Synchronous Parallel Processing of Big-Data ... - Semantic Scholar

Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds Gueyoung Jung, Nathan Gnanasambandam Xerox Research Center Webster Webster, USA {gueyoung.jung, nathang}@xerox.com

Abstract—Parallelization of big-data analytics services over a federation of heterogeneous clouds has been considered to improve performance. However, contrary to common intuition, there is an inherent tradeoff between the level of parallelism and the performance for big-data analytics principally because of a significant delay for big-data to get transferred over the network. The data transfer delay can be comparable or even higher than the time required to compute data. To address the aforementioned tradeoff, this paper determines: (a) how many and which computing nodes in federated clouds should be used for parallel execution of big-data analytics; (b) opportunistic apportioning of big-data to these computing nodes in a way to enable synchronized completion at best-effort performance; and (c) sequence of apportioned, different sizes of big-data chunks to be computed in each node so that transfer of a chunk is overlapped as much as possible with the computation of the previous chunk in the node. In this regard, Maximally Overlapped Bin-packing driven Bursting (MOBB) algorithm is proposed, which improve the performance by up to 60% against existing approaches. Keywords-federated clouds; big-data analytics; parallelization

I. I NTRODUCTION Deploying big-data analytics services into clouds is more than just a contemporary trend. We are living in an era where data is being generated from many different sources such as sensors, social media, click-stream, log files, and mobile devices. Recently, collected data can exceed hundreds of terabytes and be continuously generated. Such big-data represents data sets that can no longer be easily analyzed with traditional data management methods and infrastructures [1][2][3]. In order to promptly derive insight from big-data, enterprises have to deploy big-data analytics into an extraordinarily scalable delivery platform. The advent of Cloud Computing has been enabling enterprises to analyze such big-data by leveraging vast amounts of computing resources available on demand with low resource usage cost. One of the research challenges in this regard is figuring out how to best use federated cloud resources to maximize the performance of big-data analytics. In this paper, we mainly focus on parallel data mining such as topic mining and pattern mining that can be run in multiple computing nodes simultaneously. Parallel data mining consumes a lot of computing resources to analyze large amounts of unstructured

Tridib Mukherjee Xerox Research Center India Bangalore, India [email protected]

data, especially when they are executed with a time constraint. Cloud service providers may have enough capacity dedicated to perform such data-intensive services in their own data centers. However, facilitating loosely coupled and federated clouds consisting of legacy resources and applications is often a better choice, as analytics can be carried out partly on local private resources while the rest of the big-data are transferred to external computing nodes that are optimized for processing big-data analytics. This paradigm can be more flexible and has obvious cost benefits than using a single data center [4][5]. In order to optimize a parallel data mining in not just multiple computing nodes but different clouds separated by relatively high latencies, this paper addresses: (a) node determination, i.e., “how many” and “which” computing nodes in federated clouds should be used, (b) synchronized completion, i.e., how to optimally apportion big-data across parallelized computation environments to ensure synchronization, where synchronization refers to completing all workload portions at the same time even when resources and inter-networks are heterogeneous and situated in multiple Internet-separated clouds, and (c) data partition determination, i.e., how to serialize different data chunks to computing nodes to avoid overflow or underflow to nodes. To address these problems, we develop a heuristic cloudbursting algorithm, referred to as Maximally Overlapped Binpacking driven Bursting (MOBB). Specifically, we improve the advantage of data mining parallelization by considering the time overlap: (a) across computing nodes; and (b) between data transfer delay and computation time in each computing node. While unequal loads may be apportioned to the parallel computing nodes, our algorithm can still make sure that outputs are produced at the same time without any single slow node acting as a bottleneck. When a data mining is run on a pool of inter-connected clouds, extended periods of data transfer delay are often experienced, and the data transfer delay depends on the location of each computing node. Fast transfer of data chunks to a slow computing node can cause data overflow, whereas slow transfer of chunks to a fast node can lead to underflow causing the nodes to be idle. Our MOBB algorithm can reduce such data overflow or underflow.

II. R ELATED W ORK The problem of load distribution has previously been studied for different distributed computing systems including computing grids [8], parallel architecture [9], and data centers [10][11][12]. Load-balancing for parallel applications has typically involved the distribution of loads to computing nodes so as to maximize performance. Although the cost and delay overheads have been considered in many cases, such overheads usually involve application delays in check-pointing and context switching. This paper, on the other hand, focuses on the distribution of big-data over computing nodes separated far apart. In this setting, the overhead of the data transfer can be significant since the network latencies may be high between nodes due to the amount of data being transferred. Although recent research work [13][14] has introduced the impact of the data transfer delay on the performance when they select clouds to redirect service loads, they do not consider further optimization by overlapping the transfer delay of a data with the computation time of previous data. The need for such overlapping has been identified for clusters [15]. Continuationbased overlapping of data transfers with instruction execution has been investigated for many-core architecture [9]. However, such overlapping is restricted by a pre-defined order of instruction sets. This paper instead determines the order of different sizes of data chunks to be transferred to individual nodes. By doing so, we can have the maximum overlap between the data transfer and the data computation. The other way to optimize the performance of the bigdata analytics is scheduling sub-tasks among computing nodes. For example, back-filling of tasks at earlier time than an original scheduled sequence has been considered as a part of batch-scheduling in data centers and computing clusters [10]. However, such scheduling is geared towards batch processing within data centers and not for big-data apportioning in heterogeneous federated clouds. Some approaches have introduced task schedulers with loadbalancing techniques in heterogeneous computing environments. For example, CometCloud [16] has a task scheduler to run analytics on hybrid clouds with a load-balancing technique. [17] has introduced some heuristics to schedule tasks for heterogeneous computing nodes. None of the above approaches have dealt with the potential tradeoff between the data transfer delay and the computation time in parallel execution environments. [18] has introduced a scheduling algorithm to address which tasks in a task queue have to be run in internal cloud and sent to external cloud. They have mainly focused on keeping the order of tasks in the queue while increasing performance by utilizing an external cloud

on demand. However, they do not consider how many and which clouds are required and how much data is allocated to each chosen cloud for parallel processing. Similar with our approach, research efforts [5][19] have been made to deploy parallel applications using MapReduce over massively distributed computing environments. Using only local cluster with dynamic provisioning [20] may outperform these distributed approaches by reducing the data transfer delay if the local cluster has enough computational power. However, the distributed approach is more flexible and has cost benefits [4][5]. This paper provides a precise loadbalancing algorithm for the distributed parallel applications dealing with potentially big-data such as the continuous data stream analysis [5] and the life pattern extraction for healthcare applications in federated clouds [19]. III. P ROBLEM S TATEMENT To achieve the optimal performance of big-data analytics services in federated clouds, we have to determine “how many” and “which computing nodes” in clouds are required, where each computing node can be a cluster of servers in a data center, and how to apportion given big-data to chosen computing nodes. Specifically, we address these problems for a frequent pattern mining employed as a big-data analytics service (see Section V).

Overall execution time!

To evaluate our approach, we employ a frequent pattern mining analytics [6] as a specific type of parallel analytics whose inputs are huge but outputs are far small. Then, we deploy the analytics on small multiple Hadoop [7] clusters in four different clouds. The experimental results show that our approach outperforms other existing load-balancing methods for big-data analytics.

Find a set of the best computing nodes! SLA! Minimum ! The number of parallel computing nodes!

Fig. 1. Hypothetical curve representing the relation between the overall execution time and the number of parallel computing nodes

The input big-data to the frequent pattern mining algorithm (e.g., a log file containing users’ web transactions) is typically collected in a central place over a certain period (e.g., for a year), and is processed to generate an output (e.g., frequent user behavior patterns). To execute the mining task in multiple computing nodes for given big-data, the big-data is first divided into a certain number of data chunks (e.g., logs for user groups), and those data chunks are transferred to computing nodes. Intuitively, as the number of computing nodes increases, the overall execution time can decrease, but the amount of data chunks to be transferred increases. As shown in Fig. 1, the overall execution time can start to increase if we use beyond a certain number of computing nodes. This is because the delay taken to transfer data chunks starts to dominate the overall execution time. Meanwhile, adding computing nodes is optionally stopped once a target execution time specified in Service Level Agreement (SLA) is met.

Our MOBB algorithm is designed to address the problem of how many and which computing nodes are used. It identifies the number of computing nodes by exploring from a single node, and increasing the number of nodes one at a time. At each step, the best set of computing nodes can be identified by estimating the data transfer delay and computation time of each node for given big-data. Minimizing the frequency of data synchronization in the parallel process is practically one of the best ways to optimize the performance. Thus, we have to understand the characteristics of the input data before designing the parallel process. One of the characteristics of frequent pattern mining is that the data is in temporal order and mixed with many users’ activities. To generate frequent behavior patterns of each individual user (i.e., extract personalized information from big-data), we divide the given big-data into user groups. Distributing and executing individual user data in different computing nodes can reduce the data synchronization since these data chunks are independent of each other. In this regard, to address the problem of how to apportion given big-data to computing nodes, we have to consider a set of data chunks, each of which has the different size, and a set of computing nodes, each of which has the different network and computing capacities. Big-data!

Central Cloud!

Data chunk allocation! Network capacity!

A set of data chunks!

Medium-capacity ! Remote Cloud!

Fig. 2.

Low-capacity ! Local Cloud!

Medium-capacity ! Local Cloud!

High-capacity! Remote Cloud!

Data allocation to clouds having different capacities

We encode this problem into a bin-packing problem as shown in Fig. 2. Our MOBB algorithm aims at minimizing the execution time difference among computing nodes by optimally apportioning given data chunks into computing nodes. In other words, it maximizes the time overlap across computing nodes when it performs the parallel data mining. Data chunk 1!

Data transfer delay! Data chunk 2!

Data computation time! Data transfer delay! Data chunk 3!

Data computation time! Data transfer delay!

Time!

Fig. 3. Ideal time overlap when serializing a series of data chunks to a cloud

Moreover, we simultaneously consider improving the time overlap between the data transfer delay and the computation time while distributing data chunks to computing nodes. Practically, this overlap can be achieved since a data chunk can be computed in a node while the next chunk can be transferred

to the node. As shown in Fig. 3, ideally, our algorithm attempts to select a data chunk that takes the same amount of delay to be transferred to a node with the computation time of the node with the previous data chunk. Our algorithm optimizes the performance of the parallel mining by maximizing the time overlap not only across computing nodes, but also between the data transfer delay and the computation time in each node, simultaneously. IV. MOBB A PPROACH A. Maximally Overlapped Cloud-Bursting The first part of our approach is to make decision for “how many” and “which computing nodes” are used. Based on estimates of the data transfer delay and the data computation time (see Section IV-B), our algorithm chooses a set of parallel computing nodes, which have shorter delay than other candidates, by identifying a next best node and adding it into the set at a time. Algorithm 1 Cloud-bursting to determine computing nodes N ← {n1 , n2 , n3 , ..., nn } for each ni in N do ti ← EstDelay(ni ) ei ← EstCompute(ni ) end for Sort(N ) by (ti + ei ) S ← n0 ; p ← 1; xp ← EstExecT ime(n0 ) while xp ≥ SLA or xp < xp−1 do nmin ← SelectN extBest(N − S) S ← S ∨ nmin p←p+1 P erf ormBinP acking(S) xp ← EstExecT ime(S) end while

Algorithm 1 describes our cloud-bursting approach. Given a set of candidate computing nodes N , our approach estimates the data transfer delay t to each node and the node’s computation time e for the data assigned. The estimation is made using the unit of data (i.e., a fixed size of data). Then, the candidate nodes are sorted by the total estimate (i.e., t+e), and each node is added to the execution pool, S if required. Our approach starts the execution of the mining task with the central node n0 . n0 in our approach is the node that has the big-data collected, or the node that the service provider initially allocates the big-data before bursting to multiple nodes. One extreme case can be only using n0 if the estimated execution time using EstExecT ime(n0 ) meets the exit condition. If SLA cannot be met using the current set of parallel nodes (i.e., xp ≥ SLA), or we want to further reduce the overall execution time by utilizing more nodes (i.e., xp < xp−1 ), our approach increases the level of parallelization by adding one node into the pool using SelectN extBest(N −S). The newly added node is the best node that has the minimum (t + e) among all candidates. Once the set of parallel nodes is determined from the above step, our approach performs the maximally overlapped bin-packing, P erf ormBinP acking(S), that attempts to maximize the time overlap across these nodes

and the time overlap between data transfer and computation in each node. Then, EstExecT ime(S) computes the optimal execution time that can be achieved by utilizing these nodes.

for n nodes and a set of data, which sizes are s1 , s2 , ..., sn , the following is satisfied, s1 d1 = s2 d2 = ... = sn dn ,

B. Estimation of Data Computation and Transfer Delay Many estimation techniques have been introduced for the data computation time and the data transfer delay, but our approach does not depend on a specific technique. For the computation time estimation, we can adopt the response surface model introduced in [18]. By profiling each node for the computation time, we can build the initial model and tune the model based on observations over time. Another well-known technique is the queueing model [21]. The node has a task queue and processes each task with a defined discipline (e.g., FIFS or PS). The data transfer delay between different clouds can be more dynamic than the computation time because of various factors such as bandwidth congestions and re-routing due to network failures. To estimate the data transfer delay, we can adopt the auto-regressive moving average (ARMA) filter [21][22] by periodically profiling the network latency. To profile the latency, we can periodically inject a small size of unit data to the target node and record the delay. With the historical record, we can build the model, and tune the model over time by recording error rate. We have used ARMA filter for estimating both the computation time and the data transfer delay in the current prototype. C. Maximally Overlapped Bin-Packing Fig. 4 illustrates three main steps of our algorithm for data allocation. • Pre-processing: This step involves: (a) the determination of bucket size for each node; (b) sorting of data chunks in descending order of their sizes; and (c) sorting node buckets in descending order of their sizes. The buckets’ sizes are determined in a way that a node with higher delay will have lower bucket size. Then sorting of buckets essentially boils down to giving higher preference to nodes which has lower delay. • Greedy bin-packing: In this step, the sorted list of data chunks are assigned to node buckets in a way that larger data chunks are handled by nodes with higher preference. Any fragmentation of the bucket is handled in this step (as shown in Algorithm 2). • Post-processing: After the data chunks are assigned to the buckets, this step organizes the sequence of chunks for each bucket such that the data transfer and computation are overlapped maximally. 1) Determining bucket size: The algorithm intends to parallelize the mining task by dividing input data for multiple computing nodes. For parallelization, the size of data given to particular node depends on the average delay of the mining task on the node. If the average delay of the task for a unit of data on a node i is denoted as di then the overall delay for executing a data size si (that is provided to node i for mining task) is si di . In order to ensure ideal parallelization

where d1 , d2 , ..., dn are the delay per unit of data at each node, respectively. After such assignment, if the overall execution time of the mining task is assumed to be r, then the size of data assigned to each node would be as follows, s1 =

r r r , s2 = , ..., sn = d1 d2 dn

(1)

Let s be the total amount ofPn input logs which are n distributed to each node (i.e., s = i=1 si ). Then, we get & r=

s

' (2)

Pn

1 i=1 di

Eq. 2 provides the overall execution time of the mining task under full parallelization. This can be achieved if data assigned to each node i is limited by an upper bound si given (by replacing r from Eq. 2 to Eq. 1) as follows, & si =

di

s Pn

1 i=1 di

' (3)

Note here that si is higher for a node i if di is lower (compared to other nodes). Hence, Eq. 3 can be used to determine bucket size for each node in a way where higher preference is given to nodes with lower delay. 2) Greedy Bin Packing: Once the bucket sizes are determined, the next step involves assigning the data chunks to the node buckets. We use a greedy bin packing approach where the largest data chunks are assigned to nodes with lowest delay (hence reducing the overall delay), as shown in Fig. 4. There are two main intuitions, •

•

Weighted load distribution: This involves loading to a node based on their overall delay, i.e. the node with lower delay gets more data to handle. This is guaranteed by providing an upper bound on the size of data (i.e. the bucket size) to be handled by the node (as described in Section IV-C1). Delay-based node preference: Larger data chunks are assigned to a node with larger bucket size (i.e. with lower overall delay) so that individual data chunk get fairness in their overall delay. This is guaranteed by sorting the input data chunks in descending order (in preprocessing step in Fig. 4) and filling up nodes with larger bucket size first (Algorithm 2).

To reduce fragmentation of buckets, the buckets are completely filled one at a time; i.e., the bucket with lowest delay will be first exhausted followed by the next one and so on (Algorithm 2). This approach also enables to fill more data to nodes with lower delay.

Input Data Chunk List

Sorted Data Chunk List

!!"

!!"

Preprocessing

Total Input Data Size = s

Sorted Bucket List

Bucket (Node/Cloud) List Node

!!"

Node

Delay

Node

Node

!!"

Node

d1

d2

Delay

Delay

Node dn Delay

Delay

Delay

Delay per unit data to nodei = di

Transfer Delay and Computation Time Node1

!!"

t1 > e1 t1 e1

Node i

!!"

ti < ei

tn > en

ti

tn

ei di

d1

Input Sorted Data Chunk List en

dn

di is sum of transfer delay and computation time per unit data i.e., di = ti + ei

!"

!!"

!"

!!"

Greedy Bin-Packing

Node n

!"

Organize sequence of data chunks to maximize overlap between computation time and subsequent data transfer

Fig. 4.

!!"

Postprocessing

!!"

! S1

Node1

! Si

!!"

Node i

! Sn

!!"

Node n

Assign more data (larger data chunks) to nodes with less delay

Steps to allocate data chunks to computing nodes in our algorithm

3) Maximizing the overlap between data transfer and computation: The above approach achieves parallelization of the analytics over a large set of data chunks. However, the delay, di , for unit of data to run on node i can be explained by data transfer delay from the central node to node i and actual computation time on node i. Therefore, it is possible to further reduce the overall execution time by transferring data to a node in parallel with computation on a different data chunk. Ideally, the transfer delay of a data chunk should be exactly equal to the computation time on previous data chunk. Otherwise, there can be delay incurred by queuing (in case the computation time is higher) or unavailability of data computation (in case the transfer delay is higher). If computation time and transfer delay are not exactly equal, it is required to smartly select a sequence of data chunks for the data bucket of each node so that the difference between the computation time of each data chunk and the transfer delay of a data chunk immediately following is minimized. Depending on the ratio of data transfer delay and computation time, a node i can be categorized as follows: • type 1, for which the transfer delay ti per unit of data is higher than the computation time ei per unit of data; • type 2, for which the computation time ei per unit of data is higher than the transfer delay ti per unit of data. It is important to understand the required characteristics of the sequence of data chunks sent to each of these types of nodes. If sij and si(j+1) are the size of data chunk j and j + 1 assigned to node i, then for complete parallelization of the data transfer of chunk j + 1 and computation of j, the following holds, si(j+1) ti = sij ei It should be noted here that if ti ≥ ei , then ideally si(j+1) < sij . Thus, data chunks in the bucket for type 1 node should

be in descending order of their sizes. Similarly, it can be concluded that for a node of type 2, data chunks should be in ascending order, as shown in the post-processing step in Fig. 4 and ensured at the end of Algorithm 2 (where descending order of data chunks is reversed to make the order ascending in case ti < ei ). Algorithm 2 Maximally overlapped bin-packing Sort DataChunkList by descending order of chunk size Determine bucket size, s1 , s2 , ..., sn (Eq. 1) Sort BucketList by descending order of bucket size repeat for i = 1 to n do Remove first element from DataChunkList Insert the element to tail of BucketList[i] for j = 1 to remainingN umberof DataChunks do if (BucketList[i] is not empty) and (first element in DataChunkList can fit in BucketList[i]) then Remove first element from DataChunkList Insert the element to tail of BucketList[i] end if end for end for until all the BucketLists are full for i = 1 to numberof N odes do if ti < ei then Reverse order of data chunks in BucketList[i] end if end for

V. E XPERIMENTAL E VALUATION We demonstrate the efficiency of our approach by deploying the frequent pattern mining as a big-data analytics service to four different computing nodes. We first describe the experimental setup followed by the results.

60000!

B. Experiment Results We first show the performance characteristics of computing nodes in the context of data computation time and data transfer delay. Fig. 5 shows nodes’ performance characteristics when

Computation time!

50000!

Data transfer delay!

40000! 30000! 20000! 10000! 0! LLC!

Fig. 5.

LLW!

MRW!

HLW!

Performance characteristics of computing nodes

we run the mining task with the entire log in each node. LLC has low-end servers, but the log is stored in the local storage. Thus, the computation time is higher than other nodes while there is no data transfer delay. To run the mining task in other nodes, first, we must copy the log to the target node and then, execute it. Since LLW is another local node with intentional network delay, data transfer delay is high. It has also low-end servers so that it takes large amounts of time to compute. M RW , which is located remotely, has mid-end configuration and thereby, its computation time is lower than LLC. However, as shown in the figure, the overall execution time is similar to LLC due to the large transfer delay. Meanwhile, HLW has high-end servers, and the log is stored in the local area network. Thus, the computation time is much lower than others, and the data transfer delay is small.

Execution time (sec)!

1) Frequent Pattern Mining: Frequent pattern mining [6] aims to extract frequent patterns from a log file. The log contains users activities to a system in temporal order. A typical example is a web server access log, which contains a history of web page accesses from users. Enterprises need to analyze such web server access logs to discover valuable information such as website traffic patterns and user activity patterns by time of day, time of week, or time of year. These frequent patterns can also be used to generate rules to predict future activity of a certain user within a certain time interval based on the user’s past patterns. To extract such frequent patterns and prediction rules, our mining process parses the log while identifying patterns of each user. Then, it generates a set of prediction rules for each user. As a sample log for our experiments, we have combined a phone call log obtained from a call center and web access log. The log has been collected for a year, and its size is up to 200 GB. The log contains several millions of activities generated by more than a hundred of thousand of users regarding human resources such as health insurance, 401K, and retirement plans. The objective of the frequent pattern mining is to obtain patterns of each user activities on human resource information systems. As such, our approach first divides the log into a set of user logs and then, executes these user logs in parallel over federated computing nodes. 2) Computing nodes: For our experiments, we employ four small computing nodes to run the frequent pattern mining. Three nodes are local clusters located in the north-eastern part of US, and one is a remote cluster located in the mid-western part of US. One local node, referred to as a Low-end Local Central node (LLC), is used as a central node that collects bigdata to be analyzed. This node consists of 5 virtual machines (VMs), each of which has two 2.8 GHz cores, 1 GB memory, and 1 TB hard drive. Another local node, referred to as a Low-end Local Worker (LLW ), has the similar configuration with LLC. The third local node, referred to as a High-end Local Worker (HLW ), is a cluster that has 6 non-virtualized servers, each of which has 24 2.6 GHz cores, 48 GB memory, and 10 TB hard drive. All these local nodes are configured with a high speed local connection so that data can be moved very fast between nodes. The remote node, referred to as a Mid-end Remote Worker (M RW ), has 9 VMs, each of which has two 2.8 GHz cores, 4 GB memory, and 1 TB hard drive. We deploy Hadoop into all these computing nodes. In our scenario, we assume HLW is shared by other applications; there are three other data mining tasks while running our aforementioned frequent pattern mining. Meanwhile, we intentionally give a network overhead to LLW by moving large size of files to LLW during experiments. This increases the data transfer delay between LLC and LLW .


A. Experimental Setup

500! 450! 400! 350! 300! 250! 200! 150! 100! 50! 0!

Single node! Two nodes! Two nodes with delay!

10! 20! 30! 40! 50! 60! 70! 80! 90! 100! Relative input data size (%)!

Fig. 6.

Impact of data transfer delay on the overall execution time

1) Impact of data transfer delay on overall execution time: Using LLC and LLW , we compare the execution time with intentional transfer delay to LLW to one without the delay. We use a small size of log (18.6 GB) in this experiment, and resize it to show the change of execution time against the size of the input log. As shown in Fig. 6, the execution time increases exponentially as the size increases. This explains the need of parallel execution of such big-data analytics to improve the performance. When using two nodes without the transfer delay, the execution time decreases by almost half. However, when we execute the mining task with the transfer delay, the execution time is higher than the previous result, and the gap slightly increases as the size increases. Therefore, we have to deal with the data transfer delay carefully. 2) Effect of cloud-bursting on total execution time: Our MOBB algorithm attempts to use only LLC first, and then integrate one node at a time, which has the lowest estimated

TABLE I S LACK TIMES OF DIFFERENT LOAD - BALANCING ALGORITHMS


40000! 35000! 30000! 25000!

Slack (sec)

20000!

10000!

0!

Fig. 7.

LLC only!

LLC +! HLW!

LLC + HLW +! MRW!

All!

Cloud-bursting with four computing nodes

execution time. This addition is performed until SLA is met. In this experiment, our approach adds HLW , M RW , and finally it uses all four nodes to run the mining task. As shown in Fig. 7, the execution time decreases as nodes are added. However, the execution time is not significantly improved from using three nodes. This is because the contributions of M RW and LLW to the performance are small, and the transfer delay caused by M RW and LLW starts to impact on the overall execution time. Therefore, using two or three nodes can be better choice rather than using four nodes that needs higher resource usage cost. 3) Comparison of MOBB with other load balancing approaches: We run the mining task using MOBB and other three different methods that are used in many prior loadbalancing approaches [9][13][14][15][16] and then, compare the results. Methods used in this comparison are as follows, • Fair division: This method equally divides the input data and distributes them to nodes. We use this as a naive method to show as a baseline. • Computation-based division: This method only considers the computation power of each node when it performs load-balancing, rather than considering both computation and data transfer delay. • Delay-based division: This method considers both each node’s computation time and data transfer delay in loadbalancing. However, it does not consider the queuing delay in each node incurred by blindly distributing user logs to nodes (i.e., not considering the time overlap between the transfer delay and computation time). 30000!


Fair 13952.8

Comp.-based 12719.5

Delay-based 1030.8

15000!

5000!

25000! 20000! 15000! 10000! 5000! 0! MOBB!

Fig. 8.

MOBB 41.2

Fair!

Comp.-based! Delay-based!

Comparison of different load-balancing algorithms

Fig. 8 shows the result when we run the mining task in HLW and M RW . As shown in the figure, our algorithm outperforms other three methods. Since M RW has the large

transfer delay, the execution time of Computation-based division is very close to the Fair division. Table I explains this situation with slack time (i.e., the measurement of the time difference between nodes’ task completions). Although Computation-based division considers the computation powers of M RW and HLW when load-balancing, M RW becomes a bottleneck due to its large transfer delay. Meanwhile, the Delay-division considers both the computation time and the transfer delay as MOBB does. This significantly reduces the slack time. However, some data chunks are cumulated in queue before being computed. This is due to the situation of that many small data chunks arrive and are cumulated in the waiting queue while the mining task computes some large data chunks. When this situation unnecessarily happens too much, the significant delay can be incurred. Our MOBB algorithm considers all these factors simultaneously when it allocates given big-data. As evident from the results in Fig. 8, MOBB can achieve a minimum of 20% (and up to 60%) improvement compared to the other approaches. If an ideal optimal data allocation is made, the slack time must be 0 (i.e., computation in multiple computing nodes is completed at the same time). Table I shows that our MOBB has around 40 seconds slack time. However, it is very small compared to the overall time taken for the parallel data mining. To further evaluate the optimality of our MOBB approach, we have conducted multiple small experiments with LLC and LLW to execute randomly selected 20 data chunks out of 50K data chunks. We have observed that in most cases (i.e., more than 90%), the slack time is caused by the last data chunk of sequence assigned to the slower node (i.e., there is no data chunk in the slower node’s queue), and in very few cases, the slower node has more than one data chunk in its queue while the faster node has completed all assigned data chunks. This indicates that our MOBB provides close to optimal allocation. 4) Efficiency of MOBB algorithm: Our MOBB algorithm is efficient and scalable for increasing the number of data chunks and applications. As described in Algorithm 2, the complexity of preprocessing is O(nlogn+mlogm), where n is the number of computing nodes, and m is the number of data chunks to be assigned, since it sorts given nodes and data chunks. Typically, m is much larger than n and thereby, its complexity can be O(mlogm). In our experiments, we have used 4 nodes to run 50K data chunks. The complexity of the rest of MOBB algorithm (i.e., greedy bin-packing and post-processing) is O(m). Therefore, the overhead of our MOBB algorithm is mainly incurred in sorting a number of data chunks. We have used existing quick sort algorithm in our prototype. In order to deal with 50K data chunks with 4 nodes, it has taken less than 60 seconds that is very small portion of the overall approach including data transfer and parallel data computation.

VI. D ISCUSSION Our MOBB approach has been designed for data-intensive tasks (e.g., big-data analytics) that typically require special platforms such as MapReduce cluster and especially, can run in parallel. One of the best situations for our MOBB approach to be applied is the case, where the target task can be divided into a set of independent identical sub-tasks. For the data mining task used in our evaluation, our data preprocessing system has divided the input data into a set of data chunks. Then, each sub-task is run independently and in parallel with other subtasks to generate frequent patterns of each individual user with a subset of data chunks. An extended situation considered in our MOBB approach is the case, where multiple independent data-intensive tasks are in a task queue with different sizes of input data. If a task in the queue may not be divided into independent sub-tasks such as iterative algorithms, which data transfer should occur just once but the computation is run multiple times on the same data, our MOBB approach considers the task as a unit task and attempts to parallelize with other tasks and sub-tasks in the queue. This is because running these algorithms across federated clouds may not be practical since these algorithms may require considerable communications among computing nodes (e.g., merging and redistributing intermediate results iteratively). We are planning to extend our MOBB approach to be applied to such task queue having multiple independent data-intensive tasks. Another issue we are considering to extend our MOBB approach is to dynamically re-sort computing nodes and retarget the sequence of data chunks. In the current prototype, the decision making is based on the current status of network and computation capacities when it is invoked as described in Algorithm 1. However, the status can be changed dynamically due to various unexpected events such as node failures and network congestions while sorting nodes based on the previous status and then, allocating data chunks. One of possible solutions can be that our MOBB approach periodically checks the available computation capacities and network delays of nodes. Another solution can be that distributed monitoring systems can push events into MOBB when the status is significantly changed. In either case, the status change triggers MOBB to resort nodes and re-target the sequence of remaining data chunks into the next available computing nodes, while data chunks assigned already continue at the corresponding nodes. VII. C ONCLUSION In this paper, we have described a cloud-bursting based on maximally overlapped load-balancing algorithm to optimize the performance of big-data analytics that can be run in loosely-coupled and distributed computing environments such as federated clouds. More specifically, our algorithm has supported decision makings on: (a) how many and which computing nodes in federated clouds should be used; (b) opportunistic apportioning of big-data to these nodes in a way to enable synchronized completion; and (c) sequence of apportioned data chunks to be computed in each node so that transfer of a chunk is overlapped as much as possible with

the computation of the previous chunk in the node. We have compared our algorithm with other load-balancing schemes. Result shows the performance can be improved by 20% and up to 60% against other approaches. R EFERENCES [1] A. Jacobs. (2009, Jul.) The pathologies of big data. [Online]. Available: http://queue.acm.org/detail.cfm?id=1563874 [2] T. White, Hadoop: The Definitive Guilde. O’Reilly, 2009. [3] D. Kusnetzky. (2010, Feb.) What is big data? [Online]. Available: http://blogs.zdnet.com/virtualization/?p=1708 [4] S. Rozsnyai, A. Slominski, and Y. Doganata, “Large-scale distributed storage system for business provenance,” in Proc. Int’l. Conf. on Cloud Computing, 2011, pp. 516–524. [5] Q. Chen, M. Hsu, and H. Zeller, “Experience in continuos analytics as a service (caaas),” in Proc. Int’l. Conf. on Extending Database Technology, 2011, pp. 509–514. [6] R. Srikant and R. Agrawal, “Mining sequential patterns: Generalizations and performance improvements,” in Proc. Int’l. Conf. on Extending Database Technology, Feb. 1996, pp. 3–17. [7] Apache. (2011) Hadoop. [Online]. Available: http://hadoop.apache.org/ [8] Y. Li and Z. Lan, “A survey of load balancing in grid computing,” Springer, Computational and Information Science, vol. 3314, pp. 280– 285, 2005. [9] T. Miyoshi, K. Kise, H. Irie, and T. Yoshinaga, “Codie: Continuationbased overlapping data-transfers with instruction execution,” in Int’l. Conf. on Networking and Computing, Nov. 2010, pp. 71–77. [10] D. Tsafrir, Y. Etsion, and D. Feitelson, “Backfilling using systemgenerated predictions rather than user runtime estimates,” IEEE TPDS, vol. 18, pp. 789–803, 2007. [11] T. Mukherjee, A. Banerjee, G. Varsamopoulos, S. Gupta, and S. Rungta, “Spatio-temporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers,” Comp. Net., vol. 53, pp. 2888–2904, 2009. [12] Z. Liu, M. Lin, A. Wierman, S. Low, and L. Andrew, “Greening geographical load balancing,” in Proc. SIGMETRICS Joint Conf. on Measurement and Modeling of Computer Systems, 2011, pp. 233–244. [13] P. Fan, J. Wang, Z. Zheng, and M. Lyu, “Toward optimal deployment of communication-intensive cloud applications,” in Proc. Int’l. Conf. on Cloud Computing, 2011, pp. 460–467. [14] M. Andreolini, S. Casolari, and M. Colajanni, “Autonomic request management algorithms for geographically distributed internet-based systems,” in Proc. Int’l. Conf. on Self-Adaptive and Self-Organizing Systems, 2008, pp. 171–180. [15] K. Reid and M. Stumm, “Overlapping data transfer with application execution on clusters,” in Proc. Workshop on Cluster-Based Computing, May 2000. [16] H. Kim and M. Parashar, CometCloud: An Autonomic Cloud Engine. Cloud Computing: Principles and Paradigms, Wiley, Chapter 10, 2011. [17] M. Maheswaran, S. Ali, H. Siegal, D. Hensgen, and R. Freund, “Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems,” in Proc. Heterogeneous Computing Workshop, 1999, pp. 30–44. [18] S. Kailasam, N. Gnanasambandam, J. Dharanipragada, and N. Sharma, “Optimizing service level agreements for autonomic cloud bursting schedulers,” in Proc. Int’l. Conf. on Parallel Processing Workshops, 2010, pp. 285–294. [19] Y. Huang, Y. Ho, C. Lu, and L. Fu, “A cloud-based accessible architecture for large-scale adl analysis services,” in Proc. Int’l. Conf. on Cloud Computing, 2011, pp. 646–653. [20] D. Alves, P. Bizarro, and P. Marques, “Deadline queries: Leveraging the cloud to produce on-time results,” in Proc. Int’l. Conf. on Cloud Computing, 2011, pp. 171–178. [21] G. Jung, K. Joshi, M. Hiltunen, R. Schlichting, and C. Pu, “A costsensitive adaptation engine for server consolidation of multi-tier applications,” in Proc. Int’l. Conf. on Middleware, 2009, pp. 163–183. [22] G. Box, G. Jenkins, and G. Reinsel, Time Series Analysis: Forecasting and Control. Prentice Hall, 1994.

Synchronous Parallel Processing of Big-Data ... - Semantic Scholar

Synchronous Parallel Processing of Big-Data ... - Semantic Scholar

Suggest Documents

PARALLEL VIDEO PROCESSING ... - Semantic Scholar

Bulk Synchronous Parallel ML: Modular ... - Semantic Scholar

Parallel I/O in Bulk Synchronous Parallel ML - Semantic Scholar

Parallel I/O in Bulk Synchronous Parallel ML - Semantic Scholar

Parallel I/O in Bulk Synchronous Parallel ML - Semantic Scholar

A learnable parallel processing architecture ... - Semantic Scholar

(REE) Fault- Tolerant Parallel-Processing ... - Semantic Scholar

A domain decomposition parallel processing ... - Semantic Scholar

Parallel Distributed Processing Approaches to ... - Semantic Scholar

Bulk Synchronous Parallel Scheduling of Uniform ... - Semantic Scholar

Parallel information processing channels created ... - Semantic Scholar

Parallel and Distributed Processing - Semantic Scholar

Parallel central processing between tasks ... - Semantic Scholar

Bulk Synchronous Parallel Scheduling of Uniform ... - Semantic Scholar

A Case Study on Parallel Synchronous ... - Semantic Scholar

Usability of parallel processing computers in ... - Semantic Scholar

Parallel Processing of Objects in a Naming Task - Semantic Scholar

User Transparent Parallel Processing of the 2004 ... - Semantic Scholar

Parallel Processing Capability Versus Efficiency of ... - Semantic Scholar

Analysis of Parallel Scan Processing in Shared ... - Semantic Scholar

Parallel Motion Processing for the Initiation of ... - Semantic Scholar

Parallel Processing for Detection of Lunar Crater ... - Semantic Scholar

Parallel, Massive Processing in SuperMatrix â a ... - Semantic Scholar

A REAL-TIME PARALLEL IMAGE-PROCESSING ... - Semantic Scholar

Synchronous Parallel Processing of Big-Data ... - Semantic Scholar