Dynamic Tuning of the Workload Partition Factor in Data-Intensive ...

1 downloads 9529 Views 478KB Size Report
execution time, this strategy divides the data chunks with the biggest computation times and ... intensive applications [2], or big-data computing [3]. In the era of ...
2012 IEEE 14th International Conference on High Performance Computing and Communications

Dynamic Tuning of the Workload Partition Factor in Data-Intensive Applications Claudia Rosas∗ , Anna Sikora∗ , Josep Jorba† , Andreu Moreno‡ and Eduardo C´esar∗ Aut`onoma de Barcelona, 08193 Bellaterra, Spain. Email: {crosas, ania, eduardo}@caos.uab.es † Universitat Oberta de Catalunya, 08018 Barcelona, Spain. Email: [email protected] ‡ Escola Universitaria Salesiana de Sarria, 08017 Barcelona, Spain. Email: [email protected]

∗ Universitat

be manageable. This has been done to reduce the size of the workload and enable parallelism, but once the workload has been divided, other issues related to disk access or load balancing may appear. Most load balancing methods, such as factoring [6], are based on the idea of distributing the application’s data set in chunks of decreasing size. To improve total computation time of applications, these methods try to determine a good partition factor for obtaining the chunks. When doing this, parameters such as computation time, communication time, and overall performance of the application are considered. This work is focused on adapting the partition factor to reduce the overall computation time. To this end, our proposal is based on dynamically tuning the size of the workload data chunks. This proposal is oriented to: applications that perform several related explorations or queries on a large data set; and the possibility of arbitrarily dividing or concatenating the data set of the application into data chunks of different size. We have designed a strategy to change the size of the chunks with the highest and lowest associated computation time, making it possible to balance the application workload. The strategy is based on monitoring the computation time of data chunks to determine the order in which they should be scheduled in future explorations. The proposal includes the dynamic repartitioning and gathering of data chunks (when partitioning cost is low); and the possibility of dynamically choosing among previously generated partitions (when the partition cost is too high). In both cases, besides the computation time, the calculation of the partition factor considers the communication cost, memory use, and the number of available computing nodes. Since our method is based on the execution of applications in homogeneous clusters of workstations, the computation capacity is constant and the disk and network latency are stable. Moreover, to facilitate the initial design we have applied a shared nothing [7] processing approach in which each node (consisting of processor, local memory, and disk resources) shares nothing with other nodes in the cluster. Summarizing, the proposed strategy includes: • The generation of multiple representative data set divisions prior to the execution of the application when the cost of partitioning data is too high; • The monitoring of the computation time of every ex-

Abstract—The recent data deluge needing to be processed represents one of the major challenges in computational field. Available high-performance computing (HPC) systems can be very useful for solving this problem when data can be divided in chunks that can be processed in parallel. However, due to intrinsic characteristics of data-intensive problems, these applications can present huge load imbalances, and it can be difficult to efficiently use the available resources. This work proposes a strategy for dynamically analyzing and tuning the partition factor used to generate the data chunks. With the aim of decreasing the load imbalance and therefore the overall execution time, this strategy divides the data chunks with the biggest computation times and gathers contiguous chunks with the smallest computation times. The criteria to divide or join chunks are based on the chunks’ associated execution time (average and standard deviation) and the number of processing nodes being used. We have evaluated our strategy by using simulation, and a real data-intensive application. Applying our strategy, we have obtained promising results since we have improved up to 55% the total execution time. Keywords-dynamic tuning; data-intensive applications;

I. I NTRODUCTION Nowadays, one of the biggest challenges in the computational field is given by the continuous growth of data that needs to be processed. The data flow coming from sensors, results of biological and physical experiments [1], and even from information generated by users, is surpassing the capacities of the systems and algorithms designed few years before. This led to new applications known as dataintensive applications [2], or big-data computing [3]. In the era of data-intensive applications, computational systems are not only intended to compute but also to store and manage data. Simple tasks, such as making data available to be processed in research centers, have become a huge challenge given the size of data. Additionally, efficiently processing on data is not only matter of having a large number of processing units, but it also depends on characteristics of the workload of the application. With the aim of improving performance, there are many studies that have obtained good results, ranging from approaches that analyze the effectiveness of I/O system, to the design of appropriate strategies to define and access data structures [4]. In many cases, it has been necessary to divide the workload of the data-intensive applications into smaller data chunks (according to Divisible Load Theory, DLT [5]) to ensure that the workload of the application can 978-0-7695-4749-7/12 $26.00 © 2012 IEEE DOI 10.1109/HPCC.2012.37

216

and adapts the number of workers. In the second, the authors have designed a simple strategy that can be used to obtain an automatic and dynamic load balancing strategy for the workers. Later, they have defined a strategy for dynamically improving the performance of pipeline applications [16]; improving the throughput of applications by gathering the fastest pipe stages and replicating the slowest ones. In addition, the work presented in [17] presents a predictive dynamic load balancing strategy for simulations performed in Grid. Their approach, shares concepts with our work in terms of monitoring system’s performance metrics and balancing load dynamically, nevertheless, this proposal is solely focused on load migration (not size adaptation). Finally, there are some works that consider the characteristics of data-intensive applications for developing dynamic load balancing strategies, such as [18] and [19]. The first, is focused on multicast problems for data-intensive applications on the Cloud, while the second has developed a resource allocation and scheduling strategy for minimizing the total time spent on processing data. Although these works have considered partitioning data, their main focus has been data locality; ours instead, improves performances by reducing dynamically, the computation time variability among the generated data chunks.

ploration or query on every data chunk; The ordering and distribution of data chunks along the execution of the application according to their associated computation times; and • The tuning of the partition factor of the data chunks with the highest and lowest associated computation time according to the observed efficiency (relation between computation time and computational resources). We have used an analytical simulator to assess our approach because the proposal uses certain analytical expressions to estimate the modifications in the size of the data chunks. The aim of the simulation is to evaluate the partitioning strategy for parallel data-intensive applications by analyzing a wide range of situations. Furthermore, to test our proposal using a real dataintensive application, we have applied it to the computation/ data-intensive bioinformatics tool Basic Local Alignment Sequence Tool (BLAST) [8]. Results obtained are encouraging in terms of time reduction and resource efficiency. The rest of this paper is organized as follows. First, Section II provides an overview of related work. Next, Section III describes the proposed method for balancing the load of data-intensive applications. The main characteristics of the test scenarios, experimental evaluation and results are discussed in Section IV. Finally, Section V shows the conclusions and outlines future work. •

III. T UNING W ORKLOAD PARTITION FACTOR Generally, data-intensive applications can take advantage of parallel systems by dividing their workload into smaller pieces. Nevertheless, one of the major problems when dividing data is closely related to the size of the data chunks or the workload partition factor [10]. Therefore, the two most significant problems are: (i) the high risk of load imbalance, caused by executing an application with a smaller partition factor, i.e. less data chunks with longer computation times; and (ii) the overhead related to scheduling and communication times, caused by executing an application with a smaller partition factor. As a first step to solve both problems, we have initially proposed a method called Heaviest Fragments First (HFF) [20] that successfully reduces the risk of load imbalance. However, HFF is not able to reduce the total execution time beyond the limit imposed by data chunks with large computation time. This situation is shown in the example of Fig. 1(a), where total execution time is given by the execution time of the heaviest fragment. Unfortunately, this limit results in highly inefficient executions. To tackle this problem, in this paper we provide a method to improve the performance of data-intensive applications in terms of total execution time and resource utilization, by dynamically tuning the partition factor of their workloads. This constitutes the novelty and the major contribution of our work. In a few words, we have proposed to adapt the size of data chunks that their computation times are far from the average computation time of the application (i.e. tuning the workload

II. R ELATED W ORK A divisible workload has been defined as one that can be divided into several independent pieces or chunks of arbitrary size to be processed in parallel by a set of compute nodes. The Divisible Load Theory (DLT) [5] was introduced in the late 1980s. Later on, DLT has branched in many new directions covering scheduling problems and performance modeling for various types of computational environments, such as Grid and Cloud systems [9], systems with memory limitations [10] or with computation time restrictions [11]. Many of these studies have used the mathematical model proposed in the first publications of DLT to represent the scheduling problem as a system of linear equations. There are works considering different start up times of the processing nodes [11], different types of interconnection networks [12] or, in recent publications, DLT has been used in Map Reduce computations [13]. Most of these works are aimed at Grid and Cloud systems, while our strategy uses homogeneous clusters. However, the main difference is that our proposal uses analytical expressions to dynamically evaluate monitored parameters along application’s execution. In regard of dynamic tuning of performance parameters, the works proposed in [14] and [15] have introduced the use of the load balancing strategy named factoring to modify the size of the data chunks distributed among the workers. In the first proposal, the authors have detailed a performance model for Master/Worker applications that performs load balancing,

217

1

1

1

1

1

1

1

1

1

1

Nf

0

1

2

3

4

5

6

7

8

9

size

1

1

1

1

1

1

T

3

9

1

7

2

40

4

6

5

3

Nf

0

1

2

3

4

5

6

7

8

9

size

Table I S UMMARY OF NOTATION

T

μi = 8.00

Nw 5

5

4

6

3

3

7

1

0 2

7

Notation Nf Nw Nq j i size Cij

σi = 11.49

6 4

8

Tideal = 16.00

3

1

1

1

1

Ci = computation time of data chunk

9

3

2

(b) Workload Partitioned

4 2

(c) Data chunks labeled

9 1

40

1

size T

5

Nf

time e

1

1

1

1

1

1

1

1

1

1

40

9

7

6

5

4

3

3

2

1

5

1

3

7

8

6

0

9

4

2

Ci = computation time of data chunk

Ci

(a) Initial execution and expected(d) Data chunks sorted in decreasideal time ing order Figure 1.

μi

Workload characteristics and performance issues on execution

σi y T si

partition factor). To this end, we partition data chunks with slowest computation time and gather data chunks with fastest execution times. We apply this strategy to improve the overall computation time of data-intensive applications, reducing communication and scheduling overheads. It is worth noticing that in a data-intensive application, the workload may be partitioned into chunks of equal size (as shown in Fig. 1(b)), but due to data characteristics, the execution time of each data chunk may not be the same (Fig. 1(c)). For this reason, in HFF we sort and distribute data chunks among the available processing nodes in decreasing order (Fig. 1(d)). In this work we propose to dynamically monitor performance of the data-intensive application, and tune the size of the data chunks at run time to provide an efficient execution of the application using some of the available processing nodes. Our proposal has been developed making the following assumptions about data-intensive applications: • •

• •

Tmaxij Tideal Tgroupid

Description number of data chunks; maximum worker nodes available in the cluster; number of explorations (queries); data chunk identifier (0 < j < Nf ); exploration identifier (0 < i < Nq ); data chunk size (in MBytes); computation cost (in seconds) measured for the ith exploration against the jth data chunk; total time  computation  (in seconds) ith expl.  f Ci = N C ; ij j=1 average computation time (in  secs) for ith expl.   Nf Cij )/(Nf ) ; μi = ( j=1 standard deviation of computation time (in secs); number of divisions for the data chunks with Tmaxij ; total time  sequential computation  for exploration i;  f ∀i ∈ Nq : T si = N C ; ij j=1 highest comp. time for exploration i and data chunk j; ideal computation time for a parallel execution; computation time for the grouped data chunks.

A. Definition of the partitioning factor of the workload If there is a data-intensive application with a workload T waiting to be efficiently executed on Nw processing nodes, there are some performance issues that may appear. For example, the execution of applications using a large number of data chunks (a higher partition factor), leads to good results in terms of execution time and load balancing. However, the serial part of each data chunk is replicated and it results in a processing time overhead. Our aim is to determine, at run time, the workload partition factor that enables a lower execution time. To define the workload partition factor, we have established an initial common partition factor for all data chunks. This factor is based on: (i) the minimum value for some parameters of hardware; (ii) the application. These parameters enable us to decide the number of initial partitions for the workload. As hardware parameters, we have considered: the available physical memory, the network bandwidth and the number of available nodes. As application parameters, we have analyzed: the data set size, the partitioning cost (time), and the number of explorations. Particularly, the partitioning cost determines whether all partitions or only the initial partitions, must be generated before the execution of the application. Once an initial size has been selected, and the workload has been divided into Nf data chunks, we proceed to execute the application. By doing this, and given the characteristics of data-intensive applications, we can observe that even when all data chunks have the same size, the average computation time by data chunk may vary. We use this variation to determine the modifications on the size of the data chunks. Additionally, when monitoring the behavior of the application using the initial partition of the workload, we have identified two kinds of data chunks: (i) those with

The initial application data set can be arbitrarily partitioned into independent data chunks; The application performs a set of related explorations or queries on the data set, e.g. the application searches similarities for several related proteins on a large database, or looks for similar strings on the web; The performance of the application varies significantly according to the input data; The characteristics of the input data of the application are unknown.

We have expressed the main characteristics of the application through analytical expressions. These expressions enable us to decide during execution, the adequate value of the workload partition factor and when to modify it. In addition, the new method has been designed considering a homogeneous and dedicated cluster. This has been done to keep stable system variables, e.g. processing capacity and network latencies. This way, we can focus only on the behavior of the application without worrying about the computation time variability due to characteristics of the systems. Before going further, the notation we have used through this section is summarized in Table I.

218

a computation time Cij above the average computation time of the exploration μi ; (ii) those data chunks with a computation time Cij below the average computation time of the exploration μi . For a parallel execution using a fixed number of processing nodes, any node that finishes before the worker processing the data chunk labeled as Tmaxij , will be idle until that worker finishes, resulting on an inefficient execution. For this reason, to improve efficiency, we may: (i) change the size of the data chunk(s) with a computation time equal, or close to Tmaxij , by dividing the data chunk(s) into smaller pieces (as we have done with the workload initially); or (ii) gather contiguous data chunks with smaller computation times, to reduce variability among the partitions of the workload.

2 represents all data chunks that should be considered to be repartitioned. We have done this because in some cases, after repartitioning a data chunk, its computation time do not scale linearly; i.e. if the data chunk has a computation time T , and it is divided in 2 new pieces; the computation time associated to the new pieces do not necessarily is going to be T /2. We have observed this behavior through experimentation, and this non-linearity depends on both the algorithm and data. Cij > Tideal

With the aim of reducing the gap between data chunks with Tmaxij time, and the ideal time Tideal , we need to estimate the computation time of the new data chunks obtained after repartitioning. We have used a statistic of order Nw , (presented in expression 3) to estimate the upper bound for computation time of new data chunks based on the average computation time and standard deviation of the data chunks computation times, and the number of processing nodes used.  E = μi + σi ∗ Nw /2 (3)

B. Adjusting the Partition Factor To improve performance, in terms of execution time and efficiency in data-intensive applications, we have proposed a method that enables us to adapt dynamically the size of the generated data chunks. We have done this to reduce the variability among the data chunks’ computation times; enabling maximum utilization of the processing nodes. Our strategy considers: (i) repartitioning the data chunk(s) with maximum computation time; and (ii) gathering contiguous data chunks with low computation times. The main criteria to decide when to partition, or when to group, is given by estimating the best possible computation time. In our case, this ideal time to reach (Tideal , shown in expression 1), is given by the relation between the computation time of the entire workload, done serially, T si ; and the total number of available processing nodes Nw . Tideal = Tsi /Nw = (μi ∗ Nf )/Nw

(2)

From the statistic, and with the aim of enabling the estimation of the number of new partitions, we have introduced appropriate modifications to ensure consistency of expression 3, resulting on the expression 4.  (4) E = (Tmaxij /y) + (σi /y) ∗ Nw /2 In this expression, the average computation time of data chunks, which meets the time restriction 2, will be given by the relation between their associated computation time Cij and the number of new data chunks generated y. Similarly, standard deviation of new data chunks will be represented as the relation between the standard deviation of processing the whole workload and the number of newly generated pieces.  [(Tmax /y) + (σi /y) ∗ Nw /2] ≤ [(μi ∗ Nf )/Nw ] (5)

(1)

This way, by executing once the data-intensive application, in Nw processing nodes, and using Nf data chunks, we can obtain the associated computation time by data chunk Cij . From this time, we can calculate the average computation time by exploration, μi , and its standard deviation, σi . These two values enable us to identify the existent variation among the computation times of the data chunks. 1) Partitioning: When executing a data-intensive application in parallel, the total computation time for that execution Ci is given by the last worker that finishes processing. Usually, this delay is given by having to process data chunks with large execution time (close to Tmaxij ). With the aim of reducing this time and have a balanced execution, we have proposed to part this (or these) data chunk(s) and redistribute them among the processing nodes, keeping the Heaviest Fragments First approach. In this work, we have kept a conservative approach when partitioning to prevent unnecessary repartitioning on data chunks with low execution times. We have defined a threshold to be met by the computation time of the data chunks before considering a new division. In this case, expression

Next, to define the number of new data chunks to be generated, we proceed to clear y from the expression 5, as presented in expression 6. Having y defined, now we can split the corresponding data chunks.  (6) y = [Nw ∗ (Tmax + σi ∗ Nw /2)]/(μi ∗ Nf ) For example, when executing an exploration of the application using N w = 5 and N f = 10, we observe the behavior showed in Fig. 1(a). In this case, we have data chunks with different associated computation times; particularly, we can identify a data chunk with a large computation time (about 40 time units). From this exploration we obtain an average computation time equal to μi = 8, a standard deviation of σi = 11.49, and an expected ideal time Tideal = 16. After evaluating the restriction given by expression 2, and solving y from expression 6, the resulting value for the number of pieces in which data chunk j = 5 should be partitioned is y = 4 (this is shown in Fig. 2(a)). However,

219

size

1

1

4

T

1

4

4

1

4

40 5

Nf

1

1

1

1

1

1

1

1

1

9

7

6

5

4

3

3

2

1

3

7

8

6

0

9

4

2

1

C’i ??

??

??

??

50

51

52

53

the associated computation time of each data chunk exceeds Tideal . However, by doing this, we could generated too many data chunks with a similar computation time. This implies not having enough data chunks, with low computation time, to fill the blanks left by imbalances along the exploration. To face this situation, we have decided to be more precise in time estimation when grouping data chunks. We have done this by defining a time threshold for grouped data chunks (as shown in expression 7). This way, computation time of new data chunks is kept way below ideal time and enables us to dispose of data chunks small enough to fill gaps on execution time and facilitate a balanced execution.

= new computation time of data chunk

(a) Partitioning μi = 6.38

Nw 5

9

5

2

1

4

7

3

8

σi = 4.21

8

4

7

3

53

3

3

52

7

9

16

2

Tideal = 16.00

0

6

1

51 2

12 1

1

4 50

Tgroup ≤ Tideal − μi

6

time

(b) Measuring size

1

1

4

4

1

4

1

4

1

1

1

1

1

1

1

1

T

40

9

7

6

5

4

3

3

2

1

5

1

3

7

8

6

0

9

4

2

C’i 12

16

50

51

8

7

52

53

Therefore, we proceed to sort data chunks in natural order, according to the execution time obtained from the previous division, to group only contiguous data chunks that do not exceed the restriction defined in expression 7. From this point, we evaluate the possible resulting computation time when grouping, i.e. Tgroup , by adding the computation time of the adjacent data chunks. For this particular example, we group data chunks 2 and 3, and 8 and 9 (as shown in Fig. 3(a)) because the sum of each computation time is below the defined threshold. After we have done this, and similar to partitioning strategy, the new data chunks are executed when their original data chunks should be executed (Fig. 3(b) and 3(c)) to label the data chunks with their new associated computation time.In this way, we can sort them in decreasing order for the following explorations.

1

Nf

= new computation time of data chunk

(c) Updating computation times μi = 6.38

Nw 5

7

7 53

4

3

8 9

5

12 1 2

σi = 4.21 Tideal = 16.00

9

2 8

4

4 50

1

3 7

1

2

0

6 52

3

3

1

6 2

16 51

time

(d) Tuning Figure 2.

(7)

IV. E VALUATION R ESULTS

Partitioning data chunks with the highest execution times

To evaluate the proposed strategy for dynamically tuning the workload partition factor in data-intensive applications, we have tested our partitioning and grouping approach. The aim of this evaluation has been to analyze the behavior of the applications when changing the size of the data chunks at run time. As a first step, we have implemented an analytical simulator to analyze a broad number of scenarios. Then, as a second step, we have tested the method over a real dataintensive application, BLAST to observe the behavior of the application when using our proposal. In both steps, we have analyzed data-intensive applications developed under a Master/Worker paradigm, where each worker is assigned to a different processing node and workers do not communicate between them. This way, master process delivers the data chunk that should be processed to the available worker and each worker returns the associated computation time of that data chunk at the end. The experiments have been conducted on a cluster of workstations composed by 32 processing nodes each one with 12 GB of memory at 667Mhz FSB. Each node consist of two dual core Intel Xeon 5160, at 3 GHz with 4MB of L2 Cache.

for these new data chunks we do not know their associated processing times. We use a subsequent exploration for labeling new data chunks with the new corresponding times. To reduce the possibility of generating load unbalance, we send new data chunks when their original data chunk was expected to be sent (Fig. 2(b)). After the new chunks has been labeled (as shown in Fig. 2(c)), we can rearrange all data chunks in decreasing order of computation time, and proceed to the next exploration (Fig. 2(d)). 2) Grouping: For some data-intensive applications, partitioning their workload in too small data chunks may produce overheads on scheduling or communication time that need to be solved. In our strategy, we have proposed to: evaluate the performance of the application at run time; and whether necessary, group and distribute bigger data chunks to be processed as one unique data chunk. Initially, we have considered to gather those contiguous data chunks with low computation times to reduce communication overheads, and the effect of serial fractions (when having too many data chunks). For each data chunk that can be grouped, the grouping strategy will stop when the sum of

220

size

1

1

2

1

T

3

9

??

2

Nf

0

1

2^3

4

1

4

1

1

4

8

12 16

1

4

4

7

5 0 51 5 2 53

C’i

1

1

2

4

6

??

6

7

8^9

analytical expressions. We have performed the simulations under the following conditions: • An initial partition factor N f = 128; • N w values ranging from 10 to 80; • Two different scenarios: our Heaviest Fragments First tuned method (HFF tuned); and the Heaviest Fragments First simple distribution policy (HFF simple); • Execution time of data chunks have been generated following a normal distribution. The results presented in Fig. 4 show the difference between the maximum execution time Tmaxij obtained with and without applying our strategy for tuning workload partition factor. These results show that the execution time limitation imposed by data chunks with large computation times can be soften or diminished through the repartitioning of those data chunks. It is worth noticing that once this barrier has been removed, the total execution time of the application can be reduced by adding more workers. 2) Introducing variability in computation time of explorations: Our load balancing strategy is based on tuning the size of data chunks with higher, and lower, processing times. To accomplish this, we use the history of the computation time measured for each data chunk to adapt the partition factor. However, predictions are likely to fail in some degree and expected results can also be different. The aim of this experiment is to see how the total execution time is being affected when a certain degree of error is introduced (changing the size of data chunks). We have set the simulation environment to use N f = 128 and N w = 64, and we have introduced a certain percent of variation in the time of each element of the set (from no variation –0%– up to 500% of variation of the measured time). The greater this percentage, the greater the variability in the time associated to each new data chunk. The generated data set is evaluated for the same scenarios as in previous sub-section. The simulation process is repeated 500 times and results are the average values of computation time. Introducing variability in the computation time of data chunks tends to degrade the performance of the HFF simple strategy. Nevertheless, in Fig. 5 can be seen how, when applying our method, the time degradation is significantly reduced because the maximum execution time has decreased when repartitioning the corresponding data chunks.

= new computation time of data chunk

(a) Grouping μi = 7.54

Nw 5

7

8

4

8

3

9

σi = 4.00

2^3 4

53

6 52

3 0

7

Tideal = 16.00

8 8^9

1

12 2

2

2

4 50

6

16

1

51

time

(b) Measuring size

1

4

1

4

T

16 12

Nf

51 5 0

1

1 9

4

8

1

1

4

7

52 53

2

1

2

1

1

1

??

6

??

4

3

2

2^3

7

8^9

6

0

4

C’i

= new computation time of data chunk

(c) Updating computation times μi = 7.54

Nw 5

8

8

4

8

3

9

2^3

52

0

53

2 7

12 2

Tideal = 16.00

4

4 50

1

3

6 1

2

σi = 4.00

8^9

7

6

16 6 51

time

(d) Tuning Figure 3.

Grouping data chunks with the shortest execution times

A. Analytical Simulator To assess our proposal we have decided to implement an analytical simulator that reproduces a Master/Worker pattern. The developed tool allows us to observe and analyze, for a broader range of scenarios, the influence of the size of data chunks in overall execution time of applications. Using the analytical simulator, we have evaluated the effect of our proposal on improving execution time when: (i) adding more resources, and (ii) introducing variability between the performed explorations. As a first step, the simulator is fed with different scenarios of input data. Among these initial data we included: the execution time of every data chunk (in seconds), the size of the data chunks (in megabytes), the communication time per megabyte and the number of computation nodes. Then, every scenario is processed tuning the partition factor. Final result is the total execution time of the application for each scenario, expressed in seconds. 1) Evaluating performance on a representation of dataintensive applications: To observe the performance improvement, we have used the simulator to evaluate the

B. Real Application: BLAST The aim of this experiment is to evaluate the performance of a real scientific application when tuning the workload partition factor. We have chosen one of the most widely used bioinformatics tools - BLAST (Basic Local Alignment Sequence Tool) [8]. BLAST is both a CPU and a dataintensive application. This application has irregular computation times due to data characteristics and works with biological databases of up to 50 GB size (and growing) that can be arbitrarily divided into non-dependent data chunks.

221

6000

HFF simple HFF Tuned Time (seconds)

6000 5000

5000

4000

4000

3000 2000

Average Time Ideal Time

3000 2000

4000 1000

1000

3000 Fragment Identifier

1000 10

18

26

32

40

48

56

64

72

64

56

48

40

Fragment Identifier

(a) Initial computation time

80

32

16

8

96 10 4 11 2 12 0

80 88

64 72

48 56

32 40

0

8

2000

16 24

0

24

Time (seconds, log10)

7000

6000

Average Time Ideal Time

5000 Time (seconds)

8000

(b) Computation time after tuning

Number of Workers

Figure 6. Figure 4.

Performance improvement when changing data chunks sizes 18000

HFF simple HFF Tuned

16000

Ideal Time HFF simple HFF Tuned

10000 Time (seconds)

14000 Time (seconds)

Variation of computation times by data chunk 12000

12000 10000 8000 6000

8000 6000 4000 2000

4000 0

2000 0

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

4

8

16

32

Number of Workers

5

Standard deviation (per one of the mean)

Figure 7. Figure 5.

Total execution time improvement when tuning partition factor

Performance improvement when introducing variability

both approaches and their corresponding ideal times Tideal ; results are showed in Fig. 7. On the contrary of HFF simple, when applying our method for tuning the size of data chunks limiting the time, the total execution time of the application is close to the ideal time. These results are promising because once the barrier set by data chunks with large computation times has disappeared, the overall execution time of the application can be reduced by adding more processing nodes. In this case, the reduction in minimum execution time has been of up to 55% less than using HFF.

Most parallel BLAST versions, as mpiBLAST [4], take advantage of the system parallelism by applying database partitioning. Data processing is carried out after a previous data partition. The generated fragments are distributed among all available workers to search for the similarities with the input sequence. Unfortunately, this parallel BLAST version can present inefficient data distribution strategies because the computation time depends on the database fragment. The more similarities BLAST encounters, the more time is required for processing a fragment. To tackle this problem, we have applied HFF simple and HFF tuned using BLAST ncbi-blast-2.2.23. We have chosen a database of nucleotides (≈36GB size) called nt, and created eight data sets’ partitions with equal-size data chunks for N f = {16, 32, 64, 128, 256, 512, 1024, 2048}. We have used a query of 1 MB literally chopped from the last part of the nt database. This piece was selected due to its long computation time (a couple of hours in our computing platform). For the initial exploration, we have chosen N f = 128 and N w = 32, obtaining an average execution time of μi = 599.08 seconds, a standard deviation of σi = 670.86, and a expected ideal time equal to 2, 396.32 seconds. These results and the computation time of each data chunk are showed in Fig. 6(a). Additionally, after gathering and repartitioning the data chunks, results of a new partitioned workload are shown in Fig. 6(b). In this case, N f is greatly reduced and also is the maximum computation time by data chunk, consequently μi has change to 1, 107.56 seconds. Additionally, we have evaluated the behavior of the two strategies when increasing the number of workers from 4 to 32. We have compared the total execution time from

V. C ONCLUSIONS AND F UTURE W ORK The recent growth of data coming from sensors, biological and physical experiments, and information generated by users, needing to be processed, has led to the design of new methods to satisfy its basic processing requirements. Concepts as data-intensive or big-data computing have emerged in the last few years, and along with these terms, approaches as dividing the workload of the applications into smaller pieces, have become more common. With all this in mind, the number of performance problems related to load balancing also has increased. In this work, we have addressed the problem of load balancing through a dynamic strategy for tuning the workload partition factor of some data-intensive applications. Particularly, we have considered data-intensive applications that perform multiple explorations or queries on the same data set, and that this explorations are related to each other. In addition, their workload allows arbitrary partitions into smaller chunks. Our method proposes to change, at run time, the size of these data chunks by repartitioning or gathering specific pieces according to application performance. By monitoring each exploration and analyzing

222

collected data, have been possible to decide the adequate modifications for the workload partition factor. In executions with data chunks with large computation time, our strategy will change the size of these chunks into smaller pieces and redistribute them in a subsequent exploration. In cases with higher partitioning costs, the strategy will determine a range of pre-partitioned sets of the workload, and will select the new data chunks from them. Similarly, when the computation times are too low, our strategy will proceed to group contiguous data chunks to reduce communication overheads. The method has been assessed through simulation, and by executing a real data-intensive application. The results obtained are encouraging in terms of reducing the minimum total execution time up to a 55%. As open lines and future work, is the inclusion of refinements in the method to dynamically adapt the workload partition factor on a broader range of applications.

[9] S.-S. Boutammine, D. Millot, and C. Parrot, “An Adaptive Scheduling Method for Grid Computing,” in Euro-Par 2006 Parallel Processing. Springer Berlin, Heidelberg, 2006, vol. 4128, pp. 188–197. [10] M. Drozdowski and P. Wolniewicz, “Divisible Load Scheduling in Systems with Limited Memory,” Cluster Computing, vol. 6, pp. 19–29, 2003. [11] S. Chuprat and S. Baruah, “Scheduling Divisible Real-Time Loads on Clusters with Varying Processor Start Times,” in Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, 2008, pp. 15–24. [12] O. Beaumont, H. Casanova, A. Legrand, Y. Robert, and Y. Yang, “Scheduling divisible loads on star and tree networks: results and open problems,” IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 3, pp. 207–218, Mar. 2005. [13] J. Berli´nska and M. Drozdowski, “Scheduling divisible MapReduce computations,” Journal of Parallel and Distributed Computing, vol. 71, no. 3, pp. 450–459, 2011.

ACKNOWLEDGMENT This research has been supported by the MICINN Spain, under contract TIN2007-64974 and TIN2011-28689.

[14] E. C´esar, A. Moreno, J. Sorribes, and E. Luque, “Modeling Master/Worker applications for automatic performance tuning,” Parallel Computing - Algorithmic skeletons, vol. 32, no. 7-8, pp. 568–589, 2006.

R EFERENCES

[15] A. Moreno, E. C´esar, A. Guevara, J. Sorribes, T. Margalef, and E. Luque, “Dynamic Pipeline Mapping (DPM),” in Proceedings of the 14th international Euro-Par conference on Parallel Processing, ser. Euro-Par ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 295–304.

[1] M. Cannataro, D. Talia, and P. K. Srimani, “Parallel data intensive computing in scientific and commercial applications,” Parallel Computing - Parallel data-intensive algorithms and application, vol. 28, no. 5, pp. 673–704, May 2002. [2] R. E. Bryant, “Data-Intensive Supercomputing: The Case for DISC,” Carnegie Mellon Univ., Tech. Rep., May 2007.

[16] A. Moreno, E. C´esar, J. Sorribes, T. Margalef, and E. Luque, “Task distribution using factoring load balancing in Master– Worker applications,” Information Processing Letters, vol. 109, pp. 902–906, 2009.

[3] R. E. Bryant, R. H. Katz, and E. D. Lazowska, “Big-Data Computing,” Computing Research Association, White Paper, 2008.

[17] R. E. De Grande and A. Boukerche, “Predictive dynamic load balancing for large-scale hla-based simulations,” in Proceedings of the 2011 IEEE/ACM 15th International Symposium on Distributed Simulation and Real Time Applications, ser. DSRT ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 4–11.

[4] A. E. Darling, L. Carey, and W. Feng, “The Design, Implementation, and Evaluation of mpiBLAST,” in 4th International Conference on Linux Clusters: The HPC Revolution 2003 in conjunction with ClusterWorld Conference & Expo, 2003.

[18] T. Chiba, M. den Burger, T. Kielmann, and S. Matsuoka, “Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds,” in Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, ser. CCGRID ’10. Washington, DC, USA: IEEE Computer Society, May 2010, pp. 5 –14.

[5] V. Bharadwaj, D. Ghose, and T. G. Robertazzi, “Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems,” Cluster Computing, vol. 6, pp. 7–17, 2003. [6] S. F. Hummel, E. Schonberg, and L. E. Flynn, “Factoring: a method for scheduling parallel loops,” Communications of the ACM, vol. 35, pp. 90–101, Aug. 1992.

[19] L. Glimcher, V. Ravi, and G. Agrawal, “Supporting load balancing for distributed data-intensive applications,” in Int. Conf. on High Performance Computing (HiPC ’09), Dec. 2009, pp. 235 –244.

[7] A. Middleton, “HPCC Systems: Introduction to HPCC (High-Performance Computing Cluster),” LexisNexis Risk Solutions, White Paper, 2011. [Online]. Available: http: //hpccsystems.com/community/white-papers/hpcc-intro

[20] C. Rosas, A. Morajko, J. Jorba, and E. C´esar, “Workload Balancing Methodology for Data-Intensive Applications with Divisible Load,” in Proceeding of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE Computer Society, 2011, pp. 48–55.

[8] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” Journal of Molecular Biology, vol. 215, pp. 403–410(8), October 1990. 223

Suggest Documents