Self-adaptive skeletal task farm for computational ... - Semantic Scholar

4 downloads 1437 Views 224KB Size Report
Keywords: Divisible workload; Grid resource management; Single-round .... a non-dedicated network, clusters, and management software which enables a.
ARTICLE IN PRESS

Parallel Computing xxx (2006) xxx–xxx www.elsevier.com/locate/parco

Self-adaptive skeletal task farm for computational grids Horacio Gonza´lez-Ve´lez

*

School of Informatics, University of Edinburgh, Edinburgh EH9 3JZ, UK Received 20 December 2005; received in revised form 8 July 2006; accepted 11 July 2006

Abstract In this work, we introduce a self-adaptive task farm for computational grids which is based on a single-round scheduling algorithm called dynamic deal. In principle, the dynamic deal approach employs skeletal forecasting information to automatically instrument the task farm scheduling and determine the amount of work assigned to each worker at execution time, allowing the farm to adapt effectively to different load and network conditions in the grid. In practice, it uses selfgenerated predictive execution values and maps tasks onto the different nodes in a single-round. The effectiveness of this approach is illustrated with a computational biology parameter sweep in a non-dedicated departmental grid.  2006 Elsevier B.V. All rights reserved. Keywords: Divisible workload; Grid resource management; Single-round scheduling; Parameter sweep; Algorithmic skeleton

1. Introduction The use of efficient programming models and structures, which can be staged in a scalable structured fashion, has long been sought after in computational grids [1,2]. These programming models must be necessarily performance-oriented, and they are expected to provide guidance on the overall execution of their jobs in order to assist in the deployment of heterogeneous resources and policies. From a systems administration perspective, the demand for strategies which minimise communication overheads make resource management and scheduling key to the correct functioning of a computational grid [3]. Grid predictive techniques have increasingly been used to enhance application performance. Such techniques include the estimation of file transfer times [4], the generation of program traces prior to execution [5], and the provision of comprehensive environments comprising resource discovery, selection, and scheduling [6]. As a matter of theory, the resource availability–performance premise is widely accepted as the determining factor of application adaptiveness in a grid. According to this premise, applications must be able to transform, evolve, or extend their behaviour to conform to the resources present at a certain time to improve their efficiency [7]. The major problem in empirical research based on this premise has arisen from the autonomic deployment of generic applications. *

Tel.: +44 131 650 5152; fax: +44 131 6677209. E-mail address: [email protected].

0167-8191/$ - see front matter  2006 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2006.07.002

Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

2

Moreover, we consider scant research has been devoted to exploiting the structure of the application to improve overall grid resource management. Because jobs, or workloads, in a grid must be divided into tasks in order to minimise communication costs, little attention is paid to partitioning using the application structure. When the workload is arbitrarily divisible into totally independent tasks, any arbitrary number of tasks can be grouped and assigned to any available node, making it particularly suitable for grid deployment. Large divisible workloads can be successfully executed using a task farm (TF). A TF exploits the independent nature of the workload to generate a multiplicity of tasks to compute results. Its operation can be adumbrated as a farmer process which spawns a number of independent worker processes to run simultaneously. Each worker executes a series of tasks by applying the given worker function. Different TF implementations assign different amounts of tasks per worker based on their scheduling method. The TF structure can be efficaciously abstracted using algorithmic skeletons (AS) [8]. AS synthesise commonly-used patterns of parallel computation, communication, and interaction. Furnishing top-down design composition and control inheritance, AS are typically implemented as parameterisable building blocks, in the form of patterns, templates, or higher-order functions [9], and complete programs are expressed using these blocks analogously to the way sequential structured programs are constructed [10]. Several AS libraries feature task farm constructs with distinct control structures which provide support for heterogeneous systems [11–13]. However, these libraries rely on theoretical performance models and rarely use the intrinsic forecasting capabilities of AS from a systems infrastructure standpoint. Because AS capture the structure of the application and can inherently supply performance estimation guidelines, they provide a solid foundation for resource-guided adaptiveness. In particular, we argue that large divisible workloads can substantially improve their overall execution time thorough an adaptive task farm, in the form of an instrumented algorithmic skeleton which wisely schedules the number of tasks per node. 1.1. Contribution and structure Based on the central premise of application adaptiveness to resource availability, we would like to research their actual correlation and provide a methodology to enable generic parallel programming patterns to conform to the heterogeneity of a computational grid. Furthermore, we consider it relevant to explore this correlation by employing the forecasting information of the algorithmic skeleton and exploiting its intrinsic adaptiveness through generic conventions. Based on a skeletal TF which incorporates information at compilation time and adapts at execution time, our methodology takes a pragmatic approach using a single-round scheduling algorithm which distributes tasks among workers based on their ability to compute results. This work improves our initial results to adjustable farming in computational grids [14] by incorporating an autonomic calibration function. The validity of this approach is supported using a parameter-sweep for the stochastic simulation of calcium currents in cells. This parameter sweep provides a suitable testbed since the parameter space experiments can be graded in terms of numerical complexity. This paper is structured as follows. First, we discuss other related approaches and provide motivation for this work. Then we describe the dynamic deal approach and its implementation, followed by the experimental evaluation using a parameter sweep case study. Finally we present some future directions. 2. Related work The generic problem of mapping groups of totally independent tasks with homogeneous algorithmic complexity to computational nodes is documented in the literature as the scheduling of divisible workloads [15,16]. Abstract approaches to divisible load scheduling develop optimal theoretical models for distributing tasks [17]. Although, they provide a guideline to tackle the task distribution problem, they rely on deterministic arbitrary assumptions on the underlying infrastructure such as the network topology, processor capacity, and the termination times which cannot easily be generalised to accommodate the dynamics of a computational grid. Moreover, the number of tasks assigned to a worker, determined by the TF scheduling mechanism, defines the processing done and ultimately the execution time on a per-node basis. In its canonical form, the TF scheduling is based on a self-scheduled workqueue [18], which supplies a single task to any available worker. Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

3

After its processing, a worker reports back to the farmer for the next unit of work or termination. For a given job, each worker normally processes several tasks in multiple rounds. The workqueue strategy provides an acceptable load balancing strategy for large workloads in dedicated systems with fixed network latency. The greedy nature of self-scheduling allows the assign-to-idlenode scheme to balance the system load over time. The generalisation of the workqueue model allocates more than one task per round and takes into account variable network latency, effectively distributing small chunks of the workload in a greedy fashion and spawning several multi-round scheduling algorithms [19]. In contrast to multi-round, single-round scheduling distributes the workload among the workers at once. The deal approach is widely acknowledged for its effectiveness in homogeneous systems. It consists of dividing evenly a workload with tasks with similar processing requirements among the workers, minimising the overall communication costs since the farmer must communicate with a worker only twice, i.e., to distribute tasks and receive the results. However, the single-round scheduling approach in non-dedicated heterogeneous systems is still considered an open-ended question in computational science [20]. 3. The dynamic deal method The dynamic deal method synthesises the forecasting nature of AS into a single-round scheduling for divisible workloads, which can drive the adaptiveness of the TF on heterogeneous distributed systems. In contrast to other approaches, it intends to optimise the application performance in a grid from a systems infrastructure standpoint using real resource measurements and application times. Since the definition of a correct single-round scheduling rests on the premise that the workload can be wisely distributed to the nodes with the most convenient resources for a given application, it is crucial to be able to represent generically the equivalence between application adaptivity and resource convenience. We propose to quantify this equivalence, that is to say, the worker resources at a given time on a certain grid topology from an application-specific perspective, through a fitness index F. Then, the TF scheduling mechanism can arguably use F as guideline to distribute the workload among the workers. It is important to note that F is not only application dependent but also time and resource constrained. The dynamic deal algorithm makes the following assumptions: • The TF farmer and each worker process are mapped to different nodes of a computational grid. The terms worker and nodes will be used interchangeably. • All tasks of the workload are independent and have similar algorithmic complexity. • The grid fabric comprises a non-dedicated network, clusters, and management software which enables a direct connection from any worker to the farmer. Let a denote the workload assigned to the farmer, expressed as the number of tasks and N the number of participating workers. Thus, our objective is to calculate: aı 8ı 2 ½1; N  subject to

N X

aı ¼ a

ı¼1

Then let tı be the total time it takes for worker ı to compute aı, including communication overheads, and aı and ‘ı its processing availability and its communication latency respectively. The processing availability measures the CPU fraction allocatable to a new process to be executed at this worker, while the latency is the time to receive a TCP message from the farmer. In order to define the relation between t, a and ‘ for the entire TF, we automatically calibrate the N nodes with the execution of a single task in the workload (typically N  a), and write the execution times to t. Based on this information, we propose three different options to determine F: Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

4

• Times-only: Only the calibration times are used to determine F. • Univariate linear regression: F is determined employing a curve-fitting method for the relation between t and a. • Multivariate regression: a and ‘ are considered independent and are employed to fit the t values.

3.1. Times-only Let s be equal to t. The simplest way to define F is as a normalised decreasing function based on the inverse of sı as shown in (1). Thus, the aı values are directly determined using (3). Note that (2) ensures the dynamic deal property, in other words, that the workload will be distributed in a single-round to the workers. Fı ¼

1 sı N P |¼1

ð1Þ 1 s|

By construction: N X Fı ¼ 1

ð2Þ

ı¼1

Therefore, aı ¼ a  Fı 8ı 2 ½1; N 

ð3Þ

3.2. Univariate linear regression Let us define a0ı , the scaled availability for worker ı, as: a0ı ¼ aı  rp0ı where rpı is the relative performance of worker ı bmı where bmı is any known benchmark value for worker ı and rp0ı ¼ maxðbmÞ N

maxðbmÞ the maximum bmı among N workers N

Using linear least-squares regression, we set a 0 , the vector of a0ı for the N workers as a predictor (independent variable) and allow t to be the dependent variable. Then, we try to fit a curve along the observed values in t using the regression function in (4), which can be determined by minimising the sum of squared residuals in (5). t ¼ c 0 þ c 1 a0 XN v2 ¼ ðtı  ðc0 þ c1 a0ı ÞÞ2 ı¼1

ð4Þ ð5Þ

An example of this linear least-squares fitting methodology is presented in Fig. 1. Our objective is to assign fewer tasks to the workers which were slower and in consequence, minimise the overall execution time. s ¼ c0 þ c1 a0

ð6Þ

Hence, we calculate the F in (1), using the estimated (fitted) values s (6). Then aı is calculated analogously to the times-only method using (3). 3.3. Multivariate linear regression Since processor availability is not necessarily the only determining factor, further exploration needs to take into account additional system parameters. In order to provide ground for discussion, Fig. 2 introduces the schematic representation of the relation between node-processor availability, communication latency and execution times for the case study to be discussed in Section 5.1. Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

5

Linear Regression Analysis 90

Actual Times ti 80

τ =59.1628 - 47.1872 a’

Execution Time (s)

70 60 50 40 χ2 = 5916.97

30 20 10 0 0.2

0.4

0.6

0.8

1

1.2

Availability Fig. 1. Univariate linear regression analysis for a 48-worker instance, using the a 0 as predictor and t as dependent variable.

45

Execution Time

40 45 40 35 30 25 20 15 10

35 30 25 20 15 10 (s) 1.2 1.1 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Availability

0.9 0.8 0.7

Latency

0.6 1

0.5 1.1 1.2 1.3 0.4

Fig. 2. Correlation between scaled availability (a 0 ), latency (‘), and execution times (t). Darker segments in the figure correspond to shorter execution times.

It is clear from Fig. 2 that the shortest execution times, represented by the darkest segments, tend to gravitate towards the right following the higher values of a while the longest times are located in the upper left segment (lowest a). In this particular case, the strong implication of the trend is that execution time on a given Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

6

node is determined in this application by the processor availability and influenced, to a lesser extent, by the latency. Using multi-variate linear least-squares regression, we set a 0 and ‘ as the predictor vector within the X matrix and allow t to be the dependent vector. Analogously to Fig. 2, we fit a surface along the observed values in t using the regression function in (7), which can be determined by minimising the sum of squared residuals in (8). t ¼ Xc

ð7Þ T

2

v ¼ ðt  XcÞ ðt  XcÞ

ð8Þ 0

And analogous to the univariate case, we use s = c0 + c1‘ + c2a 2. A summary of the whole dynamic deal methodology is presented in Algorithm 1. Algorithm 1 Dynamic Deal INPUT: A divisible workload with a tasks; N workers OUTPUT: ha1, a2, . . ., aNi 1: Calibrate N nodes with 1 task. Store execution times in t 2: If using the linear regression method . Collect aı, ‘ı, bmi "ı 2 [1, N] . Determine a0ı 8ı 2 ½1; N  . If univariate regression . . Calculate t = f(a 0 ) . else . . Calculate t = f(a 0 , ‘) . Let s be the estimated t else . Allow s t 1 PsNı 1 8ı 2 ½1; N  3: Allow Fı 4: 5:

|¼1 s|

a  Fı 8ı 2 ½1; N  Compute aı Return ha1, a2, . . ., aNi

4. Implementation As a result of its skeletal structure, a TF programmer only needs to define the tuple hI, O, fi, where I is a vector containing the tasks jIj = a, O is the vector where the results are stored, and f is the worker function. That is to say, a worker computes a subset of O by applying f to its assigned subset of I. Then, based on the prevalent load conditions of the defined platform, the calibration phase automatically calculates the F and the corresponding number of tasks per node aı and proceeds with the TF single-round scheduling. It requires no further interaction with the user. Fig. 3 presents the algorithmic skeleton API implementing the TF. It provides sufficient flexibility to accommodate different options in terms of the worker function, the type and size of the input and output, the MPI communicator, and the scheduling mode. Valid scheduling modes are:

Fig. 3. Application program interface to the algorithmic skeleton of the adaptive task farm.

Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

(1) (2) (3) (4) (5)

7

SCH_TRAD: Based on a workqueue management (1 by 1). SCH_DEAL: Single-round deal assuming equal work chunks to all nodes. SCH_DEALDYN_LR: Dynamic deal with linear regression. SCH_DEALDYN_MV: Dynamic deal with linear multivariate regression. SCH_DEALDYN_SM: Dynamic deal with times-only.

That is to say, this skeleton can even be used as a non-adaptive TF using the standard workqueue or deal scheduling modes. It has been programmed as an ANSI C library call, and compiled using the GNU gcc compiler. To support adaptiveness, it requires to be linked with the following libraries: • The GNU Scientific Library for the deployment of the linear regression routines [21]. In particular we use gsl_fit_linear and gsl_multifit_linear for the univariate and multivariate cases respectively. • The Network Weather Service [22] for the forecasts of CPU availability (a) and latency (‘). NWS employs proven statistical methods to estimate the resource availability. Furthermore, it uses the Message Passing Interface for internode communication and parallel processing, primarily MPI_Send and MPI_Irecv for message exchange and MPI_Wait and MPI_Test for synchronisation. From previous experience with farms in grids [14], we have avoided the use of MPI collective operations. We have chosen the BogoMips [23] benchmark as the performance metric. The Bogomips value is a widely available measurement in Linux systems and intends to reflect more accurately the processing power of a node than the standard clock frequency. Although the initial deployment was on Linux, most of the code is portable, except for the BogoMipsdependent snippet which is tied to the existence of the /proc/cpuinfo file. The actual task farm has been successfully deployed in other Unix flavours. The adaptive TF requires for its correct functioning a NWS sensor per node and a NWS system clique as well as a single MPI_COMM_WORLD encompassing the whole configuration. In our case the co-allocation of multiple nodes in different systems across distinct administrative domains to form the grid was achieved through LAM/MPI, but it can be achieved using any other grid-enabled MPI library. 5. Experimental evaluation In this section, we first introduce our case study followed by the experimental results obtained on a departmental grid. To the best of our knowledge, we have tried to avoid common pitfalls in this evaluation [24] by • • • • • • •

providing as much detail as possible on the experimental environment in order to enhance reproducibility; using a realistic system, open to any interactive jobs and system administration tasks; employing a real, representative computational cell biology application; calibrating the nodes without assuming any user estimate; using a sustainable workload; not running the experiments concurrently in order to avoid contention, and utilising a representative number of nodes for scalability inferences.

5.1. Case study: a computational biology parameter sweep A parameter sweep consists of a series of independent executions of a common function over a set of parameters in order to generate a set of results. Mainstream implementations take advantage of the intrinsic task parallelism which presents virtually no inter-process communication, and have been successfully deployed employing software frameworks such as the AppLeS Parameter Sweep Template [25], the Nimrod/G broker [26], and the GridSim system [27]. Our system provides an alternate approach which exploits the foreknowledge provided by the parametric function and does not require application-dependent tuning. Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

8

Calcium regulation is a crucial cellular process which affects neuronal and cardiac muscle activity. Roughly speaking, calcium ions enter the cell through membrane channels, and the total absorption process can be represented as a set of points (states) of specific conductance, and a series of transitions among them which depend on voltage and calcium concentration values [28]. This process can be modelled stochastically, defining a threshold based on voltage and time constraints, and aggregating individual calcium currents for a given channel population [29]. Because a cell possesses thousands of channels, simulating their stochastic behaviour implies the processing of a large number of random elements with different parametric conditions describing the associated currents, concentrations, base and peak depolarising voltages, and time resolution of the experiment. Since the voltages and the peak duration can be varied without affecting the complexity, the parameter space can be explored while maintaining the complexity constant at each run. This is particularly important for our divisible workload premise. The actual model has been synthesised in a parametric algorithm where the number of channels and time resolution can gradually regulate the model complexity on a per-experiment basis [30]. This algorithm effectively becomes f, the worker function for the adaptive TF. Each experiment generates two result files: a data file which records the calcium currents values over time and a GNUplot script to automatically produce graphs for these values. 5.2. Results The reported results were obtained employing a departmental grid formed by two non-dedicated Beowulf clusters located across the University of Edinburgh, configured as shown in Table 1. In terms of software, all nodes had Linux Red Hat FC3 with kernel 2.6, gcc 3.4.4, LAM/MPI 7.1.1, GSL 1.5 and NWS 2.10.1. All modules were compiled using the ‘‘-pedantic -ansi -Wall -O2’’ flags. It is important to mention that we have been able to co-allocate resources using the LAM/MPI capabilities without any queue management system in place. All farmer-worker communication took place using non-dedicated Ethernet networks. For this study, we have instantiated the parameter space with 960 experiments of similar complexity, i.e. a = 960, by varying the peak voltage, and have defined O to store the individual times for each experiment. The full instantiation is shown in Table 2. Initially, we have implemented the sequential version of f, executed it in a dedicated node, and observed its performance under increasing load conditions. As expected, it degrades linearly when the system load is increased. Table 1 Beowulf clusters bw240 & bw530 hardware configuration

Hardware CPU Memory Network BogoMips

bw240

bw530

64 nodes Intel P4 1 GB/node 2 · 100 Mb/s 3350–3555

16 nodes Intel Xeon 2 GB/node 1 · 100 Mb/s, 1 · 1 Gb/s Myrinet 3326–3359

Table 2 960-experiment parameter space (simulation time 0.1 s; 10 ls step) Parameter

Fixed

Number of channels Time resolution Base voltage Peak voltage Peak duration

· · ·

Variable

· ·

Value 10,000 10,000 80 mV [60, 60 mV] 0.06 s

Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

9

Uniprocessor Program Execution 30000

30100

Without MPI MPI Deal MPI Workqueue

25000

Time (s)

23879

20000 18039

15000 11950

10000 7330

7729

5976

5890

5000

Dedicated

Load=1

Load=2

Load=3

Load=4

Load=5

Processing Mode Dec 2005 Fig. 4. Uniprocessor Time Summary.

Dynamic Deal Empirical Comparison 5000

Standard Deal DD Times-Only DD Multi-Variate DD Uni-Variate

4461 4068

Time (s)

4000

3000 2361 2381

1969

2000 1736

1460

1353

944

1267

1000

1278

947

928 660 507 315

0 6

12

24

48

Number of Workers Dec 2005 Fig. 5. Time comparison for the parameter sweep with 960 experiments using the dynamic deal and the deal scheduling methods.

Please cite this article as: Horacio Gonza´lez-Ve´lez, Self-adaptive skeletal task farm for computational grids, Parallel Computing (2006), doi:10.1016/j.parco.2006.07.002

ARTICLE IN PRESS H. Gonza´lez-Ve´lez / Parallel Computing xxx (2006) xxx–xxx

10

Then, we have deployed a simple TF version on a 1-farmer 1-worker dedicated configuration in order to compare it to the uni-processor version. All entries represent the arithmetic mean, with a small variance, of a series of executions. All uni-processor results are depicted in Fig. 4. The MPI version with standard deal scheduling, where the farmer assigns the 960 elements at once to the worker, performs roughly equal to its uni-processor counterpart (

Suggest Documents