Performance models for desktop grids

0 downloads 0 Views 236KB Size Report
Main characteristics of desktop grids are the large number of nodes .... 1+(n−1)fs n. (2) where fs is the serial fraction of the specific job as defined by. Amdahl in ...
Performance models for desktop grids Paolo Cremonesi and Roberto Turrin Department of Computer Science - Politecnico di Milano, Italy Via Ponzio, 34/5, I-20133 Milano, Italy [email protected] - [email protected] Keywords: desktop grid, ordered statistics, performance model, throughput, partitioning, replication

Abstract Main characteristics of desktop grids are the large number of nodes and their heterogeneity. Application speedup on a large-scale desktop grid is limited by the heterogeneous computational capabilities of each node, which increase the synchronization overhead, and by the large number of nodes, that results in the serial fraction dominating performance. In this paper we present an innovative technique which may outperform the throughput of traditional grid applications by merging job partitioning and job replication. We utilize ordered statistics analytical models for the performance analysis of desktop–based grid applications. The models describe the effects of resource heterogeneity, serial fraction and synchronization overheads on the application– level performance. Using the models we show how the proposed policies can be tuned with respect to the size of the grid in order to optimize the grid throughput.

1

INTRODUCTION

Desktop grids enable compute-intensive applications to run faster and more effectively. However, increasing the number of nodes that participate in a desktop grid does not automatically ensure increasing speedup and throughput for an application. The problem is primarily because the large number of nodes in a desktop grid strengthens the inefficiencies usually associated with parallel applications, e.g., the serial fraction and the synchronization overheads (e.g., time needed to reach remote resources, scheduler working time, queuing time, and so on). Therefore, optimal performance is usually not obtained by maximum partitioning: a trade-off has to be found between partitioning and grid overhead that tends to grow with the number of tasks to deal with. In this work, we present a strategy to optimize the partitioning of jobs on a desktop grid infrastructure. We demonstrate that this strategy leads to a significant time speed-up and to a substantial increment of the throughput of completed tasks with respect to a blind maximal partitioning strategy. The performance benefit in desktop grids comes mainly from partitioning the application into parallel tasks so that

each task runs on a node independently. However, in most cases job partitioning in desktop grids results in less that linear speedup because of small task granularity with respect to the large synchronization overheads. When partitioning an application in a large number of parallel tasks, increasing the number of nodes may not contribute much to reducing the completion time because of the serial fraction of the application [1]. Moreover, within a desktop grid the computational capabilities of each node vary greatly. As a result, tasks running on different nodes on the grid will have a large range of completion times, increasing the overhead for task synchronization and data gathering. Such overhead increases with the number of nodes and may overcome the benefits of the parallelization [2]. Very often there exists an optimal number of nodes to be used in grid systems that minimizes the execution time [3]. Emerging concepts like job replication [4, 5] can help in achieving better performance. The basic idea of job replication in a desktop grid is the concurrent execution of multiple copies of the same job. If multiple copies of a job are executed on independent nodes, then the chance that at least one copy is completed during a specific period of time increases. As a result, the execution time is very probably reduced. Concurrent assignment of job to multiple nodes guarantees that a particular, very slow, machine will not slow the aggregate progress of a computation. By merging both job partitioning and job replication, we obtain hybrid strategies, which, in many cases, can be shown to reach better performance and availability with respect to a maximal partitioning strategy.

1.1

Related works

Several extensions to Amdahl’s law [1] have been proposed in order to better describe grid architectures. Taylor et al. [6] describe a framework for the performance analysis of grid applications. The framework adopts a modeling component based on the coupling parameter, a metric that quantifies the interaction between the computational kernels that compose an application. Muttoni et al. [3] apply Amdahl’s equations to the scalability analysis of grid applications. The analysis results in a relationship between the number of nodes in the grid and the ratio between application efficiency and architecture cost. All the above approaches have the advantage of expressing analytically the speedup in a closed form. However, none

of the above models take into account the grid overhead introduced by the partitioning of the tasks. Task graph models are frequently used to evaluate the performance of grid applications when the program control structures can be represented by means of “series–parallel” graphs. Lee and Weissman [7] present a performance model for the analysis of grid–enabled network services, i.e., highperformance applications available on-line. The model deals with stateless data-parallel services and provides a heuristic for adaptive scheduling of service requests. Such models focus on allocating nodes to sequentially arriving tasks, while we are interested in a set of independent tasks being submitted in parallel to the grid infrastructure. For more realistic models, queuing networks and petri nets can be used. Many approaches describe multi–programmed and multi–tasked parallel systems executing a sequence of programs of similar task structure. The queuing network models are usually based on fork–join, open queuing networks, where a stream of series–parallel applications arrives in the system. Qin et al. [8] model the performance of cluster– based grid architectures where each cluster node is a shared– based multiprocessor. Bacigalupo and al. [9] describe a very simple queuing network model for the scalability analysis of e–commerce grid-based applications. Sun and Wu in [10] and Gong et al. in [11] describe a performance prediction and task scheduling system based on a queuing network model. The model can be applied to heterogeneous non-dedicated grid computing infrastructures (e.g., desktop-based grid computing). Li and Mascagni [12] present an interesting petri-net model for the performance analysis of grid applications when a computational replication technique is adopted to reduce task completion time and improve availability. The model assumes an unlimited number of computational nodes and ignores data gathering and synchronization overheads. Glatard et al. [13] introduce a model that takes into account the overhead originated by the variability of the task execution times. Their model, focused on production grids within enterprise networks, assumes an additive overhead. We will show in the following sections that a multiplicative overhead is better suited when modeling a desktop based grid. As a conclusion, the above works do not seem to completely match our needs, as they make assumptions better suited for cluster grids and do not take into account the peculiar aspects of desktop grids, i.e., the large number of nodes, their heterogeneity and the synchronization overhead required for task scheduling and results gathering. The work by Kondo et al. [14, 15] is one of the few that focuses on desktop grids. The papers present measurements of a real-world desktop grid and evaluate three resource selection techniques through trace-driven simulations. The paper is organized as follows. Section 2 presents the problem and defines the scale factor. In Section 3 we show

Figure 1. Master-worker paradigm in desktop grid. the scale factor for job partitioning and job replication. Then in Section 4 we extend the model to hybrid solutions using asymptotic order statistics. Finally, Section 5 presents different case studies.

2

ARCHITECTURE

In desktop grid anyone can bring resources into the grid system, sharing them for the common goal of that Grid [16]. The most well-know example, which is the original distributed computing facility example of such grids, is represented by the various @home projects implemented over the BOINC infrastructure[17]. In a desktop grid, applications can be performed in the well-known master-worker paradigm [18]. The application is split up into many smaller work units; for instance we could split input data into lighter, independent data units that can be computed independently by the individual PCs, running the same executable but processing different input data. Local grid software runs in background inside such PCs and the owner does not need to take care of the grid activity of her computer. The central server of the grid runs the master program, which creates and distributes the tasks and processes the incoming partial results (Figure 1). These systems sustain multiple teraflops continuously by aggregating up to several millions of machines for solving a wide range of large-scale computational problems. Computation inside a desktop grid can benefit from several techniques, such as job partitioning and job replication, referred as allocation policies. A job corresponds to the computational work needed to be executed; for instance, a job may represent an application. Job partitioning divides a job into tasks that can be executed independently and in parallel on a number of nodes to improve performance. A task is a partition of a job which results computationally lighter and so faster to be executed. For instance, a task corresponds to a work unit (using the BOINC terminology). As main drawback, processing nodes may require synchronization in order for the master to collect and

where fs is the serial fraction of the specific job as defined by Amdahl in [1]. The serial fraction represents the parallelization capability of an application. Thus, for such simple model, we can state T (n) = µ˜ , so that S(n) = Figure 2. Fork-join queuing network model.

assemble the final results. The synchronization activity can be modeled with a fork-join queuing network [19], consisting of a set of identical delay service centers, each one representing a node. Referring to Figure 2, a job reaching the fork-join block instantaneously forks into n independent and identical tasks, each one assigned to one of the fork-join delay centers. Once the task is completed, it waits for all the other tasks at the join point. The job is considered completed when all the tasks in the fork-join block terminate the execution. Job replication concerns the execution of the same identical job in a multitude of nodes at the same time; each copy of the original job is called replica. Such policy aims to achieve higher performance by running several replicas in more than one node and by taking the results from the first one finishing. Replication can be even used to reach high reliability by waiting for a quorum of replicas providing consistent results; we will not treat this extension here. Main objective of this work is to estimate the execution time of an application executed on a desktop grid with n nodes, evaluating different allocation policies able to enhance the performance, i.e., job partitioning, job replication and a combination of the two. Note that faster execution time means larger throughput. Let us assume the application has an average sequential execution time µ, which is the execution time expected when performed on a single node. The average execution time of the whole application on the grid can be defined as the function 1 µ (1) T (n) = S(n) where S(n) is a speed-up factor depending on the number of nodes n and on the allocation strategy. S(n) can be an aggregation of various factors. For example, the classical speed-up, well-know in literature as the Amdahl’s law [1], is due to executing partitions of the job, i.e., the tasks. In fact, splitting the job we obtain n smaller tasks, computationally lighter than the original job. Let us state µ˜ is the average service time of one task, which can be defined as the job service time µ, suitably scaled: µ˜ = µ

1 + (n − 1) fs n

(2)

n 1 + (n − 1) fs

(3)

Focusing on job partitioning, the scale factor must take into consideration the Amdahl’s speed-up, but also the overhead due to the synchronization activity among the n different nodes; we define such overhead On ≥ 1. On is statistically due to the non–determinism in task execution time. Thus, for job partitioning, we can state S(n) =

1 n 1 + (n − 1) fs On

(4)

With job replication there is no contribution due to partitioning and the consequent synchronization. In this case we 1 define S(n) = , where the factor 0 < O1 ≤ 1 models the O1 improvement of running multiple copies of the same job and waiting for the first one computed. In order to calculate O1 and On we use ordered statistics. Let us consider a set of n independent and identically distributed (iid) variates, such that −∞ < X1 < X2 < · · · < Xn < +∞ Considering a desktop grid with n nodes, we suppose Xi being the execution time of the i-th node. We assume that each random variable Xi has cumulative distribution function (cdf) F(x) and probability density function (pdf) f (x). We will show that the response time T (n) for a desktop grid adopting either job replication or job partitioning is equal to the expected value of the variate Xm T (n) = E[Xm ]

(5)

where m depends on the policy; for m = 1 we obtain the job replication response time, otherwise, for m = n, the job partitioning response time. The exact solution of (5) requires to know the distribution function of Xm , whose pdf is: fm (x) = n!

F(x)m−1 [1−F(x)]n−m f (x) (n−m)!(m−1)!

so that (5) can be rewritten as Z +∞ E[Xm ] = x fm (x)dx

(6)

(7)

−∞

In general, the integral in (7) is distribution dependent and can be difficult to solve.

In the next sections we use ordered statistics to model the performance of a desktop grid adopting either job partitioning or job replication. We then extend the model to hybrid policies by combining together the two strategies and proving that, under particular conditions, they can outperform the pure job partitioning and job replication.

3

ALLOCATION POLICIES

In this section, we investigate the problem analytically considering a uniform distributions for F(x), in order to demonstrate the relevance of the models. Although this distribution is not representative of all desktop grid applications, many of them [20, 4, 21, 22] have working units with an execution time that can be approximated with a uniform distribution. An on-going extension of this paper will analyze different distributions. According to these assumptions, we consider an application having a sequential execution time uniformly distributed with mean µ and interval width h. The pdf and cdf of the sequential execution time are, respectively: 1 h x 1 µ F(x) = + − h 2 h f (x) =

h h ≤ x ≤ µ+ 2 2 h h µ− ≤ x ≤ µ+ 2 2 µ−

(8) (9)

We wish to study the execution time of such application when executed on a grid with n nodes, either partitioned or replicated.

3.1

Job partitioning

We first focus our attention on job partitioning, which is the classical approach used in grids. With job partitioning, a job is split into n statistically identical tasks, each one computed in a different node. The job execution completes when all the n nodes terminate the execution of the assigned task. The problem is referred to as n-out-of-n ordered statistics. By splitting the job we obtain n smaller tasks, each one with a service times sampled from a uniform distribution with mean as defined in (2) and interval width suitably scaled as 1 + (n − 1) fs h˜ = h n

execution time corresponds to the expected value of the variate Xn T (n) = E[Xn ] = µ˜ On , On ≥ 1 (11) where µ˜ and On are, respectively, the task expected execution time and the overhead factor previously defined. The execution time of the application is bounded by Xn , the execution time of the slowest node. Equation (6) becomes the maximum distribution function of the random variables Xi , with cdf Fn (x) and the pdf fn (x) Fn (x) = F(x)n fn (x) = n f (x)F(x)n−1 We can rewrite (11) as Z +∞ Z x fn (x)dx = n E[Xn ] = −∞

+∞

where f (x) and F(x) are taken, respectively, from (8) and (9), with mean and interval width scaled as in (2) and (10). The integral computation of (13) brings to the (exact) solution:   h˜ n − 1 T (n) = E[Xn ] = µ˜ + 2 n+1 (14)    1 + (n − 1) fs h n−1 = µ+ n 2 n+1   h n−1 . The factor On is so equal to 1 + 2µ n + 1

3.2

Job replication

With job replication, a job is replicated into n replicas, each one statistically identical to the original job; each replica is sent to a node to be executed. Replica execution time distribution is that one of the cloned job; thus the replica expected service time is µ. The execution completes when the fastest node terminates its execution. The problem is referred to as 1-out-of-n ordered statistics. The application execution time corresponds to the expected value of the variate X1 T (n) = E[X1 ] = µO1 ,

O1 ≤ 1

(15)

where O1 is the speed-up scale factor previously introduced. The minimum distribution functions of the variate X1 are

σ 1 h = √ µ 2 3µ is constant, being σ the standard deviation. The scaling is motivated by Dobber et al. in [23]. We indicate with Xi the variates representing the execution time distribution of each task. Ordering such variates, the job

(13)

−∞

(10)

so that the variation coefficient

x f (x)F(x)n−1 dx

(12)

F1 (x) = 1 − [1 − F(x)]n

f1 (x) = n f (x) [1 − F(x)]n−1

so that we can rewrite (15) as Z +∞ Z x f1 (x)dx = n E[X1 ] = −∞

+∞

−∞

(16)

x f (x) [1−F(x)]n−1 dx (17)

where f (x) and F(x) are again, respectively, the job service time pdf (8) and cdf (9). The integral computation of (17) brings to the (exact) solution:   h n−1 (18) T (n) = E[X1 ] = µ − 2 n+1   h n−1 Thus the factor O1 results equal to 1 − . 2µ n + 1

4

HYBRID POLICIES

In this section we consider again an application which has a sequential execution time following the uniform distribution as defined in (8) and (9). Let us suppose to dispose of a grid composed by n nodes. We now want to study the performance using together in this architecture the two speeding techniques shown in Section 3. That means performing either: • partitioning of replica • replication of task In both the case we can refer again to the order statistics concepts concerning the maximum and minimum among n variates, each one representing the node service time. However, in order to analyze the two allocation policies just introduced, the expected execution time computed in (14) and (18) is not enough, because we need to know also its distribution function. Being distribution functions (12) and (16) not trivial to treat and to solve, we use the asymptotic order statistics theory to calculate an approximate solution of theirs. Asymptotic order statistics approximation proves to be correct when the number of random variables goes to infinity; indeed it represents a limiting attraction domain for the distribution function of order variates. Desktop grid typically faces with a consistent amount of nodes, so we assume that the asymptotic distribution correctly models the performance of such architecture. Let’s now analyze the two allocation policies distinctly.

4.1

Partitioning of replica

Partitioning of replica is performed by taking an application, replicating it into r replicas, each replica is in turn splitted into t tasks (Figure 3). Reminding we dispose of n nodes, it must hold t · r ≤ n. First of all, we need to compute the execution times for each replica (and their distributions). In fact, a job is replicated into r replicas and it requires to wait for the first replica to complete; that means calculating the expected execution time of the fastest among r replicas. Since a replica is composed of t tasks and it needs to wait for all the tasks to complete, replica execution time corresponds to the execution time of the slowest among the t tasks.

Figure 3. Partitioning of replica policy. Computing the slowest task distribution among t variates, means calculating (12), where n = t, i.e., Ft (x) = F(x)t . We use the asymptotic theory of the maximum [24], which shows that only three asymptotic distributions exist, to which most of the maximum distribution functions Ft (x) converge. If Ft (x) is such that, after a suitable linear transformation, a limiting distribution G(x) exists, i.e., there exist real normalizing constants at > 0 and bt such that t→∞

Ft (at x + bt ) −→ G(x)   x − bt t→∞ Ft (x) −→ G at

(19)

then G(x) must be one of just three types [25], corresponding to the three asymptotes, namely   G(I) (x) = exp[−e−x ]      x≤0  G(II) (x; α) = 0  −α G(x) = (20) = exp[−x ] x>0     G(III) (x; α) = exp[−(−x)α ] x ≤ 0     =1 x>0

where α is a strictly positive real constant. Let us remind again that task service time follows a uni˜ Using the form distribution with mean µ˜ and interval width h. ˜ and bt = µ˜ + h/2 in (19), with normalizing constants at = h/t α = 1, it is straightforward to prove that the maximum among t uniform distributed variates converges, as t tends to infinity, to the third limiting distribution in (20), G(III) :    ˜  h˜   exp t x−µ˜ − h x < µ˜ +  h 2 2 t→∞ (21) Ft (x) −→   h˜ 1 x ≥ µ˜ + 2 Note that, since the normalizing constants at and bt defined in (19) depend on the number of tasks t to be synchronized, the term t occurs in both the sides of the limit (21).

The service times of the replica in Figure 3 will be distributed accordingly to (21). The job average execution time T (n) will be the excepted execution time of the fastest replica, i.e. the expectation of the minimum among r variates distributed as (21). The expectation of the minimum can be computed using (17), where n = r and F(x) is the cdf Ft (x) described in (21). By analytically solving the integral (17) we obtain:   h h 1+(t−1) fs h˜ h˜ (22) µ+ − Hr T (n) = µ˜ + − Hr = 2 t t 2 t where Hr = ∑ri=1 1i represents the Harmonic series. Referring to (1), the speed-up factor S(n) results    t h 1 Hr −1 S(n) = 1+ − 1 + (t − 1) fs µ 2 t

(23)

Approximated execution time Exact execution time

0.2

job expected execution time

0.18

be computed; that means calculating the expected execution time of the slowest among t tasks. Since a task is replicated into r copies and it needs to wait for the first one completed, task execution time corresponds to the execution time of the fastest among the r replicas. In order to compute the fastest replica distribution, we can utilize the asymptotic order statistics of the minimum to compute (16), where n = r, i.e., F1 (x) = 1 − [1 − F(x)]r . The asymptotic distributions G(x) of largest values are linked by the symmetry principle to the asymptotic distributions L(x) of the smallest values. Consequently, if F1 (x), after a suitable linear transformation, has a limiting distribution L(x), i.e., if there exist real constants ar > 0 and br such that

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0

Figure 5. Replication of task policy.

20

40

60

80

r→∞

100

F1 (ar x + br ) −→ L(x)   x − br r→∞ F1 (x) −→ L ar

number of partitions t

Figure 4. Comparison between approximated and exact job expected execution time with partitioning of replica policy. Figure 4 shows a comparison between (22) (solid line) and the real job expected execution time (dashed line), obtained numerically as the expectation of a variate following the cdf 1 − (1 − F(x)t )r , where F(x) is the uniform distribution specified in (9), with mean µ˜ and interval width h˜ as defined, respectively, in (2) and (10). Stated the number of nodes n = 100, we plot the job expected execution time versus the number of partitions t. As expected, since the asymptotic order statistics theory is exact when the number of partitions tends to infinity, (22) is highly accurate for large values of t.

4.2

Replication of task

Replication of task is performed by taking an application, splitting it into t tasks, each task is in turn replicated into r replicas (Figure 5). Again we must have t · r ≤ n. In the first place we need to compute the execution times for each task (and their distributions). In fact, a job is composed by t tasks and it requires to wait for all the tasks to

(24)

L(x) can be just of the three following types:

L(x) =

  L(I) (x)     (II)    L (x; α)

    L(III) (x; α)    

= 1 − exp[−ex ]

= 1 − exp[−(−x)−α ] x ≤ 0

=1

x>0

=0 α

= 1 − exp[−x ]

(25)

x≤0

x>0

where α is again a strictly positive real constant. Let us note that each replica execution time follows a uni˜ since it form distribution with mean µ˜ and interval width h, is a replication of a task. We can prove that the minimum among r uniform distributed variates converges, as r tends to infinity, to the third limiting distribution in (25). In fact, ˜ and with α = 1 and using the normalizing constants ar = h/r

˜ in (24), the fastest variate converges to L(III) : br = µ˜ − h/2  h˜   x ≤ µ˜ − 1 2 r→∞   (26) F1 (x) −→ ˜   h h˜ r   1−exp µ˜ − −x x > µ˜ − 2 2 h˜

Since the task service time is distributed as the fastest among r replicas, we can state that the service times of the task will be distributed accordingly to (26). Note that such distribution is a shifted exponential distribution function. The job average execution time is the average execution time of the slowest task, i.e. the expectation of the maximum among t variates distributed as (26). By analytically solving (13), with F(x) equals to F1 (x) described in (26) and n = t, we obtain the job expected execution time:   h h 1+(t−1) fs h˜ h˜ (27) µ− + Ht T (n) = µ˜ − + Ht = 2 r t 2 r Referring to (1), the speed-up factor S(n) results    t h 1 Ht −1 S(n) = 1− − 1 + (t − 1) fs µ 2 r

(28)

and job replication are particular cases of hybrid policies. The latters partially use asymptotic theory, which is a good approximation when the number of variates increase. Thus, for the two extreme cases (job partitioning and job replication), we will use the exact results obtained in Section 3. We suppose to have n grid nodes and we want to figure out the best allocation policy for the application previously presented, having sequential execution time following a uniform distribution as defined in (8) and (9). For the sake of simplicity, where not otherwise specified, let us consider µ = 1, such constraining the interval width h in the range (0, 2].

5.1

Job replication vs. job partitioning

In this section we compare job replication and job partitioning. Depending on the strategy, the expected execution times of application are, respectively, given by (18) and (14). Once the variance of the job service time is known, i.e., h is fixed, the optimal choice between job replication and job partitioning depends upon the serial fraction fs and the number n of nodes used. We are now interesting in finding for which values of n replication is preferable with respect to partitioning, and vice versa. We face with three cases, depending on the application serial fraction. With very low serial fraction, i.e., when it holds

Approximated execution time Exact execution time

job expected execution time

0.12

fs

Suggest Documents