Preemptive Task Execution and Scheduling of Parallel Programs in ...

Preemptive Task Execution and Scheduling of Parallel Programs in Message-Passing Systems Lin Huang and Michael J. Oudshoorn Department of Computer Science, The University of Adelaide, Adelaide, SA 5005, Australia fhuang,[email protected]

Abstract The scheduling of tasks within a parallel program onto the underlying available processors has been studied for some time. To date, solutions to this problem generally assume that communication only occurs at the start or end of each parallel task, i.e., the child task can only start its execution when all its parent tasks complete and have sent data to it. This is termed \non-preemptive task scheduling". This paper examines the problem of the preemptive parallel program which is represented by a preemptive task model. A new preemptive scheduling algorithm, named PET, is also proposed. Experiments are conducted to illustrate the performance achievable through preemptive task execution and scheduling. Keywords: preemptive task scheduling, preemptive task execution, parallel programming.

1 Introduction The scheduling of tasks within a parallel program aims to optimize system performance via the ecient arrangement of the tasks onto the underlying available processors within the parallel system. The task scheduling problem can be decomposed into four major aspects: the scheduling objective is the performance measure to be optimized; the task model portrays constituent tasks and their interconnection relationships; the processor model describes the architecture of the underlying available parallel processors; and the scheduling algorithm produces a scheduling policy to distribute tasks onto the available processors. In this paper, system performance is evaluated by the parallel execution time (abbreviated as PT) of the parallel tasks, or the completion time of the nal task of the parallel program. The target parallel architecture is regarded as composed of identical processors fully interconnected via identical networks. Processors communicate with each other via message-passing. This paper concentrates on the study of the task model and the scheduling algorithm for a homogeneous distributed system. The task model is illustrated by a directed acyclic graph (DAG) in which nodes represent parallel tasks and edges represent precedence relationships between tasks. The task model is weighted to re ect the execution time of the parallel tasks. The scheduling algorithm takes as input the task model and the processor model, and produces a scheduling policy which can be described by a chart with two axes | the processor axis representing all available processors in the parallel system, and the time axis illustrating the execution order of tasks on each processor. The task scheduling problem of parallel and distributed systems has been studied for some time [2, 6]. It has been proved that no optimal policy can be obtained within polynominal computation time [15]. Various heuristics (algorithms) have been proposed with the aim of producing an ecient scheduling policy for parallel tasks [3, 10, 13, 14]. A more detailed survey is found in [1]. The scheduling algorithm determines the attributes of tasks and processors to be considered within the task and processor models, 1

respectively. The most common task attributes addressed are task computation time and task communication time, which are estimated prior to execution [11]. An additional attribute, execution probability, which is associated with task interconnection is introduced in [12]. To date, non-preemptive task scheduling has been the primary focus in this research area. Consequently, it is a common assumption that task communication only occurs in the beginning or at the end of the task, and that a task can not commence its execution until it receives all required data. The task then executes to completion without interruption and transmits all output data to its dependent (child) tasks. However, in reality, task spawn and message-passing between tasks do not occur only at the start or end of a task. Such operations may take place at any place within the task. including within loops and conditional statements. This paper studies two issues involved in task preemption: preemptive task execution and preemptive task scheduling. Preemptive task execution focuses on the investigation of performance improvement due to runtime preemption. It is assumed that all tasks have been scheduled onto the underlying available processors which are ready to run. Preemptive execution may either improve or degrade system performance. This paper proposes an approach by which the performance improvement can be guaranteed. On the other hand, preemptive task scheduling concentrates on not only the scheduling algorithm itself, but also other issues, such as the task model, which relate to that algorithm. This paper introduces a preemptive task model to handle preemption between tasks. It also presents a new scheduling algorithm, named PET , to deal with the preemptive task model. In this paper, parallel program execution is assumed to be \deterministic", that is, application users invoke the program in the same way. Consequently, the task model, and its task interconnection structure and task attribute values, between dierent executions remains unchanged. It can, therefore, be known precisely prior to execution by running the program a number of times. As for the case where the task model diers between program executions, a system named ATME [12] has been developed which adaptively establishes and varies the task model based on past execution patterns, to re ect the model variation. These assumptions simplify the model and allow its properties to be examined and compared to non-preemptive scheduling. This paper is organized as follows. Section 2 illustrates the preemptive task model adopted in scheduling the parallel program. An example of preemptive task execution is presented in Section 3. Preemptive task execution and preemptive task scheduling is discussed in Sections 4 and 5 respectively. Section 6 provides experimental results on preemptive execution and scheduling. The paper is concluded in Section 7.

2 Preemptive Task Model The task model of a preemptive parallel program is represented by a weighted directed acyclic graph as shown in Figure 1. The model can be formally de ned by G = (T; E; Cu; Cm; Ts; Te), where:

T : the set of tasks in the application program. E : the set of task interconnections in the application program. Cu: the set consisting of the computation time of each task in T , i.e., the time required by each

task, in the program, should it run. Generally speaking, the task computation time is concerned with two aspects: the volume of task source code to be executed and the processing speed of the processor on which this task runs. In this paper, it is assumed that the processing speed of all processors is identical, therefore, the task computation time can be regarded as proportional to the volume of task source code to be executed. 2

S5 (4,0.8) 8 A

(6,0.2)

(9,0.2)

(8,0.5)

(10,0.4)

B

12

(20,0.4)

(20,0.5)

20 D (30,0.5)

E

Figure 1.

10 C

6

The preemptive task model.

In the case of heterogeneous systems; that is to say, the processing speed and/or the architecture of processors in the distributed system is not identical, the computation time of a task can be then represented as a vector, rather than a scalar value. Each cell of the vector describes the computation time of the task on a particular processor. As a result, in the heterogeneous case, Cu is a m n-matrix, where m is the number of tasks and n is the number of available processors. Futhermore, if there exists a performance ratio between processors, the representation of task computation time on dierent processors can be further simpli ed into the product of two matrices: one representing the computation time (a scalar value) of each task on a \standard" processor, with the other representing the processing speed ratio between each processor against the \standard" processor. Cm: the set of communication attributes of each task interconnection (between a parent task and a child task) in E . Each communication attribute is a pair (communication time, preemption start point), discussed below: The communication time represents the total time taken to transfer data between a parent and its child task, if there is such communication. The communication time can be calculated by multipling the communication data volume by the network data transfer rate. For the sake of simplicity, it is assumed that one packet of data is transferred in one unit of time and all networks within the system are identical. Hence the magnitude of data transferred between parent and child task is directly proportional to the time taken for the communication. In the case of heterogeneous systems, the communication time between a pair of interconnected tasks, similar to the task computation time, can be represented by a matrix, with each cell representing the communication time between a pair of tasks which are resident on particular processors. Therefore, Cm is then a matrix rather than a set of scalar values. It is observed that this merely adds the complexity of the calculation and the representation of the communication attribute, while not aecting the result and conclusions drawn in this paper. { The preemption start point represents the point at which message transmission within a (parent) task to a child (dependent) task may rst commence. It is de ned as a ratio given by the following formula: {

3

(p) : V (p; c) = CT (cU) ?(pCT )

where: p, c: a parent task, p, and its child task, c. CT (t): the start time of a task t. U (t): the computation time of a task t. V (p; c): preemption start point at which messages within a (parent) task p are transmitted to its child task c. It is assumed that all future communication between the parent and child tasks takes place synchronously with no delay in either task. This merely simpli es the model in the case where the communication takes place within a loop and therefore takes place several times between the two tasks. Ts: the start task set. It is assumed that each program has only one start task. Te: the exit task set. It is assumed that each program has only one exit task. An example of how to calculate the preemption start point of a typical intertask communication follows: arrange the execution of all tasks along a uni ed time axis, each task commences and completes at some point along this axis. The execution of two tasks may overlap due to the existence of multiple processors and the communication placed in the middle of the (parent) task. Suppose the parent task commences its execution at time t = 10 and its computation time is 20 units. Therefore, the parent task completes at t = 30. If the child task commences execution at time t = 16 (Note that the communication between the parent and the child task can be embedded inside a loop or a conditional statement. Here the commencement time of the child task is measured from the perspective of the execution time, therefore no more attention is given to where the communication is placed.), then the preemption start point of the child task within its parent task is 1620?10 = 30%. This attribute makes sense only when data transmission does not only occur at the end of the parent task execution. When this attribute is omitted from the task model, it implies the assumption that task execution is non-preemptive | data reception occurs at the beginning of a task while data transmission takes place at the end of a task, i.e., V (p; c) for all interconnected tasks p and c is 1. In the preemptive task model, message transmission may occur at any location within the task. For the sake of simplicity, it is still assumed that data reception takes place before any task processing actually begins. That is, a task does not commence execution until it receives all required data, however, it can transmit data at any time before its execution completes. The scheduling algorithm employed must choose how to deal with the preemptive task model. The preemptive task model, on one hand, can be supplied to an existing scheduling algorithm without change, thereby resulting in an allocation of tasks to processors assuming that the tasks exhibit non-preemptive behaviour. Alternatively, a new scheduling algorithm can be developed based on the preemptive task model to deal with preemption in the algorithm itself. The aim is to achieve a good scheduling policy so that a parallel program can obtain high performance. As stated, the preemptive task model assumes no variation between program executions. Such a model can, therefore, be precisely built by executing the program a number of times and capturing task runtime data for task attributes such as task computation time, task communication time, and preemption start point. In the case where task runtime operations (such as task spawn, data sending and data reception) may take place conditionally, as stated in [8, 12], each task interconnection requires one more attribute known as execution probability. Therefore, the model becomes a conditional and preemptive task model. 4

P1

P2

P3

Processor

P1

10

P2

Processor

P3

S

S A

10

A C

C

Processor Idle

20

20 B

30

B

40

30

D

40

D

E 50

50 E

60

60

1111 0000 PT = 59

Time

(b)

(a)

Figure 2.

PT = 49 0000 1111

Time

Performance of Figure 1 when preemption is (a) prohibited and (b) permitted during execution.

3 An Example of Preemptive Task Execution This section shows, through an example, that system performance can be improved if preemption is permitted to occur at runtime, i.e., if preemptive task execution (PTE ) is supported. Message passing operations in this situation are not restricted to the start and end of the task only (as commonly restricted in non-preemptive task execution, denoted as NPTE ), but may take place at any point in the task. In the example below, it is assumed that a task occupies computing resources until it completes its execution, i.e., no interruption of tasks takes place in the execution. Each task is regarded as an atomic execution unit. It is also assumed that initial data reception occurs prior to execution of the task. As a result, tasks are not suspended waiting for data arrival. This section compares the performance dierence between preemptive task execution and non-preemptive task execution. Consider an example of a preemptive task model as shown in Figure 1. The system performance is measured by the parallel execution time (denoted by PT ). The parallel tasks are scheduled onto three fully-connected identical processors in the way shown in Figure 2(a). Such a policy is obtained by applying CET , a conditional task scheduling algorithm discussed in [7], and assuming that all execution probability values (as required by CET ) are 1. Runtime performance under non-preemptive task execution is PT = 59, as displayed in Figure 2(a). On the other hand, if permitting the communication (as well as task spawn) to occur at any point of the task, i.e., preemptive task execution, then the child task may commence its execution before its parent task(s) completes. The preemption start point of each child task within its parent task is shown in the preemptive task model (Figure 1). In this PTE situation, with the same scheduling policy as shown in NPTE situation in Figure 2(a), the performance is PT = 49, displayed in Figure 2(b). As seen, in this example, the performance is improved by 17% ((59 ? 49)=59). This illustrates that better performance can be achieved if intertask communication is triggered as early as possible in the parent task. It remains to be seen (Section 5) how a customized scheduling algorithm can futher improve performance.

5

4 Preemptive Task Execution In preemptive task execution, tasks of the parallel program are distributed and arranged in the execution commencement order on the underlying available host processors, according to a pre-determined scheduling policy. When tasks are actually submitted to execute, preemption is permitted both on the same and on dierent processors, depending on the strategy used for task management. A typical feature of preemptive task execution is that the pre-determined scheduling policy is not altered while the program is executing, only the execution sequence and commencement time of tasks on each host processor may vary due to variation in the time messages are transmitted. This section investigates whether there is any performance improvement achieved through PTE , and furthermore, how to achieve such an improvement. It studies two strategies for handling preemptive task execution (PTE ): one strategy (named P ) permits preemption between tasks can occur on the same processor; the other strategy (named P ) permits communication to take place at any point in the task. Section 4.1 illustrates two categories of preemptive task execution, namely, -preemption (refering to preemption in which the execution of a task is interrupted by another task on the same processor), and -preemption (refering to preemption in which the child task commences prior to the completion of its parent task(s)). Performance gains brought about by both P and P strategies are elaborated in Sections 4.2 through 4.5. The eciency discrepany between PTE and non-preemptive task execution (NPTE ) is studied in Section 4.6.

4.1 Two Strategies for Preemptive Execution

Recall the normal de nition for the \non-preemptive task execution", as adopted in [3, 13, 16, 17]: a task does not commence execution until it receives all required data; the task then executes without interruption until its completion; the task nally transmits all necessary data to its child tasks. That is to say, all data communication occurs either in the beginning or at the end of each task. Correspondingly, this paper interprets \preemptive task execution" as encompassing the following two aspects:

The execution of a task may be interrupted by another task which has been distributed onto the same processor. This is termed -preemption. Whether or not -preemption occurs depends on

the job scheduling strategy of each processor. Message-passing operations, as well as the task spawning, may take place at any point in the task. Consequently, the commencement of a child task is not delayed until after the completion of all its parent tasks. This is termed -preemption. It is assumed that the job scheduling algorithm of each host processor is identical and that it arranges the execution of each assigned task in the sequence determined by the scheduling policy obtained beforehand. It is also assumed that an executing task can only be interrupted at the point of data transmission by another task which is ready to execute on the same processor. For the sake of simplicity, and no loss of generality, it is further assumed that all data reception operations occur at the beginning of the task, i.e., a task does not start to execute until it receives all required data. Furthermore, all data transmission operations of a task are gathered together, rather than scattered throughout the whole task, so that a task can be interrupted at most once. Corresponding to the two aspects of preemptive execution stated above, two strategies are proposed to manage task execution in which preemption is allowed, named P and P respectively. -preemption and -preemption may reciprocally lead to each other. The P strategy may result in -preemption in the case where the execution start time of a task is advanced signi cantly so that this task may acquire CPU resources from the task currently executing. On the other hand, the P strategy may cause 6

preemption in a similar way: the variation in execution sequence among tasks on the same host processor changes the actual data transmission within a task. In the following sections, the performance gain in preemptive task execution under P and P strategies is studied. The system performance is measured by parallel execution time (PT ), which is de ned as the wall-clock completion time of the nal task of the program. From the perspective of processor occupation, the PT is equal to the occupied time (including task execution time and processor idle time) of the processor on which the nal (exit) task of the program is allocated. Hence, this paper focuses on the performance variance on a particular processor, so as to obtain the gain in system performance through PTE by examining the longest elapsed wall-clock time on that processors.

4.2 Processor Performance (p)

System performance of the parallel program, denoted as , is measured by the completion time of the exit task. De ne the processor performance, (p), of a processor p as the execution completion time of the last task distributed onto p, then system performance is formally represented by: = maxp2P f(p)g where P is the set of all available processors in the system. As seen, improvement in processor performance on a particular processor does not necessarily incur a gain in system performance of the entire parallel program, which is the maximum value among all (p). On the other hand, the system performance can be improved only via enhancing processor performance (p) within the parallel system. Preemptive task execution aects system performance, , by varying the processor performance of available processors (especially those with maximal task completion time). For each processor in the parallel system, its performance is measured by the sum of the task execution time of all tasks assigned to this processor and the processor idle time. It is assumed that the task execution time is merely determined by the task's source code. It is also assumed that a task is composed of a number of atomic segments separated by message passing operations. Therefore, once a task is distributed onto a processor, the execution time of this task is xed. On the other hand, the processor idle time is determined by the time each task on the processor waiting for the data. This can be reduced by an appropriate scheduling policy and execution strategy. On each processor, two strategies, i.e., P and P as stated before, result in a change to the system performance of a parallel program through altering the idle time on each processor. In this paper, it is conjectured that -preemption incurs context switching overhead between tasks (or task segments). -preemption also causes the variation in data transmission time stamps due to the interrupt handling between tasks. On the other hand, the P strategy is regarded as advancing the commencement time of the child task, as compared to that for non-preemptive task execution, owing to the relaxation of the constraints on where communication operations may take place. The P strategy may also trigger -preemption. On the whole, the following situations may lead to a change in processor performance on each processor and therefore system performance, if both P and P strategies apply to task execution. Two tasks Ti and Tj are taken as an example to illustrate the issues. Suppose Ti and Tj are allocated on the same processor p. 1: the execution of a task Ti is interrupted by another task Tj when Ti is dealing with data transmission. That is to say, -preemption occurs between Ti and Tj . In this case, the start time of Tj must be earlier than the nish time of task Ti . There is no processor idle time between Ti and Tj . The performance variation in this situation depends on the overhead incurred by context switching between tasks. It is denoted as PAj (p), indicating that execution is interrupted by another task Tj 7

on some processor p in the parallel system. Note that the performance measurement mentioned here and below is related to a particular processor, therefore, p is omitted in the following discussion. 2: the execution start time of a task Ti is advanced due to early-transmission of data from a parent task, as a result of -preemption and -preemption. Correspondingly, the performance achievement is represented by PAe and PA respectively. The earlier-start of Ti may also result in the preemption between Ti and the currently-executing task on the same host processor, which is included in the discussion of PAj (p) in situation 1 . 3: the execution start time of a task Ti is postponed due to the delayed arrival of required data from parent tasks. This may occur when execution is interrupted between tasks, i.e., -preemption. Such performance achievement is denoted as PAd . From the above discussion, the performance change incurred by P strategy is composed of three parts as follows. PA (p) = PAj + PAe + PAd Performance achievement on a processor p, denoted as (p), is measured as the dierence in processor performance between non-preemptive execution and preemptive execution. A positive value of (p) indicates that preemptive execution performs better than non-preemptive execution. Similarly, a negative value indicates performance degradation compared to the non-preemptive case. The performance gain/loss on a processor p can be calculated by: P P (p) = Pt ;t 2L(p) PAj (p; ti ; tj ) + Pt ;t 2L(p) PA (p; ti; tj ) = tP;t 2L(p) PAj (p; ti ; tj ) + tP;t 2L(p) PAe (p; ti; tj ) + t ;t 2L(p) PAd (p; ti; tj ) + t ;t 2L(p) PA (p; ti; tj ) i

j

i

j

i

j

i

j

i

i

j

j

where L(p) is the set of tasks assigned to the processor p. The following notation is introduced prior to a discussion on processor performance gain/loss in various situations:

S (t; p) refers to the earliest possible start time of task t on processor p, ignoring the earliest time that p is available at runtime. In the same way, F (t; p) is the earliest nish time of task t on processor p at runtime, and U (t; p) is the actual execution time of task t on processor p. H (p) is the context switch time of processor p. It arises when task execution is interrupted by 1

another task on the same processor, and the processor needs to change context. PA(p; ti; tj ) is the general performance achievement (four speci c kinds are listed above), on processor p, owing to preemptive task execution between task ti and tj . In Sections 4.3 through 4.5, the performance achievement caused by situation 1 , 2 and 3 is elaborated, when both -preemption and -preemption take place in task execution. That is to say, both P and P strategies function at runtime. The discussion is followed by a performance comparison between dierent strategies in controlling preemptive task execution.

4.3 Performance Achievement PAj

As aforementioned, for each processor p, PAj (p) is incurred when the job scheduling strategy of p allows -preemption to take place, i.e., the execution of a task can be interrupted by another task on p. PAj (p) is measured as the dierence in the completion time of the last task on processor p when compared to non-preemptive execution, due to task execution interruption. The magnitude of PAj (p) is determined by the number of context switches between neighbouring tasks on the same processor. 8

...

Th Ti1 Ti2

Tj2

...

Ti2 Tj2

...

Tj1 (a)

...

Th Ti1

Tj1 (b)

Figure 3.

(a) Non-preemptive execution and (b) -preemption on a processor p.

Figure 3 shows the task execution on a processor p when the -preemption is not (Figure 3(a)) and is permitted (Figure 3(b)) between tasks Ti and Tj on this processor. Suppose task Ti is divided into two parts, Ti1 and Ti2, by its data transmission operations, similarly for task Tj . In preemptive task execution, if the execution start time of task Tj 1 is earlier than the execution start time of Ti2, the rst part of task Tj steals CPU resources from the execution of task Ti. Considering the various possible relationships between the earliest start time of task segments, Ti1, Ti2, Tj1 and Tj2 , there are ve possible preemptive execution sequences between task Ti and Tj discussed below: 1. (Ti1, Tj 1 , Ti2, Tj 2) when S1(Tj 1) < S1(Ti2) and S1 (Tj 2) S1 (Ti2):

PAj (p; Ti; Tj ) = ?2H (p) In addition, this execution sequence results in early data transmission from task Tj , since it is assumed that the data transmission from task Tj is undertaken at the end of Tj 1 . The performance enhancement thus achieved is discussed in Section 4.4. 2. (Tj 1, Ti1, Ti2, Tj 2) when S1(Tj 1) < S1(Ti1) and S1 (Tj 2) S1 (Ti2):

PAj (p; Ti; Tj ) = ?H (p) Furthermore, such preemptive execution causes delayed data transmission from task Ti and early transmission from Tj , the consequence of which is discussed in Sections 4.4 and 4.5. 3. (Ti1, Tj 1 , Tj 2, Ti2) when S1(Tj 1) < S1(Ti2) and S1 (Tj 2) < S1 (Ti2):

PAj (p; Ti; Tj ) = ?H (p) This execution sequence also results in the early transmission of data from task Tj . 4. (Tj 1, Ti1, Tj 2, Ti2) when S1(Tj 1) < S1(Ti1) and S1 (Ti1) S1 (Tj 2) < S1 (Ti2):

PAj (p; Ti; Tj ) = ?2H (p) Late data transmission from task Ti and early transmission from Tj may occur in this case.

9

Processor p: 00000000000000000000 11111111111111111111 0000 1111 0000 Tj 1111 ... ... Ti 1111 0000 11111111111111111111 00000000000000000000 0000 1111 I

Figure 4.

Two tasks on processor p with I processor idle time.

5. (Tj 1, Tj 2, Ti1, Ti2) when S1(Tj 2) < S1(Ti1):

PAj (p; Ti; Tj ) = 0 Processor performance may vary due to the late transmission from task Ti and early transmission from Tj . Whichever situation above occurs at runtime, PAj (p) is either negative or equal to 0. That is to say, the processor performance is not improved by -preemption in the 1 situation. However, in the case when the context switch time of a processor is so small that it can be ignored, PAj (p) can be regarded as approximately 0. In addition, a byproduct of the 1 situation is that early/late data-transmission may occur and therefore bring about further performance bene ts/disadvantages as discussed in Sections 4.4 and 4.5 below.

4.4 Performance Achievement PAe and PA

Processor performance may vary when a task Ti receives its required data earlier than predicted and permits an earlier start time than would otherwise occur in non-preemptive execution. This is the 2 situation as stated in Section 4.2. This may be caused by either -preemption or -preemption. PA2 is used to represent the sum of PAe and PA : PA2 = PAe + PA . Figure 4 shows two tasks, Ti and Tj , assigned to the same processor p. I represents the processor idle time, due to task Tj waiting for data which may come from another task Tk executing on another processor. Denote GR(Tj ) as the variation in start time of task Tj gained from the early arrival of data. Depending on the relationship between GR(Tj ) and I , PA2(p; Ti; Tj ) is analyzed as follows. 1. When GR(Tj ) < I :

PA2 (p; Ti; Tj ) = I ? GR(Tj ) > 0 Processor idle still exists between task Ti and Tj , however, the idle time is less than that in nonpreemptive execution. 2. When GR(Tj ) I :

PA2 (p; Ti; Tj ) = I 0 In this case, the idle time between task Ti and Tj is completely removed due to the gain made by the early arrival of data in Tj . As a consequence, -preemption may result in further -preemption between Ti and Tj through task interruption: the execution sequence between these task segments is determined by the start time of task segments Ti1 , Ti2, Tj 1 and Tj 2 (as discussed in Section 4.3). From the discussion, it can be observed that PA2 (p) is positive due to the early arrival of data to task Tj in any cases. That is to say, performance gain is guaranteed in preemptive execution situation 2. 10

4.5 Performance Achievement PAd

In situation 3 of preemptive task execution, the execution start time of a task may be delayed due to the late arrival of data. This situation is only caused by -preemption, since -preemption only advances the time of data transmission. Figure 4 also shows this situation. Using the same notation GR(Tj ) stated in Section 4.4 to indicate the advancement in execution start time gained from early data arrival. Therefore, GR(Tj ) in situation 3 is always less than 0. Depending on the value of I , performance achievement PA3 in 3 is summarized below: 1. When I > 0:

PA3 (p; Ti; Tj ) = ?jGR(Tj )j The start time of task Tj is postponed a jGR(Tj )j length of time. 2. When I = 0: Let Y = S1 (Tj ) ? F (Ti ). If Y jGR(Tj )j, then

PA3 (p; Ti; Tj ) = 0 otherwise

PA3 (p; Ti; Tj ) = Y ? jGR(Tj)j < 0 i.e., processor idle is created between Ti and Tj due to delayed arrival of data to task Tj . Whichever case above occurs at runtime, processor performance of p in situation 3 is non positive. This implies performance loss on processor p. In previous Sections 4.3 through 4.5, the variation of processor performance under three situations ( 1, 2 and 3 ) has been examined. Processor performance varies, depending on the sum of all these variations. These three situation are resulted by two preemptive execution strategies, i.e., P and P . The next section compares the performance between preemptive and non-preemptive execution, so as to obtain a strategy which can guarantee a performance enhancement over preemptive task execution.

4.6 Preemptive Execution versus Non-Preemptive Execution

This section summarizes processor performance achieved through preemptive task execution, and compares such achievement against the non-preemptive task execution. Recall the assumption commonly made regarding non-preemptive task execution as discussed in Section 1, i.e., communication between tasks just takes place in the beginning and/or the end of the task. From the assumption, it is clear that the execution of a child task is forced to wait until all its parent tasks nish. This restriction ignores the fact that the child task may commence execution as long as it receives all its data. In addition, non-preemption is not what is practically realized in PVM [4] or any other practical runtime system. In those systems, when a task is waiting for messages from other tasks, another task on the same host processor is automatically selected to utilize CPU resources. There is no \pure non-preemption" scheme in the practical sense. Depending on whether P and/or P strategy is permitted at runtime, there are, in general, dierent strategy combinations which control the execution of preemptive tasks, as shown below: 11

PA(p) NP *N P P * P NP * P PAj 0 0 0 PAe 0 0 0 PAd 0 0 0 PA 0 0 0 (p) 0 ?0 0 Table 1.

Performance comparison between non-preemptive and preemptive executions.

1. NP * N P : neither P nor P strategy is applied at runtime. Data reception takes place at the start of the task, and data transmission at the end. Furthermore, the execution of the task is not interrupted. This is equivalent to non-preemptive task execution and has been adequately addressed in other work [11]. 2. NP * P : a task can start before the completion of its parent tasks due to the centrally-located data communication operations (i.e., communication is not restricted to the start or end of the task), but this strategy does not permit the interruption into the execution of other tasks on the same processor. That is to say, once a task gains access to the CPU resources, it executes until completion. 3. P * P : the execution of a task can be interrupted by another task on the same host processor, due to -preemption or -preemption or both. Data transmission is permitted in the middle of task execution. The combination of strategy P and N P does not exist. On one hand, the study of preemptive task execution assumes that the -preemption occurs only at the point of data transmission. It is only meaningful if the data transmission is placed in the middle of the task, where interruption can occur. On the other hand, the N P strategy restricts the communication to either end of the task (reception in the beginning and transmission at the end). Consequently, P * N P is excluded from the stragies employed in handling preemptive task execution. Table 1 compares processor performance of an available processor between dierent preemptive strategies against non-preemptive task execution. Recall that (p) is the sum of the above four kinds of performance variations. According to the de nition, system performance of the parallel program, , is thus the maximum value among all these processor performances, (p). When task execution follows NP * N P strategies, it is actually non-preemptive task execution. From Table 1, it is observed that, for any available processor p in the parallel system, processor performance (p) in NP * P preemptive execution is guaranteed to be better than that in non-preemptive execution. On the other hand, P * P does not always provide superior performance to NP *N P , because of possible delays in data-transmission incurred due to -preemption, thus a question mark is placed in the corresponding (p) column. In summary, merely applying P to parallel task execution, it is expected that system performance is improved, in comparison to non-preemptive execution.

5 Preemptive Task Scheduling In preemptive task scheduling, tasks of a parallel program are conjectured to undertake message transmission and task spawn operations at any time. As a consequence, it is likely that a task may commence execution before the completion of its parent tasks. This characteristic of parallel processing should be considered when distributing tasks onto available processors, in order to achieve high system performance. 12

Several proposals to deal with non-preemptive task scheduling have been put forward. There exist two main techniques in statically handling this scheduling problem. One is cluster scheduling, which includes the work of Sarkar [13] and Yang [17]. This approach rst groups the parallel tasks into \clusters" and then, in a second step, distributes them onto available processors. Further clustering may be required in the second step, when the number of available processors is smaller than that of task clusters produced. The other main scheduling strategy is list scheduling, in which each task is rst assigned a priority order (either explicitly or implicitly) and allocated onto an idle processor according to such priority. Algorithms falling into this category include LS [5], MH [3], ERT [10] and ETF [9]. Preemptive task scheduling stresses that the scheduling algorithm should consider the existence of preemption among task execution; this has been largely ignored in previous work. The task model is best illustrated with one additional attribute, named preemption start point, as well as the standard attributes of task computation time and communication data between tasks as addressed in many other algorithms [3, 10, 11, 13]. Such a task model more accurately re ects the execution of parallel tasks than those currently used. The preemptive task scheduling algorithm produces an ecient scheduling policy, as compared against those not considering preemption. A new preemptive scheduling algorithm, PET is proposed to deal with the preemptive task model. PET originates from the list scheduling algorithm ERT [10]. At any instant, each schedulable task (or free task, i.e., those tasks for which all of its parent tasks have been distributed onto processors), in the application program is assigned a priority value based on its \earliest start time". The PET algorithm is brie y outlined as follows: Step 1: Initialization: All tasks are considered unallocated. All processors are idle and available immediately. Set current time t = 0. Step 2: Task Scheduling: Let W be the set of all tasks that have not yet been scheduled and whose parent tasks have all been scheduled. If W is empty, then exit. Select a task, ti , in W and a processor, pj , with the smallest value of the \earliest start time" which can be calculated by:

S1(ti ; pj ) = max(F (ti1 ; pj1) V (ti1 ; ti) + M (ti1; ti )), for every ti's parent task, ti1 , on its host pj1 . Let S2(ti ; pj ) = maxfS1(ti ; pj ); A(pj )g be the \earliest start time" of task ti on processor pj . Evaluate the smallest \earliest start time" of task tk on processor pl by:

S0(tk ; pl) = min(S2(ti; pj )) among all tasks ready to be scheduled on all available processors. Task tk is then assigned onto processor pl . Step 3: Update: The earliest time that the selected processor pl will be available is set to its previous value of earliest available time for pl plus the estimated length of time required to execute selected task tk on processor pl . 13

AvePMRatio 0.1 0.5 1.0 5.0 10.0 Table 2.

GNR < 0 GNR = 0 GNR > 0 Exec% AveDi% Exec% Exec% AveDi% 0 0 2 98 4 0 0 1 99 10 0 0 2 98 10 0 0 9 91 7 0 0 9 91 7

Performance comparison between preemptive task execution and non-preemptive execution.

Step 4: Go to Step 2. The complexity of PET algorithm is O(mn2 ) where m is the number of processors in the distributed system and n is the number of tasks of the parallel program. The eciency of PET is illustrated in Section 6 through extensive simulation experiments. The PET algorithm schedules parallel tasks prior to program execution, taking task preemption into account. At runtime, the job scheduling policy of each available processor handles the task execution so that a running task executes to its completion, without interruption by other tasks assigned to the same processor. In other words, the policy of NP P is employed to manage program execution. As revealed in Section 4.6, this policy can guarantee a performance improvement for parallel programs. The system performance achieved by preemptive task execution and preemptive task scheduling is experimentally examined in Section 6.

6 Experimental Results Experiments conducted show the performance improvement of preemptive task execution and preemptive task scheduling. In preemptive task execution, only the P strategy is applied to task execution, i.e., data transmission from a task is permitted to take place before the task completes its execution. A large number of parallel programs with representative behaviours are simulated by simply extracting their task attributes regarding task computation time, intertask communication data volume and -preemption start point. Parallel applications are classi ed according to AvePMRatio | the ratio of the average magnitude between task computation and communication in a task model. A high value of AvePMRatio (e.g., 10.0) models computation-intensive applications, while a low value (e.g., 0.1) represents communication-intensive applications. A range of experiments with varying values AvePMRatio are undertaken with dierent task numbers, dierent task interconnections and various numbers of processors to obtain the average performance for each situation. Three kinds of experimental results for each simulated parallel program are studied and compared. First assume all constituent tasks are non-preemptive and undertake the scheduling procedure to distribute tasks onto processors. The \non-preemptive" performance is obtained in this experiment. In the next step, allow tasks to be preemptive at runtime while ignoring this factor in task scheduling prior to program execution. This is \preemptive task execution" as discussed in Section 4. Next, undertake the scheduling and obtain actual system performance results. Finally, an experiment is conducted which takes the preemption start point into consideration while distributing tasks onto underlying processors. Thus, \preemptive task scheduling" performance results can be captured and analyzed. The term GNR is introduced to measure the parallel execution time dierence between preemptive task execution (PTE) and non-preemptive execution (NPTE): Exec : Time (NPTE ) ? Exec : Time (PTE ) GNR = Exec : Time (NPTE ) 14

AvePMRatio 0.1 0.5 1.0 5.0 10.0 Table 3.

PNR < 0 PNR = 0 PNR > 0 Exec% AveDi% Exec% Exec% AveDi% 5 10 4 91 6 1 1 1 98 12 3 1 1 96 14 2 1 2 96 17 1 1 1 99 20

Performance comparison between preemptive task scheduling and non-preemptive scheduling.

Here only task preemption at runtime is considered. A positive value for GNR represents that PTE is superior to NPTE; while a negative GNR shows the opposite. In the same way, another term PNR is de ned as the performance discrepancy between preemptive task scheduling (PTS) and non-preemptive case: Exec : Time (NPTE ) ? Exec : Time (PTS ) PNR = Exec : Time (NPTE ) Tables 2 and 3 list experimental results of the performance comparison between PTE and NPTE, and PTS and NPTE, respectively. Each table contains experimental data of three groups, depending on the value of GNR and PNR accordingly. For each group at a certain value of AvePMRatio, the table displays the percentage of applications (Exec%) falling in this group and the average value of GNR and PNR (AveDi%), respectively. The context switch time between tasks on the same host has been ignored in the experiments. The experimental results are consistent with those shown in Section 4. From Table 2, in above 90% of the simulated parallel applications, preemptive task execution shows better performance than non-preemptive execution. As observed in Table 3, preemptive scheduling can achieve 20% better performance than non-preemptive scheduling when AvePMRatio is 10.0, but in a very limited number of cases, PTS performs slightly worse than NPTE. In general, both the theoretical discussion and extensive experiments indicate that consideration of preemption in task execution and scheduling can achieve improved system performance.

7 Conclusions This paper addresses the issue of preemptive task execution and preemptive task scheduling of parallel tasks onto message-passing parallel systems. While a large amount of work concentrates on the task model which portrays task computation time and intertask communication magnitude, the recognition of preemptive task execution and scheduling is largely ignored. Preemptive task execution distinguishes itself from the non-preemptive case by allowing preemption to exist between the execution of parent and child tasks. That is to say, the message-passing operations are no longer restricted to either end of the task. The P strategy proposed in this paper improves system performance. Task scheduling in preemptive task execution can utilize the current non-preemptive algorithms. On the other hand, preemptive task scheduling requires the algorithm itself to deal with preemption between tasks, in order to achieve high performance. The PET preemptive scheduling algorithm is presented in this paper to deal with this arising situation. Experiments have been undertaken and show that system performance can be signi cantly improved in preemptive task execution and scheduling, as compared to non-preemptive program execution. An environment, named ATME [8, 12], has been developed in order to eciently automate the tedious scheduling issue of parallel programming. 15

References [1] Thomas L. Casavant and Jon G. Kuhl. A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Transactions on Software Engineering, Volume 14, Number 2, pages 141{154, February 1988. [2] Hesham El-Rewini and Hesham H. Ali. Static scheduling of conditional branches in parallel programs. Journal of Parallel and Distributed Computing, Volume 24, Number 1, pages 41{54, January 1995. [3] Hesham El-Rewini and Ted G. Lewis. Scheduling parallel program tasks onto arbitrary target machines. Journal of Parallel and Distributed Computing, Volume 9, Number 2, pages 138{153, June 1990. [4] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek and Vaidy Sunderam. PVM: Parallel Virtual Machine. A User's Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, London, England, 1994. [5] R. L. Graham. Bounds on multiprocessing time anomalies. SIAM Journal on Applied Mathematics, Volume 17, Number 2, pages 416{429, March 1969. [6] Y. C. Hu. Parallel sequencing and assembly line problems. Operations Research, Volume 9, pages 841{848, 1961. [7] Lin Huang and Michael J. Oudshoorn. An approach to distribution of parallel programs with conditional task attributes. Technical Report TR97-06, Department of Computer Science, University of Adelaide, August 1997. [8] Lin Huang and Michael J. Oudshoorn. ATME: A parallel programming environment for applications with conditional task attributes. In Michael Hobbs Andrzej Goscinski and Wanlei Zhou (editors), 1997 3rd International Conference on Algorithms and Architectures for Parallel Processing, pages 275{282, December 1997. Melbourne, Australia. [9] Jing Jang Hwang, Yuan Chieh Chow, Frank D. Anger and Chung Yee Lee. Scheduling precedence graphs in systems with interprocessor communication times. SIAM Journal of Computing, Volume 18, Number 2, pages 244{257, April 1989. [10] Chung Yee Lee, Jing Jang Hwang, Yuan Chieh Chow and Frank D. Anger. Multiprocessor scheduling with interprocessor communication delays. Operations Research Letters, Volume 7, Number 3, pages 141{147, June 1988. [11] Michael G. Norman and Peter Thanisch. Models of machines and computation for mapping in multicomputers. ACM Computing Surveys, Volume 25, Number 3, pages 263{302, September 1993. [12] Michael J. Oudshoorn and Lin Huang. Conditional task scheduling on loosely-coupled distributed processors. In The 10th International Conference on Parallel and Distributed Computer Systems, pages 136{140, October 1997. New Orleans, USA. [13] Vivek Sarkar. Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. The MIT Press, Cambridge, MA, 1989. [14] Harold S. Stone. Multiprocessor scheduling with the aid of network ow algorithms. IEEE Transactions on Software Engineering, Volume SE-3, Number 1, pages 85{93, January 1977. 16

[15] J. Ullman. NP-Complete scheduling problems. Journal of Computing System Science, Volume 10, pages 384{393, 1975. [16] Min You Wu and Daniel Gajski. Hypertool: a programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, Volume 1, Number 3, pages 330{343, July 1990. [17] Tao Yang. Scheduling and Code Generation for Parallel Architectures. Ph.D. thesis, Department of Computer Science, Rutgers, The State University of New Jersey, 1993.

17

Preemptive Task Execution and Scheduling of Parallel Programs in ...

Preemptive Task Execution and Scheduling of Parallel Programs in ...

Suggest Documents

Scheduling Parallel Execution Of Planning And

CPU Scheduling Algorithm-Preemptive And Non- Preemptive

Data-Centric Execution of Speculative Parallel Programs

Parallel Implementation of Task Scheduling using

Scheduling Task Dependence Graphs with Variable Task Execution ...

Provably Efficient Non-Preemptive Task Scheduling with ... - CiteSeerX

Pipelined Data Parallel Task Mapping/Scheduling ... - CiteSeerX

Task Scheduling and Parallel Mesh-Sweeps in Transport ...

A Loadable Task Execution Recorder for Hierarchical Scheduling in

Intestinum Scelus: Preemptive Execution in ... - ScholarlyCommons

Preemptive Execution in Tacitus' Annals - ScholarlyCommons

Robust Scheduling of Task Graphs under Execution Time Uncertainty

Automatic generation of self-scheduling programs - Parallel and ...

Scheduling Strategies for Optimistic Parallel Execution of Irregular ...

A Preemptive CPU Scheduling Algorithm

Parallel Execution of Prolog Programs: a Survey - the CLIP Lab

Piecewise Execution of Nested Data-Parallel Programs - CiteSeerX

Approximation in Preemptive Stochastic Online Scheduling

Preemptive online scheduling with reordering

Supporting Preemptive Task Executions and Memory ... - CiteSeerX

Jackson's pseudo-preemptive schedule and cumulative scheduling ...

On contiguous and non-contiguous parallel task scheduling

Complexity and Inapproximability Results for Parallel Task Scheduling ...

Supporting Preemptive Task Executions and Memory ... - CiteSeerX