"task allocation and scheduling models for multiprocessor digital ...

4 downloads 10934 Views 301KB Size Report
or task scheduling in multiprocessor digital signal processing based on 0-1 integer programming was proposed by Konstantinides et al. This algorithm does not ...
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43. NO. 3, MARCH 1995

X. Fan, N. H. Younan, and C. D. Taylor, “The cumulant based MUSIC and its use in frequency estimation,” in IEEE Southeastcon Proc., Apr. 1993. G. H. Golub and C. F. Van Loan, Matr-i.r Computations. Baltimore, MD: The Johns Hopkins Univ. Press, 1989. Heinz Bauer. Probabiliv Theory and Elements of Measure TheorT. New York: Holt, Rinehart and Winston, 1972.

A Note on “Task Allocation and Scheduling Models for Multiprocessor Digital Signal Processing” C. S. R. Krishnan, D. Antony Louis Piriyakumar, and C. Siva Ram Murthy

Abstract-Recently, a branch and bound algorithm for task allocation or task scheduling in multiprocessor digital signal processing based on 0-1 integer programming was proposed by Konstantinides et al. This algorithm does not consider the problem of contention in the communication links of a multiprocessor system and thus may produce unrealisticschedules. We present here a modified version of this algorithm that resolves the problem of contention, thereby producing realistic optimal schedules.

C (b) Fig. 1. (a) Task graph: (b) processor graph

I. INTRODUCTION Task allocation or task scheduling is a fundamental problem that has to be satisfactorily solved in order to exploit the full potential of a multiprocessor system [1]-[5]. It is the problem of allocating the tasks of a parallel program (to be executed on a multiprocessor system) to the processors in a way that minimizes its completion time. Recently, Konstantinides et al. [ 11 proposed a branch and bound task allocation algorithm, with simple backward and forward searching techniques from the theory of &1 integer programming, for multiprocessor digital signal processing. This algorithm, while scheduling nonperiodic block-type tasks onto multiprocessors that allow concurrent I/O and program execution, does not consider the problem of contention (which arises in the context of the usage of communication links of a multiprocessor system) and thus may produce unrealistic schedules. In the rest of this correspondence, we first briefly discuss this problem and then present several modifications to the algorithm in [ I ] for producing realistic schedules.

11. OCCURRENCE OF CONTENTION The problem of contention arises when two or more parallel tasks of a program running on different processors try to communicate via the same communication link simultaneously. Contention leads to delaying the arrival of messages (data) to their destinations. The task allocation algorithm in [ 11 does not consider intertask communication delays due to contention. This is illustrated in the following example. Consider the task and processor graphs of Fig. 1. In Fig. I(a), the number beside a node denotes its execution time and the numbers Manuscript received July 16, 1992; revised July 27, 1994. This work was supported by the Indian National Science Academy and the Department of Science and Technology. The associate editor coordinating the review of this paper and approving it for publication was K. Wojtek Przytula. The authors are with the Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India. IEEE Log Number 9408220.

,-[

k

l

-

T

Z d t=--T3-T4--r(

-

T2

PI TI

TL

I

I

I

PO

T5

T3 I

I

I

*

Fig. 2. Schedule obtained by the algorithm in [I].

on the branches denote the intertask data transfers. For these task and processor graphs, the algorithm in [ I ] produces the schedule as shown in Fig. 2. The Gantt chart of Fig. 2 shows that the two pairs of tasks T1 and T2, and T3 and T4 use the same communication link C for intertask communication in the time interval 3 5 4 0 time units. Since this is not possible, in the case in which only one pair of tasks is allowed to use the link at any given interval of time, the schedule produced by the algorithm in [ I ] is not realistic. In general, if the number of links (I/O ports) is less than the number of tasks requiring communication at the same time, or the capacity of the link is less than the total communication required by all these tasks, then there will be contention. Moreover, contention usually occurs when intertask communication time is of the same order of magnitude or comparable to task execution time.

111. CRITICAL NOTE ON OPTIMALITY

Now, it will be shown, with the help of a search tree, how the algorithm in [ 11 does not examine all possible allocation sequences that may be optimal as well as contention free. A search tree for the

lOS3-S87X/9S$04.00 0 1995 IEEE

T

T

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43, NO. 3. MARCH 1995

(b) Fig. 3. (a) Task graph; (b) processor graph.

given task and processor graphs gives all possible allocations and allocation sequences of tasks onto processors. A node ( i .j ) of the search tree represents the allocation of task Ti on processor Pj. A path from the root to a leaf gives a possible allocation of all tasks onto processors. The following are some important points conceming the search procedure adopted by the algorithm in [I]: i) Initially, all the nodes are ‘free.’ ii) At any point of execution, a given node is either ‘free’ or ‘bound.’ iii) Task Ti is eligible for allocation on processor P j only when node ( i , j ) is ‘free.’ iv) A node in the search tree that is ‘bound’ becomes ‘free’ only when its parent node becomes ‘bound.’ v) There can be many nodes corresponding to the allocation of a task onto a processor. vi) If we ‘choose’ node ( i ,j ) , then it implies that task Ti is allocated on processor P j . vii) If we backtrack from node ( i 3 j ) then , it implies that task Ti is deallocated from processor P j , and node ( i . 1 ) becomes ‘bound. ’ While going through the search tree, if the algorithm in [ I ] deallocates task Ti from processor P j , then node ( i ,j ) becomes free only when its parent node, say, (k,I ) , becomes bound. Due to this, and by iii) and v) given above, it follows that once the node ( i , j ) becomes bound, the different instances of that node within the subtree rooted at (k.I ) are not eligible for consideration, and hence, different sequences of a particular allocation are not examined. We now show with an example how the algorithm in [ I ] does not examine all possible allocation sequences. The search tree, for the task and processor graphs of Fig. 3, is shown in Fig. 4. Suppose we choose the nodes (1, 0), (2, 0), and (3, 1) (in that sequence), which corresponds to the allocation of tasks TI and T 2 on processor PO and task T3 on processor PI. Say we backtrack from ( 3 , 1) to (2, 0) and then again from ( 2 , 0) to (1, 0). Now, node (2, 0) becomes bound, and node (3, 1) becomes free. Suppose we now choose node ( 3 , 1). Then, ( 2 , 0) cannot be chosen, for it is bound and would become free only when its parent node (1. 0)becomes bound. Therefore, the sequence of allocation ( I , O), ( 3 , I), and ( 2 , 0 ) (which also corresponds to the allocation of tasks TI and TZ on processor P O and task T 3 on processor PI) is not examined. Thus. there is a possibility that a sequence that is not examined by the algorithm in [ I ] may be optimal as well as contention k.

803

other pairs of tasks that want to use the same link at that interval of time have to wait for that link to become available again. This necessitates the scheduling of links, i.e., updating the availability times of the links. It may be noted that the scheduling of links depends on the sequence in which the tasks are allocated to processors. Even if the algorithm in [ I ] is made to resolve contention by allowing only one pair of communicating tasks to use a link at any given interval of time, i.e., if link scheduling is introduced, it still cannot guarantee an optimal schedule. This is because unlike in [ 3 ] . the scheduling of links depends not only on the allocation of tasks to processors but also on the sequence in which they are allocated. Consider the task and processor graphs of Fig. 1. Suppose at some point of execution of the algorithm in [ I ] with link scheduling introduced, tasks TI and T 3 are allocated to processor PO and tasks T5 and T2 to processor P1. If the allocation sequence of the tasks is T1, T3, T5, and T2, then the link C is a\-ailable for communication between tasks TI and T2 at 55 time unib. Whereas if the allocation sequence is TI, T3, T2, and T5. then the link C is available for communication between tasks TI and T 2 at 15 time units itself. This shows clearly how link rheduling depends on the sequence in which tasks are allocated m o processors.

B . Optimal Scheduling As mentioned in Section 111, the algorithm in [ I ] does not examine all the possible allocation sequences in order to guarantee optimal schedules. This problem would be resolved if we can distin-rmish the different instances of a particular node in the search tree. I k s is done by associating a set with every distinct node ( i .j : in the search tree, call it the associated set of ( i . j ) . Whenever the task Ti corresponding to a node ( i . j ) is deallocated from processor P i . we do the following: i) Add the parent node of node ( i .j : in the search tree to the associated set of ( i . ; ) ; ii) remove node I . ; from the associated sets of its child nodes that are bound (this 13 done because once we backtrack from an instance of node 1 . 1 then we would never choose that instance of node ( i .j I again). If a node ( k . I ) belongs to the associated set of node ( i .j 1. it i m p k that the node ( i . j ) , whose parent in the search tree is node k.1 . has been examined. As two instances of the same node c m have the same parent node (by the property of a search awl it follows that different instances of a single node are distinguished from each other. The new condition for a node ( i .j ) to be free or bound is as follows: Node ( I . j ) is free if its associated set is empty or if the last chosen node (the root of the sub tree c u d ? examined) does not occur in the associated set of ( i . ; ) . else Dode ( i . j ) is bound. With this modification. the schedule produced b? the algorithm in [I]. for the task and processor graphs in Fig. I(a) and (b). respectively. is given in Fig. 5 . It may be noted that it is both optimal and contention free. At this point. it IS important to note that like any branch and bound algorithm. tht modified algorithm also onl? in the worst case enumerates all possibk sequences as w-e have not changed the branch and bound condition in tk d-pridrm.

.-%DE

A. Contention Handling

The problem of contention can be resolved by allowing only one pair of tasks to use a given link at any given interval of time. The

A

We would like to present our case with a realistic digital signal processing application of computing fast Fourier transformation decimination in time (FIT DIT). Here. we have taken the case of this with -1-= 8 and elementary operation of butterfly [6]-[7] (due to the shape of the flow graph) with four components of -\-/-I poinb

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43, NO. 3, MARCH 1995

A -1,-1

Fig. 4. Search tree.

I

I

PO

TI

T2

T3

T5 I

I

P1

W

T7

TL T1

T3

T2

T5

TB

Fig. 5. Contention-free and optimal schedule obtained by the modified algorithm for the task and processor graphs in Figs. l(a) and I(b), respectively. 5.6 15.2

5.6

53.6

5.6 time

Fig. 8. Contention-free and optimal schedule obtained by the modified algorithm . 15.2

5.6

53.6

5.6

@--@ (b) Fig. 6 . (a) Task graph; (b) processor graph.

[9]. From this, the task graph is constructed as shown in Fig. 6(a). The contention schedule produced by the algorithm in [l], using a I-D hypercube (Fig. 6(b)), is shown in Fig. 7. During the time interval 89.6 to 144.6 time units, task 5 communicates to tasks 9 and 11 simultaneously, using the same link PO-P 1 leading to contention. The modifications suggested in this correspondence to the algorithm in [ 11 produce both contention-free and optimal schedule as shown in Fig. 8. The original algorithm in [ 11 and the modified version of this algorithm have been implemented on a network of SUN 3/50 workstations, and the performance of these algorithms for different types task graphs can be found in [IO].

ACKNOWLEDGMENT

computation [6], as ( .V/2) - 1

. r ( 2 r )* ( ~ i - ! v / ~ ) ~ ~

~ [ k=] 1-=O

The authors wish to express our sincere thanks to the associate editor, Dr. W. Przytula, and the referees for many helpful comments and suggestions.

( .\./2) - 1

+ It-.: = G(k )

*

x(2r

,=o

+ 1 ) * (Ii-.y/Jk

+ r:: * H ( k ) .

We map directly the flow (or task) graph given in [6] and [7] onto a realistic iPSC/2 hypercube architecture with message passing coprocessor (MPC) and virtual channel router ( V C R ) . The communication latency in this system, for messages of length 5 100 bytes is approximately 55 ~ t s and , the multiplication and addition times (for complex numbers) are 9.6 and 2.8 L I S , respectively [SI,

REFERENCES [ I ] K. Konstantinides, R. T. Kaneshiro, and J. R. Tani, “Task allocation and scheduling models for multiprocessor digital signal processing,” IEEE Trans. Acoust., Speech. Signal Processing, vol. 38, no. 12, Dec. 1990. [2] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous data flow programs for digital signal processing,” IEEE Trans. Compur.,

vol. C-38, no. 2, pp. 24-35, Jan. 1987. [3] R. P. Bianchini and J. P. Shen, “Interprocessor traffic scheduling algorithm for multiple-processor networks,” IEEE Trans. Compur., vol. C-38, no. 4, Apr. 1987.

.rrr

TRkVSACnOSS ON SIGNAL PROCESSING, VOL. 43, NO. 3. MARCH 1995

805

1. P. Lrhoczky and L. Sha ”F’erfonnance of real-time b u s scheduling algorithms.- ACM Pcqmmnce € % d u d o nRev., Special Issue, May 1986. E. A. Lee and D. G. Messerrhmi~t.“pipeline intedeaved propnunable DSP’s: Synchronous data flow p ” m h g . “ I€€€ Trans. ACOUSI.. Speech Signal Processing. vol. ASSP-35. no. 9, Sep. 1987. A. V. Oppenheim and R. W. Schafer. Disrretc-Tmc S i g ~ Processl ing. Englewood Cliffs. NJ: Rentice-Hall. 1989. Rentice Hall Signal

[8] J.-M. Hsu, “Performance measurement and hardware support for message passing in distributed memory multicomputers.” Ph.D. thesis, UILU-ENG-91-2209, CRHC-91-5, Univ. of Illinois at UrbanaChampaign, 1991. 191 L. Bomans and D. Roose, “Benchmarking the iPSCl2 hjprcube multiprocessor,” Concurrencv: Practice and Experience, vol. l. no. l, pp. 3-18, Sept. 1989. [IO] C. S. R. Krishnan and C. S. R. Murthy, “A modified branch and bound algorithm for multiprocessor digital signal processing.” Dept. of Comput. Sci. and Eng., Indian Inst. of Technol., Madras. Tech. Rep.. June 1992.

Processing Series. S. Y. Kung, VLSI Arruj Processors. Englewood Cliffs, NJ: PrenticeHall, 1988, Prentice Hall Information and Systems Series.