a scheduling system for a large-scale parallel machine, to capture the ... This paper is a first preliminary report on DQT. We concentrate ... N4, N5 and N6 is responsible for a quartered partition. Proceedings .... In the final subsection, several task allocation ..... nearest partition size, in this case the power of 2. The processor ...
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
A Scalable Time-Sharing Scheduling Distributed Memory Parallel Atsushi HORI Munenori MAEDA
Yutaka ISHIKAWA Takashi TOMOKIYO
for Partitionable, Machines Hiroki
KONAKA
Tsukuba Research Center, Real World Computing Partnership 1-6-1 Takezono, Tsukuba-shi, Ibaraki 305, JAPAN
Abstract
sequential machines. For parallel machines, however, process scheduling techniques have not yet been studied well. Ousterhout proposed time-sharing scheduling algorithms [8]. Processes, however, often change their status, for example, when performing I/O. Such status changes should be reflected in the process management as soon as possible to prevent idle process allocation. In a large-scale parallel machine having a centralized process run queue, this can result in a bottleneck.
We propose a new process scheduling queue system called the Distributed Queue Tree (DQT) for a distributed memory, dynamically partitionable parallel machines. We assume that partitions can be nested dynamically and that a process in a partition can be preempted. The combination of dynamically nested partitioning and time-sharing scheduling may provide an interactive environment and higher processor utilization. The key idea of DQT is to distribute process scheduling queues to each partition. We propose a round-robin scheduling algorithm and several task allocation policies on DQT. The simulation results show that time-sharing with DQT results in better processor utilization than that available from batch scheduling in high-load situations.
1
A combination of time-sharing and dynamic partitioning can make processor utilization higher than that available with batch scheduling. This is because the external fragmentation that decreases the processor utilization can be canceled by latecoming task(s). Thus, time-sharing on a parallel machine not only provides an interactive environment, but can also maintain higher processor utilization. It is desired to have a scheduling system for a large-scale parallel machine, to capture the benefits of dynamic partitioning and time-sharing, and to avoid the bottleneck incurred by the centralized queue system, simultaneously. To cope with these, we propose a new time-sharing scheduling technique called the Distributed Queue Tree
Introduction
Dynamic partitioning that can vary the partitioning of processor space with software control may realize higher processor utilization, because incoming tasks are assigned only the number of processors required, and the rest of the processors can be assigned to other tasks. In this case, the external fragmentation of processor space that can be seen in a memory allocation problem [9], becomes a problem. Li and Cheng proposed the 2-D buddy strategy [6], while Zhu proposed some alternatives [12] to handle this problem in a 2-D mesh-connected parallel machine. For a hypercube-connected system, several subcube allocation strategies have also been proposed [1][2]. None, however, take account of time-sharing, but all assume batch scheduling. Time-sharing that provides an interactive programming environment is a mature scheduling technique for
PQT). This paper is a first preliminary report on DQT. We concentrate on the details of the technique of DQT, and try to analyze the fundamental behavior of DQT by simulation. The efficiency of DQT, compared with simple batch scheduling is also evaluated by simulation. In the next section, we clarify the assumptions and goals of our work on DQT. In Section 3, roundrobin scheduling and task allocation algorithms are proposed. A fundamental analysis of the behavior of DQT is also described. Section 4 presents the results of several simulations. In Section 5, we discuss on the difference between our DQT and the hierarchical process control scheme proposed by Feitelson and
173 1060-3425/95$4.00
0 1995 IEEE
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 Rudolph[3].
2
Assumptions
size partition. If a partition consists of four subpartitions, then the node corresponding to the partition should have four subnodes corresponding to the subpartitions. Figure 1 shows an example of DQT. Each DQT node has a process run queue that is represented by a rectangle on the right side of the node, with the width of the rectangle being equal to the length of the queue. The root node, NO, is responsible for the entire processor space (full partition). Each of nodes Nl and N2 is responsible for a halved partition. Each of nodes N3, N4, N5 and N6 is responsible for a quartered partition.
and goals
We assume that the target is a MIMD, message passing, distributed memory, dynamically partitionable machine. Partitions can be nested over time. A task requires a certain number of processors simultaneously. The required number of processors is referred to as the task size. Each task is assumed to be independent of other tasks. The parallel machine is assumed to be homogeneous, such that the task may be allocated to any partition of a size at least equal to the task size. The task allocator knows nothing only the task size. The task size distribution, however, can be a priori knowledge. In our model, a process is a parallel execution entity of a task and consists of many threads. We assume that every task size is constant, and that every process lives in the same partition as that allocated at initiation. Process migration is not taken into account, and falls outside the scope of this paper. A process can be suspended or resumed by the scheduler at any time. A possibility is for a few processors to be running while the other processors are idle in a running process in a partition at any one time. This internal processor utilization problem also falls outside the scope of this paper. Given the above assumptions, we intend to develop a round-robin scheduling algorithm and task allocation algorithms. For round-robin scheduling, the main problem is how to distribute the scheduling process. For task allocation, we concentrate on how to maximize the processor utilization and provide fair scheduling opportunities, while maintaining the distribution.
3
Figure
Table 1: Example
1: Example
of Dynamic
of DQT
Partitioning
with
TSS
DQT
DQT is a distributed tree structure for process scheduling management. Each node of DQT has a process run queue. Every process in the queue requires that the number of processors not exceed the partition size of the node. The DQT structure should reflect the nesting of the dynamic partitioning. Each DQT node should be distributed to the processor in the partition corresponding to the node. When a process is suspended, the process should be dequeued from the process run queue. In DQT, this queue operation is needed only in a processor that plays the role of DQT node. Usually, the root node corresponds to the full
Table 1 shows a DQT scheduling example, corresponding to the DQT in Figure 1. In this table, the jth process in the queue Q; of the ith node is denoted as “Qi (j) .” The entire processor space is assigned to Qe(0) at time slot 0 and Qs(1) at time slot 1. In time slot 2, halved partitions are assigned and two processes are simultaneously running in the adjacent partitions. In time slot 3, the right-hand side halved partition is halved again, while the left-hand halved partition is left as is since there are two processes in queue Qi. Every process is scheduled at least once in
174
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 19%
3.1
6 time slots in this case. DQT can be represented as wh, where w is the number of branches from a node, and h is the number of levels (height) of the DQT. The partition size (Si) corresponding to the ith DQT node and its subnodes have the following relation:
Terminology
Empty
DQT If there is no process in a DQT, it is called an empty DQT.
Sub-DQT Total
A sub-tree
Queue Length
then
is called sub-DQT.
of a DQT
of a Branch
(TQLB)
value, called the Total Queue Length of a Branch (TQLB), in DQT. A TQLB of the ith leaf node (Ti) is recursively defined as follows: Let us define an integer
assuming the 0th node to be the root node, and that the nodes are ordered width-first way, as shown in Figure 1. It should be noted that the height of a DQT represents the magnitude of the partition size. For example, on a parallel machine with 1,024 processors, where the smallest partition consists of 8 processors, then only 7 levels of binary DQT is enough. Therefore, the height of a DQT can be a configuration or design parameter of a system. So far, DQT has been represented as a binary tree. In a 2-D mesh-network connected parallel machine, however, a quad-tree can be selected. In a 3-D meshconnected parallel machine, an act-DQT is possible. The tree structure of DQT can be an n-ary tree depending on the topology of the network and the partitioning within the network. Generally, the larger the w, the greater the possibility of internal fragmentation. A more complicated DQT structure, for example, a 2-3 tree structure, other buddy structures [9], or unbalanced tree structure thus becomes possible. Such complicated DQTs, however, are a research topic for the future, and fall outside the scope of this paper. For simplicity, this paper refers to binary DQTs (2h) as examples. Implementing time-sharing on DQT is more difficult than on a centralized queue system. This complexity is incurred by distributing the process run queue. A DQT node may be activated or deactivated by its supernode. Only processes in activated nodes can run simultaneously. Activated nodes also represent the partitioning at that time. No more than one node can be activated on the path from the root node to a leaf node at any time. In DQT, where to allocate a task is another problem. To achieve high processor utilization, fair scheduling, and better response, it is very important to balance a DQT. Balancing a DQT is a kind of a static load balancing.
B;=
for i = 0 otherwise
’ { Li + Bl(;-l)+,i
where Li is the queue length
of the node.
Figure 2 shows an example of the TQLBs of a DQT. The maximum TQLB (Max-TQLB) is a very significant number in DQT, since every process in a DQT is guaranteed to be scheduled at least once during the Max-TQLB time slots.
QLJrgr& L-----l-----lr---l----6 6 L LQF!. -5 ------------------.I Figure
Processor
2: Example
utilization
5
’
of TQLB
Processor
utilization
(Ut)
at
time t is defined as U, = s, and the average processor utilization (0) from time ti to time t2 is defined as follows: 0 = c::,,
Pf/(P
x
(t2
-
t1 +
where, P; is the number of possibly cessors (or, the sum of the number of in activated partitions) at the tth time and P is the number of processors in system.
Fairness tunity
1)) busy proprocessors quantum, the entire
Fairness in a DQT is how equal the opporfor scheduling for every process is.
Balanced
DQT If every TQLB in a DQT that is not empty is of the same length, then the DQT is called a Balanced DQT. In a balanced DQT, the processor utilization is lOO%, and every process in the DQT has exactly the same scheduling opportunity.
In this section, first we define a terminology in the next subsection. Then, we describe the communication protocol used to realize the round-robin scheduling in Subsection 3.2. In the final subsection, several task allocation policies are proposed. 175
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
F’ulfilled
DQT If every node of a DQT that is not empty has subnodes, all of which are the root node of an empty sub-DQT or non-empty subDQT, then the DQT is called a Fulfilled DQT (see Figure 3 for an example). If a node consists entirely of empty subnodes but the node itself has no empty queue, then the subnodes will never be scheduled and the processor utilization is not degraded. If any node has a subnode that is the root node of empty sub-DQT, however, then an idle partition corresponding to the subnode will exist when those subnodes are activated. This situation degrades processor utilization. A utilization rate of 100% can be obtained in a fulfilled DQT. Note, however, that a fulfilled DQT might not be a balanced DQT.
Blank nodes indicates
Figure
those having empty
3: Example
of Fulfilled
of the processes in the queue is running. The Rounded node has been activated and all processes of that node have been scheduled at least once in the round. The Ready state node has not yet been activated in the round. Figure 5 is a diagram of the structure of the queue in a DQT node. Unlike the centralized queue, it has a current process pointer. In a Rounding DQT node, the current process pointer proceeds by one at every time quantum along with the queue. When the pointer reaches the end of the queue, then the node enters the Rounded state and the pointer now points to the first process in the queue. If a Rounding node is deactivated and becomes Ready, the current process pointer merely holds the current position, so that the next process of the current process pointer will be scheduled when the node is next activated by a next-turn or activate message (explained later). The enqueue operation inserts a new entry into the queue, and the dequeue operation removes the corresponding entry from the queue.
queues.
&we
Head
-
DQT Proc8ss 0
3.2
Round-robin
scheduling
Figure The round-robin scheduling process of DQT is distributed over the DQT nodes. Each node communicates and synchronizes with its supernode and subnodes only, although every node and every processor can directly communicate each other. Each DQT node changes its state according to the state transition diagram shown in Figure 4.
Figure
4: State Transition
Process n
in DQT
of DQT
5: Queue Structure
of DQT
Node
Communication messages are described below. The uparrow and downarrow suffixes indicate that a message is going up to the root or down to a leaf, respectively. l
add-t ask (1) When a new task arrives an add-task message, that carries the task size, is sent to the root node to find an appropriate partition for the task, and a reply message will be returned with a partition ID. The task size should be rounded up to a partition size that is bigger than or equal to the task size. This message goes down the DQT until it reaches the node having a partition size equal to the task size, then the task is enqueued to that node. So it takes O(logN), where N is the number of nodes of DQT. Which subnode to go down to is chosen by a task allocation policy.
0 next-turn (1) Switch to the next process in terms of roundrobin. This message is generated by an interval timer at each time quantum in each Rounding
Node
The Idle state indicates that the queue of a node is empty. The Rounding state node is activated and one
176
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 node. If the current process is the end of the queue of the node, or if the queue is empty, then the message is forwarded to all the subnodes and the status becomes Rounded; otherwise, schedule the next process of the queue of the node. Subsequent to sending this message, the following reply messages will be returned to the sender.
l
This message may not require a reply and can be asynchronous. The communication delay of this message may degrade the performance of task allocation, but this does not present a severe problem.
- rounded (1) This message is returned when the target subnode state is Rounded and has received rounded messages from every subnode of the target node. This reply message indicates that the scheduling of the next processes in the subnode has failed. This situation implies that all processes in the sub-DQT having a root node that is the target subnode have already been scheduled at least once.
It is not necessary to send an ery time a process is enqueued cause, a suspended process that I/O completion is resumed in a than the interval of task entry
- rounding (7) This indicates that scheduling of the next process in the subnode was successful.
At the end of every round, next-turn messages are propagated to the leaf nodes and rounded messages go back to the root node, after which an activate message is generated to go into the new round.
3.3
activate (1) Activate the node if the queue is not empty. If the queue is empty, forward an activate message to all subnodes. The node activation procedure consists of the following steps.
2. Wait until vated.
Task
allocation
policies
The performance of a task allocation policy can be measured from the processor utilization and the fairness of scheduling. In a low-load situation, when the DQT is not yet fulfilled, processor utilization should be the major policy concern. In a high-load situation, however, when the DQT is already fulfilled, fairness should be the major concern. In either of these two situations, a policy can be contradictory. A combination of policies that are complementary, for example, a policy that is good in a high-load situation and another policy that performs well in a low-load situation, is a good idea. Various task allocation policies are possible. The procedure of the add-task message and the type of information to be gathered by the info message are changed with the policy. Complex policies that require large amounts of data result in increased overhead for
messages to all subnodes.
all subnodes
info message evor dequeued, beis waiting for disk much shorter time or end.
a task-done task-done is not a message for inter-node communication, but is an event that may trigger the state transition of a node. This event is generated in a node to indicate that a process has been dequeued. As a result, if the queue becomes empty, the node state becomes Idle. Note that DQT scheduling and process scheduling may not be synchronized. This message may be generated in any state other than Idle.
If a message sender, other than the root node, has received rounded messages from all subnodes, then the node forwards the rounded message to the supernode. If the root node has received rounded messages from all subnodes, it then sends an activate message to itself to turn the DQT around again; otherwise, it sends an activate message to the node that replied rounded. This strategy keeps the processors busy, but may result in an unfair scheduling opportunity, because a sub-DQT with a light load is subject to turn around more often than those parts with a heavy load.
1. Send deactivate
info (T) As the action subsequent to process creation or termination, this message carries the information for task allocation to the supernode. This message does not invoke state transition. The type of information to be sent depends on the task allocation policy.
have been deacti-
3. Schedule the current process in the queue and change the node state to Rounding. deactivate (1) Deactivate the node. If the receiver node is in the Rounding state, the running process is suspended and the node state becomes Ready; otherwise, forward the message to all subnodes. 177
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 gathering information when processes are enqueued and dequeued. It is, however, easy to imagine that having precise data will result in a good policy. The choice of policy is a trade-off between the overhead and the performance of the policy. The following are subsets of the policies that we have evaluated. We selected those that give some insight into DQT. We also added comments on simulation results described in the next section.
In contrast to the MAX policy, this policy performs well in high-load situations, but poorly in low-load situations. This is because almost every sub-Min-TQLB is zero in low-load situations.
Assigned
(RR) policy This policy is the most simple, but simulation results show that its performance is very poor. Every node remembers the direction of the subnode to which the add-task message was forwarded last time. Each time an add-task message goes through, the target subnode is that next to the remembered subnode. The order of the subnodes can be chosen arbitrarily. In this policy, an info message is not generated, and no nodes have any information of other nodes for task allocation. The poorness of this policy indicates that a task allocation policy is very important. (MAX)
policy
ues, called sub-Max-TQLB, Sub-Max-TQLB of the ith sively defined as follows:
1
for i 2 wh 0 5 i
wh for 0 5 i < wh
178
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
4
Simulation
4.1
Task allocation
nor process switching overhead is taken into account. If a policy can not determine to which subnode to go, then a fixed (leftmost, for example) node becomes the target in the simulations. The task entry pattern is exactly the same in each simulation. Real Execution Time Ratio (RETR) of the qth task (RFET) is defined as:
policies
As described in the previous section, DQT often exhibits varying behavior according to the system workload. To clarify this, we have the simulations with a ramp system workload and constant system workloads with a low-load situation and a high-load situation.
Ramp
@ET Q
workload
_ -
ttask-end P
_
ttask-entry P
task-lengthq
t~sk-entrY is the time of the qth task entry, and tFsk-end is the time of the task end. At the end of each
where
Figures 6 and 7 are graphs of the simulation results for the APA and BF policies respectively. In these cases, a binary DQT having 7 levels (up to 128 partitions) is the target. The time quantum is one time unit. A next-turn message is sent to the root at every time unit. Each task size is rounded up to the nearest partition size, in this case the power of 2. The processor utilization numbers shown in these simulation results were optimistic figures, since we ignored internal fragmentations and idle processors because of communication delay, synchronization and/or waiting for an I/O operation. The task size distribution is inversely proportional to the rounded size of the task. Tasks that require the full configuration (128 smallest partitions, in this case) are omitted because they behave in the same way as a single queue system. The distribution of the ideal execution time is a uniform distribution from 500 to 19,999 time units. Let us define a fraction, called the Workload Factor (Fw), as:
task execution, the RETR of a task is plotted as a tiny dot. Theoretically, the highest possible RRET does not exceed the number of maximum of Max-TQLB during the task execution time. The range between the maximum and minimum values of RETR in a certain time range indicates the unfairness. In those figures, The upper line represents the Max-TQLB. The lower line represents the Min-TQLB. Both lines are sampled at the end of each task. Generally, less space between the line of Max-TQLB and that of Min-TQLB indicates better processor utilization and fairness. Ramp workload,
where, for the qth task, task-sizeq is the number of processors in the partition where the task is processed, task-Zengthq is the ideal task execution time, and Q is the number of tasks entered during the simulation time T. To attain ramp workload pattern, the task entry interval is tuned every 5% of the simulation time. Every simulation begins with an empty DQT, the initial workload factor being set to approximately 5% of maximum workload and stepping up by 5% every 5% of the overall simulation time. The maximum workload factor at the last step is approximately 133% of the entire processing power. The workload, however, is irregular since the task length and task size are randomized. Task entry is stopped at 2,000,OOO time units, although the simulation continues until 2,200,OOO time units. Neither communication delay
Figure
27 DQT
6: APA Policy
The APA policy is good at suppressing the MaxTQLB in a low- and high-load situations (Figure 6). With the BF policy, Figure 7, the Max-TQLB tends to be a multiple of the number of the levels of a DQT, as described above. The policy in Figure 8, “BF & APA”, is a combination of the BF and APA policies. With this policy, if the BF policy can not determine to which subnode to go, then the APA policy is applied. This policy combination behaves like the APA policy in low-load situation, and provides shorter Max-TQLB in highload situation than does the BF policy above.
179
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995
Ramp workload,
APA policies provides the highest processor utilization in a low-load situation and the shortest Max-TQLB in a high-load situation among the policies in the table.
27 DQT
Table 2: Comparison of Task Allocation Policies Policy WF = 0.368 WF = 0.793 ( Max. 1 Max. Util. TQLB Util. TQLB MAX 0.366 4 0.768 10
BF BF & APA
7 3
0.363 0.366
Constant
0.768 0.776 workload,
8 7 27 DQT
Figure 7: BF Policy
4.2 Ramp workload,
DQT-TSS
vs. FCFS-Batch
2’ DQT
RETn
Figure 9 shows a simulation result for First Come First Serve (FCFS) batch scheduling having a single queue with a ramp workload. In this simulation, the dynamic partitioning pattern and task entry pattern are exactly the same as in the above simulations. The algorithm for task allocation is the binary-buddy[5, pp. 435-4551. If an idle partition is found it allocates the task to the partition; otherwise, it enqueues the task. In Figure 9, the length of the queue (denoted as “QL”), which differs from that of the process run queue, and RETRs are plotted. The Y-axis is logscaled, because of the larger magnitude. Ramp workload,
same partitoning
as 2’ DQT
Figure 8: BF & APA Policy
Constant
workload
In Table 2, average processor utilization (“Util.” in the table) and the Max-TQLB (“Max. TQLB”) obtained by simulation are compared for each policy. In these simulations, the task entry interval is constant, the simulation time being l,OOO,OOOtime units. Both a low-load situation (average workload factor, “WF”, is 0.368) and a high-load situation (average workload factor is 0.793) are simulated. The processor utilization ratio of 0.366 in the low-load situation is very close to the theoretical maximum ratio.
Figure 9: FCFS Batch
In the low-load situation, the processor utilization ratios of the MAX, MIN and APA polcies are almost the same, but the maximum of Max-TQLBs are not. In most cases, in low-load or high-load situations, the larger the maximum of Max-TQLB, the higher the possibility of unfairness. The combination of BF and
Scheduling
In Table 3, the processor utilization ratios for the FCFS batch scheduling and DQT policies are compared. Generally, a processor utilization ratio of 100% cannot be obtained in most cases for batch scheduling, while DQT time-sharing may achieve a 100% proces-
180
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
Table 3: Batch vs. TSS Scheduling DQT Policy 1 Util. ] Max. TQLB 1 Figure MAX 10.544 ] 19 I MIN 0.552 16 APA 0.550 16 6 BF 0.552 17 7 BF & APA 0.554 14 8 Batch Policy FCFS
Util. 0.464
Max. QL 798
Figure 9
Ramp workload,
sor utilization
ratio in high-load
#
#
development in our RWC project, some kind of hardware support will be implemented. According to our study on the effect of architectural support on timesharing, the process switching overhead can be comparable with that of sequential machines. We believe that it is worth implementing timesharing using DQT while sacrificing overhead. By having the architectural support and DQT, the loss caused by the process switching overhead can be less than the loss of processor utilization in batch scheduling.
2’ DQT
situations.
5
Table 4 lists the simulation results for the same simulation conditions as Table 2 (constant workload). Comparing Tables 2 and 4, FCFS batch scheduling performs as well as DQT in low-load situations, but is much worse in high-load situations, especially regarding the queue length. This is because, before scheduling, larger tasks are forced to wait for the space occupied by smaller task(s), while after scheduling larger tasks tend to block latecoming tasks. While with a DQT scheduling, the external fragmentation can be canceled by latecoming task(s).
Feitelson and Rudolph also proposed a hierarchical process control scheme for multiprocess environments [3]. The basic concept and the goals are very similar to DQT, but with a different approach and assumptions. The scheme proposed by Feitelson and Rudolph [3] and our DQT are very similar in the nature of distribution and scalability, but their control structure is based on X-tree, while our DQT is based on simple tree. The major difference can be found in task allocation algorithms. In their scheme, a process creation request goes up until it reaches the appropriate level, and then tries to find a node in this level to balance the load. On the other hand, in DQT, every process creation request always begins from the root DQT node. It is easier to find an optimal DQT node from the root node, than the diffusion algorithms proposed by them, since the root node always maintain global information. While in DQT, the root node could be a bottleneck, but the queueing operations that occur when processes interact with person are much frequent than the frequency of process creation and termination. Thus the possible bottleneck is not severe. In their scheme, a thread is the unit of control, and a group of threads is the unit of gang-scheduling. We, however, are targeting a fine-grain parallel machine, such as RWC-1 [lo]. Threads on such a machine are so fine that intervention of the operating system should be avoided. In our DQT, process (or “thread group” in terms of their paper) spreading over a partition is the unit of (gang-)scheduling, and we assume a kind of hardware support to implement gang-scheduling. At the result, the partitioning is constrained, for example, to the power of 2 processors. The alternative structures of DQT depend on the topology of the network and the partitioning of the network of the target machine. There may be a program that requires a partition with another shape that
Table 4: FCFS Batch Scheduling WF = 0.368 WF = 0.793 Max. QL Util. ] Max. QL Util. ] 0.366 1 13 - 1 0.687 1 361 I
Constant
workload,
same partitioning
Discussion
as 27 DQT
It is not fair to compare the processor utilization ratios in Table 3, because those DQT simulations do no take into account of time-sharing overhead. The time-sharing overhead can be estimated from the ratio of the process switching time and the time quantum of time-sharing. The time quantum of a time-sharing system should be determined with a trade off between the ability of response and the overhead. It is, however, very hard to estimate the overhead to switch a process on a parallel machine, since there are many factors, for example, the characteristics of a communication network, communication delay, context switching overhead on a processor, and so on, depending on the target machine. Some kind of architectural support is required to realize efficient time-sharing. CM-5 supports an All Fall Down mode in which the messages in the network are forced to fall down to the nearest processors memory, [ll]. In RWC-1 [lo], a MIMD, distributed scalable up to 1,024-processor parallel machine under
181
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 19% can not be provided by the DQT on the machine. This problem, however, can be avoided by providing a virtual topology and/or a virtual processor supported by a parallel programming language (HPF is an example) or a parallel library (MPI [7] is an example).
6
PI
M.-S. Chen and K. G. Shin. Subcube Allocation and Task Migration in Hypercube Multiprocessors. IEEE Transactions on Computers, 39(9):1146-1155, 1990.
PI
Chuang Tzeng. P.-J. and N.-F. A Fast Recognition-Complete Processor Allocation Strategy for Hypercube Computers. IEEE Transactions on Computers, 41(4):467-479, 1992.
Summary
We have proposed a time-sharing scheduling management called the Distributed Queue Tree for a MIMD, distributed memory, dynamically partitionable parallel machine. One of the most important features of DQT is that it can provide not only better processor utilization but also a much shorter waiting time for service than batch scheduling, in high-load situations. While with batch scheduling, the queue length becomes catastrophic in high-load situations. DQT may also avoid the bottleneck that can be a severe problem with a centralized process queue management. Some fundamental behavior of DQT is studied, and several task allocation policies are proposed. According to the above simulation results, the combination of BF and APA policies is the best policy in both low- and high-load situations. It is, however, very difficult to say generally which policy is the best, since there are many aspects, including processor utilization, scheduling fairness, overhead, workload level and workload shifting. The following topics are the future: i) to implement system, priority scheduling to give more precise insight havior should be analyzed, policies should be further and mathematical analysis.
scheduled to be studied in a practical time-sharing should be introduced, ii) into DQT, stochastic beand iii) the task allocation investigated by simulation
Our DQT is going to be implemented in the operating system kernel, named Score [4], on the massively parallel machine RWC-1 [lo].
7
References
Acknowledgement
We thank Prof. Akinori Yonezawa, Chairman of the Massively Parallel Software Workshop of RWC, and all members of the workshop. We also thank Dr. Sakai, Director of the RWC Massively Parallel Architecture Laboratory and the staff of the laboratory.
PI D.
G. Feitelson and L. Rudolph. Distributed Hierarchical Control for Parallel Processing. COMPUTER, pages 65-77, May 1990.
141A.
Hori, Y. Ishikawa, H. Konaka, M. Maeda, and T. Tomokiyo. Overview of Massively Parallel Operating System Kernel Score. Technical Report TR-93003, Real World Computing Partnership, 1993.
PI D.
E. Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms. AddisonWesley, 1968.
PI
K. Li and K.-H. Cheng. A Two-Dimensional Buddy System for Dynamic Resource Allocation in a Partitionable Mesh Connected System. Journal of Parallel and Distributed Computing, 12(5):79-83, May 1991.
(71Message ument for November
Passing Interface Forum. DRAFT, Doca Standard Message-Passing Interface, 1993.
PI J.
K. Ousterhout. Scheduling Techniques for Concurrent Systems. In Proceedings of Third International Conference on Distributed Compuling Systems, pages 22-30, 1982.
PI J.
L. Peterson and T. A. Norman. Buddy System. Communication of the ACM, 20(6):421-431, June 1977.
PO1S.
Sakai, K. Okamoto, H. Matsuoka, H. Hirono, Y. Kodama, and M. Sato. Super-threading: Architectural and software mechanisms for optimizing parallel computation. In Proceedings of 1993 International Conference on Supercomputing, pages 251-260, 1993.
Thinking 1111
Machines Corporation. chine CM-5 Technical Summary,
PI
Connection MaNovember 1992.
Y. Zhu. Efficient Processor Allocation Strategies for Mesh-Connected Parallel Computers. Journal of Parallel and Distributed Computing, 16:328337, 1992.
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE