Comparing Gang Scheduling with Dynamic Space Sharing on - IPDPS ...

5 downloads 4927 Views 57KB Size Report
traditional approaches to scheduling a mixed load of jobs. Performance results for ASAT ... and dispatch the threads of a process in a roughly synchronized manner. ... extent to which they use hardware or software and the techniques used to ...
Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Charles Severance Michigan State University East Lansing, Michigan, USA [email protected] Abstract This work considers the best way to handle a diverse mix of multi-threaded and single-threaded jobs running on a single Symmetric Parallel Processing system. The traditional approaches to this problem are free scheduling, gang scheduling, or space sharing. This paper examines a less common technique called dynamic space sharing. One approach to dynamic space sharing, Automatic Self Allocating Threads (ASAT), is compared to all of the traditional approaches to scheduling a mixed load of jobs. Performance results for ASAT scheduling, gang scheduling, and free scheduling are presented. ASAT scheduling is shown to be the superior approach to mixing multi-threaded work with single threaded work.

1. Introduction When a parallel processing system is processing a mix of different types of jobs, some scheduling approach is needed so that the overall utilization of the system is maximized. Operating systems on Symmetric Multiprocessors are generally capable of handling a large number of competing single-threaded processes efficiently under a wide variety of load conditions. These systems are also capable of supporting multi-threaded compute jobs very efficiently. Multi-threaded compute jobs which need periodic synchronization between their threads run best when each thread has access to dedicated CPU resources. Problems arise when these two types of jobs, single threaded and multithreaded, are mixed on the system. In the simplest case, the multi-threaded applications suffer poor performance because of inopportune context switches which cause an increase in time spent waiting for suspended threads at synchronization points. There are two classic solutions to this problem. The first is called “space sharing” or partitioning where the singlethreaded and multi-threaded jobs are separated from one another. Each type of workload is given dedicated resources and each workload can efficiently utilize their resources. The second approach is to add gang

Richard Enbody Michigan State University East Lansing, Michigan, USA [email protected] scheduling to the operating system. When gang scheduling is used, the multi-threaded job can assume that all of its threads are running simultaneously even though the job is being time-shared with the other load on the system. The operating system is careful to suspend and dispatch the threads of a process in a roughly synchronized manner. Each approach has its limitations. Because space sharing partitions resources statically, excess resources in one partition cannot be easily utilized in the other partition, and these load imbalances result in poor utilization of the overall resources. Gang scheduling can be difficult to implement in an operating system and overhead increases as the number of processors in these systems scale from two processors to over 100 processors. An approach which is both efficient and scaleable is to use dynamic space sharing where the allocation of resources between the single-threaded jobs and the compute jobs is dynamically altered while the system is running. In the remainder of this paper, we survey the existing dynamic space sharing approaches and then compare the performance of one approach to the performance of gang scheduling on an SGI Challenge parallel processing system.

2. Dynamic Thread Adjustment Techniques The general approach to dynamic space sharing is to increase or reduce the number of active threads in the multi-threaded job(s) when changes in the overall system load are detected. A wide range of highly parallel applications [8] is capable of executing with a varying number of threads throughout the duration of the application. The major way these techniques differ is the extent to which they use hardware or software and the techniques used to trigger the thread adjustments. The Convex C-Series [2] vector/parallel supercomputers used Automatic Self-Allocating Processors (ASAP) hardware to create new threads at the beginning of each

parallel section and destroy them at the end of each section. Cray Research’s Autotasking [3] does not create and destroy threads at each parallel section, it dynamically manages the number of executing threads through a combination of hardware, run-time software, iteration scheduling, and operating system support. Scheduler Activations [1] and Process Control [7] are somewhat similar to Autotasking in that they rely on an agreement between the operating system and the run-time library in the multi-threaded task. Automatic Self Adjusting Threads (ASAT) [5,6] and Loop-Level Process Control (LLPC) [8] do not depend on the operating system for notification about the load condition of the system. Both approaches actively track the load of the system and adjust their threads as appropriate. The primary difference between ASAT and LLPC is the way in which they determine system load. LLPC communicates the overall system load information among the LLPC-enabled processes using a shared memory location. ASAT performs a periodic barrier synchronization to determine the load condition and adjust threads between parallel sections in the code. In the remainder of this paper, we show performance results which compare dynamic space sharing using ASAT to gang scheduling.

3. Performance Results 3.1 ASAT Performance Tests In this section a series of experiments are performed which demonstrate the effectiveness of ASAT across a wide range of loop sizes and run-time settings. For comparison we use the two common, commercial scheduling techniques: free and gang. We also examine how the “load”, single-threaded jobs, are affected by the scheduling choices used by the parallel jobs. 3.2 Experiment Details A highly parallel application is used for all the experiments. This application is compiled and executed under a range of run-time scheduling options: the entire computation can be executed in parallel or serial, gang scheduling can be turned on or off, or ASAT thread adjustment (dynamic space sharing) can be turned on or off. The following table summarizes the option settings for the various runs:

Title

Threads

Gang

Management

Single

1

N/A

N/A

ASAT

4

No

ASAT

Gang

4

Yes

Fixed

Free

4

No

Fixed

Table 1 - Types of Run-Time Choices

3.2.1 Code Structure The basic structure of the code is a parallel inner loop with a serial outer loop. DO I = 1,EXCOUNT C Perform ASAT adjustment if appropriate C$PAR PARALLEL C$PAR& SHARED(A,B,C) LOCAL (J) C$PAR PDO DO J=1,GRAINSIZE A(J) = B(J) + C(J) ENDDO C$PAR END PARALLEL ENDDO

In order to test the effect on programs with different memory access patterns and loop duration times, the inner loop length (GRAINSIZE) is varied. This inner loop length is called the “grain size” as it affects the granularity of the parallel sections. The number of iterations of the inner parallel loop can be adjusted from 1K to 4M. The size of the data structure used in the loop is also adjusted. Varying the data structure size will affect how much of the data accessed by the application will actually reside in the cache of the system. In order to process the same “work”, the number of outer loop executions (EXCOUNT) is decreased as the inner loop iteration length (GRAINSIZE) is increased. The following table relates the parameters. Grain Size

Count

Time

Data

2K

200,000

0.00035s

48K

10K

40,000

0.0022s

240K

100K

4000

0.035s

2.4M

1M

400

0.35s

24M

4M

100

1.4 s

96M

Table 2 - Parameters Relative to Grain size

The compiler used for these tests is a Beta version of the Kuck and Associates Guide compiler with the Flow(ASAT) run-time extensions(Guide 2.00 k270721 960606”). The system used for these tests is an SGI Challenge with the following attributes: OS IRIX 6.2; 4X150 Mhz R4400 Processors; D-Cache 16K; I-cache 16K, secondary unified cache 1M, and main memory 384 Mbytes, 2-way interleaved.

The following figures show the performance of the different jobs on an empty system for various grain sizes. 02:30 Single

Minutes

ASAT

01:30

Gang 01:00

Free

00:30 00:00 0

100

200

300

0.5 ASAT Gang Free 0.25

0 0

100

200

300

400

500

Grain Size (in K)

Figure 2 - Speedup for Parallel Jobs on Empty System (Expanded Vertical Axis)

3.3 Running Jobs on an Empty System

02:00

Ratio to Single Threaded on Empty System

3.2.2 Execution Environment

400

500

Grain Size (in K)

Figure 1 - Runs on Empty System As expected, in Figure 1 the parallel jobs on an empty system have essentially the same running time regardless of basic scheduling choice (ASAT, free, or gang). In general, the parallel jobs’ execution time is considerably faster than the single-threaded execution. One can see the effect of the first and second levels of cache as jumps in the graph of the single threaded run. While even the smallest loop at 2K (48K working set size) will not completely fit into the 16K L1 cache, it fits in the L2 cache and the L1 cache can hold much of the data. Between 50K and 100K in the single threaded run, the data structure can fit in the 1MB L2 cache. Above 200 K, none of the data structure fits in any of the caches from iteration to iteration and the application executes at main-memory speeds. To see the speedup of the parallel application over the serial application more clearly and factor out some of the cache effect, in the following figures the vertical axis indicates the performance as a ratio relative to the single threaded application execution time on an empty system.

In Figure 2, the benefits and effects of parallelism on this application are shown. The first observation is that the performance of ASAT tracks the performance of gang scheduling very closely. Gang scheduling only has a benefit over ASAT for very small loops (

Suggest Documents