RC 20151 (08/07/95) Computer Science
IBM Research Report Application Oriented Resource Management on Large Scale Parallel Systems K. Ekanadham, J. Moreira, and V. K. Naik IBM Research Division T.J. Watson Research Center Yorktown Heights, New York
LIMITED DISTRIBUTION NOTICE This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties).
IBM
Research Division Almaden T.J. Watson Tokyo Zurich
This page intentionally left blank.
APPLICATION ORIENTED RESOURCE MANAGEMENT ON LARGE SCALE PARALLEL SYSTEMS K. Ekanadham
J. Moreira
V. K. Naik
IBM T. J. Watson Research Center P. O. Box 218 Yorktown Heights, NY 10598 e-mail: feknath, moreira,
[email protected] executing, which allows more flexibility. However, in all cases the partition is fixed during the job execution. This approach does not allow the operating system or other supervisory process to reconfigure a partition as the usage of the system changes. For example, when the job arrival rate is high and all jobs are treated equally, the partition size that a job can acquire, on an average, tends to be small. If a large job is started during this period, then that job is constrained to use a small partition size for its entire execution, even if other jobs in the system have finished and more processors have become available. In most cases, the only alternative is to restart the job from the beginning or not to start execution at all unless sufficient number of processors are available; in either case, the system cannot be fully utilized and fairness among jobs cannot be maintained. From an application point of view, its own internal state is changing dynamically through the execution. Typically, the computations of a large application tend to go through multiple phases, with different phases having distinct characteristics of exploitable parallelism and efficient data distribution. If an application has to execute on a fixed size partition then either some phases underutilize the processors or some phases do not exploit all the available parallelism. In either case, the effect is detrimental to the efficiency of the application and also to the throughput of the system. Correspondingly, an application should have access to facilities to distribute and redistribute global data dynamically among the processors of the partition. This feature allows the appropriate distributions to be used at each phase of the computation. Clearly, to balance the resources dynamically and adaptively among a changing workload, the operating system must understand the nature of the computations and even the data structures used in each application. This interaction between applications and the operating system must take place in an efficient manner, otherwise the performance effects would be unacceptable to both users and the system administrators. The conventional means of interaction between an application and the operating system are very limited and tend to have large overheads—primarily because the means for such interactions were designed for a different purpose than for the execution of tightly coupled distributed computations. Since the application designer has the most information about the program behavior, we have taken the position that it is best to let the application writers expose the distributed computations and the data structures to the run-time
Abstract — Large scale parallel systems are usually shared by many users with varying demands. The traditional scheme for supporting multiprogramming has been to statically divide the resources of a parallel system into independent partitions and assign one partition to each parallel job. These partitions are then fixed in size for the duration of each job, causing two severe problems: the operating system cannot dynamically reconfigure the system to increase overall throughput, and the user applications cannot resize the corresponding partitions to take advantage of different amounts of parallelism along each phase of computation. The Distributed Resource Management System (DRMS) is presented as an approach to allow dynamic scheduling of resources during the execution of parallel programs. By establishing well-defined interactions between the executing job and the parallel system, it supports dynamic reconfiguration of processor partitions, dynamic distribution and redistribution of data, communication among cooperating applications, and various other monitoring actions. The interactions occur only at specific points in the execution of the program where the aforementioned operations can be performed efficiently.
1
INTRODUCTION
The recent years have seen a growth on the use of large scale parallel systems, those with hundreds and sometimes thousands of processors, for the execution of complex scientific codes. Examples of these systems include the IBM SP2 [7], Cray Research T3D [2], and Thinking Machines CM5 [10]. These parallel systems are usually installed at supercomputing centers and shared by many users with varying demands. To accommodate multiple simultaneous users, and, thus, improve utilization of the machine, most current large scale parallel systems divide the machine into independent groups of processors. Each group of processors, called a partition, can then execute a different parallel job. In some systems (e.g., CM5) the partitions are determined at boot-time and therefore only certain partition sizes are available to the user. In other systems (e.g., SP2 and T3D) a partition is created for each job at the time the job starts Reprinted with permission from Proceedings of the 1995 ICPP Workshop on Challenges for Parallel Processing, Dharma P. Agrawal, Editor, pages 56–63. Copyright CRC Press, Boca Raton, Florida c 1995.
1
system. The run-time system can in turn interface with the operating system to bring about a coordinated program execution. The more exposure the run-time system has about the course of the program execution, the more efficient can be scheduling of resources. However, this entails additional programming effort on the part of the users and is typically resisted. Past experience with newer programming paradigms indicates that to gain user acceptability, the additional programming effort must be relatively small and the benefits should be high. Keeping this in mind, we have designed Distributed Resource Management System (DRMS) to efficiently manage system resources and at the same time cater to the dynamically changing demands from the users. In the following, we present a functional description of DRMS and outline the application oriented approach we have taken to address some of the issues discussed above.
for timesteps Perform
a
Compute
b
i = 1, n
Relaxation ErrorNorm
endfor
Iteration i
2 USER APPLICATIONS To explain our application oriented resource management approach, we first consider three representative applications that will allow us to bring out the characteristics inherent to many scientific applications. We exploit these characteristics in the design of DRMS. The first example we consider is that of computations performed repetitively over a fixed object such as a grid. Computing the discrete solution to certain partial differential equations fall in this category. In this example, as shown in Figure 1, in each iteration a Relaxation operation is performed over the grid and this is followed by some error checking operation for determining the quality of the approximate solution. The Relaxation operation is typically data parallel and the ErrorNorm operation involves some form of global reduction. In this example, all the exploitable parallelism is encapsulated within the implementation of these two operations. Thus, the distribution of data and scheduling of the computations within each of these operations can be done independent of each other. Furthermore, one instance of the Relaxation operation may use a different data distribution and a different scheduling of computations than another instance of the same operation. These changes have no effect on the final solution, but for the differences in the round-off errors. The arrows labeled ‘a’ and ‘b’ in Figure 1 indicate the points in the pseudo-program where one may conveniently change the data distribution or the computation schedule or both. The second example is that of the adaptive mesh refinement process, as illustrated in Figure 2. For brevity, we explain this problem schematically. In these type of problems, the data and the computations associated with it evolve dynamically and the data dependencies may not be fully known until at run-time. The computations typically start on some coarse global level, where an approximation to the solution is computed. Based on the new information, a sub-region is identified where the solution must be refined. This process is continued until a satisfactory solution is obtained everywhere in the region of interest. In such problems, at each level of refinement, two types of computations are involved: (i) first the refined mesh must be setup and its relation to the existing levels must be established and then (ii) the solution on the level must be computed. As pointed out in [6], the
Iteration i+1
Figure 1: A simple example illustrating the possible scheduling and interaction points in a program. refinements could be highly irregular and computations at such refinement could be completed using multigrid like algorithm, which has a nested form of parallelism inherent to it. These types of computations exhibit a varying degree of parallelism from one section of the code to the next, as well as over the same code section from one instance to the next. The appropriate number of processors and the ideal data distribution and computation scheduled for each phase cannot be determined until the mesh is defined at run-time. The third example we consider is a representative application in the area of multi-disciplinary design optimization. Typically, these types of applications tend to be large-scale in terms of their resource requirements (e.g., cpu cycles and memory) and they couple more than one discipline of physics and engineering. In the example we consider, for simplicity, we have eliminated many details. Refer to [1, 4, 9] for approaches similar to our example. Shown in Figure 3 is the schematics of the computations in determining optimum design parameters of a wing or of a fuselage of an airplane. In this example, computational fluid dynamics (CFD) and structural analysis computations are performed in a tightly coupled manner and the design is optimized until certain sensitivity criteria are met. The computations begin assuming some baseline surface geometry from which a mesh for the flow field is computed. Using this mesh, the 2
Baseline Geometry
Fine adapted level
Surface Geometry Time Dependent Adaptive Mesh Generation For CFD Solver
Time Dependent Boundary Conditions
CFD Solver
Pressure Distribution
Finite Element Mesh
Static and Dynamic Structural Analysis
Structural Displacements
Design Sensitivity Analysis
Geometry Optimization
Coarse global level
Design Convergence Test
Optimized Design
Figure 2: An example illustrating the adaptive mesh refinement process. Figure 3: An example of multi-disciplinary design optimization process. flow field around the body is computed using a CFD solver. This step involves many iterations during which boundary conditions have to be continuously updated as the solution evolves and, in the case of unsteady problems, the mesh has be recomputed for each timestep. From the CFD solution, a pressure distribution for the assumed geometry is determined and is then used as an input to drive the structural analysis part of the computations. The structural analysis part requires its own finite-element mesh for the geometry under consideration. Using the finite-element mesh and the pressure distributions, both static and dynamic stress analysis are performed. The structural displacements computed in this step are then fed into the CFD computations to further refine the solution. These steps are followed by design sensitivity analysis for optimizing the geometry. The optimization step involves iterating over a large parameter space until satisfactory design parameters are identified. This step forms the outermost loop of the entire process. The key points to note for this example are that the computations within the outermost loop are quite heterogeneous (i.e., the data structure are different, the degree of parallelism varies from one phase to next, and the algorithms used in different phases do not necessarily have the same communication and computation patterns) and that each phase (computing the CFD mesh, computing the CFD solution, performing the structural analysis, etc.) consumes significant amount of resources. As a result, each phase must be parallelized efficiently. Dynamic reconfig-
uration of processor arrangement, run-time data redistributions, and computation scheduling are prerequisites for achieving this efficiency. Finally, in many multi-disciplinary situations, applications have been written and tested for solving the component problem belonging to each discipline. Using these individual applications to solve a larger problem (such as the optimization problem), without rewriting a single monolithic code, requires a tightly coupled and seamless interface between the run-time system and the applications executing on multiple separate partitions. With such an interface it becomes possible for two or more independent applications to rendezvous at appropriate points during their execution and accomplish the same results as a single monolithic (and difficult to manage) code. Again in such situations, efficient system interfaces can be setup by letting the applications expose more information to the run-time system.
3 OBJECTIVES The proposed solution for the aforementioned problems is a scheme that allows interaction between the executing parallel application, the user, the operating system, and a run-time system to manage processors and distributed
3
of global data, as well as communication between the program and the environment. In order for this support to be effective these changes and communication must be restricted to occur at points where the state of the program is well defined and the necessary operations can be carried out efficiently. This section defines what these point are. We consider parallel programs written in SPMD style, where a copy of the program runs on each node of a parallel machine. A program can be represented by a control-flow graph. The nodes of the graph correspond to sequential segments and the arcs correspond to flow of control, some of which may be conditional and some may be unconditional. Parallel execution by n processors can be characterized by n copies of this graph, each processor plodding its way through its graph. The progress of an application can be imagined as pushing wave-fronts consisting of n points, one from each graph. We would like to select a few of these wave-fronts at which the program state can be examined or altered. We call these points schedulable and observable points (SOPs). These are central to DRMS. DRMS monitors the SOPs and provides an interactive interface at desired SOPs by pausing the entire application along that wave-front. The run-time system can interact with the user or with a resource scheduler at this point and perform actions such as data redistribution, partition resizing, communication with other processes, and data visualization. The segment of code between two (dynamically) consecutive SOPs is called a schedulable and observable quantum (SOQ). Execution within an SOQ is never interrupted for a DRMS action. For convenience, we permit only wave-fronts with certain characteristics as potential SOPs. Briefly, an SOP must be such that
data. This scheme is implemented in the Distributed Resource Management System (DRMS). DRMS provides the following specific functionality for the user: Application-level dynamic load balancing capabilities: 1. Provide facilities for dynamic distribution and re-distribution of data structures within user applications. 2. Provide annotations to users for specifying schedule points in parallel programs where the data redistributions can be performed effectively. 3. Provide facilities for acquiring and releasing processors by individual user applications at runtime; that is, facilities for dynamic expansion and shrinkage of a user application partition.
System-level dynamic resource allocation capabilities to balance system throughput and job turn-around time: 1. Dynamic management of resources allocated to individual applications. Resources include processors, memory, peripherals such as data servers, visualization systems, etc. 2. Facilities for dynamically defining and incorporating resource management policies and for the execution of these policies. 3. Ability to manage the available pool of resources on an incremental basis.
Facilities for inter-application activities:
1. Either all processors reach their corresponding points on the wave-front or no processor reaches. In other words, we want to avoid a situation where some processors conditionally bypass their points while others reach their points.
1. Ability to make inter-application connections at run-time, thus supporting dynamic interaction among independent jobs. 2. Ability to form a logically cohesive processor partition by “merging” two independent processor partitions at certain rendezvous points. 3. Ability to uncouple a logical processor partition (formed as above) into component partitions.
2. There are no outstanding messages if the application is paused along this wave-front. Otherwise archivals and restarts can get very complicated. 3. There exists a well-defined reinitialization procedure to reset the state of relevant variables which may be affected when a resource change takes place at this point. For example, when an array is redistributed onto a smaller number of processors, the local bounds for iterating over the array may have to be reset.
In addition, the unique concept of application-defined scheduling points in DRMS makes it possible to provide several useful facilities for controlled execution of parallel applications, such as stop, pause, restart, checkpoint, and migrate. Similarly, DRMS design incorporates many features that are useful for efficient system administration. These include the ability to dynamically manipulate scheduling policies and to change priorities assigned to parallel applications without resorting to expensive synchronous mechanisms.
Our long term goal is to develop algorithms that analyze a program and derive these points and associated initialization procedures. Since this problem is undecidable in the most general case, we will resort to heuristics that work in popular cases. For the time being, our experimentation relies on user annotations that supply all the relevant information. Some of these annotations are described later in the paper.
4 DRMS ARCHITECTURE 4.1 Concepts
4.2 Organization
DRMS supports dynamic changes to the executing environment of a parallel program, including the distribution
Shown in Figure 4 are the main functional components of DRMS and the primary interactions among these components. At the highest level, DRMS has three distinct 4
new job cancel job query status stop job connect job(lpe) repartition job hookup passive monitor connect to active monitor password control sysadmin commands
Job Scheduler and Analyzer allocate pe for job cancel job repartition job query status stop job sysadmin commands Resource Coordinator
User Interface Coordinator
user
return status user prog data active monitor commands
load job repartition job kill job stop job soq status job terminated job stopped job repartitioned
when connected Tools and Utilities Coordinator
Task Coordinator active monitor commands repartition job archive job
user prog data Performance Data Gatherer
user program
drms library
communication subsystem
file system
operating system
Figure 4: Main functional components of DRMS functions to accomplish: (i) resource scheduling, (ii) managing and coordinating user applications at run-time, and (iii) performance analysis for decision making. The architecture for DRMS is built to execute these three functions in a consistent fashion. Two functional components perform the resource scheduling task: the Resource Coordinator (RC) and the Job Scheduler and analyzer (JSA). The runtime management and coordination of user applications is accomplished by the User Interface Coordinator (UIC), the Resource Coordinator (RC), and the Task Coordinator and Run-time monitor (TC). The performance analysis component is handled by run-time performance data gatherer and the associated tools and utilities. A component of analysis is also carried out by JSA for its internal decision making process. The system-level allocation and scheduling decisions are made by JSA based on the enforced scheduling policies. These decisions may take into account information such as application supplied resource requests, job priorities, individual processor utilization, as well as system-level information such as current and expected workload. New policies for making such decisions can be supplied and modified by system administrators. (For examples of dynamic and adaptive scheduling policies that can be enforced by JSA refer to [8].) JSA does not interface directly with the user or user program, but rather it communicates its decisions to RC. RC interacts with UIC and TC. The actual execution of the scheduling and allocation decisions for a user application are carried out by TC. There is only one
logical RC and JSA for the entire DRMS. Associated with each user application is a TC. TC consists of multiple agents, one per processor on which the user application is scheduled for execution. One of the agents acts as a master for coordination with the external world including RC and UIC. The run-time interactions between the user application and the rest of the system, including other applications, the user, and RC, are managed by various subcomponents of TC. The main functions carried out by TC are: acquiring/releasing processors from/to RC, initializing and restarting user program execution at the appropriate points in the user programs on the allocated processors, and performing appropriate data distributions to continue program execution after a partition resize or data redistribution. The user program interacts only at the user supplied SOPs, but TC coordinates external interactions throughout the course of program execution. For example, if JSA decides to take away some of the processors allocated to a job during the course of its execution, this decision is conveyed to the TC of that job. The TC waits until the program arrives at the next schedule point (i.e., SOP) and then rearranges the application data so that the program can run on a smaller size partition. At this point, the excess processors are released to RC. At a program directed expansion of its partition, the TC acquires the additional processors from RC and reschedules the computations on the larger processor partition after performing appropriate data redistribution. The DRMS run-time library is linked with the user code 5
and provides services necessary for the user application to effectively utilize the resources allocated to it. These services include:
capabilities by allowing the user to declare global distributed data in HPF-style [5]. However, the computational code is still SPMD and can only operate on local data. Annotations for specifying processor arrangements and data distributions are very similar to their counterparts in HPF. Processor arrangement specification is extended to include a basic mapping onto physical processors. For instance, the annotation
Maintaining descriptors for the global data arrays declared by the application. These descriptors are dynamically created and modified. They contain the information necessary to translate between the local and global index spaces and they are also used to perform redistribution operations.
PROCESSORS, DIMENSION(4,4), @10 ::
P
defines a 4 4 processor array mapped to the physical processors 10 through 25. The column-major linearization of the virtual processor numbers maps to the physical processor numbers beginning from the given starting point. Examples of the definition and redefinition of processor arrays can be seen in lines 19 and 32 of Figure 5. One can also specify array sections out of the processors. Thus, for example, the annotation
Maintaining descriptors for the processor arrays declared by the application. Each virtual processor has to be mapped to a specific physical processor on the partition the application is running. Maintaining descriptors for the partition. A parallel application runs on a partition of physical processors that it sees as numbered from 0 to P ? 1. A partition, however, is time-variant in size and in the set of actual processors that constitute it.
REAL, DIMENSION(100,100), DISTRIBUTE(BLOCK,BLOCK) ONTO P(1:4:2,1:4) :: A
Perform automatic redistribution of data when there is a partition resizing. This involves close coordination with the TC since data has to be transferred after new processes are created and before old processes are killed.
block-distributes the 100100 matrix onto a logical processor grid of size 2 4. The first row of the logical grid is mapped to physical processors (10, 14, 18, 22) and the second row is mapped to physical processors (12, 16, 20, 24). We support block, cyclic, and cyclic(k) distributions, as well as block-list distribution which gives a list of individual block sizes. Line 20 of Figure 5 is another example of declaration of distributed data. In that example, the first dimensions of u0 and u1 are distributed block while the second dimensions are collapsed. Our run-time library provides a host of primitives to obtain distribution information, to convert between local and global index ranges, and other facilities to write distributed-data code. An SOP can also contain executable statements. These statements can be regular Fortran statements of additional DRMS statements to redistribute specific arrays, resize the number of processors, or reinitialize state variables. Lines 14–16 and 25–29 show how to include regular Fortran code. The command
Perform data format conversion from the format that is appropriate for the user program to a format appropriate for archival or communication to a visualization system.
The functions of UIC are listed in Figure 4. The user submits jobs and interacts with the system throughout the course of the program execution via UIC. This interaction is bypassed only when a direct connection is established between the user and the application. The primary function of UIC is to provide convenient user interface and we will not go into the details of these. Similarly, the performance gathering component is designed to assist users and the system administrators to understand some of the characteristics of user programs. For the sake of brevity, we do not describe these in detail. Refer to [3], for a description of one of the performance estimation utilities we have developed in this context.
REDISTRIBUTE (BLOCK,BLOCK) ONTO P(1:4,1:4) ::
A
will redistribute the array A according to the new distribution. In the current implementation it is assumed that storage is allocated in each node to accommodate the largest possible partition envisioned for an array. Hence the redistribution can rearrange the data within the same storage area. Later, we plan to introduce dynamic storage allocation scheme where storage is allocated on demand as per the need. The command
4.3 Annotations SOPs are declared in the user code in the form of annotations. An annotation is a Fortran 77 comment that has special meaning for DRMS. A Fortran 77 program with SOP annotation can be processed by the DRMS preprocessor. This preprocessor replaces the SOP annotations by executable code that calls the DRMS library. The output of the preprocessor can then be compiled by a regular Fortran compiler and linked with the DRMS library to generate a DRMS executable. A program with SOP annotations that is compiled without preprocessing will have its annotation ignored and will therefore execute as a regular program. For brevity, we will avoid full description of the syntactic details of the annotations here. Instead, we will give the flavor of the annotations and illustrate them through the example in Figure 5. DRMS augments Fortran 77 description
RESIZE
n
reduces or increases the number of processors to n. All distributed data structures are automatically redistributed onto n processors using the same type of distribution specification as before. When the number of processors is increased, new processors are loaded with image from the first processor (default) and then the redistribution is performed. Lines 31–33 of Figure 5 exemplify a situation were the processor partition is first resized, then the processor array P 6
is changed to a two-dimensional configuration, and finally, the data arrays u0 and u1 are redistributed onto this new configuration of processors. A border region, with data overlapping that of other processors, is defined for each local section of the arrays. A conditional form of RESIZE RESIZE?
communicate with multiple jobs. DRMS, along with the set of annotations, makes it simpler for the users to interact and monitor with the applications at run-time. Currently, we are in the process of designing and implementing DRMS on distributed memory systems. Our first target machine is the IBM SP2 parallel system. DRMS will initially support Fortran 77 SPMD codes, written using message-passing calls. Initially MPI and MPL will be the supported message-passing libraries.
n
is also provided. In this case the annotation indicates a convenient (from the application perspective) place for resizing. The resizing operation occurs only if the operating system or other supervisory process decides that the system needs to be reconfigured. The number of processors in the new partition is returned in n. The command INITIALIZE
f
Fortran code
Acknowledgements: This work is partially supported by NASA under the HPCCPT-1 Cooperative Research Agreement No. NCC2-9000.
g
REFERENCES
will execute the enclosed Fortran code. This is useful for reestablishing local loop indices and bounds when the number of processors is changed. In line 49 of Figure 5, the initialization code is executed only if a resizing is performed. Finally, lines 41 and 46 of Figure 5 illustrate the use of conditionals on SOPs. In this case, every 5 timesteps the contents of u0 or u1 are dumped to a file.
[1] M. Bhardwaj, R. Kapania, C. Byun, and G. Guruswamy. Parallel aeroelastic computations by using coupled Euler flow and Wing-box structural models. Presented at Computational Aerosciences Workshop’95, NASA Ames Research Center, 1995. [2] Cray Research Inc., Cray T3D System Architecture Overview Manual, Eagan, Minn., 1994. [3] K. Ekanadham, V. K. Naik, and M. S. Squillante. PET: Parallel Performance Estimation Tool. Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing, pp. 826–831, 1995.
5 SUMMARY AND CONCLUSIONS In this paper, we have described an application oriented resource management approach suitable for large scale parallel systems. Based on this approach, we have designed an architecture for Distributed Resource Management System (DRMS), an outline of which is presented in this paper. By establishing well-defined interactions between the executing job and the parallel system, DRMS allows efficient dynamic scheduling of resources during the execution of parallel programs, thus achieving its goals of increasing application performance and system throughput. The application program communicates with the environment only at specific execution points called SOPs (schedulable and observable points). At those points, the parallel program is in a well-defined state. Operations such as partition resizing (changing the number of processors involved in the execution of the program) and data redistribution can be performed efficiently. Communication between cooperating applications in multi-disciplinary problems can also be performed at SOPs. Information passed among the application program, the run-time system, and the operating system at SOPs can be used to perform a better scheduling of the resources in a parallel environment. Execution checkpoints, that allow a program to be stopped and restarted at the user or system convenience, can also be implemented at the SOPs. It should be noted that large scale parallel systems executing parallel applications are very complex systems. It is very hard to check correctness and optimize performance in such systems. The user can benefit from a system that allows dynamic monitoring and control of a scientific application during execution. Monitoring includes display of computed values and performance parameters. Control includes the ability to start and stop execution, reconfigure the partition, change the distribution of data on-the-fly, and
[4] G. Guruswamy and C. Byun. Progress in computational aeroelasticity using high fidelity flow and structural equations on parallel computers. Presented at Computational Aerosciences Workshop’95, NASA Ames Research Center, 1995. [5] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele Jr., and M. E. Zosel. The High Performance Fortran Handbook. The MIT Press. Cambridge, MA, 1994. [6] D. Mavriplis. Multigrid Solution strategies for adaptive meshing problems. ICASE Report No. 95–14, 1995. [7] Special Issue on IBM POWERParallel Systems. IBM Systems Journal, vol. 34, no. 2, 1995. [8] V. K. Naik, S. K. Setia, and M. S. Squillante. Performance analysis of job scheduling policies in parallel supercomputing environments. In Proceedings of Supercomputing’93, pp. 824–833, 1993. [9] O. Storaasli, R. Gillian, and J. Housner. Algorithms for coupled aero/structural design studies on highperformance computers. Presented at Computational Aerosciences Workshop’95, NASA Ames Research Center, 1995. [10] Thinking Machines Corp., The Connection Machine CM-5 Technical Summary, Cambridge, Mass., 1992.
7
program Poisson_solver c c
Computes the discrete solution to the 2-D Poisson equation on an N x N grid. real u0(DIM1, DIM2), u1(DIM1, DIM2) real err(MAX_ITERS) Read(5,*) N
c c
Under DRMS this program starts off on one processor and then allocates processors as required
c$DRMS$ { c Read(5,*) num_pes c$DRMS$ } c$DRMS$ RESIZE num_pes c$DRMS$ PROCESSORS, DIMENSION(num_pes) :: P c$DRMS$ REAL, DIMENSION(BLOCK,*) ONTO P :: u0, u1 call InitInterior(u0) call InitBoundary(u0, u1) c$DRMS$ { c Read(5,*) num_pes c npx = sqrt(num_pes) c npy = sqrt(num_pes) c$DRMS$ } c$DRMS$ RESIZE num_pes c$DRMS$ PROCESSORS, DIMENSION(npx, npy) :: P c$DRMS$ REDISTRIBUTE(BLOCK,BLOCK), BORDERS((1,1),(1,1)) ONTO P :: u0, u1 do timestep = 1, MAX_STEPS if(odd(timestep)) then call Relax(u1, u0) !compute u1(i,j) using 5-pt ave around u0(i,j) err(timestep) = ErrorNorm(u1) !accumulates error terms c$DRMS$
(timestep%5)? : DUMP u1 TO datafile(timestep) else call Relax(u0, u1) err(timestep) = ErrorNorm(u0)
c$DRMS$
(timestep%5)? : DUMP u0 TO datafile(timestep) end if
c$DRMS$
RESIZE? num_pes,
INITIALIZE {call InitParameters(P, npx, npy)}
end do end Poisson_solver
Figure 5: An Example program with DRMS annotations.
8
< < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < <
2> 3> 4> 5> 6> 7> 8> 9> 10> 11> 12> 13> 14> 15> 16> 17> 18> 19> 20> 21> 22> 23> 24> 25> 26> 27> 28> 29> 30> 31> 32> 33> 34> 35> 36> 37> 38> 39> 40> 41> 42> 43> 44> 45> 46> 47> 48> 49> 50> 51> 52> 53> 54>