Document not found! Please try again

Architecting Malleable MPI Applications for ... - ACM Digital Library

8 downloads 1177 Views 624KB Size Report
User Level Failure Mitigation (ULFM) [4] API and run- time. The ULFM ... recovery from node failures. We evaluate ..... HDD local storage of 1 TB, 128 GB of RAM.
Architecting Malleable MPI Applications for Priority-driven Adaptive Scheduling Pierre Lemarinier

IBM Research - Ireland

[email protected]

Khalid Hasanov

Srikumar Venugopal

[email protected] Kostas Katrinis

[email protected]

IBM Research - Ireland

IBM Research - Ireland

IBM Research - Ireland

[email protected] ABSTRACT

partnered with dynamic visualization or short-term analysis to help domain expert scientists understand how their experiments evolve, instead of waiting for the simulation to complete and run a post-mortem analysis. This interactivity enables steering the simulation through the use of short term analysis, and avoids waste of computing resources by enabling early cancellation of unproductive jobs. Unfortunately, with current scheduling systems, this interactive computation is possible only by reserving enough free resources for the analysis or preempting the simulation, both of which lead to waste of resources. In other scenarios, jobs may need to be executed urgently even when all resources have already been allocated to longrunning jobs. This issue can be solved by providing a priority mechanism along with preemption techniques that involve stopping a running job momentarily. However, this can lead to resource under-utilization, as a large job can be preempted to make room for a high priority job which just needs a couple of cores. Backfilling can help alleviate this problem, but only if there are other short-running jobs queued for execution. Malleable applications are applications that can dynamically grow or shrink to adapt to the number of resources that they are provided. In presence of malleable applications, a scheduler can be tailored to dynamically request any such application to free some of its resources to accommodate other jobs, or to deliver more resources to these applications. Most existing MPI applications are not malleable and would need a lot of refactoring to become malleable due to specific data structures and communication patterns. It is thus envisioned that future scheduling systems will have to accommodate both malleable and non-malleable MPI applications simultaneously. Nonetheless, malleability should be supported when developing new interactive MPI simulations. While simply migrating ranks and overloading some nodes leads also to free some resources, MPI allows applications to map hardware topologies to virtual communicators, and thus that would impact more performance than shrinking the number of rank to the provided resources and balancing the payload. In this paper we present two techniques that provide the required support to enable malleability for future MPI applications. These techniques are similar to fault tolerance techniques, as they enable the capability to dynamically lose some resources. The first technique involves user level checkpoint restart mechanisms. When an application needs to grow or shrink, it stores its state into a checkpoint, then

Future supercomputers will need to support both traditional HPC applications and Big Data/High Performance Analysis applications seamlessly in a common environment. This motivates traditional job scheduling systems to support malleable jobs along with allocations that can dynamically change in size, in order to adapt the amount of resources to the actual current need of the different applications. It also calls for future innovative HPC applications to adapt to this environment, and provide some level of malleability for releasing underutilized resources to other tasks. In this paper, we present and compare two different methodologies to support such malleable MPI applications: 1)using checkpoint/restart and the SCR library, and 2) using dynamic data redistribution and the ULFM API and runtime. We examine their effects on application execution times as well as their impact on resource management.

Keywords Interactive Supercomputing, Malleability, Scheduling, MPI, Resource Management

1.

INTRODUCTION

High Performance Computing (HPC) applications usually require exclusive access to computing resources in order to deliver result in an acceptable time frame. Consequently, resource management systems for HPC machines operate traditionally as batch schedulers where users request a fixed number of resources for a specific amount of time. The scheduler then orchestrates the deployment of applications, monitors and potentially terminates them once they reach their allocated time limit. With the convergence of high performance analysis systems and high performance computing systems into a single platform, some applications will benefit from interactivity. For instance, long-lasting scientific simulations can be Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

EuroMPI ’16, September 25-28, 2016, Edinburgh, United Kingdom c 2016 ACM. ISBN 978-1-4503-4234-6/16/09. . . $15.00 ⃝

DOI: http://dx.doi.org/10.1145/2966884.2966907

74

stops. It is then automatically restarted with a new number of resources and can access its previous data from the set of checkpoint files. The second technique employs the User Level Failure Mitigation (ULFM) [4] API and runtime. The ULFM project is the implementation of a proposed API to include into the future MPI-4 standard for supporting fault tolerance. This API adds, for instance, error codes in the MPI standard that the application developer can use to be aware of failures, and functions such as MPI_Comm_Shrink(), for the developer to tell the MPI runtime what to do after a failure. In the context of malleability, we use MPI_Comm_Shrink() after killing targeted rank to rebuild the communicator and continue the execution. Note that contrary to ULFM, most MPI runtimes do not support recovery from node failures. We evaluate these two malleability approaches through a suite of custom applications that implement well-known parallel algorithms such as matrix multiplication and twodimensional block-cyclic distribution. Additionally, we present a scheduling policy for dynamically requesting resources from running malleable applications. We compare the impact of the malleability techniques on job and resource management by using synthetic benchmarks and present the results. The next section reviews the state-of-the-art in the area of malleability and its incorporation into scheduling. Then, Section 3 discusses checkpointing and ULFM in further detail. Section 3.1 presents our scheduling policy for malleable applications that we used in the evaluation. Section 4 presents our experiments and discusses the results of the evaluation. Finally, we present our conclusions in Section 5.

2.

RELATED WORK

There has been intensive research on the topic of scheduling malleable parallel applications over the past few years, but not many on comparing techniques for enabling malleability at the application level. Scheduling resizeable applications involves making decisions on which jobs to shrink as well as identifying the jobs for allocating the resources freed up in such a manner. Many projects have presented frameworks that couple decision-making about resource allocation to the mechanism of shrinking or growing the malleable applications. An example of this is the ReSHAPE project presented by Sudarsan and Ribbens [20] which combines a scheduler with a library that supports dynamic resizing of iterative MPI applications. Applications can either grow or shrink at a so-called resize point, which are typically the end of an iteration. Applications contact the scheduler through the included API to provide performance data. The scheduler then uses this data and the available resources to make its decision. This means that the applications will have to be compiled with the ReSHAPE library in order to be malleable under this paradigm. In a later publication [21], Sudarsan and Ribbens introduce scheduling policies for applications developed using the ReSHAPE framework. The scheduling policy with the best outcome involved extracting resources from jobs whose performance would be least impacted by shrinking and allocating them to queued jobs, if present, or to running applications that would benefit the most from expansion. We thus decided to implement a similar policy in our scheduler for comparing the impact of the chosen malleability technique. Utrera et. al. [22] introduced a new scheduling policy named FCFS-malleable for multi-core clusters and evalu-

ated its performance against backfilling. According to their study the FCFS-malleable policy can improve the average response time by 31% over the traditional EASY backfilling depending on the execution times of the jobs and the number of processes. More recently, similar to ReSHAPE, Martin, et al. [14] introduced the Flex-MPI framework for using performance predictions to reconfigure malleable MPI applications during execution. Checkpoint restart is the method used for supporting malleability. While the focus of the authors is on the interaction between an application and resource management, in this paper, we focus on comparing the impact of our chosen malleability techniques, checkpoint restart and ULFM, on application execution times. Buisson, et al. [5] discuss scheduling malleable applications across multiple clusters using the KOALA scheduler. Applications have to incorporate a custom framework, capable of detecting synchronisation point between threads and redistribute data between threads at such time. This framework supports the SPMD model. Similarly, The Internet Operating System (IOS) project [6, 7] tackled malleability of MPI applications using their own custom middleware called Process Checkpointing and Migration library. Malleability was implemented in applications through the split and merge operations, that are handled by local agents running on individual resources. We eschew such custom frameworks and focus only on well-known SCR library and the proposed ULFM standard API for our comparison. Prabhakaran, et. al [17] extended the Maui scheduler [11] to create an adaptive batch scheduling system for handling malleable jobs that used the Charm++ [12] adaptive runtime. In contrast, our focus is solely on programming malleable applications using MPI. We present in this paper two different techniques to support malleability for MPI applications and compare for the first time these two approaches and their relative impact on resource utilization.

3.

MALLEABILITY SUPPORT FOR MPI APPLICATIONS

Large-scale systems are faced with dynamic changes in resources due to failure of nodes, which in turn require the application to be reconfigured in order to achieve successful completion. The classic approaches to deal with such reconfiguration are based on checkpoint/restart techniques. Commonly, these consist of regularly storing the state of an application, and in the case of node failure, reading this state from the last checkpointed data once the application is restarted. While most checkpoint approaches will lead to job re-submission in order to restart with the same number of resources as the initial execution, it is also possible to tailor some applications to simply restart with a different number of resources, thus enabling them to restart right away on the remaining allocated resources. This requires data redistribution from the checkpointed data, which is easier to do from a single shared checkpoint file. Moreover, this data redistribution phase is application-specific as it can depend on the communication pattern of the parallel application. Outside of the fault tolerance context, these checkpointing techniques can be used to build malleable applications as well. This still implies that such an application has to be restarted each time it is reconfigured, however checkpoint

75

phases can be triggered only when required for reconfiguration. Another major difference compared to fault tolerance resides in the requirement for such applications to also be able to dynamically grow in size after having obtained more resources. It is thus important to address the associated data redistribution and not simply count on restarting the same number of processes on a different set of resources. System-level checkpointing [8, 10] libraries hide the low level details of the checkpointing from the user. These are not only non-portable, which can be an issue for heterogeneous systems, but also hide application data positioning in checkpoint. This leads to the need to actually restart all previously running processes, and prevents data redistribution at restart time. We, thus, exclude this possibility for enabling malleability. User-level checkpointing lets the application developer explicitly store the data he or she needs for restarting, specifying data positioning in resultant files. Though more cumbersome for the developer, this provides a method to orchestrate data redistribution directly from checkpoint files. For instance, in applications that compute on matrices distributed with a classic block distribution, it is possible to build checkpoints so that every process can compute locally in which file each blocks are stored and at which offset. The co-ordination of snapshots of different processes and the management of checkpoint files must be performed carefully in order to ensure consistency of the global checkpoint. This is the role of the Scalable Checkpoint/Restart (SCR) library[15]. SCR provides an API for coordinating a userlevel checkpoint phase. It also copies the checkpoint files transparently to predetermined locations across the storage system, such as a parallel file system, and fetches checkpoint files at restart time. Currently, SCR only fetches the checkpoint file matching the MPI rank that wrote the file. However, we modified the library to enable every rank to request and access any checkpoint file on demand. In the rest of the paper, we will name an application that uses SCR for user-level checkpoint management as an SCR application. In the past few years, there have been many suggestions to extend the MPI standard to support fault tolerance. The User Level Failure Mitigation (ULFM) [4] is the most advanced among these propositions. It adds representative error codes into the MPI standard that an application developer can test against to discover node failure, and provides a couple of new MPI functions to let the developer give insight on his or her application should continue its execution in case of failures. For instance, the developer can request the MPI runtime to replace the failed rank in the communicator with a new rank along with the function it should execute, or he or she can request the MPI runtime to simply remove the failed rank from the communicator and execute with one less rank. This later request is obtained by using the MPI_Comm_shrink function. Though this functionality has been designed for fault tolerance, we will make use of it to provide malleability as well when an application is requested to shrink. Upon receipt of this request, the application will first operate its data redistribution with point to point communication, then kill the required number of MPI processes and finally use MPI_Comm_shrink to get a correct and working MPI communicator. The MPI-2 standard [1] provides already the MPI_Comm_spawn routine that lets users implement an application which can dynamically increase the original number

of processes p by a specified number x to p + x, where x > 0. Later on, the spawned processes can be finalized in which case the number of processes will return back to p. However, it is not possible to shrink below p or to p + y, where 0 < y < x. Although we could start an application with a single rank, and grow it one by one with MPI_Comm_spawn until reaching the expected initial size to obtain a fine-grained malleability, this solution would impact the starting time of an application and would require a lot of communicator management. Our second technique to support malleability for MPI applications thus consists of coupling the ULFM’s MPI_Comm_shrink and the standard MPI_Comm_spawn routines, and takes advantage of the accompanying ULFM MPI Runtime, which in turn, supports dynamic removal of processes and nodes. We have developed a simple scheduler and implemented a fairly simple scheduling policy, presented in the next section, in order to better study the impact of the selected techniques for enabling malleability. Whilst it would have been more desirable to integrate malleability into a production scheduler, this would have introduced other sources of performance discrepancies.

3.1

Scheduling Policy

In this paper, we compare two techniques for realising malleability in HPC applications in terms of their effects on application runtimes as well as resource usage efficiency. In order to study the latter, we have focused on the use case of jobs with long runtimes being resized in order to accommodate short running jobs with higher priority, wherein a candidate job is shrunk to free up resources for a high-priority job and then resized back after the latter has finished execution. Accordingly, we have implemented a queue based scheduler that supports jobs with different priorities. Each priority is maintained as a separate queue, with the scheduler extracting jobs from the queues in the order of priority. The key piece of this queue management policy is the logic for handling the common scenario wherein a high-priority job has been submitted but its resource requirements cannot be satisfied. The scheduler then has to identify jobs to shrink for extracting resources to accommodate this job, and later, to identify which job to expand to consume the extra resources. We have employed the Least Impact - Most Benefit policy for selecting running jobs to shrink and expand. The Least Impact policy for shrinking is adapted from Sudarsan and Ribbens [21] . In this policy, the list of jobs under execution is searched to identify jobs that have a lower priority than the queue under consideration, are malleable, and have not exceeded their requested wallclock time. This list is sorted on increasing priority, and within jobs of the same priority, decreasing wall-clock time. That is, we endeavour to find malleable jobs with the lowest priority that will be the least impacted in terms of over-running their requested wallclock time. The intuition here is that high-priority jobs will be short-lived, and a job once contracted will be expanded later so that it can make up for the slow down. Then, the sorted list is traversed and each job is marked to be reduced by the lower of the difference between its current number of resources and its minimum number of resources, and the number of resources required. The job is then added to the set of resize candidates. The Most Benefit policy for expan-

76

sion is the reverse of the shrinking policy where we try to sort the jobs with the defined priority in the increasing order of the difference between its current wall-clock time and the request wall-clock time. Each job is given the maximum number of resources to meet that requested at the time of submission and is then added to the list of resize candidates. A key aspect of this queue management logic is that the resources that can be obtained after shrinking are not committed to other jobs unless after the operation is completed. This is because the scheduler is decoupled from the actual performance profile of the malleable parallel applications and the requested wall-clock time is considered to be advisory rather than an accurate user estimation of the job execution time. As we’ll show in Section 4, it is possible that upon resizing, a job may shrink itself to the configuration that is most suitable for data distribution and communication. Also, it is possible that some jobs may finish execution during the polling interval. Therefore, the scheduler may obtain more or less resources than what was taken into consideration during job allocation. In the interest of brevity, we have opted not to present more detail of the implementation of the scheduler.

4. 4.1

blocks rather than a single element. The algorithm consists of nb steps and at each step the following operations are performed: • Each processor holding part of the pivot column of the matrix A horizontally broadcasts its part of the pivot column along the processor row. • Each processor holding part of the pivot row of the matrix B vertically broadcasts its part of the pivot row along the processor column. • Each processor updates each block in its C rectangle with one block from the pivot column and one block from the pivot row, so that each block cij , (i, j) ∈ (1, ..., nb ) of matrix C will be updated as cij = cij + aik ×bkj . • After

steps of the algorithm, each block cij of matrix ! nb k=1 aik × bkj

C will be equal to

Figure 1 highlights the communication patterns in SUMMA on 6×6 processors grid [9]. We have implemented two versions of the SUMMA algorithm in a malleable way using first the SCR checkpoint/restart library and then using the ULFM library. Both implementations use two-dimensional block distributed matrices. The shrink and grow operations in the malleable SUMMA require data redistribution before the algorithm can continue. During grow operation the local matrix sizes will decrease and each rank before the grow operation owns data that needs to be distributed over more than one new processes. It is the opposite for the shrink operation. This data redistribution has been implemented using MPI IO operations in the SCR implementation of the malleable SUMMA. The ULFM implementation uses MPI gather and scatter operations to perform data redistribution while growing and shrinking respectively.

EVALUATION Experimental Testbed

The experimental study has been conducted on the Shamrock cluster of the High Performance Systems group of IBM Research Ireland. This work utilized up to 64 nodes interconnected with Gigabit Ethernet. Each of the nodes has an Intel Xeon X5670 CPU (6 cores, 12 hardware threads), HDD local storage of 1 TB, 128 GB of RAM. In addition, each node has access to 15TB of NFS version 3 shared storage. The operating system is RedHat 6.5 Linux distribution, Open MPI 1.8.7 and ULFM (based on Open MPI 1.7) have been used as the MPI implementations. The SCR version 1.1-7 has been used for the checkpoint/restart purposes.

4.2

n b

2D Block Cyclic Distribution Benchmark. The 2D block cyclic distribution of matrices is widely used in the state-of-the-art dense numerical linear algebra libraries such as ScaLAPACK to load balance computation and reduce data movement in parallel matrix multiplication and factorization operations. In addition, according to a research work [13], the block-cyclic data distribution is the natural choice in signal processing applications. A 2D block cyclic distribution can be illustrated as in Figure 2(left). Moreover, each process performs a certain number of multiply and add operation based linearly on its local data size. Shrinking the number of resources will thus redistribute the different block of the matrix and increase each processes local data size, leading to an increase in the computation time. The initial number of operations to perform on each node is also scaled on rank 0 to match one second in our system.

MPI Applications

We first evaluate the performance impact of resizing an application when using either checkpoint restart or ULFM based techniques. We used two different applications: a state-of-the-art parallel matrix multiplication algorithm and an application with no computation, but a global matrix shared using a two-dimensional block-cyclic data distribution. Effectively, the latter serves the purpose of stressing data redistribution overheads during resizing.

SUMMA - Parallel Matrix Multiplication Algorithm. The parallel matrix multiplication algorithm we used in our experiments is the SUMMA [19] algorithm, which has been implemented in the state-of-the-art numerical linear algebra packages such as ScaLAPACK [3] and Elemental [16]. Furthermore, because of its practicality SUMMA is used as a baseline for implementations and optimization of parallel matrix multiplication on specific platforms [2] [18] [9]. The SUMMA algorithm has been designed for a matrix multiplication C = A × B over a two-dimensional p = P ×Q processor grid. In order describe the algorithm briefly, let us assume the matrices are square and their dimensions are n×n. If we introduce a block size b then the dimensions of the matrices can be seen as nb × nb and each element will be a square block of size b×b. Thus, the algorithm operates on

4.3

Checkpoint Restart vs ULFM

Figure 2a exhibits the ratio of the overall execution time of ULFM based malleable SUMMA over that of the SCR based malleable SUMMA in two experimental settings: 1) shrinking from 64 to 16 processes, 2) growing from 16 to 64 processes. The SCR version uses checkpoint/restart to shrink and grow while the ULFM version uses MPI_Comm_shrink and MPI_Comm_spawn respectively. The global memory size refers to the sum of the memory used by all the processes. The overall tendency is similar with slightly more speedup

77

Ab•k P00

P01

P02

P03

P04

P05

P10

P11

P12

P13

P14

P15

P20

P21

P22

P23

P24

P25

P30

P31

P32

P33

P34

P35

P40

P41

P42

P43

P44

P45

P50

P51

P52

P53

P54

P55

b Bk•

(a) Communication flow of matrix A

P00

P01

P02

P03

P04

P05

P10

P11

P12

P13

P14

P15

P20

P21

P22

P23

P24

P25

P30

P31

P32

P33

P34

P35

P40

P41

P42

P43

P44

P45

P50

P51

P52

P53

P54

P55

(b) Communication flow of matrix B

Figure 1: Horizontal communications to transmit matrix A and vertical communications to transmit matrix B in SUMMA. The pivot column Ab•k of √nP ×b blocks of matrix A is broadcast horizontally. The pivot row

b Bk• of b× √nP blocks of matrix B is broadcast vertically.

3

3

22

Shrink Time (p64− > p16) Grow Time (p16− > p64)

20 18 16

2

14

Speedup

Speedup

Speedup

2

12 10 8

1

1

6 4

Run Time (p64− > p16) Run Time (p16− > p64) 0

0

2

0

4 8 16 32 Global Memory Size (GB)

(a) Speedup of total run times.

Run Time (p16− > p4) Run Time (p4− > p16)

2 0

2

0

4 8 16 32 Global Memory Size (GB)

(b) Speedup of shrink and grow times.

0

2

4 8 16 32 Global Memory Size (GB)

(c) Speedup of total run times.

18

400

16

6

14 300

10 8

Time (Sec)

Speedup

Speedup

12 4

6 2

4

0

0

2

200

100

Shrink Time (p16− > p4) Grow Time (p4− > p16)

2

SCR Checkpoint Time SCR Restart Time ULFM Time

ULFM-2D-BC over SCR-2D-BC speedup 0

4 8 16 32 Global Memory Size (GB)

1

2 4 8 16 New Size after Shrink from 64

32

0

1

2 4 8 16 New Size after Shrink from 64

32

(d) Speedup of shrink and grow times. (e)

ULFM-2D-BC data redist. + shrink time (f) ULFM-2D-BC data redist. + shrink time over SCR-2D-BC checkpoint/restart time. and SCR-2D-BC checkpoint/restart time.

Figure 2: ULFM-SUMMA vs SCR-SUMMA (a, b, c, d) and ULFM-2D-BC vs SCR-2D-BC (e, f )

in case of the grow operation. The computation time of the matrix multiplication dominates over the time spent on the data redistribution. For instance, for a total memory size of 32GB, the shrink and data redistribution represents only 4.7% of the overall execution time. Figure 2b exhibits the speedup of the time spent solely in the shrink and grow operations in which case the improve-

ment is multifold for large global memory sizes. Similarly, Figure 2c and Figure 2d show the corresponding speedups of growing from 4 to 16 processes and shrinking from 16 to 4 processes. As we keep the global memory size the same across the experiments the calculation time on 4 processes increases while the contribution of the data redistribution decreases. Therefore, the speedup of the overall execution

78

Job ID 1 2 3 4

Table 2: Job File for Interactive Simulations Submit Time Run Time (s) Alloc. Procs Min. Procs Mem./ Proc. 1 2400 16 2 4.7 GB 600 400 8 2 4.7 GB 1200 40 8 2 4.7 GB 1800 40 8 2 4.7 GB

Prior. 0 2 2 2

shrinks further.

Table 1: Total execution time of the SCR versions of malleable SUMMA with growing and shrinking Shrink or Grow 2GB 4GB 8GB 16GB 32GB p64− > p16 102s 199s 406s 836s 1818s p16− > p64 178s 210s 531s 1183s 2534s p16− > p4 70s 156s 377s 1117s 3245s p4− > p16 85s 173s 380s 1210s 3270s

4.4

Interactive Simulations

In these experiments, we study the scenario of an interactive simulation. A long lasting simulation is first started requesting all resources from the systems. Then from time to time, a short running job is submitted in the same system to visualize the current data state. In our experiments, both simulation and visualization applications are executed using our 2D Block Cyclic Distribution Benchmark. The experiment itself consists in submitting the jobs exposed in Table 2. A first low priority job is submitted requesting all resources for 40 minutes. A second high priority job is submitted after 10 minutes requesting 50% of all resources for 400 seconds. Finally, two other high priority jobs are submitted every 10 minutes for 40 seconds, requesting 50% of resources as well. While these high priority jobs are supposed to be relatively short, the rational for having the first one requiring 400 seconds is to expose a situation where this job may have not terminated at the time the next short job is submitted. This is shown as SCR50 and ULFM50. The second set of experiments (ULFM25 and SCR25) is similar in every point but one: high priority jobs this time only request 25% of resources, thus 4 nodes. We ran the baseline (Base) separately with the two sets of jobs (50% and 25%) but without malleability. Table 3 exhibits the total execution time of these experiments and the impact on the low priority job. As expected, the checkpoint restart technique impacts more the global time, as time for reconfiguration increases. The resize count exposes the number of such reconfiguration, which are then described as a pattern (from this number of resources → to this number of resources) followed by the time it took to perform in seconds. In the case of high priority job requesting 25% of resources, reconfiguration time for ULFM being slightly faster than SCR, the low priority job could be regrown to every resource before terminating, while when using SCR the final job was still running. The low priority job is impacted as expected by these reconfigurations, Figure 2(right) exhibits the time each of the high priority jobs has spent in queue. Each job being submitted at 10 minutes interval, the first one spent more time in queue than the others when no malleability is enabled (Base): they all have to wait for the termination of the low priority job before getting resources to execute. When malleability is provided, the time spent in queue depends on the time of reconfiguration of the low priority job to free resources. It can be noted from Table 3 that the time to reconfigure from 16 to 8, whether with ULFM or SCR, is less than the time to reconfigure from 16 to 12, due to extended data redistribution time.

time of the matrix multiplication with different global memory sizes does not change much, namely, it is in the range of 2.26 and 2.68 times for shrinking depending on the memory size, while in the case of growing the improvement varies in the range of 2.30 and 2.66 times. Our initial observations showed that the slight difference of speedups between shrinking and growing operations is due to the data redistibution differences among them. More precisely, the grow operation in the ULFM version uses MPI scatter to redistribute the data, while the shrink operation employs MPI gather for the same purpose. In addition, the grow operation is performed using MPI_Comm_spawn, on the other hand, the shrink operation has been implemented using MPI_Comm_shrink. The similar differences applies to the SCR version of the malleable SUMMA in the sense that accessing single file by more than one processes while growing, and accessing more files by one process during the shrink operation. Table 1 shows the total execution time of the SCR version of the SUMMA algorithm for two scenarios, when the algorithm grows or shrinks during its runtime between 64 and 16 or between 16 and 4 number of processes. The first column shows the type of the resize event, the number of processes before and after the resize event within one execution cycle, the next columns show the total message sizes used within one execution of the application. The algorithm can be faster on small scale when its execution time is dominated by the communication cost. Indeed, this is the case when the execution time is faster while shrinking or growing between 16 and 4 processes compared to that of between 64 and 16 processes for small messages. Figure 2e highlights the speedup of the malleable 2D block cyclic redistribution implemented in ULFM (ULFM-2D-BC) over SCR (SCR-2D-BC). In this experiment a matrix spanning 32GB of memory is distributed in 2D block cyclic way over 64 processes (one process per node). We evaluate the performance of six different separate executions where each time the number of process to eliminate differ. The six different executions scenarios are shrinking from 64 to 32, 16, 8, 4, 2, and 1 processes. The speedup is the highest in case of the shrink from 64 to 32, but decreases as we shrink to smaller number of processes. As exhibited in Figure 2f, checkpoint/restart time is relatively constant for any data redistribution, as it is limited by the I/O capability, whereas ULFM exchange more data between two given peers as one

79

Table 3: Effect of ULFM and SCR on scheduling (LP = Low Priority) Job Type SCR25 ULFM25 SCR50 ULFM50

LP job time (s) 521 482 522 482

Resize Count 4 5 4 4

Data Redistribution Pattern: Time 16→12: 453s, 16→12: 295s, 16→8: 352s, 16→8: 258s,

0

1

2

0

1

2

0

1

2

3

4

5

3

4

5

3

4

5

6

7

8

6

7

8

6

7

8

0

1

2

0

1

2

0

1

2

3

4

5

3

4

5

3

4

5

6

7

8

6

7

8

6

7

8

0

1

2

0

1

2

0

1

2

3

4

5

3

4

5

3

4

5

6

7

8

6

7

8

6

7

8

12→8: 12→8: 8→16: 8→16:

427s, 426s, 384s, 221s,

8→16: 8→16: 16→8: 16→8:

380s, 221s, 418s, 284s,

16→12: 444s 16→12: 309s, 8→16: 447s 8→16: 221s

12→16: 252s

2,000

Queue Time (Sec)

Total Time (s) 4516 3418 4028 3058

1,500

1,000

500

0 Job2 Base25

Job3 Job Types Base50

SCR25

SCR50

Job4 ULFM25

ULFM50

Figure 3: 2D block cyclic distribution of a matrix with 9×9 blocks over 3×3 processor grid (left). Waiting times in queue when using synthetic benchmark (right).

5.

CONCLUSION

6.

REFERENCES

[1] Message Passing Interface Forum. http://www.mpi-forum.org/. Acc.: 05-11-2015. [2] O. Beaumont, V. Boudet, F. Rastello, and Y. Robert. Matrix multiplication on heterogeneous platforms. Parallel and Distributed Systems, IEEE Transactions on, 12(10):1033–1051, Oct 2001. [3] L. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, 1997. [4] W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. Dongarra. An Evaluation of User-Level Failure Mitigation Support in MPI. In Recent Advances in the Message Passing Interface, volume 7490 of Lecture Notes in Computer Science, pages 193–203. Springer Berlin Heidelberg, 2012. [5] J. Buisson, O. Sonmez, H. Mohamed, W. Lammers, and D. Epema. Scheduling malleable applications in multicluster systems. In Proc. 2007 IEEE International Conference on Cluster Computing, Sept. 2007. [6] T. Desell, K. E. Maghraoui, and C. A. Varela. Malleable Applications for Scalable High Performance Computing. Cluster Computing, 10(3):323–337, Sept. 2007. [7] K. El Maghraoui, T. J. Desell, B. K. Szymanski, and C. A. Varela. Malleable Iterative MPI Applications. Concurr. Comput. : Pract. Exper., 21(3):393–413, Mar. 2009. [8] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494.

We proposed two different techniques to support malleability for MPI applications. The first method relies on user level checkpoint/restart. It does not require a fault tolerant MPI runtime, and can thus be employed with any MPI library. We used the existing SCR library to manage checkpoints and extended it to enable every rank to read from multiple checkpoint files for data redistribution. The second technique uses the ULFM API and runtime. ULFM proposes an API to include in the MPI standard to help a developer coping with failure. A core function of this API is the ability to explicitly shrink the communicators to the current number of processes after failures, and its runtime supports a dynamic change of the number of ranks. We implemented both techniques for different applications and compared performance impact of resizing using both of them. We also developed a custom scheduler with a policy to handle the shrinking and expansion of running jobs to accommodate high-priority short length jobs, and demonstrated the improvement that malleability can provide in terms of reducing the queueing time for high-priority jobs. We have also demonstrated that the ULFM approach enables faster reconfiguration, and exhibited how even the checkpoint restart solution can help system cope with high priority jobs. Through a synthetic benchmark, we have illustrated how malleability, in conjuction with a scheduler aware of this capability, can enable the execution of interactive simulations on HPC systems. Future work will aim at extending these techniques in the presence of other types of jobs such as Map-Reduce or Spark applications, in order to maximize resource utilization while allocating resources to both data analytic and MPI jobs simultaneously.

80

IOP Publishing, 2006. [9] K. Hasanov, J.-N. Quintin, and A. Lastovetsky. Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. The Journal of Supercomputing, 71(11):3991–4014, 2015. [10] J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, 03 2007. [11] D. Jackson, Q. Snell, and M. Clement. Core Algorithms of the Maui Scheduler. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, number 2221 in Lecture Notes in Computer Science, pages 87–102. Springer Berlin Heidelberg, June 2001. DOI: 10.1007/3-540-45540-X 6. [12] L. V. Kale and S. Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’93, pages 91–108, New York, NY, USA, 1993. ACM. [13] Y. W. Lim, P. Bhat, and V. Prasanna. Efficient algorithms for block-cyclic redistribution of arrays. In Parallel and Distributed Processing, 1996., Eighth IEEE Symposium on, pages 74–83, Oct 1996. [14] G. Martin, D. E. Singh, M.-C. Marinescu, and J. Carretero. Enhancing the performance of malleable MPI applications by using performance-aware dynamic reconfiguration. Parallel Computing, 46:60–77, July 2015. [15] A. Moody, G. Bronevetsky, K. Mohror, and B. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1–11, Nov 2010. [16] J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1–13:24, Feb. 2013. [17] S. Prabhakaran, M. Neumann, S. Rinke, F. Wolf, A. Gupta, and L. V. Kale. A Batch System with Efficient Adaptive Scheduling for Malleable and Evolving Applications. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 429–438. IEEE, 2015. [18] J.-N. Quintin, K. Hasanov, and A. Lastovetsky. Hierarchical parallel matrix multiplication on large-scale distributed memory platforms. In Proceedings of the 2013 42Nd International Conference on Parallel Processing, ICPP ’13, pages 754–762. IEEE Computer Society, 2013. [19] R. A. van de Geijn and Jerrell W. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Concurrency: Practice and Experience, 9(4), April 1997. [20] R. Sudarsan and C. Ribbens. ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment.

In International Conference on Parallel Processing, 2007. ICPP 2007, pages 44–44. [21] R. Sudarsan and C. Ribbens. Scheduling resizable parallel applications. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–10, May 2009. [22] G. Utrera, S. Tabik, J. Corbalan, and J. Labarta. A job scheduling approach for multi-core clusters based on virtual malleability. In Proceedings of the 18th International Conference on Parallel Processing, Euro-Par’12, pages 191–203, Berlin, Heidelberg, 2012. Springer-Verlag.

81