Scientific Application Performance on HPC, Private ...

3 downloads 84640 Views 285KB Size Report
tained using the IPM performance monitoring framework for. MPI applications. .... StarCluster is an open-source toolkit which allows for the launching of custom ..... IPM performance monitoring tool [27], which enables a per-section analysis of ...
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter E. Strazdins ∗ , Jie Cai # , Muhammad Atif # and Joseph Antony # ∗

Research School of Computer Science, # National Computational Infrastructure The Australian National University Canberra, ACT, Australia

{Peter.Strazdins, Jie.Cai, Muhammad.Atif, Joseph.Antony}@anu.edu.au

Abstract—The ubiquity of on-demand cloud computing resources enables scientific researchers to dynamically provision and consume compute and storage resources in response to science needs. Whereas traditional HPC compute resources are often centrally managed with a priori CPU-time allocations and use policies. A long term goal of our work is to assess the efficacy of preserving the user environment (compilers, support libraries, runtimes and application codes) available at a traditional HPC facility for deployment into a VM environment, which can then be subsequently used in both private and public scientific clouds. This would afford greater flexibility to users in choosing hardware resources that suit their science needs better, as well as aiding them in transitioning onto private/public cloud resources. In this paper we present work in-progress performance results for a set of benchmark kernels and scientific applications running in a traditional HPC environment, an Amazon HPC EC2 cluster and a private VM cluster. These are the OSU MPI microbenchmark, the NAS Parallel macro-benchmarks and two large scientific application codes (the UK Met Office’s MetUM global climate model and the Chaste multi-scale computational biology code) respectively. We discuss parallel scalability and runtime information obtained using the IPM performance monitoring framework for MPI applications. Initial performance results indicate the importance of (a) the cluster interconnect for parallel scientific computing jobs and (b) the need to avoid over-subscription of cores as this affects code scalability, especially since an underlying hardware platform has characteristics (eg. NUMA) are hidden owing to virtualization. We were also able to successfully build application codes in a traditional HPC environment and package these into VMs which ran on private and public cloud resources. Index Terms—Scientific Workloads, Cloud Computing, Performance Analysis, Unified Model, Chaste, IPM

I. I NTRODUCTION HPC hardware resources are often procured as a single upfront capital purchase. The expectation is that the vendor delivered platform will cater for sustained peak loads and run continuously for the duration of its operational lifetime (usually three to five years). While this greatly simplifies administration and aids time-allocation of resources, scientific researchers are keenly investigating and using on-demand

compute/storage capabilities of private and public clouds [1], [2], [3], [4], [5]. The on-demand model of cloud resource provisioning firstly, facilitates resource utilization that best fits a user or scientific group’s needs e.g. rapidly acquiring CPU cycles in order to compute on time-sensitive events/data [1] or accessing specialized science portals [6], [7], [8]. Secondly, it suits users who often do not have parallel computing jobs which would make the most of tightly coupled compute and storage environments often seen in HPC facilities. In turn this has motivated us towards a long-term goal of assessing the efficacy of packing a traditional HPC environment into VMs (i.e. compilers, support libraries, runtimes and application codes). Our aim is to reduce the barrier to entry for traditional HPC users in opting to use cloud based resources either from a private or public science cloud. As a first step towards this direction, in this paper we present work inprogress performance results for two highly memory intensive scientific application codes: a widely used climate model code from the UK Met Office (MetUM) [9] and a multi-scale computational biology code (Chaste) [10] running in a three environments: a production HPC facility, an Amazon HPC EC2 cluster and a private VM cluster. We also present timing and scalability results for the NAS Parallel Benchmark and OSU MPI micro-benchmarks. Where applicable we present instrumented performance data obtained using IPM [11], [3] and perform detailed performance analysis. The rest of this paper is organized as follows. The motivations for our experimental study are given in Section II Related work is discussed in Section III, which is followed by experimental setup in Section IV. Evaluation results are demonstrated in Section V, which lead to our conclusions and future work in Section VI. II. M OTIVATION Supercomputing facilities, such as the NCI National Facility [12] in Australia, provide access for registered users to stateof-the-art cluster computers. Such facilities also provide users

with comprehensive software stacks to enable users to readily install a diverse range of complex applications. However, the supercomputing cluster is typically a highly contended resource, and users are often subjected to limiting usage quotas. Also, some user workloads may not make good use of the cluster, e.g. their communicate requirements might be satisfied by a cluster with a commodity network (i.e. 10G Ethernet). Jobs dedicated to debugging and validation also typically do not require the supercomputing cluster. Even when a job is not well suited to the cloud, in times of high demand, the use of a cloud as an alternative site may result in a shorter turnaround. In such situations, the users’ jobs could be better run on a cheaper private cloud, or even a public cloud. This however requires that the same comprehensive software stack can be easily replicated on the cloud environments, so that the availability of the cloud facilities comes with little extra effort to the user (and ideally would be transparent). Having the ability to package up a standard HPC working environment into VMs gives HPC centers the ability to cloudburst [13] as a means of responding to peak demand, when local resources are saturated, or when it is simply more costeffective to do so. In order to achieve this goal, we will first investigate the feasibility to package the environment and the performance impacts of doing so for various workloads. This paper serves as a preliminary study in this respect. The second stage will involve developing supporting infrastructure. Atif et. al [14] implemented a framework (ARRIVE-F) which addresses the issue of heterogeneity in compute farms by utilizing the live migration feature of the virtual machine monitors. The framework carries out a lightweight ‘online’ profiling of the CPU, communication and memory subsystems of all the active jobs in the compute farm and is able to predict the execution time for each distinct hardware platform within the compute farm. Based upon the predicted execution times, the framework is able to relocate the compute jobs to the best suited hardware platforms such that the overall throughput of the compute farm is increased. Experiments show that ARRIVE-F is able to improve the average job waiting times by up to 33%. By using ARRIVE-F metrics, it may be possible to classify candidate workloads that could be run on a cloud resource, rather than tying up resources at peak HPC facility for example. III. R ELATED W ORK A considerable body of research exists for the use of virtualization and/or cloud computing in HPC environments. Researchers have tested virtualization in several scenarios to make a case for virtualization in HPC or the cloud environments. Some of the related research work is discussed briefly in this section. In [2] Ramakrishnan et. al. present their experiences with a cross-site science cloud running at the Argonne Leadership Computing Facility and at the NERSC Facility. They deployed testbeds for running diverse software stacks aimed at exploring the suitability of cloud technologies for DoE science users.

Some salient finding are that (a) scientific applications have special requirements which require tailored solutions (e.g. parallel filesystems, legacy datasets and pre-tuned software libraries), requiring special clouds designed for science needs; (b) scientific applications with minimal communications and I/O make the best fit for cloud deployment and (c) clouds require significant end-user programming and system administrator support. He et. al. [5] find that most clouds they evaluated are optimized for business/enterprise applications. They note the importance of a good system interconnect and the ease-of-use afforded by on-demand resources for science users. Jackson et. al. [1] present results from porting a complex astronomy pipeline, for detecting supernovae events onto Amazon EC2. They were able to encapsulate complex software dependencies and note that the EC2-like environments being more complex present a very different resource environment in comparison to a traditional HPC center e.g. images not booting up correctly, performance perturbations arising from co-scheduling with other EC2 users. Jackson et. al. [3] conduct a through performance analysis using applications from the NERSC benchmarking framework. The Community Atmospheric Model from the CCSM code developed by NCAR was one of the benchmarks that were run. They find there is a strong correlation between communication time and overall performance on EC2 resources i.e. applications with greater global communication patterns perform suffer the most. They also find there is significant performance variability between runs. Evangelinos et. al [15] have presented a detailed comparison of MPI implementations namely LAM/MPI, MPICH , OpenMPI and GridMPI to test Amazon’s HPC cloud. A custom application for the atmosphere-ocean climate model and the NAS parallel benchmarks were utilized to evaluate the system. It was concluded that the performance of Amazon’s Xen based cloud is below the level seen at dedicated supercomputer centers. However, performance is comparable to low-cost cluster systems. Significant performance deficiency arises from the messaging performance, which for communication intensive application is 50% slower compared to the similar ’non-cloud’ compute infrastructures. While there have been a number of papers investigating cloud environments for HPC, this paper is different in that (a) our study is motivated with the intention of using cloudbursting to extend a supercomputing facility, (b) we are evaluating the both public (Amazon EC2 using the StarCluster toolkit) and private cloud environments head-to-head with a supercomputing cluster and (c) we present a detailed performance analysis which includes complex memory intensive applications. IV. E XPERIMENTAL S ETUP Experiments are conducted on three different compute platforms, detailed in Table I. The first platform, called DCC, is a virtual VM cluster deployed at the NCI National Facility (NCINF) housed in the Australian National University, Canberra.

TABLE I D ESCRIPTION OF THE E XPERIMENTAL P LATFORMS USED IN THIS PAPER Platform # of Nodes Model Clock Spd CPU #Cores L2 Cache Memory per node Operating System File System Interconnect

DCC 8 Intel Xeon E5520 2.27GHz 8 (2 slots) 8MB (shared) 40GB Centos 5.7 NFS 1GigE

EC2 4 Intel Xeon X5570 2.93GHz 16 (*) 8MB (shared) 20GB CentOS 5.7 NFS 10 GigE

Vayu 1492 Intel Xeon X5570 2.93GHz 8 (2 slots) 8MB (shared) 24GB CentOS 5.7 Lustre QDR IB

* Each EC2 compute instance is assigned two quad core processors (8 cores total), the processors’ hyper-threading capabilities in 16 total cores

Third, a StarCluster [16] instance was deployed on Amazon’s EC2 HPC resource in Amazon’s US East datacenter in Virginia. The cc1.4xlarge resource was used, which provides a large memory HPC instance on EC2 with 10GigE between the VMs, running in a cluster placement group1 . StarCluster is an open-source toolkit which allows for the launching of custom scientific computing clusters on EC2. It automates the building, configuration and management of compute nodes allowing an end-user to target standard, HPC and/or GPU compute resources on EC2. Both DCC and EC2 are virtualized clusters. DCC uses the VMware ESX server [17] hypervisor and EC2 is based on the Xen [18] hypervisor. Each DCC virtual compute node is hosted on a single distinct physical hardware: i.e. an eight core physical machine on DCC infrastructure hosts a guest VM containing eight CPUs. No compute node is oversubscribed. The EC2 cluster nodes have eight physical cores with HyperThreading enabled. On the Vayu system-wide application compilers, support libraries, runtimes and application codes are configured and

installed into the /apps directory. The modules software package2 is then used to manage versions and appending appropriate environment variables. This allows us to build application codes on the Vayu within a user’s home/project directories and then rsync the requisite libraries, runtimes (into /apps) on a VM and the application binaries into the home/project directories on the VM, which is then deployed either on the private VM cluster or on EC2 instances. V. R ESULTS The NAS Parallel Benchmark (NPB) [19] MPI suite version 3.3, OSU MPI benchmarks [20] and two memory intensive simulation applications (Chaste [10] and UM [9]) are used to evaluate and compare the platforms described in Table I. A. OSU MPI Benchmarks The OSU MPI bandwidth and latency benchmarks [20] measure the sustained message passing bandwidth and latency between two compute nodes. In Figures 1 and 2, the x-axis represents message size and y-axis for Bandwidth (MB/s) and Latency (s) respectively, using a log scale. With respect to bandwidth tests, the Vayu cluster uses a high performance QDR InfiniBand interconnect. Results show significantly (more than one order of magnitude) higher bandwidth for any given message size. This is followed by EC2 cluster which utilizes 10 Gigabit Ethernet showing a peak bandwidth of ∼560MB/s for 256KB messages. DCC uses a Gigabit Ethernet interface providing a peak bandwidth of ∼190MB/s. Unlike the bandwidth test results, while EC2 and Vayu shows expected latencies for different message sizes, latencies observed on DCC fluctuated from 1 byte to 512KB messages. We think this is due to CPU scheduling of VMware hypervisor as networking is done through a proprietary software switch. 2 http://modules.sourceforge.net/

OSU bandwith benchmark results comparison on three different platforms 10000 dcc GigE

cluster placement group is a logical grouping of VMs, which facilitates full-bisection 10GigE communication between VMs in that group

vayu QDR IB

100

10

1

0.1

0.01 1A

EC2 10GigE

1000

MB/sec

The DCC cluster is a set of guest VMs running under VMware ESX server 4.0, which provides the hypervisor. There are a set of eight Dell M610 blade servers which run VMware ESX server and are exclusively set aside for DCC i.e. there is one VM per blade that has all its physical resources allocated to it. Each blade has two Broadcom 55710 10GigE NICs and use two QLogic fibre channel HBAs to access dedicated fibrechannel storage for VM images. The virtual VMware switch on each blade uses two 10GigE uplinks in a channel bonded manner, but each vNIC in the DCC VMs use the an Intel E1000 driver (i.e. this is a 1GigE driver). Packets from the NIC are load-balanced across the two 10GigE links of the vSwitch. All filesystems for the VM are NFS mounted from an external storage cluster. The second platform is the Vayu supercomputer also hosted at the NCI NF. It was ranked 64th in the June 2011 listing at top500.org. There are 1492 nodes consisting of Sun Oracle X6275 blade servers. There is are four Sun Oracle DS648 Infiniband switches which provide a fat-tree QDR IB fabric for both compute and storage. An in-house scheduler called ANUPBS is used to manage job submission, using a suspendresume scheme.

1

Fig. 1.

8

64

512 4K Message Size

32K

256K

2M

OSU MPI bandwidth tests for DCC, EC2 and Vayu clusters

Bench DCC (sec) BT.B.1 1696.9 EP.B.1 141.5 CG.B.1 244.9 FT.B.1 327.6 IS.B.1 8.6 LU.B.1 1514.7 MG.B.1 72.0 SP.B.1 1936.1

Normalized w.r.t DCC

2

1.5

1

EC2

Fig. 3.

EC2 10GigE

8

2

Speedup

8

4

4 2

1

0.4

0.4

0.4

0.2

0.2

2

4

8 # of cores

16

32

64

1

CG benchmark scalability comparison on three different platforms

2

4

8 # of cores

16

32

0.2

64

EC2 10GigE

EC2 10GigE

32

vayu QDR IB

8 Speedup

8 Speedup

8

4 2 1

0.4 8 # of cores

16

32

0.2

64

8 Speedup

16

8 4 2

4

8 # of cores

16

32

64

0.2

1

2

dcc GigE

4 2

1

1

0.4

0.4 2

2

vayu QDR IB

16

1

SP.B.64

SP.B.36

SP.B.4

SP.B.16

SP.B.1

MG.B.64

2

EC2 10GigE

32

vayu QDR IB

0.2

4

SP benchmark scalability comparison on three different platforms 64

EC2 10GigE

32

4

8 # of cores

Fig. 4.

16

32

64

0.2

64

0.4 1

IS benchmark scalability comparison on three different platforms dcc GigE

32

1

0.4 4

16

vayu QDR IB 16

1

8 # of cores

dcc GigE

vayu QDR IB

2

4

EC2 10GigE

32

16

4

2

LU benchmark scalability comparison on three different platforms 64

dcc GigE

16

2

1

MG benchmark scalability comparison on three different platforms 64

dcc GigE

64

MG.B.32

2

1

1

MG.B.4

4

1

0.2

MG.B.16

vayu QDR IB

8

32

MG.B.1

LU.B.64

LU.B.32

LU.B.4

LU.B.16

LU.B.1

IS.B.64

IS.B.32

IS.B.4

vayu QDR IB 16

1

dcc GigE EC2 10GigE

32

16

Speedup

Speedup

IS.B.16

EC2 10GigE

32

16

64

Speedup

FT benchmark scalability comparison on three different platforms 64

dcc GigE

vayu QDR IB

Speedup

IS.B.1

EP benchmark scalability comparison on three different platforms 64

dcc GigE

32

VAYU

NPB (class B) execution time normalized w.r.t the DCC cluster. Absolute walltimes for single threaded runs on DCC are given on the top right

BT benchmark scalability comparison on three different platforms 64

FT.B.64

FT.B.32

FT.B.4

FT.B.16

FT.B.1

CG.B.64

CG.B.32

CG.B.4

CG.B.16

CG.B.1

EP.B.64

EP.B.32

EP.B.4

EP.B.16

EP.B.1

BT.B.64

BT.B.36

BT.B.4

BT.B.16

0

BT.B.1

0.5

1

2

4

8 # of cores

16

32

64

Speedup scalability comparison for DCC, EC2 and Vayu clusters for NPB MPI class ‘B’

4

8 # of cores

16

32

64

OSU latency benchmark results comparison on three different platforms 100000 dcc GigE EC2 10GigE

Microseconds

10000

vayu QDR IB

TABLE II IPM REPORTED PERCENTAGE COMMUNICATION (% COMM ) IN SELECTED NPB MPI BENCHMARKS .

1000

100

np 2 4 8 16 32 64

10

1

communication intensive benchmark and having not having dedicated CPU/core for the hypervisor to process packets results in degraded performance. Similar trends were observed for BT, MG, LU and SP.

1

Fig. 2.

8

64

512 4K Message Size

32K

256K

DCC 1.5 5.3 68.3 85.7 78.0 90.3

CG EC2 1.2 3.0 5.1 9.4 38.8 58.0

VU 0.9 1.9 3.8 8.5 12.5 21.7

DCC 2.5 3.6 8.3 59.3 75.7 84.4

FT EC2 2.1 3.4 5.4 7.2 38.2 55.3

VU 1.9 2.9 4.2 7.7 12.5 20.8

DCC 6.3 8.6 14.2 82.4 88.3 98.1

IS EC2 4.6 7.4 13.5 19.2 58.9 84.9

VU 4.4 8.2 12.9 22.1 44.4 68.2

2M

OSU MPI latency tests for DCC, EC2 and Vayu clusters

B. NAS Parallel Benchmarks The NAS Parallel Benchmarks (NPB) class ‘B’ are used to determine the impact of different CPU and communication loads on three platforms. The NPB are a small set of programs designed to help evaluate the performance of parallel supercomputers. These macro-benchmarks, which are derived from computational fluid dynamics (CFD) applications, consisting of five kernels and three pseudo-applications. Figure 3 presents the single process elapsed time of all NPB benchmarks on DCC and plots normalized elapsed time on Vayu and EC2 with respect to that on DCC. Speedup curves for the three platforms, as measured with NPB MPI suite, are presented in Figure 4. In the case of EP benchmark run, where there is no communication, Vayu and DCC show almost linear speedups; whereas the results of EC2 fluctuate but maintain an upward trend. We suspect the fluctuation is due to CPU scheduling of Xen hypervisor and system jitter brought on by the use of HyperThreading. For the FT benchmark we see Vayu scaling almost linearly, whereas DCC and EC2 do not scale as well. In Table II we report the percentage of total walltime spent in communication, using IPM, for selected NPB macro-benchmarks. DCC and EC2 show significantly higher percentage of walltime being spent in communication. Particularly for DCC, we see performance dropping from 8 processes to 16 processes. Each compute node on DCC has 8 cores, hence a 16 process NPB kernel runs on two distinct compute nodes. This results in utilization of GigE interfaces for internode communication. We notice that speedup begins to improve from 16 processes onwards. This arises due to the message size for MPI AlltoAll communication decreasing with an increase in the number of processes, resulting in reduced communication overhead. Interestingly, the EC2 cluster drops in performance at 16 cores rather than the expected 32 cores, as each compute node on EC2 cluster has 16 cores. This is due to the HyperThreading and communication overhead of the Xen hypervisor, as detailed in [21]. The reference also notes that when running a

The IS benchmark is communication intensive and does not scale well on any of the clusters. In Table II, DCC spends almost all of its walltime in communication with 64 processes, and that of EC2 is ∼85%. The performance of Vayu for this benchmark is also not linear and the curve starts to drop slightly from 32 processes. We note that the communication percentage increased to 45% on Vayu. For the CG benchmark, Vayu and EC2 show similar trends as for the other NPB benchmarks. In contrast, speedup drops at 8 processes for DCC. We contend this arises due NUMA effects and as the VMware ESX hypervisor masks NUMA effects from guest VMs, applications or supporting runtimes (e.g. OpenMPI) are unable to make judicious thread and memory placement decisions [22], [23], [24]. CG is memory bound and the communication between processes references remote memory frequently, and better speedups could be obtained if memory aware thread placement was carried out. C. Applications We chose memory-intensive benchmarks for this study, using the UK Met Office Unified Model (MetUM) [9] version 7.8 and the Chaste cardiac simulation [10] version 2.1 packages. The MetUM benchmark uses an N320L70 (640 × 481 × 70) grid to model the global earth’s atmosphere. MetUM is used for operational weather forecasts in a number of countries including the UK and Australia; it is also used for climate simulations. The Chaste benchmark uses a high resolution rabbit heart model (≈ 4 million nodes, 24 million elements); rather surprisingly, its memory usage is slightly greater than that of the MetUM benchmark. The benchmark simulates the electrical and mechanical properties of the simulated heart after an electric stimulus is applied. In both of these benchmarks, the dominant component is solving a linear system at each timestep. Our performance analysis methodology followed that of detailed previous studies carried out on the Vayu cluster [25], [26]. Except where otherwise noted, the benchmarks were configured as for those studies. The methodology uses low-overhead internal timers which profile the major internal sections of the application separately. We use this in conjunction with the IPM performance monitoring tool [27], which enables a per-section analysis of

TABLE III S TATISTICS FOR UM FOR 32 CORES . rcomp (rcomm ) IS THE RATIO OF COMPUTATION ( COMMUNICATION ) TIME RELATIVE TO VAYU , ‘% COMM ’ IS THE PERCENTAGE TIME SPENT IN COMMUNICATION AND ‘% IMBAL’ IS

8

Speedup (over 8 cores)

THE PERCENTAGE OF OVERALL LOAD IMBALANCE

4

vayu total (t8=1599) dcc total (t8=1017) vayu KSp (t8=938) dcc KSp (t8=579)

time(s) rcomp rcomm %comm %imbal I/O (s)

2

1 8

16

32

64

Number of Cores

Fig. 5. Speedup of Chaste and its KSp solver section on the Vayu and DCC platforms. ‘t8’ is the execution time (s) on 8 cores

MPI calls. To conserve compute resources, the benchmarks were run with the minimal number of iterations required to accurately project long-term simulations [25]; for this study, it also emphasizes the effects of I/O (namely, the initial reading in of the models’ data files: 1.6 GB for MetUM, 1.4 GB for Chaste). Due to small micro-architectural differences on the Vayu nodes (SSE4 instructions), the benchmarks had to be compiled separately on Vayu; however the same binaries were used for the DCC and EC2 VM instances. The following sections detail the results for both benchmarks. Each run was repeated 5 times, with the minimum time being used for the results. Except where otherwise noted, all runs had processes fully subscribing each core. 1) Chaste Cardiac Simulation: The Chaste benchmark was built with the Intel 11.1.046 icpc compiler 3 , and configured for a simulation duration of 2.0ms (250 timesteps) using a conjugate-gradient linear solver. Due to the high number of dependencies in the Chaste software, we were not able to install Chaste on the EC2 cloud, in the available time. Speedup results are given for Vayu and DCC in Figure 3. We see much poorer scaling performance on DCC; it can be seen that the scaling of the KSp linear solver section determines the trends in overall behavior. It should be noted that the KSp section scales up to 1024 cores on Vayu [26]; the flat trend from 48 to 64 cores does not continue. The input mesh section was 1.37 times faster on Vayu, and scaled identically on both systems (1.25 speedup at 64 cores over 8 cores). At 8 cores, the output routine was 2.6 times faster on Vayu; surprisingly however its performance remained constant on DCC, but scaled inversely on Vayu. An IPM analysis on 32 cores indicated that the benchmark spent 48% of its time in communication on DCC, and only 11% on Vayu. DCC’s computation time was 1.5 times that of Vayu, which is close to the corresponding ratio of cycle times 3 This

was found to be about 50% faster than version 11.1.072.

Vayu 303 1.0 1.0 13 13 4.5

DCC 624 1.37 6.71 42 4 37.8

EC2 770 2.39 3.53 18 18 9.1

EC2-4 380 1.17 1. 18 19 7.6

on the nodes of 1.3, and this ratio ranged between 1.0 and 1.7 across different sections. Across sections, the ratio of communication times of DCC over Vayu ranged between 2 and 19 times, and was 13 on the KSp section. The ratio was high on sections when MPI time was dominated by large numbers of collective communications; for example, the communication of the KSp section are entirely 4-byte all-reduce operations. IPM profiles of communication and computation times, such as those illustrated in Figure 7, indicated a greater degree and a higher irregularity of load imbalance on DCC. This, together with the difference in overall communication costs, suggest that performance on the DCC is hurt mainly by high message latencies and, to a lesser extent, load imbalance caused by jitter. 2) MetUM Global Atmosphere Simulation: The MetUM benchmark was built with the Intel 11.1.072 ifort compiler, with a simulation duration of 2.5 hours (18 timesteps). It should be noted that the benchmarks is configured to generate no output data; thus, the only I/O is from the initial reading of the dump file. For the EC2 cloud, memory constraints meant that it could not be run on fewer than 2 nodes (for 24 processes, three nodes had to be used). The speedups for the ‘warmed simulation time (which is representative of a long time-period simulation [25]) are given in Figure 6. For the EC2 cloud, noting the constraints above, processes were evenly distributed across the nodes. The benchmarks on Vayu scales almost linearly; that on DCC is less; and scaling appears poor on the EC2 cloud. The plot EC2-4 is for 4 node runs; for less than 64 cores, these are always significantly faster. For 32 cores, using 4 nodes versus two is almost twice as fast. An IPM analysis reveals that the difference is highly uniform in all sections, and almost the same in both communication and computation times. This indicates that little benefit was gained from hyperthreading, eihter in computation or communication. Taking this into account, EC2 may still scale beyond 4 nodes, as the tail of the EC2 plot suggests. Table III gives details of the IPM analysis at 32 cores. The ratio of computation times closely reflects the ratio of clock frequencies on DCC, and this was quite uniform across all sections. On EC2, computation speed was similar to Vayu provided that the nodes were not fully subscribed. DCC also spent the most time in communication, reflecting its network of higher latency and lower bandwidth. This was

Speedup (over 8 cores)

8

4

vayu (t8=963) dcc total (t8=1486) EC2 total (t8=812) EC2-4 total (t8=646)

2

1 8

16

32

64

Number of Cores

Fig. 6. Speedup of UM (‘warmed’ execution time) on Vayu, DCC and EC2 platforms. ‘t8’ is the execution time (s) on 8 cores

reflected particularly in the sections which were dominated by a high number of collectives, where DCC spent much more MPI time in these operations. While the overall (averaged) load imbalance was least on DCC, when considered across sections, the load imbalance was of a generally higher degree and more irregular on DCC, as was the case for Chaste. Similarly, a greater proportion of the time in DCC was spent in collective communications. Also as for Chaste, a much lower (read-only) I/O speed was observed on DCC; it was however more comparable on the EC2 cloud. Both platforms used NFS to share filesystems. Figure 7 gives per process computation and communication time profiles at 32 cores, for the ‘ATM_STEP’ section of the computation, as generated by the IPM tool. It can be seen that not only is the communication time in far greater proportion on DCC, it is primarily in system time. These traits were seen in other sections. For this section, we also see an imbalance in communication time across processes 8 to 23. We also see more imbalance in computation times (e.g. process 18) on DCC, which is suspected to be due to NUMA effects: NUMA affinity is enforced by the version of OpenMPI used on Vayu [25], but this is not currently possible on DCC. Surprisingly, on the EC2 cloud, the profiles resembled those on Vayu much more than on the DCC (although the communication times were in slightly higher proportion) – even the same pattern of imbalance in the ‘ATM_STEP’ section. As for DCC, the reported system time closely reflects the communication time. VI. C ONCLUSION AND F UTURE W ORK In this paper, we used the OSU MPI micro-benchmarks, the NPB MPI macro-benchmarks to characterize performance features of three different platform including a peak supercomputing HPC hosted at NCI-NF, a private cloud based virtual cluster and a Amazon EC2 cloud based HPC cluster. Importantly, two large scale applications, MetUM and Chaste, were also evaluated and analyzed on the three platforms.

We were able to successfully create x86-64 application binaries on a HPC system (Vayu) and replicate its software dependencies into a VM, allowing us to use these binaries on a private VM cluster and on a HPC cluster running on Amazon’s EC2. The only barrier we encountered was the use of nonubiquitous features such as SSE4 cards on one application, which can be avoided by the selection of suitable compilation switches. Results for OSU MPI micro-benchmarks and the NPB class ’B’ macro-benchmarks were presented. The key finding here, which correlates with related work showed the importance of the interconnect and how communication bound applications, especially those which used short messages were at a disadvantage on the two virtualized platforms. This was corroborated by the per section analysis of the two applications. We also discovered the need to avoid over-subscription of cores as this affects code scalability especially since an underlying hardware platform has characteristics (eg. NUMA) are hidden owing to virtualization. While our applications were not strongly I/O intensive, the performance analysis indicated that the underlying filesystem is also important. We saw only minor effects (e.g. jitter) that were directly attributable to virtualization. We were also able to successfully build application codes in a traditional HPC environment and package these into VMs which ran on private and public cloud resources. In the near-future, we are planning to use metrics from the ARRIVE-F framework to assess candidate workloads, which currently run on HPC systems like Vayu, to be spawned on private/public science clouds. Using StarCluster we are also planning to cloud burst onto OpenStack based cloud resources locally, within Australia, and onto commercial providers. Additionally, we plan to integrate Amazon EC2 spot-pricing into our local ANUPBS scheduler, to avail of price competitive compute resources. ACKNOWLEDGEMENTS The authors would like to thank Michael Chapman, David Singleton, Robin Humble, Ahmed El Zein, Ben Evans and Lindsay Botten at the NCI-NF for their support and encouragement. R EFERENCES [1] K. R. Jackson, L. Ramakrishnan, K. J. Runge, and R. C. Thomas, “Seeking supernovae in the clouds: a performance study,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC ’10. New York, NY, USA: ACM, 2010, pp. 421–429. [Online]. Available: http://doi.acm.org/10.1145/1851476.1851538 [2] L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradshaw, R. S. Canon, S. Coghlan, I. Sakrejda, N. Desai, T. Declerck, and A. Liu, “Magellan: experiences from a science cloud,” in Proceedings of the 2nd international workshop on Scientific cloud computing, ser. ScienceCloud ’11. New York, NY, USA: ACM, 2011, pp. 49–58. [Online]. Available: http://doi.acm.org/10.1145/1996109.1996119 [3] K. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia, J. Shalf, H. Wasserman, and N. Wright, “Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 30 2010-dec. 3 2010, pp. 159 –168.

(a) Vayu

(b) DCC Fig. 7.

Time breakdown and load balance UM for 32 cores

[4] Y. Zhai, M. Liu, J. Zhai, X. Ma, and W. Chen, “Cloud versus in-house cluster: evaluating Amazon cluster compute instances for running MPI applications,” in State of the Practice Reports, ser. SC ’11. New York, NY, USA: ACM, 2011, pp. 11:1–11:10. [Online]. Available: http://doi.acm.org/10.1145/2063348.2063363 [5] Q. He, S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, “Case study for running HPC applications in public clouds,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC ’10. New York, NY, USA: ACM, 2010, pp. 395– 401. [Online]. Available: http://doi.acm.org/10.1145/1851476.1851535 [6] J. Goecks, A. Nekrutenko, J. Taylor, E. Afgan, G. Ananda, D. Baker, D. Blankenberg, R. Chakrabarty, N. Coraor, J. Goecks, G. Von Kuster, R. Lazarus, K. Li, A. Nekrutenko, J. Taylor, and K. Vincent, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences,” Genome Biol., vol. 11, p. R86, 2010. [7] J. Li, M. Humphrey, D. Agarwal, K. Jackson, C. van Ingen, and Y. Ryu, “eScience in the cloud: A MODIS satellite data reprojection and reduction pipeline in the Windows Azure platform,” in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1 –10. [8] A. Thakar and A. Szalay, “Migrating a large science database to the cloud,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC ’10. New York, NY, USA: ACM, 2010, pp. 430–434. [Online]. Available: http://doi.acm.org/10.1145/1851476.1851539 [9] T. Davies, M. J. P. Cullen, A. J. Malcolm, M. H. Mawson, A. Staniforth, A. A. White, and N. Wood, “A new dynamical core for the Met Office’s global and regional modelling of the atmosphere,” Q. J. R. Meteorol. Soc, vol. 131, pp. 1759–1782, 2005. [10] J. Pitt-Francis and al, “Chaste: A test-driven approach to software development for biological modelling,” Computer Physics Communications, vol. 180, no. 12, pp. 2452–2471, 2009. [11] D. Skinner, “Performance monitoring of parallel scientific applications,” Lawrence Berkeley National Laboratory, Tech. Rep. LBNL Paper LBNL-PUB-5503, 2005. [12] “The NCI National Supercomputing Facility,” 2012. [Online]. Available: http://nf.nci.org.au/ [13] L. Ramakrishnan, K. R. Jackson, S. Canon, S. Cholia, and J. Shalf, “Defining future platform requirements for e-Science clouds,” in Proceedings of the 1st ACM symposium on Cloud computing, ser. SoCC ’10. New York, NY, USA: ACM, 2010, pp. 101–106. [Online]. Available: http://doi.acm.org/10.1145/1807128.1807145 [14] M. Atif and P. Strazdins, “Adaptive resource remapping through live migration of virtual machines,” in Proceedings of the 11th international conference on Algorithms and architectures for parallel processing Volume Part I, ser. ICA3PP’11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 129–143. [Online]. Available: http://dl.acm.org/citation.cfm? id=2075416.2075430

[15] C. Evangelinos and C. N. Hill, “Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled AtmosphereOcean Climate Models on Amazon’s EC2.” Cloud Computing and Its Applications, October 2008. [Online]. Available: http: //www.cca08.org/speakers/evangelinos.php [16] MIT STAR, “MIT StarCluster.” [Online]. Available: http://web.mit.edu/ star/cluster/index.html [17] “Performance of VMware VMI,” Technical Paper, VMware Inc., 2008. [18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” in Proceedings of the nineteenth ACM symposium on Operating systems principles, ser. SOSP ’03. New York, NY, USA: ACM, 2003, pp. 164–177. [Online]. Available: http: //doi.acm.org/10.1145/945445.945462 [19] “NAS Parallel Benchmarks,” http://www.nas.nasa.gov/Software/NPB, Sep 2010. [Online]. Available: http://www.nas.nasa.gov/Software/NPB/ [20] “OSU Benchmarks,” Oct 2010. [Online]. Available: http://nowlab.cse. ohio-state.edu/projects/mpi-iba/ [21] M. Atif and P. Strazdins, “An evaluation of multiple communication interfaces in virtualized SMP clusters,” in HPCVirt ’09: Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing. ACM, 2009, pp. 9–16. [22] R. Yang, J. Antony, A. P. Rendell, D. Robson, and P. E. Strazdins, “Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code,” in IPDPS. IEEE, 2011, pp. 1046–1057. [23] R. Yang, J. Antony, P. P. Janes, and A. P. Rendell, “Memory and Thread Placement Effects as a Function of Cache Usage: A Study of the Gaussian Chemistry Code on the SunFire X4600 M2,” in ISPAN. IEEE Computer Society, 2008, pp. 31–36. [24] J. Antony, P. P. Janes, and A. P. Rendell, “Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport,” in HiPC, ser. Lecture Notes in Computer Science, Y. Robert, M. Parashar, R. Badrinath, and V. K. Prasanna, Eds., vol. 4297. Springer, 2006, pp. 338–352. [25] P. E. Strazdins, M. Kahn, J. Henrichs, T. Pugh, and M. Rezny, “Profiling Methodology and Performance Tuning of the Met Office Unified Model for Weather and Climate Simulations,” in 25th IEEE International Parallel and Distributed Processing Symposium Workshops. Anchorage: IEEE, May 2011. [26] P. Strazdins and M. Hegland, “Performance Analysis of a Cardiac Simulation Code Using IPM,” in Proc. of the 2011 ACM/IEEE Conference on Supercomputing Workshops. ACM, 2011. [27] N. J. Wright, W. Pfeiffer, and A. Snavely, “Characterizing Parallel Scaling of Scientific Applications using IPM,” in Proc. of The 10th LCI International Conference on High-Performance Clustered Computing, Mar. 2009.

Suggest Documents