BC-MPI: Running an MPI Application on Multiple Clusters with ...

BC-MPI: Running an MPI Application on Multiple Clusters with BeesyCluster Connectivity⋆ Paweł Czarnul Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology, Poland [email protected] http://fox.eti.pg.gda.pl/∼pczarnul

Abstract. A new software package BC-MPI which allows an MPI application to run on several clusters with various MPI implementations is presented. It uses vendor MPI implementations for communication inside clusters and exploits the multithreaded MPI THREAD MULTIPLE mode for handling inter-cluster communication in additional threads of the MPI application. Furthermore, a BC-MPI application can be automatically compiled and started by the BeesyCluster middleware. The latter allows users to manage and use cluster accounts via a single BeesyCluster account and WWW or Web Services. The middleware connects to clusters via SSH and does not require any software installation on the clusters. Results of various latency and bandwidth tests for intra and inter-cluster communication are presented for BC-MPI using OpenMPI and LAM/MPI and Infiniband or TCP within clusters. Keywords: WAN-aware MPI, threads and MPI, grid middleware

1 Introduction In recent years we have observed a growth of interest in both highly parallel software solutions and bridging clusters using grid technology. The first included progress towards thread-safe MPI implementations (OpenMPI), focusing on transparent checkpointing for MPI applications (LAM/MPI/BLCR or MPICH-V), failure-proof parallel algorithms in view of large parallel machines like IBM’s BlueGene, mixed shared memory and MPI programming etc. The latter resulted in releases of high-level grid systems like CrossGrid ([1]), CLUSTERIX ([2]) and also MPI implementations for WANs like MPICH-G2 ([3]) based on grid middleware Globus Toolkit ([4]). Still shortcomings can be identified especially regarding: – difficulty of installation, complex configuration and version compatibility of grid middlewares, – limited availability of high level grid systems for a broader community, – a usually complex process for middleware configurations, setup, distribution of account credentials, etc. ⋆

partially covered by MNiI grant No N516 035 31/3499

2 Related Work and Motivations The author proposes a software package, BC-MPI, that allows running an MPI application over several distributed clusters, using different MPI implementations on various clusters if needed. The package consists of a library, a compilation script replacing mpicc to preprocess and replace MPI * calls with ones shipped with BC-MPI as well as (optional) TCP forwarders for inter-cluster communication. There are MPI implementations available allowing an MPI application to span over several clusters. Examples include MPICH-G2 ([3]) using Globus ([4]) for job control (startup, monitoring, termination) and TCP messaging ([5]). PACX-MPI ([6]) and LAM/MPI ([7]) can use Globus to couple remote clusters and start the application. Interoperable MPI ([8]) defines a protocol for communication between MPI implementations. Compared to other MPIs for WANs like MPICH-G2 or PACX-MPI, benefits of the proposed solution include: 1. BC-MPI is designed to exploit multithreading (MPI THREAD MULTIPLE model, [9]) as additional threads are used for handling inter-cluster communication in MPI processes. In BC-MPI each MPI process is multithreaded and one per cluster serves as a proxy for inter-cluster communication unlike in PACX-MPI – where additional processes acting as cluster proxies are created. This is discussed in paragraph 3. 2. BC-MPI can use any MPI implementations on cluster sides. This allows to exploit additional features of the particular MPI application, if necessary. 3. ability to start the BC-MPI application using the BeesyCluster middleware. Although the package is also meant as a standalone software with manual startup on remote clusters (much like PACX-MPI in [6]), it can benefit from automatic and secure startup via BeesyCluster ([10]) which is a J2EE application and middleware to clusters exposing WWW and Web Service interfaces. Contrary to using Globus, this approach does not require installation of any parts of the middleware on remote clusters since BeesyCluster accesses user accounts on such clusters via the Jsch SSH library. As such, it only requires that the BeesyCluster user provides system logins/passwords to accounts on remote clusters they would like to use through BeesyCluster. This allows to start an MPI application on several clusters via a single account in BeesyCluster to which several registered user accounts from clusters are mapped. Complex middlewares like Globus are more prone to modifications while plain Web Services are well established standards. [5] lists specific Globus versions MPICH-G2 does not operate with.

3 Architecture and Design of the Proposed Solution 3.1 Architecture of a BC-MPI Application The architecture of the BC-MPI application is depicted in Figure 1. BC-MPI uses multithreaded MPI processes one of which is a proxy for a cluster rather than distinguished processes like e.g. in PACX-MPI ([6]). If the cluster configuration requires access via a dedicated node then a TCP forwarder process can be launched there. In the case

MPI is used to forward/receive from the proxy, communication links are equivalent to PACX-MPI. However, inter-proxy communication is potentially faster in BC-MPI since it requires only TCP communication and not MPI-TCP-MPI like in PACX-MPI. Application threads use MPI for communication within one cluster. BC-MPI does not require any changes in the MPI application source code.

app thread

recv thread

BC−MPI

send thread

BC−MPI

send thread

BC−MPI

recv thread

BC−MPI

app thread

send thread

BC−MPI

recv thread

BC−MPI

app thread

send thread

BC−MPI

recv thread

BC−MPI

app thread

optional TCP forwarder

TCP

vendor MPI

TCP or MPI

Fig. 1: Architecture of BC-MPI Application

3.2 BeesyCluster as a Middleware for BC-MPI BeesyCluster, installed at Academic Computer Center in Gdansk, Poland (at https:// beesycluster2.eti.pg.gda.pl/ek/Main), can be seen as an access portal/middleware to clusters/supercomputers/PCs with WWW (Figures 2, 3) and Web Service ([10]) interfaces. The user can access and use many accounts on various clusters through one account in BeesyCluster (single sign-on). Users can run any commands on clusters, edit, copy files and directories between clusters, queue or run tasks interactively, publish actions such as running a parallel or sequential application (run interactively or queued on clusters) or editing a file as services visible to other users via WWW or Web Services. For the use of services, users-providers earn points which can be spent on running services published by others. Services can also be offered free of charge and can be combined into workflows as presented by the author in [11]. BeesyCluster only presumes that user accounts on clusters are accessible via SSH and does not require installation of any software to run or publish services. Further information can be found in [10] or at the aforementioned web site. In such an environment, in the context of a BC-MPI application the user can: 1. Register existing accounts in a BeesyCluster account in seconds without assistance of remote clusters’ administrators. 2. Launch a WAN-aware MPI application (using BC-MPI) on these clusters via BeesyCluster. BeesyCluster can be used to upload application sources to target clusters,

Fig. 2: BeesyCluster’s File Manager in Web Browser

Fig. 3: Task’s Results in BeesyCluster

compilation and starting processes of the application on the clusters. In particular, BC-MPI’s MPI Init() can use BeesyCluster’s Web Services to launch processes of the application on other clusters. BeesyCluster’s Web Services can be called in a secure way using SOAP/HTTPS and use client authentication/authorization. Specifically, BeesyCluster’s Web Services require an initial call to method String[] auth = port.logIn(new String[] "", "","loginAgentID","signerID") to log in and then allow to call any of the following ([10]): runCommand(auth,cluster,command)) for running a command, enqueueJob (auth,cluster,jobPath,minCPU,maxCPU, resultPath,email) for queueing a task using a queueing system on cluster (queueing details handled transparently by BeesyCluster), retrFile(auth,cluster,remoteFileName,localFileName) for retrieving a file from cluster, sendFile(auth,cluster,remoteFileName,localFileName) for sending a file to cluster. The latter can be used to download sources of a BC-MPI application from one cluster, upload to others, the first one to compile the source code. Then processes on other clusters could be started using one of the first two services. Figure 4 shows the startup sequence (download/upload ommitted). One process of an MPI application will act as a proxy waiting for socket connections and forwarding traffic to other clusters. Communication between clusters in a BC-MPI application will use TCP. 3. Optionally consume services made available by other BeesyCluster users from their accounts if proper rights have been set by the owners. This potentially extends the MPI application with ability to run external functionality using SOAP/HTTPS.

4 Multithreaded Implementation Calling MPI functions in threads other than the main one for forwarding to or receiving from the proxy process for inter-cluster communication requires proper threading

1. start app on fox (WWW)

BeesyCluster server(s) Sun

BeesyCluster’s client 1. start app on fox (SSH)

3. start app (SSH) 2. start app on parowiec and holk (Web Service) parowiec

holk 288 IA−64 procs Infiniband

fox 16 Xeon procs Fast Ethernet BC−MPI communication (TCP)

Fig. 4: Start-up of a BC-MPI Application using BeesyCluster

support from the MPI implementation (possibly different in various clusters). Thus, for performance tests in this paper, in terms of multithreading the author tested three versions of the code: 1. OpenMPI in the MPI THREAD MULTIPLE mode, 2. LAM/MPI in the MPI THREAD SERIALIZED mode with the BC-MPI code including special synchronization to avoid deadlocks but only for the exemplary code, 3. BC-MPI’s TCP – fully multithreaded (can be used with any MPI implementation) using sockets for forwarding/receiving to/from proxy for inter-cluster communication. MPI is still used for communication within one cluster. The following are used in BC-MPI, apart from the main MPI application threads: 1. Receive threads – used to listen to both internal communication (MPI or TCP) from MPI processes and from external clusters (only in the proxy process) via TCP. 2. Sending thread – forwarding to the proxy process or TCP communication to another cluster is done in another thread. This allows the client-side to continue and implement e.g. MPI I*send modes as well as continue receiving when forwarding data in proxy processes. Data is set in proper structures and a call to pthread cond signal instructs the sending thread to flush the send buffer. BC-MPI currently incorporates the following optimizations: receiving to user buffer when a matching receive has already been posted – if a message has not already arrived, an MPI Recv() call inserts a receive request (with the destination buffer specified by the user) into a table and waits on a condition; when the message arrives, it is received into the user buffer, partitioning and forwarding packets of long messages – the proxy process receives the message it should forward in packets; the received packet is forwarded immediately in another thread while receiving of next packets continues.

5 Performance tests For performance tests, an MPI example for benchmarking point-to-point performance from [12] was used but modified to test latency/bandwidth between any MPI ranks. For each test described below a pair of clusters was used to benchmark point-to-point times between MPI processes in separate clusters. Cluster holk features 288 Itanium2 processors with Infiniband, parowiec contains 16 Pentium Xeon processors with Fast Ethernet, fox is an AthlonXP PC node, all running Linux. We use a single MPI application compiled with BC-MPI with processes of ranks 0 to 3 on one cluster and rank 4 (and 5 in the last case) on the second cluster. Note the inter-departmental link between cluster parowiec and fox is of 10Mbit/s speed only. 5.1 Inter-cluster Communication and Communication within a Node in Cluster Figure 5 shows latency times and Figure 6 bandwidth for communication between ranks 1 and 4 of the MPI application. Communication occurs in configuration: 1 (parowiec) - MPI (shared memory or TCP) - 0 (proxy process on parowiec) - inter-cluster interdepartmental TCP (10 Mbit/s) - 4 (proxy process on fox)). Processes 0 and 1 run on the same node. For communication between 0 and 1 both shared memory and TCP modes in MPI were used for comparison. It can be seen that LAM/MPI slightly outperforms OpenMPI (as also reported in [13]). The times are obviously dominated by the intercluster communication times. 750

Intercluster Latency [us]

700

intercluster communication (TCP) 0-4

650

1-OMPI shm (THREAD_MULTI PLE) -0 – TCP -4

600 550

1-LAMMPI shm (THREAD_SERIA LIZED)-0–TCP-4

500 450

1-OMPI tcp -0 – TCP -4

400 350

Message size [bytes]

300 0

32

64

96

128

1-LAMMPI tcp -0 – TCP -4

160

192

224

256

Fig. 5: Latency for Inter-cluster Communication + Communication in a Node in Cluster

5.2 Inter-cluster Communication and Communication between Nodes in Cluster Figure 7 shows latency times and Figure 8 bandwidth for communication between ranks 1 and 4. Communication occurs in configuration: 1 (parowiec) - MPI (TCP) or BCMPI’s TCP - 0 (proxy process on parowiec) - inter-cluster inter-departmental TCP (10

1.1 1.075

Intercluster Bandwidth [MB/s}

intercluster communication (TCP) 0-4

1.05

1-OMPI shm (THREAD_MULTI PLE) -0 – TCP -4

1.025 1

1-LAMMPI shm (THREAD_SERIA LIZED)-0–TCP-4

0.975

1-OMPI tcp -0 – TCP -4

0.95

1-LAMMPI tcp -0 – TCP -4

0.925

Message Size [bytes]

0.9 0

4096

8192

12288

16384

Fig. 6: Bandwidth for Inter-cluster Communication + Communication in a Node in Cluster

Mbit/s) - 4 (proxy process on fox)). Processes 0 and 1 run on separate nodes in cluster parowiec using Fast Ethernet. For communication between 0 and 1 both vendor MPI and BC-MPI’s TCP were used for comparison. It can be seen that as in Figure 5 (communication between 0 and 1 on a single node) LAM/MPI offers slightly better latency times than OpenMPI (for TCP). Furthermore, BC-MPI’s TCP implementation is slightly faster than the code using LAM/MPI with MPI THREAD SERIALIZED. This may be in part due to the synchronization added to BC-MPI in the latter case so that LAM/MPI can be called in many threads without deadlocks. This synchronization is specific and would work only for the MPI application used for tests. OpenMPI was tested with MPI THREAD MULTIPLE which allows to call MPI in many threads without deadlocks. 750

Intercluster Latency [us]

700 650

1-OMPI tcp (THREAD_MULTI PLE) -0 – TCP -4

600 550 500

1-LAMMPI tcp (THREAD_SERIA LIZED)-0–TCP-4

450

1-BC-MPI's TCP -0 – TCP -4

Message size [bytes]

400 0

32

64

96

128

160

192

224

256

Fig. 7: Latency for Inter-cluster Communication + Communication between Nodes in Cluster

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

Intercluster Bandwidth [MB/s] 1-OMPI tcp (THREAD_MULTI PLE) -0 – TCP -4 1-LAMMPI tcp (THREAD_SERIA LIZED)-0–TCP-4 1-BC-MPI's TCP -0 – TCP -4

0

4096


8192

12288

16384

Fig. 8: Bandwidth for Inter-cluster Communication + Comm. between Nodes in Cluster

5.3 Data Partitioning and Forwarding For larger messages the author tested the impact of partitioning messages and immediate forwarding of received data in rank 0 while continuing receiving. Figure 9 shows results for the configuration as in the previous case. It indicates that for large messages the latency of sending the message of size 8MBytes between processes 1 and 0 (around 0.7s between separate nodes in cluster parowiec) can be practically hidden.

12 forwarding whole messages

10 8

Intercluster Communication Times [s]

partitioning when forwarding

6 4 2

Message size [Kbytes]

0 0

1024 2048 3072 4096 5120 6144 7168 8192

Fig. 9: Latency for Inter-cluster Communication with and without Message Partitioning

5.4 Testing Infiniband vs. TCP and an TCP Forwarding In this configuration communication occurs between processes 1 and 5 in configuration: 1 (parowiec) - BC-MPI’s TCP or MPI (TCP) - 0 (proxy process on parowiec) inter-cluster TCP - additional TCP forwarder on access node karawela - inter-cluster TCP - 4 (proxy process on holk) - BC-MPI’s TCP or MPI (Infiniband) - 5 (holk). 1 and 0 as well as 4 and 5 run on separate nodes in clusters parowiec and holk respectively. Figures 10 and 11 show differences in communcation times for both small and large message sizes when using Infiniband within holk compared to TCP (Ethernet). 1200

Intercluster Latency [us] Infiniband vs TCP in holk

1100 1000 900

1–BC-MPI's TCP–0–TCP–4– BC-MPI's TCP– 1–MPI(TCP)–0– TCP–4–MPI (IB)–5

800 700

Linear regression, 1– BC-MPI's TCP– 0–TCP–4–BCLinear

600 500 400


300 0

512

1024

1536

regression, 1– MPI(TCP)–0– 2048 TCP–4–MPI

Fig. 10: Inter-cluster Communication Latency /w TCP vs Infiniband on holk: Short Messages

0.8

Intercluster Communication Times [s] Infiniband vs TCP in holk

0.7 0.6 0.5

1–BC-MPI's TCP–0–TCP– 4–BC-MPI's

0.4 0.3 0.2 0.1

Message Size [Kbytes]

1–MPI(TCP)– 0–TCP–4– MPI (IB)–5

0.0 0

512

1024 1536 2048

2560 3072 3584

4096

Fig. 11: Inter-cluster Communication Times with TCP vs Infiniband on holk

6 Summary and Future Work In this paper, a new software package for bridging MPI applications on clusters was presented. While it can be used as a standalone software with MPI implementations, it can benefit from the BeesyCluster middleware for automatic startup of the MPI application spanning clusters as well as spawning BeesyCluster services in parallel. Its architecture can exploit multithreaded features of MPI implementations e.g. MPI THREAD MULTIPLE of OpenMPI. Various performance tests of point-to-point communication between clusters were presented using BC-MPI’s TCP, LAM/MPI and OpenMPI. Future work includes further implementation of BC-MPI including collective operations, performance comparisons with other systems like MPICH-G2, PACX-MPI, LAM/MPI etc. as well as tests of TCP communication with encryption. The author will also focus on building an ontology for HPC computing and using it for intelligent searching of HPC services within BeesyCluster.

Acknowledgments Calculations were carried out at the Academic Computer Center in Gdansk, Poland.

References 1. Official Crossgrid Information Portal. (http://www.crossgrid.org/main.html) supported by Grant No. IST-2001-32243 of the European Commission. 2. CLUSTERIX: (The National Linux Cluster) http://clusterix.pcz.pl. 3. Karonis, N., Toonen, B., Foster, I.: MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing (JPDC) 63 (2003) 551–563 4. Sotomayor, B.: The Globus Toolkit 4 Programmer’s Tutorial (2005) http://www.casa -sotomayor.net/gt4-tutorial/. 5. Karonis, N., Toonen, B.: MPICH-G2. (http://www3.niu.edu/mpi/) 6. Keller, R., Mller, M.: The Grid-Computing library PACX-MPI: Extending MPI for Computational Grids. (www.hlrs.de/organization/amt/projects/pacx-mpi/) 7. LAM/MPI Parallel Computing. (http://www.lam-mpi.org/) 8. National Institute of Standards and Technology: Interoperable MPI, http://impi. nist.gov/. 9. Ewing Lusk et al.: MPI-2: Extensions to the Message-Passing Interface: MPI and Threads. (http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/ node162.htm#Node162) 10. Czarnul, P., Bajor, M., Fraczak, M., Banaszczyk, A., Fiszer, M., Ramczykowska, K.: Remote Task Submission and Publishing in BeesyCluster : Security and Efficiency of Web Service Interface. In Springer-Verlag, ed.: Proc. of PPAM 2005. Volume LNCS 3911., Poland (2005) 11. Czarnul, P.: Integration of Compute-Intensive Tasks into Scientific Workflows in BeesyCluster. In: Proceedings of ICCS 2006 Conference,, University of Reading, UK, Springer Verlag (2006) Lecture Notes in Computer Science, LNCS 3993. 12. Gropp, W., Lusk, E.: MPI Tutorial: Benchmarking point to point performance. (http://www-unix.mcs.anl.gov/mpi/tutorial/mpiexmpl/src3/ping pong/C/main.html) 13. Barrett, B.: Open MPI User’s Mailing List Archives. (http://www.open-mpi.org/ community/lists/users/2006/04/1076.php)

BC-MPI: Running an MPI Application on Multiple Clusters with ...

BC-MPI: Running an MPI Application on Multiple Clusters with ...

Suggest Documents

Experiments in running a scientific MPI application on Grid ... - CiteSeerX

Understanding performance of SMP clusters running MPI programs ...

Dyn-MPI: Supporting MPI on Non Dedicated Clusters - SC Conference

Efficient Message Passing on Multi-Clusters: An IPv6 ... - Open MPI

MPI Collectives on Modern Multicore Clusters - CiteSeerX

An Efficient Framework for Running Applications on Clusters, Grids

An Efficient Framework for Running Applications on Clusters, Grids ...

An Efficient Framework for Running Applications on Clusters, Grids ...

An Empirical Study of MPI over PC Clusters - ThinkMind

MPI Farm programs on non-dedicated clusters - Semantic Scholar

Optimizing Threaded MPI Execution on SMP Clusters - CiteSeerX

Dynamic Fuzzy Load Balancing on LAM/MPI Clusters ... - Engineering

Improving Wildland Fire Prediction on MPI Clusters

Optimizing Threaded MPI Execution on SMP Clusters - CiteSeerX

Running Resilient MPI Applications on a Dynamic Group of ...

Running Oracle Real Application Clusters on Oracle VM Server for ... [PDF]

Metacomputing on Clusters Augmented with

Collective Operations in an Application-level Fault Tolerant MPI ... - ISS

Exchanging multiple messages via MPI - HPCx

Running an HCI Experiment in Multiple Parallel Universes - Hal

CSS 434 Program 2: MPI Java Application

Automated Application-level Checkpointing of MPI Programs

A Latent Model to Detect Multiple Spatial Clusters with ... - DIMACS

Decentralized Optimization, with Application to Multiple ... - CiteSeerX