Grid middleware provide the integration into CampusGrid and D-Grid projects. .... a couple of reasons â it is Open Source, compatible to OpenPBS, but more stable. At the point when .... One work package of the integration project is to build an ...
OpusIB – Grid Enabled Opteron Cluster with InfiniBand Interconnect Olaf Schneider, Frank Schmitz, Ivan Kondov, and Thomas Brandel Forschungszentrum Karlsruhe, Institut f¨ ur Wissenschaftliches Rechnen, Herrmann-von-Helmholtz-Platz 1, D-76344 Eggenstein-Leopoldshafen, Germany {thomas.brandel,ivan.kondov,frank.schmitz, olaf.schneider}@iwr.fzk.de
Abstract. OpusIB is an Opteron based cluster system with InfiniBand interconnect. Grid middleware provide the integration into CampusGrid and D-Grid projects. Mentionable details of hardware and software equipment as well as configuration of the cluster will be introduced. Performance measurements show that InfiniBand is not only well suited for message-passing based parallel applications but also competitive as transport layer for data access in shared cluster file systems or high throughput computing. Keywords: Cluster, Grid, Middleware, File system, SAN, InfiniBand.
1
Introduction
Cluster systems with fast interconnects like Myrinet, Quadrics or InfiniBand become more and more important in the realm of high performance computing (HPC). The Institute for Scientific Computing at the Forschungszentrum Karlsruhe was active in adopting and testing InfiniBand technology very early. We started with a small test system in 2002, followed by a Xeon based system with 13 nodes called IWarp in 2003. The next generation of InfiniBand cluster is OpusIB , which we describe in this paper. In the following section we briefly look at the project CampusGrid in which most activities reported here are embedded. The key facts about OpusIB ’s hardware and software are collected in several subsections of Sect. 3. Thereafter, in Sect. 4, we comment on some measurements which prove the achievable performance with InfiniBand. We shall conclude the paper with a short survey on OpusIB as part of the D-Grid infrastructure.
2
The CampusGrid Project
The R&D project CampusGrid [1,2] was initiated at the Forschungszentrum 2004 with the aim to construct and build a heterogeneous network of resources B. K˚ agstr¨ om et al. (Eds.): PARA 2006, LNCS 4699, pp. 840–849, 2007. c Springer-Verlag Berlin Heidelberg 2007
OpusIB – Grid Enabled Opteron Cluster with InfiniBand Interconnect
841
for computing, data and storage. Additionally, the project gives users the opportunity to run their applications in such a heterogeneous environment. Grid technologies were selected as a state-of-the-art method to achieve these goals. The use of standard Grid middleware in our local infrastructure is advantageous, because we enable the scientists of the Forschungszentrum to smoothly enter the global Grid. The project started with a testbed for evaluation of middleware and other components. While the initial testbed was small, it already comprised all kinds of resources in our heterogeneous IT environment: clusters, SMP servers, and vector processors as well as managed storage (SAN). During project progress more and more production systems shall be integrated in the CampusGrid environment. In order to do so we need a clear and smooth migration path from our classical HPC environment into the new Grid-based infrastructure. Thus, in the design of the CampusGrid architecture we need to take care of many boundary conditions we can not (easily) change in our project, e. g. central user administration via Active Directory Services (ADS). The cluster OpusIB started as part of the CampusGrid testbed and it is now growing into a fully productive system.
3 3.1
Hardware and Software of OpusIB Overview
The name OpusIB is an abbreviation for Opteron cluster with InfiniBand. As the name implies, the cluster is assembled of dual processor nodes with Opteron 248 processors and the high-performance networking fabric is an InfiniBand switch (InfinIO9000 by SilverStorm). All worker nodes and most cluster nodes run CERN Scientific Linux as operating system (64bit version). At the time of writing this, there are 64 worker nodes with 128 CPUs in total and an aggregated memory of about 350 GB. All worker nodes and the switch fabric are build into water cooled cabinets by Kn¨ urr. This technology was originally developed for the GridKa [3] cluster. 3.2
InfiniBand
InfiniBand (IB) is a general purpose network and protocol usable for different higher level protocols (TCP/IP, FibreChannel/SCSI, MPI, RFIO/IB) [4]. In contrast to existing interconnect devices that employ a shared-bus I/O architecture, InfiniBand is channel-based, i. e., there is a dedicated path from one communication partner to the other. Links can be aggregated, which is standardized for 4 and 12 links called 4X and 12X. We use 4X in our installation, that means 1 GB/s usable bandwidth (in each direction). FibreChannel (FC) bridges plugged into the IB switch enable us to directly connect storage devices in the Storage Area Network (SAN) to the cluster nodes. Thus it is not necessary to equip each node with a FC host bus adapter.
842
O. Schneider et al.
As an off-the-shelf high-speed interconnect InfiniBand is a direct competitor of technologies like Myrinet and Quadrics. Our decision to use InfiniBand in the cluster was mainly due to the positive experiences in recent projects (cf. [5]). 3.3
Running Two Batch Schedulers Concurrently
A peculiarity of the OpusIB cluster is that all worker nodes are managed by two different job schedulers concurrently. At one hand, we have the OpenPBS successor TORQUE [6] together with the MAUI scheduler [7]. On the other hand, there is a mixed LoadLeveler cluster which consists of the OpusIB nodes, several PowerPC blades and some other single machines. Recently we added our AIX production system (pSeries 655 and 630) to this mixed cluster. The reasons for running these two batch systems concurrently are numerous. First the history: As we started assembling the cluster we chose TORQUE for a couple of reasons – it is Open Source, compatible to OpenPBS, but more stable. At the point when IBM provided first LoadLeveler with the mixed cluster option, we decided to try it – because of curiosity. Shortly before we had got some PowerPC blades which could serve as AIX nodes in our testing environment. The Linux part of the testbed was just OpusIB . At that point, the cluster was still in a quite experimental mode of operation. Thus, two batch systems did not cause any problems but were sometimes useful for tests. This configuration survived the gradual change of the cluster into productive operation. Currently, LoadLeveler works very well and is used for the majority of user jobs submitted in the classical way on the command line. On the other hand, Grid middleware supports more often TORQUE/PBS than LoadLeveler. Moreover, the combination with MAUI scheduler is quite popular in the Grid community. Thus, TORQUE serves as a kind of reference system when using Grid Middleware. A third reason is, that we want to stay independent of commercial software vendors as far as possible. That means, an Open Source solution should be available at least as fall-back. Running two job managers concurrently without noting of each other of course holds the danger to overload nodes with too many jobs. In practice, however, we noticed that such problems occur less often than expected. The reason was probably that both schedulers take the actual workload on a node into account, when making the scheduling decision. For Maui scheduler this behavior is triggered by setting the configuration parameter MAXLOAD. Hence, Maui marks a node busy if the load exceeds MAXLOAD. The exact value needs some tuning – we used values between 1.1 and 2.5. LoadLeveler prefers the node with the lowest load by default. If overcommitment occurs, it is always very harmful, especially if it affects the workload balance of a parallel job (since a single task is slowed down compared to all other tasks). Recently we tried to solve this kind of problems by adding prolog and epilog scripts to each job. After submission a job waits in a queue until a matching resource is available. Right before job startup the scheduler, say LoadLeveler, runs a prolog script, to which the list of processors (nodes) occupied by the job is passed (via the variable LOADL PROCESSOR LIST). The prolog script utilize this information to decrease the number of job slots
OpusIB – Grid Enabled Opteron Cluster with InfiniBand Interconnect
843
in the list of available resources at the other scheduler (i. e. TORQUE). After the job has finished, the slot number is increased in the epilog script. Thus, we reconfigure dynamically the resources of the second scheduler if a job is started by the fist scheduler, and vice versa. 3.4
Cluster Management Using Quattor
Automated installation and management of the cluster nodes is one of the key requirements for operating a cluster economically. At OpusIB this is done using Quattor, a software developed at CERN [8]. Some features of Quattor are: – – – – – –
automated installation and configuration software repository (access via http) Configuration Data Base (CDB) server template language to describe setup Node Configuration Manager (NCM) using information from CDB templates to describe software setup and configuration
The open standards, on which the Quattor components are based, allow easy customization and addition of new functionality. For instance, creating new node configuration components is essentially writing a Perl module. In addition, the hierarchical CDB structure provides a good overview of cluster and node properties. Addition of new hardware or changing installed software on existing hardware is facilitated tremendously by Quattor – the process takes as long as several minutes. 3.5
Kerberos Authentication and Active Directory
For the CampusGrid project it was decided to use Kerberos 5 authentication with the Active Directory Server as Key Distribution Center (KDC). Thus all OpusIB nodes are equipped with Kerberos clients and a Kerberos enabled version of OpenSSH. As a work-around for the missing Kerberos support in the job scheduling systems (PBS, LoadLeveler) we use our own modified version of PSR [9], which incorporates Kerberos 5 support. While Kerberos is responsible for authentication, the identity information stored in the passwd file still needs to be transferred to each node. For this purpose we use a newly developed Quattor component, which retrieves the necessary data via LDAP from the Active Directory and then distributes it to the cluster by the usual Quattor update mechanism. 3.6
StorNext File System
SNFS is a commercial product by ADIC [10]. It has several features which support our goal to provide seamless access to a heterogeneous collection of HPC and other resources:
844
O. Schneider et al.
– Native clients are available for many operating systems (Windows, AIX, Linux, Solaris, IRIX). – The metadata server does not require proprietary hardware. – Active Directory integration is part of the current version. – We get very good performance results in our evaluation (cf. Sect. 4). – Installation procedure and management is simpler than in competing products. A drawback is that file system volumes can not be enlarged during normal operation without a maintenance period. 3.7
Globus Toolkit 4
In the project CampusGrid we decided to use Globus Toolkit 4 (GT4) as basic middleware. For an overview of features and concepts of GT4 we refer to Forster [11] and the documentation of the software [12]. The current configuration for the OpusIB cluster is depicted in Fig. 1. We use the usual Grid Security Infrastructure (GSI), the only extension is a component to update the grid mapfile with data from the Active Directory.
. Fig. 1. WS-GRAM for job submission on OpusIB , with identity management using Active Directory Server (ADS)
OpusIB – Grid Enabled Opteron Cluster with InfiniBand Interconnect
845
For LoadLeveler jobs there is an additional WS-GRAM adapter and a scheduler event generator (SEG). Actually there are two of them – one running on the GT4 server mentioned above and a second running on a PowerPC machine with AIX. The latter was installed to test GT4 with AIX. The cluster monitoring data gathered by Ganglia [13] are published in MDS4, the GT4 monitoring and resource discovery system (using the Ganglia Information Provider shipped with Globus).
4 4.1
Benchmarks Data Throughput
Right from the start our objective in evaluating InfiniBand technology was not only the fast inter-node connection for message-passing based parallel applications. We also focused on the data throughput in the whole cluster. The large data rates achievable with InfiniBand are appropriate for accessing large amounts of data from each node concurrently via file system or other protocols like RFIO. Preliminary results of our studies can be found in [5]. Later on we successfully tested the access to storage devices via an InfiniBand-FibreChannel bridge together with a Cisco MDS9506 SAN director. These tests were part of a comprehensive evaluation process with the aim to find a SAN based shared file system solution with native clients for different architectures. Such a heterogeneous HPC file system is intended as a core component of the CampusGrid infrastructure. A comparison of StorNextFS (cf. Sect. 3.6) with two competitors (SAN-FS by IBM and CXFS by SGI) is given in Table 1.
Table 1. Write throughput in MB/s for different file systems and client hardware, varying file sizes and fixed record size of 8 MB
64 MB 1 GB 4 GB
SunFire with IB p630 (AIX) SNFS CXFS SNFS CXFS SAN-FS 177 93 70 50 59 176 91 70 49 53 175 96 73 49 52
The measurements are done using the benchmark software IOzone [14]. Write performance was always measured such that the client waits for the controller to confirm the data transfer before the next block is written. This corresponds to the behavior of NFS mounted with option ’sync’. Due to compatibility issues it was not possible to install the SAN-FS client on our SunFire nodes with InfiniBand and Opteron processors. The reported values rely on sequentially written files of various size – 128 kB to 4 GB, doubling size in each step – while the record size goes from 64 kB to 8 MB. Typically a monotonic increase of data throughput
846
O. Schneider et al.
200
MByte/s
150 4 128 8 1
GB MB MB MB
100
50
0
64
128
256
1024 512 record size (kByte)
2048
4096
8192
Fig. 2. Write performance of SNFS on SunFire with InfiniBand using IB-FC bridge
with growing file and record size can be observed. This behavior is depicted in Fig. 2. All measurements are done using a disk-storage system by Data Direct Networks (S2A 8500). The connection between SAN fabric and IB-FC bridge or, accordingly, the p630, was a 2 Gigabit FibreChannel link. Thus, the overall bandwidth from the cluster to the storage is limited to the capacity of this link. For file system access with considerable overhead we can not expect more than about 180 MB/s (or, correspondingly, 1.5 GBit/s). Measurements [15] show that SNFS behave well if more than one client access the same file system. 4.2
Parallel Computing
Beside serial applications with high data throughput, the application mix on OpusIB contains typical MPI applications from several scientific domains (climate simulation, CFD, solid state physics and quantum chemistry). Parallel floating-point performance which is relevant for the latter applications was benchmarked using HPL [16]. It was compiled and linked on OpusIB using all available C compilers (GCC, Intel, PGI) and the libraries MVAPICH [17] and ATLAS [18]. The latter was compiled with the architecture defaults from the vendor. The tests were performed on up to 18 nodes. Figure 3 shows the measured performance. It scales linearly with the number of processors. The performance per node is quite constant – between 3.7 and 3.3 Gflops, which corresponds to about 80% of the peak performance.
Performance (%) Total performance (Gflops)
OpusIB – Grid Enabled Opteron Cluster with InfiniBand Interconnect
847
150
100
50
0 82 80 78 76 74
0
10
20 Number of processors
30
40
Fig. 3. Total performance of the HPL test RR00L2L2 for maximal problem size. The HPL benchmark was compiled with the GNU C Compiler 3.2.
Performance per processor (Gflops)
4
3
2 GNU C Compiler Intel C Compiler PGI C Compiler
1
0
0
5000
10000 Problem size
15000
20000
Fig. 4. Performance comparison of the HPL test WR00L2L2 on two processors with three different compilers
Comparing the compilers, we see that the GNU C compiler performs best in our tests (cf. Fig. 4). However, for small problems (up to size 1000) the actual
848
O. Schneider et al.
choice of the compiler does not matter. For larger problems (size 18000) the PGI code is about 25% slower while the lag of the Intel compiler is moderate. These results should not be misconstrued in the sense that GCC produces always better code than the commercial competitors. Firstly, real applications do not behave exactly as the benchmark. Secondly, according to our experiences, PGI performs much better with Fortran code (see also [19]).
5
D-Grid
The D-Grid initiative aims at design, building and operating a network of distributed, integrated and virtualized high-performance resources and related services which allow processing of large amounts of scientific data and information. D-Grid currently consists of the integration project (DGI) and six community projects in several scientific domains [20]. One work package of the integration project is to build an infrastructure called Core D-Grid. As part of this core infrastructure OpusIB should be accessible via the three middleware layers GT4, Unicore [21] and gLite [22]. The GT4 services are the same as for CampusGrid plus an OGSA-DAI interface to an 1 TB mySQL database (cf. [23]) and integration into the MDS4 monitoring hierarchy of D-Grid. The integration with gLite middleware suffers from a yet missing 64bit port of gLite. Thus, 32bit versions of the LCG tools must be used. The jobs submitted via gLite are scheduled on the cluster via TORQUE. Unicore also supports TORQUE, so we can use one local job scheduler on the cluster for all three middlewares. Each middleware should run on a separate (virtual) machine which is a submitting host for the local batch system. So far, only few nodes are equipped with the gLite worker node software. A complete roll-out of the system with access via all three middlewares (GT4, gLite, and Unicore) is scheduled for the first quarter of 2007. At that time we will be productive with an extended D-Grid infrastructure (for example, we are adding 32 nodes to OpusIB with two dual core processors each). A detailed report about configuration details and experiences will be subject of a separate publication. Acknowledgments. The authors thank all colleagues for support in daily management OpusIB and other systems and for helpful suggestions regarding our research activities. Great thanks are due to the developers of Open Source software tools we use in our projects.
References 1. Institut f¨ ur Wissenschaftliches Rechnen, Forschungszentrum Karlsruhe: CampusGrid (2005), http://www.campusgrid.de 2. Schneider, O.: The project CampusGrid (NUG-XVI General Meeting, Kiel (May 24-27, 2004)
OpusIB – Grid Enabled Opteron Cluster with InfiniBand Interconnect
849
3. Institut f¨ ur Wissenschaftliches Rechnen, Forschungszentrum Karlsruhe: Grid Computing Centre Karlsruhe (GridKa) (2005), http://www.gridka.de 4. InfiniBand Trade Association: InfiniBand Architecture (2006), http://www.infinibandta.org 5. Schwickerath, U., Heiss, A.: First experiences with the InfiniBand interconnect. Nuclear Instruments and Methods in Physics Research A 534, 130–134 (2004) 6. Cluster Resources Inc.: TORQUE Resource Manager (2006), http://old.clusterresources.com/products/torque 7. Cluster Resources Inc.: Maui cluster scheduler (2006), http://old.clusterresources.com/products/maui 8. Quattor development team: Quattor. System administration toolsuite (2006), http://quattor.web.cern.ch 9. The LAM team: Password Storage and Retrieval System (2006), http://www.lam-mpi.org/software/psr 10. Advanced Digital Information Corporation (ADIC): StorNext File System (2006), http://www.adic.com 11. Foster, I.T.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 12. The Globus Alliance: Globus Toolkit 4.0 Release Manuals (2006), http://www.globus.org/toolkit/docs/4.0 13. The Ganglia Development Team: Ganglia (2006), http://ganglia.info/ 14. Capps, D.: IOzone Filesystem Benchmark (2006), http://www.iozone.org 15. Schmitz, F., Schneider, O.: The CampusGrid test bed at Forschungszentrum Karlsruhe (NUG-XVII General Meeting, Exeter, GB, May 25-27, 2005) 16. Petitet, A., Whaley, R.C., Dongarra, J., Cleary, A.: HPL – A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers (2004), http://www.netlib.org/benchmark/hpl 17. Network-Based Computing Laboratory, Dept. of Computer Science and Engg., The Ohio State University. MVAPICH: MPI over InfiniBand project (2006), http://nowlab.cse.ohio-state.edu/projects/mpi-iba 18. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001), see also http://www.netlib.org/atlas 19. Research Computing, Information Technology Services, University of North Carolina: Compiler Benchmark for LINUX Cluster (2006), http://its.unc.edu/hpc/performance/compiler bench.html 20. The D-Grid Initiative: D-Grid, an e-Science-Framework for Germany (2006), http://www.d-grid.de 21. The Unicore Project: UNICORE (Uniform Interface to Computing Resources) (2005), http://www.unicore.eu 22. The EGEE Project: gLite. Lightweight Middleware for Grid Computing (2006), http://www.glite.org 23. Jejkal, T.: Nutzung der OGSA-DAI Installation auf dem Kern-D-Grid (in German) (2006), http://fuzzy.fzk.de/∼ GRID/DGrid/Kern-D-Grid.html