BIP-SMP : High Performance Message Passing over a Cluster of Commodity SMPs Patrick Geoffray Lo¨ıc Prylli Bernard Tourancheau RHDAC, project CNRS-INRIA ReMap
LIP, project CNRS-INRIA ReMap
LHPC (Matra Syst`emes & Information)
LHPC (ENS-Lyon)
ISTIL UCB-Lyon 69622 VILLEURBANNE
ENS-Lyon, 69634 LYON
e-mail:
[email protected]
1 Introduction As we approach the next century, parallel machines are gradually and incrementally being replaced by clusters of commodity workstations. The price of such a cluster is a fraction of the cost of a same capacity parallel computer (see [1]. However, connecting several PCs together with a classical network does not give you a supercomputer! To build a cheap but efficient cluster, one needs powerful hardware and software components :
A high performance interconnection network like Myrinet. The best available commodity machines at this time like multi-processor PCs (SMPs). A free software operating system : Linux provides the flexibility and performance needed. An efficient communication software layer like BIP (Basic Interface for Parallelism [16]). Building a cluster with commodity SMPs is very attractive as it reduces the cost of network interfaces per processor. However, the efficient support of these computational components is not obvious. Indeed, to efficiently manage such a CLUster of SMPs (CLUMPs), the communication layer needs to provide a multi-protocol support, thereby allowing a process to simultaneously send a message to another process This work was supported by LHPC (Matra S&I, CNRS, ENS-Lyon, INRIA, R´egion Rhˆone-Alpes), INRIA Rhˆone-Alpes project REMAP and grantNSF-INRIA
1
on the same physical node or to send a message via the network. Therefore, we need to develop a low-level message passing protocol and a matching MPI implementation to provide intra and inter-node communication performance, in terms of latency and bandwidth, as close as possible to commodity hardware limits. We have designed and implemented this in BIP-SMP over commodity nodes (dual Pentium) running Linux. In this article, we present the hardware and the software characteristics of our design. Then, we explain in detail the implementation of BIP-SMP and the original mechanism to implement high performance message passing inside an SMP node. We show the evaluation of BIP-SMP over our CLUMPs with MPICH [7], an implementation of the Message Passing Interface, and we compare it with other solutions, MPI-GM and MPI-PM/CLUMP. Finally, we present our conclusions and discuss ongoing research.
2 Background 2.1 Hardware components The experimental platform is a well-spread cluster of 4 commodity SMPs interconnected by Myrinet [2]. Each node of this cluster is a dual Pentium II at 450 MHz, compliant with the Symmetric Multi-Processing (SMP) interface from Intel and contains a total of 128 MBytes of main memory. The motherboard is based on the BX chip set which allow us to use a 100 MHz memory bus. The operating system is Linux (an open source and efficient Unix). All of these nodes are connected by Myrinet, a high performance local area network from Myricom. One Myrinet network interface card (NIC) uses a PCI slot to connect a node to a Myrinet switch. The bandwidth provided by the network is approximately 160 MB/s, but the PCI bus limits the maximum throughput at 132 MB/s. All of the links are full-duplex and the Myrinet switch is a full cross-bar. The routing is source based and a mechanism of hardware back-pressure implements a point-to-point flow control. Myrinet is a very open network; the embedded processor (LANai 4 in our cluster) can be programmed in C and the code, called Myrinet Control Program (MCP), is downloaded at the initialization of the board. The hardware architecture of this cluster is composed of affordable components.
2.2 Software architecture An efficient software architecture is needed to exploit such hardware at the maximum capacity, and is essential in cluster computing. Usually, old communication layers were designed for low performance unreliable networks. Using generic software over an efficient cluster can be an error and can give an incorrect impression of cluster performance for applications with a strong communication requirement, as we have shown in the performance evaluation section (see section 5). We present in this section the BIP software and a port of MPICH over BIP that was originally restricted to one process per node. We use BIP for the network management side of BIP-SMP.
2
2.2.1 BIP BIP [16] (Basic Interface for Parallelism) is a low level layer for the Myrinet network developed by our team at the University of Lyon. The goal of this project is to provide efficient access to the hardware and to allow zero memory copy communications. As BIP supplies very limited functionality, it is not meant to be used directly by the parallel application programmer. Instead, higher layers (such as MPI) are expected to provide a development environment with a good functionality/performance ratio. The API of BIP [17] is a classical message-passing interface. BIP provides both blocking and non-blocking communication. Communications are as reliable as the network, errors are detected and in-order delivery is guara nteed. It is composed of a user library, a kernel module and a Network Interface Card (NIC) program. The key points of the implementation are: A user level access to the network, avoiding system calls and memory copies implied by the classical design, becomes a key issue: the bandwidth of the network (160 MB/s in our case and 132 MB/s for the I/O bus) is equivalent to the one of the memory (300 MB/s for memory read and 160 MB/s for a copy on our machines with the BX chip set). Long messages follow a rendez-vous semantic: the receive must be posted before the send is completed. Messages are split into packets and the steps of the communication are pipelined. Small messages: Since initializations and handshakes between the host and NIC program are more expensive than a memory copy for small messages, small messages are written directly in the network board memory on the sending side, and copied in a queue in main memory on the receiving side. The size of this queue is static and the upper layers must guarantee that no overflow occurs. The communication latency of BIP is 5 s, and the maximal bandwidth is 126 MB/s (95% of the theoretical hardware maximum, actually represented by the PCI bottleneck in our case). Half of the maximum bandwidth is reached with a message size of 4 KB (= N1=2 ). 2.2.2 MPI-BIP MPI-BIP [15] is a high performance implementation of MPI [7] for Myrinet connected clusters using the BIP protocols.The current MPI-BIP implementation is based on the work done by Myricom on the MPI-GM [12] implementation. Most of the code is now shared between MPI-GM and MPI-BIP. MPICH is organized in layers and designed to ease the porting to a new target. Figure 1 presents our view of the MPICH framework, and at what level we inserted the MPI-BIP specific part. Different ports choose different strategies depending on which communication system they use. We implemented our network specific layer at a non-documented interface level, that we called the “Protocol Interface”, this API allows us to specify custom protocols for the different kinds of MPI messages. Each MPI message of the application is 3
MPI API Generic part (collective ops, context/group mgmt,...) Abstract Device Interface " Protocol interface" Channel Interface
Generic ADI code, datatype mgmt, heterogeneity request queues mgmt "short", "eager", "rendez-vous" Protocols
MPI
Check_incoming
BIP
P4 NX MPL TCP/IP Paragon SP/2
shared-mem
other ports SGI port.
port
BIP
Figure 1: The architecture of MPI-BIP implemented with one or several messages of the underlying communication system (BIP in our case). The cost of MPI-BIP is approximately an overhead of 2 s (mainly CPU) over BIP for the latency on our cluster. Thus, the latency of the non-SMP MPI-BIP is very good, 7 s, and the bandwidth reaches 110 MB/s.
3 Related Work Efficient management of CLUMPs is a new research topic. We have investigated issues related to a multi-protocol message passing interface using both shared memory and the network within a CLUMP. Several projects have proposed solutions for this problem in the last few years, and BIP-SMP is in this research line.
Projects like MPI-StarT [8] or Starfire SMP Interconnect use uncommon SMP nodes and exotic networks, and the performance is limited. Multi-Protocol Active Messages [11] is a very efficient multi-protocol implementation of Active Messages [20] using Myrinet and uses a Sun Enterprise 5000 as SMP nodes. Multi-Protocol AM achieves 3.5 s of latency and 160 MB/s of bandwidth.The main restriction is the use of the Sun Gigaplane memory system [18] instead of a common PC memory bus. The polling is also a problem in Multi-Protocol AM, as polling for external messages is more expensive than for internal messages. However, Multi-Protocol AM is the first message passing interface to efficiently manage CLUMPs. Another project, closely related to our solution, comes from a Japanese research group. MPI-PM/CLUMPs [19] also uses commodity SMP nodes. We will com4
pare the performance of this software with our work in the performance evaluation section (see section 5). The design of the whole system uses similar techniques as BIP-SMP.
Finally, one of the first message-passing interfaces to manage CLUMPs as a platform is the well-known device P4 [3] used for MPICH. P4 provides mechanisms to start multiple processes on machines and uses either message passing or shared memory copies to communicate between these processes. However, the programmer must explicitly select the appropriate library call. P4 also provides several useful reduction operations. We can also cite implementations of MPI limited to a single SMP node, like the MPICH devices ch shmem or ch lfshmem . The device ch lfshmem is a lock-free shared memory device that achieves very good performance (2.4 s and 100 MB/s on one of our SMP nodes). Concerning the shared memory management, an article about concurrent access for shared memory Active Messages [10] presents very efficient solutions, like a lock-free algorithm and high performance lock implementation.
4 Multi-Protocol communication layer: BIP-SMP BIP is an efficient communication layer. It provides almost all of the hardware performance at the user level. The constraints are easy to accept if we want high performance computing on a dedicated cluster: single user, no fault tolerance, basic services, mono process, etc. However CLUMPs seem to be becoming the optimal clustering architecture, in terms of the ratio between price and performance. In the context of SMPs, BIP can not be adapted because only one process per node is able to access the Myrinet board. This makes it impossible to support several MPI processes on SMPs (except by using special compilation techniques to regroup several MPI processes into a unique multi-threaded UNIX process [4]). We chose to keep the message-passing point of view to avoid this restriction. The challenge is to provide the support of several processes per node with the same performance as multi-threaded programming. We need to manage the concurrent access to the Myrinet board and to provide local communication at least at the same level of performance as BIP over Myrinet.
4.1 Concurrent access to the network When a BIP process wants to send a message to another BIP process, it starts by asking for an entry in a queue of mapped send requests. A send request is a structure containing all of the information required to process the send. This queue maps send requests on the Myrinet memory and when a structure is filled, the NIC completes the communication. In the case of a small message, the processor writes the data directly in the send request buffer, or it writes the physical addresses of memory pages used by the data, and a DMA engine on the Myrinet board gathers the data to be sent on the network. Once a send request is filled, the rest of the communication is asynchronous, 5
managed by the LANai. The best solution is to protect the concurrent access to this send request’s queue by a lock, and of course, the lock needs to be very efficient. We implement such a lock using a function provided by the Linux kernel: test and set bit in the kernel 2.2.x. This function guarantees that the memory operation is done atomically. This lock is very cheap compared to IPC system V locks or the lock mechanism in the Pthread [9] library. With BIP-SMP, two processes can then overlap the filling of a send request, and the only operation that is locked is obtaining a slot in the send request’s queue.
4.2 Internal communication 4.2.1 Requests queues There are several ways to move data from one process to another process (figure 2). A good solution is to use shared memory to implement mailboxes. One can also move data directly from user space to user space. For efficiency reasons, we will use both mechanisms: the shared memory to move small buffers, with two memory copies but small latency; and the direct copy for big messages, with a kernel overhead but a large sustained bandwidth.
Memory copy using an intermediate shared memory area
11 00 00 11 00 11
Shared memory area Send Buffer
11 00 00 11 00 11
Process A memory space
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111
Send Buffer
Receive Buffer
11 00 00 11 00 11
Process B memory space
Linux Kernel
1111 0000 0000 1111 0000 1111 0000 1111 0000 Receive 1111 0000 1111 Buffer 1111 0000 0000 1111
Direct memory copy using linux kernel Figure 2: Several ways to move data between processes. First, we have implemented lock-free communication primitives using shared memory for the receive queue. Each BIP process has one FIFO receive queue per other 6
processes on the same node. When a process wants to send a small message, it copies a header and the data to the dedicated queue in the receiving process pool. There is a simple flow control in BIP-SMP, which can be disabled if upper layers already provide flow control, like MPI-BIP. For receiving a message, BIP-SMP polls a memory flag at the head of each receive FIFOs and, if a message is ready, copies the data in the receiver’s user space. A function of the BIP-SMP’s API allows us to obtain a pointer to the buffer in the receive queue to avoid the memory copy. The lock-free paradigm is very efficient; we achieve 1.8 s of latency with raw BIP and 3.3 s with MPI-BIP. However, this protocol needs one queue per peer and the amount of shared memory needed increases by the square of the number of local processes. We have also implemented a local communication layer using an efficient lock. In this case, the protocol needs only one queue per process, but the cost of the lock ensuring that one process at a time writes in this queue limits the performance: we achieve 2.5 s of latency with raw BIP and 4.5 s with MPICH. We actually prefer to choose the lock-free implementation because commodity SMP nodes usually contain 2 processors, sometimes 4, and as our primary target is applications which have only one process per processor. Thus, the number of local processes inside a SMP node will usually be 2 or 4. We do not use IPC system V to initialize the shared memory segment. We prefer to open a file and project this unique file in each process using mmap . There is no fundamental difference, except for the fact that mmap seems more portable and that the operating system synchronizes the projection of the file in memory and the file on the disk when he has the time or when the user requests it: it is very useful to debug an application by reading the sharedcmemory content using a simple file editor. 4.2.2 Bulk transfer Another innovation in this communication layer is the direct memory copy. Moving data between threads is just a simple memory copy. However, it is impossible to do the same thing between processes. Unix protects different user spaces and it is forbidden to access foreign virtual memory addresses. Therefore, we have implemented a Linux kernel module to move data from one user space to another user space. This module needs to know the virtual address and the length of the source buffer and the virtual address of the receive buffer in another process. The algorithm can be summarized as: For all of the buffers do 1. Search the length of the next contiguous source data flit until a memory page boundary. 2. Resolve the physical memory address of the source offset. 3. Copy the contiguous source data flit using copy to user. enddo Figure 3 shows this direct memory copy. Although the send buffer and the receive buffer are contiguous in virtual memory, manipulating physical memory addresses
7
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 Flit 2 00000 11111 00000 11111
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
1111 0000 0000 1111
Page B2
Page A2
Send buffer
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 Flit 1 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
Page B1
Page A1
breaks this contiguous point of view and the copy operation has to take care of memory pages boundaries. In this example, the send buffer uses two memory pages and the receive buffer three memory pages. In this typical case, the kernel copies two data flits between the memory pages of process A and the virtual memory space of process B. The memory copy is done by the receiver, so we do not need to manipulate physical addresses on the target side. The overhead per fragment is quite small compared to the cost of the whole page copy.
Receive Buffer
Page B3
1111 0000 0000 1111
Process A memory pages
Process B memory pages
Figure 3: A direct memory copy between two processes. We can use the copy to user operation in this module because, with Linux, all of the physical memory addresses are mapped in the kernel address space. Another problem can appear if a virtual memory page is not present in physical memory (for instance swapped to disk). A simple solution is to lock all of the pages in physical memory before the copy and unlock them afterwards. This solution is very expensive for big messages; the kernel needs to decode all of the memory information to lock each memory page. We prefer an optimistic strategy: we do not lock the memory pages before the copy, but if a page is not physically present when the module tries to access it, it calls the kernel function handle mm fault in kernel 2.2.x. This function brings the page into physical memory and returns the physical address allocated for it. There is a constraint with this direct copy module: we need to apply a patch to the Linux kernel to add the kernel page fault function to the export symbol table to be able to use it in a module. At this time, Linux developers do not export this function by default.
8
4.2.3 Multi-protocol layer Having inter-node and intra-node communication primitives is not enough. Another part of this work is to enable BIP-SMP to simultaneously use both and to hide this new feature in BIP’s API. We have chosen to use two independent pools of receive queues per node, one for the internal communication and the other one for Myrinet. Then, we can allow the receipt of a message from the Myrinet network and the receipt of a message from another process in shared memory at the same time without any synchronization. However, the process has to poll messages from these different queues in memory. In addition, the memory space used by the Myrinet receive queues must be contiguous in physical memory to ease the DMA management of our Myrinet firmware. MPI - BIP
BIP’s API BIP - SMP (Multiprotocol layer) BIP over Myrinet (Communication between nodes)
BIP-SMP
(Communication between processes)
Shared memory copy
Direct memory copy
Figure 4: Architecture of BIP-SMP. The use of BIP-SMP is completely transparent, as each process receives a different logical number but everything else is hidden by the BIP-Multi-protocol layer, as shown in figure 4. When we send a message from one logical node to another logical node, BIP-SMP knows if it needs to use the shared memory or the Myrinet network, not the user. For a receive, BIP-SMP polls both the internal and the external receive queues. Flags to indicate a new message are in different cache lines. Hiding the multi-protocol issue inside BIP’s API allows us to reuse MPI-BIP without any change. MPI-BIP provides then a uniform high performance message-passing interface. Of course, if users want to exploit the assymetric performance between shared memory and Myrinet [6], BIP-SMP variables are available to provide the information about the location of another logical node, the number of processes on the same physical node, etc. We used MPI-BIP above BIP-SMP to evaluate and validate it on different parallel benchmarks and compare its results with related work.
5 Experimental results 5.1 Point-to-point communications The first tests are raw point-to-point-communications. We measure the latency and the bandwidth between two processes using ping-pong communication, on the same node and on two different nodes of our cluster, using a ping-pong program and the cycle
9
counter of the processor for precise time measurements. We can see in table 5.1 that the performance of BIP-SMP and MPI-BIP over BIP-SMP is very close to the hardware performance.
Intra-node latency (Shared memory): Inter-node latency (Myrinet network): Intra-node bandwidth (Shared memory): Inter-node bandwidth (Myrinet network):
BIP-SMP 1.8 s 5.7 s 160 MB/s 126 MB/s
MPI-BIP over BIP-SMP 3.3 s 7.6 s 150 MB/s 107 MB/s
Table 1: Latency and bandwidth in point-to-point communications with BIP-SMP and MPI-BIP over BIP-SMP. We have then compared the latency and the bandwidth of our MPI-BIP over BIPSMP with several other related works:
the ch shmem and ch lfshmem MPICH devices; MPICH over GM; MPI-PM/CLUMP (with mpizerocopy flag). We used the benchmark program mpptest included in the MPICH distribution. This software measures the latency and the bandwidth of a network architecture using a round-trip communication with blocking calls. We have modified the program to avoid the use two different buffers for sending and receiving. As this benchmark does not touch the communication buffers, their contents will remain in the cache of the receiver processor after the first iteration, making the experiments irrelevant (the bandwidth of MPI-BIP over BIP-SMP can reach 300 MB/s between two processes in this case). We performed tests depending on the size of the packet, between two processes per node for the intra-node tests and between two processes on two different nodes for the inter-node tests. 5.1.1 Comparing with MPI implementations for a single SMP node These results in figures 5 and 10 allow us to compare MPI-BIP/SMP in terms of pointto-point latency and bandwidth with the MPICH ch shmem and ch lfshmem MPICH devices:
Ch shmem: 14 s and 100 MB/s for internal communications. Ch lfshmem: 2.4 s and 100 MB/s for internal communications. This demonstrates the effect of our direct memory copy mechanism to improve the bandwidth, and the low overhead required in BIP-SMP for multi-protocol support.
10
55 50 45 40
BIP-SMP intra-node PM intra-node
Latency (us)
35
ch_shmem ch_lfshmem
30
GM intra-node 25 20 15 10 5 0 0
50
100
150 Packet size (bytes)
200
250
Figure 5: Latency of several intra-node MPI implementations (small scale) 1000
Latency (us)
800
600
400
BIP-SMP intra-node 200
PM intra-node ch_shmem ch_lfshmem GM intra-node
0 0
20
40 60 Packet size (Kbytes)
80
Figure 6: Latency of several intra-node MPI implementations (large scale)
11
100
55 50 45
Latency (us)
40 35
BIP-SMP inter-node
30
GM inter-node
PM inter-node
25 20 15 10 5 0
50
100
150 Packet size (bytes)
200
250
Figure 7: Latency of several inter-node MPI implementations (small scale) BIP-SMP inter-node
1600
PM inter-node GM inter-node 1400
Latency (us)
1200
1000
800
600
400
200
0 0
20
40 60 Packet size (Kbytes)
80
Figure 8: Latency of several inter-node MPI implementations (large scale)
12
100
40 BIP-SMP intra-nodes PM intra-nodes 35
ch_shmem ch_lfshmem
Bandwidth (MBytes/s)
30
GM intra-nodes
25
20
15
10
5
0 0
50
100
150 Packet size (bytes)
200
250
Figure 9: Bandwidth of several intra-node MPI implementations (small scale). (Same results as in figure 5 but expressed in terms of bandwidth) 160
140
Bandwidth (MBytes/s)
120
100 BIP-SMP intra-nodes 80
PM intra-nodes ch_shmem
60
ch_lfshmem GM intra-nodes
40
20
0 0
20
40 60 Packet size (Kbytes)
80
Figure 10: Bandwidth of several intra-node MPI implementations (large scale). (Same results as in figure 6 but expressed in terms of bandwidth)
13
100
12 BIP-SMP inter-nodes PM inter-nodes GM inter-nodes
Bandwidth (MBytes/s)
10
8
6
4
2
0 0
50
100
150 Packet size (bytes)
200
250
Figure 11: Bandwidth of several inter-node MPI implementations (small scale). (Same results as in figure 7 but expressed in terms of bandwidth) 120
Bandwidth (MBytes/s)
100
80
60
40
20
BIP-SMP inter-nodes PM inter-nodes GM inter-nodes
0 0
20
40 60 Packet size (Kbytes)
80
Figure 12: Bandwidth of several inter-node MPI implementations (large scale). (Same results as in figure 8 but expressed in terms of bandwidth)
14
100
5.1.2 Comparing with MPI-PM/CLUMP MPI-PM/CLUMP uses a zerocopy implementation between nodes using Myrinet, but also between processes on the same node. Nevertheless, the “inter-node” performance (figures 7, 8, 11, 12) of MPI-BIP/SMP (7.6 s and 107 MB/s) is a little better than MPI-PM/CLUMP’s (10.4 s and 93 MB/s). We can see that the MPI-BIP/SMP (3.3 s and 150 MB/s) is close to the ch lfshmem MPICH device whereas MPI-PM/CLUMP (9.3 s and 150 MB/s) is closer to the ch shmem MPICH device. Finally, the difference of intra-node latency can be important for some application. Overall MPI-PM/CLUMP is a very efficient MPI implementation over clusters of SMPs. 5.1.3 Intra-node communication with GM The inter-node tests of MPI-GM (figures 7,12) are not at the same level as MPIBIP/SMP or MPI-PM/CLUMP due to the lower level of performance of GM itself. However, GM provides extended features: fair queuing between several processes and error recovery. The performance gap between MPI-GM and MPI-BIP/SMP or MPIPM/CLUMP is perhaps smaller with the new Lanai 7 based Myrinet board and also with the developmental release of GM. However, the results of intra-node tests with MPI-GM (figures 5,10) are disappointing: for a local communication between two processes on the same node, GM sendsa message by the Myrinet board to itself by a round-trip through first switch. A multi-protocol approach in MPI-GM, not relying on GM for intra-node communication, would be a needed improvement.
5.2 IS/NAS parallel benchmark The second measurement series are the IS/NAS [13] parallel benchmark 2.3 using class A. We ran this benchmark on several logical configurations using either two processes per node or only one per node, with three different implementations of MPI for clusters of SMPs:
MPI-BIP/SMP; MPI-PM/CLUMP; MPI-GM; We can see in figure 13 that MPI-BIP/SMP is a very efficient message passing interface for CLUMPs: the speedup of the IS test is super-linear. However, we can also notice that using SMP nodes is always less efficient than using the same number of processors with single nodes. Perhaps, this loss of efficiency occurs because there is no overlapping between computation and communication for the internal message passing (the processor is the manager of the communication process, instead of the NIC in the external communications case) or the contention on the memory bus is an important factor of the performance decrease.
15
30 MPI-BIP/SMP MPI-PM MPI-GM
25
Time (s)
20
15
10
5
0 1 CPU
1 dual PII
2 single PIIs 2 dual PIIs Configuration
4 single PIIs
4 dual PIIs
Figure 13: NAS Parallel benchmark IS on several MPI implementations over CLUMPS
5.3 Linpack benchmark Finally, we have tested the Linpack [5] benchmark (Table 2) over several clusters:
“MIR”: 4 dual Pentium II (450 MHz, 128 MB, BX chip set), provided by LHPC, Ecole Normale Sup´erieure of LYON. “AXP”: 4 single Alpha 21164A nodes (500 MHz, 128 MB), provided by LHPC, Ecole Normale Sup´erieure of LYON 1 . “TORC”: 8 dual Pentium II (300 MHz,256 MB, FX chip set), provided by Innovative Computing Laboratory, University of Tennessee. We reach 1.9 Gflops on our cluster using a block-size of 32 and a BLAS library optimized by ATLAS [21]. We are limited by the amount of physical memory, and this prevents us from measuring the maximum performance. Linpack seems to be an application that use the type of communication where MPIBIP/SMP performance is better than MPI-PM/CLUMP or MPI-GM. Indeed, MPIBIP/SMP (1.9 Gflops) is about 40 % more efficient than MPI-PM/CLUMP (1.31 Gflops) or MPI-GM (1.35 Gflops). We can compare these results to other in-a-box parallel machines like the CRAY T3E or Intel SP2: the price of the cluster ($ 10 K) is less for the same performance, 1 Myrinet
equipment on AXP and MIR comes from a Myricom grant.
16
Machine Cluster MIR using MPI-BIP/SMP Cluster MIR using MPI-GM Cluster MIR using MPI-PM/CLUMP Cluster TORC using MPI-BIP/SMP Cluster AXP using MPI-BIP over BIP CRAY T3E (300 MHz) with 4 procs [14] IBM SP2 (77 MHz) with 8 procs [14] SGI Power Challenge (75 MHz) with 8 procs [14]
Matrix size 3300 3300 3300 3200 3300 Unknown Unknown Unknown
Performance 1.9 Gflops 1.35 Gflops 1.31 Gflops 2.4 Gflops 1.8 Gflops 1.806 Gflops 1.8 Gflops 1.955 Gflops
Table 2: Linpack results over different machines.
and moreover, clusters can be more quickly updated with the fastest processors on the market.
6 Conclusion and future work BIP-SMP is available to easily build clusters of commodity SMPs (which represents the best ratio between price and performance). This communication layer combined with MPI-BIP provides an efficient message-passing interface for a cluster of commodity SMPs. Our design achieves 3.3 s of latency and 150 MB/s of bandwidth for intra-node MPI messages. The performance of inter-node MPI communication over Myrinet is 7.6 s of latency and 107 MB/s of bandwidth. With this tool, clusters start to tickle the black-box parallel machine, saving big money for the hardware, with the best processors at the time. BIP-SMP proposes several contributions to improve the communication between two processes on the same physical multi-processor node. BIP-SMP is at this time implemented on Linux. Although it has been mainly tested on x86, it should run on any architecture supported by Linux without modification. As the intra-node communication side of BIP-SMP is more or less independent of BIP, we eventually plan to transform it into a high performance shared memory device for MPICH. By allowing MPICH to use simultaneously this device and another classical device (GM, BIP or PM), the intra-node issue becomes independent of the network, as it should be. Finally, we would like to test BIP-SMP on a very large cluster of SMP nodes to evaluate its scalability.
References [1] Thomas E. Anderson, David E. Culler, David A. Patterson, and the NOW Team. A case for networks of workstations: Now. IEEE Micro, Feb 1995. [2] Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Kulawik, charles L. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet - a gigabit-per-second
17
local-area network. In IEEE Micro, volume 15, Arcadia, CA, February 1995. Myricom. http://www.myri.com. [3] Ralph M. Butler and Ewing L. Lusk. Monitors, messages, and clusters : the p4 parallel programming system. Technical report, University of North Florida and Argonne National Laboratory, 1993. http://www-fp.mcs.anl.gov/ lusk/p4/p4paper/paper.html. [4] Franck Capello and Olivier Richard. Performance characteristics of a network of commodity multiprocessors for the nas benchmarks using a hybrid memory model. Technical Report 1197, LRI, Universit´e Paris-Sud, 91405 Orsay cedex, FRANCE, 1999. [5] Jack J. Dongarra. Performance of various computers using standard linear equations software. Technical Report CS-89-85, University of Tennessee Computer Science, 1999. [6] Patrick Geoffray, CongDuc Pham, and Bernard Tourancheau. Myrinet - a gigabitper-second local-area network. In Cluster-Based Computing workshop, International Conference on Supercomputing, ICS’99, Rhodes, Greece, June 1999. [7] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789–828, September 1996. [8] Parry Husbands and James C. Hoe. Mpi-start: Delivering network performance to numerical applications. In SuperComputing (SC’98), Orlando, Florida, November 1998. [9] Xavier Leroy. The linuxthreads http://pauillac.inria.fr/ xleroy/linuxthreads/.
library,
June
1999.
[10] Steven S. Lumetta and David E. Culler. Managing concurrent access for shared memory active messages. In International Parallel Processing Symposium, Orlando, Florida, April 1998. [11] Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-protocol active messages on a cluster of smp’s. In SuperComputing (SC’97), University of California at Berkeley, August 1997. [12] Myricom. The gm http://www.myri.com/GM/doc/gm toc.html.
api,
December
1998.
[13] NASA. Nas parallel benchmark 2.3. http://science.nas.nasa.gov/Software/NPB/. [14] Jack Dongarra / University of Tennessee /
[email protected]. Linpack benchmark-parallel, April 1999. http://performance.netlib.org/performance/html/linpack-parallel.data.col0.html.
18
[15] L. Prylli, B. Tourancheau, and R. Westrelin. The design for a high performance mpi implementation on the myrinet network. In EuroPVM/MPI’99, 1999. to appear. [16] Lo¨ıc Prylli and Bernard Tourancheau. BIP: a new protocol designed for high performance networking on myrinet. In Workshop PC-NOW, IPPS/SPDP98, Orlando, USA, 1998. [17] Lo¨ıc Prylli. Bip user reference manual. Technical Report TR97-02, LIP/ENSLYON, Septembre 1997. [18] A. Singhal, D. Broniarczyk, F. Cerauskis, J. Price, L. Yuan, C. Cheng, D. Doblar, S. Fosth, N. Agarwal, K. Harvey, E. Hagersten, and B. Liencres. Gigaplane : A high performance bus for large smps. In Hot Interconnects IV, pages 41–52, Stanford, California, August 1996. [19] Toshiyuki Takahash, Francis O’Carroll, Hiroshi Tezuka, Atsushi Hori, Shinji Sumimoto, Hiroshi Harada, Yutaka Ishikawa, and Peter H. Beckman. Implementation and evaluation of mpi on an smp cluster. In Paralel and Distributed Processing – IPPS/SPDP’99 Workshops, volume 1586 of Lecture Notes in Computer Science. Springer-Verlag, April 1999. [20] T. von Eicken. Active Messages: an Efficient Communication Architecture for Multiprocessors. PhD thesis, University of California at Berkeley, November 1993. [21] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In SuperComputing (SC’98), 1998.
19