PROGRAMMING THE INFINIBAND NETWORK ARCHITECTURE FOR HIGH PERFORMANCE MESSAGE PASSING SYSTEMS Vijay Velusamy$*, Changzheng Rao$ Mississippi State University$ Department of Computer Science and Engineering Box 9637 Mississippi State, MS 39762, USA {vijay, cr90}@hpcl.cs.msstate.edu
Srigurunath Chakravarthi*, Jothi Neelamegam*, Weiyi Chen*, Sanjay Verma*, MPI Software Technology, Inc. * 101, S. Lafayette St., Suite 33 Starkville, MS 39759, USA {ecap, jothi, wychen, sanjay}@mpisofttech.com
Abstract The InfiniBand Architecture provides availability, reliability, scalability, and performance for server I/O and inter-server communication. The InfiniBand specification not only describes the physical interconnect and the highlevel management functions but also a wide range of functions from simple unreliable communication to network partitioning, with many options. The specification defines line protocols for message passing as a collection of proto verbs. This paper discusses various programming options that available to benefit from the capabilities of InfiniBand Network Architecture. An analysis is presented of the various strategies and programming models that would be enabled and/or enhanced to benefit from the InfiniBand infrastructure, for message passing systems. Conclusions are offered about benefits, trends, and further opportunities for low-level InfiniBand programming. Keywords: InfiniBand, Verbs, Message Passing Interface, Middleware, RDMA
1.
INTRODUCTION
The InfiniBand Architecture is an industry standard developed by the InfiniBand Trade Association [3], to provide availability, reliability, scalability, and performance for server I/O and inter-server communication. Design principles enlisting communication strategies to virtualize resources among an arbitrary number of processes form the core of the standard. The InfiniBand specification defines only the line protocols for message passing, and not any APIs. This paper discusses InfiniBand Verb APIs, the VI Provider Library API [1], and certain other portable network APIs, while offering perspective on trends, convergence, and #
Corresponding author, to whom inquiries should be sent.
Anthony Skjellum*$## Department of Computer and Information Sciences 115A Campbell Hall, 1300 University Boulevard University of Alabama at Birmingham, Birmingham, Alabama 35294, USA
[email protected]
enhancement opportunities. Using these APIs to create fast MPI-1.2 implementation is also discussed. It is envisioned that features in MPI-2, including MPI-I/O will be able to benefit from the support of InfiniBand Architecture for storage networks and this in aggregation with its benefits for message passing could be used for high performance computing systems. The remainder of this paper is organized as follows. Section 2 briefly describes the various programming interfaces that are available to utilize InfiniBand. Programming models applicable to InfiniBand are described in Section 3. Section 4 encapsulates some experiences in adopting InfiniBand to high performance Message Passing, and Section 5 summarizes the paper.
2.
PROGRAMMING INTERFACES
The HCA Verbs Provider Driver is the lowest level of software in the operating system, and interfaces with the hardware directly. The HCA driver consists of two parts, a loadable kernel mode driver, and a user mode library (most often a dynamically loadable library module), as shown in Figure 1. Programming interfaces for InfiniBand may be broadly classified into two main categories, namely, Channel Access Interfaces and Portable Interfaces. Programming interfaces are generally provided by the adapter vendor and communicate directly with the drivers of the Channel Access Interfaces. The interfaces layer on top of these channel access interfaces, thereby hiding the implementation specifics, belong to the class of Portable Interfaces.
2.1
Channel Access Interfaces
The InfiniBand Architecture incorporates many of the concepts of the Virtual Interface Architecture [1], which in turn drew from Shrimp [11] and Cornell University’s UNet [12]. The Virtual Interface (VI) Architecture is a specification for high-performance networking interfaces, which reduces the overhead of ubiquitous transports such as TCP/IP. VI supports both, the traditional two-sided send/receive model of communication, as well as a onesided remote memory access model. Some of the characteristics of VI have been found to have significant impact on the scalability, and performance of the systems that support the Message Passing Interface (MPI) [6,7]. These include the connections that are required by a large application, time taken to dynamically open/close connections, and support for unexpected messages. The VI model also does not support the sharing of a single pool of buffers between multiple VI's that limits scalability of VI based systems. The IBM InfiniBlue Host Channel Adapter Access Application Programming Interface is a vendorindependent interface that allows the development of both kernel space and user-space applications based on the InfiniBand transport model [4]. It has the ability to work with different verbs interfaces, and supports multiple channel adapters from the same or different vendors. It also recommends that the implementation ensure that multiple threads may safely use the APIs provided they do not access the same InfiniBand entity (such as queue pair or completion queue). Other verbs API include those defined by Mellanox and VIEO. The Mellanox IB-Verbs API (VAPI) is Mellanox's software programmer's interface for InfiniBand verbs [5].
2.2
Portable interfaces
In addition to the above interfaces that can be used to communicate with the Host Channel Adapters' provider drivers directly, there also exist certain portable interfaces that hide the channel access interface from the user. These include the Direct Access Transport and Sandia Portals interface. Direct Access Transport (DAT) [2] is a standard set of transport-independent, platform independent APIs that are designed to benefit from the remote direct memory access (RDMA) capabilities of interconnect technologies such as InfiniBand, Virtual Interface Architecture, and RDMA over IP. DAT defines both user level (uDAPL) and kernel level (kDAPL) provider libraries (DAPL). The current version 1.0 of the DAT provides minimal functionality to exploit RDMA performance capabilities. It falls short at present of completely supporting IB functionality, such as datagrams, and multicast groups. This short-term limitation has adverse effects on the scalability of large applications. However, DAT also targets systems other than InfiniBand, which enhances its long-term value. The Interconnect Software Consortium – Transport API (ICSC-Transport API) defines specifications related to programming interfaces and protocols that enable direct interaction with RDMA-capable transports. The specifications includes extensions to the UNIX Sockets API, an API that provides direct user application access to interconnect transport, and APIs that provide application access to interconnect fabric management infrastructure. The Phase I Specification covers VI
The Mellanox VAPI interface provides a set of operations that closely parallel the proto verbs of the InfiniBand standard, plus additional extension functionality in the areas of enhanced memory management and adapter properties specifications. The VIEO InfiniBand Channel Abstraction Layer (CAL) Application Programming Interface provides a vendorindependent interface for InfiniBand channel adapter hardware [9]. The CAL API evidently lies under the verbs abstraction, between the Channel Interface driver software and adapter hardware. It isolates specific hardware implementation details, providing both a common function call interface and a common data structure interface for the supported InfiniBand chipsets. An advantage of such an abstraction is the simultaneous support for heterogeneous channel adapters. This potentially enhances path selection. However, CAL is not widely used at present.
Figure 1: High Level Architecture (Adapted from Linux InfiniBand Project [10])
Architecture networks, and the Reliable Connection and Unreliable Datagram services of IB. Sandia Portals Interface [8] is an interface for message passing between nodes in a system area network. The Portals interface is primarily designed for scalability, with the ability to support a scalable implementation of the MPI standard. To support scalability, the Portals interface maintains a minimal amount of state. Portals supports reliable, ordered delivery of message between pairs of processes, without explicit point-to-point connection establishment between them. Portals combines the characteristics of both one-sided operations and two-sided send/receive operations. Since this interface was designed originally to support MPI, implementations of Portals over the IB infrastructure could enable rapid deployment of high performance message passing systems, however it could also provide for an alternative option for other communication paradigms. 2.3
Interesting Scenarios
Though the IBM HCA access API and the Mellanox Verbs API are primarily targeted at IBM and Mellanox chipsets based adapters respectively, an interesting scenario would be to use IBM verbs on Mellanox adapters or vice versa. Evidence of such planned cross-support exists. Another interesting scenario would be a port of VI Provider Library (VIPL) over the DAT library. Even though DAT provides with a portable layer over VI architecture, this scenario is ideal for applications that have been developed over VI Provider Library, and need to be ported to a layer that is already supported by DAT. Current studies that incorporate such an idea are geared towards enumerating the shortcomings of DAT for support of the VI Provider Library. The various layering strategies possible, along with their advantages are enlisted in Table 1. 2.4
Interface Comparison
The various Verb APIs discussed above are based on the Verbs specification of the InfiniBand Architecture. Almost all APIs have many similarities in the semantics used. The many functions are generally classified as HCA Access, Memory Management, Address Management, Completion Queue and Completion Queue domains, Queue Pair related, End-to-end contexts, Multicast groups, Work Request Processing, and Event handling. However, they have some subtle differences in the parameters that are associated with certain calls. For example, in the Query_hca function provided in the IBM Verbs API, the caller is not required to pre-allocate memory for the attributes structure. The caller calls the function twice. The
Table 1. Layering strategies Layering
Advantage/Potential Use(s)
VIPL over DAPL
Makes legacy VIPL applications usable on IB, RDMA over IP
DAPL over VIPL
Useful to architect DAPL applications on legacy VIPL based hardware (such as Giganet)
DAPL over IB Verbs
Portable programming across multi-vendor IB HCA’s
Portals over IB
Portable programming across multi-vendor IB HCA’s, particularly suitable for MPI applications Makes Portals applications usable on VIPL, IB, RDMA over IP, particularly suitable for MPI applications
Portals over DAPL
Portals over VIPL
Suitable for Portals based applications on legacy VIPL based hardware (such as Giganet)
first time the function is invoked with attr_size=0; the routine returns the actual size of the buffer. The caller is then able to allocate the buffer and pass it to the routine. This enables efficient memory utilization, as the call cannot estimate how much memory should be allocated. There is also certain additional functionality provided by some Programming Interfaces. For example, the Mellanox API supports querying the HCA’s gid, and pkey table. The Mellanox API also as a single interface for both simple data transfer, and RDMA based data transfer functions. In the Post Send Interface, the opcode parameter specifies the type of operation to be performed. It also supports a set_ack bit flag. This enables the last packet to have its ack_required bit. This can be used to indicate to the corresponding end point whether the initiator expects an acknowledgement or not. DAPL provides many features that make it attractive to applications. Since the API hides the internals of the implementation, it provides a user-friendly platform for application developers. DAPL provides a common event model for notifications, connection management events, data transfer completions, asynchronous errors, and all other notifications. It also supports combining OS wait objects, and DAT events. However, in certain cases, this prevents the application from benefiting from certain features that may be available to developers using the underlying Verbs/VIPL interface directly. Also presence of an additional layer, has an adverse affect on the performance of the system.
3.
PROGRAMMING MODELS
InfiniBand supports the following transport models: Reliable Datagrams (RD); Remote Operations (RO); Unreliable Datagrams (UD); Reliable Connection (RC); Unreliable Connection (UC); and Unreliable Multicast data transfer paradigms (Fig. 2). Though InfiniBand provides support only for Unreliable Multicast, applications requiring Reliable multicast, such as MPI implementations could use a model that uses Reliable Datagrams coupled with a selective retransmission strategy to achieve Reliable Multicast (RMC) capability. The Reliable Datagrams, Remote Operations, and Unreliable Datagram models may be used to implement the Sandia Portals API. Implementations of MPI-1 and MPI-2 standards may use the Portals implementation to benefit from InfiniBand. DAT 1.0 and VIPL will benefit from Reliable Connection oriented data transfers. Implementations of MPI-1 and MPI-2 standards may be layered over DAT 1.0 or VIPL. MPI-2 One Sided Communications will benefit from Remote Operations supported by InfiniBand. Unreliable Connection, Multicast, Reliable Multicast will be used to implement MPI Collective Communications. Networks like Myrinet offer non-deadlocking routes through the Myrinet mapper [13]. Fully connected networks that can be composed from either Myrinet, InfiniBand fabrics, or similar technologies should offer access to such non-interfering routes in order to allow programs to exploit full bisection bandwidth and to avoid unexpected bandwidth collapse for cases that the fabric should be able to handle without performance compromise. While this is generally well understood in older technologies such as Myrinet, it is an issue to be resolved for InfiniBand. Issues include access to appropriate sets of routes for static/clique programs that assume all-to-all virtual topologies (such as MPI-1 programs), access to appropriate sets of routes for dynamic programs that add and remove cliques during computation and may or may not need full virtual topology support (such as MPI-2 programs that use DPM). There are several kinds of issues that arise, which we raise here, rather than solving. First, there has to be an API that gives access to static routes, if they are precomputable for a fabric. Second, there has to be an API that allows modification of such routes, and finally, if multiple route options should be available for more complex InfiniBand networks, the ability to use these multiple routes should be possible. Finally, if there is a resource limitation that prevents precomputed routes from being stored for large systems, a systematic understanding of the costs for dynamically caching such routes is important. While formally requiring all-to-all virtual connectivity, most real applications will use rather more limited runtime graphs, except those that actually do data transposition (corner turns), which might
either be effectively implemented with a fully connected graph. The coupling of such routing information to the services of InfiniBand is also an important question that impacts overall scalability. For instance, if RD is exploited as a means to enhance scalability (for example either directly or used in connection with Portals), then a hidden unscalability from routing information must also be identified and removed for large-scale networks, or else the cost of higher scalability in terms of dynamic caching must be estimated. Brightwell et al identify in [8] reasons why connection-oriented protocols such as VIPL limit scalability because each connection uses fixed resources in the NIC, rather than each session. Similarly, the lower-level routing information may effectively reproduce such unscalable growth in resource requirements. Then, one has to reason if these limitations are binding on systems of size to be created practically (e.g., up to 65,536 nodes in current generation superclusters). From the emphases of this paper, it is important to mention that there may be coupling issues to existing verb layers that require new verbs, both for message passing and also impacts on connection management. MPI I/O, the I/O chapter of the MPI-2 standard is a potential beneficiary of the RC and UD capabilities offered by InfiniBand. Collective communication, and non-blocking I/O calls could take advantage of the low latency, high bandwidth capabilities offered by InfiniBand to simultaneously offer extended overlapping of not only computation and communication, but also I/O.
RD MPI - RD 1 MPI – 1/2 DATDAT 1.0,1.0, VIPL VIPL
RC
MPI – 1/2 PORTALS
MPI – 2 One Sided Remote RemoteOPs OPs
RMC RMC MPI Collective
UD
MC MC
Communication
UC
Figure 2. Layering Options for Network Programming Models
4.
HIGH PERFORMANCE MESSAGE PASSING
The following are some of our experiences in porting MPI/Pro (a commercial implementation of the MPI-1 standard) and ChaMPIon/Pro (a commercial implementation of the MPI-2 standard) to benefit from InfiniBand.
4.1
Some of the features provided by InfiniBand include Reliable Connection and Reliable Datagram services. In Reliable Connection oriented services, programmers using IB Verbs need to explicitly change Queue Pair (QP) states to initiate handshake between two QP’s. Applications could benefit from this for increasing the performance by overlapping handshaking and computation, using this finegrained control. This would also increase the scalability of applications in a large-scale InfiniBand network, as QP’s could be multiplexed/reused, and reducing the number of connections required. (Perhaps making and breaking QP connections can be used in MPI Communicator creation and destruction to help us use QPs efficiently, especially when the number of QPs are limited. When the number of QPs are limited all QPs can be created in MPI_Init, and lightweight state change operations can be used in MPI_Comm_create/dup and MPI_Comm_free.) 4.2
large sized data.
Impacts on efficient and scalable MPI implementations d.
Use multiple QPs, one for each {tag, communicator} pair. This model extends model (a) to use even more QPs. It would benefit applications using several communicators and/or several tags, and inherently maps well to the tag+communicator ordering rule for MPI. However, handling the MPI_ANY_TAG wild card will require the MPI implementation to add extra ordering logic.
e.
Use a separate QP(s) for Collective Communication and Point-to-point communication. This model should be easy to implement because MPI dictates no ordering constraints between point-to-point and collective messages spaces. On the other hand, when the number of QPs are restricted, it limits liberal use of QPs. Performance has to be traded off for scalability in light of this resource constraint.
Exploiting HCA Parallelism
IB HCAs typically exploit parallelism in hardware. A direct consequence is that throughput can be maximized by pipelining messages using multiple QPs to transfer data whenever possible. MPI requires message ordering in communicator and tag scope. This may limit unrestricted use of multiple QPs between MPI ranks, because message ordering is guaranteed by IB Reliable Connections using the same QP connection. Naïve MPI implementations would use a single QP for each MPI rank pair. However, there are several ideas to help leverage concurrent messaging bandwidth by MPI libraries through usage of multiple QPs. They are as follows:
Still developers of high performance cluster middleware face scalability vs. bandwidth trade-offs. However, as the Reliable Datagram (RD) communication service type is widely supported by mainstream IB hardware vendors, the conflict may be solved by the solution to provide a dynamic mapping between application communication channels and Endto-end (EE) Contexts. This strategy requires sophisticated design and it may be application-oriented to avoid introducing additional overhead and/or latency. It is widely believed to be a promising solution for high performance clustering middleware. 4.3
a.
b.
c.
Use one QP for each process pair for each communicator. This is a conservative exploitation of parallelism that benefits MPI application using multiple communicators. No special measures are required by the MPI library to enforce message ordering since ordering is limited to communicator. Preallocate a fixed number (more than one) of QP connections between each rank pair, and multiplex large message transfers across multiple QPs. This exploits HCA parallelism by pipelining message transfers better. As bus limitations are removed over time, such methods may increase in value. Use separate QPs for control messages (handshake messages, flow control messages) and data. This insulates control messages from being blocked behind long data transfers. An extension of this model would be to use one QP each for small, medium, large and very
Timeouts and retries
IB allows a programmer to specify timeouts and number of retries (RNR_TIMEOUT and RNR_RETRY flags). This can help the MPI library eliminate adding explicit flow control mechanisms that typically add overhead. Using an appropriate combination of timeout and retry can help achieve low or no-overhead flow control. Applications that inherently do not require flow control (applications that never exhaust system buffers) will never incur extra overhead, whereas this is not true for even simple credit based flow control mechanisms. 4.4
Selective event completion notification
IB allows QPs to be configured so that notification of event completion can be specified when posting a send or receive work queue entry. Typically generating an event queue entry is reasonably expensive as it involves programmed I/O (PIO) across the PCI bus by the driver.
Latency DAPL IBMV (Microcode Ver. 1.5.1) IBMV (Microcode Ver. 1.6.4)
200 150 100 50 0 0
200
400 600 800 Message Length (Bytes)
1,000
1,200
Figure 3: MPI/Pro Latency over IBM Verbs Bandwidth
DAPL IBMV (Microcode Ver. 1.5.1)
600 500 400 300 200 100 0
Message Length (Bytes)
Figure 4: MPI/Pro Bandwidth over IBM Verbs
0
0 4, 5
00 ,0 0
0 4, 0
00 ,0 0
0 3, 5
00 ,0 0
0 3, 0
00 ,0 0
0 2, 5
00 ,0 0
0
00 ,0 0
2, 0
1, 5
00 ,0 0
0
0
IBMV (Microcode Ver. 1.6.4)
00 ,0 0
The shared memory window paradigm of MPI with Window-based communicator matches well the memory region model offered by InfiniBand. The semantics for updating such windows, versus required MPI synchronization semantics, is an attractive match. InfiniBand supports two Atomic operations: Compare & Swap, and Fetch & Add. The Compare & Swap instructs the remote QP to read a 64-bit value, compare it with the compare data provided, and if equal, replace it with the swap data, provided in the QP. The Fetch & Add operation on the other hand, instructs the remote QP to read a 64-bit value and replace it with the sum of the 64-bit value and the add data value, provided in the QP. MPI-2 offers a rich, high-level distributed shared memory paradigm with onesided semantics in addition to two-sided message passing. These one-sided semantics appear to work well with the availability of both RDMA constructs of InfiniBand, and also with atomic operations of InfiniBand. While the latter have not been explored extensively at all, we find that both the read-modify-write and fetch-and-add can be exploited in support of certain 1-sided operations. The MPI collective accumulate operation appears to be supported, at least in part by such atomic operations. The direct use of Distributed Shared Memory (DSM) for collective data reorganization in third-party memory appears promising. Finally, the use of DSM and RDMA together with MPI-2's parallel I/O chapter appears extremely promising, offering to allow the removal of intermediate TCP/IP connections between parallel I/O clients and parallel I/O servers, and also allow flexible choices for data reorganization at either the (clique/parallel) client or server side of the parallel I/O.
Figure 3 shows the Latencies achieved with MPI/Pro for two processes, in back-to-back configuration, and Figure 4 shows the one-way Bandwidths achieved with similar
1, 0
4.6 Atomic operations for MPI-2 implementations
0
The Work Request Entry of IB verbs contain fields such as retry, timeout, etc. These provide a more fine-grained control of communication. When specifying a proper retry count and timeout value, flow control can be more efficient or even remove in some cases.
,0 0
Traditional flow control, such as credit-based flow-control, is built on top of implicit message ordering requirements often utilizing features such as fencing, immediate data, etc. The receive-side needs to send messages with credit to the send-side as flow control notifications. In the cases that data is constantly transferred one way, this credit based flow control is unnecessary. Credit notifications may also waste bandwidth.
Some of our initial experiments with porting MPI/Pro over IBM Verbs API, uDAPL, and Mellanox VAPI included ping-pong latency, and bandwidth measurements. Back-to-back configurations were used for the IBM Verbs testbed. For experiments with IBM Verbs API, and uDAPL (implementation over IBM Verbs API), the configuration included Supermicro motherboards, with INTEL E7500 chipsets, Intel Xeon processors, and Paceline 2100 HCA's. Experiments with the Mellanox VAPI were performed on InfiniHost MTEK23108 HCA's connected by Topspin 360 switch. The configuration included Intel Server Board SHG2, with ServerWorks Grand Champion LE chipset, Intel Xeon processors. The IBM Verbs based tests were performed with two microcode versions (1.5.1, and 1.6.4), while uDAPL tests were performed only with the HCA firmware version 1.5.1.
50 0
Fine-grained control over IB enabling efficient communications
Some experimental results
us
4.5
4.7
MB/s
The MPI library can request for notification for only every Nth send work queue entry, rather than for each one.
Latency and Bandwidth measurements for two processes using MPI/Pro over Mellanox VAPI are shown in figures 5 and 6 respectively. MPI/Pro over VAPI supports two modes, namely polling and blocking. The blocking mode is implemented by using VAPI event handling Verbs. Polling latencies of 8.5us, and 11.8us were observed for the VAPI layer and MPI/Pro respectively. Event latencies of 33us, and 38us were observed in VAPI and MPI/Pro respectively. Peak bandwidth of 827 MB/s was observed for the VAPI layer. For MPI/Pro, peak bandwidth of 802 MB/s was observed when the buffers are aligned on the page size. However, when the buffers were aligned, and the same buffers were used for sending and receiving in the ping-pong tests, peak bandwidth of 815 MB/s was observed. It was interesting to note that best results were obtained for MTU 1024, whereas the maximum bandwidth was only 560 MB/s using MTU 2048.
scale remaining to be explored in great depth. InfiniBand is a useful technology for implementing parallel clusters, and several of the verb APIs are particularly efficient for supporting the requirements of MPI-1 point-to-point message passing via RDMA, while offering the potential to enhance implementations of MPI-2 one-sided communication. Opportunities for enhancing parallel I/O system implementations using InfiniBand also exist. MPI/Pro VAPI Latency 50.00 40.00 30.00 20.00 10.00 0.00
0
200
400
600
CONCLUSIONS
800
1,000
1,200
Message Size (Bytes)
Figure 5: MPI/Pro Latency over Mellanox VAPI MPI/Pro VAPI Bandwidth (Long Messages) Polliing (Not Aligned) Polling (Aligned) Blocking (Not Aligned) Blocking (Aligned)
900.00
5.
Blocking Polling
Latency (microseconds)
configuration. The latencies and bandwidths achieved in IBM Verbs (microcode version 1.6.4) were considerably better than that with IBM Verbs (microcode version 1.5.1). Although bandwidths were comparable for uDAPL with IBM Verbs (microcode version 1.5.1), the observed latencies were slightly higher. This is attributed to the overhead associated with the uDAPL layer. Also, RDMA writes were used for long messages, as opposed RDMA reads, which would have yielded better performance. Peak bandwidth of up to 500 MB/s was observed.
800.00 700.00 600.00
This paper compares, contrasts, and maps APIs to 4X InfiniBand hardware. Commentary on porting between APIs, and simultaneously using multiple APIs is given. This paper analyzes the various programming models that might be used by applications to benefit from various features provided by InfiniBand, while assessing potential converges of low-level APIs. Outstanding scalability concerns are also raised, notably that RC is not as scalable as desired for large clusters, and that underlying route computation must be utilized. Opportunities to use RD and other InfiniBand modes together with the Portals API appear to be excellent paths to high scalability, but routing issues will also be important here.
6.
InfiniBand introduces many new programming options and many new ways to build high performance clusters. In the next few months, convergence of programming models, programming options, and hardening of low-level implementations will help identify how fully commercial offerings for production computing will evolve. The opportunity to explore maximum scalability of InfiniBand (APIs, protocols, and routing issues) remains for immediate work. Effective scalability up to commonly used sizes is obviously possible, with issues at massive
Work at both Mississippi State University and MPI Software Technology, Inc was enabled through loaner hardware from the respective commercial vendors mentioned herein, as well as Los Alamos National Laboratory, as well as superb technical support from both Paceline Systems and Mellanox concerning low-level software issues. Work at MPI Software Technology, Inc and Mississippi State University was funded, in part, under a grant for the National Science Foundation, SBIR Program, Phase II, IIb. Grants DMI-9983413 and DMI-
500.00 400.00 300.00 200.00 100.00
65 ,5 36 13 1, 07 2 26 2, 14 4 52 4, 28 1, 8 04 8, 57 2, 6 09 7, 15 4, 2 19 4, 30 8, 4 38 8, 60 16 8 ,7 77 ,2 33 16 ,5 54 ,4 67 32 ,1 08 ,8 64
0.00
Message Size (Bytes)
Figure 6: MPI/Pro Bandwidth over Mellanox VAPI
ACKNOWLEDGEMENTS
0222804 are acknowledged. MPI Software Technology, Inc also was enabled through a subcontract with the US Department of Energy, Los Alamos National Laboratory. Dr. Gary Grider of Los Alamos is specifically acknowledged as well.
[13]
Nanette J. Boden and Danny Cohen and Robert E. Felderman and Alan E. Kulawik and Charles L. Seitz and Jakov N. Seizovic and Wen-King Su, “Myrinet: A Gigabit-per-second Local Area Network,” IEEE Micro, 15(1):29--36, February 1995.
7.
REFERENCES
[14]
[1]
Compaq, Microsoft, and Intel, “Virtual Architecture Specification Version 1.0”, Technical report, Compaq, Microsoft, and Intel, December 1997.
Interconnect Software Consortium,“ICSC Transport API Specification,” http://www.opengroup.org/icsc
[2]
DAT Collaborative. uDAPL API Specification Version 1.0. http://www.datcollaborative.org
[3]
InfiniBand Trade Association, “InfiniBand Architecture Specification, Release 1.0,” http://www.InfiniBandta.org.
[4]
International Business Machines (IBM) Corporation, “InfiniBlue Host Channel Adapter Access Application Programming Interfaces Programmer's Guide,” October 2002
[5]
Mellanox Technologies Inc., “Mellanox IB-Verbs API (VAPI),” 2001.
[6]
Message Passing Interface Forum, “MPI: A Message Passing Interface standard,” The International Journal of Supercomputer Applications and High Performance Computing, vol. 8, 1994.
[7]
Ron Brightwell and Arthur B. Maccabe, “Scalability Limitations of VIA-Based Technologies in supporting MPI,” Proceedings of Fourth MPI Developer's and User's Conference, March 2000.
[8]
Ron Brightwell, Arthur B. Maccabe, and Rolf Riesen, “The Portals 3.2 Message Passing Interface Revision 1.1,” Sandia National Laboratories, November 2002.
[9]
Vieo Inc., “Channel Abstraction Layer (CAL) API Reference Manual V2.0,” January 2002.
[10]
Intel Corporation, “Linux InfiniBand Project,” http://InfiniBand.sourceforge.net.
[11]
C. Dubnicki et al., “Shrimp Project Update: Myrinet Communication,” IEEE Micro, Jan.-Feb. 1998, pp. 50-52.
[12]
T. von Eicken et al., “U-Net: A User-level Network Interface for Parallel and Distributed Computing,” Proc.15th Symp. Operating System Principles, ACM Press, New York, 1995, pp. 4053.