Parallel Distributed Computing on SCI Workstation

1 downloads 0 Views 180KB Size Report
9] A. D. George, R.W. Todd, and W. Rosen, \A Cluster Testbed for SCI-based Parallel Processing",. Proc. 4th Int'l. Workshop on SCI-based High-Performance ...
Parallel Distributed Computing on SCI Workstation Clusters: Early Experiences y Hermann Hellwagner Lehreinheit fur Rechnertechnik und Rechnerorganisation / Parallelrechnerarchitektur Institut fur Informatik, Technische Universitat Munchen D-80290 Munchen, Germany [email protected]

Vaidy Sunderam Department of Math and Computer Science Emory University, Atlanta, GA 30322, USA [email protected]

Abstract

Workstation and PC clusters interconnected by SCI (Scalable Coherent Interface) are very promising technologies for high performance cluster computing. Using commercial SBus-SCI interface cards and early system software and drivers, a two-SPARCstation cluster has been constructed for initial testing and evaluation. PVM direct task-to-task data transfers have been adapted to operate on this cluster using raw device access to the SCI interconnect (message passing), and preliminary communications performance tests have been carried out. Our early results with PVM direct data transfers over SCI show a two to three-fold throughput improvement over the fastest Ethernet implementation. However, message-passing latencies over SCI are in the millisecond range, comparable to TCP-based communication latencies. Initial analysis suggests that performance is primarily limited by sub-optimal functionality and eciency available with the rst-generation SBus-SCI hardware and software, in particular the SCI device driver. The utilization of shared memory based interprocess communication promises to improve latency gures signi cantly, into the range of 100 microseconds.

1 Introduction High-performance computing in heterogeneous networked environments is a standard paradigm, and far more common than hardware multiprocessors or vector supercomputers. This computing model typically employs software systems to emulate a general purpose message passing parallel computer on collections of interconnected computers or clusters; numerous such software systems are available, with PVM [1] [2] being the most widely used. However, such cluster systems are subject to one major drawback, viz. communications performance. The low-speed, shared, packet switched nature of general purpose networks and the high software overheads of Internet protocols result in minimum latencies of 1-3 ms, and bandwidths lower than 1 MB/s on Ethernet. Recent and imminent advances in networking promise to alleviate the above de ciencies. The Scalable Coherent Interface (SCI) standard de nes a point-to-point interconnect that o ers backplane bus services and that can, in fact, serve as a high speed network for LAN-based high-performance cluster computing y Appeared in: Mitteilungen - Gesellschaft f ur Informatik e.V., Parallel-Algorithmen und Rechnerstrukturen (PARS), Nr. 14, Dez. 1995  Research supported by the Applied Mathematical Sciences program, Oce of Basic Energy Sciences, U. S. Department of Energy, under Grant No. DE-FG05-91ER25105, the National Science Foundation, under Award No. CCR-9118787, and the German Science Foundation under SFB342.

1

[3]. An IEEE standard since 1992 [4] [5], SCI proposes innovative solutions to the bus signaling and bottleneck problems and to the cache-coherence problem, and is capable of delivering 1 Gb/s bandwidth and latencies in the 10 s range in local networks. While not yet widespread, commercial SCI products are becoming available. One such product is the SBus-SCI Adapter Card from Dolphin Interconnect Solutions that enables SPARC Unix workstations to be connected by copper media. This product includes device driver software that provides both elementary message-passing and shared-memory APIs. Using such a two-machine SCI cluster, an initial implementation of PVM direct task-to-task data transfers over the SCI interconnect has been done for initial testing and evaluation. The testbed and the PVM adaption are described in this paper, and preliminary performance results are reported.

2 The Testbed For SCI software development, a SCI \ring" of two SPARCstation-2 (SS-2) machines has been set up using 25 MHz Dolphin SBus-SCI adapter cards [6]. The device driver for this product (V1.4a, May 1995) supports two elementary software interfaces. The raw channel interface (message-passing) provides an ioctl system call for establishing and controlling connections between processes on di erent machines. Data can then be transferred across connections using write and read system calls. Raw I/O channels are simplex; separate unidirectional connections are created if both communicating processes wish to send and receive messages from each other. The SCI adapter uses programmed I/O for messages smaller than 64 bytes, or if the message is not aligned on a 64byte boundary. For larger messages (if properly aligned), a DMA engine conducts the data transfer and enables high throughputs to be achieved. It is emphasized by the vendor that, to achieve low latencies, only a single intermediate kernel bu er (at the receiver side) and thus only a single intermediate copy operation is involved in a data transfer. The shared-memory interface consists of ioctls for constructing shared-memory segments as well as setting up address mappings in the interface so that remote segments can be accessed transparently. Processes can include shared segments in their virtual address spaces using the mmap system call. Processes can move data transparently from the local memory of one machine to local memories on other machines by simple assignment operations onto shared segments. This is e ected by the DMA engine; no system call or intermediate kernel bu er is involved in this kind of data transfer. The SCI links are advertized to have 1 Gb/s bandwidth, while the interface cards should deliver sustained application throughput of 10 MB/s. With faster machines (SS-20), round-trip, ping-pong latencies of 8 s have been reported [7].

3 Implementing PVM over the SBus-SCI Interconnect The PVM system provides a general library-based message passing interface to enable distributed memory concurrent computing on heterogeneous clusters. Message exchange as well as synchronization, process management, virtual machine management, and other miscellaneous functions are accomplished by this library in conjunction with PVM daemons that execute on a user de ned host pool and cooperate to emulate a parallel computer. Speci cally, data transfer in PVM is accomplished with the pvm send/pvm recv calls that support messages containing multiple data types but require explicit pvm pk and pvm upk calls for bu er construction and extraction. Alternatively the pvm psend/pvm precv calls may be used for transfer of single data type bu ers. Both methods account for data representation di erences, and both use process taskid and message tags for addressing and discrimination. In addition, a runtime option allows message exchange to use direct process-to-process stream (i.e. TCP) connections, rather than transfer data via the daemons as is the default. Therefore, pvm psend/pvm precv with the direct routing option is the fastest communication method in PVM.

3.1 Implementation Strategy Our project to implement PVM over the SBus-SCI interface consists of several phases: 1. Implement the equivalent of pvm psend/pvm precv as a non-intrusive library over SCI using the raw I/O interface, i.e. message passing. Complete this implementation to incorporate heterogeneity, fragmentation/reassembly, Fortran stubs, optimization, and ne tuning. 2. Incorporate the use of SCI shared memory for data transfers into this implementation. Utilize shared memory communication or message passing, respectively, where most ecient. 3. Implement the native PVM send/receive calls with intelligent decision making to utilize SCI when available and to fall through to default protocols otherwise. 4. Investigate the feasibility and desirability of implementing PVM control and administration functionality over SCI. 5. Port the above to PC-based compute clusters interconnected via the forthcoming PCI-SCI interface card.

3.2 PVM Direct Data Transfers over SCI In the initial phase of this project, we have completed the part of item (1). Issues arising from this exercise, i.e. the design and implementation of pvm scisend/pvm scirecv counterparts to direct routed pvm psend/pvm precv, are discussed below. In the interest of compatibility, the argument lists and semantics of pvm scisend/pvm scirecv were designed to be identical to pvm psend/pvm precv; thereby programs may be trivially translated to use either of the two mechanisms. The argument list for pvm scisend consists of the destination taskid (multicast is not currently supported), a message tag, a typed bu er, the number of items to be sent, and a coded datatype indicator: int info = pvm_scisend( int tid, int msgtag, char *buf, int len, int datatype)

The rst ve parameters to pvm scirecv are identical, with the exception that wildcards are allowed for the source taskid and message tags; in addition, there are three output parameters for the return of the actual taskid, message tag, and message length: int info = pvm_scirecv( int tid, int msgtag, char *buf, int len, int datatype, int *atid, int *atag, int *alen)

Return values from both calls are identical to those for pvm psend/pvm precv. To implement pvm scisend/pvm scirecv over the raw I/O interface of the SBus-SCI device, the following strategy was adopted. At the rst pvm scisend attempt for a speci c source-destination pair, a special connection request message is sent via the default PVM mechanism which is assumed to be operational. The corresponding pvm scirecv either accepts or rejects this connection request; in the latter case, the library transparently equivalences the SCI calls to regular PVM calls for all future communications for this source-destination pair. If the receiver is willing and able to accept the connection, it allocates and returns an SCI port number and nodeid in an acknowledgement message. Subsequently, the sender transmits messages by writing directly to the active end of the connection. For each message, a header containing the message length, source and destination identi ers, and a tag, is rst sent to permit the destination to prepare for reception. The actual message is then written, with fragmentation carried out when needed; as with pvm psend/pvm precv, arbitrarily large messages can be sent. Small messages are piggybacked onto headers to avoid multiple system calls and context switches. Implementation of the pvm scirecv function consists of accepting connections, and subsequently reading headers and data from the passive end of the connection. In addition, the receiver library performs

multiplexing among many incoming SCI connections (and connection requests), performs source and message tag based discrimination, and also reads in and bu ers early messages. To preserve PVM source/tag ltering semantics and to avoid retries and possible deadlocks, all incoming messages are rst read in and bu ered whenever the pvm scisend/pvm scirecv libraries get program control.

3.3 Implementation Issues While the implementation scheme is, for the most part, straightforward, several signi cant issues had to be considered, most pertaining to the current version of the SBus-SCI device driver. The rst, and most critical, is the absence of a select or poll system call for the SCI device. In a PVM implementation catering to parallel programs where each process communicates with multiple partners, polling must be arti cially implemented to avoid inde nite delays and deadlocks. In the rst version, we have circumvented this problem by marking the SCI le descriptors as non-blocking and attempting to read/write, with the result values used to determine readiness. A second issue is that, since send/receive are asynchronous, writes could potentially block, perhaps unneccesarily holding up the sender. This is currently being overcome by using an exponential backo scheme { delaying the writer by successively increasing amounts. An alternative would be to deliver an urgent asynchronous signal to the receiver via normal PVM mechanisms to cause it to read the message. A third (closely related) factor concerns the nite and limited bu er space available in a connection, including the sender's and receiver's bu ers. When write bu ers become full, writes may block even when the receiver is reading from the other end; delays between successive attempts must be carefully tuned for optimality. Currently, only limited e ort has been spent for this optimization. Yet another issue concerns simultaneous attempts by the sender and receiver to establish connections; a simple scheme based on ordering of taskids is sued to give precedence to one of the entities, and to avoid deadlocks. Finally, a number of issues pertaining to performance and the lack of experience with the hardware arose. These are discussed below.

4 Preliminary Performance Results The primary interest in the performance studies was to compare the PVM direct data transfers over SCI ( rst-cut implementations of pvm scisend/pvm scirecv) and Ethernet (pvm psend/pvm precv). However, in order to provide some insights into the performance potential and some artifacts of the SBus-SCI adapter as well, results for SCI raw channel communication are also given.

4.1 Throughput For initial throughput tests, a simple ping program was used. Figure 1 shows throughput achieved with several variants of send/receive functions, message lengths, and interconnect media. Except for maximum SCI throughput, each number was determined from 10 experiments with 1000 iterations each of sending/receiving a message. Maximum SCI throughput (write/read, max.) achievable on the cluster was determined using the native write/read calls with \hardcoded" sender/receiver pairs and message lengths. In addition, the number of send/receive iterations was chosen so that the amount of data to be transferred did not exceed the kernel bu er size (around 128 kB); for instance, a 32 kB message was sent only three times. Hence, the send operations were never blocked due to bu er size limitations. The maximum throughput measured for this speci c situation is 9.7 MB/s, in fact close to the 10 MB/s as stated by the vendor. This result is also in line with [8]. Sustained SCI throughput (write/read, sust.) was determined using the same mechanisms, but with 1000 send/receive iterations for each message size. Thus, a write operation blocks when it encounters a full bu er. As can be seen from the gure, this stop-and-wait behaviour diminishes performance

10.000

Throughput (MB/s)

1.000

0.100

write/read, max. write/read, sust. pvm_scisend/pvm_scirecv, DMA pvm_psend/pvm_precv pvm_scisend/pvm_scirecv, prog. I/O

0.010

0.001

1

4

16

64

256

1024

4096

16384

65536

Message size (bytes)

Figure 1: Throughput results signi cantly. Sustained throughput is about half the maximum throughput, with a peak of 5.6 MB/s for messages of length 32 kB. This corresponds to what [9] reports recently. While these gures are promising, they cannot be realized for PVM data transfers, as an operational PVM implementation necessarily has to perform multiplexing, avoid blocking, and dynamically negotiate connections with multiple partners. Incorporating these facilities into the PVM-SCI library (pvm scisend/pvm scirecv, DMA) diminishes performance as compared to native write/read SCI I/O, by about a factor of 2: sustained throughput is about 2.4 MB/s for large messages, with a fragment size of 32 kB and DMA transfers being enabled. Besides the code that realizes the improved functionality of the PVM-SCI routines, the primary causes for this degradation are: (1) the absence of select/poll mechanisms for SCI connections, i.e. the need to simulate polling using asynchronous reading and writing; (2) separate header and data transmission; (3) not fully tuned retransmission delays; and (4) sub-optimal coding. For comparison, the native PVM data transfers (pvm psend/pvm precv) with direct routing using TCP/IP over Ethernet, achieve throughput of about 1 MB/s. The fth curve (pvm scisend/pvm scirecv, prog. I/O) reveals an (undocumented) artifact of the SCI driver for pvm scisend/pvm scirecv transfers with programmed I/O, which achieves throughput of at most 160 kB/s. These results were obtained using short headers of length smaller than 64 bytes. In this case, the driver uses programmed I/O for header transmission, and is stuck in this mode for all future transfers over this SCI connection. Thus, even for large, properly aligned messages, DMA transfers which would realize high-speed transfers, are not utilized. The pvm scisend/pvm scirecv functions that deliver high throughput as described above, have therefore been forced to utilize DMA transfers by minimum header and message sizes of 64 bytes. Clearly, this is overhead and waste of bandwidth. Hence, without this driver artifact, and with the availability of select or poll for SCI connections and further code optimization and ne tuning, we would expect sustained throughput for pvm scisend/pvm scirecv to be about three times higher than for pvm psend/pvm precv.

4.2 Latency Latency gures were determined using pingpong tests for small message sizes. Table 1 presents the oneway application-to-application latencies for one-byte messages and some variants of data transfers. For

comparison, the latencies of the send calls from the ping tests are also given. These represent the xed software overheads for setting up a message for sending. write/ read

Send call lat. (ping) App-to-app lat. (pingpong)

0.085 1.618

pvm scisend/ pvm scirecv

(prog. I/O) 0.433 3.376

pvm scisend/ pvm scirecv

(DMA transfers) 1.087 3.569

pvm psend/ pvm precv

1.604 2.299

Table 1: One-way latencies for one-byte message (in milliseconds) The results show high latencies for message-based communication over SCI, even in case of the native interface. Latencies for pvm scisend/pvm scirecv are even signi cantly higher than those for . This is in part due to the poor functionality of the driver (absence of poll) and the unoptimized implementation of the calls, as explained above. Quantitatively, as indicated by the send call latencies, the PVM-SCI sends are at least about 5 times slower than their native SCI counterparts. We believe that, with improved drivers and optimized software, the latency of pvm scisend/pvm scirecv could be improved to those of pvm psend/pvm precv, but not better. The major cause is the sub-optimal implementation and performance of the driver. Apparently, message passing over SCI is not an appropriate style of communication when low latency is required. In this case, shared memory-based IPC must be used, as discussed below. write/read pvm psend/pvm precv

4.3 Discussion The performance studies revealed some surprises and (undocumented) subtleties of using the Dolphin SBus-SCI adapter card and driver software for distributed parallel computing on workstation clusters. These diculties are mainly due to limitations of the rst-generation SCI interface hardware and software. Using the raw channel interface (message passing), good throughput can be obtained provided that messages are large and properly aligned. However, the artifact of the driver to enter programmed I/O mode once a short or unaligned message is encountered, and to utilize this low-speed mode for all future communication over a given SCI connection, must be taken into account as well. Furthermore, due to sub-optimal driver implementation, latencies for SCI messaging are high, in the range of TCP-based communication. This behaviour is clearly not what SCI does in fact promise, and what the vendor emphasizes as strength of the product. For good communications performance, also in terms of latency, shared-memory IPC must be utilized as well. Shared-memory one-way latency for this product is reported in [9] to be about 15 s, with SS-5 nodes. Thus, even with our slower SS-2 machines, latencies in the range of a few tens of microseconds are realistic. These observations indicate to use the following \classical" strategy for implementing higher-level communication libraries such as PVM for our SCI cluster: the use of shared memory IPC for small amounts of data to be transferred, and the use of message passing for large data blocks, with special care to enable DMA transfers to take place. This combination of shared memory and messaging makes implementingPVM over the SBus-SCI interface more dicult, but promises to deliver good performance both in terms of throughput and latency. As yet, it has not been incorporated into the PVM-SCI library.

5 Conclusion and Further Work We have described early experiences and results from our project aimed at high performance distributed computing in workstation clusters using Dolphin's SCI interconnect technology. While this report is too

preliminary to be used as a basis for drawing conclusions about the promise of the SCI technology and the LAMP (Local Area MultiProcessor) paradigm [3], it does provide indications of what the possible potential is for such environments. In a nutshell, the technology appears viable and is capable of being a rival to other contenders such as FDDI, ATM, Fibre Channel, and Memory Channel. However, some problems and subtleties of the rst-generation hardware and software still limit its ease of use and performance. Our experiences also highlight the critical importance of intelligent communication software to avoid high-overhead pitfalls, including choice of data transfer method (message passing or shared memory) and I/O mode (programmed I/O or DMA transfers), multiplexing, and connection management. Poor design choices in these areas can very easily degrade performance to worse than Ethernet levels. We are currently addressing these issues in our work. Some routes for exploration are:     

Optimized use of select/poll calls for incoming messages. Exponential backo with signaling, and ne tuned retransmission delays when write bu ers are full. Use of SCI shared memory for small data transfers and low latencies, and use of the raw channel interface for large messages to achieve high throughput. Use of asynchronous signals or semaphores in SCI-based shared memory to initiate new connections. Reduced copying of bu ers, other implementation optimization.

At the time of writing, Dolphin has indicated that an SBus-2 adapter card to SCI will soon become available and will signi cantly increase the bandwidth available to applications (30 MB/s throughput). Improvements in device driver functionality (poll) and performance will also be available. These improvements, in conjunction with eciently implemented applications and support software (such as PVM), should enable SCI clusters to perform at levels where CPU speeds and communication speeds are well matched.

References [1] V. S. Sunderam, G. A. Geist, J. J. Dongarra, and R. Manchek, \The PVM Concurrent Computing System: Evolution, Experiences, and Trends", Journal of Parallel Computing, 20(4), pp. 531{546, March 1994. [2] G. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderam, PVM Parallel Virtual Machine: A Users Guide and Tutorial for Networked Parallel Computing, M.I.T. Press, November 1994. [3] D. B. Gustavson and Q. Li, \Local-Area MultiProcessor: the Scalable Coherent Interface". in: De ning the Global Information Infrastructure: Infrastructure, Systems, and Services, S. F. Lundstrom (ed.), SPIE Press, vol. 56, pp. 131{160, 1994. [4] D. B. Gustavson, \The Scalable Coherent Interface and Related Standard Projects", IEEE Micro 12(2), pp. 10{22, Feb. 1992. [5] ANSI/IEEE Std 1596-1992: IEEE Standard for Scalable Coherent Interface (SCI), IEEE Computer Society Press, 1993. [6] Dolphin Interconnect Solutions, 1 Gb/s SBus-SCI Cluster Adapter Card, White Paper, 1994. http://www.dolphinics.no. [7] Dolphin Interconnect Solutions, Dolphin Breaks Cluster Latency Barrier with SCI Adapter Card, Press Release, 1995. http://www.dolphinics.no.

[8] H. O. Bugge and P. O. Husoy, \Dedicated Clustering: A Case Study", Proc. 4th Int'l. Workshop on SCI-based High-Performance Low-Cost Computing, Oct. 3-5, 1995, Heraklion, Crete, Greece. Available from: SCIzzL, The Association of SCI Local-Area MultiProcessor Users, Developers, and Manufacturers. Contact: [email protected]. [9] A. D. George, R.W. Todd, and W. Rosen, \A Cluster Testbed for SCI-based Parallel Processing", Proc. 4th Int'l. Workshop on SCI-based High-Performance Low-Cost Computing, Oct. 3-5, 1995, Heraklion, Crete, Greece. Available from: SCIzzL, The Association of SCI Local-Area MultiProcessor Users, Developers, and Manufacturers. Contact: [email protected].