Abstract. This paper describes the SCI-based workstation cluster system being developed at the HCS Research Lab and the parallel processing and network ...
Parallel Processing Experiments on an SCI-based Workstation Cluster Alan George, Robert Todd, William Phipps, Michael Miars, and Warren Rosen* High-performance Computing and Simulation (HCS) Research Laboratory Electrical Engineering Department, FAMU-FSU College of Engineering Florida State University and Florida A&M University Abstract This paper describes the SCI-based workstation cluster system being developed at the HCS Research Lab and the parallel processing and network experiments that have been conducted and the results achieved. Using several different input sizes and degrees of partitioning and granularity for the parallel processing algorithms employed (i.e. matrix multiply and data sorting) and experimenting with both ring- and switchbased topologies and other permutations (e.g. number of workstations), significant reductions in execution time have been achieved. These results help to illustrate what types of parallelism can be achieved today on SCI-based workstation clusters.
1. Introduction Parallel processing has emerged as a key enabling technology for a wide variety of current and future applications spanning both the commercial and military sectors, however these applications continue to rely heavily on the speed of the interconnection system used for processor communication. For years, advancements in processor technology have consistently outpaced those in interconnect technology in terms of performance, cost, and availability. However, new interconnect technologies such as SCI promise to close the gap and support the development of new, more powerful, and more effective parallel processing systems. One of the most practical platforms for parallel and distributed computing is the workstation cluster consisting of commercial, off-the-shelf UNIX computers connected by conventional Ethernet with TCP/IP and working together in the form of a multicomputer. While such systems can achieve near-linear speedup and a high degree of parallel efficiency for certain applications (i.e. coarse-grain applications with many large, relatively independent sections of computation and relatively light communication requirements), they are extremely limited in terms of a wide range of parallel processing applications. The Scalable Coherent Interface [SCI93] is one of the most promising interconnects for the design of future parallel processing systems in general and workstation clusters in particular [GUST95]. One such platform is *
the system being developed at the HCS Research Lab which consists of SPARCstation computers connected by first-generation SCI/Sbus adapters in various permutations based upon ring and switch topologies [GEOR95]. In this paper we present the latest results achieved in adapting SCI/Sbus workstation clusters to the requirements of several parallel processing algorithms. By taking advantage of both shared-memory and message-passing modes of operation, execution times are significantly improved with the 1-Gbps SCI interconnect even despite the Sbus bottleneck. In addition to results from and comparisons of a number of parallel processing experiments, network latency and effective throughput measurements are also provided. In the next section, an overview of the testbed facility is provided, followed by section three which describes the software tools and techniques developed and employed for these experiments. In section four, the results of basic latency and effective throughput measurements are provided for both ring- and switchbased topologies on the testbed. These results provide an indication of the range of message sizes and cluster sizes where the SCI/Sbus-1 interconnect is most effective. Next, in section five, a wide range of experiments, measurements, and comparisons are provided for two major parallel processing algorithms, matrix multiplication and sorting, in terms of different cluster topologies, number of nodes, algorithm granularity, data set size, etc. The results of these experiments help to clarify the many strengths and weaknesses associated with this platform for parallel and distributed computing. Finally, in section six, a brief set of conclusions is presented and items of future research are enumerated of which many have already begun in the HCS Lab.
2. Testbed Configuration A cluster testbed is being developed at the HCS Lab for the purpose of experimentally studying the potential advantages and limitations of SCI-based parallel and distributed processing. The machines in the cluster currently include SPARCstation-20/85 (SS20/85), SPARCstation-20/50 (SS20/50), and SPARCstation-5/85 (SS5/85) computers connected by Dolphin SCI/Sbus-1
Dr. Rosen is with the Naval Air Warfare Center, Aircraft Division, Warminster, PA
EDU
EDU
SS5/85 32MB
EDU
EDU
SS5/85 32MB
EDU
SS5/85 32MB
SS5/85 32MB
EDU
SS20/85 128MB
EDU
SS20/85 64MB
EDU
1-Gbps/link Scalable Coherent Interface
SS5/85 32MB
EDU
SS20/50 32MB
EDU
SS20/50 32MB
Figure 1. Cluster in Ring Configuration
The key hardware element of this cluster architecture is the Dolphin SCI/Sbus-1 adapter. The functional components of the SCI/Sbus-1 adapter card is shown in Figure 2 [DOLP95][ALNE93]. These adapters currently support both message-passing and limited, noncoherent, shared-memory interprocessor communication techniques are available as API (application program interface) function calls from any high-level-language or assembly-language program and have been enhanced via the HCS_LIB function library (discussed in section three).
Table 1. Sbus to SCI Mapping [DOLP95] Sbus cmd
Sbus FCode Dual-Port RAM Latch
DATA
Latch
Address Translation Table
1 Gbps
SCI Data Address
Mailbox
SCI Link Chip
TX 1 Gbps
Packet Buffer
Latch
Sbus Controller + DMA
RX
(courtesy of Dolphin Interconnect Inc.)
Virtual shared-memory on a physically distributed memory system is handled by the Address Translation Table where 32-bit Sbus addresses are converted to 64bit SCI addresses. The most significant 16 bits of the SCI address are used to select between 65536 distinct devices. To map this SCI “address” space into a user address space the mmap() function is called which makes a copy of the buffer allocated in the kernel space and places it in the user application’s virtual memory
SCI Response
read_sb()
response_16
write()
20 - 23 bytes
write_sb()
response_00
In addition to several instances of the SCI-based ring topology shown in Figure 3 (i.e. 2-node ring, 4-node ring, and 8-node ring), experiments were also conducted with a switched-ring topology. The SCI-based switch topology consists of up to four rings of workstations connected by a Dolphin 4-way switch. SS5/85 32MB
EDU
1-Gbps SCI
SS20/85 128MB
EDU
EDU
SS20/50 32MB
EDU
SS5/85 32MB
EDU
4-way Switch
EDU
EDU
(3.2 Gbps)
1-Gbps SCI
Figure 2. SCI/Sbus-1 Adapter
SCI Request
20 - 23 bytes
SCI Controller + MAP
Sbus Address
Sbus size
read()
1-Gbps SCI
SS5/85 32MB
segment. This allocated buffer is created via an ioctl() command and is mapped via the mmap() command. The DMA engine that underlies this shared-memory system provides the ability to transparently move data from local memory on one station to local memory on another. Shared-memory transfers are accomplished, at the most fundamental level, by simple assignment operations where a pointer to the shared-memory segment is assigned a value which then is sent, via the DMA engine, to all other local copies of the shared memory buffer on the ring. By doing this, another node on the ring has the ability to “see” that value. The message-passing API consists of normal I/O command structures such as read(), write(), and ioctl(). These commands normally operate on simple file descriptors which act as sockets to the Sbus itself and thus are created via open(). These transactions proceed into message FIFOs where the link controller chip decides how to break apart the message and push it across the link. For optimal transfers however, the write buffer should be aligned to a 64-byte boundary before passing it to the Sbus. This allows the SCI card to take direct advantage of fast 64-byte transfers without the need for breaking apart a message. Sbus commands are mapped to SCI commands and back again using the simple table shown in Table 1.
10-Mbps Ethernet
10-Mbps Ethernet
adapters operating at a data rate of 1-Gbps per link. All systems operate under Solaris 2.4 and communicate through TCP/IP over a 10-Mbps thinwire Ethernet LAN for routine UNIX traffic (e.g. NFS). As shown in Figure 1, the baseline topology for the cluster is a simple ring.
EDU
SS20/50 32MB
EDU
SS5/85 32MB
EDU
SS5/85 32MB
EDU
1-Gbps SCI
EDU
SS20/85 64MB
Figure 3. Cluster in Switched-Ring Configuration
A functional diagram of the Dolphin SW4 4-way SCI Cluster Switch attached to 4 external ringlets is
shown in Figure 4. Each input port of the switch contains a Dolphin Link Controller (LC-1) chip. This chip performs the standard SCI interface functions and, in addition, communicates with the LC-1 chip at other switch ports via a back-end bus known as the B-Link. The B-Link is a 64-bit wide unterminated party-line bus which operates at a 50 MHz clock rate, for an aggregate bandwidth of 3.2 Gbps. Initialization is accomplished by loading a 61-bit initialization sequence on power-up. This sequence contains such information as node Id, node interval mask (described below), and node configuration. Routing is performed via an interval addressing scheme. All nodes connected on ringlet 0 must have a node Id that lies within the range 4 to 60. Nodes on ringlet 1 must have an address in the range 68 to 124, and so on up to node Id 252 for ringlet 3. In addition, node Id’s must be multiples of 4. For each incoming packet the Link controller at the switch port logically ands the target Id with its own node Id and compares the result with a 16-bit mask loaded at initialization. If the result of the compare lies within the attached ringlet’s interval, the packet is passed to the Link Controller’s output queue and continues around the ringlet. If the target Id lies outside the ringlet’s interval the packet is stripped from the ringlet and passed to the B-Link. Packets on the B-Link are examined by each Link Controller and if the target Id lies within a Link Controller’s interval, the packet is copied to the Link Controller’s output queue for subsequent transmission on its attached ringlet.
SCI node
SCI node
SCI node
SCI node
SCI node
SCI node
SCI node
SCI node
port 0 LC-1
LC-1
LC-1
port 3
B-Link bus @ 3.2-Gbps (64-bit wide)
ringlet 3
ringlet 2
ringlet 1
ringlet 0
SW4
LC-1
Figure 4. Dolphin 4-way Switch
3. Software Tools and Techniques The lowest-level API available for use with the Dolphin SCI/Sbus-1 cards is a set of library routines called SCI_LIB. This set of functions provides a level of
abstraction to the programmer and allows the relatively easy creation of shared-memory services and messagepassing support without the programmer being overly concerned with the specific idiosyncrasies of the current revision. The SCI_LIB functions are shown in Table 2. The library consists of initialization, status, and closure functions, message-passing functions for creating, connecting, and releasing ports, and functions for sharedmemory creation and destruction. Table 2. SCI_LIB Summary Initialization sci_init_lib() sci_close_lib() sci_lib_info()
Message-passing sci_create_receive_port() sci_receive_port_connected() sci_connect_transmit_port() sci_read_msg() sci_write_msg() sci_remove_receive_port() sci_remove_transmit_port()
Shared-memory sci_create_shm() sci_map_shm() sci_unmap_shm() sci_remove_shm()
A higher-level set of functions for accessing the SCI/Sbus-1 adapters in either shared-memory or message-passing fashion has been developed at the HCS Research Lab. Called HCS_LIB, it adapts the SCI_LIB functions to provide an even higher layer of abstraction for added convenience, flexibility, and dependability. Table 3 summarizes these functions. Table 3. HCS_LIB Summary Shared-memory safe_shared_memory_make() safe_shared_memory_map() Message-passing safe_create_receive_port() safe_connect_transmit_port() safe_wait_for_connection() send_block() get_block() send_broadcast() get_broadcast() Signaling service_signals() default_signal_handler() ignore_signal_handler() send_signal() send_signal_and_wait() set_signal_handler() barrier_sync() Initialization and Termination cluster_init() safe_exit()
The majority of shared-memory and messagepassing functions in HCS_LIB are simply extensions of SCI_LIB to provide more versatility and fault tolerance in the programming environment. These functions allow
a programmer to reference an SCI node as an index into a node array, taken from the command line, rather than by specific node numbers, while also providing time-out support on failed mapping and creation attempts. The HCS_LIB message-passing functions also include broadcast support which follows a spanning tree structure with the node initiating the broadcast transmitting to all nodes below. Figure 5 shows a broadcast tree for an eight-node configuration. 0 0
1
0 0
2 4
2
1 6
1
3 5
3
7
Figure 5. Broadcast Tree Configuration
One of the most important features of HCS_LIB is the signaling support that allows a programmer to have interactive communication and semaphores between master and worker nodes with latencies in the singledigit microsecond range. These functions use small shared-memory maps to transfer integer signals from worker to master while allowing customization of signals by the set_signal_handler() command. Via special signals, the barrier_sync() function allows synchronization of all nodes in a particular parallel program such that a pseudo-synchronous execution may follow. This construct is particularly important in heterogeneous SCI clusters where correct timing is essential. P and V semaphore constructs are implemented via the send_signal_and_wait() function (used by the worker) and the service_signals() function (used by the master). This allows the master to arbitrate such high level concepts as mutual exclusion. Finally, the initialization and termination commands in HCS_LIB allow the construction of a common environment for SCI-based parallel programs. The cluster_init() function creates all of the shared-memory maps for signaling while also fully connecting the network via message-passing ports. Once this is done, a barrier synchronization is performed to insure all initialization has completed before control is returned to the user's program. The safe_exit() function simply catches all signals from child processes for debugging purposes while also safely terminating all sharedmemory areas, message-passing connections and ports, and freeing all structure memory used for internal SCI accounting . As a means of comparison between cluster communications via HCS_LIB over SCI/Sbus-1 adapters versus TCP/IP over Ethernet, a number of high-level parallel programming and coordination languages are
being employed. Among these is MPI which is used in this study for performance comparison purposes. The Message Passing Interface (MPI) specification was created by the Message Passing Interface Forum [MPIF93] will the goal of providing a portable parallel API that allows for efficient communication, heterogeneous implementations, convenient C and FORTRAN-77 language bindings, and consistency with current message-passing paradigm practices (such as PVM, NX, p4, etc.). Portability is achieved with the creation of machine-specific implementations of MPI which consist of "black box" libraries of MPI functions that adhere to the syntax and semantics of the MPI specification. The manner in which these functions perform the necessary operations are not specified by MPI and are transparent to the application programmer. The general categories of functions are: point-to-point communication (blocking and non-blocking), collective communication, group management, process topology management, environmental management, and profiling interfacing. MPICH is a sockets-based TCP/IP implementation of the MPI specification which can be ported to any number of platforms available from Argonne National Laboratory and Mississippi State University.
4. Throughput and Latency Tests In order to gauge the potential performance benefits of the SCI/Sbus-1 interconnect in a workstation cluster environment, a number of basic benchmarking programs have been developed and their results measured and collected. In [GEOR95] these measurements were presented for a ring of SS5/85 workstations running SunOS 4.1.3. In this section we present the latest measurements for a ring of workstations running Solaris 2.4 which includes SS20/85, SS20/50, and SS5/85 systems as well as data from switched rings of these workstations. The results of these basic experiments are provided in the form of both latency and effective throughput measurements. The shared-memory benchmark examines the latency between two machines for an n-byte payload transfer. Initialization of this test is done by creating a shared-memory area on each machine with a size of n bytes. These areas are then mapped by both the remote and local nodes to create a fully-accessible, distributed shared-memory system consisting of two nodes. The actual latency is found by measuring the time it takes for one node to write a value n bytes long to the remote buffer and then read the same value from its local buffer. The corresponding action on the remote node is the exact reverse where a value is sought in the local node and then, upon receipt, is written immediately to the remote buffer. This two-way latency time t is averaged among m iterations and the one-way latency is found by
latency =
subsystem. Using read() and write() functions, data is pushed through this “pipe” as fast as possible in n-byte payloads. The total number of bytes transferred (i.e. n bytes * m iterations) divided by the total transfer time is the sustained effective throughput. All benchmark tests were compiled using the GNU C compiler with level -O4 optimization on the Solaris 2.4 operating system.
t 2m
The message-passing benchmark examines the maximum sustained throughput between two nodes. This test is started by creating and connecting a transmit port on one node and a receive port on the second node. The blocking I/O functions are enabled using the appropriate ioctl() signal and the data is aligned to the nearest 64-byte boundary for easy transfer to the Sbus
300 250
Latency (usec)
SS5/85:Switch SS5/85:Ring
200
SS20/50:Switch SS20/50:Ring
150
SS20/85:Switch SS20/85:Ring
100 50 0 1
2
4
8
16
32
64
128
Payload Size (bytes)
Figure 6. One-way Latency Measurements between Two Identical Nodes (via shared memory)
500
Latency (usec)
450 400
SS5/85: 8 Nodes
350
SS5/85: 4 Nodes
300
SS5/85: 2 Nodes
250
SS20/85: 8 Nodes SS20/85: 4 Nodes
200
SS20/85: 2 nodes
150 100 50 0 1
2
4
8
16
32
64
Payload Size (bytes) Figure 7. One-way Latency Measurements for Varying Ring Sizes (via shared memory)
128
which larger ring sizes effect two-node latencies, a series of tests were also conducted with different numbers and types of nodes on the ring. As illustrated in Figure 7, the additional delays associated with data transmission and propagation have a noticeable impact (e.g. two communicating SS20/85s with a payload size of four bytes experience a latency increase of almost two microseconds when six other nodes are on the same ring).
14 SS5/85:Switch
12
SS5/85:Ring 10
SS20/50:Switch SS20/50:Ring
8
SS20/85:Switch
6
SS20/85:Ring
4 2 131072
65536
32768
16384
8192
4096
2048
1024
512
128
256
0 64
Effective Throughput (MBytes/sec)
Figure 6 summarizes the one-way latency measurements obtained between two identical nodes (i.e. SS5/85, SS20/50, or SS20/85) either across the ring or through the switch. As expected, the switch introduces a small degree of overhead in all cases. For small payloads a latency of four microseconds can be achieved whereas for larger payloads the latency becomes exceedingly large and it becomes advisable to change to message-passing mode. In order to judge the effect to
Payload Size (bytes)
1.20 1.21 1.22
0.63 0.64 0.64
2.43 2.58 2.60 0.36 0.36 0.36
256
1.46 1.54 1.50
0.14 0.14 0.14 0.76 0.81 0.70
128
0.26 0.26 0.26
0.07 0.07 0.07 0.35 0.36 0.35
64
2
13.26
10.03 10.83 6.10 6.59 6.97
6.54 7.34 7.82
8.45
9.74
11.11
11.85
12.47
SS20/85: 2 nodes
4
4.09 4.14 4.14
3.96 4.26 4.40
SS20/85: 4 Nodes
6
9.33
SS20/85: 8 Nodes
5.57 6.07 6.51
8
10.21 10.92
SS5/85: 2 Nodes
8.65
10
8.56 9.27
SS5/85: 4 Nodes 7.46
12
2.14 2.26 2.27
SS5/85: 8 Nodes
5.84 6.51 6.83
14
0.04 0.03 0.04 0.18 0.18 0.17
Payload Size (bytes) Figure 9. Effective Throughput Measurements for Varying Ring Sizes (via message passing)
131072
65536
32768
16384
8192
4096
2048
1024
0 512
Effective Throughput (MBytes/sec)
Figure 8. Effective Throughput Measurements between Two Identical Nodes (via message passing)
In a similar fashion, Figures 8 and 9 show the effective throughput for two nodes communicating over different configurations. Figure 8 provides a side-byside comparison of ring versus switch throughput and it indicates the optimal transfer size to be 64KB where approximately 105 Mbps of sustained effective throughput is attained. Figure 9 shows the effects between ring scalability and effective throughput.
5. Parallel Processing Experiments The parallel programs developed for this study were designed and implemented in C using HCS_LIB function calls over 1-Gbps SCI/Sbus for all communication. In order to compare the results with conventional workstation clusters, a comparable set of programs was also developed using MPICH over TCP/IP over Ethernet. The first parallel application program developed for these experiments is a medium-grain implementation of an N x N matrix multiply. This algorithm is designed in a master-slave fashion in which the master passes out rows of the first matrix for the slaves to multiply. The slaves, when passed an input row, perform N dot products to produce the corresponding output row. This output row is passed back to the master and the slave is ready to request another row. This behavior typifies the concept of opportunistic load balancing (OLB) which takes distinct advantage of the heterogeneity of the cluster testbed. There are three types of communication in this algorithm: broadcast of the second input matrix, scattering of the first input matrix, and collection of the output matrix. Two scheduling techniques were investigated: equal load and OLB. The former ensures that each worker node is supplied an equal number of rows to compute thereby forcing a heterogeneous cluster to perform like a homogeneous cluster of the slowest node. This is, of course, an approximation since communication speeds may not be equivalent and communication and computation can be overlapped by the faster nodes, however it is a useful method to emulate a balanced cluster. The second scheduling technique thrives on the inequities in the cluster and passes rows upon demand to the hungry nodes. This technique requires a fast signaling strategy for efficient operation since time will be wasted requesting the next row from the master. Granularity can also be varied in the matrix multiply by passing more than one row at a time in response to a request from a slave. When varying grain size from one to eight rows it was found that no significant, sustained performance increase or decrease took place. The overall best granularity only performed better than its competitors by nominal margins in a majority of cases. The test cases for this experiment were a 256x256 matrix, a 512x512 matrix,
and a 1024x1024 matrix. Cache misses and page faults were monitored and minimized by assuming that the broadcast matrix had already been transposed before the calculation takes place, thus avoiding vertical indexing and dramatically improving calculation times. The second parallel application developed performs a sort of an array of N double-precision, floating-point numbers. The algorithm used is based on a hybrid combination of the popular Quicksort and Mergesort algorithms. While Quicksort is among the fastest sorting algorithms on uniprocessor machines, it contains elements which are difficult to parallelize. To circumvent a considerable amount of overhead inherent in this "conquer-and-divide" scheme, an algorithm was devised such that the sort vector was divided into P equal segments, where P is the number of processors or workstations in the system. The segments are distributed, one per processor, and each processor (including the master) performs a Quicksort using the native C qsort() function. Each individually sorted segment is then passed back to the master which performs a Mergesort to produce the list final_vector containing the sorted list of N elements. Figure 10 illustrates this sequence of events. MP No actual data being transferred MP
WP
WP
WP qsort()
MP
mergesort()
MP = Master Processor WP = Worker Processor
Figure 10. Parallel Sorting Algorithm
All of the parallel programs written in C and communicating via HCS_LIB function calls were compiled using the GNU compilers at level -O4 optimization. The MPI programs were compiled using the mpicc MPICH compiler also at -O4 optimization. The configurations that were included in the testing are shown in Table 4. Figure 11 shows the results of the matrix multiplication experiments for a granularity of four. It was found that the differences between the switch and the simple ring topologies were minimal and using the heterogeneity of the cluster testbed and OLB a 1024x1024 matrix multiply could be calculated in approximately 24 seconds (as compared to almost 200
seconds on a single SS5/85). Because of the significantly different sequential execution times of the three types of workstations in the testbed, OLB guarantees that the faster nodes do not have to wait on the slower nodes to finish before they can begin their next computation. Significant speedup is achieved for small ring sizes as well as topologies with larger ring sizes and switched rings.
provides dramatic increases in speed even with relatively few nodes although the MPI algorithm still does relatively well. This is because of the “burst” oriented traffic that is inherent to both algorithms’ traffic patterns. While the MPI programs do achieve some speedup, the SCI-based parallel processing shows itself to be clearly superior despite the inherent bottlenecks associated with the relatively slow Sbus interface and first-generation nodechips.
Table 4. Parallel Test Configurations Ring 2- SPARCstation-5/85s 2- SPARCstation-20/50s 2- SPARCstation-20/85s 4- SPARCstation-5/85s 2- SPARCstation-20/50s & 2- SPARCstation-20/85s 2- SPARCstation-20/50s & 2- SPARCstation-20/85s & 4- SPARCstation-5/85s
Switch 2- SPARCstation-5/85s (ports 0-1) 2- SPARCstation-20/50s (ports 0-1) 2- SPARCstation-20/85s (ports 0-1) 4- SPARCstation-5/85s (ports 0-3) 2- SPARCstation-20/50s (ports 0-1) & 2- SPARCstation-20/85s (ports 2-3) 2- SPARCstation-20/50s (ports 0-1) & 2- SPARCstation-20/85s (ports 2-3) & 4- SPARCstation-5/85s (ports 0-3)
Figure 12 is a comparison between an MPI version of the matrix multiply running over Ethernet (MPICH), and the comparable measurements from Figure 11. The algorithm in the MPI program uses a static scheduling method such that each worker gets the same number of rows and faster nodes must wait on the slower ones to finish. As can be seen, the SCI/Sbus-1 interconnect
200.00 180.00
Execution Time (sec)
160.00 140.00 120.00 100.00 80.00 60.00
Sequential SS5-85 Sequential SS20-50 Sequential SS20-85 2 SS5-85: Sw itch 2 SS5-85: Ring 2 SS20-50:Sw itch 2 SS20-50: Ring 2 SS20-85: Sw itch 2 SS20-85: Ring 4 SS5-85: Sw itch 4 SS5-85:Ring 4 SS20: Sw itch 4 SS20: Ring 8 SS: Sw itch 8 SS: Ring
40.00 20.00 0.00 256^2
512^2
1k^2
Matrix Size Figure 11. Parallel Matrix Multiply on SCI/Sbus Workstations (grain size = 4)
200.00 180.00
Execution Time (sec)
160.00 140.00 120.00 100.00 80.00 60.00
Sequential SS5-85 Sequential SS20-50 Sequential SS20-85 2 SS5/85: MPI Ethernet 2 SS5/85: SCI/Sbus Ring 2 SS20/85: MPI Ethernet 2 SS20/85: SCI/Sbus Ring 4 SS5/85: MPI Ethernet 4 SS5/85:SCI/Sbus Ring 4 SS20: MPI Ethernet 4 SS20: SCI/Sbus Ring 8 SS: MPI Ethernet 8 SS: SCI/Sbus Ring
40.00 20.00 0.00 256^2
512^2
1k^2
Matrix Size Figure 12. Parallel Matrix Multiply on SCI/Sbus and Ethernet Workstations (grain size = 4)
140.00
Seq uen tial SS5-85 Seq uen tial SS20-50
120.00
Seq uen tial SS20-85 2 SS20-85: Switch
Execution Time (sec)
2 SS20-85: Ring
100.00
4 SS20: Switch 4 SS20: Ring
80.00
8 SS: Switch 8 SS: Ring
60.00 40.00 20.00 0.00 2M
4M
Data Set Size (array elements)
Figure 13. Parallel Sort on SCI/Sbus Workstations
Figure 13 shows the results of the parallel sorting programs tested using a subset of the same configurations and topologies as the matrix
multiplication (i.e. due to the memory requirements associated with 2M- and 4M-element sorts those configurations where an SS5/85 or SS20/50 was the master were excluded). For the 4M-element arrays, sorting times ranged from the slowest sequential time of approximately 131 seconds to 35 seconds using four nodes in a ring topology. Whereas the sequential programs are based on a simple Quicksort implementation, the parallel programs use a combination of Quicksort and Mergesort, the latter being naturally embedded in the collection of results by the master from the workers. This relationship, coupled with the additional amount of main memory and cache brought to bear on the problem when parallel processors are used, helps explain the extraordinary speedups (some even super-linear) achieved and verified in this test. Figure 14 shows the parallel sorting algorithm implemented in C using HCS_LIB compared with the same algorithm implemented using the MPI interface over Ethernet. Because of the large amounts of data that must traverse the interconnection network in a “stream” fashion, Ethernet is not suited for providing even a small amount of the speedup achieved by the SCI/Sbus-1 system.
90.00 S e q u e n ti a l SS20-85
80.00 Execution Time
70.00 60.00
2 SS20/8 5 : MP I E th e r n e t 2 SS20/8 5 : SCI/S b u s Ri n g 4 SS20: MP I E t h e r n e t 4 SS20: SCI/S b u s Ri n g 8 SS: SCI/S b u s Ri n g
50.00 40.00 30.00 20.00 10.00 0.00 2M
4M
Data Set Size (array elements)
Figure 14. Parallel Sort on SCI/Sbus and Ethernet Workstations
high-level parallel programming and coordination tools (e.g. MPI, PVM, HPF, Linda, etc.) with or without TCP/IP, and also with applications developed which directly access the drivers for these other high-speed networks much like has been accomplished with HCS_LIB for SCI/Sbus-1. As more advanced SCI adapters become available these experiments will be upgraded. In addition to TCP/IP-based tools and custom code, the use of Berkeley’s Active Messages is also being considered. Finally, in addition to these testbed interconnect comparisons, plans are underway for a series of system comparisons. Based on current and future configurations of our testbed we hope to gauge parallel efficiency, speedup, flexibility, and performance/cost ratio of the SCI cluster against commercial systems such as the IBM SP-2 and SGI PowerChallenge.
6. Conclusions and Future Research A series of parallel processing and network experiments have been conducted and the measurements presented. Basic latency measurements (a critical element for parallel processing) have shown that oneway latencies of less than four microseconds can be achieved and exploited. Parallel processing results illustrate that current first-generation SCI/Sbus clusters are capable of a high degree of parallel processing efficiency for applications whose granularity is not sufficiently coarse enough for conventional clusters. Two widely used algorithms which form an integral part of many high-performance computing applications have been developed for this study (matrix multiply and sorting). These algorithms have used several variations including grain size and scheduling technique in order to help clarify performance tradeoffs. As funding permits, future research plans related to the SCI-based cluster testbed for parallel processing can be divided into four categories: adapters, algorithms, interconnect comparisons, and system comparisons. In addition to the SCI/Sbus-1 adapters currently being employed we anticipate the arrival of SCI/Sbus-2 cards which promise several times the throughput and perhaps even lower latency. Given the limitations of the Sbus I/O interface other adapters are also being evaluated, including Mbus, PCI, and processor-direct interfaces with cache-coherent, shared-memory support. In addition to parallel matrix multiply and sorting, other algorithms are currently under development for implementation over SCI-based clusters. These include fast Fourier transforms, wavelet transforms, and matrix inversion. Another effort underway involves the comparison of SCI-based clusters with those based on other interconnects, including FDDI, Fibre Channel, 100BASE-T, and ATM. These experiments are expected to take place on two fronts, first with relatively portable,
Acknowledgements We gratefully acknowledge the support of our sponsors at the National Security Agency in Fort Meade, Maryland, the Naval Air Warfare Center, Aircraft Division, in Warminster, Pennsylvania, and the Office of Naval Research in Washington D.C. We also wish to thank Mr. Anthony Muoio at Dolphin Interconnect Solutions for the donation of some of the hardware and software involved, and finally Mushtaq Sarwar and David Zirpoli in the HCS Research Lab and Drs. Cockburn and Roberts in the EE Department for their assistance in this effort.
References [ALNE93]
Alnes, K., “Enabling Products for Cluster Computing using SCI,” Proceedings of the First International Workshop on SCI-based HighPerformance Low-Cost Computing, pp. 58-64, August, 1994.
[DOLP95]
Dolphin Inc., “1 Gbit/sec SBus-SCI Cluster Adapter Card”, White Paper, Dolphin Interconnect Solutions, March 1995.
[GEOR95]
A.D. George, R.W. Todd, and W. Rosen, "A Cluster Testbed for SCI-based Parallel Processing," Proceedings of the 3rd International. Workshop on SCI-based HighPerformance Low-Cost Computing, August 1995, pp. 43-48.
[GUST95]
Gustavson, D.B. and Q. Li, “Local-Area MultiProcessor: the Scalable Coherent Interface,” Proceedings of the Second International Workshop on SCI-based HighPerformance Low-Cost Computing, pp. 131154, March, 1995.
[MPIF93]
Message Passing Interface Forum, “MPI: A Message Passing Interface,” Proceedings of Supercomputing 1993, IEEE Computer Society Press, pp. 878-883, 1993.
[SCI93]
Scalable Coherent Interface, ANSI/IEEE Standard 1596-1992, IEEE Service Center, Piscataway, New Jersey, 1993.