CNI: A High-Performance Network Interface for Workstation Clusters Prasenjit Sarkar and Mary Bailey Department of Computer Science University of Arizona Tucson, AZ 85721 fpsarkar,
[email protected] Abstract Networks of workstations provide an economic solution for scalable computing because they do not require specialized components. Even though recent advances have shown that it is possible to obtain high bandwidth between applications, interconnect latency remains a serious concern. In this paper we present CNI, a cluster network interface that not only provides both low latency and high bandwidth but also efficiently supports multiple programming paradigms. This is done by functionally coupling the network adaptor board more closely to the CPU without changing the standard workstation architecture. CNI results in performance gains for applications, substantially reducing communication overhead and delay.
1. Introduction Recent advances in workstation architecture and networking technology have enabled researchers to focus on networks of workstations as an economic solution for scalable computing architectures. There has been a dramatic increase in the computing power of workstations. Moreover, gigabit networks offer high bandwidth communication rivaling that found in many parallel computers. Because workstation clusters use off-the-shelf components, they can provide a cost-effective parallel computer. Additional advantages of networks of workstations include easy upgradability and reuse of existing machine configurations. Researchers for workstation network interfaces have tended to focus on getting high bandwidth to the applications [4]. The challenge now is to provide low-latency communication between applications, since latency is often the limiting factor in parallel programs [3]. There are two paths to meet this challenge. First, one could use specialized processors which place emphasis on ease of communication with other processors in a parallel computer. Second, the workstation architecture can be cus-
tomized with special control and data paths to more closely couple it to the networking interface thus bringing down the node-to-node latency. However both approaches preclude the use of off-the-shelf components, which drives up the base cost of such a configuration. Alternatively, the network interface could be enhanced without modifying the workstation architecture. This paper considers this alternative option of reducing latency by only changing the design of the network adaptor board. A network of workstations must operate under several constraints that do not apply to custom-built parallel machines, in order to avoid violating standard workstation design principles. First, the network interface device can access host memory only via DMA, which means that there are no custom-built data paths such as is present in the MAGIC node controller in the Stanford FLASH[8]. Second, there are no special memory bus control signals that aid in the interaction between the CPU and the network device. For example, in the Wisconsin Typhoon system[11] and the CM5 [5], the network device and the memory-cache subsystem maintain consistency by special signaling. Third, the network devices cannot access the host machine cache directly and data can flow from the cache to the network device only via programmed I/O. These architectural limitations also prevent explicit support for hardware cache coherence as is present in the DASH [9]. In light of the above constraints, our aim is to develop a network interface that (1) uses off-the-shelf components and can be interfaced with a standard workstation architecture, (2) provides both low-latency and high-bandwidth to applications, and (3) efficiently supports both the message passing and distributed shared memory paradigms for generality in programming. CNI, the cluster network interface, is a network interface that meets these goals through a combination of three techniques. First, we introduce a new mechanism in the network adaptor board, a Message Cache, that minimizes both the cost of sending messages and the cost of remote access misses for shared memory applications. The Message Cache
allows us to couple the network interface more closely with the CPU to achieve better performance. The SHRIMP network interface has a similar coupling but the binding between the host and the network device is static and explicit, limiting its scalability[1]. In contrast, the Message Cache provides flexible light-weight mechanisms to reduce the overhead of memory consistency protocols. Second, we use an efficient hardware implementation of Application Device Channels to give applications restricted direct access to the network adaptor board. This removes the operating system from the critical path of message sends and receives, and is compatible with standard workstation operating system scheduling policies. This mechanism uses a hardware device known as the PATHFINDER that directly routes messages to the appropriate application[10]. The PATHFINDER additionally provides the framework for message driven processing. Third, we allow applications to install customized protocols in the network adaptor board to minimize distributed shared memory overhead. This also uses the PATHFINDER to direct messages to the correct distributed shared memory protocol on the network adaptor board. The CNI network interface implements the above mechanisms on top of the OSIRIS network interface[4]. Most of the functionality provided by the CNI is implemented through existing OSIRIS components. This not only leverages off an existing design but also minimizes the incremental cost of the CNI interface. We evaluate the efficiency of CNI through executiondriven simulation. The Message Cache alone reduces communication latency by as much as 33% for page size transfers. For application programs, the greatest reductions in execution time are found in programs where communication latency is the principal bottleneck. The performance of the CNI is the most promising when applications can take advantage of the Message Cache to reduce communication latency and use Application Interrupt Handler code on the network interface to reduce communication overhead. The factor which principally limits performance is the small cell size in the underlying ATM interconnect. The remainder of the paper is organized as follows. We begin by presenting an architectural overview of the proposed network interface and discussing the incremental cost over the OSIRIS network interface. Next we demonstrate the efficiency of the interface by simulating its performance using benchmark application programs. Finally, we conclude with a discussion of our results and future directions.
CPU and Cache
Memory
Application
Protocol Library
OS
Memory Bus
dual-ported memory
Message Cache
Snoopy I/F Application Device Channels
TLB
RTLB
Cached Buffers
Buffer Map
Application Interrupt Handlers
Network Adaptor Board
Pathfinder Transmit
Receive
Figure 1. Schematic of network interface is an emerging high-throughput network technology which is likely to be a standard in the near future. The CNI is connected to the memory bus on the host workstation side and is capable of a maximum network throughput of 622 Mbps (STS-12). There are three major components of CNI, as shown in Figure 1: the Application Device Channels, the Application Interrupt Handlers, and the Message Cache. The Message Cache, composed of the snoopy interface, buffer map, and cached buffers, maintains consistency between the host workstation and the network adaptor board, and reduces the cost of communication primitives. The Application Device Channels give the application limited direct access to the network adaptor board. CNI uses a hardware packet classifier called the PATHFINDER to demultiplex packets to the right application and also to transfer control to application code in the memory reserved for Application Interrupt Handlers. The remainder of this section focuses on these three major components. The CNI adds a small amount of hardware to the existing OSIRIS board. The chief addition is the hardware used for the Message Cache and the Application Device Channel. These costs are described in detail in Section 2.4.
2. Architecture Overview
2.1. Application Device Channels
The CNI is an ATM network adaptor board, based on the existing OSIRIS ATM adaptor board[4]. The choice of the ATM interconnect was influenced by the fact that it
With the availability of network protocol stacks as application-level libraries, it is possible to give an application access to the network device with the kernel pro-
viding connection setup and tear-down services [13]. This removes the kernel from the critical path of message sends and receives, and lowers the application-level latency. CNI uses a hardware implementation of a technique called Application Device Channels to give an application access to the networking device [4] and enhances it for use in a networks of workstations environment. The concept of Application Device Channels has been implemented in software in the OSIRIS network board and is summarized here. Part of the on-board dual ported memory is partitioned into multiple transmit, receive and free queues. When an application wants to open a connection, one triplet of transmit, receive and free queues is mapped into the application’s address space to form a device channel. Protection is verified only when an application places a buffer in one of these three queues, and verification overhead is thus eliminated from the send and receive paths. The application performs send and receive operations through manipulation of these three queues which are shared between the application and the adaptor board. The manipulations are lock-free and rely only on the atomicity of loads and stores. This precludes the necessity of gang-scheduling application access to the network interface which would amount to violating standard workstation operating system scheduling policies. However there are two key implementation differences between the CNI and OSIRIS network interfaces. First, the OSIRIS board uses the VCI field in an ATM cell to demultiplex between applications implicitly assuming that every application will get a unique VCI. Even admitting this assumption, the VCI field is too coarse-grained to handle multiple protocol actions inside an application or to transfer control to memory consistency protocols running on the network interface. An alternative would be to do the demultiplexing in the software in the network interface. However, our experience with software packet classification on the ATOMIC network interface showed that the speed of classification is critically dependent on whether the packet classifier code is resident in the instruction cache of the network interface processor; measurements of classification time indicated that the classifier code had a poor cache hit ratio because of capacity conflicts with other application handlers running on the network interface. In contrast, the CNI network interface uses a hardware packet classifier called the PATHFINDER to guide packets to the appropriate application. Key features of the PATHFINDER include flexible classification programmability and the ability to handle fragmented packets. A more detailed description of the PATHFINDER can be found in [10]. Second, the OSIRIS boards rely purely on host interrupts to transfer data and control to the application. In highspeed networks, the number of interrupts is frequent. This, combined with the fact that the interrupts are expensive in modern superscalar superpipelined processors implies that
an alternative approach is needed. Hence in the CNI interface, the interaction between the host and the network adaptor board in the receive path involves both polling and host interrupts. The host polls the network adaptor board at a rate which is dependent on the rate of arrival. If the packet arrival rate is high, the host depends on polling to process the receive and free queues. However if the arrival rate is low, the host depends on interrupts to indicate message arrival. This scheme gives preference to polling over interrupts because of the above-mentioned high cost of handling interrupts. While Application Device Channels do reduce communication latency to a great extent by removing kernel overhead, recent research has revealed that the cost of data transfer across the memory bus contributes greatly to the cost of message passing. The Message Cache in the CNI is designed to tackle this overhead and is discussed next.
2.2. Message Cache There are two objectives for the Message Cache: to reduce the communication overhead between workstation nodes, and to reduce the cost of access misses for distributed shared memory applications. Our goal is to couple the network adaptor board with the host workstation without changing the workstation architecture. Thus we must provide functionality in the network adaptor board that provides this coupling without exposing the CPU to frequent interrupts that degrade performance. The adaptor board maintains a set of cached buffers that are used to keep a consistent copy of host workstation memory pages. Consistency is maintained by snooping on writes from the CPU to main memory; the exact mechanism is detailed below. Keeping a consistent copy in adaptor board memory avoids the expense of a copy from the host workstation memory to the adaptor board. Each buffer corresponds to a virtual memory buffer on the host; the mapping is kept in a table called the buffer map. There is also a TLB and a RTLB which keeps mappings between host virtual and physical memory addresses and permits virtually addressed DMA operations. The buffers are managed in an approximate LRU order and the least recently used buffer map entry is evicted in case of a capacity conflict. For the sake of operational simplicity, we have fixed the size of a buffer in the Message Cache to be the same as that of a page in the host workstation. Since the Message Cache keeps the network interface memory consistent with the host memory and not the host cache, the CPU should keep the memory consistent with the contents of the cache. For a write-though cache subsystem this is trivial, but systems with a write-back cache must ensure consistency with cache flushes before an impending message transfer. Our experience with generating code
for both message-passing and shared memory applications shows that this is not a difficult proposition. To prove the applicability of the Message Cache for write-back architectures, the performance of the CNI network architecture is evaluated in a write-back cache environment. An alternative would be to do a DMA directly from the host cache, but modern workstation architectures are yet to permit such an operation on the cache. In any case, this alternative solution fails to deal with the situation when the message is partially in the cache and partially in the host memory. There are three fundamental operations associated with the Message Cache which are detailed below:
Transmit Caching
shared memory programs. When an applications needs to receive data, it specifies a receive buffer on the application’s receive queue on the network interface. When a shared memory page arrives over the network, the receive processor does the following: 1. It copies this incoming message onto a buffer in the network interface. 2. It then checks the incoming message header for a bit to see if it is to be cached. Shared memory pages which are likely to migrate from one host to another will always have this bit on. If the bit is on, the receive processor creates a binding between the network buffer and the host receive buffer in the buffer map.
Receive Caching Consistency Snooping
Transmit Caching: Transmit Caching of main memory network buffers occurs when the application on the host machine transmits a packet. In the CNI network interface, applications transmit a packet by specifying the virtual address of the buffer and the buffer length in the dual ported transmit queue in the network interface. When the transmit processor sees this virtual address at the head of the transmit queue, it does the following: 1. Using the buffer map, it sees whether there is a valid buffer in the network interface corresponding to the buffer to be transmitted. 2. If there exists such a buffer, then the transmit processor directly transmits this buffer over the network without DMAing the buffer from host machine memory. 3. If there does not exist a valid buffer, then the transmit processor DMAs the host buffer onto the network interface. It then checks the header of the message for a bit indicating whether this message buffer is to be cached and if the bit is on, it creates an entry in the buffer map indicating where the host virtual memory buffer is mapped to the buffer on the network interface. Then the transmit processor transmits this buffer over the network. Thus if the application uses the same buffer for transmitting data, it needs to DMA the buffer from the host memory onto the network adaptor board only once. This is very useful in reducing communication overhead when an application uses the same set of host buffers for message transfers and can take advantage of temporal locality in network communication. Receive Caching: Similarly caching on the receive side helps reduce the cost of remote accesses for distributed
3. The receive processor then DMAs this network buffer onto the host receive buffer. Hence if there is a consistent copy of the received shared memory page in the network interface in the future and the page has to be sent out to another host, the network interface can do it directly from the network interface without having to DMA from the host memory. This potentially reduces the cost of page migration in shared memory applications. Consistency Snooping: To take advantage of transmit and receive caching, there needs to be a consistent copy of the host memory on the network interface. This consistency is done by snooping. The mechanism is as follows: 1. Whenever the CPU does a write onto the memory bus, the CNI network interface snoops out the target of the write from the bus. The write target will be a physical address on the host machine. 2. Using the RTLB, the interface then converts this physical address to the host virtual address and uses this virtual address to see if any valid network buffer corresponding to the host virtual memory buffer exists in the buffer map. 3. If no such network buffer exists, then the snoop operation is aborted. However, if such a network buffer is present, the contents of the buffer are modified with the data being written, making the virtual memory buffer consistent with the CPU. The Message Cache thus focuses on reducing communication latency by avoiding data transfers as much as possible. However, as network interface processors are getting more and more powerful, substantial overhead can be reduced if protocol processing can be done in the network interface. This mechanism avoids the cost of control transfer to the host workstation and is implemented using the concept of Application Interrupt Handlers.
2.3. Application Interrupt Handlers An increasing concern for distributed shared memory is the high overhead of memory consistency protocols. Even though coarse- and medium-grained applications show definitive speedup characteristics using high-bandwidth networks, fine-grained applications exhibit poor performance. This can be attributed to the high cost of synchronization between cooperating compute nodes. To improve performance, one must reduce either the amount of synchronization needed or its cost. We focus on reducing the latter. Our approach is similar to the technique used in the FLASH and the Typhoon architectures of running user-level code in the network interface [8, 11]. However, an important distinction in Application Interrupt Handlers is the lack of support for virtual memory on the network interface. Two possible advantages of virtual memory are the ability to page in user-level code, and protection between different user processes. However a page fault on a network interface is very expensive and is unsuitable in the light of the high rate of data arrival. In such a case it is better to swap in the entire user-level handler onto the network interface. In the case of protection, the network adaptor board either needs to have a separate copy of the page table or can share the page table on the host machine memory. Having a separate copy introduces consistency problems which might need customized bus signals for resolution. Sharing the page table on the network adaptor board would need synchronization primitives which might affect the performance of the host machine CPU. In contrast, our solution is motivated more by current efforts to integrate user-level code in untrusted environments such as the network adaptor board or the operating system kernel. Application protocol code is written in a pointer-safe language environment and compiled to relocatable network interface object code. When an application opens a connection, it can specify a packet pattern for the PATHFINDER and the location and size of the protocol object code. The network interface then swaps in the object code to an available free segment, and programs the PATHFINDER to activate the object code on a match of a specified pattern in an incoming message packet. Thus whenever the board receives a packet and demultiplexes it to the correct application, it also transfers control to the application’s protocol object code. This can be thought of to be an extension of the Active Message Principle [14] to the network interface. The protocol object code can then perform all consistency operations required to implement distributed shared memory in the network of workstations. For example, a barrier can be handled within the network adaptor board, eliminating the overhead of the application protocol stack. The network adaptor board can also contain the data structures for handling shared memory operations to avoid expensive host memory accesses. The
cost of most other synchronization operations also decrease significantly. This protocol object code or Application Interrupt Handlers also allow customized memory consistency protocols, which can significantly increase performance of distributed shared memory programs.
2.4. Cost-Performance Issues This section deals with the incremental cost of the CNI over the OSIRIS interface. The CNI interface requires the following additional hardware: first, the PATHFINDER is used to support demultiplexing between various application code both on the network interface and the host workstation. Second, the buffer map requires a RTLB-like functionality for providing the translation between network buffers and memory buffers on the host. The additional hardware components have been designed for FPGA implementation and thus the incremental cost, as compared to the total cost of a workstation, is small enough to maintain the cost-performance advantage provided by networks of workstations.
3. Performance CPU Frequency Primary Cache Access Time Primary Cache Size Secondary Cache Access Time Secondary Cache Size Cache Organization Cache Policy Memory Latency Bus Acquisition Time Bus Transfer rate Bus Frequency Switch Latency Network Processor Frequency Network Latency Interrupt Latency Message Cache Size
166 MHz 1 cycle 32K unified 10 cycles 1 MB unified Direct-mapped Write-back 20 cycles 4 cycles 2 cycles per word 25 MHz 500 ns 33 MHz 150 s 40 ns 32 KB
Table 1. Simulation Parameters This section discusses the performance of the CNI network interface using an execution-driven simulation of various application programs. Our measurements are based on simulating an Alpha workstation cluster connected via an ATM interconnect and the CNI. The simulation parameters are summarized in Table 1. The cache, processor and memory simulation parameters are partially derived from an Alpha workstation. The switch latencies are obtained from a 32-port banyan-network based ATM switch model.
3.1. Applications Jacobi is a coarse-grained application with two major synchronization points per iteration and a high computa-
99.5
6
99 98.5
Speedup
98 4 97.5 3
97
CNI-speedup Standard-speedup Network Cache Hit Ratio
2
Network Cache Hit Ratio
7
5
96.5
1
96 10
20 No of processors
30
11
99.5
10
99
9
98.5
8 7
98
6
97.5
5 4 3
97
CNI-speedup Standard-speedup Network Cache Hit Ratio
96.5
Network Cache Hit Ratio
Figure 2. Performance Results for Jacobi with 128 128 matrix
Speedup
The execution-based simulation was done using a modified version of the Proteus simulator[2]. The performance of the CNI network interface was measured using the execution driven simulation of three distributed-shared memory applications representing the spectrum of granularity: Jacobi, Water and Cholesky; the latter two are from the SPLASH benchmark [12]. Message passing applications were not used because we wanted to vary the granularity of the applications keeping the programming paradigm constant. All three applications used a lazy invalidate release consistency protocol [6, 7] for memory consistency. An invalidate protocol was chosen because it has been shown that invalidate protocols work best in low overhead environments. This protocol is assumed to run on the network interface board using the memory allocated for application interrupt handlers. For the sake of simplicity, we assume that no other protocol is running in the application interrupt handler memory region; this assumption is reasonable considering that it is likely that only a single parallel application will be using the cluster of workstations. A fixed portion of the processor address space was allocated to distributed shared memory with shared addresses being mapped into this allocated memory space. An approximate LRU scheme is used to replace mappings in case the allocated processor address space runs out of mappings for shared address spaces. Because the CNI network interface faces architectural constraints in design as mentioned in Section 1, the performance of the CNI network interface is not compared to the FLASH and Typhoon multiprocessors which use specialized control and data paths and custom low-latency interconnects. Instead, we compare the CNI network interface based cluster of workstations to one based on a standard networking interface used in current workstation cluster configurations. By a standard networking interface, we mean one which does not have Application Device Channels, Message Caches and support for Application Interrupt Handlers. In fact, currently available networking interfaces do not support these features. Otherwise, the software and hardware configurations of the CNI-based and the standard networking interface-based workstation clusters are considered identical. In the following experiments, the network cache hit ratio refers to the ratio of the number of times a message to be transmitted is found in the Message Cache to the number of total message transmissions in the CNI network interface based cluster. This term does not apply to the standard network interface-based workstation cluster.
96
2 1
95.5 10 20 No of processors
30
Figure 3. Performance Results for Jacobi with 256 256 matrix
tion/communication ratio. Each point in the strip is iteratively calculated from the values of its neighbors. The application was run with three matrix sizes: 128 128, 512 512, and 1024 1024. The speedups for both the CNI and the standard case, and the network cache hit ratios for the CNI case are shown in Figures 2,3 and 4. Because of the relatively small amount of communication, the difference in performance is not substantial, however the CNI interface shows better performance because of the high cache hit ratio. The ratio is very high because the Message Cache is large enough to hold the shared data. Both the configurations show mediocre performance for a small matrix size (128 128) and a large number of processors (32) but the level of degradation is less in the CNI because of reduced shared memory overhead costs. Next, we measured the sensitivity to shared memory page size with a 1024 1024 matrix input and 8 processors. The results shown in Figure 5 indicate that the CNI network interface is less sensitive to page size variations because of
18
98
10
97
8
96
6
2
95
CNI-speedup Standard-speedup Network Cache Hit Ratio
94
0
Synch overhead Synch delay Computation Total
Time-CNI (109 cycles) 0.054 0.086 1.164 1.304
Time-standard (109 cycles) 0.063 0.099 1.165 1.330
Table 2. Overhead for 8-processor Jacobi with 1024 1024 matrix
93 10
20 No of processors
30 4.5
100 90
4
7.9
80
3.5 Speedup
Figure 4. Performance Results for Jacobi with 1024 1024 matrix
70
3
60 2.5 2
CNI-speedup Standard-speedup Network Cache Hit Ratio
50 40
7.8 1.5
Network Cache Hit Ratio
Speedup
12
Network Cache Hit Ratio
99
14
4
Category
100
16
30
Speedup
7.7 1
20 10 20 No of processors
7.6 7.5 7.4
30
Figure 6. Performance Results for Water with 64 molecules
CNI Standard
7.3 5000
10000 Page Size (bytes)
15000
Figure 5. Page Size Sensitivity for 8processor Jacobi with 1024 1024 matrix
the lower cost of page transfers. Finally, the performance breakup for a 8-processor Jacobi with a 1024 1024 matrix with a a 2 KB shared memory page size as shown in Table 2 indicates that the CNI scheme has a lower synchronization overhead as well as substantially less synchronization delay. Because of the granularity of the application, we did not expect a major difference in performance between the CNI and standard network interface-based workstation clusters. Water can be categorized as a medium-grained application. It simulates the molecular behavior of water, and was run with the input sizes of 64, 216 and 343 molecules for 2 steps. In each step, the various intra- and inter-molecular forces affecting the molecule are calculated with respect to other molecules and then the parameters of the molecule are updated. The original algorithm was modified to postpone the updates until the end of an iteration as in [3]. Synchronization is performed by (1) acquiring a lock for updating the parameters of a molecule and (2) through barriers. Figures 6, 7 and 8 show the speedups and network cache hit
ratios for the two configurations. In contrast to the previous application, the network cache hit ratio is sensitive to the number of processors because of the nature of data sharing. However, this helps the CNI network processor to show improved scalability with large number of processors. The CNI is also less sensitive to page size as shown in Figure 9 even though there is some false sharing with larger page sizes. Finally, an analysis of overheads for a 8-processor 216-molecule Water listed in Table 3 demonstrates lower synchronization overheads and delays for the CNI configuration. Category Synch overhead Synch delay Computation Total
Time-CNI (109 cycles) 0.17 2.24 2.95 5.36
Time-standard (109 cycles) 0.30 2.45 2.95 5.70
Table 3. Overhead for 8-processor Water with 216 molecules Cholesky is a fine-grained application that factorizes a sparse positive-definite matrix. Each processor modifies a
100
6.8
10
90
6.7
70
6
60
5 4 3
50
CNI-speedup Standard-speedup Network Cache Hit Ratio
40
2 1 10
20 No of processors
6.5 6.4 6.3 6.2 6.1 6
30
5.9
20
5.8
30
2000
Figure 7. Performance Results for Water with 216 molecules
8
16
90
7
14
80 70 60
8 50
6 4 2
40
CNI-speedup Standard-speedup Network Cache Hit Ratio
0 10
20 No of processors
Network Cache Hit Ratio
100
10
8000
96.5 96
6 5
95.5
4
95
3
30
2
20
1
30
4000 6000 Page Size (bytes)
Figure 9. Page Size Sensitivity for 8processor Water with 216 molecules
18
12 Speedup
6.6
Speedup
Speedup
7
CNI Standard
CNI-speedup Standard-speedup Network Cache Hit Ratio
94.5
Network Cache Hit Ratio
80
8
Speedup
9
Network Cache Hit Ratio
11
94 10
20 No of processors
30
Figure 8. Performance Results for Water with 343 molecules
Figure 10. Performance Results for Cholesky with matrix bcsstk14
column or a set of columns called supernodes of a matrix. Access to the columns and supernodes are synchronized through column locks. Columns or supernodes are allocated to a processor using the bag of tasks paradigm. Pages tend to move from the releaser to the acquirer leading to many access misses when an invalidate protocol is used; thus caching receive buffers helped performance a great deal. Also, one page usually contains many columns, so concurrent write sharing and the use of write notices increases the parallelism and reduces the amount of data exchanged. The application was run with matrices bcsstk14 and bcsstk15, the results of which are shown in Figures 10 and 11. The bcsstk15 matrix shows better speedup performance because of the larger size of the matrix as shown in Figure 11. The application is very sensitive to the size of the shared memory page because of large page migration overhead induced by increase in page size. However, this overhead is reduced a lot in CNI due to transmit and receive caching thus leading to considerable lesser sensitivity to shared page size as
illustrated in Figure 12. Overall, as seen in Table 4, the synchronization overheads and delays are considerably less in the case of CNI.
3.2. Message Cache Size Sensitivity From the above results, it is apparent that the performance is dependent on the network cache hit ratio on the Message Cache. Since cache hit ratios are predominantly determined by the size of the cache, we conducted experiments to determine the optimal size of the Message Cache for the three applications in our benchmark. The results are shown in Figure 13 for 8-processor versions of Jacobi, Cholesky and Water for varying Message Cache sizes. For Water and Jacobi, a slight increase of the Message Cache beyond 32KB brings the network cache hit ratio to its optimal limit primarily because of the quantity and nature of the shared data. In the Cholesky application, the nature of data sharing causes the network cache hit ratio to saturate
100
12
90 80
Speedup
10
70
8
60 6
50
4
40 CNI-speedup Standard-speedup Network Cache Hit Ratio
2
Category Network Cache Hit Ratio
14
30
0
Synch overhead Synch delay Computation Total
Time-CNI (109 cycles) 3.39 61.8 21.5 85.70
Time-standard (109 cycles) 3.35 65.1 21.5 89.0
Table 4. Overhead for 8-processor Cholesky with matrix bcsstk14
20 10
20 No of processors
30 100
4.6
CNI Standard
4.4
90 85 80 75 70
Jacobi Water Cholesky
65
4.2 Speedup
Network Cache Hit Ratio
95
Figure 11. Performance Results for Cholesky with matrix bcsstk15
60
4
0
500 Message Cache Size (KB)
1000
3.8 3.6
Figure 13. Network Cache Hit Ratios for 8processor versions of Jacobi,Water,Cholesky with varying Message Cache sizes
3.4 3.2 2000
4000 6000 Page Size (bytes)
8000
Figure 12. Page Size Sensitivity for 8processor Cholesky with matrix bcsstk14
at 90% for a Message Cache size of 512 KB. So the optimal Message Cache Size depends on the application as well as the number of processors but it is evident that the 1MB memory onboard the OSIRIS board (and thus the CNI) may be sufficient.
3.3. Microbenchmarks From the performance results of the above three applications, the CNI performs very well if the problem space can be distributed over the Message Caches of the workstation nodes leading to high network cache hit ratios and low communication overheads and delays. To illustrate this point, we estimate the best possible node-to-node latency of the CNI (assuming a 100% network cache hit ratio) as compared to that in the standard network architecture. This is shown in Figure 14 and demonstrates that for a 4KB page size transfer, the communication latency is lower for the CNI architecture by as much as 33%. Since data is transferred
via pages in distributed shared memory programs, this reduction in communication latency will significantly reduce page transfer overhead.
3.4. Performance Analysis
8-processor Applications Jacobi with 1024 1024 matrix Water with 343 molecules Cholesky with matrix bcsstk14
%age Improvement 5.69 13.31 25.29
Table 5. Performance Improvements using ATM with unrestricted cell size This section analyzes the performance of the CNI architecture. Though there is a substantial reduction in the communication overhead, this does not translate into a corresponding improvement in speedup. A probable hypothesis could be that the network performance overshadows the reductions in overhead. To test this, we concentrated on the underlying ATM network. The cell size of the ATM is 53 bytes, so every large message transfer incurs a huge
200
Latency (ms)
150
100
50
CNI Standard
0 0
1000
2000 3000 Message (bytes)
4000
Figure 14. Node-to-node latency for the CNI and standard network interface
fragmentation and reassembly overhead. This increases the communication overhead and delay substantially; to illustrate this point, we experimented with a mythical networking technology having the same characteristics as ATM but with unlimited cell size. This would imply that there would be no fragmentation and reassembly overhead. We ran two applications with this networking technology and the improvements in performance are summarized in Table 5. Each application shows a remarkable improvement in performance indicating that the ATM cell size is a major detriment in trying to reduce communication overhead.
4. Conclusion To enable workstation clusters to achieve high performance from standard off-the-shelf components, the functionality of the network adaptor board must be more closely coupled with the host workstation. CNI achieves this by giving the application restricted direct access to the network adaptor board, removing the operating system from the critical path of sends and receives. CNI also allows the host workstation to install code in the network adaptor board to reduce the cost of high-overhead memory consistency protocols. Finally, CNI caches transmit and receive buffers to decrease the number of costly host memory-to-network adaptor board copies. Simulations of the CNI indicate that communication latency and overhead is reduced significantly, resulting in higher performance for applications that can exploit the features of this networking interface.
References [1] M. A. Blumrich and K. L. et al. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proceedings of the 21st Annual International Symposium on Computer Architecture, May 1994.
[2] E. Brewer and C.N.Dellarocas. PROTEUS: User documentation version 0.5. Technical report, M.I.T., 1993. [3] A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software versus hardware shared-memory implementation: A case study. In Proceedings of the 21st Annual International Symposium on Computer Architecture, May 1994. [4] P. Druschel, L. L. Peterson, and B. S. Davie. Experience with a high-speed network adaptor: A software perspective. In Proceedings of the 1994 Sigcomm Symposium, Aug. 1994. [5] C. et al. The network architecture of the Connection Machine CM-5. In Proceedings of the 4th Symposium on Parallel Algorithms and Architecture, 1992. [6] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15–26, Seattle, Washington, May 1990. [7] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy consistency for software distributed shared memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 13–21, May 1992. [8] J. Kuskin and D. O. et al. The Stanford FLASH multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, May 1994. [9] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the DASH multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148–159, May 1990. [10] M.Bailey, B.Gopal, M.Pagels, L.Peterson, and P.Sarkar. PATHFINDER: A pattern-based packet classifier. In Proceedings of the 1st Symposium on Operating System Design and Implementation, Nov. 1994. [11] S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: User-level shared memory. In Proceedings of the 21st Annual International Symposium on Computer Architecture, Apr. 1994. [12] J. Singh, W. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared memory. Technical Report CSLTR-91-469, Stanford University, Apr. 1991. [13] C. Thekkath, T.Nugyen, E.Moy, and E.Lazowska. Implementing network protocols at the user level. In Proceedings of the 1993 Sigcomm Symposium, Sept. 1993. [14] T. von Eicken, D.Culler, S. Goldstein, and K.Schauser. Active Messages: A Mechanism for intergated communication and computation. In Proceedings of the 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, May 1992.