Embedding High Speed ATM in Unix IP

2 downloads 0 Views 216KB Size Report
Penn's Asynchronous Transfer Mode (ATM) host interface architecture is a ... The design and implementation of an Asynchronous Transfer Mode (ATM) computer/network host .... them DMA devices, although that is inaccurate in the MCA context). ... \free" checksums Clark 89] possible with Programmed I/O (PIO) devices ...
Embedding High Speed ATM in Unix IP D. Scott Alexander, C. Brendan S. Traw, and Jonathan M. Smith University of Pennsylvania fsalex,traw,[email protected] Abstract

Penn's Asynchronous Transfer Mode (ATM) host interface architecture is a scalable hardware architecture designed for high-performance operation when coupled with appropriate software support. An early implementation for the IBM RISC System/6000 supports ATM data rates of upto 160 Mbps. The communications kernel, UPWARDS, measured sustained application-toapplication throughput of over 130 Mbps. Using this experimental software system as a basis, we were able to transparently support the AIX TCP/IP protocol stack. This paper reports our experiences, suggests some lessons, and outlines the application of those lessons to the software models for an OC-12c implementation of the host interface architecture.

1 Introduction The design and implementation of an Asynchronous Transfer Mode (ATM) computer/network host interface at the University of Pennsylvania began in 1990 [Traw 93a], as part of the ATM/SONET infrastructure of the AURORA Gigabit Testbed [Clark 93]. The initial research goal of the interface work was to identify and experimentally verify a kernel of services that were suitable for hardware implementation. These data movement and formatting intensive services include ATM Adaptation Layer (AAL) processing, segmentation-and-reassembly (SAR) and ATM demultiplexing. The architecture partitioned the protocol processing activites between the hardware host interface and software running on the host processor. The result was a well-balanced communications subsystem where hardware performs all per-ATM cell activities leaving higher level operations to the more

exible host software. The IBM RISC System/6000 was chosen as the experimental platform for this work since IBM is an AURORA collaborator, and it o ered high performance, especially muchneeded memory bandwidth. The long-term goal of AURORA is to provide end-to-end gigabit per second networking between workstations. Concurrent with the hardware implementation, software support appropriate for high-throughput networking had to be designed and implemented. While a research-oriented platform such as Mach [Accetta 86] or Chorus [Abrossimov 88] may have been attractive given their

exibility, our view was that the major novel system software ideas could be easily tested within IBM's interpretation of UNIX, AIX. An additional bene t was that pre-existing AIX applications would be available without porting and recompilation. Using the AIX support for loadable device drivers, a small kernel (UPWARDS) was written in the guise of a character device driver. UPWARDS has been used to test the ideas of reduced bu er copying, shared memory application programmer interfaces (APIs), and scheduled device service. It is used daily in our laboratory as support for a variety of applications, including a telerobotics application which uses the raw ATM service. The performance is limited by the workstation I/O bus, with over 130 Mbps of bandwidth available end-to-end (AIX sending process to remote AIX receiving process). A TCP/IP  This work was supported by the National Science Foundation and the Advanced Research Projects Agency under Cooperative Agreement NCR-8919038 with the Corporation for National Research Initiatives. Additional support was provided by Bell Communications Research under Project DAWN, by an IBM Faculty Development Award, and by Hewlett-Packard.

1

implementation has become necessary both due to the broader range of supported applications and due to the protocol's role as a networking performance litmus test. Starting in the summer of 1993, UPWARDS was augmented to support protocol layers above ATM including the AIX IP protocol stack. This paper reports lessons learned from our e ort. This paper is organized as follows. Section 2 provides an overview of the interface hardware subsystem. Section 3 outlines the concepts tested in the original UPWARDS implementation and assesses the results. Section 4 describes the idea of an UPWARDS virtual device, developed to support TCP/IP but considerably more general. Section 5 describes the addition of AAL3/4 support, needed to support the variable-sized datagram model of IP. Section 6 assesses the use of AIX Network Interface Drivers (NIDs) to support IP and other protocols on new devices. We provide measurements and a performance analysis in Section 7, conclusions in Section 8, and directions for future work in Section 9.

2 The Penn Host Interface The host interface (Figure 1) attaches to the IBM Micro Channel Architecture, which serves as an I/O subsystem attachment point in the IBM RISC System/6000 [IBM 90]. The interface currently supports network physical layers with bandwidths up to 160 Mbps, such as SONET STS-3c and the Hewlett-Packard HDMP-1000 (\G-LINK") chipset [Laubach 93]. The interface is split into two logical elements, a segmenter and a reassembler. These logical elements are implemented as a pair of Micro Channel cards which perform all per-ATM cell functions including segmentation and reassembly functions in dedicated hardware.

CPU

Micro Channel I/O Bus

XIO/ IOCC

Memory

RS/6000 Workstation

MCA Bus Master Interface

MCA Bus Master Interface

Host Interface

Host Interface Reassembler

Segmenter 160 Mbps

160 Mbps

Physical Layer Interface

to network

Figure 1: RS/6000 Host Interface The segmenter card is passed the address and size of a memory bu er containing a Protocol Data Unit (PDU). Data is transferred from the bu er into the segmenter, which splits the bu er contents into ATM payloads and transmits these payloads with appropriate ATM cell headers. The reassembler inverts this operation by receiving multiplexed streams of cells, and reconstructing PDUs in its bu er memory. The bu er memory is indexed by Virtual Circuit Identi ers (VCIs), which serve as names for linked-list chains of cell-payload sized (48 byte) bu ers. When an address, size, and VCI are passed to the reassembler, an appropriate number of cell bodies are transferred into system memory at that address. In this way, the segmentation and reassembly hardware provide 2

the illusion of variable-sized PDUs for hosts connected to an ATM network. Neither the segmenter nor the reassembler provide facilities for interrupts. Discovering data arrival relies on a periodic polling scheme called clocked interrupts [Smith 93] rather than a traditional interrupt on data arrival. Clocked interrupts are discussed further in Section 3. Both cards function as bus masters, meaning that they are capable of autonomously transferring data to and from host memory. The IBM RISC System/6000's Micro Channel Architecture supports addressing of host memory through address translation control words (TCWs) maintained in an I/O Channel Controller (IOCC). Presuming that appropriate TCWs are set into the IOCC, arbitrary sized data transfers can be performed, and the mappings of virtual addresses to real addresses is transparent to both the host software and the device. One implication of this arrangement is the elimination of direct scatter/gather support in either the host or the device. There are potential advantages and disadvantages to such bus-mastered devices (we will call them DMA devices, although that is inaccurate in the MCA context). The major advantage is concurrent operation. Presuming the workstation's memory can support the bandwidth demands of both the processor and the device, data transfer and data processing can take place concurrently. This potential gain must be traded against the increased complexity in managing the data transfer (mainly due to interactions with virtual memory and the processor data cache), as well as a potential increase in latency. In addition, DMA devices are not able to employ some of the clever techniques for \free" checksums [Clark 89] possible with Programmed I/O (PIO) devices [Banks 93] [Partridge 93].

3 Initial Software Work The goal of the system software is to provide high-performance software support for the host interface design. A second (but important) goal was to provide others within the AURORA Gigabit testbed [Clark 93] a way to use the ATM network. As such, the original UPWARDS kernel only supported \cell" mode, a limited mode of the host interface which supports 48-byte cells of data without providing any framing. It provided for initializing the host interface, moving data to and from the user, managing DMA, and provided system entry points such as open(), read(), write(), and ioctl() [Smith 93]. The three major concepts investigated were a new device signaling model (clocked interrupts), transparently reducing copying in the path from a user process, and alternate APIs for even higher performance. The results are reported elsewhere [Smith 93] but it is important to understand the context in which the present paper's results are set. Our ATM host interface was one of the earliest design and implementation e orts, and was essentially a proof of concept e ort to show that hardware SAR could o er low cost with high performance. It staked out a middle ground between the simplistic rst-generation Fore Systems ATM host interfaces[Cooper 91] and the complex host interface [Davie 1993] designed and implemented by our AURORA collaborators at Bellcore. Our feature set, described in Section 2, allowed us to provide the host with a variable-sized PDU model of the network; these PDUs were transferred directly to and from user address spaces, using both the UNIX read()/write() interface and an advanced shared memory interface which avoided the UNIX system call overhead. This overhead is considerable for small (and unfortunately, typically) sized PDUs.

3.1 Performance Summary, previous to IP implementation

We were able to achieve over 130 Mbps, or 98% of the maximum achievable 135.5 Mbps ATM transport bandwidth of a SONET OC-3c connection. Others concurrently demonstrated the value of reduced copying in protocol stacks [Banks 93] for high-speed programmed I/O (PIO) adapters; we believed (and continue to believe) that the concurrency possible with a DMA device o ers some advantages when an application workload is placed on the host; experiments we reported in [Smith 93] showed that applications could achieve high networking performance even on a loaded workstation (load average greater than 5). E ective bu er management was the key to success with the DMA adapter; process bu ers were pinned to keep them available for the adapter's access. More recent software work [Druschel 93] has built on these ideas, in the context of the Mach operating system, 3

to manage a pool of pinned bu ers suitable for interfacing Mach to the per-cell DMA abstractions used for Bellcore's Osiris ATM host interface [Davie 1993]. A distinct feature of UPWARDS is the use of clocked interrupts, which we discuss next.

3.2 Clocked Interrupt Model

Changes of state (e.g., data arrival) on the interface hardware are detected with clocked interrupts. A clocked interrupt is a periodic clock-generated interrupt. It initiates a routine which polls the interface to determine any changes of state. Clocked interrupts may negatively a ect the latency of the networking subsystem, but they can improve the bandwidth which can be handled under a variety of trac types, as multiple changes of state can be detected by a single clocked interrupt. A simple calculation shows the tradeo . Consider a system with an interrupt service overhead of C seconds, and k active channels, each with events arriving at an average rate of  events per second. Independent of interrupt service, each event costs seconds to service, e.g., to transfer the data from the device. The o ered trac is   k, and in a system based on an interrupt-per-event, the total overhead will be   k  (C + ). Since the maximum number of events serviced per second will be 1=C + , the relationship between parameters is that 1 >   k  (C + ). Assuming that C and are for the most part xed, we can increase the number of active channels and reduce the arrival rate on each, or we can increase the arrival rate and decrease the number of active channels. However, for clocked interrupts delivered at a rate per second, the capacity limit is 1 >  C +   k  . Since is very small for small units such as characters, and C is very large, it makes sense to use clocked interrupts, especially when a reasonable value of can be employed. In the case of modern workstations, C is about a millisecond. Note that as the trac level rises, more work is done on each clock \tick," so that the data transfer rate   k  asymptotically bounds the system performance, rather than the interrupt service rate. We note that traditional interrupt service schemes can be improved, e.g., by aggregating trac into larger packets (this reduces  signi cantly, while typically causing a slight increase in ), by using an interrupt on one channel to prompt scanning of other channels, or masking interrupts and polling some trac intensity threshold. One interesting aspect of using clocked interrupts is setting the frequency. We are exploring dynamic mechanisms for setting this rate, which are based on the network load and the latency requirements of the applications.

3.3 Practical Aspects of the Clocked Interrupt Scheme

Clocked interrupts were implemented using the AIX kernel's manifestation of the UNIX callout table scheme [Thompson 78], using a fast real-time clock available on all RISC System/6000 processors. Two diculties with implementing this concept on a real machine are the overhead of a timer interrupt (which limits us to a practical rate of less than about 6000 HZ on an RS/6000 Model 580 for trivial clocked interrupts), and the embedding in an existing interrupt hierarchy. This latter constraint proved most troubling, as the interrupt can carry out variable amounts of work depending upon what is available at the reassembler. Recommended time bounds for various interrupt levels were short. This was solved by using the clocked interrupt to signal a kernel thread (an AIX kproc) through a sleep()/wakeup()-like mechanism. The kernel thread then initiates any necessary transfers; a transfer can take up to 4ms (for a 64KB PDU). The clocked interrupt scheme's polling rate is a key parameter in the design and gives rise to some non-traditional trade-o s. While less-frequent polling improves throughput and host performance, it has some potentially negative consequences for latency; for example a 60 Hertz timer would give a worst-case latency of over 16.7 milliseconds before data reached a process, far slower than desired for many LAN applications [Kanakia 88]. This is easily observable on our systems using ping and changing the clocked interrupt rate on the system being pinged. Our system goal was high throughput, and other considerations (e.g., the distance to MIT) were expected to dominate round trip times. As we will see when analyzing TCP/IP performance, this latency e ect is problematic for TCP's control algorithms as well, even in a Local-Area setting. 4

4 The /dev/atm Virtual Device UPWARDS initially consisted of two drivers, for the segmenter and the reassembler boards, respectively. This was an historical artifact due to both software engineering, and to the AIX automatic system con guration logic. The two drivers were fully independent and each open() provided a simplex channel. While architecturally simple, this was inconsistent with the AIX model for a network device. AIX makes the assumption that a network device will consist of a single board driven by a single driver. Given the choice between creating a Network Interface Driver (NID) which recognized two drivers simultaneously, or an UPWARDS interface which made the two boards appear as a single (virtual) board, the latter option was chosen. This allowed us to write a fairly standard network interface driver at a cost of complexity in the driver. NIDs are discussed in Section 6. In some cases, the uni ed software could be built by just copying routines from the two independent drivers and xing collisions in variable names. For example, read() and write() were straightforward since these entry points were each present in only one of the drivers. However, operations like open() and the initial con guration routines had to be more heavily reworked to support this melding. The driver now supports any combination of segmenters and reassemblers. This leads to the \virtual" device, /dev/atm, which is realized by two physical devices, the reassembler and the segmenter. Virtual devices, in our model, have the property that the virtual device provides an abstraction necessary for a higher layer of software structure to obtain the behavior it requires for a device. While not hard to implement in this case, the dependence on the underlying physical devices and the automatic con guration of the Micro Channel devices added complexity to the device setup. For example, robustness required considerable analysis of device absence/presence in order to bring up the virtual device, which needs a transmit/receive pair. Some segmenter and reassembler entry points were maintained for low-level debugging tasks as well as providing simplex channels.

5 Adding support for IP datagrams - AAL3/4 CS-PDUs The original UPWARDS driver code was extended to take advantage of the networking support services o ered by AIX and the advanced features of the host interface for framing and multiplexing PDUs. We wanted a general PDU interface for projects building custom transport layers. Additionally, by also building a Network Interface Driver (NID), we were able to inherit the AIX IP protocol stack and (more importantly) attendant applications. The rst step was to add support for Convergence Sublayer PDUs (CS-PDUs) to the software. CS-PDUs are implemented as part of AAL3/4, which is a protocol layer immediately above the raw ATM (cell) layer. AAL3/4 imposes an additional 4 bytes of overhead, so that the 48 byte payload is reduced to 44 bytes, in exchange for per-cell checksumming and elds to support the datagram reassembly. Had the boards been implemented today, we would have used or added AAL5, which seems better suited to IP support; we have done this in a second implementation of this host interface architecture for the HP PA-RISC, described in Section 9. AAL3/4 supports connectionless trac. For the segmenter, support of this facility was relatively straightforward. On the reassembler, CS-PDU mode is more complicated as a consequence of the reassembly bu ering scheme. A CS-PDU's presence is not indicated until it is fully assembled. Its size cannot easily be determined previous to the DMA transfer; after the transfer the number of bytes transferred is available in a status register. Potentially, any CS-PDU may contain up to 65536 bytes. Before a transfer occurs, the status registers of the reassembler for a given virtual circuit only indicate the number of CS-PDUs available.

6 Network Interface Driver Architecture To add a new network device to an AIX system, one must write a Network Interface Driver. This piece of code is the interface between the higher level protocols (e.g., IP and X.25) and a device 5

IP Protocol Suite

Other

Other

Protocol Suites

Protocol Suites

...

ATM

Ethernet

Other

NID

NID

NID

ATM

Ethernet

Other

Device Driver

Device Driver

Device Driver

Figure 2: Network Interface Driver functional model driver. It provides whatever data link layer encapsulation is necessary, for transmitting mbuf chains, and for formatting received data as mbuf chains. In our implementation, this means providing or checking the data at the beginning and end of an AAL3/4 CS-PDU (e.g., begin-end tags, sizes) and arranging for each IP datagram to be encapsulated in one CS-PDU. The role of the NID, as illustrated in Figure 2, is as a piece of glue logic between low-level device support software (i.e., the UPWARDS communications kernel) and general-purpose protocol stacks (i.e., IP). This model generally proved to be a good one. We did not have access to the AIX sources, but were able to add a rst-class network device to the system. Not only were we able to inherit utilities like telnet, ftp, xmosaic without recompilation, but system administration tools like netstat worked without modi cation. The primary diculty encountered was the assumption that the new network will be IEEE 802[Tanenbaum88] based. Thus, DSAPs and ARPs are expected. Since 802 does not have a clear mapping into a virtual circuit environment, these features are currently not cleanly implemented. When virtual circuit tables are built dynamically, additional information will describe the mapping to the 802 parameters.

7 Performance With the initial version of UPWARDS, Smith and Traw [Smith 93] were able to show that the hardware can easily support the full bi-directional OC-3c bandwidth. In fact, with changes in implementation technology, OC-12c bandwidths can be supported by the host interface hardware[Traw 93b]. Our goal was to preserve this performance for IP family transport layer protocols. While we did not wholly achieve this, we understand why, and the results are respectable. We gathered end-to-end performance using ttcp, and low-level performance via kernel tracing. AIX provides an extremely exible ne-granularity tracing feature. Trace macros are called with zero to ve arguments, and are identi ed by a hookword. Hookwords for events to be traced are speci ed to a program which enables tracing for those hookwords. The trace macros test if tracing is active for the hookword, and if so, a record is written to a kernel trace bu er. When tracing is stopped, with another command, the contents of the trace bu er are dumped to a le, which can then be formatted by interpreting the hookword and printing any arguments to the macro found in the trace dump. When tracing is not active, the trace facility costs a conditional branch. The tracing facility was heavily used in the debugging and performance tuning of the system. It 6

RS/6000 Workstation

RS/6000 Workstation STS-3c like Link

Host Interface

Host Interface

160 Mbps ATM Transport

Figure 3: Experimental Setup Operation d master() d complete() cache inval()

mbuf copy in mbuf copy out DMA in DMA out

Measured Time (sec) for 16KB MTU 379 15 22 267 239 1066 948

Work Count per Direction Unit PDU 64KB 0 R,W PDU 1 R,W PDU 1 R 4KB 4 R 4KB 4 W PDU 1 R PDU 1 W

Table 1: Performance of various critical path operations from traces provided a low-cost but exible method for reporting behavior of the AIX kernel. The hardware test con guration consists of the elements shown in Figure 3. Two RS/6000 workstations are connected back-to-back via their ATM host interface subsystems. The physical layer connection is provided by a board which appears to be a SONET STS-3c link to each of the host interfaces. This board is connected to the host interfaces via a ribbon cable.

7.1 Strategy

Due to its use of mbuf chains with 4KB (page-sized) clusters, the AIX IP stack represented an almost pessimal situation for high performance with our system. Our best UPWARDS applicationto-application throughput of 130 Mbps employed 64KB PDUs (a.k.a. \jumbograms") and copied directly to/from process memory. Neither of these were possible supporting an unmodi ed TCP/IP stack. This made the key decision whether to DMA directly to/from mbuf structures (a small PDU size) or to aggregate mbufs into a larger PDU for use by the adapter. Based on an analysis of various costs (see Table 1) we chose the latter strategy, as it allowed us to maintain a long-term TCW mapping (obtained with the AIX d master() kernel service) in the IOCC. This mapping is for the bu ers to/from which mbufs are copied. DMA directly to/from mbufs would have required set-up and teardown of the mappings for each mbuf. An additional advantage of the large copied PDU versus direct-mapped mbufs was the amortization of many per-PDU costs over a larger number of bytes. Finally, due to some peculiarities of the device, general and robust receiver code would have to d master() a 64KB PDU in any case, as the length of the PDU is not known until after it is copied. The major disadvantages are the reduced throughput and increased 7

i = 0; dma_buf = not_in_use for (pdus = get_#pdus; pdus == 0; i = (i + 1) % nbuf) { can_dma = DG_DONE if (dma_buf >= 0 && can_dma) buf[dma_buf].state = filled; dma_buf = -1 if buf[i].state == empty if can_dma start dma; dma_buf = i;

buf[i].state = filling

if buf[i].state == draining bcopy some of the data if done buf[i].state = drained if buf[i].state == filled initialize mbuf chain; buf[i].state = draining if buf[i].state == drained pass to nid; buf[i].state = empty;

--pdus

if buf[i].state == filling no_op }

Figure 4: Reassembly Data Management Thread latency from the extra copy. The logic of the UPWARDS reassembly data management thread is given in pseudo-code in Figure 4. This thread copies datagrams (PDUs) from the reassembler card into mbuf chains for processing by the NID. The main concepts embedded in the routine are the use of overlap with the reassembler's memory access and the use of large (64KB) bu ers. Concurrency control of the reassembler is accomplished using the can dma ag set from a call to a macro DG DONE(), which checks a ag raised by the reassembler when it has nished a data transfer. Only one DMA can be outstanding at a time. While this DMA is outstanding, other useful work (e.g., data copying) can be accomplished. This can be exploited when multiple PDUs are found at the time this thread is stimulated by the clocked interrupt service code. Mbuf copying is interleaved with checking for DMA completion.

7.2 Early Results

Initial tests of TCP and UDP were conducted in December 1993 and January 1994 under AIX version 3.2.0. We measured TCP throughput over the lo0 loopback interface at a maximum of 64 Mbps. Our best measured ATM interface throughput was slightly under 25 Mbps (with a clocked interrupt rate of 1000 Hz, an MTU of 65536, and 1 megabyte of data transferred). This was less than expected, so we investigated further. With tracing, we discovered that it was taking approximately 4ms to transfer 64KB across the bus. Thus, it would take at least 8ms to send a packet and receive the acknowledgment. Additionally, 8

130 ’Model_530’ ’Model_580’

120 110

Throughput in Mbps

100 90 80 70 60 50 40 30 20 0

10

20

30 40 MTU Size in KB

50

60

70

Figure 5: UDP Throughput IP and TCP processing took approximately 2ms. Since the maximum window size in TCP is 64KB, only one packet could be outstanding at a time. We also attempted decreasing the MTU so that multiple packets could be outstanding in hopes of getting some overlap in processing. The gains from this overlap were found to be less than the additional cost of per-packet processing. This imposed a maximum possible throughput of 51Mbps. An increased window size would improve performance. After these measurements were conducted, we found that AIX version 3.2.5 added support for RFC1323 [Jacobson 92] features. In particular, a TCP option can be speci ed during the connection to increase the window size by a scale factor, allowing windows of up to 1 Gbyte. (Further, timestamps can be exchanged, which enables more accurate round trip time estimation.) We upgraded our development systems to this version of the operating system.

7.3 Measurements and analysis

AIX TCP/IP is heavily parameterized, with options to select the maximum socket size, bu er space limits on TCP's send and receive sizes, and upper and lower bounds on various pools of mbufs and clusters. Further, these parameters limit the window size, although in di erent fashions depending on whether the RFC1323 support is employed. Small di erences in the choice of parameters can (and we have observed this in practice) produce di erences in TCP/IP throughput of over 40 to 1. The reason that these dramatic di erences occur is simple: the amount of bu ering determines how eciently various elements of the AIX TCP/IP stack processing chain are utilized, as well as the scheduling of threads used to process data. This is because the bu er sizes determine both what is passed via mbuf chains and more importantly, when it is passed. This is clear when the CPU utilization of slower instances of TCP/IP transfers is studied; it is 0-5%, as reported by ttcp. We are still trying to determine the optimum con guration of parameters for TCP/IP, which is unfortunately sensitive to the performance of the UPWARDS multiplexing as well. The best TCP/IP throughput we have been able to measure at the time of this paper, using ttcp with 16KB writes and 32KB MTUs, is 7736.01 KB/s, or 63.4 Mbps for a quarter MB. To determine what is possible, we studied the performance of UDP/IP transmission throughput for our two test machines, Models 530 and 580 of the IBM RISC System/6000. Presuming that TCP/IP is in an optimum steady state, with windows open to an appropriate size, this should give an upper bound on TCP/IP throughput. The results are given in Figure 5. All measurements were 9

taken from the UDP/IP transmit side, using ttcp's \-u" ag, with 16KB writes. All checksums were enabled for all tests, and the sending processor utilization was at or near 100%.

8 Conclusions The segmenter and reassembler architecture generally works quite well for the programmer. By mostly hiding the details of the cell level and giving the appearance of a datagram network, writing the device driver was straightforward. The clocked interrupt model, however, is probably not the best choice under the constraints forced on us by AIX. It works well for detecting data arrival, but a hybrid model which signals the driver on completion of selected data movement operations would ease concurrency control. Another problem occurred where the reassembler did not have enough memory to hold all of the outstanding data. TCP/IP worsened this, as it learns how much bandwidth is available by pushing the link until it breaks (e.g., by dropping a packet). Reassembly bu er lockup crashes the machine. Large bu ers (perhaps located on the host) are needed for high-speed devices. The AIX NID architecture is generally a nice one. Being able to add a new network device without recompiling the kernel or consulting the kernel sources simpli ed the development e ort. The NID was also one of the easiest pieces of software of this suite to write after the requirements became clear. As expected, TCP extends reasonably well onto high-speed ATM networks. The major diculties we had were related to particular design decisions in the host interface architecture which caused con icts with TCP learning and control algorithms. With the minor extensions necessary to circumvent small eld sizes, TCP throughput is limited more by the power of the processor to move data rather than the protocol itself [Clark 89]. An implementation of TCP/IP with bu er management properly integrated coupled to UPWARDS bu er and clock management would provide considerably higher performance. We note that our performance is quite satisfactory in spite of the signi cant constraints of using an unmodi ed vendor protocol stack. Using this stack is a signi cant constraint for reporting \hero" numbers on a (now) ve-year old design originally optimized for custom (ATM-based) protocol stacks. As an example, this constraint prevented us from using the \fbuf" bu er optimizations discussed in [Druschel 93] for the Bellcore Osiris interface [Davie 1993] or the \single-copy" stack used by [Banks 93], all of which control the entire path from application to device. Our reduced copying optimizations became essentially useless, as the AIX TCP/IP stack was oblivious to their existence, and therefore could not employ them. In spite of this, our performance remains competitive with even the newest commercial host interfaces such as those implemented by FORE systems. The goal of using the host interface as part of the laboratory infrastructure was achieved, meaning that common applications are available without recompilation or rewriting. The only di erence applications see is much faster networking. The e ect on xmosaic can be quite impressive! Regrettably, this paper does not incorporate measurements taken from the Aurora gigabit testbed itself. We have operated TCP/IP over the infrastructure and run some preliminary tests when the TCP/IP rst because operational. As the TCP/IP was re ned, several crucial components of the ATM infrastructure were removed for upgrading, rendering the infrastructure temporarily unusable. We intend to report these measurements in the near future.

9 Future Work Another e ort within our laboratory is implementing a user-level TCP/IP [Edwards 94] for the optimized UPWARDS user-level stack. Thus, the AIX TCP/IP is avoided altogether, and the fast path to the ATM network is exploited by a TCP/IP library. Using an unoptimized version of Karn's KA9Q code as a basis, a team of students was able to achieve 22 Mbps with TCP/IP checksumming on, and 32 Mbps with TCP/IP checksumming turned o . These numbers are computed in a loopback con guration on a single RS/6000 Model 530, created by connecting the machine's segmenter to its reassembler. This suggests (presuming that sending and receiving are equally time-consuming 10

Link Adapter HP 700 Series Workstation

SGC Bus

CRC32 Generator

Afterburner Dual Ported Packet Buffer

Segmenter

640 Mbps

640 Mbps

Reassembler

Physical to network Layer Interface

CRC32 Checker Monitor

Figure 6: Link Adapter Implementation probably NOT a good assumption) that end-end performance could approach 64 Mbps (when cheating) or 44 Mbps (when playing fair). We note that this machine is a low-performance model from a ve-year old series of IBM RISC/System 6000s, which were also plagued with an underperforming Input/Output Channel Controller [Traw 93a]. At the time of this writing, we have not measured the performance of this system on higher-performance models, but based on our experience with the IBM RS/6000 Model 580 equipped with the XIO, we expect considerable improvements. A second implementation of the segmentation and reassembly architecture has been developed for the HP 9000/700 series workstations equipped with Afterburner cards [Dalton 93]. The Afterburner is an implementation of Van Jacobson's WITLESS architecture [Jacobson 90] developed by HP Laboratories in Bristol, England. This implementation attaches to the SGC bus of the workstation and provides a bidirectional FIFO data path as well as a control port for a network speci c Link Adapter. Link Adapters have already been designed by HP and others for FDDI and HiPPI networks. Our segmentation and reassembly architecture is the basis for an ATM Link Adapter [Traw 94] [Veen 93] (Figure 6). This second implementation of the architecture will support bidirectional 640 Mbps network connections as well as AAL5. The ATM Link Adapter/Afterburner pair will be able to interrupt the host on a variety of network events including the completion of a CS-PDU's reassembly, as well as support the clocked interrupt approach for event management presented in this paper. A goal of this second generation work is to facilitate the experimental comparison of a range of options for event management on a single hardware platform. Finally, the issues involved in addressing over an ATM network have been completely ignored in the current version of our software. It needs to be modi ed to dynamically setup calls and keep a mapping between virtual circuits in the network and le descriptors in the kernel. Additionally, decisions need to be made on how to map an IP address into the ATM addressing framework and on how virtual circuits should or should not be shared.

11

10 Acknowledgments Gaylord Holder helped with design, debugging and measurement.

References [Abrossimov 88] V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, C. Kaiser, S. Langlois, P. Leonard and W. Neuhauser, \CHORUS Distributed Operating Systems," Computing Systems, 1(4), pp. 305-370, (December 1988). [Accetta 86] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian and M. Young, \Mach: A New Kernel Foundation for UNIX Development," in Proceedings, 1986 Summer USENIX Conference, (1986). [Banks 93] D. Banks and M. Prudence, \A High-Performance Network Architecture for a PA-RISC Workstation," IEEE Journal on Selected Areas in Communications (Special Issue on High Speed Computer/Network Interfaces), 11(2), pp. 191-202 (February 1993). [Clark 89] David D. Clark, Van Jacobson, John Romkey, and Howard Salwen, \An Analysis of TCP Processing Overhead," IEEE Communications Magazine, 27(6), pp. 23-29, (June 1989). [Clark 93] David D. Clark, Bruce S. Davie, David J. Farber, Inder S. Gopal, Bharath K. Kadaba, W. David Sincoskie, Jonathan M. Smith, and David L. Tennenhouse, \The AURORA Gigabit Testbed," Computer Networks and ISDN Systems 25(6), pp. 599-621, North-Holland (January 1993). [Cooper 91] Eric Cooper, Onat Menzilcioglu, Robert Sansom and Francois Bitz, \Host Interface Design for ATM LANs," in Proceedings, 16th Conference on Local Computer Networks, pp. 247-258, (October 14-17, 1991). [Dalton 93] C. Dalton et al., \Afterburner: A network-independent card provides architectural support for high-performance protocols," IEEE Network, pp. 36-43 (July 1993). [Davie 1993] Bruce S. Davie, \The Architecture and Implementation of a High-Speed Host Interface," IEEE Journal on Selected Areas in Communications (Special Issue on High Speed Computer/Network Interfaces), 11(2), pp. 228-239 (February 1993). [Druschel 93] Peter Druschel and Larry L. Peterson, \Fbufs: A High-Bandwidth Cross-Domain Transfer Facility," Proceedings, Fourteenth Symposium on Operating Systems Principles (December 1993). [Edwards 94] A. Edwards, G. Watson, J. Lumley, D. Banks, C. Calamvokis and C. Dalton, \Userspace protocols deliver high performance to applications on a low-cost Gb/s LAN," in em Proceedings, 1994 SIGCOMM Conference, London, UK, 1994. [IBM 90] IBM Corporation, IBM RISC System/6000 POWERstation and POWERserver: Hardware Technical Reference, General Information Manual, IBM Order Number SA23-2643-00, 1990. [Jacobson 90] V. Jacobson, \Tutorial Notes," SIGCOMM '90 tutorial. [Jacobson 92] V. Jacobson, R. Braden, and D. Borman, \TCP Extensions for High Performance," RFC 1323 (May 1992). [Kanakia 88] Hemant Kanakia and David R. Cheriton, \The VMP Network Adapter Board (NAB): High Performance Network Communication for Multiprocessors," in Proceedings, ACM SIGCOMM '88 (August 16-19 1988), pp. 175-187. 12

[Laubach 93] Mark Laubach, \Gigabit Rate Transmit/Receive Chipset: ATM Framing Speci cation," Hewlett-Packard, 1993. [Partridge 93] Craig Partridge and Steve Pink, \A Faster UDP," IEEE/ACM Transactions on Networking, 1(4), (August 1993). [Smith 93] Jonathan M. Smith and C. Brendan S. Traw, \Giving Applications Access to Gb/s Networking," IEEE Network 7(4), pp. 44-52, Special Issue: End-System Support for HighSpeed Networks (Breaking Through the Network I/O Bottleneck) (July 1993). [Tanenbaum88] Andrew S. Tanenbaum, \Computer Networks," Prentice Hall, Second Edition, 1988. [Thompson 78] K. L. Thompson, \UNIX Implementation," The Bell System Technical Journal, 6(2), pp. 1931-1946, (July-August 1978). [Traw 93a] C. Brendan S. Traw and Jonathan M. Smith, \Hardware/Software Organization of a High-Performance ATM Host Interface," IEEE Journal on Selected Areas in Communications (Special Issue on High Speed Computer/Network Interfaces) 11(2), pp. 240-253 (February 1993). [Traw 93b] C. Brendan S. Traw, \Host Interfacing at a Gigabit," Technical Report MS-CIS-93-43, CIS Department, University of Pennsylvania (April 21, 1993). [Traw 94] C. Brendan S. Traw, \Applying Architectural Parallelism in High Performance Network Subsystems," Technical Report, CIS Department, University of Pennsylvania, 1994. [Veen 93] J.T. van der Veen, C. Brendan S. Traw, Jonathan M. Smith, H.L Pasch, \Performance Modeling of a High Performance ATM Link Adapter," in Proceedings of the Second International Conference on Computer Communications and Networks, San Diego, CA (June 1993).

13