High-Speed Networks. CHANDRAMOHAN. A. THEKKATH and HENRY M. LEVY.
University of Washington, Seattle. The throughput of local area networks.
Limits to Low-Latency Communication High-Speed Networks CHANDRAMOHAN A. THEKKATH University of Washington, Seattle
The throughput
of local
ATM
networks
Other
network
few years. often
and
mented this
system
procedure
call
to several
hardware
platforms
to achieving
low
system,
Our
RPC
network We
(ATM,
impact
latency,
and
which
networks
demonstrate
and
that
controllers
RPC
times
more
than
ever
to achieving
Subject
Descriptors:
crucial
Categories
and
put/Output
to near
Devices—channels
Network
Protocols—protocol
tributed
Systems—distributed
tems]:
Systems]:
General Additional
Key
munication,
This
Design,
work Digital Intel
Authors’
WA
Permission not made of the
with
us to
respect
architecture
and
(170
pseconds
those
and
on an ATM
components
software
design
network
of next-
communication.
and
can
controller
reduce design
and
Data
Communications]
[Computer-Communication operating
sending,
Networks]:
systems;
network
In-
Networks]:
D.4.4
Dis-
[Operating
communication;
Sys-
D.4.7
[Oper-
systems
Performance networks,
Systems
level
in part
of Washington,
host-network
interfaces,
interprocess
com-
protocols
Science
Foundation
by the
Research
Computer.
is supported
University
host
of low-latency
making
Output
CCR-9200832,
and Apple
H. Levy
under
Washington Center
C. Thekkath by a Fullbright Department
and
grants
Technology External
is supported research
Research
in part award
of Computer
CCR-8619663, Center,
Science
and
by
Program,
by a fellowship
and by the INRIA. and Engineering,
98195. to copy without
or distributed
publication
Association specific
Corp.,
Corp.
differ-
allows
communication.
network
transport
Corporation
several
systems
C.2.2 [Computer-Communication
by the National
and
impleported
system
the
times
way
technology limits,
[Input/
ATM
calls,
in part
Equipment
address:
Seattle,
Phrases:
CCR-8907666,
Hewlett-Packard from
and
procedure
was supported
CCR-8703049, the
Words
remote
and
We then
with
these
to isolate
in the
Design—dwtributed
Measurement,
call
us
stand
Management—message and
latency.
SPARCstation)
controller,
reduced
C.2.4
is
high-speed
software.
low-latency
applications,
in a
throughput
we designed
low
Comparing
network
processor
B.4.2
of magnitude
of newer
in the communication
allow
and controllers;
Organization
Terms:
the
network-imposed truly
effects
of new
of Ethernets.
increased
software,
and
runtime
still
architecture;
Communications
ating
that
new-generation
small-packet
and
designs
hosts),
that
order
than
at providing
(DECstation
substantially
5000/200
the bandwidth than
communications.
and Ethernet).
network,
rather
system-level
targeted
user-level
achieves
DECstation
system FDDI,
the and
the
hardware
of alternative
e.g.,
the kernel
system,
using
generation
both
example, greater
of yet another
latency
cross-machine
of influences,
and controllers
increase
lowered
examines
on low-latency,
the performance
cache
paper
For
of magnitude
a bandwidth
remote
ent networks
increasing.
is an order
systems,
This
a number
a new
explore
promise
concern.
and HENRY M. LEVY
is rapidly
rings
in distributed
technologies
To evaluate
networks
token
technologies
However,
of primary
network
area
FDDI
on
and
fee all or part
for direct its
for Computing
date
of this
commercial
appear,
Machinery.
and
material
is granted
advantage,
the ACM
notice
is given
To copy otherwise,
that
provided copyright
copying
or to republish,
that notice
the copies
is by permission requires
are
and the title of the
a fee and/or
permission.
!3 1993 ACM
0734-2071/93/0500-0179 ACM
Transactions
$01.50 on Computer
Systems,
Vol. 11, No. 2, May 1993, Pages 179-203.
180
.
C. A, Thekkath and H M. Levy
1. IINTRODUCTION In
modern
computer
systems, slow communication links and softwarehave imposed a stiff penalty for cross-machine commucosts limit the extent to which a network of computers
processing overheads nication. These high can
be used
related
Several way
as a solution
problems recent
in which
for
either
structuring-related
developments,
networks
improvement over Ethernet ment to 1 gigabit per second
however,
have
the
potential
such as FDDI (ATM), offer
expected
the bandwidth
Taken
together,
communication
lying
on
these
system
technologies
systems
other
These
operating system system mechanisms reduce the software
the
that
should
can
exploit
of networked
nodes
[2], and B-ISDN [34] a 10-fold bandwidth
could
be
improve-
have already given us power, with 100 MIPS
processors
will
be capable
of
provide. research has led to extremely low [4, 22, 27, 33]. Such low-overhead latency that has typically limited
primitives.
be possible, example, consider
[ 11]. In a cluster idle
the year.
new networks
of operating
would not otherwise As a motivating server
within that
—Performance-oriented overhead operating mechanisms greatly the performance
to change
[23]. Another order-of-magnitude seems close behind.
—New processor technologies and RISC architectures an order-of-magnitude improvement in processing using
performance-
are used:
—New local area network technologies, using Asynchronous Transfer Mode
performance
or
in applications.
the
allow novel design
us to produce distribution
that
of a distributed-memory
workstations,
paging
much
than
faster
low-latency schemes
to spare paging
to
memory disk.
A
distributed server could manage the idle memory in the network; on a page fault, the page fault handler would communicate with the server and transfer pages to or from remote nodes. The effectiveness of this scheme is highly influenced by the communication latency between client and server. As another example, reduced-latency communication would greatly facilitate the use of networked workstations as a loosely coupled multiprocessor or shared virtual-memory system [7, 21]. Such configurations would have significant cost performance, flexibility, and scalability benefits over tightly coupled systems or over dedicated message-based multiprocessors (e.g., cubes). These systems are characterized by frequent exchanges of small synchronization packets and data transfers—attributes that are well served by low communication latencies. The common thread in these examples is that latency and CPU overhead are at least as critical as network throughput. With the advent of very high-bandwidth networks, the throughput problem relatively more tractable, leaving latency as the preeminent issue. 1.1 Paper
Organization
The objectives (1) To evaluate ACM
TransactIons
of this
and Goals paper
are threefold:
the fundamental, on Computer
Systems,
underlying Vol.
11. No. 2, May
message 1993,
costs for small-packet
is
Low-Latency Communication communication
on new-technology
on High-Speed
networks,
when
Networks
compared
181
. with
Ether-
net technology. (2)
To describe and evaluate the the face of different networks,
design of a new low-latency RPC controllers, and architectures.
(3) To draw lessons for network and ence. We believe (as do others) controllers are often poorly tributed systems, particularly Overall,
our
generation interface.
objective
is
to
take
system-level
of modern disRPC systems.
view
of RPC
FDDI,
connected
by an Ethernet
and
ATM
applications;
(2) general
new-
local
area
networks,
and
on the
combinations, connected by SparcStation
I
network.
We have used Remote Procedure model for several reasons: (1) it known [5, 27]; communication
on
in particular the software and hardware designed, implemented, and measured a
RPC on several different network and workstation on the MIPS R3000-based DECstation 5000/200
Ethernet,
cantly;
a
in
design based on our experinew generation of network
matched to the demands in the face of low-overhead
networks, examining To do this, we have
low-latency specifically,
controller that the
system
Call (RPC) as the remote is a dominant paradigm
techniques
for achieving
communications for distributed
good performance
are well
(3) it is user-to-user, in contrast to kernel-to-kernel memory that tends to underestimate communication latencies signifi-
and (4) it is close in spirit
to other
communication
models,
such
as V
our observations widely applicable. Our objective is [81 or CSP [17], making not to attempt a completely new RPC design, but rather to assess the latency impact
of modern
required
the
networks,
design
and
controllers,
processors,
implementation
and software.
of a new
RPC
system,
To do this because
existing high-performance RPC systems (such as SRC RPC [27]) do not run on the experimental hardware base we required. The paper is organized around the objectives listed previously. Section 2 describes the network cost for cross-machine ines the RPC design ment
of our RPC
of the controller nication existing
principles
system
Complex appear
that we used and details the minimum in these environments. Section 3 exam-
that
make
on a variety
has a significant
overheads. machines
environments communication
low-latency
RPC feasible.
of controllers
influence controllers
indicates
on efficient and
to add to the overall
how
cross-machine
host/controller overhead
Measurethe design commu-
interfaces
on
in two ways—latency
inherent in the controller and latency required by the host software to service the controller. Finally, Section 4 examines in greater detail the implications of low-latency communication for networks, controllers, and operating system abstractions. For example, we show the latency effects interrupt handling, and data transfer techniques. Our
experience
demonstrates
that
new-generation
of cache management, processor
technology
and performance-oriented software design can reduce RPC times for small packets to near networkand architecture-imposed limits. For example, using DECstation 5000/200 workstations on an ATM network, we achieve a simple user-to-user that RPC
round-trip RPC in 170 microseconds. Our experiments software overhead contributed by marshaling, stubs, ACM
Transactions
on Computer
Systems,
suggest and the
Vol. 11, No. 2, May 1993
182
C. A, Thekkath and H. M, Levy
.
packet exchange protocol need not be the bottleneck to low-latency remote communication on modern microprocessors. However, as software overheads decrease, network controller design becomes more crucial to achieving truly low-latency communication. 2. LOWER
BOUNDS
This
section
from
two important
FOR CROSS-MACHINE
evaluates
the
overhead
sources:
COMMUNICATION
added
to cross-machine
(1) the network
hardware
(the
communications link) and (2) the controller, memory, This allows us to examine our RPC software (the stubs, protocol,
and
runtime
support)
costs for cross-machine network and processor nents
may
change;
relative
to the
by isolating
the
lower-bound
components,
to extend
networks,
and
other
and
with
ware
and
explain
those
second
measurements,
to compare
Ethernet. our
In the
first
different
experimental
past
we briefly
testbed
2.1 Overview of the Networking subsection
capabilities Ethernet. which controller
is
summarizes
and
The [1].
network
Ethernet
accessed
on
However,
networks
controller
DECstations
the
[ 18, 27]. newer
However, technology
networks
with
characterize
measurement
each
our hardmethodology.
controller and network communication.
controller
we used and the particular
used to access each network.
is a 10 Mbit/see
our
the perfor-
Environment
the various
of the specific
Given different of these compo-
to examine
Then we analyze our performance results and discuss issues that specifically impact latency in cross-machine
This
message-passing
we can see how
high-speed
following
and the
and CPU interfaces. RPC packet exchange
communication on our hardware. technologies, the relative importance
mance of each layer scales with technology change. Similar measurements have been reported in the we wish
communication controller
CSMA/CD and
local
SparcStations
is packaged
area by
network, a LANCE
differently
on the
two
machines. On the DECstations, the controller cannot do DMA to or from host memory; instead, it contains a 128-Kbyte on-board packet buffer memory into which the host places a correctly formatted packet for transmission [ 14]. Similarly, the controller places incoming packets in its buffer memory for the host to copy. In the case of the SparcStation, the controller uses DMA to transfer
data
to and from
host memory.
In this
case, a cache flush
is done on receives to remove old data from the cache. packets are described by special descriptors initialized by send) or the controller (on receive). Descriptors are kept the SparcStations while they are in the special on-board
operation
On both machines, either the host (on in host memory on packet memory on
the DECstations. Two message sizes were used in our experiments, a minimum-sized (60 bytes) send and receive, and a maximum-sized (1514 bytes) send and receive. FDDI.
FDDI
is a 100-Mbit/sec
fiber
token
tion by the DEC FDDI controller. Like the the FDDI controller cannot perform DMA ACM
TransactIons
on Computer
Systems,
Vol
11, No
ring,
accessed
DECstation from host
2, May
1993.
on the DECsta-
Ethernet memory
controller, on message
Low-Latency Communication transmission;
instead,
ever, the FDDI
it relies
controller
on an on-board
can perform
on reception of a packet from share descriptors as described an unloaded
private
token
passing
ment,
token-passing
Packet with
FDDI
is kept
DMA
ring
delay
packet
transfers
Networks buffer
with
two
would
have
1514 bytes
hosts.
the
overhead
in a more
to be added
were
chosen
How-
to host memory
and host software Ethernet. We used
Thus,
minimum;
183
.
memory.
directly
the network. The controller above for the DECstation
to an absolute
sizes of 60 and
on High-Speed
due
realistic
to the
to facilitate
overall
direct
to
environlatency.
comparison
the Ethernet.
ATM.
ATM
munication exchanged order
(Asynchronous
standard
Transfer
Mode)
used to implement
between
entities
of a few tens
in fixed-length
of bytes.
is an international
B-ISDN.
An ATM
In an ATM
parcels network
called
telecom-
network,
data
cells, usually
typically
consists
is
on the
of a set of
hosts connected by a mesh of switches that form the network. In an ATM network, user-level data is segmented into cells, routed, and then reassembled at the destination The particular ATM
using header information we used has 140-Mbit/sec
contained in the cells. fiber optic links and
cell
sizes of 53 bytes and is accessed using FORE Systems’ ATM controller [16]. The controllers on the two DECstation hosts were directly connected without going through a switch; thus there is no switch delay. Unlike the Ethernet and FDDI controllers, the ATM controller uses two FIFOS, one for transmit and
the
simply
other
for
receive.
reads/writes
tions
that
cells
arrive
choosing
correspond in
how
The
complete to the
the
receive
often
controller ATM
FIFOS. FIFO.
it should
has
cells The The
no DMA
by accessing host
is notified
host
be interrupted.
facilities. certain
has
via interrupt
considerable
Further,
The
host
memory
locawhen
flexibility
the controller
in
does not
provide any segmentation or reassembly of cells; that is the responsibility of the host software. Each ATM cell carries a payload of 44 bytes; in addition, there are 9 bytes of ATM and segmentation-related headers and trailers. In our experiment
we chose packet
well
close
as being
enough
sizes that to
the
were an integral
Ethernet
and
FDDI
number packet
of cells as sizes
for
link,
we
comparison. 2.2 The Testbed In order
and Measurement
to isolate
built a minimal on the network.
the performance
stand-alone testbed, There is no operating
software is executing workstations (either through processor
an isolated rated
at
Methodology of the controller
which simply sends and receives packets system intervention since only minimal
on each machine. two DE Citations network. 18.5
The
SPECmarks,
and the network
The testbed hardware consists of two or two SparcStations) connected
DECstation and
the
uses
a 25 MHz
SparcStation
MIPS
I uses
R3000 a Spare
processor rated at 24.4 SPECmarks. The DECstations were connected in turn to an Ethernet, an FDDI ring, and an ATM network. The SparcStations were connected to an Ethernet. The DECstation network devices are connected on the 25 MHz TURBOChannel [13] while SparcStations use the 25 MHz SBUS [29]. We measured
the performance ACM
of each configuration
Transactions
on Computer
Systems,
in sending
a source
Vol. 11, No. 2, May 1993.
184
C. A, Thekkath and H, M.
.
packet from one node to the Packets are sent and received data
over the host
are enabled, packet dedicated
the
While
fashion
other from
bus is included
so both
arrival.
Levy
and
the
is generally
by disabling
in receiving a packet in response. memory, so the cost of moving the
in our measurements.
sender
it
and host
receiving
possible
network
Network
hosts
to
access
interrupts,
provide
periodic
interrupts.
These
on the DECstations (about SparcStations (once every
were
on the
Wire.
the propagation
This
time
clock
set to provide
chips
interrupts
Measurements trip
can be broken
is the transmission
because
were
time
it is negligible
on
network
in
a
time-sharing
once every 244 microseconds) and 625 microseconds). No significant
involved in fielding a timer interrupt. least 10,000 successful repetitions. The component costs for the round —Time
the
conventional
access will involve interrupt-processing overheads. Both the DECstation and the SparcStation have
interrupts
are interrupted
that
can
at 4096
Hz
1600 Hz on the processing is averaged
over
at
up as follows:
of the packet.
for the length
We ignore
of cable
we are
using. —Controller the sending has made the data
Latency. controller the data
This is the sum of two times: (1) the time taken by to begin data transfer to the network once the host available
at the receiving
to it, and (2) the delay
controller
and the time
between
the arrival
it is available
of
to the host.
—Control / Data Transfer. Data has to be moved at least once over the host bus between host memory and the network. Some of our controllers use DMA incurs transfer describe the
to move
no
data
data
over
transfer
the
host
overhead.
bus to the
However,
network;
the
overhead because it has to use special the location of the data to the controller.
actual
Latency
time
to do the
data
transfer
CPU
thus incurs
the
CPU
a control
memory descriptors to With such controllers,
is captured
in
the
Controller
item.
—Vectoring the packet
the Interrupt. arrival interrupt
The overhead —Interrupt
involved
Service.
some essential
On the receive side, host to the interrupt handler
is a function On taking
controller-specific
of the CPU
an interrupt, bookkeeping
software must vector in the device driver.
architecture.
host
software
before
the
must interrupt
perform can be
dismissed. 2.3
Performance
To determine
Analysls the latency
of the controller
between a pair of hosts controller’s status registers of data, expected simply
we ran separate
experiments
host polled the indicated arrival
a new transmission was begun. In the cases where the to copy data to the controller’s memory before transmission, programmed
giving it any data. data for the host ACM
itself,
with interrupts disabled. Each in a loop. As soon as the register
Transactions
the controller Similarly, to copy,
on Computer
to start
the data
transfer,
without
host was the host actually
when the controller indicated the arrival of new the host ignored the data and began the next
Systems,
Vol.
11, No. 2, May
1993.
Low-Latency Communication transfer.
In
started.
addition,
In these
by the host
descriptors
circumstances
per round
trip.
were very
on High-Speed prefilled
before
few machine
The time
required
Networks the
data
instructions
to execute
185
.
transfer
are executed
these
as well
as the
time This
spent on the wire was subtracted from the total measured round trip. method gives satisfactory results in most cases but has the disadvantage
that
it does not account
This
artifact
multicell
for any pipelining
is particularly
packets
transmission
are exchanged.
of data
transmission instance, in
visible
between
that
in the Typically,
its
the controller
a controller
internal
buffer
the
perform.
controller
chip
and
between host (or on-board memory) sending a multicelled packet through
host can fill the FIFO with a cell while from the FIFO onto the network.
might
case of our ATM
when
can overlap
network,
the
with
the
and the chip itself. the ATM controller,
the controller
sends
For the
the previous
cell
Table I shows the cost of round-trip message exchanges on the host/network combinations described above. A few points of clarification are in order here. First, which
in the
case of the
perform
flushing
the
overhead. from
DMA cache
While
higher-level
is kept
software
data. Second,
arrived
only
two interrupts. row
in the named
Most of the microseconds have
cache flushes
network,
it interrupted
Sum
of
in
the
necessary
locations,
means
this
Table
reported
include
In addition, each
is therefore
is the
if data that
the
cost of accessing
I does not
experiments,
cost of
Service
are not strictly
packets.
Components
the
Interrupt
the host only when
in our
controller,
receives,
pay a performance
multicell
Thus,
SparcStation
sum
this
the cost
the controller
a complete
packet
round
incurs
trip
a lower
of the
bound.
rows
above
it.
time, the sum of our component measurements is within 1–9 (about 270) of the observed round-trip time. The only exception
case of the
transfers
2.3.1
that
The performance
overestimated
ATM
network
in sending
the round-trip
time
of the amount
of overlap
Latency
and
on the
packets,
12%. The most
between
where likely
we
cause
the host memory–FIFO
transfer. It
Latency.
to Time
multicell
by about
and the FIFO-network
Throughput
Controller
is included
eventually
FIFO.
is an underestimate data
DMA
the
on packet
in uncached-memory
will
so that
had
and
memory
the
and reassembling
was programmed
is in the
controller
to host
in the case of the ATM
for segmenting
The
after
it is true
the network
FDDI
directly
is
interesting
Wire.
to
For small
compare
packets,
the
this
ratio
is 0.4 on
the DECstation Ethernet, 0.8 on the SparcStation, 7 for the FDDI controller, and 3.2 for the ATM network. For larger packets of approximately 1514 bytes, these ratios are respectively 0.02, 0.04, 0.9, and 1.0. While these numbers are specific to the controllers we use, we believe they are indicative of a trend: The
tant
bandwidths
packet
difference
exchange
between
are improving times
on
throughput
packets is the goal, then exchange on our DECstation
dramatically
Ethernet,
we can Ethernet
and
FDDI,
latency.
while and
If
latencies
ATM
show
low
latency
achieve a round-trip in only 253 pseconds;
ACM TransactIons
on Computer
Systems,
are not. the
for
impor-
small
60-byte message FDDI on similar Vol. 11, No. 2, May 1993,
186
.
C. A. Thekkath and H. M. Levy Table
I.
Hardware-Level
Round-Trip
Component Ethernet (DIK) 60/60 1514/1514 Time
on the Wue
Controller
Latency
Control/Data Vecmrmg
Transfer the Intemupt
Interrupt
Service
Sum of Components
Round Tnp Time
Measured
hardware,
despite
single
cell
lower
bound
checking
in ,useconds
ATM (DEC) 53/53 1537/1537
2442
115
2442
13
245
6
176
53
89
103
97
230
16
161
40
600
6
6
40
253
17
45s
25
25
12
12
25
25
25
25
26
26
42
42
92
140
9
20
257
3146
264
2605
267
893
73
840
253
3137
263
2611
263
894
73
7.$6
10-fold
bandwidth
is
in
about
because
whether
Times
51
which
transfers
Exchange
115
its
the same operation,
Packet
Round-Trip Time (f(seconds) Packet Size in bytes (serrdk’ecv) FDD1 (DEC) Etbcrnct (Spare) 60/60 1514/1514 60/60 1514/1514
we
longer.
4~0
73
have
the cell is part
advantage, The ATM
~seconds,
ignored
takes network
However
switching
of a larger
message
263
pseconds
is capable
this
delays, that
for
of doing
is an optimistic and
the
needs
cost
of
fragmenta-
tion/reassembly. On the other hand, the high bandwidth of ATM and FDDI is significant for large packets. For example, latency for a 1514-byte packet on the DECstation Ethernet
is 4.2 times
packets
even
larger
worse than
than 1514
ATM
and
3.5 times
worse
bytes,
the
Ethernet
situation
than
FDDI.
For
is relatively
worse, because FDDI will require fewer packet transmissions. The ATM comparison is slightly more complex; as pointed out earlier, we have ignored ATM segmentation and reassembly costs in arriving at the figures in Table I. The particular software implementation of the segmentation/reassembly we use requires
about
11 ~seconds
per cell.
If this
is included,
then
a 1537-byte
packet (29 cells) takes about 1065 ~seconds, which is about 1.2 times the FDDI time. The situation further improves in favor of FDDI for packet sizes beyond
this
2.3.2
limit.
Controller
Structure
and
Latency.
As noted
earlier,
both
the
DEC-
station and the SparcStation use the same Ethernet controller. However, their performance is not identical. Recall that on the SparcStation the controller uses DMA transfers on the host bus. The overhead for this is included in Controller Latency; the Control / Data Transfer costs include
only
the
cost
of the
instructions
to program
troller is able to overlap the data transfer the network. Thus the sum of Controller Transfer is relatively unchanged by the
the
controller.
The
con-
over the bus with the transfer on Latency and Control/ Data packet size. In contrast, the DEC-
station controller incurs a heavy latency overhead because the host first has to copy the data over the bus before the controller can begin the data transfer. Consequently, for larger packets, the SparcStation controller performance is likely to be better even though for small packets both have comparable round-trip times. Controllers that use on-board packet buffers instead of DMA or FIFO will generally incur this limitation. However, controllers with DMA or FIFO are not without problems; we will defer a more detailed discussion ACM
until
TransactIons
Section on Computer
4. Systems,
Vol.
11, NrI
2, May
1993
Low-Latency Communication It
is
also
interesting
to compare Both
various controllers. interrupt-handling
times.
the The
on High-Speed
the
Networks
interrupt-servicing
Ethernet
latency
controllers
SparcStation
figure
.
have
is slightly
187 on the
comparable
higher
because
a cache flush operation is included in the cost due to the DMA transfer. except the ATM controller use descriptor-based interfaces between the and the interface. than
Programming
the simpler
FIFO
this
interface
requires
many
on the ATM.
more
The FDDI
All host
accesses to memory controller
is the most
complex one to service and is 7– 10 times as expensive as the ATM. While figure is for a specific pair of controllers, we believe that FIFO-based trollers
will
in general
reduce
Our experiments with lowered latency, simple the more
traditional
of controllers
programming
overhead.
low-level message passing seem to suggest that for FIFO-based controllers have some advantages over
types
of controllers.
on user-to-user
However,
cross-machine
we describe the impact
a higher-level of controllers
message-passing
performance,
user-level
to
with
concerned
additional
ultimately
it is the impact
communication
the next section, RPC and explore be
this con-
that
communication in this context. cross-machine
issues
like
is crucial.
system Unlike
communication
protection
and
In
based on low-level has
input
packet
demultiplexing. 3, THE DESIGN,
LOW-LATENCY The
IMPLEMENTATION, RPC
performance
system
of
the
hardware
are two of the three
If the
performance
poor,
this
twofold. systems
of the
can easily
and
message-passing
components third
dominate
Second,
of the
low-level
of user-level
component, the overall
we wish
used
by higher-level
software
ture
of the network
controller.
the
high-level
how
might
system,
is
section
is
of this
exists for building RPC in the previous section
some of the low-latency these
latency.
RPC
cost. The purpose
to outline and
communication
interact
techniques
with
the
struc-
Design and Implementation
The RPC system tion
OF A
First, we wish to show that the technology that are so efficient that the costs described
are significant.
3.1
AND PERFORMANCE
[30]
[27] built
has demonstrated
optimization. Our where performance tions
system
system
differs
(1) We differ performed: promising RPC. (2) Our
system
for the DEC the
SRC Firefly
latency-reducing
multiprocessor
effect
of a large
RPC system closely follows the SRC is a goal, many details in the structure
must
be
from
SRC RPC in several
dictated
by
in the way stubs we use a scheme protection differs
kernel
in the way but ACM
hardware
of
design; however, of a communica-
environment.
Thus,
our
respects:
are organized that minimizes
between
(3) We do not use UDP/IP,
the
workstanumber
and
in which
and the copying user
control
instead
use network
Transactions
on Computer
way marshaling is costs without com-
spaces
as is done
transfer
is done.
datagrams Systems,
in SRC
directly.
Vol. 11, No. 2, May 1993.
188
.
These
C. A. Thekkath and H M. Levy
differences
Our
are described
low-latency
ning
the Ultrix
The
system
RPC operating
is
in assembly RPC
system
is entirely
during
face previously common byte
the
binding
by the
to impact
Multipacket
SunOS other
that
is linked
process,
server. transfers
code. The
the user’s
address
on different
imports
the
inter-
optimized
machines
for the
with
the same
code path
so as not
is reduced
by ensuring
that only the client, but not the server, swaps byte order when needed. The underlying communication in the RPC relies on a simple request response
packet
however,
there
beyond what data copying,
exchange
similar
are
fundamental
three
is required for con trol transfer,
to that
described aspects
dis-
the need to program
specially
overhead
or
high-quality
by a separate
case. Byte-swapping
systems
networked
client
between
run-
SunOS.
operating
into
the
We have
RPCS are handled
the common
not felt
generate
5000
I running
that is integrated into the kernel. clients and servers can be placed
case [4] of single-packet
order.
and
in C; we have
subsections.
DECstation
impacting
our compilers
runtime
exported
on the
Ultrix
component
space and another component Like other RPC systems, machines;
the without
because
has a runtime
in the following
and on the SparcStation
into
machines
language,
detail
is prototype
system
integrated
executing on these tributed services. Our implementation
in more
system
in Section
to RPC
and
2. In addition,
that
add
overhead
a simple message exchange: marshaling and and protocol processing. Our system achieves
its low latency by optimizing are somewhat interrelated, and
each of these areas. The optimizations we consider them in turn in the following
subsections. 3.1.1 simple specific
Marshaling procedural and depend
procedures
marshal
and
Data
interface
Copying.
for the
client
on the particular the
call
RPC
stubs
and
server.
service
arguments
into
function and
out
create Stubs
the
illusion
of a
are application-
that
is invoked.
of the
message
These packet.
Even in highly optimized RPC systems such as SRC RPC, marshaling time is significant. Marshaling overhead depends on the size and complexity of the arguments. Typically the arguments are simple—integer, booleans, or bytes; more involved data structures are used less frequently [4]; therefore simple byte copying is sufficient. Related to the cost of marshaling
is the cost of making
the network
packet
available to the controller in a form suitable for transmission. A complete transmission packet contains network and protocol headers, which must be constructed by the operating system, and user-level message text, which is assembled by the application and its runtime system. There are several strategies for assembling the packet for transmission, with the cost depending on the capability of the controller and the level of protection required in the kernel. With a controller that does scatter-gather DMA over the bus, like the SparcStation controller, the data can be first marshaled in host memory by the host (in one or more locations) and then moved over the bus by the controller. This is the scheme we use on the SparcStations—the data packet comes from user space, and the header is taken from kernel space, a techACM
Transactions
on Computer
Systems,
Vol. 11, No. 2, May
1993
Low-Latency Communication nique way
commonly
used
SparcStation
and from
in such
DMA
a fixed
range
on High-Speed
an environment
is architected, of kernel
the
virtual
is more
expensive
than
our RPC design uses mapping One general drawback with
addresses.
accesses packet
of the data and
Likewise,
again given
performs
This
the
when
the
controller with
copying
it
of the user
over
to the
approach is to relax kernel/user user space and allow the user copies
to one. However,
there
controller
ment-marshaling into
the
To do this
kernel
generate
and
optimized
rather
executed.
to the
Code for
network.
or a FIFO, into
two
technique
Another
retains
synthesis
specific
which
has been
situations
all the
used
argu-
space,
is then in the
to achieve
and
buffer into number of
address
code on the fly,
the
a buffer
copies.
that
in the user’s
two
of the
the cost of two copies. call arguments, we perform
than
we synthesize
routines
the data requires
is an alternative
in the kernel
conventional.
pieces
conthis
at least
the pieces
memory
to
a threshold,
protection and to map the packet direct access. This reduces the
benefits of the protection without incurring In an effort to minimize copying of the
only mapping
accessible to the primitives for
it involves
the
on-board
marshaling
due to the
necessitates
the host builds
transfers
special
189
DMA
sizes below
is that
over the bus: once when
technique
kernel
for packet
only selectively. a DMA scheme
a controller
straightforward
copying
.
[18, 25]. However, controller
the user’s data buffer into kernel addresses that are troller. Since the cost of using SunOS virtual-memory mapping
Networks
as is linked
past
high
to
perfor-
mance [20, 22]. Our focus is slightly different: we are more concerned with avoiding the copy cost than with generating extremely efficient code for a special situation. At bind time, when the client imports the server’s interface, the client calls into
the
directly
kernel
pointers
1 As only
procedure instructions. each input
a template
of the
simple-valued
to bytes.
procedure. involves
with
supports
Using
this
mentioned
assignments
is nothing
earlier,
more
than
passed
procedure.
as words, kernel
assembling
the task
the
procedure
contains This
right
simple
synthesized
code to check
procedure
and a
of primitive the validity
has the benefit
and reply are known in advance, the can be avoided if arguments and results
the
and
of synthesizing
sequence
approach
kernel
bytes,
a marshaling
is typically
Thus,
The
halfwords,
synthesizes
marshaling
copying.
at runtime.
since the sizes of the request general multipacket code path a single network packet. The kernel then installs
the
the
and byte
marshaling
such
template
The marshaling argument
types
as a system
of that
more fit in
call
for
subsequent use by this specific client. Thus, the stubs linked into the user’s address space do not do marshaling; they merely serve as wrappers to trap into the kernel where the marshaling is done. A client RPC sees a regular system call interface with approach has the benefit required
without
1 We do not yet have
all the usual of performing
compromising
a template
the
compiler, ACM
protection rules the minimum
safety
of a “firewall”
and so our kernel
Transactions
that go with it. This amount of copying
stubs
on Computer
between
are currently
Systems,
hand
the
user
generated.
Vol. 11, No. 2, May 1993.
190
.
C. A. Thekkath and H, M, Levy
and the kernel, or the user and the network controller. impose the overhead of probing the validity of pointers copied,
but
this
In addition RPC
system
is not a significant scheme,
provides
another
interface
into
a specialized
case. Instead
of calling
which
client calls into a fixed-kernel tors. Each descriptor describes type—IN descriptors space ing
and the
done using cally
handles
controller’s
exports
memory
compromising
this
several
cases, our general
procedure,
the
The kernel can use these directly between the user
thereby
eliminating
Unmarshaling
on the
server
above.
A server
interface
with
received packets. We have used kernel-level
common
for a more
kernel-marshaling
or FIFO,
procedures
the kernel
of the
is designed
supported. arguments
safety.
general-purpose
provides
RPCS.
most
that
scheme does data can be
entry point, passing an array of data descripa primitive data type (including its parameter
or OUT) that is directly to marshal and unmarshal
cost without
server
cost for most
to this
Our before
as described
with
different
a generic
types
template
marshaling
that
of
extra
is
typi-
arguments.
applies
and unmarshaling
copyside
The
to all types
with
of
the DECsta-
tion Ethernets, but a hybrid scheme is used with the FDDI controller. On transmissions the usual scheme is used, but on receives, the controller’s DMA engine copies the data over the host bus and hands it to the kernel. The kernel unmarshals the data either by copying or by virtual-memory mapping if the
alignments
controller. driver
performs
could
read
required, fact,
are
In this
suitable.
case, before
reassembly
only
the
A similar the
RPC
from
the
network
accessed
method, but reassembly is such ATM that we did not feel justified
is used
receives
in a page-aligned
header
and if not let the RPC layer
in a non-ATM
approach
layer
buffer.
FIFO
and
the data
via FIFOS,
this
and this
the the
Furthermore,
determine
unmarshal
a common in making
with
the packet,
ATM device
the driver
if reassembly from
would
the FIFO.
is In
be the preferred
frequent operation optimization.
on the
3.1.2 Control Transfer. Researchers have reported that context switching causes a significant portion of the overhead in RPC [27]; in addition, there is a substantial impact on processor performance due to cache misses after a context
switch
switching and finally the
[24].
the client switching
server
out—can
An
RPC
call
typically
requires
four
context
out, switching the server in, switching the client back in. Two of these—switching be
overlapped
with
the
transmission
the
switches:
server out, the client or
of the
packet.
Systems with high-performance RPC usually have lightweight processes that can be context switched at low cost, but unless there is more work to do in the client and the server, or no work elsewhere, a process context switch usually occurs. Both the DECstation that can be significant context
switches,
our
and the SparcStation have context-switching times to the latency of a small packet. To reduce the cost of RPC
system
defers
blocking
the
client
thread
on the
call. Instead, the client spin-waits for a short period before it is blocked on the sleep queue. If the service response time is very small, the reply from the server is received before the client’s spin-waiting period has expired. When ACM
TransactIons
on Computer
Systems,
Vol
11, No
2, May
1993
Low-Latency Communication there
is no other
no penalty done,
work
to be done,
to spinning
the
caller
the caller
spins
for
i.e., when
round-trip
time
are
be to block
a short
greater
than
otherwise. using
available.
the
time
relative
An estimate
an estimate.
hint,
In general,
are processes
control
trip
technique
queue
switch
a thread
This
leads
often
with
In
contrast,
packet
directly
is done
from
layer.
performance.
the lowest
directly
within
the
Protocol
Processing.
layer
trip
is
and to spin statically
response
by
times
for latency
as
if there
protocol
handler,
layering.
The
the incoming packet on the interrupts are then used to to a modular we
to the destination
interrupt
scheme
round
either
past
from
wake
in that
is
to be
turn.
arises to queue software
of control
there quantum
penalty
throughput
traditional approach is for each layer input queue of the next higher layer; unacceptable
191
work
current
expected
be obtained
their
overhead
to the
by using
trades
waiting
transfer
is empty, is useful
round-robin
if the
could
or dynamically,
this
on the run
Additional
extension
to the context
of the round
a user-supplied
to its
spinning
related
there
.
of the server ensures that the process can be improved if estimates of the
without
some threshold
Networks
queue
When
A simple
caller
the run
indefinitely.
before blocking. In most cases the low response time is never put to sleep. This approach would
on High-Speed
try
approach
but
to dispatch
the
process.
yielding
This
a path
dispatch
of very
low
latency. 3.1.3 nate
communication
overhead
of protocol
processing
costs if general-purpose
The
protocols
are used
the usual case of a homogeneous and low service response times, employed to optimize There are typically protocol
like
specialized procedure
call semantics
Several multiple
that
protocol
aspects layers
increasing
layers
provides
to provide
in RPC
a basic a close
systems:
unreliable
a transport-level transport,
approximation
and
a
of conventional
for RPC.
of protocols
of protocol
the number
In
environment, with frequent remote requests special-purpose protocols can be effectively
the latency. two protocol
UDP/IP RPC
can domifor RPC.
contribute
tend
to RPC latency.
to add to the overhead
of context
switches
As mentioned of RPC
and increasing
above,
in two
the number
ways: of data
copies between layers. Further, the primary cost of using protocols such as UDP/IP is the cost of checksums. Calculating checksums in the absence of hardware support involves manipulating a packet as a byte stream; this can nullify any advantage gained by the controller or host processor in assembling the packet using scatter-gather or wordlength operations. For efficient RPC implementations, then, the checksum must be either calculated in hardware themselves transmission
or made
optional.
to the former medium
Most
conventional
approach.
The latter
is reliable
and that
protocol approach
the transport
formats
do not lend
presupposes protocol
that
is used
the only
for routing, not for reliability. These factors argue for the use of a simpler protocol and hardware checksumming. Our RPC uses a simple and efficient unreliable transport protocol and relies on a specialized RPC protocol to provide robustness, which is similar to ACM
Transactions
on Computer
Systems,
Vol. 11, No. 2, May 1993.
192
.
C. A. Thekkath and H. M, Levy
that used by the protocols reflects the
SRC or Xerox [5] RPC systems. In general, the choice of a set of assumptions about the location of clients and
servers
and
servers
are expected
area networks heavy
loads
error
characteristics
have good packet due to overruns.
the same LAN,
raw
by the controller. controller. If the
of the
to be on the same local
network.
loss characteristics,
In the case when
network
datagrams
Typically
area network.
clients
dropping
packets
the RPC destination
are used with
and
Furthermore,
local at only is within
the checksum
provided
An erroneous packet is simply dropped by the receiving target is not on the local network, it is a straightforward
extension to use UDP/IP without checksums. The choice can be made at bind time when the client/server connect is established, and the marshaling code can be generated In addition protocol
the appropriate imposed
per se adds to the latency.
to provide achieve
to include
to the overhead
a natural
high
The primary
set of semantics
performance,
header.
by transport-level reminiscent
our protocol
on the critical fast-path of the code. Under normal error-free operation,
protocols,
purpose of simple
was implemented the server’s
of RPC
the RPC protocols
procedure
is
call.
To
so as not to intrude
response
to a call from
the
client acknowledges it at the client end. Similarly the next call from the client acknowledges the previous response. The state of an RPC is maintained by the client and server using a “call/response” record. Call records are preallocated at the client at bind time. These contain a header, most of whose fields do not change the latency
with
side and contain In
order
packets
each call.
at call time. to
header recover
on the client
These
Similarly,
are therefore
response
information from
from
dropped
and the server
prefilled
records a previous
packets,
call that
our
RPC
side. On the client
blocked for the duration of the call, retransmission the client address space, as with the original call.
so as to minimize
are retained
at the server can be reused.
transport
buffers
side, since the client
is
proceeds from the data in Thus no latency is added
due to buffering when the call is first transmitted. On the server side, the reply is nonblocking; hence a copy of the data has to be made before returning from the kernel. The copy is overlapped with the transmission on the network. Once again no latency is added to the reply path. One alternative to this would be to use Copy-on-Write, which would not affect latency but can potentially 3.2
save buffer
RPC Performance
space for large
multipacket
RPCS.
Measurements
This section examines the performance of our RPC implementation. Our goal is to show that structuring techniques, such as those we have used, yield an RPC software system so effective that the hardware costs shown in Table I are significant. We gather the data to support our analysis of controller design in the next section. Table II shows the time in microseconds for RPC calls on the various platforms.
Two
procedures
takes two integer two arguments—an an integer ACM
result.
Transactions
called
Minus
and
MaxArg
were
timed.
Minus
arguments and returns an integer result. MaxArg takes integer parameter and a variable-length array, returning The exact
on Computer
sizes of the packets
Systems,
Vol. 11, No, 2, May
exchanged 1993
varies
depending
Low-Latency Communication Table II
II.
Allocation
on High-Speed
of RPC Time
.
193
in Microseconds
I Activity
Networks
(sccont
fltherr h4inus
Fim7cr
Ethcrrrl
(Spwc
MaxArg
Minus
Maxar
Minus
DEC) Maxarg
Fl)r
All
DEC)
Minus
=MaxArg
Chen[ Call Controller Latency
28
145
59
137
46
173
25
159
26
27
45
54
48
82
8
44
Time
58
1221
58
1221
4
122
3
88
25
25
27
27
56
70
17
20
Server Packet Receipt
39
470
59
169
42
42
29
347
Server Reply
27
2-1
46
46
36
36
25
25
Controller
25
25
44
44
~ 49
82
8
44
57
57
57
57
5
5
3
3
26
27
56 ’29
17
17
49
27 49
56
29
26 29
340
2052
471
1831
371
340
2070
496
1997
380
on the WUC for Call
Intcrntpt
Time
Handhng
Latency
on the Wue
Interrupt Chcnt
on Server
for Reply
Handling Reply
hcelpl
Total Attributed Measured
on CIIcn!
T]mc
Time
on the network 48 bytes
type.
on FDDI,
Minus
causes
and 53 bytes
60 bytes
single-user
rows
which
showed
are explained
below:
—Client Call. This is the total time out a call packet. It consists of five up the argument time to validate dure,
—Time
the packet on the
based
for setting
Latency.
getting
MaxArg
Wire
This for
Call.
major
parts:
locate
the cost
2
675
on the Ethernet,
transmits
to the server
on the ATM. on the three
The reply networks.
required on the client side for sending major components: the time for setting
and
of this
server
to do a transmit.
represents
the wire.
the
This
An estimate
controller’s
is computed
of the packet
latency
in
as in Section
2.
transmission
time
after
This is a sum of the times to vector the handler in the device driver, the time in the interrupt
is dismissed.
This is the time spent on the server machine before it is handed to user code. It consists
the cost of examining correct
unmarshaling
170
of the network.
and to return
—Server Call Receipt. the call packet arrives
The
figure
Handling on Server. interrupt to the interrupt
the handler,
693
less variance.
up the controller
to and from
on the bandwidth
—Interrupt network
29 776
to the system call, the time to perform a kernel entry, the the arguments, the time spent in the marshaling proce-
and the time
—Controller
29
164
both in single-user and in multiuser modes. The different; the times reported in the tables are the
measurements
The various
29
697
to be exchanged
on the ATM.
1514 bytes on Ethernet and FDDI, and 1537 bytes from the server is 60, 48, and 53 bytes respectively Measurements were made times were not significantly
s +
and
the packet
dispatch
copying/mapping component
varies
the data
and kernel packet,
into
depending
the
data
and
the
server’s
on the
when of two
structures time
spent
address
capabilities
to in
space. of the
controller. —Server needed
Reply. This is quite similar to Client Call. This includes the time to call the correct server procedure and to set up the results for the
system call, the time spent in the kernel in checking the time to set up the controller for transmit. ACM
Transactions
on Computer
Systems,
the
arguments,
and
Vol. 11, No. 2, May 1993.
194
.
C. A. Thekkathand
H. M. Levy
—Client ceipt
Reply Receipt. This is the counterpart and takes place on the client side.
—Total
Attributed
we have been measurements A few points
Time.
This
is the total
able to attribute or by estimating about
Table
of the
of the components.
to the various from functional
II are in order
Server
Call
It is the time
activities, either specifications.
here.
First,
Re-
though
by direct
the SparcSta-
tion and the DECstationv 5000 have similar CPU performance and use identical Ethernet controllers, the software cost for doing comparable tasks varies. In general, our experience has been that SunOS extracts a higher penalty than Ultrix. We believe this is primarily due to the richer SunOS virtual-memory architecture rather than to the SparcStation’s architectural peculiarities ATM
such
network,
packet
into
Similarly,
as register
windows.
Second,
29 cells are sent to the server.
29 cells at the sender
is included
the cost of reassembling
in the case of MaxArg
The cost of segmenting in the row entitled
the cells is included
on the the user
Client
in the Server
Call. Packet
Receipt row. In the case of single-cell transmission, the segmentation and reassembly code is completely bypassed. Finally, we exploited the flexible interrupt structure of the ATM controller to ensure that in the normal case only the last cell in a multicell the Total Attributed Time Section
2, in all cases except
in the range
one, the difference
of 1–8Y0. The only
on the ATM,
where
tables,
exception
for the reasons
overestimated the total Overall, as we can Ethernet
packet caused a host interrupt. In comparing with the Measured Time we note that as in
the
experimental
error,
is in the case of a multicell
packet
mentioned
time required. see by comparing
latency
for
is within
the
earlier
the
small
in Section
FDDI
Minus
and
RPC
the
2, we have DECstation
request
on FDDI
is
12% higher than for Ethernet. On the other hand, as expected, the much higher bandwidth on FDDI becomes evident for the larger packet sizes; the MaxArg RPC request takes nearly three times longer on Ethernet than on FDDI. Comparing the performance on the ATM network with the others, we see how effective a simple controller design is for reducing latency for small packets. is a factor
However,
for larger
in reducing
tion
and reassembly
are
being
built
[12,
packets,
performance. either 32].
the software ATM
in hardware It
would
fragmentation/reassembly
controllers
that
be interesting
characteristics of these controllers with FORE controller we use. As noted earlier in Section 2, for small
perform
or by dedicated the
to
simpler
packets,
fragmenta-
on-board
processors
compare approach
the taken
the cost of “doing
latency in the
business”
(i.e., controller latency) on the network has increased with FDDI. However, compared to the low-level software latency imposed by the packet exchange, the higher-level RPC functions (stubs, context switching, and protocol overhead) add only 87 pseconds for Ethernet, pseconds for ATM. Higher-level RPC functions cost more relating to the DMA interface ACM TransactIons
controller/host the high-level on Computer
117 ~seconds on
interface. First, cost of dispatching
Systems,
Vol
11, No. 2, May
FDDI
for
for two
FDDI, reasons,
and
97
both
because of the more complex the server process and data 1993.
Low-Latency management is greater Ethernet, the code path
Communication
for the required
on High-Speed
Networks
DEC FDDI. Second, for both for RPC for these controllers
different inherent
from that used for a simple packet exchange. overhead added by RPC. However, the overhead
of FDDI
than
for the
controller/host While
Ethernet
because
of some of the
the absolute
increase
traps and context However, the bulk
contributed
switching that of the overhead
main
factors
that
use of preallocation, exploiting
simple
Our
experience
that
the
ments
and
with
also determine in Tables
Our experiments, ordinary
network
latency-impacting
AND NETWORK
RPC
on a variety
has
latency
latency.
experienced the
fact
In this
and the network
that
influence
included
controllers
and no DMA,
tradeoffs
features the
make
minimum
Transfer. the
between
allow
by an RPC.
Our
a faster
achievable.
posed by the lower the communication
access
measure-
network
we discuss
that
are capable
us to examine
these
The
the data
details
capabilities
it difficult
efficient
network
does not
the
effects
of
I/ O
(2) the overhead of transferring in more detail below.
on
indicates on
on latency.
different
sor moves the data to and from the bus, transfer primitive). These issues are (1) the
4.1.1 Data
the
DESIGN
controller
to restrict Excessive
to the user.
and
number
on the Certain
of data
unnecessary
each of these
system
movements data
movement
on Computer
Systems,
of
to one, im-
performance of between con-
troller types and copying costs. Table III summarizes the number of copies that the controller, and the user need to perform, respectively, for each type of device: Transactions
bus vary
combinations
levels of the system can lead to bad overall system. We discuss below the interaction
ACM
There
using a block the device and
We discuss
data
controller.
the
types.
controllers that (i.e., the proces-
usually without cost of servicing
of moving
of the
of scatter-gather
some of the essential
are basically two issues to be considered in choosing between provide DMA and those that support programmed 1/0 (PIO)
depending
for
different
section,
over-
are (1) the
of controllers
a significant Similarly,
the
transmission,
optimizing
reinforce
which
performance
and network
processors.
II
DMA,
is to keep
controller,
a lower
4.1 DMA versus Programmed
reduction
system’s
overheads.
the
I and
imply
the controller
DMA,
the
controller
for
to the cost of features like
(4) the speed of the host
low-latency
of the
this
computation
of
case,
communication
necessarily both
of the
is comparable
RPC adds about 130% is due to architectural
to the
FOR CONTROLLER
design
cross-machine
contribute
peculiarities
common
4. IMPLICATIONS
protocols
characteristics
may not scale with processor speeds. can be reduced with increasing CPU
(2) overlapping
the
and
Thus, there is an is more in the case
by RPC functions
as a percentage, of this overhead
speeds. A significant factor in achieving heads of memory copying to a minimum.
(3)
FDDI and is slightly
interface.
both ATM and Ethernet, low-level messages. Part
The
195
.
the kernel, PIO, DMA,
Vol. 11, No. 2, May 1993.
196
.
C. A. Thekkathand Table
III.
H. M, Levy
Number
Number
of Copies
Controller
Types
of Copies (Device/KemelAJser)
PIO (KM)
Fragment
for Various
I PIO
I DMA
I DMA (S-G)
Send Header Data
otlm 0/1/0
Ollp
I/lp
lPIO
0/1/1
1/0/1
lPIO
OIIP OIIP
0/1/0
1/1/0
1/0/0
0/1/1
Ip/1
lPP
Receive Header Data
and DMA with scatter-gather. The some form of kernel-level marshaling FRPC
running
represents
on the ATM
scatter-gather
Certain
implicit
controller
assumptions
made
the cost of copies
the
internal
either by using that with PIO, into
multiple
copied)
user
use a contiguous DMA,
the
data
over
address
spaces,
the
while
with
and
that
on
tiguous
locations
of allowable (or
a few)
in memory.
location(s),
Thus, and
cost
can
user
scatter
Most
data
DMA
S-G
be made
data
and
negligible
with
(instead
of
the DMA
can
scatter-gather
small
controller
amounts can
goes to the correct
of
perform
destination.
the number of times data is moved on and the host memory. With PIO and to keep
the
number
scatter-gather
the user
of copies
to one
DMA this would not be possible, in could be located in multiple nonconcontrollers
for each segment
then
so that that
the
First,
Second, we assume be reliably mapped
arbitrarily
DMA,
below.
the network
is mapped
memory,
we assume
With data
size requirements segments.
this
DMA,
it is possible
compromising protection. because application-level
minimum
between
incoming
without general, have
are clarified
occur
of transferring
so that
marshaling
table
might
in kernel
one would like to minimize bus between the network
kernel-level
in the
Finally,
is capable
bus
The column
or some similar technique. memory, or FIFO, cannot
set of addresses.
controller
its FIFOS.
Typically
to the header
demultiplexing
Ideally, the host
that
buffer.
video RAMs the on-board
to be adjacent
with
DMA,
we are ignoring controller’s
column PIO (KM) represents PIO with as described in Section 3, for instance,
will
have
have
the
to marshal
controller
(e.g.,
and a maximum the data
move
it.
LANCE) number into
While
one it
is
possible to build controllers to overcome this restriction. Using several small segments to gather data comes at a price, because the controller has to set up multiple DMA transfers. A similar situation is true on the receiving side as well. As shown in Table III, using PIO with kernel-level marshaling allows the data-copying cost to be kept to the minimum possible. However, one aspect of copying that is not captured in the table is the different rate at which data is
z the Autonet Sept. ACM
controller
being
built
at DEC
SRC;
personal
communication,
1991. Transactions
on Computer
Systems,
Vol. 11, No
2, May
1993
Charles
P. Thacker,
Low-Latency Communication
on High-Speed
Networks
.
197
moved over the bus for PIO and DMA. Typically, word-at-a-time PIO accesses over the bus are slower than block DMA accesses; this is the case on both the DECstation TURBOChannel bus and the SparcStation SBUS. While PIO versus
DMA
point
is of limited
beyond
which
is required
concern
PIO will
to touch
each byte
across the bus (for instance, more
efficient
4.1.2 and
Cache
and
with DMA
current might
TLB
the total
bookkeeping
cache lines
corresponding
IV,
of cache
flushes
flush
Newer
it
adequate
support
data
transfers
If the cache does not snoop
by the
to the data are
that
a serious
the situation
for
DMA
and
The cache flush required
processor/cache
coherence
without
controllers.
costs
designs In
penalty
interruptcontroller-
the time
architecture
taken
to flush
memory
the table
from
architec-
suggests,
because
cache flush instructions, by destroying locality. this
absence
the
As is evident
on current
than
recognize the
The total essential
was transferred.
is even worse
[15].
the
cost is simply host
in addition to the costs of executing additional flushes have a negative impact on performance ory
moving it is usually
subsystem with This subsection
controller-initiated
due to cache effects.
instructions
cache
above,
subsystem,
overhead.
the
However,
in software),
size.
cost of our DMA
sum
to execute
tures.
the processor than
blocks could be left incoherent as a result of the requiring cache flushes. If the cache is write-back,
interrupt-handling
Table
other
need to be purged before a DMA operation from memory. the contribution of the cache flush cost as a percentage of
cost is the
related
a certain
detail. costs outlined
of overhead
on 1/0 operations, cache DMA operation to memory,
handling
for reasons
a checksum
is a break-even
unless
interaction of the memory overall penalty than PIO.
and processor
dirty entries may Table IV shows
data
there
Thus,
While Table 111 seems to indicate that PIO can perform the same number of copies, very often
architectures the extract a heavier
be a source
packets, DMA.
Effects.
DMA
the memory
could
of the
beyond
describes this effect in more In addition to the copying from
short than
to generate
to use DMA
scatter-gather
for
be slower
problem
and
of snooping
cache
provide
caches,
mem-
the
usual
“solution” to this problem is to allocate buffers temporarily in uncached memory before they are copied to user space. This approach, used in the stock SunOS
Ethernet
driver,
either
incurs
an extra
benefit of cached accesses to frequently used Another cost of DMA is the manipulation necessary. On packet arrival, the controller however, there is generally no way to guarantee the
correct
option this
destination
of either remapping
address
remapping
space.
Thus
or performing
can require
an explicit
copy
overhead
data. of page
tables
or loses that
the
is often
stores the data on a page; that the page is mapped into the
an extra
kernel
is faced
with
the
copy. On a multiprocessor,
and expensive
TLB
coherency
opera-
tion [6]. To summarize our experience with the various controllers, we believe that DMA capabilities without adequate support from the cache and memory subsystem can be bad for performance in modern RISC processors. We also believe
that
controllers
that
have
ACM
simpler
Transactions
interfaces
on Computer
to the
Systems,
host
have
the
Vol. 11, No. 2, May 1993.
198
C. A. Thekkathand
.
H. M Levy Table
IV.
Cache
Flush
Cost
Received Packet Size in bytes FDDI (DEC) Ethernet (Spare) 1514 60 60 1514 Total Interrupt Cache
Percentage
,
potential simple 4.2
Overhead
for reducing
overheads.
controller/host
are two
controller.
common buffer
in memory.
other
ways
and
14
48
15
43
next
subsection
argues
data
a range
between
(possibly
the
data
to the
overhead
FIFO
user.
We have
shall
therefore
and
costs of different controller types. with the two types of controllers
interface
such
component.
as that
found
processors
Our in the
objective ATM
shows
the
cost
and
already
of servicing
through a copy or a remapping here is to compare the interfaces
the to be
descriptors
for transmits
and
discussed
in an the
restrict
ourselves
mentioned
above
to indi-
the
and (2) the controllerresearch has studied
[3, 26], and so we shall is with
interface such as that found in the Ethernet We reproduce some of the measurements table
host
overhead can be significantly reduced cost is composed of two components:
costs on RISC
second
the
and the host share
(1) the CPU-dependent cost of vectoring the interrupt dependent cost of servicing the interrupt. Previous only
use of
all) of host memory
is to use a simple
cate that the interrupt-handling by using a FIFO. Interrupt-handling
interrupt-vectoring
for the
host access it directly. A significant overhead to the processor system is the cost of servicing the
movement
interrupt-handling Our experiments
70 30
and have the controller
getting
of data
7
of transferring
alternative
receives and have the interfacing the controller interrupt
46
10
Memory Interface
used as a packet
role
21
3
The
One way is to designate The
21
interfaces.
Host / Controller
There
Time
Flush Time
‘
to
compare
a more
elaborate
or the FDDI from Section
interrupt
and
examine
a simple
controllers. 2 in Table
transferring
FIFO
descriptor V. The the
data
operation, as appropriate. Since our intent and not the network, we have ignored the
reassembly overhead on the ATM controller. As the table indicates, for transferring small
amounts
of data
to the user,
the overhead of the FIFO-based controller (10 ~seconds) is less than half that of the best-case descriptor-based controller (24 ~seconds for the DEC Ethernet). For larger packets, the balance is in favor of controllers that allow the kernel to perform page-mapping operations, because the copying cost dominates the interrupt-handling cost. The ability to map data into user spaces at low cost is one alternative to kernel-level marshaling with PIO, which retains the benefits of protection and reduced data movement without the need to synthesize code. A typical limitation with FIFO-based controllers, such as the FORE ATM, is that there is no easy way to map the memory in a protected manner simultaneously into multiple user spaces. ACM
TransactIons
on Computer
Systems,
Vol.
11, No
2, May
1993
Low-Latency Communication on High-Speed Networks Table r
V.
Interrupt-Handling
(DEC) 1514
Ethernet
60
199
.
Cost
Ethernet (Spare) 60 1514
FDDI (DEC) 60 1514
Intenupt Time
13
13
21
21
46
10
11
201
9
39
10
70 10
5
Copy/Mapin
5
138
24
214
30
60
56
80
10
148
Time
Toral
In contrast, to support
with
descriptor-based
address
mapping
packet
irrespective
memory,
of DMA
it is possible
support.
“
ATM (DEC) 1537 53
in principle
However,
typically,
on descriptor-based PIO controllers, the existence of a small amount of on-board buffer memory makes it difficult to provide address-mapping support, because that memory is a scarce resource that must be managed sparingly. For instance, the DECstation’s Ethernet controller has only 128 Kbytes of buffer memory to be used for both send and receive buffers. Mapping pages of the buffer memory into user space would be costly, because the
smallest
Reducing dropped tively
units
that
the number packets
during
managing
the
can
be
individually
of available
buffers
periods
scarce
of high
buffer
mapped
this
way could
load.
On the
resource
results
are
4-Kbyte
pages.
lead to delays other
in the
hand, kernel
due to
conservamaking
extra copy of the data from user space into the packet buffer. With a trivial amount of controller hardware support, it is possible
an
to solve
the protection granularity problem, providing a larger number of individually protected buffers in controller memory. The basic idea is to populate only a fraction of each virtual page that refers to controller memory. As a concrete example, Figure 2-Kbyte
we consider an alternative design to a DEC Ethernet 1 shows the sketch of the design. Controller memory buffers,
us to have write troller
each of which
64 buffers
directly
in our
to controller
ignores
will
hold
128-Kbyte memory
the high-order
the physical page number the TLB. This has the
bit
controller. without
of the
Our larger based
experience simple
host/controller
FIFO-based
packets packet
with
would buffer.
same controllers.
packet; To allow
sacrificing
page-offset
this
would
user
processes
protection,
and
as
allow
the
concatenates
to con-
it with
(PFN) field of each physical address presented by effect of causing each 2-Kbyte physical page of
controller memory to be doubly mapped half of a 4-Kbyte process virtual page. while
an Ethernet
controller. is organized
are
served
it might
To our knowledge,
interfaces on board is the VMP-NAB in the absence of memory system
both
the top half
interfaces
controllers
be better Thus,
into
leads
ideally
with
a more
be beneficial the only
suited
us to believe for
small
conventional
to support
controller
[19]. Our support,
and the bottom
that
experiments DMA may
both
that
packets, descriptor-
forms
has multiple
on the host
also suggest that incur the cost of
additional copies and/or cache-flushing overheads. In such cases, it would be advantageous to use PIO with a descriptor-based packet memory that the host CPU can either copy or map into user space. This could be done either by providing enough buffers or with hardware support (e.g., as mentioned above). The decision to copy or map in will depend on whether the processor is required to touch every byte of the packet or not. For instance, if it is ACM
Transactions
on Computer
Systems,
Vol. 11, No. 2, May 1993
200
.
C. A. Thekkath and H, M, Levy
. : :.
. Processor Virtual
: .
Address Space
: :
: :. : :
Processor
Fig.
1.
Address
Controller
mappmgfor
Memory
Controller
Ethernet
buffers.
necessary to calculate a software checksum, then an integrated copy and checksum loop (as proposed in [9] and [ 10]) would suggest that mapping is of limited benefit. This consideration will be even more important with future memory transfers. initiated adequate 4.3
subsystem designs If the processor
that will provide support for I/O-initiated does need to mediate the transfer, then
data moves would be of benefit support for cache consistency.
Network
if
the
memory
system
data 1/’0-
provides
Types
Compared to an Ethernet, a high-speed token ring like FDDI offers greater bandwidth. However, token rings have a latency that increases with the number of stations on the network. Consider a network where the offered load is quite small. That is, on the average, only a few nodes have data to transmit at a given time. As stations are added to the network, the Ethernet latency remains practically constant while the latency of a token ring will increase linearly due to the delay introduced by each node in reinserting tokens. This implies that even on a lightly loaded, moderately-sized token network, achieving low-latency cross-machine communication is difficult. As an example, if each station introduces a one-microsecond token rotation delay, a network of 100 stations would make it infeasible to provide lowlatency communication. As load is increased, both the Ethernet and the FDDI token ring will experience greater Iatencies, with the FDDI reaching an asymptotic value [32]. Thus, on balance, it appears that low-latency communications are not well served by a token ring, despite high bandwidth. We ACM
TransactIons
on Computer
Systems,
Vol
11, No
2, May
1993
Low-Latency
Communication
on High-Speed
Networks
201
.
should point out that in our experiments with RPC, token rotation latency was not a problem because we used a private ring with two nodes on it. However, if we added more nodes to the ring, we would expect to see a degradation
in the latency
ATM-style the
networks
delay
instance,
of RPC.
that
fragment
and
of
fragmenting
and
reassembly
our
experience
indicates
that
interleave of
with
packets
have
medium-sized software
to incur
packets.
reassembly,
For
latency
begins to be impacted with packet sizes in the range of 1500 bytes. It is not immediately clear how adding fragmentation and reassembly support in hardware overall
or in an on-board latency,
5. SUMMARY Modern
processor,
even though
as provided
in [12] and [31],
fragmentation/reassembly
will
affect
is fast.
AND CONCLUSIONS
distributed
systems
require
both
high
throughput
and
low
latency.
While faster processors help to improve both throughput and latency, it is high throughput, and not low latency, that has been the target of most newer networks In this cations
and controllers. paper
we have explored
on new-generation
implemented designs
a low-latency
in
addition
avenues
networks RPC
to our
own.
for achieving
(specifically, system
Using
using
newer
low-latency
FDDI
techniques RISC
communi-
and ATM).
We have
from
processors
previous
and
perfor-
mance-oriented software structures, our system achieves small-packet, round-trip, user-to-user RPC times of 170 pseconds on ATM, 340–496 ~seconds on Ethernet, and 380 ~seconds on FDDI. Our RPC system demonstrates that it is possible to build an RPC system whose overhead is only 1.5 times the bare
hardware
cost of communication
for small
packets.
Our experiments indicate that controllers play an increasingly crucial role in determining the overall latency in cross-machine communications and can often
be the
bottleneck.
However,
we believe
controller design that can provide lowered niques that achieve excellent performance. us to believe
that
FIFO-based
packets
and
that
DMA-
hidden trollers
costs that
and
network
that
there
are alternatives
latency, facilitating software Specifically, our experience interfaces
descriptor-based
are well controllers
suited may
depending on the memory system architecture. provide multiple host interfaces appear to be
alternative to current factor for performance.
for have
to techleads small many
Hybrid conan attractive
designs. Of course, the network itself is an important For instance, both of our high-throughput networks
have some peculiarities that could affect latency of packets: e.g., the token rotation latency in FDDI network and the fragmentation and reassembly in the ATM Finally,
network. we believe
that
with
careful
design
at all levels
tion system, communications latencies can be substantially entirely new approaches and applications for distributed
of the communicareduced, systems.
enabling
ACKNOWLEDGMENTS
The authors thank Ed Lazowska for his numerous patient readings of earlier drafts of this paper and his comments, which added greatly to its clarity of ACM
Transactions
on Computer
Systems,
Vol. 11, No. 2, May 1993
202
.
C. A. Thekkathand
presentation.
Thanks
suggestions of the
are also
for improving
FDDI
helping
H.M.
Levy
due
to Tom
the paper.
controller
from
us understand
DEC,
many
Anderson
Finally, who
for
we wish
gave
his
comments
to thank
so willingly
and
the designers
of their
time
in
of its details.
REFERENCES
1.ADVANCED
MICRO
Advanced 2. Ross,
F. E.
Devices,
FDDI—A
3. ANDERSON, T. E., architecture
Am7990
DEVICES.
Micro
Sunnyvale, tutorial.
LEVY,
H.
and operating
on Architectural
Support
Local
Area
Calif.,
1986.
IEEE
Network
Cornmun.
M.,
BERSHAD,
system
design.
Mug.
B. N.,
Controller
24, 5 (May
1986),
10-17.
AND LAZOWSBA,
E. D.
The
In Proceedings
for Programming
Languages
Trans.
5. BIRRELL,
Comput.
A.
Comput.
D.,
Syst.
6. BLACK,
D.,
tural
B.
39-59.
1984),
R.,
GOLUB,
J.
D.,
and
Implementing HILL,
approach.
In
for Programming
interaction
Operating
Systems
Lightweight
of
Conference (Apr.
remote
1991).
procedure
call.
1990).
1 (Feb.
A software
Support
8, 1 (Feb.
AND NELSON,
2,
RASHID,
consistency: York,
Syst.
(LANCE).
of the 4th International
4. BERSHAD, B., ANDERSON, T., LAZOWSKA, E., AND LEVY, H. ACM
for Ethernet
remote
C., AND BARON,
Proceedings
Languages
procedure R.
Translation
of the 3rd
and
calls.
ACM
Operating
Systems
LEVY,
M.,
ACM
Trams
lookaside
buffer
Conference (Apr.
on Architec-
1989).
ACM,
New
113–122.
7. CHASE, J. S., AMADOR, Amber
system:
ACM
12th
LAZOWSKA,E. D.,
F. G.,
Parallel
programming
Symposium
on
on a network
Operatzng
Systems
H.
AND LITTLEFIELD,
of multiprocessors. Principles
(Dec.
In 1989).
R. J.
The
Proceedings
of the
ACM,
York,
New
147-158. 8
CHERITON, (Apr.
D. R
1984),
The
V kernel:
A software
9. CLARK, D. D., AND TENNENHOUSE, protocols. tures
In
and
10. CLARK,
Protocols IEEE
11. COMER, model.
1990).
Commun.
B. S.
SIGCOMM New
(Sept.
Proceedings
ACM,
New
ROMKEY,
Msg.
27,
J.
York,
6 (June
A new
A host-network
for
distributed
design 1990
interface
USENIX
An
1, 2
for a new generation
for
ATM.
Architectures
analysis
of
Architec-
of TCP
systems:
Conference
architecture
on Communications
In
and
The
(June
remote
1990),
memory
127–135.
Proceedings
Protocols
processing
of the
(Sept.
1991).
1991 ACM,
307-315. TURBOChannel
CORPORATION.
PMADD-AA
Specification,
Rev.
1.2. Workstation
EQUIPMENT SYSTEMS. C. A. R.
Systems
CORPORATION.
TCA-1OO
Pittsburgh,
HOARE,
Softw.
on Communications
H.
23–36.
CORPORATION.
17
IEEE
200–208.
1989),
EQUIPMENT
Systems,
considerations
J., AND SALWEN,
14. DIGITAL DIGITAL
systems.
Symposium
EQUIPMENT
16, FORE
distributed
Architectural
13. DIGITAL
15
for
SIGCOMM
of the Summer
Symposium
York,
D. L.
of the 1990
D., AND GRIFFIOEN, In
12. DAVIE,
Proceedings
D. D., JACOBSON, V.,
overhead.
base
19-42.
Alpha
Hardware
Engineering, ATM
Ethernet
1991.
Module
Functional
1990.
Architecture
TURBOchannel
Speczficatzon.
TurboChannel Reference
Computer
Man ual,
Interface,
1992.
User’s
Manual.
FORE
Pa. 1992 Communicating
sequential
processes.
Comm un
ACM
21,
8 (Aug.
1978),
666-677. 18, JOHNSON, D. B., AND ZWAENEPOEL, Rep. 19
COMP
KANAKIA, mance
TR91-152,
H.,
mm
network
Symposmm
Dept.
CHERITON,
W.
The
of Computer D. R.
communications
The
Peregrine Science,
VMP
high
network
for multiprocessors.
on Comm unzcatLons
Architectures
performance
Rice Univ., adapter
In
and
board
Proceedings
Protocols
RPC
system.
Tech,
1991. (NAB):
of the 1988
(Aug.
1988).
ACM,
High-perforSIGCOMM New
York,
175-187. 20
KEPPEL, ings
guages 21
and
A portable 4th
Operating
Syst.
Transactions
interface
International
LI, K., AND HUDAK, Comput.
ACM
D.
of the
Systems P.
Memory
7, 4 (Nov. on Computer
for
on-the-fly
Conference
1989),
(Apr.
1991).
coherence
instruction
on Architectural ACM,
New
in shared
York,
virtual
321-359.
Systems,
Vol.
11, No. 2, May
space
modification.
Support
1993
for
In
Programming
ProceedLan-
86-95. memory
systems.
ACM
Trans.
Low-Latency Communication 22. MASSALIN,
H., AND Pu,
ings
of the 12th
York,
191-201.
23. METCALFE, computer
R.
M.,
J.
C.,
Proceedings Languages
AND BORG,
and
Compzd.
Proceedings
J. K.
input/output
R.
Ethernet:
effect (Apr.
Newsletter
31. BRENDAN,
aren’t 1990
IEEE C.,
INC.
operating
systems
USENIX
Conference
New
switching
for
local
on
performance.
In
New
cache
Support
York,
for
Programming
77-84.
of a capability-based getting
Performance
results.
Systems
SBUS Specification
tures
Protocols
and J. N.
Proceedings
Trans.
TRAW,
In
(June
operating
faster
as fast
1990),
247-256.
of Firefly
RPC.
Performance
B.O. Sun
S.,
Comput.
37, 8 (Aug.
AND SMITH,
Proceedings
of the 1991
(Sept.
1991).
A timed-token
ring
of the 7th IEEE
ACM, local
Conference
J.
M.
1988), A
system.
as hardware?
ACM
Evaluation
Microsystems,
Trans.
In
Comput.
Cooperative, Inc.,
1990.
Mountain
View,
New area
E. H.,
JR.
Firefly:
Sympos~um
York,
317–325.
network
and
on Local
A multiprocessor
909–920.
high-performance
SIGCOMM
host
interface
on Communications
its
Computer
performance
Networks
for
ATM
Architec-
characteristics.
(Feb.
1982).
IEEE,
In New
50-56.
33. VAN RENESSE, R., VAN STAVEREN, H., AND TANENBAUM, distributed 34. MINZER,
Received
Proceed-
ACM,
1-17.
benchmark
networks.
Mug.
In
1989).
1990.
workstation.
York,
switches
ACM,
30. THACISSR, C. P., STEWART, L. C., AND SATTERTHWAITE,
32. ULM,
packet
on Architectural
The design
kernel.
(Dec.
395–404.
context
1991).
A. S.
Synthesis
203
.
289-299.
Why
1990),
of
Networks
Principles
Distributed 1976),
Conference
D., AND BURROWS, M.
1 (Feb.
in the
Systems
19, 7 (July
The
Systems
of the Summer
29. SUN MICROSYSTEMS, Calif.,
A.
Operating
27. SCHROEDER, M. 28. SPEC.
D.
ACM
International
J. 29, 4 (1986),
8,
and
on Operating
S. J., AND TANENBAUM,
26. OUSTERHOUT,
Syst.
Commun.
of the 4th
25. MULLENDER,
Threads
Symposium
AND BOGGS,
networks.
24. MOGUL,
C.
ACM
on High-Speed
operating S. E.
27, 9 (Sept.
July
1991;
system.
Broadband 1989),
revised
Sofku—Pratt. ISDN
17-24,
and
Exp.
asynchronous
A. S.
The performance
19, 3 (Mar. transfer
1989), mode
of the Amoeba
223-234. (ATM).
IEEE
Commun.
57.
November
ACM
1992;
accepted
Transactions
January
on Computer
1993
Systems,
Vol. 11, No. 2, May 1993.