Fast Buffer Memory with Deterministic Packet Departures - UCSD

1 downloads 0 Views 144KB Size Report
Bill Lin. University of California, San Diego, La Jolla, CA 92093–0407. Abstract— ..... [10] S. Kumar, P. Crowley, J. Turner, “Design of randomized multichannel.
Fast Buffer Memory with Deterministic Packet Departures Mayank Kabra

Siddhartha Saha

Bill Lin

University of California, San Diego, La Jolla, CA 92093–0407

Abstract— High-performance routers need to store temporarily a large number of packets in response to congestion. DRAM is typically used to implement the needed packet buffers, but DRAM devices are too slow to match the bandwidth requirements. To bridge the bandwidth gap, a number of hybrid SRAM/DRAM packet buffer architectures have been proposed [7]–[10]. These packet buffer architectures assume a very general model where the buffer consists of many logically separated FIFO queues that may be accessed in random order. For example, virtual output queues (VOQs) are used in crossbar routers, where each VOQ corresponds to a logical queue corresponding to a particular output. Depending on the scheduling algorithm used, the access pattern to these logical queues may indeed be at random. However, for a number of router architectures, this worst-case random access assumption is unnecessary since packet departure times are deterministic. One architecture is the switch-memory-switch router architecture [3], [4] that efficiently mimics an output queueing switch. Another architecture is the load-balanced router architecture [1], [2] that has interesting scalability properties. In these architectures, for best-effort routing, the departure times of packets can be deterministically calculated before inserting packets into packet buffers. In this paper, we describe a novel packet buffer architecture based on interleaved memories that takes advantage of the known packet departure times to achieve simplicity and determinism. The number of interleaved DRAM banks required to implement the proposed packet buffer architecture is independent of the number of logical queues, yet the proposed architecture can achieve the performance of an SRAM implementation.

I. I NTRODUCTION High-performance routers need to store temporarily a large number of packets at each linecard in response to congestion. One of the most difficult challenges in designing highperformance routers is the implementation of packet buffers that can operate at extremely fast line rates. That is, at every time slot, a newly arriving packet may need to be written to the packet buffer, and an existing packet may need to be read from the packet buffer for departure. To provide sufficient packet buffering at times of congestion, a common rule of thumb is to size the packet buffer to be the product of the average round-trip-time and the line rate [5]. At an average round-trip-time of 250 ms on the Internet [6] and a line rate of 40 Gb/s, this rule of thumb would translate to a memory requirement of over 1 GB at each linecard, which requires

the use of DRAM. However, at a line rate of 40 Gb/s and beyond, DRAM is simply too slow. Therefore, researchers have explored hybrid SRAM/DRAM memory systems that aim to provide the memory access speeds of SRAM, but with the density of DRAM [7]–[10]. These architectures can be used to implement a packet buffer with logically separate FIFO queues. For example, virtual output queues (VOQs) are used in crossbar routers, where each VOQ corresponds to a logical FIFO queue for buffering packets to a particular output destination. Depending on the scheduling algorithm used for the crossbar router, the assumption is that the choice of VOQs in a packet buffer to service at each time slot may be chosen at random. To enable the random servicing of VOQs at SRAM speeds, the methods described in [7], [8] require complex memory management algorithms based on real-time sorting and a non-trivial amount of SRAM for storing the head and tail portions of the VOQs to handle worst-case access patterns. On the other hand, the memory management schemes proposed in [9], [10] only provide statistical guarantees. Although the assumption of servicing VOQs at random order is reasonable for crossbar-based router architectures, this assumption is unnecessarily general for several important router architectures, in particular those router architectures where the departure times of packets from packet buffers can be deterministically calculated exactly in advance of packet insertion into a packet buffer when best-effort routing is considered. One such router architecture is the switchmemory-switch router architecture [3], [4]. This architecture can efficiently mimic an output queueing switch. The switchmemory-switch architecture is based on a set of physically separated packet buffers that are sandwiched between two crossbar switches. When a new packet arrives at an input port, the departure time of the packet is deterministically calculated to mimic exactly the output departure order of an output queueing switch. A matching problem is then solved to decide to which packet buffer to write the packet. When the packet gets inserted into a packet buffer, its departure time has already been decided. Another router architecture is the load-balanced router architecture [1], [2]. This architecture has been shown to have

interesting scalability properties and throughput guarantees. The load-balanced router architecture is also based on a set of physically separated packet buffers that are sandwiched between two switches. However, in contrast to the switchmemory-switch architecture, there is no central scheduler, and the two switches are traffic-independent fixed configuration switches. Instead, in this architecture, each packet buffer maintains a set of VOQs that are serviced deterministically in round-robin order. Therefore, the departure time of a packet from its packet buffer can also be calculated deterministically in advance of writing the packet into the packet buffer. In this paper, we propose a simple deterministic memory management scheme that exploits the known departure times of packets to achieve the performance and behavior of an SRAM packet buffer implementation, but using instead multiple DRAM modules in an interleaving manner. Using interleaved memories, a fast external effective memory bandwidth is achieved by accessing K slower memories independently. Using the known departure times of packets that need to be stored in the packet buffers, we can deterministically assign packets to interleaved DRAM modules in a manner to guarantee the avoidance of access conflicts and to guarantee the retrieval of packets before their required departure times. Our proposed packet buffer architecture can support an arbitrary number of logical FIFO queues with the only assumption that the departure times of packets can be determined in advance of packet insertion. We prove that the number of interleaved DRAM modules required is a constant with respect to the number of logical queues. A simple bypass scheme is employed using a small amount of SRAM to buffer packets that have departure times less than the round-trip latencies of DRAM access. The size of this SRAM is again a constant with respect to the number of logical queues. With this simple bypass scheme, we show that our scheme can achieve the performance of an SRAM packet buffer implementation, but with the density of DRAM. Although interleaved memories have been used in the previously proposed fast packet buffer schemes [7]–[10], our proposed scheme takes advantage of known departure times to achieve simplicity and determinism. The rest of the paper is organized as follows. In Section II, we describe the packet buffer abstraction that we aim to implement, our packet buffer architecture, and its operation. In this section, we also describe the types of memory access conflicts that can occur with our architecture and prove that the number of interleaved memories required at each intermediate input linecard to guarantee the avoidance of access conflicts is independent of traffic arrival and the number of logical queues. In Section III, we address implementation issues related to the memory management control logic required for our packet buffer scheme. Finally, we conclude in Section IV.

II. A BSTRACTION AND A RCHITECTURE A. Packet Buffer Abstraction In this section, we first describe the packet buffer abstraction that we aim to implement. For simplicity, throughout this paper, we assume that all incoming variable-size packets are segmented into fixed-size packets, or simply packets, and reassembled when leaving the router. Therefore, we assume a packet buffer abstraction with fixed-size packets. Also, we assume that time is slotted, where a time slot is defined to be the amount of time that it takes for a packet to arrive (or depart) at a given line rate. For example, at 40 Gb/s, the time slot for a 40-byte packet is 8 ns. At each time slot t, a new packet p may need to be inserted into the packet buffer with a future departure time TD (p), and an existing packet q may need to depart from the packet buffer in the current time slot (i.e., TD (q) = t), with both occurring in the same time slot in the worst-case. The departure times of packets are assumed to be known and fixed at the time of packet insertion, which dictates exactly when the packet needs to be read from the packet buffer in the future. We make no restrictions on what departure times are assigned, only that the departure times are unique to ensure that only one packet departs per time slot. Like previous work on fast packet buffers [7]–[10], our packet buffer may consists of an arbitrary number of logical queues, but instead of assuming random access, we assume deterministic access based on predetermined departure times. We refer to this packet abstraction as the single-write-single-read time-deterministic packet buffer model, or time-deterministic packet buffer model for short. A packet buffer architecture is said to implement a timedeterministic packet buffer if, at every time slot, a new packet can be written into the packet buffer and a new packet can be read from the packet buffer in accordance to its pre-determined departure time. B. Packet Buffer Architecture We next describe our proposed packet buffer architecture, as depicted in Figure 1. To provide bulk storage, our architecture uses an interleaved memory architecture with multiple slower DRAM modules. In particular, let there be K memories. DRAM access latencies are usually much slower than line rate. Hence, multiple time slots are usually required to complete a memory transaction. Let b be the number of time slots required to complete a single write or read memory transaction. Once a memory transaction has been initiated to a memory, that memory will remain busy for b time slots until the memory transaction completes. To match the external line rates, we write b packets in parallel into b slower memories once every b time slots, and we also read b packets in parallel from b memories once every b

1

2

K-1

K

DRAMs

...

aggregate

b packets …

arriving packet

de-aggregate …

...

departing packet

retrieved on time for departure. Therefore, we will only write a packet into a DRAM module if its departure time is at least 4b − 1 time slots away. For each arriving packet with a departure time at least 4b−1 time slots away, we must ensure that we can assign it to an available DRAM module. To make sure that there is always an available DRAM module to write such a packet, we must consider three types of conflicts. •

SRAM Bypass Buffer

4b – 1 locations Fig. 1.

The proposed packet buffer architecture.

time slots. We refer to b consecutive time slots as a frame slot, or simply a frame. By convention, a time slot t is said to belong   to frame f = bt . For example, time slots t = 1, 2, . . . , b belong to frame f = 1, time slots t = b + 1, b + 2, . . . , 2b belong to frame f = 2, etc. To facilitate parallel writing, b packets are aggregated before writing to the parallel DRAM modules. Suppose a packet has an arrival time of TA (p), then we say that it arrived in frame   TA (p) . FA (p) = b The packet will then be written into the parallel memories at frame FW (p) = FA (p) + 1, and the writing will complete by the end of FW (p), b time slots later. Similarly, suppose a packet has a departure time TD (p), then we say that it departs in frame   TD (p) FD (p) = . b To ensure that the packet departs on time at time slot TD (p), the packet is actually read in the previous frame, FR (p) = FD (p) − 1, so that there is time to de-aggregate the packets read. Based on this operational model, it should be clear that the minimum round-trip latency for storing and retrieving a packet to and from one of the DRAM modules is 4 frames, one frame for aggregating the packets, one frame for writing into the memories, one frame for reading from the memories, and one frame for de-aggregating the packets for departure. Thus, excluding the arrival time slot, a packet with a departure time less than or equal to 4b − 1 time slots away may not be





Arrival write conflicts: At the current frame f , there can be at most b packets that need to be written into the DRAM modules. That is, there can be at most b packets with FW (p) = f . Each packet must be written to a different memory. We refer to this type of conflicts as arrival write conflicts. For each arriving packet that needs to be written in the current frame f , there can be at most b − 1 arrival write conflicts. Arrival read conflicts: At the current frame f , there can be at most b packets that must be read from the DRAM modules for departure. That is, there can be at most b packets with FR (p) = f . The b memories assigned to the newly arriving packets cannot already hold packets that must be read in the current frame for departure. We refer to this type of conflicts as arrival read conflicts. Because there can be at most b packets that must be read in the current frame for departure, there can be at most b arrival read conflicts. Departure read conflicts: For each arriving packet p that must be written into the DRAM modules in the current frame FW (p) = f , it will need to be read eventually in a future frame FR (p) = d for departure. At this future frame d, there may be other packets with the same departure frame. Therefore, the memories assigned to these other packets cannot be used to assign to the current packet p. We refer to this type of conflicts as departure write conflicts. Because there can be at most b packets departing in any frame, there can be at most b − 1 departure write conflicts.

Therefore, by pigeon hole principle, we need at least [[(b − 1) + b + (b − 1)] + 1] memories to avoid all three types of conflicts. That is, we must have K ≥ 3b − 1, which is a small constant that is independent of traffic arrival and the number of logical queues. Finally, we must consider the case when a packet has a departure time less than 4b − 1 time slots away. To ensure that we implement the time-deterministic packet buffer abstraction described in Section II-A, we add to our architecture a small SRAM bypass buffer, as shown in Figure 1. With the SRAM bypass buffer, a packet can be written or read within a time slot. If the departure time of a packet is less than 4b − 1 time slots away, then it is written to this SRAM bypass buffer

III. M EMORY M ANAGEMENT I MPLEMENTATION In this section, we address implementation issues related to the memory management control logic required for our packet buffer architecture. A. DRAM Module Selection We first consider implementation of the memory management control logic for selecting an available DRAM module for storing a packet. In the last section, we showed that K = 3b − 1 DRAM modules will suffice. To find a compatible memory for inserting an incoming packet, we maintain a two-dimensional read transaction bitmap R to indicate which DRAM module has an already stored packet that must be read at each corresponding frame slot for departure. Each row in the bitmap corresponds to a frame slot, and each column of the bitmap corresponds to a DRAM module. There are exactly 3b − 1 columns in the bitmap, corresponding to the 3b−1 DRAM modules. To avoid

W write reservation bit-vector

memory modules

frames

instead of the interleaved memories. This way, we can ensure that the packet can depart at its pre-determined departure time. Since a packet is only written to this bypass buffer if its departure time is less than 4b − 1 time slots away, and all packets assigned to the bypass buffer will depart in the next 4b − 1 time slots, one per time slot if there is a packet with departure time at that time slot, it is easy to see that we only need 4b − 1 packet locations in this bypass buffer, and this size is also independent of traffic arrival and the number of logical queues. Theorem 1: With 4b − 1 packet locations in the SRAM bypass buffer and K ≥ 3b − 1 DRAM modules, we can implement a single-write-single-read time-deterministic packet buffer. Proof: By definition of single-write-single-read, each packet has a unique arrival time and a unique departure time. By definition of the architecture, packets with departure times less than 4b − 1 time slots away are written into the SRAM bypass buffer. Since at most one packet can be inserted to the SRAM bypass buffer, and one packet is guaranteed to be removed from the SRAM bypass buffer if it already has 4b − 1 packets, a SRAM bypass buffer with 4b − 1 locations will always suffice, and the packets are guaranteed to depart at its pre-determined departure time. For the DRAM operation, because each packet has a unique arrival time and a unique departure time, only b packets need to be written to the DRAM modules at each frame, and only b packets need to be read from the DRAM modules at each frame. Therefore, there can only be b−1 arrival write conflicts, b arrival read conflicts, and b − 1 departure read conflicts. By pigeon hole principle, we need one more memory to resolve all three conflicts. Therefore, K ≥ 3b − 1 will suffice.

1 2 3 4 5



K

35 1 0 0 1 0 36 1 0 1 0 0 37 0 0 1 0 0

… … …

0 0 1

78 0 0 0 1 0 79 0 0 0 1 0 80 0 1 0 1 1

… … …

1 0 1

0 0 0 1 0



0



1

bitwise NOR 0 1 0 0 1

priority encoder m=2

R read transaction bitmap Fig. 2.

Memory selection control logic.

an infinite number of rows, we limit the furthest departure frame into the future that we need to consider, and we use circular indexing to map a frame slot to a row in the bitmap. In particular, let X be the furthest number of departure frames into the future that we consider. Then, via circular indexing, a frame slot f is mapped to rth row in the bitmap where r=f

mod X.

We defer to the next section to discuss the maximum X that we should consider. For ease of discussion in this section, we assume there is a one-to-one correspondence between a frame slot and a row in the bitmap. Given the convention that each row corresponds to a frame and each column corresponds to a memory module, we use R(f, m) to indicate whether or not the mth DRAM module has an already stored packet that must be read at the f th frame slot. That is,  1 if yes R(f, m) = 0 otherwise We use the notation R(f ) to denote the row-vector at the f th row of the bitmap. Initially, all bitmap locations are initialize to 0’s. This bitmap structure is depicted in Figure 2. Given this definition of the bitmap, there is a natural mapping to an SRAM implementation, for example using a 3b − 1 wide SRAM word and associating SRAM address location f to R(f ). In addition to the read transaction bitmap R, we maintain a write reservation bit-vector W . This bit-vector has 3b − 1 bit locations. At the current frame, we use W (m) to indicate if the mth memory module has been assigned to an arriving packet. This bit-vector can for example be maintained with registers, and it is initialize to all 0’s at the beginning of each frame slot to indicate that no memory selection has been made yet in this current frame.

At each frame f , we may need to write at most b newly arriving packets into the DRAM modules. For each of these b packets, we need to find a compatible DRAM module that avoids the three types of conflicts discussed in Section IIB, namely arrival write conflicts, arrival read conflicts, and departure read conflicts. We have b time slots to perform the memory selection for up to b packets. Therefore, we can afford to perform memory selection sequentially one packet at a time. Moreover, we can use a simple greedy algorithm that selects the first compatible memory module for each packet. In particular, for each newly arriving packet p that must be written in the current frame f , we do the following: •





We use W to check for arrival write conflicts, and we use R(f ) to check for arrival read conflicts. Let d = FR (p) be the future frame slot when packet p needs to be read for departure. Then, we must also use R(d) to check for departure read conflicts. To determine if a memory module m is compatible for storing the packet, we can use the Boolean check W (m) + R(f, m) + R(d, m), which corresponds to a 3-input NOR operation. Instead of checking one memory module at a time, we can use bitwise-NOR operations on the bit-vectors, W + R(f ) + R(d),



and feed the result into a priority encoder, which would return the index to the first compatible memory module. This is also depicted in Figure 2. Let n be the memory module selected for p. Then we set W (n) = 1 to indicate that the memory module has been selected and cannot be selected for another packet to ensure no arrival write conflicts, and we set R(d, n) = 1 to indicate that this memory module will be busy with a read transaction at frame d.

Consider again Figure 2. Suppose we wish to find a compatible memory for a packet p with FW (p) = 36 (the arrival write frame) and FR (p) = 79 (the departure read frame), and suppose W shown corresponds to the write reservations that have already been made for this current time slot. By performing the bitwise-NOR on W , R(36), and R(79), we obtain the bit-vector 01001 · · · 1. The result of the priority encoder would be m = 2. Note that the above memory selection logic can be performed on-the-fly within one time slot as each new packet arrives. We do not have to wait for the whole frame of packets to arrive before trying to find a compatible memory module for each of them. By Theorem 1, we know that with K = 3b − 1 memory modules, a compatible memory can always be found for each arriving packet.

B. Size of the Bitmap As mentioned in the previous section, we can limit the number of rows in the bitmap to correspond to the furthest number of departure frames into the future that we wish to consider. One way to set this limit is to set the limit relative to the size of the packet buffer. As mentioned, one common rule of thumb is to size the packet buffer in accordance to the product of the average round-trip-time and the line rate [5]. Suppose the size of the packet buffer is T packets. Then, under a first-come-first-serve best-effort service discipline, it is reasonable to limit the furthest departure time slot relative to the current time slot to be T , which in turn corresponds to limiting the furthest departure frame relative to the current frame to be approximately T . b Since each row in the bitmap corresponds to 3b − 1 bits, then the size of the bitmap can be roughly calculated as X=

T (3b − 1) , b which in turn is only about 3 bits per packet (about 1% of the size of a 40-byte packet). Assuming a round-trip-time of 250 ms and a line rate of 40 Gb/s, the common rule of thumb for sizing packet buffers would correspond to a memory requirement of about T = 3 × 107 packets, which would correspond to a bitmap with approximately 11 MB. With current technology, this can be implemented with a few external SRAMs. X(3b − 1) =

C. Location of Packets in a DRAM Module Section III-A described the memory management control logic for selecting a compatible memory module for writing a packet. In this section, we address several additional details that must be considered. First, when a packet p is assigned to a memory module m, we need a way to assign the actual memory location to store the packet. Then, later at a future frame slot f , we need a way to determine 1) which packets must be read (up to b packets), 2) which memory modules to retrieve the packets, 3) which memory location within each memory module to read, and 4) which order to de-aggregate the packets for departure. At frame slot f , we can use the read transaction bitmap R to answer (2). That is, we know we need to retrieve a packet from memory module m if R(f, m) = 1. By construction, there can be at most b memory modules with R(f, m) = 1. To answer (1) and (3), we need to consider the method we use to allocate memory locations within each memory module. One approach would be to maintain a lookup table that stores for each time slot a pointer to the memory location where

the corresponding packet is stored. The lookup table could be by time slots when a packet must be read for departure. However, such a lookup table may be too large and expensive in SRAM. Alternatively, we can partition the packet locations in a memory module by frame slots. Suppose we need to store a packet p with d = FR (p), the frame slot in which the packet must be read for departure, and suppose the packet has been assigned to the memory module m. Then, we store p in the dth packet location in the memory module m. By ensuring that all conflicts have been resolved in the memory selection step, we are guaranteed that at most one packet p will occupy a packet location in a memory module, corresponding to the frame slot FR (p). Then, for packet retrieval, at frame slot f , we look up the read transaction bit-vector R(f ) to determine which memory module contains a packet that must be read at this frame, i.e., where R(f, m) = 1, and we read the packet location corresponding to f in the memory module m. As discussed in Section III-B, we can limit the total number of frames considered to X, where X corresponds to the number of frames into the future that we need to consider, and we again use circular indexing to map a frame to a packet location in the memory. If we again let T be the size of the packet buffer in terms of the number of packets, we let X = Tb , and we consider 3b − 1 memory modules, then the total number of packet locations across the 3b − 1 memory modules would be approximately 3T , which would represent a memory utilization of approximately 33%. Again using an average round-trip-time of 250 ms and a line rate of 40 Gb/s, we would need approximately 3 GB of DRAM instead of 1 GB of DRAM, which is still a modest amount of DRAM given current technology. Finally, to answer (4), since the packets are retrieved by frame slots, not by time slots, we need a way to determine the order in which to de-aggregate the packets for departure. One way is to store the departure time of a packet with the packet itself in DRAM. For example, using T = 3×107 as the furthest time slot into the future that we need to consider, the departure timestamp can be represented using log T = 25 bits, which is acceptable relative to the packet size since this timestamp is also stored in DRAM. However, since we are retrieving packets in frames of b packets, we only need to determine the relative order of the packets within its departure frame. Therefore, we can simply store a relative departure time with each packet p in DRAM: TD (p)

mod b.

This relative timestamp can be represented using only log b bits. For example, for b = 20, log b = 5 bits. This is a very small amount of memory storage relative to the packet size, and again this relative timestamp can be stored with the

packet in DRAM. To ensure that packets are de-aggregated in the correct order, we can just keep a simple de-aggregation buffer with b locations, and each packet retrieved can be simply stored in the location corresponding to its relative departure time. IV. C ONCLUSIONS In this paper, we described a novel packet buffer architecture based on interleaved memories that takes advantage of the known departure times of packets to achieve simplicity and determinism. The switch-memory-switch router architecture [3], [4] and the load-balanced router architecture [1], [2], amongst others, are example architectures in which departure times for packets can be deterministically calculated in advance of packet insertion into packet buffers when best-effort routing is considered. Our proposed packet architecture can support an arbitrary number of logical FIFO queues with the only assumption that the departure times of packets can be determined in advance of packet insertion. This is in contrast to other proposed packet buffer architectures that assume random access. We showed in the paper that the number of interleaved DRAM banks required to implement our proposed packet buffer architecture is independent of the number of logical queues, but yet the proposed architecture can achieve the performance of an SRAM packet buffer implementation. R EFERENCES [1] C. S. Chang, D. S. Lee, Y. S. Jou, “Load balanced Birkhoff-von Neumann switches, Part I: one-stage buffering,” Computer Communications, vol. 25, pp. 611-622, 2002. [2] I. Keslassy, S. T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard, and N. McKeown, “Scaling Internet routers using optics,” ACM SIGCOMM, Karlsruhe, Germany, 2003. [3] A. Prakash, S. Sharif, A. Aziz, “An O(log2 N ) parallel algorithm for output queuing,” Proceedings of IEEE Infocom, New York, NY, June 2002. [4] S. Iyer, R. Zhang, N. McKeown, “Routers with a Single Stage of Buffering,” Proceedings of ACM SIGCOMM, Pittsburgh, USA, Sep. 2002. [5] C. Villiamizar, C. Song, “High performance TCP in ANSNET,” ACM Computer Communication Review, 1995. [6] CAIDA, “Round-trip time measurements from CAIDA’s macroscopic Internet topology monitor,” available at http://www.caida.org/analysis/performance/rtt/walrus0202. [7] S. Iyer, N. Mckeown, “Designing buffers for router line cards,” Stanford University HPNG Technical Report - TR02-HPNG-031001, Stanford, CA, Mar. 2002. [8] J. Garcia, J. Corbal, L. Cerda, M. Valero, “Design and implementation of high-performance memory systems for future packet buffers,” Proc. of the 36th International Symposium on Microarchitecture (MICRO-36 2003), pp. 373-386, 2003. [9] G. Shrimali, N. Mckeown, “Building packet buffers with interleaved memories,” Proceedings of Workshop on High Performance Switching and Routing, Hong Kong, May 2005. [10] S. Kumar, P. Crowley, J. Turner, “Design of randomized multichannel packet storage for high performance routers,” Proceedings of the 13th Annual IEEE Symposium on High Performance Interconnects (HotI-13), Stanford, August 17-19, 2005.

Suggest Documents