nization and use a detect and recover strategy to deal with causality ... Recovery consists of restoring .... eration of a RAID-5 disk array, and POLICE which is.
Using Programmable NICs for Time-Warp Optimization Ranjit Noronha and Nael B. Abu-Ghazaleh Computer Science Deptartment State University of New York Binghamton, NY 13902 {rnoronha,nael}@cs.binghamton.edu
Abstract
and others [5, 30, 31]) and low-overhead user-level space communication protocols (such as the Basic Interface for Parallelism (BIP) [15], Illinois Fast Messages (IFM) [23] and others [11, 20]). In a typical node in a workstation cluster, the NIC resides on the I/O bus which is connected to the system bus using a bus adapter. Outgoing messages traverse the I/O bus twice; once while being transferred (using either DMA or programmed I/O) from the senders host buffers to the NIC buffers and the other traversal at the receivers end where it is transferred from the NIC to the host. With networking technology improving rapidly, the network can now deliver messages at a rate that overwhelms the abilities of the host workstation to handle them. This includes: (i) the I/O bus: at the full network bandwidth of a 4Gb/sec Myrinet network, 100% of a typical I/O bus bandwidth (64-bit 66MHz PCI bus) will be consumed by network traffic; (ii) the system bus: this is a well known bottleneck even without communication [7] that is significantly exacerbated by network traffic [19]; and (iii) CPU: CPU time is needed to handle the messages (interrupt handling, context switch, parsing/generating headers, buffer management, checksums etc..). For fine grained applications, these factors result in low effective communication bandwidth and draw resources away from the application. Recently, NIC vendors started developing high-end, affordable, programmable NICs [1, 2, 28]. Providing the programmability on the NIC opens the door for a new system model for implementing distributed applications. More specifically, application-specific customization of the NIC becomes possible. More generally, any portion of the application may be implemented on the NIC to optimize performance in the following ways:
This paper explores optimization of Parallel Discrete Event Simulators (PDES) on a cluster of workstations with programmable Network Interface Cards (NICs). We explore reprogramming the firmware on the NIC to optimize the performance of distributed simulation. This is a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The combined effect is to optimize the application communication behavior as well as reduce the load on the host processor resources. We explore this new model by implementing two optimizations to a Time-Warp simulator on the NIC: (1) the migration of the Global Virtual Time estimation algorithm to the NIC; and (2) early cancelation of messages in place upon early detection of rollbacks. We believe that the model generalizes to other distributed applications. Keywords: Clusters, Programmable NIC, Time Warp, Parallel Discrete Event Simulation
1. Introduction The emergence and commercial success of clustering technologies using commodity components allow scalable cost-effective parallel processing machines to be built easily [3, 4, 24, 29]. Clusters approach the performance of custom parallel machines by using high-performance Local/System area networking technologies and standards (such as Myrinet [6], SCI [16]
1. Migrate “distributed” portions of the application, defined informally as the objects that are more closely related to objects on other nodes than to the
This work was partially supported by NSF Grant EIA-9911099
1
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
local objects, to the NIC processor such that they are able to tap into the network directly at low latency and high bandwidth. This migration reduces the traffic across the NIC to host interface, freeing up the host resources to local object processing;
is determined by the smallest time-stamp of an unprocessed event in the simulation (taking into account messages in transit) – this value is called the Global Virtual Time (GVT) and its estimation can be shown to be similar to the distributed snapshot algorithm [18]. Rollbacks to estimates before GVT are not possible since events cannot be generated in the past. GVT is used as a marker to garbage collect histories no longer needed for rollback. Time Warp simulations communicate heavily and at very fine granularity which has a high-overhead thus substantially affecting performance well on distributed memory machines (and clusters) [9]. This makes them a suitable application for Active NIC optimization.
2. Detect and react quickly to urgent or unexpected events. Normally, such events carry this urgency semantics at the application level. This requires that they percolate through the system and up to the application before they can be handled. Detecting and handling them at the NIC bypasses this cost; 3. Filter (or generate) messages directly on the NIC, further reducing the traffic to the host; and
3. Proposed Optimizations
4. Allow communication monitoring and profiling at a low level not available to applications under the traditional model.
In deciding what optimizations to implement on the NIC, the following factors were considered: (i) NIC resources are severely limited. The processor is the equivalent of 10 year old technology and is already saddled with the other responsibilities. Furthermore, the available memory (1Mbytes) is restrictive; and (ii) We would like to demonstrate migration of distributed functionality to the NIC and filtration of messages based on information available at the NIC. With the newer generation of NICs with better processors and more memory [28, 21], we expect a much larger scope for optimizations to become available. We elected to implement the following two optimizations as an initial demonstration of feasibility: migration of GVT computation to the NIC; and early cancellation of erroneous messages on the NIC.
Implementing intelligence in the I/O subsystem (or even specifically in the NIC) is not a new idea; in fact, the idea of DMA itself is an example of such intelligence. Our work differs in that we seek to make this intelligence application specific. There have been a small number of investigations into using the programmability features to implement application-specific communication primitives on the NIC [8, 13, 17, 32]. The SPINE Operating System provides mechanisms for off-loading user-level primitives to the NIC by extending the Operating System [12]. In this paper, we presents initial experiments with this model using parallel discrete event simulation (PDES) as an application. We explore two optimizations to a Time-Warp simulator using this model: the migration of the Global Virtual Time estimation algorithm to the NIC and early cancelation of outgoing messages on the NIC after detecting incoming rollback messages.
3.1. NIC-level GVT GVT represents the minimum on the timestamp of all unprocessed messages at all the logical processes including those in transit. GVT places a lower bound on the progress of the simulation and therefore guarantees that no earlier events can be generated. With this knowledge the LPs can garbage collect event and state histories. GVT is also used for termination detection. GVT estimate operates concurrently with the simulation: if it is carried out aggressively, it incurs a higher overhead but the obtained estimate is tighter, allowing more timely garbage collection. WARPED implements two GVT algorithms – pGVT [10] and Mattern’s algorithm [18]. We use Mattern’s algorithm because it has a lower overhead and produces good estimates. Mattern’s algorithm is similar to classic two-round distributed snapshot algorithms. In Figure 1, C1 represents the point we decide to invoke the estimate while C2 represents a point in the future where a consistent es-
2. Time Warp Simulation Parallel Discrete Event Simulation (PDES) [14] can potentially increase performance and capacity of the simulation by partitioning the simulation model across a collection of concurrent simulators called Logical Processes (LPs). Each LP maintains a Local Virtual Time (LVT) and communicates with other LPs by exchanging time-stamped event messages. In optimistic simulation (Time-Warp), no explicit synchronization is enforced among the LPs – LPs process local events in timestamp order. A causality error occurs when an arriving message has a lower timestamp than LVT (straggler event) forcing the LP to halt processing and roll back to the earlier virtual time of the straggler. Thus, each LP must maintain state and event histories to enable recovery from straggler events. The progress of the simulation 2
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
Host
LVT0
NIC
LP0 M3
M2
1. Keep track of white messages 2. track minimum timestamp of all red messages sent 3. Calculate LVT
LVT1
LP1 M5
M1
M6
LVT2
LP2 M4
1. Track V, T, Tmin across hosts 2. Generate and receive GVT msgs 3. Decide on GVT termination 4. Report new values of GVT
LVT3
LP3 Cut C1
Figure 2. Implementation
Cut C2
Figure 1. Consistent Cuts and finally tracking the minimum on the timestamp of all outgoing RED messages to reduce the latency on the send side at the NIC. The NIC keeps track of the number of outstanding WHITE messages (V) and the minimum timestamp received so far ( , initially ) and the Local Virtual Time (LVT). Consistency is a major issue introduced by this implementation mode. On the receive side, messages that are seen by the NIC (causing an update in state, for example LVT) spend time in the OS, message library, and/or application buffers before they are seen by the application. The reverse is true on the send side. For example, the NIC may report LVT to be some value which is not the true LVT value due to the presence of a rollback message that has not been received by the host yet. We expect this consistency problem to arise whenever state is shared between the NIC and the host. The GVT implementation must take care of this inconsistency problem or erroneous GVT values may be computed. In the remainder of this section we describe the implementation. Initially, each LP reports its rank to the NIC through the global buffer shared between the host and the NIC. The Communication Manager (CM), which is a module in the LP responsible for communication, initializes its control flag to indicate the piggybacking of a GVT token to 0. The Mattern GVT Manager at the root (LP0) initiates GVT computation by reporting the values of V (minimum of the (number of WHITE messages), received time stamps) and T (local virtual time estimate at the NIC) to the CM on the host processor and asking it to send out a control message. CM in turn sets a bit in an outgoing event message and encodes the values of T, and V in four unused fields in the Basic Event Message. The NIC at the root on receiving the message for the first time extracts the values of T, , V and stores them temporarily. The handshaking is carried out to enforce consistency. Whenever it gets a chance, the NIC marshals the val ues of T, and V into a special GVT message and forwards it to LP1’s host. The NIC also has the following variables: TimewarpInitialised: set by BIP to indicate that BIP has been initialized and the rank variable has been written to; GvtTokenPending: indicates whether we are in the middle of a GVT computation or not; ControlMessagePending: This variable indicates
timate is obtained; C1 and C2 should be close to closely bound GVT. GVT is the minimum timestamp of all the LPs and the messages sent between cuts C1 and C2 including messages that cut C2. All processes are WHITE initially and in this state count the number of messages (WHITE messages) they send out. A designated root LP starts off the process with the root process LP0 turning RED and sending a GVT token to process LP1. Process LP1 on receiving the GVT token turns RED and forwards the token to LP2 and so on until the token returns to the root process LP0. As the GVT token circulates, each LP adds to a counter the number of WHITE messages it has sent, and subtracts from it the number of WHITE messages it received. When the token reaches LP0, all processes are RED and the counter contains the number of messages that are in transit. LP0 initiates circulation of the token for a second round. As each LP receives the token it subtracts the number of WHITE messages received since the token was last seen. Once the token is received by LP0 again, if the counter is 0, the token is circulated again to inform all nodes of the GVT value. If the counter is larger than 0, additional rounds are initiated until all the WHITE messages are received. GVT computation is not on the critical path of the simulation and can be performed in the background. It is also not computationally intensive and does not place excessive load on the NIC. We can save bandwidth by intercepting and preventing messages from going across the IO bus. Also GVT information can be piggybacked on many of the normal message fields, which carry pointer information only useful on the originating LP. In WARPED while running Mattern’s GVT algorithm entirely on the host, it is not possible to piggyback information on an outgoing message since we cannot guarantee that LPn will send an event message to LP(n+1) mod m as required by the algorithm where m is the total number of LP’s. Finally, the migration of GVT from the host to the NIC is transparent to the applications being simulated; in this case RAID and POLICE. Figure 2 shows the division of the implementation of the algorithm between the host processor and the NIC. The host is responsible for deciding when to change color, keeping track of LVT (all objects are on the host), 3
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
that a GVT control message has been received by the NIC and has been sent to the host for processing; ReceivedHostVaraibles: indicates that the control message which was pending was processed by the host and the values (T, , V) just came off the last outgoing message. On receiving a GVT token in round 0, the receiving NIC extracts the values of T, and V. The NIC reports these values to the host processor and also requests an update. Now when the Mattern GVT Manager receives this message in memory it updates the WHITE message count and modifies its color. On receiving an incoming GVT message in any round other than round from 0, the NIC first requests the values of V, T, and the host and adds V to the value of V in the token. At the root LP, if the result is 0, the NIC broadcasts a new GVT message to all other logical processes. Following that, it reports the value of GVT to its own host. The full implementation details can be found elsewhere [22].
NIC send buff
NIC recv buff
120
Warped Buffers Anti Message (straggler)
115 Messages > 100 to be cancelled
110 102
MPICH buffers (64K)
100
85
NIC Buffer (4K) Network
(a) Message Send Path
(b) Early Cancellation Example
Figure 3. Early Message Cancellation
is to peek at received anti-messages at the NIC level and discard messages from the send queue based on the receive time stamp of the anti-messages. An example of this is shown in Figure 3(b) where messages with timestamps 120, 115, 110 and 102 can be discarded because of the anti-message received with timestamp 100. Before we describe the implementation we discuss some of the problems experienced with the communication layers BIP and MPICH when packets are dropped. For one BIP maintains sequence numbers to help in the ordering of packets making it necessary to turn off sequence numbers while implementing packet dropping. Other methods include requiring the NIC to maintain state information about dropped packets or informing the host of packet drops both of which are difficult in the current implementation due to limited capabilities of the NIC and IO bus contention. The second problem lies with the implementation of credit based flow control in MPICH. Since additional credit is piggybacked on packets from the receiver back to the sender dropped packets cause credit to be lost and the senders window to close up. We address this problem by enabling sequence numbers in MPICH so that lost packets can immediately be detected and the receiver can update his estimate of the number of credits the sender has used up. Also the NIC keeps track of credit from dropped packets for a particular destination and updates credit information for a packet headed for that destination. Finally the sending window is increased allowing the sender to send for longer periods of time and recover in the case of a block of dropped packets. The algorithm begins by scanning the receive queue on the NIC for anti-messages whose receive timestamp is then recorded. This time-stamp is compared to the send-time stamp on all outgoing messages sent before the anti-message is received at the host (the host reports
3.2. Early Message Cancellation Timewarp is based upon the optimistic strategy for parallel discrete event simulation. More specifically, Logical Processes proceed in parallel with no synchronization and use a detect and recover strategy to deal with causality errors. Recovery consists of restoring the simulation to an earlier valid state and sending out negative or anti-messages to cancel erroneously sent messages. We use aggressive cancellation [27] where erroneous messages are instantly canceled (via antimessages) when a causality error is detected. Our second optimization, Early Message Cancellation, explores deleting messages in the NIC buffer if a rollback message with an earlier time is detected. Such messages are overly optimistic and will have to be canceled using an anti-message once the rollback is detected. Eliminating these messages in place therefore saves the cost of sending them (and handling them at the destination), as well as the later cost of canceling them. This optimization was chosen to demonstrate how the NIC could be used to intelligently filter out some traffic and increase simulation efficiency. We believe that there are numerous opportunities for such optimizations both for PDES and other distributed applications; however, we are currently limited by NIC speed and the immaturity of the model and programming environment. Any message sent from WARPED follows the path shown in Figure 3(a), spending a substantial amount of time in buffers before being actually sent out over the network. At some point, the NIC or the host might decide that an event message still in the buffers is not needed and can be dropped. The technique we use is to 4
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
the last received anti-stamp to the NIC by piggybacking its receive time-stamp on all outgoing messages). If the received anti-stamp is less than the outgoing message timestamp, the message is dropped. The event-Id’s of all dropped messages are recorded so that we can either prevent the sending of the corresponding anti-message at the host or drop it at the NIC. Due to space limitations, we only discuss the implementation briefly. Whenever the NIC receives a message, it checks whether it is an anti-message. If so it records the times tamp on that message in a variable ( ), which is used by the send queue to cancel messages. The simulator’s Input Queue makes a note of the timestamp of the last processed anti-message, required by the CM on the sending side. It must record only the timestamps of messages that have been processed by the NIC. That is, it must record the timestamps of anti-messages from remote objects. This diffrentiation is achieved by using the object ID defined by the applications RAID and POLICE and will need to be recoded for another new application. The CM piggybacks the timestamp of the last received anti-message on the Next object field of every outgoing message. This is necessary to maintain consistency. Otherwise, it would not be possible to discriminate between messages generated before the antimessage was processed by the host (should be canceled) and ones generated after (should not be canceled). The logic on the send side queue of the NIC is the most complicated. Whenever we drop a positive message, we know that at some point of time the host will try to cancel this message and therefore we need to track the event ID’s of all canceled messages. For every object on the LP we allocate a buffer of size 10. which is declared in the global structures of the NIC, so that it can be accessed by both the host and the NIC. The host can avoid sending negative messages, by accessing this buffer while the NIC can filter out the negative messages, which the host sent before the corresponding positive message on the NIC was dropped. Finally we describe the logic in the Timewarp object, which is responsible for generation of anti-messages. We first scan the event ID buffer on the NIC; if the event ID is present in the buffer the anti-message is not generated. Implementation details can be found elsewhere [22].
1MB dual ported SRAM); the optimizations were implemented by reprogramming the firmware of this processor. We used the Message PAssing Interface (MPI) runnong on top of the Basic Interface for Parallelism (BIP) suite: a light-weight user-level communication protocol for Myrinet [15, 25]. BIP runs directly on top of the hardware (it bypasses TCP/IP). The optimiztions were implemented for the WARPED simulation engine: a configurable Time-Warp parallel discrete event simulator [26]. We present results using two of the applications provided by the WARPED release: RAID models the operation of a RAID-5 disk array, and POLICE which is a simple model of a traffic police telecommunications network.
4.1. NIC-level GVT
RAID Performance with NIC GVT 40 WARPED NIC GVT 35
Simulation Time (sec)
30
25
20
15
10
5
0 1
10
100
1000
10000
100000
GVT Period (Events)
Figure 4. RAID GVT Execution Time RAID was simulated using 10 processes sending disk I/O requests to 8 forks which in turn forward the requests to one of the 8 disks in the simulation. There are a total of 8 LP’s. Figure 4.1 shows the performance of the simulation with and without the NIC level implementation of GVT. When performing GVT aggressively (GVT COUNT = 1 effectively performing GVT after every event is processed), NIC-GVT outperforms the standard implementation. As we decrease the frequency of GVT (increase GVT COUNT) the time required for execution by NIC-GVT increases, while that required by WARPED decreases, until the two implementations perform almost identically. A probable explanation of this behavior is that when GVT is performed aggressively, more GVT messages must be generated and sent in the traditional implementation. These messages take up resources (CPU and memory) and create additional contention for the IO bus. On the other hand, no additional memory has to be allocated for a message for NIC-GVT since all information is generated at the NIC and piggybacked on other messages. However as we reduce the frequency of GVT computation, we see that NIC-GVT becomes slightly slower than WARPED. This is due to
4. Experimental Study In this section, we present an experimental study of the proposed optimizations on a myrinet connected cluster. The cluster has eight nodes; each node is a 2way SMP with Pentium III 550MHZ processors running Redhat Linux 6.2. The machines are connected by a 1.2 Gbps Myrinet switch with LanAi4 processors (66MHz, 5
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
POLICE Performance with NIC GVT (8 Processors)
POLICE -- NIC GVT Rounds
200
500000 WARPED NIC GVT
WARPED NIC GVT 450000
180
140
120
100
80
Number of GVT Rounds
Simulation Time (sec)
400000 160
350000 300000 250000 200000 150000 100000 50000
60 0 1
10
100
1000
10000
1
10
100
GVT Period (Events)
1000
10000
GVT Count (Events)
(a) Police – Execution Time
(b) Police – Number of Rounds
Figure 5. Police GVT Performance
the fact that NIC has to perform GVT checks on each incoming and outgoing message adding overhead that is needed infrequently due to the low frequency of GVT computation. The results for the Police model for 8 LPs are shown in Figure 5(a). The same pattern observed in RAID was seen for Police as well. At highly aggressive GVT, the traditional implementation breaks down because the communication traffic overwhelms the host processor resources. Since the messages are generated by the NIC, the optimized version does not break down. As GVT is carried out less aggressively, the gap between the two implementations narrows until they are almost identical if GVT is performed highly infrequently. With highly aggressive GVT, in addition to not requiring the resources for generating GVT messages and delivering them to the NIC, we found that the number of GVT rounds being carried out at the NIC remained relatively constant because the NIC opportunistically forwards the GVT information (Figure 5(b)).
ber of messages sent is reduced by a more appreciable amount (Figure 6(b)) due to the elimination of some rollbacks by directly canceling the erroneous messages that cause them. Police Message Count -- NIC Direct Cancelation 1.2e+06 Warped Direct Cancelation
Messages Sent
1e+06
800000
600000
400000
200000
0 900
1000
2000
3000
4000
Number of Police Stations
Figure 8. Overall Messages Generated (including messages that will be canceled)
The speedup obtained for Police was significantly higher than that for RAID for several of the simulation points (up to 27%; see Figure 7(a)). This improved speedup is due to a large percentage of canceled messages being canceled by the NIC (Figure 7(b). Moreover, similar to RAID, the total message count (including those that were canceled later) was reduced ostensibly because of the reduction in the rollbacks due to the elimination of some of the anti-messages before they cause erroneous computation at their destination (Figure 8).
4.2. Early Cancellation RAID was simulated using 16 source processes, 8 forks, and 8 disks spread across 8 LP’s in the cluster. We have taken readings at 50000, 100000, 200000 and 400000 disk requests respectively. The execution times scale almost linearly with the number of requests. Figure 6(a) shows the percentage speedup obtained from the optimization. A modest improvement in the simulation time was obtained (less than 5%) due to the reduction in the number of messages generated. When we looked closer, the percentage of messages canceled in place was small (less than 1%) – we expect to be able to drop significantly more messages with a better NIC processor. Despite this small percentage, the total num-
5. Conclusions and Future Work In this paper, we investigated optimization of a PDES simulator by programming the firmware of the network interface card of a cluster of workstation. The pro6
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
RAID Performance with NIC Direct Cancelation
RAID Message Count -- NIC Direct Cancelation
6
1.6e+08 Percentage Improvement
Warped Direct Cancelation
1.2e+08 4
Messages Sent
Improvement in Performance (%)
1.4e+08 5
3
2
1e+08
8e+07
6e+07
4e+07 1 2e+07
0
0 50000
100000
200000
400000
50000
100000
200000
400000
Number of RAID Disk Requests
Number of RAID Disk Requests
(a) RAID Performance
(b) RAID Message Count
Figure 6. RAID Early Cancellation
POLICE Performance with NIC Direct Cancelation
Percentage of Canceled Messages Dropped by NIC
30
62 Percentage Dropped
25 60
Percentage Dropped
Improvement in Performance (%)
Percentage Improvement
20
15
10
58
56
54 5
0
52 900
1000
2000
3000
4000
900
1000
Number of Police Stations
2000
3000
4000
Number of Police Stations
(a) Police Performance
(b) Police Message Count
Figure 7. RAID Early Cancellation
cessor on the NIC cards available for our experiments were not intended for general programmability (they are small CPUs with limited resources); therefore, we selected two optimizations that are lightweight in order to demonstrate the feasibility of the model and to understand the challenges and issues. As programmable cards with better processors continue to appear, it is possible that a significantly larger class of optimizations will become feasible both for simulation and for other distributed applications. The two optimizations we studied provided some improvement in the performance of our application (in some instances significant improvement) despite these limitations. In the process, we learned the following lessons: (i) Consistency is a recurring problem in this model if state is shared between the NIC and the host processor. Enforcing strong consistency via shared variables will be expensive in most cases, and relaxed consistency can be obtained by piggybacking handshaking
information on incoming and outgoing messages; and (ii) There is a need for tools and programming models to allow effective programming in this model. We are encouraged that the new NIC cards are offering mainstream OS’s running on the NIC processor. We believe that the bottleneck on the transfer path between the NIC and the processor would make offloading computation to the NIC more promising as network performance continues to increase. This is especially true if the programmable NIC resources continue to improve. Making more resources available on the NIC will open the door for additional optimizations using this model model both for PDES and other applications (distributed OS, databases, filesystems, etc...); this is a focus of our future research.
References [1] 3Com
EtherLink
7
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
Server
10/100
PCI
Net-
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14] [15]
work Interface Card with 3XP Processor. http://www.megahaus.com/tech/3com/ nics/specs/3cr990svr97_spec.shtml. Alcatric 100x4 Quad Port Server Adapter. http: //www.bellmicro.com/fibrechannel/ connectivity/alacr/quad_port.htm. T. Anderson, D. Culler, and D. Patterson. The case for NOW (network of workstations). IEEE Micro, 15(1), Feb. 1995. D. Becker, T. Sterling, D. Savarese, J. Dorband, U. Ranawak, and C. Packer. BEOWULF: A parallel workstation for scientific computation. In International Conference on Parallel Processing, 1995. M. Blumrich, R. Alpert, Y. Chen, D. Clark, S. Damianakis, C. Dubnicki, E. Felten, L. Iftode, K. Li, M. Martonosi, and R. Shillner. Design choices in the shrimp system: An empirical study. In Proceedings of the Annual ACM/IEEE International Symposium on Computer Architecture, June 1998. N. Boden, D. Cohen, and W. Su. Myrinet: A gigabitper-second local area network. IEEE Micro, 15(1), Feb. 1995. D. Burger, J. R. Goodman, and A. Kagi. Memory Bandwidth Limitations of Future Microprocessors. In 23rd International Symposium on Computer Architecture, May 1996. J. Chase, D. Anderson, A. Gallatin, A. Lebbeck, and K. Yocum. Network i/o with trapeze. In Proceedings of 1999 Hot Interconnects, Aug. 1999. M. Chetlur, N. Abu-Ghazaleh, R. Radhakrishnan, and P. A. Wilsey. Optimizing communication in Time-Warp simulators. In Proceedings of the 12th Workshop on Parallel and Distributed Simulation, pages 64–71. Society for Computer Simulation, May 1998. L. M. D’Souza, X. Fan, and P. A. Wilsey. pGVT: An algorithm for accurate GVT estimation. In Proc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94), pages 102–109. Society for Computer Simulation, July 1994. C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and K. Li. VMMC-2: Efficient support for reliable, connection-oriented communication. In Hot Interconnects V, Aug. 1997. M. Fiuczynski and B. Bershad. SPINE: A safe programmable and integrated network environment. In Proceedings of the Eighth ACM SIGOPS Workshop, 1998. M. Fiuczynski, R. Martin, T. Owa, and B. Bershad. On using intelligent network interface cards to support multimedia applications. In Proceedings of NOSSDAV’98, 1998. http: //www.cs.washington.edu/homes/mef/ research/spine/reports/nossdav98%/. R. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30–53, Oct. 1990. P. Geoffray, L. Prylli, and B. Tourancheau. BIP-SMP: High performance message passing over a cluster of commodity SMPs. In Proceedings of Supercomputing (SC99), Nov. 1999.
[16] M. Ibel, K. Schauser, C. Scheiman, and M. Weis. High performance cluster computing using SCI. In Hot Interconnects V, Aug. 1997. [17] R. Krishnamurthy, K. Schwan, R. West, and M. Rosu. A network co-processor-based approach to scalable media streaming in servers. In Proceeding of the International Conference on Parallel Processing (ICPP’00), 2000. [18] F. Mattern. Efficient algorithms for distributed snapshots and global virtual time approximation. Journal of Parallel and Distributed Computing, 18(4):423–434, Aug. 1993. [19] D. Mosberger, L. Peterson, and S. O’Malley. Protocol latency: MIPS and reality. Technical Report TR-9502, Department of Computer Science, The University of Arizon, Tuscon, AZ, 1995. [20] M-VIA: Virtual interface architecture for linux, 2001. http://www.nersc.gov/research/FTG/ via/. [21] Myrinet, inc. home page, 2001. http://www.myri. com. [22] R. Noronha. Intelligent NICs – a feasibility study of improving performance of distributed applications by programming some of their components on the NIC. Master’s thesis, Binghamton University, Binghamton, NY, 2001. [23] S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing (SC’95), 1995. [24] G. Pfister. In Search of Clusters, 2nd Ed. Prentice Hall, 1998. [25] L. Prylli. BIP messages user manual, 1998. Available at http://lhpca.univ-lyon1.fr/ BIP-manual/index.html. [26] R. Radhakrishnan, D. E. Martin, M. Chetlur, D. M. Rao, and P. A. Wilsey. An Object-Oriented Time Warp Simulation Kernel. In Proceedings of the International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE’98), volume LNCS 1505, pages 13– 23. Springer-Verlag, Dec. 1998. [27] R. Rajan and P. A. Wilsey. Dynamically switching between lazy and aggressive cancellation in a time warp parallel simulator. In Proc. of the 28th Annual Simulation Symposium, pages 22–30. IEEE Computer Society Press, Apr. 1995. [28] Intelligent ethernet interface solutions. http://www. ramix.com/tech/intelethernet.html. [29] Scalable Computing Lab. SCL cluster cookbook: Building your own clusters for parallel computation, 1998. http://www.scl.ameslab.gov/ Projects/ClusterCookbook. [30] Virtual interface architecture (VIA) specification, 2001. http://www.viarch.org. [31] M. Welsh, A. Basu, and T. von Eicken. Atm and fast ethernet network interfaces for user-level communication. In Proceedings of the Third High-Performance Computer Architecture Conference (HPCA’97), Feb. 1997. [32] K. Yocum and J. Chase. Payload caching: High speed data forwarding for network intermediaries. In 2001 Usenix Conference, June 2001.
8
0-7695-1573-8/02/$17.00 (C) 2002 IEEE