Architectural Software Support for Processing Clusters - CiteSeerX

4 downloads 252 Views 279KB Size Report
The caller never needs to know, if a device is really local or if the call is redirected. All communication can be performed through the inbound and outbound.
Architectural Software Support for Processing Clusters J. Gutleber, E. Cano, S. Cittolin, F. Meijers, L.Orsini, D. Samyn European Organization for Nuclear Research, CERN, 1211 Geneva 23, Switzerland [email protected] Abstract Mainstream computing equipment and the advent of affordable multi-Gigabit communication technology permit us to address grand-challenge data acquisition and processing problems with clusters of inexpensive computers. Such networks typically incorporate heterogeneous equipment, real-time partitions and even custom devices. Vital overall system requirements are high efficiency and flexibility. In preceding projects we experienced the difficulties to cover both requirements at once. Intelligent I/O (I2O) is an industry specification that defines a uniform messaging format and execution environment for hardware- and operating system independent device drivers in computing systems with processor based communication equipment. Mapping this concept to a distributed computing environment and encapsulating the details of the specification into an application-programming framework allow us to provide architectural support for (i) efficient and (ii) extensible cluster operation. This paper portrays our view of applying I2O to high-performance clusters. We demonstrate the feasibility of this approach by giving efficiency measurements of our XDAQ software framework for distributed data acquisition systems.

1. Introduction “The biggest problem with creating distributed computing systems is devising a method of intercomputer communication that is reliable, fast and simple.” - NASA contractor report 182505, “Computers in Spaceflight”[1]. This quotation, expresses in a few words the ultimate goals of all cluster communication tools. The amounts of data that have to be digested by current distributed processing systems are in the orders of Tbytes/s and in the hundreds kHz message rates[2]. Proposing successful designs for such environments belongs thus to the category of grand-challenge problems [2]. Air-traffic monitoring [3] or nuclear/particle physics data acquisition [4] systems are examples from this domain that rely on custom embedded devices and contain real-time paths [5]. Traditional monolithic approaches do

not meet the requirements and a natural way to break the current limitations is to distribute the data processing task to a multiplicity of cheap commodity computation hardware devices [6]. The strong requirements on the system’s communication efficiency and the need to incorporate a variety of hardware- and software modules make the selection of current off-the-shelf software infrastructures difficult [7]. Some CORBA implementations [8, 9, 10], JINI [11] or Globus [12] provide enough modularity, but they do not satisfy the efficiency requirements. Tools that provide low-level APIs to access the network such as PVM [13], MPI [14], Active Messages [15] or Nexus [16] merely exhibit the required degree of functionality and extensibility. We are, therefore, looking for alternative architectural software support to build high-performance clusters from COTS and custom hardware components. Software architecture involves the description of elements from which systems are built, interactions among those elements, patterns that guide their composition, and constraints on these patterns [17]. Various computer manufacturers proposed a specification that contains exactly such requirements. The standard called Intelligent I/O (I2O) addresses the issue of building hardware- and operating system independent device drivers [18]. To accomplish this goal it must provide basic functionality that allows the composition of extensible systems. I2O primarily aims at the cooperation of I/O processors with a host in a PCI segment, although there is room to extend the scheme to loosly coupled multiprocessor systems [41], i.e. any MIMD architecture [42] that is built from a set of nodes that are connected by a message passing network. If we succeeded to provide an efficient implementation of an I2O core that could be used in such an environment and that presents the user with a narrow component interface, we would obtain a standard based toolkit that satisfies the requirements of highperformance data-processing clusters. We designed and implemented such an application framework at CERN, the European Organization for Nuclear Research, in the scope of a Large Hadron Collider experiment that shall start operation in the year 2005. In this paper we shed light on how we map the I2O standard to general distributed systems, the internals of our framework and finally we report on its efficiency.

2. Toolkit requirements From the many pervasive attributes that middleware has to cover [19] a toolkit for high-performance processing clusters must at least expand in three dimensions. The first covers efficiency. As clusters grow, communication between the processing elements increases even more. To exploit the benefits of modern communication devices, middleware has to allow for direct manipulation of layers below the transport level. This leads us to the second important requirement dimension: flexibility. Middleware must be able to cope with evolving requirements, as well as software and hardware modifications. The latter causes not only different access to the wire, but also changing addressing schemes. The framework has to account for this. It should not be necessary to modify an application in case some hardware component is exchanged. The third dimension, which we have to consider, is system management [20, 21]. A successful scheme has to allow configuring all cluster components, whether the hardware, the framework or the applications, according to one common scheme. The scheme must be open for future extensions and has to include the configuration and operational modes of the system in its scope. Having a system that covers all these aspects at once seems difficult. Yet, we will show in the next section how the application of the I2O architecture to a distributed processing system can come up to these requirements.

The HDM includes the Device Driver Module (DDM) software that is downloaded into an I/O processor on an intelligent communication device. A DDM is tailored to the device that is to be controlled and can work together with any kind of OSM. Although tightly coupled to the hardware, it is embedded in an I2O execution environment and accesses the hardware solely through predefined commands. The software, which runs on the I/O processor, communicates with the host through a defined communication layer, the I2O messaging instance. This layer contains two queues that are outlined in figure 2. The inbound queue buffers messages that originate from the host and the device modules post replies to the outbound queue. For efficiency reasons these queues are implemented in hardware in an I2O supporting computer architectures.  !"  # 

!"$&% '() 

    

  



 

3. The I2O architecture * '+  % '() ,  ! !%  .- *",0/

3.1. Overview I2O proposes a split-processing model (see figure 1) that consists of two parts: the first, called Operating System Module (OSM) is hardware independent and resides in the host computer. It is provided for a given host operating system and presents the application programmer a common interface to communicate with an I2O device. The other part, called the Hardware Device Module (HDM) is responsible for interpreting messages from the OSM and controls the device accordingly. 4  % '  1 +"50 (

*)6 !  3- 4 5 *)/

7:6 

* '+  1 +32' % = > 4@?  ' $ 6 % (

7 '% !8'% 39: ;  < *)6 !  3- 7 9 *)/ 9: ;  < 39%  ; % *)6 !  3- 99 *)/

9: ;  <

Figure 1. I2O Split Driver Model

       

Figure 2: IOP/host communication through pointers to I2O frames. A DDM reacts to requests from an OSM, i.e. it can only reply to messages from the host, but it never sends unsolicited messages. Programs that reside in an IOP are, however, allowed to communicate with programs in other IOPs at any time. This is achieved through either of two ways. The execution environment can provide a messaging instance that is capable of redirecting requests (Peer Operation). The alternative is the use of direct peerto-peer commands in the I2O HDM software modules. The latter offers more control over the communication type, but at the price of increased complexity and loss of access transparency (see figure 3).

EFHGJI

EFHGMI

EFTGJI

K0LMGGONPQ RMPSA RTGMI NRJUHL

K0LMGJGNJPVQ RJPSA RTGMI NVRMUOL

K0LMGGNJPQ RMPSA RTGJI NRJUHL

A BDC

A BDC

a)

A BDC

CWX

b)

CWJX

A BDC

Figure 3. Peer Communication through the Messaging Instance (a) on the same PCI segment or Peer-to-Peer operation with a Peer Transport Agent (b) across separated bus segments. The IOPs’ capability to communicate with each other is the key to the architectural support that I2O offers for high-performance data processing clusters. Instead of letting hosts talk to each other we see all communicating nodes in the processing cluster as IOPs. Hence, all applications are embedded into an I2O execution environment. Primary and secondary hosts serve only as configuration and control points for the cluster nodes. Expanding the I2O standard PCI bus based communication to any distributed memory message passing environment is feasible, because I2O offers total transparency of DDMs’ location through a generic addressing scheme. This communication concept is also the idea behind upcoming I/O approaches, such as the Infiniband architecture [22]: data are transferred from host to I/O points or remote nodes through switching fabrics using message passing and one common addressing scheme for all communication. What remains is to provide an efficient execution environment for message handling.

3.2. Events and messages I2O applications are modeled following an event based processing and communication scheme [23]. Such an approach gives us the necessary flexibility that we need in ever changing environments. In this context an event is the reception of a message. It triggers the execution of user supplied application software that has been associated with the message at configuration time. Through this scheme we achieve extensibility, because the processing of a function is decoupled from its invocation. The procedure for a given message can be specified dynamically by downloading a software module at runtime. Furthermore it is possible to add events and thus functionality by defining private messages that still follow the I2O standard format. For standard messages, the user can provide software. The system can provide default procedures if for a given event no code is supplied. This is also a way to come to a homogeneous view of software components with fault tolerant behaviour.

An event-based processing concept has been shown to be scalable [24]. There is no need for a central place in which incoming messages have to be parsed. It is the sole responsibility of each device to know, what it shall do with the incoming message. Each device module in this concept is an active object that contains a local dispatcher. Doing so, it is not even necessary to register a new event with the executive framework. It is sufficient to add it to the device module. Another advantage of the I2O event model is that essentially every occurrence in the system is mapped to an I2O message. Even interrupts or timer expirations trigger messages that are sent to device modules, if they have registered to listen to such an event.

3.3. Applications and devices Messages are combined to sets that form device classes. So, each concrete I2O device has to implement executive and utility events that allow the configuration and control of the device. Finally it must implement the interface of one of the I2O devices, e.g. the Block Storage or Tape device class. Through these three interfaces it is a Device Driver Module. In our view, an application is merely a new, private “device” class. In addition to the standard messages it provides code for all the private messages that are defined for this application class by the programmer. After all, the whole mechanism maps ideally to the component paradigm [25, 26, 27] in that it comprises some complex functionality that is exposed to the users of the component through a clean and narrow interface: I2O messages that trigger the execution of code.

3.4. Addresses and Modules A common addressing scheme is the glue between components that leads from dispersed programs to a system. In fact, messages can only be used with an appropriate addressing scheme. One of the problems of existing cluster operation is the diversity of address formats that is introduced with the many communication subsystems. Modifying an application to work with Ethernet addresses today and with Myrinet [28] port information the next day should be avoidable. Hiding the information in an adapter class is in most cases not sufficient. One would still rely on recompilation because the adapter must be changed and different networks are configured in different ways. I2O challenges the Babylonic confusion by replacing all addressing with a unique destination identification scheme. That is, each device instance, software or hardware module gets assigned a numeric identifier, the TiD (Target ID). It is unique within one I/O processor card. To communicate with a remote device, the executive creates a local TiD for the target device along with information how to reach this device.

·¸T¯ ´T× ­d®¯…¯ ®O° ±T®H² ³ ° ¯´ µO° ¶…® Ù

ÁMÂTÂHÃ Ä Å ÆOÇ Ä È&ÉÊÆHÇ Æ ÆTË ÌÌ&ÉHÅ Æ&ÂHÍ&Î&à ÆOÇ ÌTÊ Ä ÉÏ ÐÑ)ÂOË Ä Ò ÆOÇ Ì Ó ÌÍ Í…Æ&ÔOÌÖÕ Ë Æ Ó ÌÍ

Ø ¬V¬V­ ©ª

©OªT« ©ª

Ú ©Oª

Û

À ¾¿

¼½ º»

Ü

·¸T¯ ´&¹ ­:®T¯ ¯…®O° ±T®H² ³ ° ¯&´ µO° ¶…® Þ

ß

©ª« Ý

©Oª

¬:¬V­ ©Oª

©ª

Figure 4: Peer-Operation in the XDAQ environment. Messages are routed through the Messenger Instance (1, 2) to a Peer Transport Agent (3) if communication (4) with remote participants is required. On the other side, a PT (5) receives the frame and forwards it to the PTA (6) that puts it into the Messenger Instance inbound queue (7). From there, the message is dispatched to the DDM (8).

3.5. Wrapping up the architecture Before we are going to shed more light on how we mapped the I2O model to networking environments, we would like to sum up the key ideas of this model. These form the architectural support for distributed programming in our environment. The basic means of information exchange is a message that follows a standard format. All events, including signals can be translated into such a message. Each processing node runs an executive program that routes all application generated messages according to their destination information to the software or hardware device modules that are registered with the executive. Transparency of locality is achieved through the target identifier addressing scheme in which all devices, local or remote get a locally unique TiD from the executive. All modules, user applications, the peer transports and even the executive get such a TiD. Thus, they are all valid I2O devices and have to implement the standard executive and utility message handlers to be configurable and

controllable. Application messages are private extensions that are foreseen in the I2O model (see figure 5). In a distributed I2O environment in which IOPs do not reside on the same bus segment, a primary host controls all processing nodes. Secondary hosts may register and subsequently apply for control rights. All communication travels through the inbound and outbound queues of the local node. The messaging layer must take care of forwarding the message either to a remote node or to the host. This model is called Peer Operation. žHŸ

Private Frame Extension Standard Frame

The principle is not new. It can be compared to the Proxy pattern [29]. That is how we can obtain total transparency of location. The caller never needs to know, if a device is really local or if the call is redirected. All communication can be performed through the inbound and outbound queues of the messaging instance (see figure 4). Communication channel control, such as connection establishment or flow control is hidden from the user. The modules that take care of performing the actual communication are designed as Device Driver Modules themselves. They are just granted a special name: the Peer Transports that are controlled by the Peer Transport Agent. At most, we are confronted with this layer at configuration time for tuning the system to our needs.

š

›  ¡¢ ž YZ[O[H\O]ZO^`_ aOZ

bqMirOl _ hMis

œ  Ÿ…£¤Ÿ ¥ ¦¤§ ¨ YZM[H[O\H]ZObdc \O][fedZg [M_ hMijk k [HZMl

p id_ l _ \l hg ndoog Z[H[

m\Mg ]Zl ndoog ZM[H[

p id_ l _ \l hVg thOil ZMuHl v wMxxy zO{&|H}~ 0€0|Txx OzOy {&z‚ { x&ƒ {…„|J†‡ x|O}0ˆ ‰ŠŠ ‰‡…ƒ y {&z)~&H„T‹:DŠ |ŒH Ž mg \i[O\rOl _ hMit)hHil ZuOl v wJx&xy z{…|O}~ JŒ&ŒO y „ƒ y ‰O{†Ty x)Š |Oƒ ‡HŠ {&|O}‡&{&„O&O{Hz|H}y {Š |ŒH Ž jg ]V\Hi_ aH\Ml _ hMip ‘

dbqirOl _ hit)hOoZ

’`g _ “\l ZM’\M”Hc h\o–•—bqirOl _ hMiD’`\Mg \H˜™Zl ZMg [ ’`g _ “&\Ml Z’`\M”Hc hM\o

Figure 5: The I2O message frame format. (* Function=FFh if it is private. Then XFunctionCode is interpreted)

4. I2O application framework design We modeled the I2O Peer Operation model with a C++ framework that we call the I2O executive. Together with an operating system independent systems programming toolbox and application classes for our data acquisition system it forms the XDAQ1 toolkit software. The executive accepts incoming messages and forwards them to the device classes. To avoid efficiency loss that might be induced with unpredictable growth of threads if each and every single active object was modeled as a task, the loop of control remains in the executive framework. There exist multiple dispatch tables for all the device class instances, but the executive performs the dispatching. Furthermore the executive has control over all the memory that can be accessed by the registered modules. These essential tasks give the impression of quite a rich executive. After all, the executive is very lean as it acts only as a delegate. The Peer Transport Agent receives messages and memory pools are used for zero-copy operation. For scheduling the dispatching of messages we follow the algorithm given in the I2O specification. There 1

We called the toolkit XDAQ (pronounce: cross duck) because it allows data acquisition modules to communicate in peer-to-peer style. In our DAQ system, n nodes talk to m other nodes in both directions, thus resulting in communication channels that cross over.

exist seven priority levels and for each one the messages are scheduled to a FIFO. All devices are then dispatched in round-robin manner. We cannot prevent monopolization of the CPU or stalling of the system caused by a misbehaving message handler with this scheme. To do so, it is necessary to asynchronously terminate the handler after a configured time interval has elapsed. Such a mechanism can be implemented making use of the I2O core timer facilities. A device class is programmed in C++ by inheriting from an i2oListener class. Similar to the Java Event model [30], the class inherits the interfaces from the i2oExecutive, i2oUtility and private classes. After an implementation of the combined interface has been provided, the device class is compiled and the object code is downloaded dynamically into the running executives. At this point a plugin method that is not defined by I2O is called by the executive, which allows us to register the downloaded object. At this point the newly created class can obtain its TiD and retrieve parameter settings from the executive. It will also request the availability of other device class instances on remote IOPs and triggers the creation of proxy TiDs. Applications use the frameSend I2O command to send a message or frameReply to reply to a request. The message has to be built according the standard frame format (see figure 4). To further shield users from these details, adapters can be provided that allow a remote method invocation style communication scheme. The stub part will take the call parameters and marshal them into a standard message, whereas the skeleton part scans the message and provides typed pointers to its contents. All communication employs a zero-copy scheme as the message buffers are taken from the executive’s memory pool. Memory is allocated in fixed sized blocks with a maximum length of 256 KB. Making use of I2O’s ScatterGather Lists (SGL) or chaining blocks helps to transmit arbitrary length information. Automatic garbage collection is provided, such that blocks are recycled if they are not referenced anymore. The Peer Transports (PT) perform the actual communication. They encapsulate all details about a specific transport layer. As it is possible to configure each device instance with a route, we can use multiple transports to send and receive in parallel. This is a vital functionality that is not covered by other comparable middleware products yet. Concerning Peer Transports we distinguish two ways of operation. In polling mode, the executive periodically scans all registered PTs for pending data. In task mode each PT has its own thread of control, reporting to the executive whenever data have arrived. To allow efficient operation in polling mode it is advisable not to use more than one PT in this mode or to suspend other PTs during periods in which low latency

communication is required. Otherwise a slow PT, e.g. a poll operation on a TCP socket would negate the benefits of checking periodically a lightweight user level network interface. Configuration and control of the executive is done through I2O executive messages. They are sent from a Tcl script that resides on the primary host to all executives in the distributed system. We chose Tcl because it is the I2O recommended way for configuration and control. In principle, however, we can choose any configuration language, as long as we follow I2O message format.

5. Framework communication efficiency Blackbox benchmark To measure the basic overhead that is introduced with the additional software layer we built a simple private device class that is instantiated on one node and continuously floods a remote instance of this class with messages. The second instance responds by replying to each received message with exactly the same content. We carried out this round-trip test with increasing payload sizes. To obtain the combined transfer and upcall latency we divided the measurement values by two. Then we compared the latencies to the roundtrip times that we obtained from measurements with the Myrinet/GM 1.1.3 lightweight message passing system [31] that is similar to Active Messages [15]. We used a Myricom M2M-PCI64 network interface card containing a LANai 7 processor. This processor runs the standard Myrinet/GM MCP program. We implemented a peer transport based on the Myrinet GM 1.1.3 library for our XDAQ I2O executive and performed the round-trip test. The Myrinet/GM PT ran as a thread. Another PT thread was handling TCP communication for configuration and control purposes. During the test we did not perform any control or configuration. Blackbox results The results are depicted in figure 6. All benchmark results were averaged over 100,000 calls in each direction. The uppermost plot shows the latency results for a payload from 1 to 4096 Bytes using the XDAQ framework with Myrinet/GM. The middle slope depicts the latencies from the test program using Myrinet/GM directly. The slopes are linear as expected. This indicates the existence of a constant overhead that is independent of the payload. The difference of the two slopes contains the XDAQ framework software overhead. This is the lowest plot. It is indeed constant for all transferred payload sizes. On the deployed Pentium II, 400 MHz, 33MHz/32 bit PCI bus based system with BX chipset, the middleware takes 8.9 µsec on average (s = 0.6) per call compared to direct use of Myrinet/GM.

GM/XDAQ Latencies 120

110

100

90

Microseconds

80

70

60

50

40

30

20

y = -7E-05x + 9.105 10

0

0

1024

2048

3072

4096

Bytes transferred Linear fit to XDAQ overhead

Linear fit to XDAQ performance

Linear fit to GM 1.1.3 performance

Figure 6: XDAQ with Myrinet/GM blackbox ping-pong latency results (one way times) Whitebox benchmark For pinpointing the overhead in the XDAQ framework, we instrumented our code with time probes. We measure the time difference between two probes in nanoseconds. The values are then again averaged over the 100,000 calls. For testing we used the same software and hardware setup as described in the blackbox test. Below we illustrate the measurements that we obtained from the send/receive test program with our toolkit. Whitebox results Table 1 shows the results for receiving an event and activating the associated code on the receiver side in µsec. All given values are the medians of 100,000 samples. Table 1: µseconds spent in the XDAQ framework. Activity

Time (Median in µsec)

PT GM processing Demultiplexing to functor Upcall of Functor Application (incl. frameSend) Release frame, call postprocessing Sum of application overhead:

2.92 0.22 0.47 3.6

frameAlloc frameFree

2.18 1.78

Cross check measurement: 4.12

2.49 9,53

Handling an incoming message in the GM PT accounts for most of the time. This overhead does not include calls

to the Myrinet/GM library. It is pure processing time in the XDAQ toolkit. Having a closer look at the time it takes to allocate a frame from the pool (frameAlloc) we see that most of the PT processing time is spent in the frame allocation. The memory allocation scheme used in the whitebox test is not optimised. A new allocation scheme that we tried, allocates memory for the buffer pool on demand. Furthermore it relies on a table based matching from requested memory size to pool buffer size, thus the time needed to allocate a frame shrinks dramatically for applications that use similar buffer sizes throughout their lifetimes. In a preliminary black box test we were able to reduce the framework overhead by another 4 µsec to 4.9 µsec (s = 0.8) per invocation. This shows that optimizations of the toolkit’s efficiency are merely implementation details. Also, the postprocessing of a call is dominated by the release of allocated memory to the common buffer pool. The table shows that demultiplexing and calling the user provided implementation takes only little additional time. Processing of the example program takes time as it contains also a frame send. From the measurements we can conclude that sending a message costs about 5 µsec for the original buffer allocation scheme. This operation consists of a frameSend (~ 3.6 µsec) for the original buffer allocation scheme) and a frameFree (1.78 µsec). Although we used lightweight high-resolution time probes based on reading the CPU clock ticks into some reserved memory region, the sum of the measurements is slightly different from the end-to-end test (median 9.53 versus average 8.9 µsec). A standard deviation of about 0.5 µsec allows us to conclude that our measurements are consistent. Summing up we can say that the whitebox results demonstrate the advantages of such a framework. The processing time that we see in the blackbox test does not stem from the executive’s core functionality, which is the dispatching of incoming messages to user code.

6. Related work 6.1 Spin and Spine Spin [32] is an extensible operating system based on the idea of event-based processing that supports downloading of code into the kernel. All code, system or applications, are OS extensions that are written in Modula-3 that react to events. Spine [33] extends the ideas of Spin to the network interface in that an OS kernel runs on an intelligent network interface card to support the building of network enabled applications. Although Spine supports more application security due to watchdogs per function and compile time checking of trusted extensions, Spine is limited to the single NIC that hosts its executive,

i.e. an network program can either use the Myrinet or the Gigabit Ethernet, depending which NIC provides its execution environment. XDAQ is a host and IOP based framework for cluster applications. Spine in contrary supports the outsourcing of functions that do not require host CPU power to the NIC. In this sense it provides highlevel support for server applications.

6.2. High-performance CORBA Performances of CORBA based systems and their applicability to high-performance environments has improved tremendously over the last few years. Only now with optimizations in the ORB core [34] and the quality of service negotiation [35, 36] it is possible to use CORBA applications in real-time environments. With new dispatching schemes [37] and pluggable transports [38] that enable an optimized access to any network device, the gap between low-level communication libraries and highlevel client/server middleware has shrunk. Still, the overhead induced by an ORB core is significant (about 90 µsec [38, 39]). CORBA provides a high level of functionality. The need for an Orb implementation to be compliant with the OMG’s specification requires, however, implementers to carry along the burden of functionality. Without modifications in the standard this fact hinders further optimizations. Even if Distributed Object Computing (DOC) middleware performance is said to be an implementation detail [38] we see that efficiency closer to the hardware can only be reached if the middleware core is configurable enough to reflect the requirements of high-performance cluster operations. It must be possible to exclude or plug-in new implementations of core modules, such as the marshalling engine. The IDL to C++ mapping must support buffer loaning techniques. The support of these buffer pools should not remain a private feature of only some ORB implementations. These are just examples to clarify better what we call the need for architectural support. The CORBA 3.0 specification [40] is without any doubt on the way to become a strong competitor to other middleware approaches in the high-performance processing domain.

7. Ongoing work Similar to the SPINE project we intend to use our executive not only in the main CPUs, but also in intelligent network cards. For this purpose members of our team designed a PLX IOP 480 based processor board with a local PCI board and fast Ethernet for control and configuration purposes. The board gives I2O support through hardware FIFOs, which will allow us to provide communication efficiency measurements with and without hardware support. The VxWorks RTOS has been

successfully ported to the board and Linux will eventually be available, too. We are now implementing a PCI Peer Transport for providing communication with the host. Another vital aspect that we are currently incorporating is efficient buffer pool management. A fast scheme has already been implemented and measured with a blackbox test. Allocators for the custom processing hardware have yet to be designed. Detailed whitebox measurements are currently being performed.

8. Conclusions Fast, reliable and simple to use middleware that can be easily adapted to the needs of custom environments is needed for future processing cluster operations. Ad hoc solutions do not represent a means to accomplish this goal. We need architectural support that allows integrating all requirements. The I2O standard, although initially intended for device drivers, exhibits all the necessary properties. In this paper we have elaborated that combining efficiency, extensibility and transparency in processing clusters is not an unattainable task. Event based processing allows us to build extensible systems based on software components that communicate through a standard message format. The target identifier addressing scheme supports total transparency of location. The use of specialized Peer Transports that interface to the executive just like ordinary device classes gives room for highly optimized communication. This approach allows us to exploit any future networking technology without the need to modify the applications. Since middleware frameworks can be made available on any hardware or operating system platform, cluster processing can be easily made homogenous, regardless of the deployment environment. Introducing an additional software layer has some cost in terms of processing time. We have shown that this overhead is acceptably small and can be further reduced with some implementation effort. Yet, it is important to note that it might not even be necessary to pursue these implementation details. The I2O architectural support of Scatter-Gather Lists or buffer pools to name but a few, can help at application level to eliminate a large portion of the additional processing time. After all, we should not consider this as an overhead, but rather as the time that it takes to provide the user of the system with considerable added value, which would otherwise have to be implemented by the application programmer.

9. Acknowledgements We owe thanks to H. Stockinger and W. Schleifer from the CMS collaboration for their valuable comments. We would also like to acknowledge the helpful comments of

Bob Jones from the Atlas collaboration who carefully read drafts of the paper. Thanks go to our team members G. Antchev, S. Erhan, D. Gigi, C. Jacobs, L. Pollet, A. Racz, N. Sinanis and P. Sphicas. We also thank the anonymous referees for their comments.

10. References [1] J. E Tomayko. Computers in Spaceflight, NASA contractor report CR-182505, National Aeronautics and Space Administration, Scientific and Technical Information Division, Washington, DC 20546, USA, March 1988, p. 228 [2] J. Gutleber. “Challenges in Data Acquisition at the Beginning of the New Millennium”, Proceedings of the 1st International Workshop on Real-Time Mission-Critical Systems: Grand Challenge Problems, IEEE, Phoenix, Arizona, November 30, 1999. [3] W. C. Meilander, J. W. Baker and J. L. Potter . “In Air Traffic Control - The Solution is the Problem", Proceedings of the 1st International Workshop on Real-Time Mission-Critical Systems: Grand Challenge Problems, IEEE, Phoenix, Arizona, November 30, 1999. [4] J. Gutleber. “Application Steering for Large Clusters of Workstations in High Energy Physics Environments”, in M. H. Hamza, editor, Proceedings of the International Conference on Applied Informatics, pp. 481-484, Innsbruck, Austria, February 1999, IASTED/ACTA Press, Anaheim-Calgary-Zurich. [5] L. R. Welch. “A Taxonomy of Real-Time Applications", Proceedings of the 1st International Workshop on Real-Time Mission-Critical Systems: Grand Challenge Problems, IEEE, Phoenix, Arizona, November 30, 1999. [6] W. J. McCombie, “High Availability in Software using Supervised Logical Channels”, Embedded Systems Programming Europe, pp. 9-16, February 1999. [7] C. D. Gill, F. Kuhns, D. L. Levine and D. C. Schmidt, “Applying Adaptive Real-time Middleware to Address Grand Challenges of COTS-based Mission-Critical Real-Time Systems”, Proceedings of the 1st International Workshop on Real-Time Mission-Critical Systems: Grand Challenge Problems, IEEE, Phoenix, Arizona, November 30, 1999. [8] F. Kon, M. Román, P. Liu, J. Mao, T. Yamane, L. C. Magalhães, and R. H. Campbell. “Monitoring, Security, and Dynamic Configuration with the dynamic TAO Reflective ORB”, IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware'2000). New York. April 3-7, 2000 [9] S. Lo and S. Pope. “The Implementation of a High Performance ORB over Multiple Network Transports”, in N. Davies, K. Raymond and J. Seitz, editors, Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware 98), The Lake District, England, September 1998, IFIP, Springer Verlag. [10] N. Wang, M. Kircher, and D. C. Schmidt. “Applying Reflective Techniques to Optimize a QoS-enabled CORBA Component Model Implementation”, the 24th Annual International Computer Software and Applications Conference (COMPSAC 2000), Taipai, Taiwan, October 25-27, 2000 [11] S. Morgan. “Jini to the rescue”, IEEE Spectrum, pp. 44-49, April 2000

[12] I. Foster and C. Kesselman. “The Globus Toolkit”, in I. Foster and C. Kesselman, editors, The GRID. Blueprint for a New Computing Infrastructure, chapter 11, pages 259-278. Morgan Kaufmann Publishers, Inc., San Francisco, California, USA, first edition, 1999. ISBN 1-55860-475-8. [13] A. Chien, S. Pakin, M. Lauria, M. Buchanan, K. Hane, Louis Giannini, and J. Prusakova.. “High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance”, in Eighth SIAM Conference on Parallel Processing for Scientific Computing (PP97), Hyatt Regency Minneapolis on Nicollet Mall Hotel, Minneapolis, Minnesota, USA, March 1997. SIAM, Society for Industrial and Applied Mathematics. [14] M. Lauria and A. Chien. “MPI-FM: High performance MPI on workstation clusters”, Journal of Parallel and Distributed Computing, 40(1):4-18, 1997. [15] T. von Eicken, D. Culler, S. Goldstein, K. Schasuer. “Active Messages: A Mechanism for Integrated Communication and Computation”, in Proc. of the 19th Int’l Symposium on Computer Architecture, Gold Coast, Australia, May 1992 [16] I. Foster, J. Geisler, C. Kesselman, S. Tuecke. “Managing Multiple Communication Methods in High-Performance Networked Computing Systems”, Journal of Parallel and Distributed Computing, 40:35-48, 1997. [17] M. Shaw and D. Garlan, Software Architecture: Perspectives on an Emerging Discipline, Prentice Hall Publishing, 1996, ISBN 0-13-182957-2 [18] I2O Special Interest Group, Intelligent I/O (I2O) Architecture Specification v2.0, available from www.i2osig.org, 1999. [19] P. A. Bernstein, Middleware. An Architecture for Distributed System Services, Technical Report CRL 93/16, Cambridge Research Lab, Digital Research Corp., 1993 [20] P. Jardin, “Supporting Scalability and Flexibility in a Distributed Management Platform”, Distrib. Syst. Engng 3:115123, The British Computer Society, The Institution of Electrical Engineers and IOP Publishing Ltd., 1996 [21] D. P. Ghormley, D. Petrou, and S. H. Rodrigues. “GLUnix: a global layer unix for a network of workstations”, Software: Practice and Experience, 28(9):929-961, 1998. [22] D. Pendery and J Eunice. InfiniBand Architecture: Bridge Over Troubled Waters, Research Note, InfiniBand Trade Ass’n, April 27, 2000, available from www.infinibandta.com [23] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczunski, D. Becker, C. Chambers, S. Eggers. “Extensibility, Safety and Performance in the SPIN Operating System”, in Proceedings of the Fifteenth ACM Symposium on Operating System Principles, pp. 267-284, December 3-6, 1995, Copper Mountain Resort, Colorado, USA. [24] P. Pardyak and B. N. Bershad. “Dynamic Binding for an Extensible System”, Proceedings of the second USENIX symposium on Operating systems design and implementation, pp. 201-212, October 29 - November 1, 1996, Seattle, WA USA [25] O. Nierstrasz, S. Gibbs, and D. Tsichritzis. “ComponentOriented Software Development”, Communications of the ACM, 35(9):160-164, September 1992. [26] C. Pfister and C. Szyperski. “Why Objects Are Not Enough”, In Proceedings, First International Component Users Conference (CUC'96), Munich, Germany, July 1996. SIGS Publishers.

[27] M. D. McIlroy. “Mass-produced software components”, In J. M. Buxton, Peter Nauran, and Brian Randell, editors, Software Engineering Concepts and Techniques, Reprinted Proceedings of the 1968 and 1969 NATO conferences, Petrocelli/Charter, pages 88-98, Garmisch, Germany, October 7 to 11, 1968, 1976. ACM Press. [28] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K.Su. “MYRINET: A Gigabit per Second Local Area Network”, IEEE-Micro, 15(1):29-36, February 1995. [29] M. Shapiro. “Structure and encapsulation in distributed systems: The proxy principle”, In Proceedings of the 6th International Conference on Distributed Computing Systems, pages 198-204, Cambridge, Massachusetts, USA, May 1986. IEEE Computer Society Press, Washington, DC, USA. [30] R. Wang and E. Crisostomo. “Event Bridges Across CORBA Event Service and Programming Language Event Models”, The Journal of Object Oriented Programming, July/August 1999, SIGS Publications. [31] Myricom, The GM Message Passing System, 1999, available from www.myri.com [32] M. E. Fiucynski and B. N. Bershad. “An extensible protocol architecture for application-specific networking.” In Proceedings of the USENIX 1996 Annual Technical Conference, January 1996. [33] M. E. Fiucynski and B. N. Bershad. “SPINE – A safe programmable and integrated network environment.” In the Sixteenth ACM Symposium on Operating system principles, Works in Programm, 1997. [34] D. C. Schmidt, David L. Levine, and C. Cleeland. “Architectures and Patterns for High-performance, Real-time ORB Endsystems”, Advances in Computers, Academic Press, Ed., Marvin Zelkowitz, 1999.

[35] F. Kuhns, D. C. Schmidt, C. O'Ryan and D. L. Levine. “Supporting High-performance I/O in QoS-enabled ORB Middleware”, Cluster Computing: the Journal on Networks, Software, and Applications, to appear, 2000. [36] D. C. Schmidt and S. Vinoski, An Overview of the OMG CORBA Messaging Quality of Service (QoS) Framework, C++ Report, SIGS, Vol. 12, No 3, March, 2000 [37] D. C. Schmidt and T. Suda. “The Performance of Alternative Threading Architectures for Parallel Communication Subsystems”, Journal of Parallel and Distributed Computing, to appear. [38] C. O’Ryan, F. Kuhns, D. C. Schmidt, O. Othman and J. Parsons. “The Design and Performance of a Pluggable Protocols Framework for Real-Time Distributed Object Computing Middleware” IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing, April 3-7, 2000, New York, NY, USA, ACM press, pages 372-395 [39] I. Yuji, S. Toshiaki, I. Tooru, K. Mitsuhiro. “CrispORB: High Performance CORBA for System Area Network”, Proceedings of the Eight IEEE International Symposium on High Performance Distributed Computing, August 3-6 1999, Redondo Beach, Cal, USA. [40] C. O’Ryan, D. C. Schmidt, F. Kuhns, M. Spivak, J. Parsons, I. Pyarali and D. L. Levine. “Evaluating Policies and Mechanisms for Supporting Embedded, Real-Time Applications with CORBA 3.0”, Proceedings of the Sixth IEEE Real-Time Technology and Applications Symposium (RTAS 00), Washington D.C., USA, May 31-June 2, 2000. [41] K. Li, Shared Virtual Memory on Loosley Coupled Multiprocessors, Ph.D. thesis, Yale University, Dept. of Computer Science, YALEU/DCS/RR-492, September 1986. [42] M. J. Flynn, “Very High Speed Computing Systems”, Proc. IEEE, 54:1966, pp. 1902-1909.