Clustered Data Acquisition for CMS - Semantic Scholar

Clustered Data Acquisition for the CMS Experiment J. Gutleber1*, G. Antchev1, E. Cano1, A. Csilling1, S. Cittolin1, S. Erhan2, D. Gigi1, P. Gras1, M. Gulmini3, C. Jacobs1, F. Meijers1, E. Meschi1, A. Ninane5, A. Oh1, L. Orsini1, L. Pollet1, A. Racz1, D. Samyn1, P. Sphicas4, C. Schwick1, N. Toniolo3, L. Zangrando3 1 CERN, European Organization for Nuclear Research, Geneva, Switzerland 2 University of California, Los Angeles, USA 3 INFN – Laboratori Nazionali di Legnaro, Legnaro (Padova), Italy 4 Massachusetts Institute of Technology, USA 5 Université catholique de Louvain-la-Neuve, Louvain-la-Neuve, Belgium Abstract Powerful mainstream computing equipment and the advent of affordable multi-Gigabit communication technology allow us to tackle data acquisition problems with clusters of inexpensive computers. Such networks typically incorporate heterogeneous platforms, real-time partitions and custom devices. Therefore, one must strive for a software infrastructure that efficiently combines the nodes to a single, unified resource for the user. Overall requirements for such middleware are high efficiency and configuration flexibility. Intelligent I/O (I2O) is an industry specification that defines a uniform messaging format and executing model for processor-enabled communication equipment. Mapping this concept to a distributed computing environment and encapsulating the details of the specification into an application-programming framework allow us to provide run-time support for cluster operation. This paper gives a brief overview of a framework, XDAQ that we designed and implemented at CERN for the Compact Muon Solenoid experiment’s prototype data acquisition system.

1

Introduction

The Compact Muon Solenoid (CMS) experiment at 40 MHz Detectors the CERN LHC will produce data rates far beyond Collision rate those of current high-energy physics experiments. 1 Terabit/sec With an inter-bunch spacing of 25 ns and an average 50000 channels event size of 1 MB, pre-selection is performed in custom-built processors close to the detectors. 100 kHz output Readout from trigger Monolithic, single computer systems would not Unit match the experiment’s efficiency and scalability requirements. CMS applies therefore a clustered 500+500 ports approach to DAQ. The fragments of collision events switching fabric, are stored in buffers, the Readout Units (RU), which are connected to the inputs of a switching fabric. ~800 Gbps througput Each event fragment is on average 2 KB large. About Builder 5-10 TeraOps 500 RUs are needed to temporarily store the data. Unit processing They have to provide a peak throughput of 400 MB/s cluster Filter at 100 kHz first level trigger input data rate. One 100 MB/s to Units design option for these buffers is to build them from Archive and to commodity computer systems with high internal I/O world-wide data grid bandwidth. Custom solutions are also under study Figure 1: The CMS data acquisition cluster [10]. The network is a fully interconnecting multistaged switching fabric that has to have an aggregate throughput of about 800 Gbps. The output of the fabric connects to a multiplicity of processing nodes, called Builder Units (BU) that gather the fragments belonging to one event. Each BU interfaces to a number of workstations, the Filter Units (FU) that may request parts of data that belong to a single event from the BU. Thus, to the FU, the BU appears like a cache to the RUs. To perform the event filtering in time, avoiding RU buffer overflows, *

Corresponding author e-mail address: [email protected]

a processing power in the order of 10 TOps is needed. Once an event is accepted it is made persistent and becomes accessible to physicists around the globe through a data grid architecture [2]. The output rate of accepted events is estimated to be 100 Hz. In this “clustered approach”, a number of processors collaborate in event building and filtering. Middleware is used to program and operate the cluster; it offers the software providing the developers with a uniform view of the system. The latter is composed of a loose collection of computers, each one comprising a CPU, a disk, memory and network adapters. A crucial desired feature of the system is its scalability, i.e. the near linear dependence of its performance on the number of RUs and BUs. To this end, efficient load balancing and traffic shaping algorithms for various event-building strategies have been developed [3].

2

The I2O architecture

Intelligent I/O (I2O) [4] is an industry standard specification that aims at the cooperation of I/O processors (IOP) and processor enabled devices with a host on a PCI bus segment. I2O proposes a split- model that consists of two parts: the Operating System Module (OSM) is hardware independent and resides on the host computer. It is provided for a given operating system and exposes an API to communicate with I2O devices through predefined operation codes and parameters. The Hardware Device Module (HDM), which is responsible for interpreting, messages from the OSM and controlling the device accordingly. The HDM sits on the device and includes, dynamically loaded, the Device Driver Module (DDM) software. Although the DDM is tailored to the device and thus tightly coupled to the hardware, it is embedded in a well-defined execution environment and accesses the hardware solely through a standard API. The software communicates with the host through a messaging interface that comprises a queue pair. One queue receives messages while the other is used to asynchronously send messages to the outer world. For efficiency reasons, only pointers to fixed sized message blocks are passed. These blocks can be concatenated in various ways through ScatterGather-Lists (SGL) to allow the transmission of arbitrary length data. There exists, however, an important limitation in I2O. A DDM can only reply to incoming OSM messages, but it never sends unsolicited messages. There are, however, Messaging Layer Messaging Layer two mechanisms suggested to leverage inter IOP communication. Either the executive provides a d i j messaging instance capable of redirecting Peer Transport requests to remote DDMs (Peer Operation, figure Peer Transport Agent Agent I2O Message h 2), or the direct-use of low level peer-to-peer c e Frames commands from within the DDMs is allowed. Executive Executive The latter approach bypasses the queuing mechanism offering more control over the g Peer f k communication, at the price of increased Transport Application Application complexity and a loss of location transparency that is provided through a network independent Figure 2: Peer communication in XDAQ I2O addressing scheme. I2O gives enough room to extend the architecture to clustered CPUs in a networked environment. Rather than letting hosts talk to each other, we see all communicating nodes as intelligent I/O devices. Hence, all applications are Device Driver Modules (DDM), embedded in an I2O based execution environment. We designed and implemented in C++ such a framework, called XDAQ. A copy of the executive runs on every cluster node. As depicted in figure 2, communication between devices is achieved through Peer Operation, i.e. through the queue-pair based messaging instance, solely. Expanding the PCI bus-based messaging to any message-passing environment is feasible, because I2O offers transparency of a DDM’s location through a generic addressing scheme. As a result, all

heterogeneity is hidden from the application programmer. Finally, every component in a distributed system can be seen as a device, which can apply algorithms to data, transmit them, or store them. I2O and XDAQ applications are plug-in components that follow an event-driven processing scheme [5]. In this context, an event can be any incoming information occurrence; it may be a message, a timeout, a DMA completion or an interrupt request. Events trigger the execution of a user-supplied function that has been associated with the event at configuration time. Every event can be represented as an I2O message frame, which indicates the function that is to be invoked. Downloading a module at runtime into the executive provides software for a message. Through this scheme we achieve the required extensibility, because the processing of a function is decoupled from its invocation. Events are added by defining new message identifiers. The system may provide default procedures if for a given event no code is supplied. Exceptions that cannot be handled by the executive are forwarded to the application plug-in with appropriate error information. The user can achieve fault tolerant behaviour by providing the exception handling code. Otherwise, the application is set into a failed state and outside intervention is required. The event-based processing concept has shown to be scalable [6] and efficient [7]. The processing time per message received or sent is in the range of 1.4 µsec on a 750 MHz Pentium II, independent of the transport or data format. There is no central place for parsing incoming information. Each plug-in has a local dispatcher that knows what it shall do with the received message. Since a message contains an identifier for the target object, it simply needs to be routed to the right DDM. Doing so, it is not necessary to register a new event with the framework. It is sufficient to add it to the module and a recompilation of the whole program is avoided. A common addressing scheme is the glue between distributed components that leads from dispersed programs to a uniform system. With the diversity of communication subsystems that DAQ systems comprise, it becomes difficult to provide this homogeneous system view. XDAQ relies on I2O addressing that assigns a numeric identifier, the Target Identifier (Tid) to each instance of an application class (a DDM). Originally unique within one hardware device only, we provide unique identifiers for all applications within a cluster through the use of a common configuration system. Various networks can be plugged into the executive as Peer Transports (PT) that may work concurrently. The executive holds configuration information that allows the PTs to translate Tids into physical networking addresses. Using this approach, XDAQ offers a change of transport, without the need to modify a single line of application code. It also interfaces to a reliable multicast library (FRB) for Fast Ethernet and FireWire that shows excellent performance [8]. Other PTs include GM [9] message passing over Myrinet, HTTP transport, TCP/IP streams as well as raw Ethernet datagrams.

3

Configuration and Control

Resource Manager

Java client For configuration and control of the cluster and the Apache WEB Server plug-in applications of the XDAQ executives, we also PHP follow the I2O specification. This means that a set of XML/SOAP Scripts predefined functions can be used to steer the applications, switch their states and access their C++ client operational parameters. Yet, I2O prescribes binary P OA formatted messages that are tedious to generate and L/S M X SQL Web client parse, especially when it comes to heterogeneous environments. We bypass these limitations by using Hardware XML as data format for serializing the I2O commands. Software Partitions The World Wide Web Consortium has standardized the encapsulation of XML data definitions in a message to be sent over HTTP in the SOAP specification [10]. It Figure 3: The Resource Manager fits ideally our notion of active messages that trigger the execution of a function upon reception. Using this technology, we do not need to recompile parts of the control system, if we add or modify messages and parameters. This is difficult to achieve, even with dynamic invocation interfaces (DII), for example those provided by CORBA implementations

[11]. DIIs are dependet on the implementation language and there are always parts that must transform user messages and parameters into programming language statements that are understood by the DII. XML alleviates these complexities through the use of Web tools. Applications may query the message contents through hyperlinks to grammar definitions. With this, older versions of XDAQ applications continue to function, even if the parameters that are sent to the older application have changed order, data definition, or are not even understood. To give the user of the cluster a practical tool to configure applications, to keep track of configurations and to define partitions that can be used over and over again, a resource manager component has been developed. It tracks the static and dynamic system view, interfacing to a relational DBMS using SQL and to various user interface clients and to the XDAQ executive applications through XML/SOAP messaging (figure 3). The interface to define configurations is shown Partition definition that in figure 4. The resource manager contains multiple resources handles the exclusive access to resources while it can also monitor Figure 4: Graphical resource definition the system’s operational state.

4

Conclusions

To provide the homogeneous view of a processing cluster for data acquisition applications, more than a simple collection of computers and a network is needed. It is the existence of extensible and efficient middleware that provides this service. A homogeneous peer-to-peer communication infrastructure for diverse information sources, together with a platform independent configuration mechanism, is an appropriate means to build extensible and efficient cluster applications. Peer-to-peer communication allows applications to exchange messages asynchronously and to act as both clients and servers. XDAQ is such a middleware solution that is used in the current prototypes for the CMS experiment at CERN, at INFN Legnaro Italy, Fermilab and UCSD (USA).

5

References

[1] C. Schwick et al., “The DAQ-Column prototype of the CMS experiment”, Proc. 12 IEEE NPSS Real-Time Conference, Valencia, Spain, June 4-8, 2001, IEEE conference proceedings. [2] W. Hoschek et al., “Data Management in an International Data Grid Project, IEEE/ACM Int’l Workshop on Grid Computing, 17-20 Dec. 2000, Bangalore, India. [3] G. Antchev et al., “The CMS Event Builder Demonstrator and Results with Myrinet”, Computer Physics Communications 2189, Elsevier Science North-Holland, 2001 (in print). [4] I2O Special Interest Group, Intelligent IO Architecture Specification 2.0, 1999 (www.i2osig.org) [5] B. N. Berschad et al., “Extensibility, Safety and Performance in the SPIN Operating System”, Proc. Fiftteenth ACM Symposium on Operating System Principles, 1995, pp. 267-284. [6] P. Pardyak and B. N. Bershad, “Dynamic Binding for an Extensible System”, Proc. of the 2nd USENIX symposium on operating systems design and implementation, 1996, pp. 201-212. [7] J. Gutleber et al., “Architectural Software Support for Processing Clusters”, IEEE Int’l Conference on Cluster Computing, 2000, Chemnitz, Germany, IEEE conf. Proc. pp. 153-161. [8] I. Suzuki et al., “An implementation of a Reliable Message Broadcast for the CMS Event Builder System, Proc. Int’l Conference on Computing in High-Energy and Nuclear Physics, Padova, Italy 2000 (http://chep2000.pd.infn.it/abst/abs_b232.htm) [9] http://www.myri.com [10] Simple Object Access Protocol (SOAP), W3C Note, May 8, 2000, http://www.w3.org/TR/SOAP/ [11] W. Emmerich, “An overview of OMG/CORBA”, IEE Colloquium on Distributed Objects-Technology and Application (332):1-6, October 1997.