Computer Physics Communications 150 (2003) 129–139 www.elsevier.com/locate/cpc
A Beowulf-class computing cluster for the Monte Carlo production of the LHCb experiment G. Avoni a , M. Bargiotti a , A. Bertin a , M. Bruschi a , M. Capponi a , A. Carbone a , A. Collamati b , S. De Castro a , L. Fabbri a , P. Faccioli a , D. Galli a , B. Giacobbe a , I. Lax a , U. Marconi a , I. Massa a , M. Piccinini a , M. Poli c , N. Semprini Cesari a , R. Spighi a , V.M. Vagnoni a,∗ , S. Vecchi a , M. Villa a , A. Vitale a , A. Zoccoli a (The LHCb Bologna Group) a Dipartimento di Fisica, Università di Bologna and INFN, Sezione di Bologna, via Irnerio 46, I-40126 Bologna, Italy b INFN-CNAF, viale Berti Pichat 6/2, I-40127 Bologna, Italy c Dipartimento di Energetica, Università di Firenze and INFN, Sezione di Bologna, via Irnerio 46, I-40126 Bologna, Italy
Received 1 July 2002
Abstract The computing cluster built at Bologna to provide the LHCb Collaboration with a powerful Monte Carlo production tool is presented. It is a performance oriented Beowulf-class cluster, made of rack mounted commodity components, designed to minimize operational support requirements and to provide full and continuous availability of the computing resources. In this paper we describe the architecture of the cluster, and discuss the technical solutions adopted for each specialized sub-system. 2002 Elsevier Science B.V. All rights reserved. PACS: 07.05.-t; 83.20.Jp; 89.80.+h Keywords: LHCb experiment; Monte Carlo simulation; Parallel computing; Linux operating system
1. Introduction LHCb is a dedicated experiment on b-quark physics currently under construction at the Large Hadron Collider (LHC). It will profit from the 500 µb b-hadron production cross section in the 14 TeV proton-proton collisions at LHC (i.e. about 1% of the total visible * Corresponding author.
E-mail address:
[email protected] (V.M. Vagnoni).
cross section) to make precise measurements of CPviolation and rare decays of the B-mesons [1]. By over-constraining the Cabibbo–Kobayashi–Maskawa matrix elements, LHCb will hopefully be able to observe subtle inconsistencies with the Standard Model, therefore providing indications of New Physics. The LHCb experiment will be fully operational already at the start-up of LHC, because it requires only a “modest” luminosity L = 2 · 1032 cm−2 s−1 to develop its full performance potential. This is two orders of magnitude lower than the nominal luminosity of LHC
0010-4655/02/$ – see front matter 2002 Elsevier Science B.V. All rights reserved. PII: S 0 0 1 0 - 4 6 5 5 ( 0 2 ) 0 0 6 7 7 - X
130
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
at the ATLAS and CMS interaction points, and is in fact even lower than the value foreseen for the first running period of the machine. Starting from 2007, LHCb and the other LHC experiments are expected to generate huge quantities of data, of the order of several Petabytes (PB) per year and about 100 PB over the whole data taking period. This gigantic volume of cumulated information will be accessed and analyzed concurrently by hundreds of researchers spread across Europe, America and Asia, requiring unprecedented resources (tens of thousands of processors and hundreds of Terabytes (TB) of disk space for holding intermediate results). The computing infrastructures capable of handling such huge data sets involve costs comparable to those of the particle detectors and trigger electronics themselves. For technical and political reasons they have to be geographically distributed and hierarchically organized [2], and are planned to be integrated into a virtual computing environment by the GRID middleware [3]. The baseline LHCb computing model is based on the distributed multi-tier regional centre model [4]. Accordingly, LHCb plans to distribute the processing load between the computing facilities available at CERN and at the regional computing centres. In this early phase computing efforts are mainly concentrated on Monte Carlo simulations. The capability of producing millions of simulated events in a reliable and prompt way is of crucial importance for the success of the experiment. Simulation studies are made in order to determine the acceptance of the detector, the efficiency of the full reconstruction, the analysis of the relevant physics processes and for trigger optimization. Commodity computing clusters constitute at the moment the most natural solution for solving problems which require high performance and high availability on such a large scale with reasonable costs. They permit a flexibility of configuration which can be applied to job execution using parallel resources. In the following sections, after a brief discussion on computing clusters, we will describe the Linux cluster built with the aim of producing high statistics of Monte Carlo events for the LHCb experiment. It has been set up at CNAF in Bologna, the INFN department where the Italian regional computing centre for the LHC experiments is being developed.
2. Computing clusters A computing cluster consists of a set of standalone computers interconnected by a local network and working together as a single integrated computing resource. In particular a Beowulf-class system is a cluster of which the nodes are personal computers integrated by a commodity Local Area Network (LAN), and hosting an open source Unix-like operating system [5]. The usage of clusters started in the early 1990s, moving away from the expensive and specialized proprietary parallel super-computers, and was adopted as an optimal solution for different kinds of applications, since it enables high performance in the solution of computing intensive problems, high availability (through redundancy of nodes) and high bandwidth (through multiplicity of disks and I/O channels). Computing clusters offer many advantages compared to super-computers, from both technological and economical points of view. Implementing clusters with commodity components which respond to widely accepted industrial standards—commonly referred to as commercial-off-the-shelf (COTS)—benefits from low prices resulting from heavy competition and mass production. The usage of COTS also provides high performance, thanks to the enormous advances in computer hardware of the last decade. Another advantage is the great flexibility in configuring systems, allowing users to implement systems according to the demands of the application. However, the implementation of a cluster presents problems which can be solved by the use of a suitable resource management model. The different areas of the necessary support can be summarized in the following points: • System Installation and Configuration: the local installation of the operating system and of the specific application software on every node is problematic and time consuming. This leads to the necessity of automatic procedures of remote installation or, alternatively, to the centralization of all the file systems on one or more specialized nodes of the cluster, which must export them to the other nodes. • System Availability: every component which provides critical services has to be made redundant
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
to avoid Single Points of Failure (SPOF’s) of the system. File servers and nodes deploying centralized services have to be duplicated or at least configured with some level of redundancy (e.g., disk arrays configured in RAID-1 or RAID-5 [6]). In case of disruptive failures solutions to minimize the dead time of the system need to be adopted. • System Monitoring: it is necessary to provide automatic tools to detect hardware or software anomalies. A dedicated node can gather the information needed to reveal malfunctions, by continuously monitoring a set of critical variables (system temperature, fan speed, CPU load, network load, disk volume usage, memory occupancy, batch queue status, etc.). It should also activate alarm procedures when key parameters are out of range. • Data Accessibility: if the data were distributed on the local disks of the computing nodes, the data management would become very difficult. Thus it is preferable to make use of centralized file servers able to provide a single file hierarchy providing from a few up to tens of TB. • System Security: system security against unauthorized accesses is a very important point to be considered; computing resources in educational and research centres are very often preferred targets and clusters made of several identical nodes with centralized accounts could be very attractive and exploitable systems.
3. The LHCb-Bologna Computing Cluster The Monte Carlo production for a High Energy Physics (HEP) experiment is performed by simulating the response of the detector to the reaction products of uncorrelated interaction events, and can efficiently be parallelized by running independent processes without inter-process communication. Therefore, the Monte Carlo production task is unaffected by the typical drawbacks of COTS clusters, such as high interconnection latency and low bandwidth, and a loosely coupled computing system can very well accomplish it. The LHCb Monte Carlo data production is a typical example of a CPU-bound problem, since the I/O of the computing nodes takes a small fraction of the process-
131
ing time. Therefore, the cluster works optimally with one Monte Carlo process running on each CPU. Currently, the execution of a typical LHCb Monte Carlo job is realized in two steps. In a first phase the RAW data are produced by the simulation program SICBMC [7], which provides the generation of the primary interactions by using PYTHIA 6.134 [8], and the simulation of the detector response by means of the GEANT 3.21 code [9]. In a second step, executed in chain, the RAW data are read back and reprocessed to carry out the event reconstruction, by using the C++ reconstruction program BRUNEL [10], based on the GAUDI framework [11]. BRUNEL produces output data files in DST (Data Summary Tape) format. The current size of a signal event in RAW format, at the standard LHCb luminosity,1 is about 1 MB, while in DST format is about 2 MB.2 The processing of a signal event, running the current version of the programs on a 1 GHz Intel PIII CPU, requires a computing time of about 100 s for SICBMC, and about 30 s for BRUNEL. Therefore, the average I/O corresponding to one CPU is 30 kB/s. Considering that the cluster is composed of 100 CPUs, i.e. 100 simultaneous Monte Carlo processes running at the same time, one gets a total average I/O of 3 MB/s. The size of the reconstructed DST data files to be permanently stored every day for the subsequent analysis phase is about 130 GB/day. In the current implementation of the system, the permanent storage facility is the CASTOR tape library at CERN. A procedure running in the background waits for the end of the execution of Monte Carlo jobs, and then opens a connection to a server at CERN to transfer the data files (each file typically contains 500 events, i.e. 1 GB DST file). 3.1. System architecture The system makes use of the following specialized sub-systems: • Operating System Server (OSS); • Disk Storage Server (DSS); 1 The event complexity does depend on the average interaction multiplicity, which in turn depends on the luminosity. 2 DST data currently contain also RAW information for debugging purposes. RAW hits will be removed in the near future, thus reducing the DST event size to few hundreds of kB.
132
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
Fig. 1. Schematic representation of the LHCb-Bologna cluster architecture. The various components are described in the text.
• Management and Control Server (MCS); • Computing Node (CN); • Gateway Node (GN). Each of these components is a computer running a flavor of the Linux operating system, and is dedicated to the execution of a specific task. In the following we will describe their functionalities and the technical solutions adopted for their implementation. A schematic representation of the cluster architecture is shown in Fig. 1. The whole system is arranged in slim rack mounted boxes to deliver more capacity per unit floor area (see Fig. 2).
3.1.1. The Operating System Server The OSS is the fundamental component of our architecture because it provides the operating systems to the MCS and to the CNs, allowing a centralized management of the whole cluster.
It consists of a 1U rack mounted box,3 hosting a dual 1 GHz Intel PIII motherboard with 512 MB of ECC RAM, two on-board Fast Ethernet adapters configured in channel bonding and two hot-swappable 40 GB IDE disks mirrored by means of a software RAID-1. It runs the Linux Red Hat 7.2 operating system upgraded to a 2.4.18 kernel (latest release at the moment of writing). The OSS is responsible for serving the Linux kernel image by means of a network boot procedure, and for storing and exporting the root filesystems to every CN and to the MCS through the Network File System (NFS), exploiting a peculiarity of the Linux kernel (root filesystem over NFS) [12]. One drawback of the network boot is the possibility of a network load saturation during simultaneous downloads of the kernel images by many clients. This problem can be solved by using the Trivial 3 1U = 1.75 = 4.45 cm (DIN 41491, IEC 60297).
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
133
the Pre-boot Execution Environment protocol (PXE), with its implementation distributed by Red Hat, which includes the usage of MTFTP [13]. The advantages achievable by centralizing the operating systems can be experienced, for example, by introducing a new CN into the cluster. This procedure normally requires the installation of the operating system on the local disk, followed by a post install configuration and the installation of the specific experiment environment and applications. In our case it requires the execution of a simple script on the OSS, which replicates the content of a template directory and updates a few configuration files. It can be performed in a few seconds. In case one needs to reconfigure or even to restore the operating systems, this kind of configuration also offers exceptional advantages. The OSS is further used to provide different services: authentication (NIS), home directories export (via NFS), time server for the synchronization of the system clocks (NTP), services needed for the network boot (DHCP, PXE, MTFTP), syslog centralization and so on. The centralization of the operating systems, realized by means of a single server, introduces a potential SPOF in the system. Mirroring the local disks by adopting a RAID-1 configuration makes the system safe against single disk failures. In case of other types of failure (e.g., power supplies, CPUs, memory or network adapters) the two hot-swappable disk drives can be extracted and inserted into another identical machine, previously employed as a CN. To restore the whole system in case of disaster, a full backup of the disk content is regularly performed.
Fig. 2. Photograph of one of the racks composing the LHCb-Bologna cluster, hosting two KVM switches (see Section 3.1.6), the system console and several nodes with dual processor motherboards.
File Transfer Protocol (TFTP) in conjunction with IP Multicast (MTFTP), thus allowing multiple clients to receive the same file concurrently. We adopted
3.1.2. The Disk Storage Servers The disk storage arrays act as a temporary cache until the data are transferred to the permanent storage site (in our case the CASTOR tape library at CERN). For this purpose it is important to foresee an appropriate safety margin which would ensure the continuity of the production in case of temporary unavailability of the tape library. In our case 1 ÷ 2 TB are judged to be sufficient to satisfy the current requirements. The solution of Network Attached Storage (NAS) disk servers is adequate for our needs because a total average I/O of 3 MB/s is easily within the means of an NAS exporting its volumes via NFS.
134
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
In order to provide system redundancy, the disk storage service is duplicated. Two NAS servers are used which employ different technologies, one based on an IDE disk array (RaidZone RS15-R1200 [14]), the other on a SCSI disk array (RaidTec V12 [15]). The two systems are normally used at the same time, thus distributing the overall load. In case of failure of one of them, the other is able to carry on the production until the repair. The RaidZone NAS is hosted in a 4U rack mounted box and is supplied with 15 ∗ 73 GB IDE disks, of which 14 compose the RAID-5 array (about 1 TB in total), while the remaining disk is used as an autohot-spare. The server runs Linux Red Hat 6.1 with a 2.2.17 kernel, as provided by the manufacturer. It is equipped with a dual processor motherboard with 800 MHz Intel PIII CPUs and 1 GB of ECC RAM. The network connectivity is provided by two Fast Ethernet cards configured in channel bonding. The RaidTec NAS, also hosted in a 4U box, makes use of 8 ∗ 169 GB SCSI disks (it can host a maximum of 12 disks), where 7 disks constitute a RAID-5 volume, while the remaining disk is employed as an auto-hot-spare, for a total storage of 1 TB. The server is equipped with a single processor motherboard based on a PowerPC 750 processor, with 512 MB of ECC RAM and two Fast Ethernet network adapters. It runs the Embedded FlashLinux 2.2 operating system, as provided by the manufacturer. 3.1.3. The Management and Control Server The MCS is dedicated to the management of the batch queues and to the system monitoring. It also runs the procedure responsible for the data transfer to the remote tape library. The hardware is the same as the OSS, except that it is a disk-less machine mounting its root filesystem via NFS from the OSS. The operating system is a CERN-certified Linux Red Hat 6.1, upgraded to a 2.2.18 kernel. The batch queue system is based on OpenPBS 2.3 [16]. PBS (Portable Batch System) is among the most frequently used batch queuing and workload management systems, and provides a single coherent interface to all the computing resources of the cluster. The MCS runs the PBS server and scheduler processes, while the CNs run the so called machine oriented mini-servers (MOMs) which, in the PBS model, are responsible to place jobs into execution as
directed by the server, establish resource usage limits, monitor the job usage and notify the server when the job is completed. The MCS collects system performance parameters from every node, such as system temperature, CPU utilization, memory usage, network load and disk usage, and presents them in a form which easily can be displayed by the system administrators. Details of the monitoring system implementation are discussed in Section 3.2. Another important duty accomplished by the MCS is the transfer of the DST data files (1 GB each) produced by the Monte Carlo processes to the remote tape library at CERN. The procedure runs in background and whenever a Monte Carlo process completes its execution, the transfer starts. The available bandwidth from our production site to CERN is at the moment 100 Mbit/s, with a Round-Trip-Time of about 20 ms. To fully exploit the available bandwidth the transfer procedure has been based on BBFTP, which allows to transfer files through several parallel TCP streams.4 By employing BBFTP we are able to transfer data with a throughput of up to 70 Mbit/s (see Fig. 3), i.e. 70% of the total bandwidth. The fraction of file transfer failures recorded during the last months of operation was typically limited to a few percent level. The MCS can be a potential SPOF of the system, because it provides fundamental services which are not duplicated by other components. However, its disk-less configuration allows for a fast replacement with another machine, which can be one of the machines employed as CNs.5 The broken MCS can also be safely turned off by means of the Ethernetcontrolled power distributor (see Section 3.1.6) without the need of any physical access to the hardware. 3.1.4. The Computing Nodes As already mentioned, every component of the cluster is arranged in rack mounted boxes to deliver 4 BBFTP has been written for the BaBar experiment in order to transfer big files (more than 2 GB) between SLAC (California) and the IN2P3 Computing Centre (Lyon, France) [17]. 5 The replacement simply requires to change the configuration script of the DHCP server on the OSS in order to assign the IP address of the MCS to the new machine. This is by the way another example of the advantages we obtain by centralizing the operating systems on the OSS.
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
135
Fig. 3. Data transfer throughput versus time from CNAF to CERN, recorded during one day. By employing BBFTP, up to 70 Mbit/s (over the currently available 100 Mbit/s) can be achieved.
more computing power per unit floor area. Making use of 1U boxes mounted on a standard 19 rack and of dual processor motherboards, one is able to stack about 80 CPUs in less than 1 m2 . The CNs are equipped with disk-less dual processor motherboards hosted in 1U and 2U rack mounted boxes, with Intel PIII CPUs having clock rates ranging from 866 MHz to 1.2 GHz and 512 MB of memory. The network connectivity is provided by two Fast Ethernet adapters, but only one of them is currently used. The RAM size is dimensioned in order not to run out of memory with two processes simultaneously active on a CN (one LHCb Monte Carlo job typically requires a maximum of 140 MB of memory). The operating system running on the CNs is a CERNcertified Linux Red Hat 6.1, upgraded to a 2.2.18 kernel. The CNs are dedicated to the execution of Monte Carlo jobs, spawned on them by the PBS server running on the MCS through the PBS MOM processes running locally. The hardware failure of one of these nodes is not critical for the system, since the PBS server automatically removes a not responding client from the production queues and jobs are no longer submitted to it. Therefore the failure of a CN leads to only a small reduction of the computing power of the system.
3.1.5. The Gateway Node For security purposes we implemented a high level of network isolation. The cluster nodes, except for the GN, make use of private IP addresses, being in this way isolated from the Internet at the IP level. Moreover, they can only access a Virtual LAN (VLAN), thus being isolated from the rest of the physical LAN at the Ethernet level. The GN is the only node connected both to the private cluster network and to the external world, by means of two network interface cards. To allow the cluster nodes to access the external network (e.g., to reach centralized databases, tape libraries, distributed file systems such as AFS, etc.), the GN implements a Network Address Translation (NAT) mechanism [18]. By using the NAT “masquerading”, every cluster node is made to appear as having a single IP address (the GN’s public one) to the Internet. It is a duty of the NAT software layer to decide whether an incoming packet should be dropped or forwarded to the private network, and the decision is taken on the basis of explicit rules established by the system administrator. The GN is also the “bastion” login host of the cluster, in that an external user must first login to the GN in order to perform an interactive login to the hidden cluster nodes. Of course it is of crucial importance to monitor the GN for unwanted intrusions and to keep its operating system up-to-date with the most recent security fixes
136
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
available. Such a job can be performed on a single machine with a reasonable effort. The GN’s hardware consists of a dual Pentium III machine with 512 MB of ECC memory, two 100 Mbit/s network cards and two 40 GB disks configured in software RAID-1 (the availability of the GN is essential since it is a potential SPOF of the system). It runs a Linux Red Hat 7.2 operating system upgraded to a 2.4.18 kernel. The NAT rules are specified by means of the Netfilter’s iptables interface (version 1.2.6a) [19]. The NAT mechanism based on Linux 2.4 and iptables has been proved to accomplish the required tasks in a very effective and flexible way (for example, in the case of the well known problem of the broken AFS callbacks behind a NAT, which has been solved by means of a static port mapping), even under a heavy and sustained load (while transferring the data files to CERN with the BBFTP tool, the two network cards of the GN work almost at full performance). 3.1.6. Other components Other hardware components employed to build up the cluster are the Ethernet switch, the Ethernetcontrolled power distributor and the KVM (KeyboardVideo-Mouse) switches. The optimal network connectivity is provided by a modular Ethernet switch (HP ProCurve 4000M [20]). It can mount up to 10 hot-pluggable modules, each handling 8 Fast Ethernet channels or a single Giga Ethernet channel. The control of the power supply of every node is provided by an Ethernet-interfaced power switch. We make use of the FieldPoint modular I/O system by National Instruments [21], which allows one to connect a network interface module to up to 9 interface relay modules, each having 8 independent relays. Each node of the cluster can thus be turned on or off by means of a program running on a remote machine, which communicates with the network interface of the power switch module and is easily controlled by means of a graphical interface. Finally, for the centralization of the system console, we make use of a set of Omniview Pro 16 KVM switches by Belkin [22]. Each switch can handle 16 channels and up to 16 switches can be stacked together to serve a total of 256 independent KVM channels.
3.2. Monitoring tools When a cluster reaches a significant size, system monitoring becomes mandatory. The ability to monitor the various components of the cluster is critical to determine the source of problems that may arise during operation and to tune the system for better performance. Relevant information to be collected concern the status of the computing hardware (CPU load, RAM occupation, CPU temperature, fan speed), the network (network load, rates of unicast, broadcast and multicast packets), the disk storage (disk occupation), the batch queue (running and queued jobs). CPU load, RAM and disk occupancy are easily accessible quantities with Linux (by means of the /proc virtual file system). To access non-standard quantities, like CPU temperatures and fan speeds, one needs to use specific drivers (e.g., the open source lm sensors [23]), which read the information detected by sensors located on the motherboard. Network status is accessible via the SNMP protocol [24] querying the Ethernet switch. Process and queue status are retrieved from the batch queue system interface (PBS). Information are gathered through monitoring agents, written in the Perl language, running on each node of the cluster. These agents are regularly activated at fixed time by the cron process. In order to discover anomalies in system operation, the parameters read by the Perl agents for a certain time range are stored in a database to obtain historical views. In the present implementation data are stored on compressed fixed size ASCII files containing daily, monthly and yearly history, updated using a round-robin procedure (when a new record is written the oldest is deleted) and located on a shared disk. The easiest way to access monitor information is by using a web interface. In order to get readable graphics we chose—instead of putting on the web static graphics files, e.g., gif files a la MRTG [25]— to employ the Java applet technology. When the user opens the web page connecting to the http port of the MCS, the Java applet is downloaded to the Java Virtual Machine running on his web browser. The Java applet, in turn, opens a connection to another http server which is always running on the MCS and returns monitoring data, and dynamically builds the graphics to be displayed.
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
137
Fig. 4. Schematic view of the monitoring system mechanism. The upper part of the figure describes where the various monitoring agents run and which kind of information they retrieve. The lower part describes the visualization step based on the Java Applet technology.
A graphical representation of the monitoring mechanism is shown in Fig. 4. The employment of Java applets is extremely flexible and allows the user to interact with the graphics presentation. For example, he can modify dynamically the image by adding a grid or by stacking the various plots to have a cumulative view of additive quantities. The user can also decide to remove curves of minor interest, in order to have a clearer view of the variables relative to a particular machine. As an example, a typical plot showing the time dependence of the system temperatures, for a subset of the nodes, can be found in Fig. 5. 4. Conclusions We built a Beowulf-class computing cluster, completely based on the Linux operating system, with the aim of producing high statistics of Monte Carlo data samples for the LHCb Collaboration. The system, made of rack mounted components, was set up at CNAF in Bologna, where INFN is developing the Italian regional centre for the LHC experiments.
Its architecture presents specialized sub-systems which allow for a centralized and easy administration. The cluster exploits the remote boot of the disk-less computing nodes through Intel’s PXE protocol and makes use of centralized file systems by means of the root over NFS Linux capability. The data produced by 50 dual processor computing nodes, simultaneously running 100 Monte Carlo jobs at 100% CPU load, are locally stored on 2 NAS disk servers (1 TB each), employed as temporary disk cache. Then a high-rate data transfer to CERN is performed by using the BBFTP client/server procedure, in order to store the data on the CASTOR tape library permanently for the subsequent analysis phase. Thanks to the employment of Virtual LANs and private IP addresses, the cluster satisfies a high standard of system security. A gateway node implementing a Network Address Translation mechanism allows the hidden computing nodes to open the necessary connections to the Internet. The system has been provided with a home-made monitoring tool, based on lm sensors, Perl agents and Java applet technology.
138
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
Fig. 5. Graphical display of the monitoring system developed for the LHCb-Bologna cluster. The plot shows the dependence of the system temperatures for a period of one month in Summer 2001, relative to a subset of the cluster nodes. Sudden changes in the temperatures indicate the start and stop times of the MC production.
The cluster has been running for one year in a stable and reliable way, and has so far simulated several millions of interaction events. This approach has reveled to be an optimal cost-effective highperformance solution for the Monte Carlo massive production of the LHCb experiment. Finally it can be observed that, even if the cluster was designed to be optimized for Monte Carlo data production, it can be adapted by minimal changes to allow for an efficient concurrent data analysis. Data analysis in general requires a much higher data throughput than the one provided by the NAS file servers. We are therefore studying the possibility to employ a parallel file system in order to achieve high performance access to data storage by many clients.
Acknowledgements The authors are very grateful to the CNAF staff and in particular to the CNAF’s director F. Ruggieri
for the prompt and valuable support. The authors are also indebted to E. van Herwijnen for his assistance in using the LHCb distributed Monte Carlo production system and to F. Harris for useful discussions and for the critical review of the manuscript.
References [1] The LHCb Collaboration, LHCb: Technical Proposal, CERNLHCC-98-004. [2] The MONARC Project, Models of Networked Analysis at Regional Centres for LHC Experiments, http://monarc.web.cern. ch/MONARC. [3] The DataGrid Project, http://eu-datagrid.web.cern.ch/eudatagrid. [4] J. Harvey, Computing Model—Baseline model of LHCb’s distributed computing facilities, http://lhcb-comp.web.cern.ch /lhcb-comp/computingmodel/ComputingModelV3.pdf. [5] The Beowulf Project, http://www.beowulf.org. [6] D.A. Patterson, G. Gibson, R.H. Katz, A case for Redundant Arrays of Inexpensive Disks (RAID), in: Proceedings of the 1988 ACM SIGMOD International Conference on Manage-
G. Avoni et al. / Computer Physics Communications 150 (2003) 129–139
[7]
[8]
[9] [10]
[11]
[12] [13]
[14] [15]
ment of Data, Chicago, IL, June 1–3, SIGM, September 1988, pp. 109–116. The LHCb Collaboration, SICB—The LHCb Geant3 based simulation program, http://lhcb-comp.web.cern.ch/lhcb-comp/ SICB. T. Sjostrand, P. Eden, C. Friberg, L. Lonnblad, G. Miu, S. Mrenna, E. Norrbin, High-energy-physics event generation with PYTHIA 6.1, Comput. Phys. Comm. 135 (2001) 238. R. Brun, et al., GEANT3, Internal Report CERN DD/EE/84-1, CERN, 1987. The LHCb Collaboration, BRUNEL—The LHCb Reconstruction Program, http://lhcb-comp.web.cern.ch/lhcb-comp/ Reconstruction. G. Barrand, et al., Gaudi—A software architecture and framework for building hep data processing applications, Comput. Phys. Comm. 140 (2001) 45; The LHCb Collaboration, http://lhcb-comp.web.cern.ch/lhcbcomp/Frameworks/Gaudi. B. Jeunhomme, Network boot and exotic root HOWTO, http://www.tldp.org/HOWTO/Network-boot-HOWTO. Intel Corporation, Preboot Execution Environment (PXE) Specification, version 2.1, ftp://download.intel.com/labs/ manage/wfm/download/pxespec.pdf. Consensys Corporation, http://www.raidzone.com. Raidtec Corporation, http://www.raidtec.com.
139
[16] Veridian Systems, Portable Batch System, http://www. openpbs.org. [17] IN2P3 Computing Centre, bbftp home page, http://doc.in2p3. fr/bbftp. [18] P. Srisuresh, M. Holdrege, IP Network Address Translator (NAT) Terminology and Considerations, Request for Comments 2663, 1999, http://www.ietf.org/rfc/rfc2663.txt; P. Srisuresh, K. Egevang, Traditional IP Network Address Translator (Traditional NAT), Request for Comments 3022, 2001, http://www.ietf.org/rfc/rfc3022.txt. [19] J. Kadlecsik, H. Welte, J. Morris, M. Boucher, R. Russell, Netfilter—firewalling, NAT and packet mangling for Linux 2.4, http://www.netfilter.org. [20] Hewlett-Packard Company, http://www.hp.com/rnd/products/ switches/switch4000/summary.htm. [21] National Instruments Corporation, Fieldpoint Distributed I/O, http://www.ni.com. [22] Belkin Components, KVM Switches, http://www.belkin.com. [23] The lm_sensors Group, Hardware monitoring by lm_sensors, http://secure.netroedge.com/~lm78. [24] J. Case, M. Fedor, M. Schoffstall, J. Davin, A Simple Network Management Protocol (SNMP), Request for Comments 1157, 1990, http://www.ietf.org/rfc/rfc1157.txt. [25] T. Oetiker, D. Rand, MRTG—Multi Router Traffic Grapher, http://mrtg.hdl.com/mrtg.html.