That the dissertation The CMS Trigger Supervisor: Control and Hardware Monitoring ...... At the LHC center-of-mass energy of 14 TeV, Ïinel is expected to be ...... to buffer all calls from clients, coordinate the operation of all sub-system control ...
Tesis Doctoral Departament d’Enginyeria Electrònica Universitat Autònoma de Barcelona
The CMS Trigger Supervisor: Control and Hardware Monitoring System of the CMS Level-1 Trigger at CERN
Ildefons Magrans de Abril
Directora: Dra. Claudia-Elisabeth Wulz Tutora: Dra. Montserrat Nafría Maqueda
March 2008
Dr. Claudia-Elisabeth Wulz, CMS-Trigger Group leader of the Institute for High Energy Physics in Vienna, and Deputy CMS Trigger Project Manager
CERTIFIES
That the dissertation The CMS Trigger Supervisor: Control and Hardware Monitoring System of the CMS Level1 Trigger at CERN, presented by Ildefons Magrans de Abril to fulfil the degree of Doctor en Enginyeria Electrònica, has been performed under her supervision.
Bellaterra, March de 2008.
Dra. Claudia-Elisabeth Wulz
Abstract The experiments CMS (Compact Muon Solenoid) and ATLAS (A Toroidal LHC ApparatuS) at the Large Hadron Collider (LHC) are the greatest exponents of the rising complexity in High Energy Physics (HEP) data handling instrumentation. Tens of millions of readout channels, tens of thousands of hardware boards and the same order of connections are figures of merit. However, the hardware volume is not the only complexity dimension, the unprecedented large number of research institutes and scientists that form the international collaborations, and the long design, development, commissioning and operational phases are additional factors that must be taken into account. The Level-1 (L1) trigger decision loop is an excellent example of these difficulties. This system is based on a pipelined logic destined to analyze without deadtime the data from each LHC bunch crossing occurring every 25_ns, using special coarsely segmented trigger data from the detectors. The L1 trigger is responsible for reducing the rate of accepted crossings to below 100 kHz. While the L1 trigger is taking its decision the full high-precision data of all detector channels are stored in the detector front-end buffers, which are only read out if the event is accepted. The Level-1 Accept (L1A) decision is communicated to the sub-detectors through the Timing, Trigger and Control (TTC) system. The L1 decision loop hardware system was built by more than ten research institutes with a development and construction period of nearly ten years, featuring more than fifty VME crates, and thousands of boards and connections. In this context, it is mandatory to provide software tools that ease integration and the short, medium and long term operation of the experiment. This research work proposes solutions, based on web services technologies, to simplify the implementation and operation of software control systems to manage hardware devices for HEP experiments. The main contribution of this work is the design and development of a hardware management system intended to enable the operation and integration of the L1 decision loop of the CMS experiment (CMS Trigger Supervisor, TS). The TS conceptual design proposes a hierarchical distributed system which fits the web services based model of the CMS Online SoftWare Infrastructure (OSWI) well. The functional scope of this system covers the configuration, testing and monitoring of the L1 decision loop hardware, and its interaction with the overall CMS experiment control system and the rest of the experiment. Together with the technical design aspects, the project organization strategy is discussed. The main topic follows an initial investigation about the usage of the eXtended Markup Language (XML) as uniform data representation format for a software environment to implement hardware management systems for HEP experiments. This model extends the usage of XML beyond the boundaries of the control and monitoring related data and proposes its usage also for the code. This effort, carried out in the context of the CMS Trigger and Data Acquisition project, improved the overall team knowledge on XML technologies, created a pool of ideas and helped to anticipate the main TS requirements and architectural concepts.
i
Visual summary The following diagram presents a visual summary of the PhD thesis. It consists of text boxes summarizing the main ideas and labeled arrows connecting them. The author’s contribution to peer reviewed journals (p), international conferences (c) and supervised master theses (t) are also indicated next to each text box. Motivation
Chapter 1
Unprecedented complexity related to the implementation of hardware control system for the last generation of high energy physics experiments. Very large hardware systems, human collaborations, and design, development and operational periods.
Generic solution
Concrete case and main thesis goal
[39]p [50]p
Development model:
Chapter 2
•XML for data and code. •Interpreted code.
Chapters 1, 3
Chapter 2
Control and monitoring system for the Level-1 (L1) trigger decision loop.
First lessons
•Web services and XDAQ middleware as suitable technologies. •Experience of developing a hardware management system for the CMS experiment. Experience
Requirements
[56]p Conceptual design of control system for the CMS L1 decision loop (Trigger Supervisor, TS): [60]c
Chapter 3
•Requirements. •Project organization. •Layered design: Framework, System, Services. Framework design
[89]c [90]t
•Baseline technology survey.
Chapter 4
•Additional developments. •Performance measurements.
Chapter 5
System design
[95]c
•Design guidelines. •Distributed software system architecture.
Services design
•Configuration, interconnection test [97]t and GUI services.
Chapter 6 Thesis achievements
•New software environment model: confirms XML and XDAQ. •TS design and project organization as a successful experience for future experiments.
Chapters 7, 8
•A building block of the CMS experiment. •A contribution to the CMS operation. •Proposal for a uniform CMS experiment control system.
ii
Contents ABSTRACT............................................................................................................................................................I VISUAL SUMMARY ............................................................................................................................................... II CONTENTS.........................................................................................................................................................III ACRONYMS ..................................................................................................................................................... VII CHAPTER 1
INTRODUCTION..................................................................................................................... 1
1.1 CERN AND THE LARGE HADRON COLLIDER ........................................................................................... 1 1.2 THE COMPACT MUON SOLENOID DETECTOR ........................................................................................... 3 1.3 THE TRIGGER AND DAQ SYSTEM ............................................................................................................ 5 1.3.1 Overview ......................................................................................................................................... 5 1.3.2 The Level-1 trigger decision loop ................................................................................................... 5 1.3.2.1 1.3.2.2 1.3.2.3 1.3.2.4
Calorimeter Trigger ..................................................................................................................................... 6 Muon Trigger .............................................................................................................................................. 7 Global Trigger ............................................................................................................................................. 7 Timing Trigger and Control System ............................................................................................................ 7
1.4 THE CMS EXPERIMENT CONTROL SYSTEM ............................................................................................. 8 1.4.1 Run Control and monitoring System ............................................................................................... 8 1.4.2 Detector Control System ................................................................................................................. 9 1.4.3 Cross-platform DAQ framework ..................................................................................................... 9 1.4.4 Sub-system Online Software Infrastructure................................................................................... 10 1.4.5 Architecture................................................................................................................................... 10 1.5 RESEARCH PROGRAM ............................................................................................................................. 11 1.5.1 Motivation ..................................................................................................................................... 11 1.5.2 Goals ............................................................................................................................................. 12 CHAPTER 2
UNIFORM MANAGEMENT OF DATA ACQUISITION DEVICES WITH XML ........ 13
2.1 INTRODUCTION ...................................................................................................................................... 13 2.2 KEY REQUIREMENTS .............................................................................................................................. 13 2.3 A UNIFORM APPROACH FOR HARDWARE CONFIGURATION CONTROL AND TESTING ................................ 14 2.3.1 XML as a uniform syntax .............................................................................................................. 14 2.3.2 XML based control language ........................................................................................................ 15 2.4 INTERPRETER DESIGN ............................................................................................................................. 17 2.4.1 Polymorphic structure................................................................................................................... 17 2.5 USE IN A DISTRIBUTED ENVIRONMENT ................................................................................................... 18 2.6 HARDWARE MANAGEMENT SYSTEM PROTOTYPE ................................................................................... 18 2.7 PERFORMANCE COMPARISON ................................................................................................................. 20 2.8 PROTOTYPE STATUS ............................................................................................................................... 20 CHAPTER 3 3.1 3.2
TRIGGER SUPERVISOR CONCEPT................................................................................. 21
INTRODUCTION ...................................................................................................................................... 21 REQUIREMENTS ..................................................................................................................................... 22
iii
Functional requirements ............................................................................................................... 22 3.2.1 3.2.2 Non-functional requirements......................................................................................................... 23 3.3 DESIGN .................................................................................................................................................. 25 3.3.1 Initial discussion on technology .................................................................................................... 25 3.3.2 Cell ................................................................................................................................................ 26 3.3.3 Trigger Supervisor services .......................................................................................................... 27 3.3.3.1 3.3.3.2 3.3.3.3 3.3.3.4 3.3.3.5
Configuration............................................................................................................................................. 27 Reconfiguration ......................................................................................................................................... 29 Testing....................................................................................................................................................... 29 Monitoring................................................................................................................................................. 31 Start-up ...................................................................................................................................................... 31
3.3.4 Graphical User Interface .............................................................................................................. 32 3.3.5 Configuration and conditions database ........................................................................................ 32 3.4 PROJECT COMMUNICATION CHANNELS .................................................................................................. 32 3.5 PROJECT DEVELOPMENT ........................................................................................................................ 33 3.6 TASKS AND RESPONSIBILITIES ................................................................................................................ 34 3.7 CONCEPTUAL DESIGN IN PERSPECTIVE ................................................................................................... 35 CHAPTER 4
TRIGGER SUPERVISOR FRAMEWORK......................................................................... 37
4.1 CHOICE OF AN ADEQUATE FRAMEWORK ................................................................................................ 37 4.2 REQUIREMENTS ..................................................................................................................................... 38 4.2.1 Requirements covered by XDAQ................................................................................................... 38 4.2.2 Requirements non-covered by XDAQ............................................................................................ 38 4.3 CELL FUNCTIONAL STRUCTURE .............................................................................................................. 39 4.3.1 Cell Operation............................................................................................................................... 39 4.3.2 Cell command................................................................................................................................ 41 4.3.3 Factories and plug-ins .................................................................................................................. 41 4.3.4 Pools.............................................................................................................................................. 41 4.3.5 Controller interface....................................................................................................................... 41 4.3.6 Response control module .............................................................................................................. 42 4.3.7 Access control module................................................................................................................... 42 4.3.8 Shared resource manager ............................................................................................................. 42 4.3.9 Error manager .............................................................................................................................. 42 4.3.10 Xhannel ......................................................................................................................................... 42 4.3.11 Monitoring facilities...................................................................................................................... 43 4.4 IMPLEMENTATION .................................................................................................................................. 43 4.4.1 Layered architecture ..................................................................................................................... 43 4.4.2 External packages ......................................................................................................................... 43 4.4.2.1 4.4.2.2 4.4.2.3 4.4.2.4 4.4.2.5 4.4.2.6 4.4.2.7
4.4.3 4.4.4
Log4cplus .................................................................................................................................................. 43 Xerces........................................................................................................................................................ 44 Graphviz .................................................................................................................................................... 44 ChartDirector............................................................................................................................................. 44 Dojo........................................................................................................................................................... 44 Cgicc ......................................................................................................................................................... 45 Logging collector....................................................................................................................................... 45
XDAQ development....................................................................................................................... 45 Trigger Supervisor framework ...................................................................................................... 46
4.4.4.1 The cell...................................................................................................................................................... 47 4.4.4.2 Cell command............................................................................................................................................ 48 4.4.4.3 Cell operation ............................................................................................................................................ 49 4.4.4.4 Factories, pools and plug-ins ..................................................................................................................... 50 4.4.4.5 Controller interface.................................................................................................................................... 51 4.4.4.6 Response control module........................................................................................................................... 51 4.4.4.7 Access control module............................................................................................................................... 53 4.4.4.8 Error management module ........................................................................................................................ 53 4.4.4.9 Xhannel ..................................................................................................................................................... 53 4.4.4.9.1 CellXhannelCell .................................................................................................................................. 54 4.4.4.9.2 CellXhannelTb .................................................................................................................................... 55
iv
4.4.4.10 4.4.4.11 4.4.4.12 4.4.4.12.1 4.4.4.12.2 4.4.4.13 4.4.4.14
CellToolbox .......................................................................................................................................... 56 Graphical User Interface ....................................................................................................................... 56 Monitoring infrastructure ...................................................................................................................... 57 Model ................................................................................................................................................ 58 Declaration and definition of monitoring items ................................................................................. 58 Logging infrastructure........................................................................................................................... 61 Start-up infrastructure ........................................................................................................................... 62
4.5 CELL DEVELOPMENT MODEL .................................................................................................................. 62 4.6 PERFORMANCE AND SCALABILITY MEASUREMENTS ............................................................................... 63 4.6.1 Test setup....................................................................................................................................... 63 4.6.2 Command execution ...................................................................................................................... 63 4.6.3 Operation instance initialization................................................................................................... 65 4.6.4 Operation state transition ............................................................................................................. 66 CHAPTER 5
TRIGGER SUPERVISOR SYSTEM .................................................................................... 69
5.1 INTRODUCTION ...................................................................................................................................... 69 5.2 DESIGN GUIDELINES............................................................................................................................... 69 5.2.1 Homogeneous underlying infrastructure....................................................................................... 69 5.2.2 Hierarchical control system architecture...................................................................................... 69 5.2.3 Centralized monitoring, logging and start-up systems architecture ............................................. 70 5.2.4 Persistency infrastructure ............................................................................................................. 70 5.2.4.1 5.2.4.2 5.2.4.3
Centralized access ..................................................................................................................................... 70 Common monitoring and logging databases.............................................................................................. 70 Centralized maintenance............................................................................................................................ 70
5.2.5 Always on system........................................................................................................................... 70 5.3 SUB-SYSTEM INTEGRATION .................................................................................................................... 71 5.3.1 Building blocks.............................................................................................................................. 71 5.3.1.1 The TS node .............................................................................................................................................. 71 5.3.1.2 Common services ...................................................................................................................................... 72 5.3.1.2.1 Logging collector................................................................................................................................ 72 5.3.1.2.2 Tstore................................................................................................................................................... 72 5.3.1.2.3 Monitor collector ................................................................................................................................. 72 5.3.1.2.4 Mstore.................................................................................................................................................. 73
5.3.2
Integration..................................................................................................................................... 73
5.3.2.1 Integration parameters ............................................................................................................................... 73 5.3.2.1.1 OSWI parameters ................................................................................................................................ 73 5.3.2.1.2 Hardware setup parameters.................................................................................................................. 74 5.3.2.2 Integration cases ........................................................................................................................................ 74 5.3.2.2.1 Cathode Strip Chamber Track Finder .................................................................................................. 74 5.3.2.2.2 Global Trigger and Global Muon Trigger............................................................................................ 74 5.3.2.2.3 Drift Tube Track Finder....................................................................................................................... 75 5.3.2.2.4 Resistive Plate Chamber ...................................................................................................................... 76 5.3.2.2.5 Global Calorimeter Trigger ................................................................................................................. 76 5.3.2.2.6 Hadronic Calorimeter .......................................................................................................................... 77 5.3.2.2.7 Trigger, Timing and Control System ................................................................................................... 78 5.3.2.2.8 Luminosity Monitoring System ........................................................................................................... 79 5.3.2.2.9 Central cell .......................................................................................................................................... 80 5.3.2.3 Integration summary.................................................................................................................................. 80
5.4 SYSTEM INTEGRATION ........................................................................................................................... 81 5.4.1 Control system............................................................................................................................... 81 5.4.2 Monitoring system......................................................................................................................... 82 5.4.3 Logging system.............................................................................................................................. 83 5.4.4 Start-up system .............................................................................................................................. 83 5.5 SERVICES DEVELOPMENT PROCESS ........................................................................................................ 83 CHAPTER 6
TRIGGER SUPERVISOR SERVICES ................................................................................ 87
6.1 INTRODUCTION ...................................................................................................................................... 87 6.2 CONFIGURATION .................................................................................................................................... 87 6.2.1 Description.................................................................................................................................... 87
v
Implementation.............................................................................................................................. 88
6.2.2
6.2.2.1 Central cell ................................................................................................................................................ 89 6.2.2.2 Trigger sub-systems................................................................................................................................... 92 6.2.2.3 Global Trigger ........................................................................................................................................... 93 6.2.2.3.1 Command interface.............................................................................................................................. 94 6.2.2.3.2 Configuration operation and database ............................................................................................... 101 6.2.2.4 Sub-detector cells .................................................................................................................................... 103 6.2.2.5 Luminosity monitoring system ................................................................................................................ 103
6.2.3 Integration with the Run Control and Monitoring System .......................................................... 103 6.3 INTERCONNECTION TEST ...................................................................................................................... 105 6.3.1 Description.................................................................................................................................. 105 6.3.2 Implementation............................................................................................................................ 105 6.3.2.1 6.3.2.2
Central cell .............................................................................................................................................. 105 Sub-system cells ...................................................................................................................................... 107
6.4 MONITORING ....................................................................................................................................... 108 6.4.1 Description.................................................................................................................................. 108 6.5 GRAPHICAL USER INTERFACES ............................................................................................................. 109 6.5.1 Global Trigger control panel ...................................................................................................... 109 CHAPTER 7 HOMOGENEOUS SUPERVISOR AND CONTROL SOFTWARE INFRASTRUCTURE FOR THE CMS EXPERIMENT AT SLHC................................................................................................... 111 7.1 7.2 7.3 7.4
INTRODUCTION .................................................................................................................................... 111 TECHNOLOGY BASELINE ...................................................................................................................... 111 ROAD MAP ........................................................................................................................................... 112 SCHEDULE AND RESOURCE ESTIMATES ................................................................................................ 113
CHAPTER 8
SUMMARY AND CONCLUSIONS.................................................................................... 115
8.1 CONTRIBUTIONS TO THE CMS GENETIC BASE ...................................................................................... 116 8.1.1 XSEQ........................................................................................................................................... 116 8.1.2 Trigger Supervisor ...................................................................................................................... 117 8.1.3 Trigger Supervisor framework .................................................................................................... 117 8.1.4 Trigger Supervisor system........................................................................................................... 118 8.1.5 Trigger Supervisor services ........................................................................................................ 118 8.1.6 Trigger Supervisor Continuation ................................................................................................ 119 8.2 CONTRIBUTION TO THE CMS BODY ..................................................................................................... 119 8.3 FINAL REMARKS................................................................................................................................... 119 APPENDIX A
TRIGGER SUPERVISOR SOAP API............................................................................ 121
A.1 INTRODUCTION .................................................................................................................................... 121 A.2 REQUIREMENTS ................................................................................................................................... 121 A.3 SOAP API ........................................................................................................................................... 121 A.3.1 Protocol....................................................................................................................................... 121 A.3.2 Request message.......................................................................................................................... 123 A.3.3 Reply message ............................................................................................................................. 124 A.3.4 Cell command remote API .......................................................................................................... 125 A.3.5 Cell Operation remote API ......................................................................................................... 125 A.3.5.1 A.3.5.2 A.3.5.3 A.3.5.4 A.3.5.5
OpInit ...................................................................................................................................................... 125 OpSendCommand.................................................................................................................................... 126 OpReset ................................................................................................................................................... 127 OpGetState .............................................................................................................................................. 128 OpKill...................................................................................................................................................... 129
ACKNOWLEDGEMENTS.............................................................................................................................. 131 REFERENCES.................................................................................................................................................. 133
vi
Acronyms ACM
Access Control Module
AJAX
Asynchronous JavaScript and XML
ALICE
A Large Ion Collider Experiment
API
Application Program Interface
ATLAS
A Toroidal LHC Apparatus
aTTS
Asynchronous Trigger Throttle System
BX
Bunch crossing
BU
Builder Unit
CCC
Central Crate Cell
CCI
Control Cell Interface
CERN
Conseil Europeen pour la Recherche Nucleaire
CGI
Common Gateway Interface
CKC
ClocK crate cell
CMS
Compact Muon Solenoid
CSC
Cathode Strip Chamber
CSCTF
Cathode Strip Chamber Track Finder
CVS
Concurrent Versions System
DAQ
Data Acquisition
DCC
DTTF Central Cell
DCS
Detector Control System
DB
DataBase
DBWG
CMS DataBase Working Group
DIM
Distributed Information Management System
DOM
Document Object Model
DT
Drift Tube
DTSC
Drift Tube Sector Collector
DTTF
Drift Tube Track Finder
ECAL
Electromagnetic CALorimeter
ECS
Experiment Control System
ERM
Error Manager
EVM
EVent Manager
FDL
Final Decision Logic
vii
FED
Front-end Device
FLFM
First Level Function Manager
FM
Function manager
FPGA
Field Programmable Gate Array
FRL
Front-end Readout Link board
FSM
Finite State Machine
FTE
Full Time Equivalent
FU
Filter Unit
GCT
Global Calorimeter Trigger
GMT
Global Muon Trigger
GT
Global Trigger
GTFE
Global Trigger Front-end
GTL
Global Trigger Logic
GUI
Graphical User Interface
HAL
Hardware Access Library
HCAL
Hadronic CALorimeter
HF
Forward Hadronic calorimeter
HLT
High Level Trigger
HTML
HyperText Markup Language
HTTP
HyperText Transfer Protocol
HEP
High Energy Physics
HW
HardWare
I2O
Intelligent Input/Output
JSP
Java Server Pages
LEP
Large Electron and Positron collider
LHC
Large Hadron Collider
LHCb
Large Hadron Collider beauty experiment
LMS
Luminosity Monitoring System
LMSS
Luminosity Monitoring Software System
LUT
Look Up Table
L1
Level-1
L1A
Level-1 Accept signal
ORCA
Object Oriented Reconstruction for CMS Analysis
OSWI
Online SoftWare Infrastructure
PCI
Peripheral Component Interconnect bus standard
PSB
Pipeline Synchronizing Buffer
PSI
PVSS SOAP Interface
PVSS
ProzessVisualisierungs- und SteuerungSSystem
RC
Run Control
viii
RCM
Response Control Module
RCMS
Run Control and Monitoring System
RCT
Regional Calorimeter Trigger
RF2TTC
TTC machine interface
RPC
Resistive Plate Chamber and Remote Process Call
RU
Readout Unit
SRM
Shared Resources Manager
SW
SoftWare
SCADA
Supervisory Controls And Data Acquisition
SDRAM
Synchronous Dynamic Random Access Memory
SEC
Service Entry Cell
SLHC
Super LHC
SLOC
Source Lines Of Code
SOAP
Simple Object Access Protocol
SRM
Shared Resource Module
SSCS
Sub-detectors Supervisory and Control Systems
sTTS
Synchronous Trigger Throttle System
TCS
Trigger Control System
TFC
Track Finder Cell
TIM
TIMing module
TOTEM
TOTal cross cection, Elastic scattering and diffraction dissociation at the LHC
TPG
Trigger Primitive Generator (HF, HCAL, ECAL, RPC, CSC and DT)
TriDAS
Trigger and Data Acquisition System
TS
Trigger Supervisor
TSCS
Trigger Supervisor Control System
TSMS
Trigger Supervisor Monitoring System
TSLS
Trigger Supervisor Logging System
TSM
Task Scheduler Module
TSSS
Trigger Supervisor Start-up System
TTC
Timing, Trigger and Control System
TTCci
CMS version of the TTC VME interface module
TTCrx
A Timing, Trigger and Control Receiver ASIC for LHC Detectors
TTS
Trigger Throttle System
UA1
Underground Area 1 experiment
UDP
User Datagram Protocol
UML
Unified Modeling Language
URL
Uniform Resource Locator
VME
Versa Module Europa bus standard
WSDL
Web Service Description Language
ix
W3C
World Wide Web Consortium
XDAQ
Cross-platform DAQ framework
XML
EXtensible Markup Language
XPath
XML Path language
XSD
XML Schema Document
XSEQ
Cross-platform SEQuencer
x
Chapter 1 Introduction 1.1 CERN and the Large Hadron Collider At CERN, the European laboratory for particle physics, the fundamental structure of matter is studied using particle accelerators. The acronym CERN comes from the earlier French title: “Conseil Européen pour la Recherche Nucléaire”. CERN is located on the Franco-Swiss border west of Geneva. CERN was founded in 1954, and is currently being funded by 20 European countries. CERN employs just under 3000 people, only a fraction of those are actually particle physicists. This reflects the role of CERN: it does not so much perform particle physics itself, but rather offers its research facilities to the particle physicists in Europe and increasingly in the whole world. About half of the world’s particle physicists, some 6500 researchers from over 500 universities and institutes in some 80 countries, use CERN’s facilities. The latest of these facilities that has been designed and is being built at CERN is the Large Hadron Collider or LHC [1]. It is contained in a 26.7 km circumference tunnel located underground at a depth ranging from 50 to 150 meters (Figure 1-1). The tunnel was formerly used for the Large Electron Positron (LEP) collider. The LHC project consists of a superconducting magnet system with two beam channels designed to bring two proton beams into collision, at a centre of mass energy of 14 TeV. It will also be able to provide collisions of heavy nuclei (Pb-Pb) produced at a centre of mass energy of 2.76 TeV per nucleon. When the two counter-rotating proton bunches cross, protons within bunches can collide producing new particles in inelastic interactions. Such inelastic interactions are also referred to as “events”. The probability for such inelastic collisions to take place is determined by the cross section for proton-proton interactions and by the density and frequency of the proton bunches. The related quantity, which is a characteristic of the collider, is called the luminosity. The design luminosity of the LHC is 1034 cm−2s−1. The proton-proton inelastic cross section σinel depends on the proton’s energy. At the LHC center-of-mass energy of 14 TeV, σinel is expected to be 70 mb (70·10−27cm2). Therefore, the number of inelastic interactions per second (event rate), is the product of the cross section (σinel) and the luminosity (L): Ninel = σ·L = 7·108 s-1. As the bunch crossing rate is 40 MHz and bearing in mind that during normal operation at the LHC not all bunches are filled (only 2808 out of 3564), the average number of events per bunch crossing can be calculated as 7·108·25·10−9·3564/2808 ≈ 22. The main LHC functional parameters that are most important from the experimental point of view are reported in Table 1-1. At the energy scale and raw data rate aimed at LHC, the design of the detectors faces a number of new implementation challenges. LHC detectors must have the capability of isolating and reconstructing the interesting events as only few events can be recorded out of the 40 million each second. Another technical challenge is the extremely hostile radiation environment.
Introduction
2
Figure 1-1: Schematic illustration of the LHC ring with the four experimental points.
Design Luminosity (L)
1034 cm−2s−1
Bunch crossing (BX) rate
40 MHz
Number of bunches per orbit
3564
Number of filled bunches per orbit
2808
Average number of events per bunch crossing
22
Table 1-1: Main LHC functional parameters that are most important from the experimental point of view. There are four collision points spread over the LHC ring which house the main LHC experiments. The two largest, Compact Muon Solenoid (CMS, [2]) and A Toroidal LHC ApparatuS (ATLAS, [3]) are general purpose experiments that take different approaches, in particular to the detection of muons. CMS is built around a very high field solenoid magnet; its relative compactness derives from the fact that there is a massive iron yoke so that the muons are detected by their bending over a relatively short distance in a very high magnetic field. The ATLAS experiment is substantially bigger and essentially relies upon an air-cored toroidal magnet system for the measurement of the muons. Two more special-purpose experiments have been approved to start their operation at the switch on of the LHC machine, A Large Ion Collider Experiment (ALICE, [4]) and the Large Hadron Collider beauty experiment (LHC-b, [5]). ALICE is a dedicated heavy-ion detector that will exploit the unique physics potential of nucleusnucleus interactions at LHC energies, and the LHC-b detector is dedicated to the study of CP violation and other rare phenomena in the decays of beauty particles.
The Compact Muon Solenoid detector
3
1.2 The Compact Muon Solenoid detector The CMS detector is a general-purpose quasi-hermetic detector. This kind of particle detector is designed to observe all possible decay products of an interaction between subatomic particles in a collider by covering as large an area around the interaction point as possible and incorporating multiple types of sub-detectors. CMS is called “hermetic” because it is designed to let as few particles as possible escape. There are three main components of a particle physics collider detector. From the inside out, the first is a tracker, which measures the momenta of charged particles as they curve in a magnetic field. Next there are calorimeters, which measure the energy of most charged and neutral particles by absorbing them in dense material, and a muon system which measures the type of particle that is not stopped in the calorimeters and can still be detected. The concept of the CMS detector was based on the requirements of having a very good muon system whilst keeping the detector dimensions compact. In this case, only a strong magnetic field would guarantee good momentum resolution for high momentum muons. Studies showed that the required magnetic field could be generated by a superconducting solenoid. It is also a particularity of CMS that the solenoid surrounds the calorimeter detectors. Figure 1-2 shows a schematic drawing of the CMS detector and its components that will be described in detail in the subsequent sections. Figure 1-3 shows a transverse slice of the detector. Trajectories of different kinds of particles and the traces they leave in the different components of the detector are also shown. The coordinate system adopted by CMS has the origin centered at the nominal collision point inside the experiment, the y-axis pointing vertically upward, and the x-axis pointing radially inward toward the center of the LHC. Thus, the z-axis points along the beam direction toward the Jura mountains from LHC Point 5. The azimuthal angle (φ) is measured from the x-axis in the x-y plane. The polar angle (θ) is measured from the z-axis. Pseudorapidity is defined as η = -ln tan(θ/2). Thus, the momentum and energy measured transverse to the beam direction, denoted by pT and ET, respectively, are computed from the x and y components.
Figure 1-2: Drawing of the complete CMS detector, showing both the scale and complexity.
Introduction
4
Figure 1-3: Slice through CMS showing particles incident on the different sub-detectors. Tracker The tracking system [6] records the helix traced by a charged particle that curves in a magnetic field by localizing it in space in finely-segmented layers of detecting material composed of silicon. The degree to which the particle curves is inversely proportional to its momentum perpendicular to the beam, while the degree to which it drifts in the direction of the beam axis gives its momentum in that direction. Calorimeters The calorimeter system is installed inside the coil. It slows particles down and absorbs their energy allowing that energy to be measured. This detector is divided into two types: the Electromagnetic Calorimeter (ECAL, [7]), made of lead tungstate (PbWO4) crystals, absorbs particles that interact electromagnetically by producing electron/positron pairs and bremsstrahlung1; and the Hadronic Calorimeter (HCAL, [8]), made of interleaved copper absorber and plastic scintillator plates, can detect hadrons which interact via the strong nuclear force. Muon system Of all the known stable particles, only muons and neutrinos pass through the calorimeter without losing most or all of their energy. Neutrinos are undetectable, and their existence must be inferred, but muons (which are charged) can be measured by an additional tracking system outside the calorimeters. A redundant and precise muon system was one of the first requirements of CMS [9]. The ability to trigger on and reconstruct muons, being an unmistakable signature for a large number of new physics processes CMS is designed to explore, is central to the concept. The muon system consists of three technologically different components: Resistive Plate Chambers (RPC), Drift Tubes (DT) and Cathode Strip Chambers (CSC).
1 Bremsstrahlung is electromagnetic radiation produced by the deceleration of a charged particle, such as an electron, when deflected by another charged particle, such as an atomic nucleus.
The Trigger and DAQ system
5
The muon system of CMS is embedded in the iron return yoke of the magnet. It makes use of the bending of muons in the magnetic field for transverse momentum measurements of muon tracks identified in association with the tracker. The large thickness of absorber material in the return yoke helps to filter out hadrons, so that muons are practically the only particles apart from neutrinos able to escape from the calorimeter system. The muon system consists of 4 stations of muon chambers in the barrel region (Figure 1-3 shows how the 4 stations correspond to 4 layers of muon chambers) and disks in the forward region.
1.3 The Trigger and DAQ system 1.3.1 Overview The CMS Trigger and Data Acquisition (DAQ) system is designed to collect and to analyze the detector information at the LHC bunch crossing frequency of 40 MHz. The rate of events to be recorded for offline processing and analysis is of the order of 100 Hz. At the design luminosity of 1034 cm−2s−1, the LHC rate of proton collisions will be around 22 per bunch crossing, producing approximately 1 MB of zero-suppressed data2 in the CMS readout system. The Level-1 (L1) trigger is designed to reduce the incoming data rate to a maximum of 100 kHz, by processing fast trigger information coming from the calorimeters and the muon chambers, and selecting events with interesting signatures. Therefore, the DAQ system must sustain a maximum input rate of 100 kHz, for an average data flow of 100 GB/s coming from about 650 data sources, and must provide enough computing power for a high level software trigger (HLT) to reduce the rate of stored events by a factor of 1000. In CMS all events that pass the Level-1 trigger are sent to a computer farm (Event Filter) that performs physics selections, using the offline reconstruction software, to filter events and achieve the required output rate. The design of the CMS Data Acquisition system and of the High Level trigger is described in detail in the Technical Design Report [10]. The architecture of the CMS Trigger and DAQ system is shown schematically in Figure 1-4.
Figure 1-4: Overview of the CMS Trigger and DAQ system architecture.
1.3.2 The Level-1 trigger decision loop The L1 trigger [11] is a custom pipelined hardware logic intended to analyze the bunch crossing data every 25 ns without deadtime using special coarsely segmented trigger data from the muon systems and the calorimeters. The L1 trigger reduces the rate of crossings to below 100 kHz. The L1 trigger has local, regional and global components. At the bottom end, the Local Triggers, also called Trigger Primitive Generators (TPG), are based on energy deposits in calorimeter trigger towers3 and track
2 Zero suppression consists of eliminating leading zeros. This encoding is performed by the on-detector readout electronics to reduce the data volume. 3 Each trigger tower identifies a detector region with an approximate (η,φ)-coverage of 0.087 x 0.087 rad.
Introduction
6
192 L1A’s (128 Algorithms + 64 Technical Triggers)
3.2 µs
Global Trigger Global Muon Trigger
Global Calorimeter Trigger
DAQ Drift Tube Track finder
CSC Track Finder
Drift Tube Sector Collector DT
DAQ
CSC L1A + TTC
(sTTS)
Local trigger
RPC
Muon Det. Front End
Slink
Back pressure
Regional Calorimeter Trigger
RPC Trigger
ECAL TPG ECAL
HCAL/HF TPG HCAL
HF
Calorimeters Front End
Local Control Local Control 00 31 Local Local Control Control 31
Back pressure (aTTS + sTTS)
L1A + TTC
Partition controller 0
Trigger Control System
OR (192 L1A)
Partition controller 7 OR (192 L1A)
Figure 1-5: The Level-1 trigger decision loop. segments or hit patterns in muon chambers, respectively. Regional Triggers (or Local Triggers) combine their information and use pattern logic to determine ranked and sorted trigger objects such as electron or muon candidates in limited spatial regions. The rank is determined as a function of energy or momentum and quality, which reflects the level of confidence attributed to the L1 trigger parameter measurements, based on detailed knowledge of the detectors and trigger electronics and on the amount of information available. The Global Calorimeter and Global Muon Triggers determine the highest-rank calorimeter and muon objects across the entire experiment and transfer them to the Global Trigger, the top entity of the L1 trigger hierarchy. While the L1 trigger is taking its decision the full high-precision data of all detector channels are stored in analog or digital buffers, which are only read out if the event is accepted. The L1 decision loop takes 3.2 μs or 128 bunch crossings which is the size of the front-end buffers. The Level-1 Accept (L1A) decision is communicated to the sub-detectors through the Timing, Trigger and Control (TTC) system. Figure 1-5 shows a diagram of the L1 decision loop.
1.3.2.1 Calorimeter Trigger The first step of the Calorimeter trigger pipeline is the TPGs. For triggering purposes the calorimeters are subdivided in trigger towers. The TPGs sum the transverse energies measured in ECAL crystals or HCAL readout towers to obtain the trigger tower ET and attach the correct bunch crossing number. The TPG electronics is integrated with the calorimeter readout. The TPGs are transmitted through high-speed serial links to the Regional Calorimeter Trigger (RCT, [12]), which determines candidates for electrons or photons, jets, isolated hadrons and calculates energy sums in calorimeter regions of 4 x 4 trigger towers. These objects are forwarded to the Global Calorimeter Trigger (GCT, [13]) where the best four objects of each category are sent to the Global Trigger.
The Trigger and DAQ system
7
1.3.2.2 Muon Trigger All three components of the muon systems (DT, CSC and RPC) take part in the trigger. The barrel DT chambers provide local trigger information in the form of track segments in the φ-projection and hit patterns in the ηprojection. The endcap CSCs deliver 3-dimensional track segments. All chamber types also identify the bunch crossing of the corresponding event. The Regional Muon Trigger joins segments to complete tracks and assigns physical parameters. It consists of the DT Sector Collector (DTSC, [14]), DT Track Finders (DTTF, [15]) and CSC Track Finders (CSCTF, [16]). In addition, the RPC trigger chambers, which have excellent timing resolution, deliver their own track candidates based on regional hit patterns. The Global Muon Trigger (GMT, [17]) then combines the information from the three sub-detectors, achieving an improved momentum resolution and efficiency compared to the stand-alone systems.
1.3.2.3 Global Trigger The Global Trigger (GT, [18]) takes the decision to accept an event for further evaluation by the HLT based on trigger objects delivered by the GCT and GMT. The GT has five basic stages: input, logic, decision, distribution and readout. Three Pipeline Synchronizing Buffer (PSB) input boards receive the calorimeter trigger objects from the GCT and align them in time. The muons are received from the GMT through the backplane. An additional PSB board can receive direct trigger signals from sub-detectors or the TOTEM experiment [19] for special purposes such as calibration. These signals are called “technical triggers”. The core of the GT is the Global Trigger Logic (GTL) board, in which algorithm calculations are performed. The most basic algorithms consist of applying pT or ET thresholds to single objects, or of requiring the jet multiplicities to exceed defined values. Since location and quality information is available, more complex algorithms based on topological conditions can also be programmed into the logic. The number of algorithms that can be executed in parallel is 128, and up to 64 technical trigger bits may in addition be received directly from a dedicated PSB board. The set of algorithm calculations performed in parallel is called “trigger menu”. The results of the algorithm calculations are sent to the Final Decision Logic (FDL) board in the form of one bit per algorithm. Up to eight final ORs can be applied and correspondingly eight L1A signals can be issued. For normal physics data taking a single trigger mask is applied, and the L1A decision is taken accordingly. The rest of L1As are used for commissioning, calibration and tests of individual sub-systems4. The distribution of the L1A decision to the sub-systems is performed by two L1A OUT output boards, provided that it is authorized by the Trigger Control System described in Section 1.3.2.4. A TIMing module (TIM) is also necessary to receive the LHC machine clock and to distribute it to the boards. Finally, the Global Trigger Front-end (GTFE) board sends to the DAQ Event Manager (EVM, Section 1.4.3), located in the surface control room, the GT data records which consists of the GPS event time received from the machine, the total L1A count, the bunch crossing number in the range from 1 to 3564, the orbit number, the event number for each TCS/DAQ partition, all FDL algorithm bits and other information
1.3.2.4 Timing Trigger and Control System The Trigger Timing and Control (TTC) system provides for distribution of L1A and fast control signals (e.g. synchronization and reset commands, and test and calibration triggers) to the detector front-ends depending on the status of the sub-detector readout systems and the data acquisition. The status is derived from signals provided by the Trigger Throttle System (TTS). The TTC system consists of the Trigger Control System (TCS, [20]) module and the Timing, Trigger and Control distribution network [21]. The TCS allows different sub-systems to be operated independently if required. For this purpose the experiment is subdivided into 32 partitions. A partition represents a major component of a sub-system. Each partition must be assigned to a partition group, also called a TCS partition. Within such a TCS partition all connected partitions operate concurrently. For commissioning and testing up to eight TCS partitions are available, which each receive their own L1A signals distributed in different time slots allocated by a priority scheme or in round robin mode. During normal physics data taking there is only one single TCS partition. 4 The sub-system concept includes the sub-detectors and the Level-1 trigger sub-systems.
Introduction
8
Sub-systems may either be operated centrally as members of a partition or privately through a Local Trigger Controller (LTC). Switching between central and local mode is performed by the TTCci (TTC CMS interface) module, which provides the interface between the respective trigger control module and the destinations for the transmission of the L1A signal and other fast commands for synchronization and control. At the destinations the TTC signals are received by TTC receivers (TTCrx). The TCS, which resides in the Global Trigger crate, is connected to the LHC machine through the TIM module, to the FDL through the GT backplane, and to 32 TTCci modules through the LA1 OUT boards. The TTS, to which it is also connected, has a synchronous (sTTS) and an asynchronous branch (aTTS). The sTTS collects status information from the front-end electronics of 24 sub-detector partitions and up to eight tracker and preshower front-end buffer emulators5. The status signals, coded in four bits, denote the conditions “disconnected”, “overflow warning”, “synchronization loss”, “busy”, “ready” and “error”. The signals are generated by the Fast Merging Modules (FMM) through logical operations on up to 32 groups of four sTTS binary signals and are received by four conversion boards located in a 6U crate next to the GT central crate. The aTTS runs under control of the DAQ software and monitors the behavior of the readout and trigger electronics. It receives and sends status information concerning the 8 DAQ partitions, which match the TCS partitions. It is coded in a similar way as for the sTTS. Depending on the meaning of the status signals different protocols are executed. For example, in case of warning on the use of resources due to excessive trigger rates pre-scale factors may be applied in the FDL to algorithms causing them. A loss of synchronization would initiate a reset procedure. General trigger rules for minimal spacing of L1As are also implemented in the TCS. The total deadtime estimated at the maximum L1 trigger output rate of 100 kHz is estimated to be below 1%. Deadtime and monitoring counters are provided by the TCS.
1.4 The CMS Experiment Control System The CMS Experiment Control System (ECS) is a complex distributed software system that manages the configuration, monitoring and operation of all equipment involved in the different activities of the experiment: Trigger and DAQ system, detector operations and the interaction with the outside world. This software system consists of the Run Control and Monitor System (RCMS), the Detector Control System (DCS), a distributed processing environment (XDAQ) and the sub-system Online SoftWare Infrastructure (OSWI). These components are described in the following sections.
1.4.1 Run Control and monitoring System The Run Control and Monitoring System (RCMS) ([10], pp.191-208; [22]) is one of the principal components of the ECS and the one that provides the interface to control the overall experiment in data taking operations. This software system configures and controls the online software of the DAQ components and the sub-detector control systems. The RCMS system has a hierarchical structure with eleven main branches, one per sub-detector, e.g. HCAL, central DAQ or the L1 trigger. The basic element in the control tree is the Function Manager (FM). It consists of a finite state machine and a set of services. The state machine model has been standardized for the first level of FM’s in the control tree. These nodes are the interface to the sub-detector control software (Section 1.4.4). The RCMS system is implemented in the RCMS framework, which provides a uniform API to common tasks like storage and retrieval from the process configuration database, state-machine models for process control, and access to the monitoring system. The framework provides also a set of services which are accessible to the FM’s. The services comprise a security service for authentication and user account management, a resource service for storing and delivering configuration information of online processes, access to remote processes via resource proxies, error handlers, a log message application to collect, store and distribute messages, and the “job control” to start, stop and monitor processes in a distributed environment.
5 Buffer emulator: Hardware system responsible for emulating the status of the front-end buffers and vetoing trigger decisions based on this status.
The CMS Experiment Control System
9
The RCMS services are implemented in the programming language Java as web applications. The controller Graphical User Interface (GUI) is based on Java Server Pages technology (JSP, [23]). The eXtended Markup Language (XML [24]) data format and the Simple Object Access Protocol (SOAP, [25]) protocol are used for inter process communication. Finally, the job control is implemented in C++ using the XDAQ framework (Section 1.4.3).
1.4.2 Detector Control System The Detector Control System (DCS) ([10], pp. 209-222) is responsible for operating the auxiliary detector infrastructures: high and low voltage controls, cooling facilities, supervision of all gas and fluids sub-systems, control of all racks and crates, and the calibration systems. The DCS also plays a major role in the protection of the experiment from any adverse event. The DCS runs as a slave of the RCMS system during the data-taking process. Many of the functions provided by DCS are needed at all times, and as a result DCS must function also outside data-taking periods as the master. The DCS is organized in a hierarchy of nodes. The topmost point of the hierarchy offers global commands like “start” and “stop” for the entire detector. The commands are propagated towards the lower levels of the hierarchy, where the different levels interpret the commands received and translate them into the corresponding commands specific to the system they represent. As an example, a global “start” command is translated into a “HV ramp-up” command for a sub-detector. Correspondingly, a summary of the lower level states defines the state of the upper levels. As an example, the state “HV on” of a sub-detector is summarized as “running” in the global state. The propagation of commands ends at the lowest level at the “devices” which are representations of the actual hardware. A commercial Supervisory Controls And Data Acquisition (SCADA) system PVSS II [26] was chosen by all LHC experiments as the supervisory system of the corresponding DCS systems. PVSS II is a development environment for a SCADA system which offers many of the basic functionalities needed to fulfill the tasks mentioned above.
1.4.3 Cross-platform DAQ framework The XDAQ framework ([10], Pp.173-190; [27]) is a domain-specific middleware6 designed for high energy physics data acquisition systems [28]. The framework includes a collection of generic components to be used in various application scenarios and specific environments with a limited customization effort. One of them is the event builder [29] that consists of three collaborating components, a Readout Unit (RU), a Builder Unit (BU) and an EVent Manager (EVM). The logical components and interconnects of the event builder are shown schematically in Figure 1-6. An event enters the system as a set of fragments distributed over the Front-end Devices (FED’s). It is the task of the EVB to collect the fragments of an event, assemble them and send the full event to a single processing unit. To this end, a builder network connects ~500 Readout Units (RU’s) to ~500 Builder Units (BU’s). The event data is read out by sub-detector specific hardware devices and forwarded to the Readout Units. The RU’s temporally store the event fragments until the reception of a control message to forward specific event fragment to a builder unit. A builder unit collects the event fragments belonging to a single collision event from all RUs and combines them to a complete event. The BU exposes an interface to event data processors, called the filter units (FU). This interface can be used to make event data persistent or to apply event-filtering algorithms. The EVM interfaces to the L1 trigger readout electronics and controls the event building process by mediating control messages between RU’s and BU’s. All components of the DAQ: Event managers (8), Readout Units (~500), Builder Units (~4000) and Filter units (~4000) are supervised by the RCMS system.
6 A Middleware is a software framework intended to facilitate the connection of other software components or applications. It consists of a set of services that allow multiple processes running on one or more machines to interact across a network.
Introduction
10
Event data fragments are stored in separated physical memory systems
Readout Units buffer event fragments
Event manager interfaces between RU, BU and Trigger Builder Units assemble event fragments Collection of Filter Units
Full event data are stored in a single physical memory system associated to a processing unit Events are processed and stored persistently by the Filter Units
Figure 1-6: Logical components and interconnects of the event builder.
1.4.4 Sub-system Online Software Infrastructure In addition to the sub-system DCS sub-tree and the Readout Units tailored to fit the specific front-end requirements, the sub-system Online SoftWare Infrastructure (OSWI) consists of Linux device drivers, C++ APIs to control the hardware at a functional level, scripts to automate testing and configuration sequences, standalone graphical setups and web-based interfaces to remotely operate the sub-system hardware. Graphical setups were developed using a broad spectrum of technologies: Java programming language [30], C++ language and the Qt library [31] or Python scripting language [32]. Web-based applications were developed also with the Java programming language and the Tomcat server [33] and with C++ language and the XDAQ middleware. Most of the sub-detectors implemented their supervisory and control systems with C++ and the XDAQ middleware. These distributed systems are mainly intended to download and upload parameters in the front-end electronics. The sub-detector control systems expose also a SOAP API in order to integrate with the RCMS.
1.4.5 Architecture Figure 1-7 shows the architecture of the CMS Experiment Control System which integrates the online software systems presented in Sections 1.4.2, 1.4.3, and 1.4.4. Up to eight instances of the RCMS or RCMS sessions can exist concurrently. Each of them operates a subset of the CMS sub-detectors. A RCMS session consists of a central Function Manager (FM) that coordinates the operation of the sub-systems FM involved in the session. A RCMS session normally involves a number of subdetectors, DAQ components and the L1 trigger. The sub-detector FM operates the sub-detector supervisory and control systems which in turn configure the subdetector front-end electronics. The DAQ FM configures and controls the DAQ software and hardware
Research program
11
Run Control Session 1
FM FM Subdetector 1 DAQ
DCS Panel DCS Supervisor
XDAQ
Run Control Session 8
… x8
FM Triggger
FM Triggger
FM Subdetector 8
XDAQ
XDAQ
DCS Srv2 SD1 DCS
Front end crate
XDAQ RUs, Bus, FUs EVMs
RUs, Bus, FUs EVMs DCS Srv1
FM DAQ
GT GT GMT GMT … RCT RCT GCT GCT CSCTF CSCTF OSWI OSWI
SD8 DCS
Front end crate
Trigger crates
Figure 1-7: Architecture of the CMS Experiment Control System. components in order to set up a distributed system able to read out the event fragments from the sub-detectors, and to build, filter and record the most promising events. Finally, the L1 trigger FM drives the configuration of the L1 decision loop. The L1 trigger generates L1As that are distributed to the 32 sub-detector partitions according to the configuration of the TTC system. Up to eight exclusive subsets of the sub-detector partitions or DAQ partitions can be handled independently by the TTC system. Each RCMS session controls the configuration of one DAQ partition. Therefore, the L1 decision loop is a shared infrastructure among the different sessions. A software facility to control it must be able to serve concurrently up to 8 RCMS sessions avoiding inconsistent configuration operations among sessions. The design of the L1 decision loop hardware management system is the main object of this PhD thesis.
1.5 Research program 1.5.1 Motivation The design and development of a software system to operate DAQ hardware devices includes the definition of sequences containing read, write, test and exception handling operations for initialization and parameterization purposes. These sequences, for instance, are responsible for downloading firmware code and for setting tunable parameters like threshold values or parameters to compensate for the accrued radiation damage. Mechanisms to execute tests on hardware devices and for detecting and diagnosing faults are also needed. However, choosing a programming language, reading the hardware application notes and defining configuration, testing and monitoring sequences is not enough to deal with the complexity of the last generation of HEP experiments. The unprecedented number of hardware items, the long periods of preparation and operation, and last but not least the human context, are three complexity dimensions that need to be added to the conceptual design process. Number Fabjan and Fischer [34] have observed that the availability of the ever increasing sophistication, reliability and convenience in data handling instrumentation has led inexorably to detector systems of increased complexity. CMS and ATLAS are the greatest exponents of this rising complexity. The progression in channel numbers, event rates, bunch crossing rates, event sizes, and data rates in three well known big experiments which belong to
Introduction
12
the decades 1980s (UA1), 1990s (H1) and 2000s (CMS) is shown in Table 1-2. The huge number of channels, the highly configurable DHI based on FPGA’s and the distributed nature of this hardware system were unprecedented requirements to cope with during the conceptual design. Experiment
UA1 4
H1
CMS
4
108
Tracking [channels]
10
Calorimeter [channels]
10
5.104
6. 105
Muons [channels]
104
2. 105
106
3400
96
25
Bunch crossing rate [ns] -1
9
10
11
4.1015
Raw data rate [bit·s ]
10
Tape write rate [Hz]
10
10
100
Mean event size [byte]
100k
125k
1M
3. 10
Table 1-2: Data acquisition parameters for UA1 (1982), H1 (1992) and CMS [35]. Time The preparation and operation of HEP experiments typically spans over a period of many years (e.g. 1992, CMS Letter of intent [36]). During this time the hardware and software environments evolve. Throughout all phases, integrators have to deal with system modifications [28]. In such a heterogeneous and evolving environment, a considerable development effort is required to design and implement new interfaces, synchronize and integrate them with all other sub-systems, and support the configuration and control of all parts. The long operational phases influence also the possible discussion about the convenience of using commercial components rather than in-house solutions. There is simply not enough manpower to build all components inhouse. However, the use of commercial components has a number of risks: First, a selected component may turn out to have insufficient performance or scalability, or simply have too many bugs to be usable. Significant manpower is therefore spent on selecting components, and on validating selected components. Another significant risk with commercial components is that the running time of the CMS experiment, at least 15 years starting from 2008, is much larger than the lifetime of most commercial software products [37]. Human Despite the necessary and highly hierarchic structure in a collaboration of more than 2000 people, different subsystems might implement solutions based on heterogeneous platforms and interfaces. Therefore, the design of a hardware management system should maximize the possible technologies that can be integrated. A second aspect of the human context that should guide the system design is that only some of the software project members are computing professionals: most are trained as physicists, and they often work only part-time on software.
1.5.2 Goals This research work, carried out in the context of the Trigger and Data Acquisition (TriDAS) project of the CMS experiment at the Large Hadron Collider, proposes web-based technological solutions to simplify the implementation and operation of software control systems to manage hardware devices for high energy physics experiments. The main subject of this work is the design and development of the Trigger Supervisor, a hardware management system that enables the integration and operation of the Level-1 trigger decision loop of the CMS experiment. An initial investigation about the usage of the eXtended Markup Language (XML) as uniform data representation format for a software environment to implement hardware management systems for HEP experiments was also performed.
Chapter 2 Uniform Management of Data Acquisition Devices with XML 2.1 Introduction In this chapter, a novel software environment model, based on web technologies, is presented. This research was carried out in the context of the CMS TriDAS project in order to better understand the difficulties of building a hardware management system for the L1 decision loop. This research was motivated by the unprecedented complexity in the construction of hardware management systems for HEP experiments. The proposed model is based on the idea that a uniform approach to manage the diverse interfaces and operations of the data acquisition devices would simplify the development of a configuration and control system and should save development time. A uniform scheme would be advantageous for large installations, like those found in HEP experiments [2][3][4][5][38] due to the diversity of front-end electronic modules, in terms of configuration, functionality and multiplicity (e.g. Section 1.3).
2.2 Key requirements This chapter proposes to work toward an environment to define hardware devices and their behavior at a logical level. The approach should facilitate the integration of various different hardware sub-systems. The design should at least fulfill the following key requirements.
Standardization: The running time of the CMS experiment is expected to be at least 15 years which is a much larger period than the lifetime of most commercial software products. To cope with this, the environment should maximize the usage of standard technologies. For instance, we believe that standard C++ with its standard libraries and XML-based technologies will still be used 10 years from now.
Extensibility: A mechanism to define new commands and data for a given interface must exist, without the need to change either control or controlled systems that are not concerned by the modification.
Platform independence: The specification of commands and configuration parameters must not impose a specific format of a particular operating system or hardware platform.
Communication technology independence: Hardware devices are hosted by different sub-systems that expose different capabilities and types of communication abilities. Choosing the technology that is most suitable for a certain platform must not require an overall system modification.
Performance: The additional benefits of any new infrastructure should not imply a loss of execution performance compared to similar solutions which are established in the HEP community.
Uniform Management of Data Acquisition Devices with XML
14
2.3 A uniform approach for hardware configuration control and testing Taking into account the above requirements, we present a model for the configuration, control and testing interface of data acquisition hardware devices [39]. The model, shown in Figure 2-1, builds upon two principles: 1) The use of the eXtensible Markup Language (XML [24]) as a uniform syntax for describing hardware devices, configuration data, test results and control sequences.
Figure 2-1: Abstract description of the model. 2) An interpreted, run-time extensible, high-level control language for these sequences that provides independence from specific hosts and interconnect systems to which devices are attached. This model, as compared to other approaches [40], enforces the uniform use of XML syntax to describe configuration data, device specifications, and control sequences for configuration and control of hardware devices. This means that control sequences can be treated as data, making it easy to write scripts that manipulate other scripts and embed them into other XML documents. In addition, the unified model makes it possible to use the same concepts, tools, and persistency mechanisms, which simplifies the software configuration management of large projects7.
2.3.1 XML as a uniform syntax When designing systems composed of heterogeneous platforms and/or evolving systems, platform independence is provided by a uniform syntax, using a single data representation to describe hardware devices, configuration data, test results, and control sequences. A solution based on the XML syntax presents the following advantages.
XML is a W3C (World Wide Web Consortium) non-proprietary, platform independent standard that plays an increasingly important role in the exchange of data. A large set of compliant technologies, like XML schema [42], DOM [43] and XPath [44] are defined. In addition, tools that support programming become available through projects like Apache [45].
XML structures can be formally specified and extended, following a modularized approach, using an XML schema definition.
7 Software Configuration Management is the set of activities designed to control change by identifying the work products that are likely to change, establishing relationships among them, defining mechanisms for managing different versions of these work products, controlling the changes imposed, and auditing and reporting on the changes made [41].
A uniform approach for hardware configuration control and testing
15
XML documents can be directly transmitted using any kind of protocols including HTTP [46]. In this case, SOAP [25], a XML based protocol, can be used.
XML documents can be automatically converted into documentation artifacts by means of an XSLT transformation [47]. Therefore, system documentation can be automatically and consistently maintained.
XML is widely used for nonevent information in HEP experiments: “XML is cropping up all over in online configuration and monitoring applications” [48].
On the other hand, XML has one big drawback: it uses by default textual data representation, which causes much more network traffic to transfer data. Even BASE64 or Uuencoded byte arrays are approximately 1.5 times larger than a binary format. Furthermore, additional processing time is required for translating between XML and native data representations. Therefore, the current approach is not well suited for devices generating abundant amount of real-time data, but is still valid for configuration, monitoring, and slow control purposes.
Figure 2-2: Example program in XSEQ exemplifying the basic features of the language.
2.3.2 XML based control language A control language (XSEQ: cross-platform sequencer) that processes XML documents to operate hardware devices has been syntactically and semantically specified. The language is XML based and has the following characteristics:
Extensibility: The syntax has been formally specified using XML schema. A schema document contains the core syntax of the language, describing the basic structures and constraints on XSEQ programs (e.g. variable declarations and control flow). The basic language can be extended in order to cope with user specific requirements. Those extensions are also XML schema documents, whose elements are instances of abstract elements of the core XML schema. This mechanism is one of the most important features of the language because it facilitates a modular integration of different user requirements and eases resource sharing (code and data). The usage and advantages of this feature will be discussed in Section 2.4.1.
Uniform Management of Data Acquisition Devices with XML
16
Imperative and object oriented programming styles: The language provides standard imperative constructs just like most other programming languages in order to carry out conditions, sequencing and iteration. It is also possible to use the main object oriented programming concepts like encapsulation, inheritance, abstraction and polymorphism.
Exception handling with error recovery mechanisms.
Local execution of remote sequences with parameter passing by reference.
Non-typed scoped variables.
Additional functionalities have been added to the core syntax in the form of modular XML schema extensions, in order to fit frequently encountered use cases in data acquisition environments:
Transparent access to PCI and VME devices: This extension facilitates the configuration and control of hardware devices, following a common interface for both bus systems. This interface is designed to facilitate its extension in order to cope with future technologies.
File system access.
SOAP messaging: This allows inclusion of control sequences and configuration data into XML messages. The messages can be directly transported between remote hosts in a distributed programming environment.
DOM and XPath interface to facilitate integration in an environment where software and hardware device configuration are fully XML driven.
System command execution interface with redirected standard error and standard output to internal string objects.
In Figure 2-2 an XSEQ program is shown where basic features of the language are exemplified. In Figure 2-3 an example is given of how the hardware access is performed following the proposed model. Device specifications, configuration data and control sequences are XML documents. In this example, configuration data are retrieved through an XPath query from a configuration database. Device specifications http://xseq.cern.ch/register_table.xml PCIi386BusAdapter ecd6 fd05 0 cms.configuration.db //device[deviceID=“0xFD05” @item=“CTRL”] Configuration database my_device … 0x01 CTRL 0x01 my_data …
Figure 2-3: Example of a program in XSEQ, which shows how the model is applied. Device specifications (register_table.xml), configuration data (retrieved from a configuration data base accessible through a XPath query) and control sequences are all based on uniform use of XML.
Interpreter design
17
2.4 Interpreter design To enable code sharing among different platforms, we have chosen a purely interpreted approach that allows control sequences to run independently of the underlying platform in a single compile/execution cycle. In addition, the interpreted approach is characterized by small program sizes and an execution environment that provides controlled and predictable resource consumption, making it easily embeddable in other software systems. An interpreter [49] for XSEQ programs has been implemented in C++ under Linux. The pattern of the interpreter is based on the following concepts:
The source format is a DOM document already validated against the XSEQ XML schema document and the required extensions. This simplifies interpreter implementation and separates the processing into two independent phases: 1) syntactic validation and 2) execution.
Every XML command has a C++ class representation that inherits from a single class named XseqObject.
A global context accessible to all instruction objects. It contains: 1) the execution stack, which stores nonstatic variables; 2) the static stack, which stores static variables and is useful to retain information from previous executions; 3) the code cache, which maintains already validated DOM trees in order to accelerate the interpretation process; 4) the dynamic factory, which facilitates the interpreter run-time extension; and 5) debug information to properly trace the execution and to find errors.
2.4.1 Polymorphic structure Every class inherits from a single abstract class XseqObject, and it has information about how to perform its task. For example, the XSEQ command is represented with the XseqIf class. This class inherits from the XseqObject class, and the execution algorithm is implemented in the overridden eval() method. Extends interpreter in order to execute a new syntactic extension
Decoupling syntax and semantic enhances sharing code between sub-systems with similar requirements
…
Figure 2-4: Example of program in XSEQ, which exemplifies the use of the tag. It extends dynamically the interpreter (semantics) in order to execute new commands (syntax) defined in a xsd document.
Uniform Management of Data Acquisition Devices with XML
18
C++ classes that implement the functionality of every language syntactic extension are grouped and compiled as shared libraries. Such libraries can be dynamically linked to the running interpreter. They are associated with a concrete syntactic language extension by means of the special XSEQ command . This facility allows separate syntax language extensions, defined in XML schema modules, from the run-time interpreter extensions. The best practice of this facility enables two different sub-systems with similar requirements but different platforms, to share code by just assigning different interpreter extensions to the same language extension. Figure 2-4 exemplifies the use of the tag.
2.5 Use in a distributed environment The interpreter is also available as a XDAQ pluggable module (Section 1.4.3). XDAQ includes an executive component that provides applications with the necessary functions for communication, configuration, control and monitoring. All configuration, control and monitoring commands can be performed through the SOAP/HTTP protocol. In Figure 2-5 the use of the interpreter in a XDAQ framework is shown. This is the basic building block that facilitates the deployment of the model in a distributed environment.
Figure 2-5: Use of the interpreter in a XDAQ framework. To operate this application, the user must provide in XML format the configuration of the physical and logical properties of the system and its components. The configuration process defines the available web services as XSEQ scripts. Once the running application is properly configured, the client can send commands through SOAP messages. As a function of the received command, the corresponding XSEQ script is executed. The SOAP message itself can be processed using the language extension to manipulate SOAP messages. Such functionality is useful when parameters must be remotely passed. Finally, every XSEQ program ends by returning a SOAP message that will be forwarded by the executive to the client.
2.6 Hardware management system prototype The architecture of a hypothetical hardware management system for the CMS experiment is shown in Figure 2-6. A number of application scenarios were integrated [50]. Hardware modules belonging to the Global Trigger [18], the Silicon-Tracker sub-detector [6] and the Data Acquisition system [10] participated in this demonstrator. The basic building block presented in Section 2.5 was implemented for every different platform that played the role of hardware module host. The same infrastructure was used to develop a central node which was in charge
Hardware management system prototype
19
to buffer all calls from clients, coordinate the operation of all sub-system control nodes and to forward the responses from the different sub-system control nodes again to the client. Hardware modules were quite heterogeneous in terms of configuration, functionality and multiplicity. In addition, the control software sub-system for every sub-detector was independent from the others. Therefore, a diverse set of control software sub-systems existed. This offered a heterogeneous set of interfaces that had to be understood by a common configuration and control system. Control sequences executed by the sub-system control nodes depended on a set of language extensions. The language was augmented, following a modular approach, by means of the XML schema technology (Section 2.4.1). For a given language extension the interpreter was associated with a platform specific support. Some language extensions were shared by several sub-systems. For instance, platform 2 and platform 3 were operating the GT crate through different PCI to VME interfaces. The tag was used for binding a common GT language extension to a specific interpreter extension that knew how to use the concrete PCI to VME interface. The tag was also used to share code between platform 3 and platform 4 in order to test PCI and VME memory boards. The default language extension to execute system commands was used to operate the Fast Merging Module board (FMM, [51]) and to forward the standard output and the standard error to XSEQ string objects. Finally, a driver to read and write registers from and to a flash memory embedded into a PCI board was implemented following the chip application notes. The homogeneous use of XML syntax to describe data, control sequences, and language extensions allowed a distributed storage of any of these documents that could be simply accessed through their URLs. Interpreter runtime extensions could also be remotely linked and, therefore, a local binary copy was not necessary. Another advantage of this approach was that both hardware and software configuration schemes were unified since the online software of the data acquisition system was also fully XML driven. The default SOAP extension of the control language made it possible to manipulate, send, and receive SOAP messages.
Figure 2-6: Hardware management system based on the XSEQ software environment.
Uniform Management of Data Acquisition Devices with XML
20
2.7 Performance comparison Timing measurements have been performed on a desktop PC (Intel D845WN chipset) with a Pentium IV processor (1.8 GHz), 256 MB SDRAM memory (133 MHz), and running Linux Red Hat 7.2, with kernel version 2.4.9–31.1. The main objective of this section is to present a comparison of the existing interpreter implementation with a Tcl interpreter [52], focusing on the overhead induced by the interpreter approach when accessing hardware devices. Tcl has been chosen as a reference because it is a well-established scripting language in the HEP community, and it shares many features with XSEQ: it is simple, easily extensible and embeddable. For both interpreters the same hardware access library (HAL [53]) has been used to implement the necessary extensions. This library has been also used to implement a C++ binary version of the test program for reference purposes. The test is a loop that reads consecutive memory positions of a memory module. In order to properly identify the interpreter overhead and to decouple it from the driver overhead, the real hardware access has been disabled and a dummy driver emulates all accesses. The results are shown in Table 2-1. XSEQ
Tcl
C++
16.9 μs
16 μs
2.63 μs
Table 2-1 Comparison of average execution times (memory read) for Tcl, XSEQ and C++. The results indicate an overhead which results from the interpreted approach that lies in the same order of magnitude as the Tcl interpreter. Execution times of XSEQ can be further reduced with customized language extensions that encapsulate a specific macro behavior. For instance, a loop command with a fixed number of iterations has been implemented. This command reduces the timing of the test program to 5.3. However, flexibility is reduced, because the macro command cannot be modified at run time.
2.8 Prototype status In this chapter a uniform model based on XML technologies for the configuration, control and testing of data acquisition hardware was presented. It matches well the extensibility and flexibility requirements of a long lifetime experiment that is characterized by an ever-changing environment. The following chapters present the design and development details of the Level-1 trigger hardware management system or Trigger Supervisor. Theoretically, this would be an ideal opportunity to apply XSEQ. However, the prototype status of the software, the limited resources and reduced development time were concluding reasons to remove this technological option from the initial survey. Therefore, the XSEQ project did not succeed to reach its final goal which is the same of any other software project: to be used. On the other hand, this effort carried out in the context of the CMS Trigger and Data Acquisition project improved the overall team knowledge on XML technologies, created a pool of ideas and helped to anticipate the difficulties of building a hardware management system for the Level-1 trigger.
Chapter 3 Trigger Supervisor Concept 3.1 Introduction The Trigger Supervisor (TS) is an online software system. Its purpose is to set up, test, operate and monitor the L1 decision loop (Section 1.3.2) components on one hand, and to manage their interplay and the information exchange with the Run Control and Monitoring System (RCMS, Section 1.4.5) on the other. It is conceived to provide a simple and homogeneous client interface to the online software infrastructure of the trigger subsystems. Facing a large number of trigger sub-systems and potentially a highly heterogeneous environment resulting from different sub-system Application Program Interfaces (API), it is crucial to simplify the task of implementing and maintaining a client that allows operating several trigger sub-systems either simultaneously or in standalone mode. An intermediate node, lying between the client and the trigger sub-systems, which offers a simplified API to perform control, monitoring and testing operations, will ease the design of this client. This layer provides a uniform interface to perform hardware configurations, monitor the hardware behavior or to perform tests in which several trigger sub-systems participate. In addition, this layer coordinates the access of different users to the common L1 trigger resources. The operation of the L1 decision loop will necessarily be within the broader context of the experiment operation. In this context, the RCMS will be in charge of offering a control window from which an operator can run the experiment, and in particular the L1 trigger system. On the other hand, it is also necessary to be able to operate the L1 trigger system independently of the other experiment sub-systems. This independence of the TS will be mainly required during the commissioning and maintenance phases. Once the TS is accessed through RCMS, a scientist working on a data taking run will be presented with a graphical user interface offering choices to configure, test, run and monitor the L1 trigger system. Configuring includes setting up the programmable logic and physics parameters such as energy or momentum thresholds in the L1 trigger hardware. Predefined and validated configuration files are stored in a database and are proposed as defaults. Tests of the L1 trigger system after configuration are optional. Once the TS has determined that the system is configured and operational, a run may be started through RCMS and the option to monitor can be selected. For commissioning periods more options are available in the TS, namely the setting up of different TCS partitions and separate operations of subsystems. The complexity of the TS is a representative example of the discussion presented in Section 1.5.1: 64 crates, O(103) boards with an average of 15 MB of downloadable firmware and O(102) configurable registers per board, 8 independent DAQ partitions, and O(103) links that must be periodically tested in order to assure the correct connection and synchronization are figures of merit of the numeric complexity dimension; the human dimension of the project complexity is represented by a European, Asian and American collaboration of 27 research institutes in experimental physics. The long development and operational periods of this project are also challenging due to the fast pace of the technology evolution. For instance, although the TS project just started in August 2004, we have already observed how one of the trigger sub-systems has been fully replaced (Global
Trigger Supervisor Concept
22
Calorimeter Trigger, [13]) and recently a number of proposals to upgrade the trigger sub-systems for the Super LHC (SLHC, [54]) have been accepted [55]. This chapter presents the conceptual design of the CMS Trigger Supervisor (TS, [56]). This design was approved by the CMS collaboration in March 2005 as the baseline design for the L1 decision loop hardware management system. The conceptual design is not the final design but the seed of a successful project that lasted four years from conception to completion and involved people from all CMS sub-systems. Because the conceptual design takes into account the challenging context of the last generation of HEP experiments, in addition to the functional and non-functional requirements, the description model and concrete solution can be an example for future experiments about how to deal with the initial steps of designing a hardware management system.
3.2 Requirements 3.2.1 Functional requirements The TS is conceived to be a central access point that offers a high level API to facilitate setting a concrete configuration of the L1 decision loop, to launch tests that involve several sub-systems or to monitor a number of parameters in order to check the correct functionality of the L1 trigger system. In addition, the TS should provide access to the online software infrastructure of each trigger sub-system. 1) Configuration: The most important functionality offered by the TS is the configuration of the L1 trigger system. It has to facilitate setting up the content of the configurable items: FPGA firmware, LUT’s, memories and registers. This functionality should hide from the controller the complexity of operating the different trigger sub-systems in order to set up a given configuration. 2) High Level Trigger (HLT) Synchronization: In order to properly configure the HLT, it is necessary to provide a mechanism to propagate the L1 trigger configuration to the HLT in order to assure a consistent overall trigger configuration. 3) Test: The TS should offer an interface to test the L1 trigger system. Two different test services should be provided: the self test, intended to check each trigger sub-system individually, and the interconnection test service, intended to check the connection among sub-systems. Interconnection and self test services involve not only the trigger sub-systems but also the sub-detectors themselves (Section 3.3.3.3). 4) Monitoring: The TS interface must enable the monitoring of the necessary information that assures the correct functionality of the trigger sub-systems (e.g., measurements of L1 trigger rates and efficiencies, simulations of the L1 trigger hardware running in the HLT farm), sub-system specific monitoring data (e.g., data read through spy memories), and information for synchronization purposes. 5) User management: During the experiment commissioning the different sub-detectors are tested independently, and many of them might be tested in parallel. In other words, several run control sessions, running concurrently, need to access the L1 trigger system (Section 1.4.5). Therefore, it is necessary that the TS coordinates the access to the common resources (e.g., the L1 trigger sub-systems). In addition, it is necessary to control the access to the L1 trigger system hierarchically in order to determine which users/entities (controllers) can have access to it and what privileges they have. A complete access control protocol has to be defined that should include identification, authentication, and authorization processes. Identification includes the processes and procedures employed to establish a unique user/entity identity within a system. Authentication is the process of verifying the identification of a user/entity. This is necessary to protect against unauthorized access to a system or to the information it contains. Typically, authentication takes place using a password. Authorization is the process of deciding if a requesting user/entity is allowed to have access to a system service. A hierarchical list of users with the corresponding level of access rights as well as the necessary information to authenticate them should be maintained in the configuration database. The lowest-level user should be only allowed to monitor. A medium-level user, such as a scientist responsible for the data taking during a running period of the experiment, may manage partition setups, select predefined L1 trigger menus and change thresholds, which are written directly into registers on the electronics boards. In addition to all the previously cited privileges the highest-level user or super user should be allowed to reprogram logic and change internal settings of the boards. In addition to
Requirements
23
coordinate the access of different users to common resources, the TS must also ensure that operations launched by different users are compatible. 6) Hierarchical start-up mechanism: In order to maximize sub-system independence and client decoupling (Section 3.2.2, Point 3) ), a hierarchical start-up mechanism must be available (Section 3.3.3.5 describes the operational details). As will be described later, the TS should be organized in a tree-like structure, with a central node and several leaves. The first run control session or controller should be responsible for starting up the TS central node, and in turn this should offer an API that provides start-up of the TS leaves and the online software infrastructure of the corresponding trigger sub-system. 7) Logging support: The TS must provide logging mechanisms in order to support the users carrying out troubleshooting activities in the event of problems. Logbook entries must be time-stamped and should include all necessary information such as the details of the action and the identity of the user responsible. The log registry should be available online and should be also recorded for offline use. 8) Error handling: An error management scheme, compatible with the global error management architecture, is necessary. It must provide a standard error format, and remote error handling and notification mechanisms. 9) User support: A graphical user interface (GUI) should be provided. This should allow a standalone operation of the TS. It would also help the user to interact with the TS and to visualize the state of a given operation or the monitoring information. From the main GUI it should be possible to open specific GUIs for each trigger sub-system. Those should be based on a common skeleton that should be fulfilled by the trigger sub-system developers following a given methodology described in a document that will be provided. An adequate online help facility should be available to help the user operate the TS, since many of the users of the TS would not be experienced and may not have received detailed training. 10) Multi user: During the commissioning and maintenance phases, several run control sessions run concurrently. Each of them is responsible for operating a different TCS partition. In addition, the TS should allow standalone operations (not involving the RCMS), for instance, to execute tests or monitor the L1 trigger system. Therefore, it is necessary to allow that several clients can be served in parallel by the TS. 11) Remote operation: The possibility to program and operate the L1 trigger components remotely is essential due to the distributed nature of the CMS Experiment Control System (Section 1.4.5). It is important also to consider that, unlike in the past, most scientists can in general not be present in person at the experiment location during data taking and also during commissioning, but have to operate and supervise their systems remotely. 12) Interface requirements: In order to facilitate the integration, the implementation and the description of the controller-TS interface a web service based approach [57] should be followed. The chosen communication protocol to send commands and state notifications should be the same as for most CMS sub-systems, and especially the same as already chosen for run control, data acquisition and slow control. Therefore Simple Object Access Protocol (SOAP) [25] and the representation format Extensible Markup Language (XML) [24] for exchanged data should be selected. The format of the transmitted data and the SOAP messages is specified using the XML schema language [42], and the Web Services Description Language (WSDL) [58] is used to specify the location of the services and the methods the service exposes. To overcome the drawback that XML uses a textual data representation, which causes much network traffic to transfer data, a binary serialization package provided within the CMS online software project and I2O messaging [59] could be used for devices generating large amounts of real-time data. Due to the long time required to finish the execution of configuration and test commands, an asynchronous protocol is necessary to interface the TS. This means that the receiver of the command replies immediately acknowledging the reception, and that this receiver sends another message to the sender once the command is executed. An asynchronous protocol improves the usability of the system because the controller is not blocked until the completion of the requested command.
3.2.2 Non-functional requirements 1) Low-level infrastructure independence: The design of the TS should be independent of the online software infrastructure (OSWI) of any sub-system as far as possible. In other words, the OSWI of a concrete
Trigger Supervisor Concept
24
sub-system should not drive any important decision in the design of the TS. This requirement is intended to minimize the TS redesign due to the evolution of the OSWI of any sub-system. 2) Sub-system control: The TS should offer the possibility of operating a concrete trigger sub-system. Therefore, the design should be able to provide at the same time a mechanism to coordinate the operation of a number of trigger sub-systems, and a mechanism to control a single trigger sub-system. 3) Controller decoupling: The TS must operate in different environments: inside the context of the common experiment operation, but also independently of the other CMS sub-systems, such as, during the phases of commissioning and maintenance of the experiment, or during the trigger sub-system integration tests. Due to the diversity of operational contexts, it is useful to facilitate the access to the TS through different technologies: RCMS, Java applications, web browser or even batch scripts. In order to allow such a heterogeneity of controllers, the TS design must be totally decoupled from the controller, and the following requirements should be taken into account: a.
The logic of the TS should not be split between a concrete controller and the TS itself;
b.
The technology choice to develop the TS should not depend on the software frameworks used to develop a concrete controller.
In addition, the logic and technological decoupling from the controller increases the evolution potential and decreases the maintenance effort of the TS. It also increases development and debug options, and reduces the complexity of operating the L1 trigger system in a standalone way. 4) Robustness: Due to 1) the key role of the TS in the overall CMS online software architecture, and 2) the fact that a malfunctioning can result in significant losses of physics data but also economic ones, the TS should be considered a critical system [60] and therefore design decisions had to be taken accordingly. 5) Reduced development time: The schedule constraints are also a non-functional requirement. The project development phase only started in May 2005, a first demonstrator of the TS system was expected to be ready four months later, and an almost final system had to be drafted for the second phase of the Magnet Test and Cosmic Challenge that took place in November 2006 with the aim that the TS would be able to follow the monthly increasing deployment of CMS experiment components during the Global Run exercises started in May 2007. 6) Flexibility: The TS has to be designed as an open system capable of adopting non-foreseen functionalities or services required to operate the L1 decision loop or just specific sub-systems. These new capabilities must be added in a non-disruptive way, without requiring major developments. 7) Human context awareness: The TS design and development has to take into account the particular human context of the L1 trigger project. The available human resources in all sub-systems were limited and their effort was split among hardware debugging, physics related tasks and software development including online, offline and hardware emulation. In this context, most collaboration members were confronted with a heterogeneous spectrum of tasks. In addition, the most common professional profiles were hardware experts and experimental physicists with no software engineering academic background. The resources assigned to the TS project were also very limited; initially and for more than one year, one single person had to cope with the design, development, documentation and communication tasks. An additional Full Time Equivalent (FTE) incorporated to the project after this period and a number of students have collaborated for few months developing small tasks.
Design
25
RC Session
Controller side 0..n
0..n
SOAP
1
Control cell (TS central node)
Trigger subsystem GUI 0..n
TS responsibility (customized by every sub-system)
1
1
SOAP (HTTP, I2O, custom) 1
Control cell (TS leaf)
wsdl
Control cell (TS leaf)
1 1
Trigger sub-systems responsibility
wsdl
OSWI
wsdl
…
1
Control cell (TS leaf)
1 1
OSWI
wsdl
1 1
OSWI
Figure 3-1: Architecture of the Trigger Supervisor.
3.3 Design The TS architecture is composed of a central node in charge of coordinating the access to the different subsystems, namely the trigger sub-systems and sub-detectors concerned by the interconnection test service (Section 3.3.3.3), and a customizable TS leaf (Section 3.3.2) for each of them that offers the central node a well defined interface to operate the OSWI of each sub-system. Figure 3-1 shows the architecture of the TS. Each node of the TS can be accessed independently, fulfilling the requirement outlined in Section 3.2.2, Point 2). The available interfaces and location for each of those nodes are defined in a WSDL document. Both the central node and the TS leaves are based on a single common building block, the “control cell”. Each sub-system group will be responsible for customizing a control cell and keeping the consistency of the available interface with the interface described in the corresponding WSDL file. The presented design is not driven by the available interface of the OSWI of a concrete sub-system (Section 3.2.2, Point 1) ). Therefore, this improves the evolution potential of the low-level infrastructure and the TS. Moreover, the design of the TS is logically and technologically decoupled from any controller (Section 3.2.2, Point 3) ). In addition, the distributed nature of the TS design facilitates a clear separation of responsibilities and a distributed development. The common control cell software framework could be used in a variety of different control network topologies (e.g., N-level tree or peer to peer graph).
3.3.1 Initial discussion on technology The development of a distributed software system like the TS requires the usage of distributed programming facilities. An initial technological survey pointed to a possible candidate: a C++ based cross-platform data acquisition framework called XDAQ developed in-house by the CMS collaboration (Section 1.4.3). The OSWI of many sub-systems was already based on this distributed programming framework (Section 1.4.4). It was therefore an obvious option to develop the TS. The following reasons backed up this technological option:
The software frameworks used in both the TS and the sub-systems are homogeneous.
For a faster messaging protocol, I2O messages could be used instead of being limited to messages according to the SOAP communication protocol.
Trigger Supervisor Concept
26
Trigger Supervisor
HTTP
SOAP
I2O or Custom
Access Control Module (ACM)
Task Scheduler Module (TSM)
Shared Resource Manager (SRM)
Error Manager (EM)
Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1
Figure 3-2: Architecture of the control cell.
Monitoring and security packages are available.
XDAQ development was practically finished, and its API was considered already stable when the conceptual design was approved.
3.3.2 Cell The architecture of the TS is characterized by its tree topology, where all tree nodes are based on a common building block, the control cell. Figure 3-2 shows the architecture of the control cell. The control cell is a program that offers the necessary functionalities to coordinate the control operations over other software systems, for instance the OSWI of a concrete trigger sub-system, an information server, or even another control cell. Each cell can work independently of the rest (fulfilling the requirement of Section 3.2.2, Point 2) ), or inside a more complex topology. The following points describe the components of the control cell. 1) Control Cell Interface (CCI): This is the external interface of the control cell. Different protocols should be available. An HTTP interface could be provided using the XDAQ facilities; this should facilitate a first entry point from any web browser. A second interface based on SOAP should also be provided in order to ease the integration of the TS with the run control or any other controller that requires a web service interface. Future interface extensions are foreseen (e.g., an I2O interface should be implemented). Each control cell should have an associated WSDL document that will describe its interface. The information contained in that document instructs any user/entity how to properly operate with the control cell. 2) Access Control Module (ACM): Each module is responsible for identifying and authenticating every user or entity (controller) attempting to access, and for providing an authorization protocol. The access control module should have access to a user list, which should provide the necessary information to identify and authenticate, and the privileges assigned to each controller. Those privileges should be used to check whether or not an authenticated controller is allowed to execute a given operation. 3) Task Scheduler Module (TSM): This module is in charge of managing the command requests and forwarding the answer messages. The basic idea is that a set of available operations exist that can be accessed by a given controller. Each operation corresponds to a Finite State Machine (FSM). The default set of operations is customizable and extensible. The TSM is also responsible for preventing the launching of operations that could enter into conflict with other running operations (e.g., simultaneous self test operations
Design
27
within the same trigger sub-system, interconnection test operations that cannot be parallelized). The extension and/or customization of the default set of operations could change the available interface of the control cell. In that case, the corresponding WSDL should be updated. 4) Shared Resources Manager (SRM): This module is in charge of coordinating access to shared resources (e.g., the configuration database, other control cells, or a trigger sub-system online software infrastructure). Independent locking services for each resource are provided. 5) Error Manager (ERM): This module provides the management of all errors not solved locally, which have been generated in the context of the control cell, and also the management of those errors that could not be resolved in a control cell immediately controlled by this one. Both the error format and the remote error notification mechanism will be based on the global CMS distributed error handling scheme. The control over what operations can be executed is distributed among the ACM for user access level control (e.g., a user with monitoring privileges cannot launch a self test operation), the TSM for conflictive operation control (e.g., to avoid running in parallel operations that could disturb each other), and inside the commands code of each operation (e.g., to check that a given user is allowed to set up the requested configuration). More details are given in Section 3.3.3.1.
3.3.3 Trigger Supervisor services The Trigger Supervisor services are the final functionalities offered by the TS. These services emerge from the collaboration of several nodes of the TS tree. In general, the central node is always involved in all services coordinating the operation of the necessary TS leaves. The goal of this section is to describe, for each different service, what the default operations are in both the central node of the TS and in the TS leaves, and how the services emerge from the collaboration of these distributed operations. It is remarked that a control cell operation is always a Finite State Machine (FSM). The main reason of using FSM’s to define the TS services is that FSM’s are a well known model in HEP to define control systems. It is therefore a feasible tool to communicate and discuss ideas with the rest of the collaboration.
3.3.3.1 Configuration This service is intended to perform the hardware configuration of the L1 trigger system, which includes the setting of registers or Look-Up Tables (LUT’s) and downloading the L1 trigger logic into the programmable logic devices of the electronics boards. The configuration service requires the collaboration of the central node of the TS and all the TS leaves. Each control cell involved implements the operation represented in Figure 3-3.
Reconfigure(Key)
ConfigurationServiceInit() Configure(Key) Not configured
Configuring
Enable()
Configured
Error()
Error()
Error
Figure 3-3: Configuration operation.
Enabling
Enabled
Trigger Supervisor Concept
RC session
28
Session_key
TS_key
… Other_keys
RCMS responsibility
Configure(TS_key)
Trigger Supervisor (central node)
TS_key
Configure(TCS_key) Configure(GM_key)
Trigger Supervisor (GT/TCS)
Trigger Supervisor (Global Muon)
TCS_key GM_key
…
GC_key
TS resp.
Configure(GC_key)
Trigger
… Supervisor
TCS_key
BC table
(Global Calo)
Throttle logic
…
Other TCS param
Subsystem responsibility
Figure 3-4: Configuration service. Due to the asynchronous interface, it is also necessary to define transition states such as Configuring and which indicate that a transition is in progress. All commands are executed while the FSM is in a transition state. If applicable, an error state is invoked from the transition state. Figure 3-4 shows how the different nodes of the TS collaborate in order to fully configure the L1 trigger system.
Enabling,
A key8 is assigned to each node. Each key maps into a row of a database table that contains the configuration information of the system. The sequence of steps that a controller of the TS should follow in order to properly use the configuration service is as follows.
Send a ConfigurationServiceInit() command to the central node of the TS.
Once the operation reaches the Not configured state, the next step is to send a Configure(Key) command, where Key identifies a set of sub-system keys, one per trigger sub-system that is to be configured. The Configure(Key) command initiates the configuration operation in the relevant TS leaves. The configure command in the configuration operation of each TS leaf will check whether or not the user is allowed to set the configuration identified by a given sub-system key. This means that each trigger sub-system has the full control over who and what can be configured. This also means that the list of users in the central node of the TS will be replicated in the TS leaves.
Once the configuration operation of the TS leaves reaches the Configured state, the configuration operation in the central node of the TS jumps to the Configured state.
Send an Enable command. This fourth step is just a switch-on operation.
From the point of view of the L1 trigger system, everything is ready to run the experiment once the configuration operation reaches the Enabled state. Each trigger sub-system has the responsibility to customize the configuration operation of its own control cell and thus has to implement the commands of the FSM. The central node of the TS owns the data that relates a given L1 trigger key to the trigger sub-system keys. The presented configuration service is flexible enough to allow a full or a partial configuration of the L1 trigger system. In the second case, the Key identifies just a subset of sub-system keys, one per trigger sub-system that is to be configured, and/or each sub-system key identifies just a subset of all the parameters that can be configured for a given trigger sub-system. The configuration database consists of separated databases for each sub-system and for the central node. Each trigger sub-system is then responsible for populating the configuration database and to assign key identifiers to sets of configuration parameters.
8Key: Name that uniquely identifies the configuration of a given system.
Design
29
3.3.3.2 Reconfiguration This section complements Section 3.3.3.1. A reconfiguration of the L1 trigger system may become necessary, for example if thresholds have to be adapted due to a change in luminosity conditions. The new configuration table must be propagated to the filter farm, as it was required in Section 3.2.1, Point 2). The following steps show how a controller of the TS should behave in order to properly reconfigure the L1 trigger system using the configuration service.
Once the L1 trigger system is configured, the configuration operation in the central node of the TS will be in the Enabled state.
Send a Reconfigure(Key) command. The following steps show how this command behaves. o
Stop the generation of L1A signals.
o
Send a Configure(Key) command as in Section 3.3.3.1, and
o
Jump to the state Configured
The controller is also responsible for propagating the configuration changes to the filter farm hosts in charge of the HLT and the L1 trigger simulation through the configuration/conditions database (Section 3.2.1, Point 2).
Send an Enable command: This signal will be sent by the controller to confirm the propagation of configuration changes to the filter farm hosts in charge of the HLT and the L1 trigger simulation. This command will be in charge of resuming the generation of L1A signals. Run control is in charge of coordinating the configuration of the TS and the HLT. There is no special interface between the central node of the TS and the HLT.
3.3.3.3 Testing The TS offers two different test services: the self test service and the interconnection test service. The following sections describe both. The self test service checks that each individual sub-system is able to operate as foreseen. If anything fails during the test of a given sub-system, an error report is returned, which can be used to define the necessary corrective actions. The self test service can involve one or more sub-systems. In the second, more complex case, the self test service requires the collaboration of the central node of the TS and all the corresponding TS leaves. Each control cell involved implements the same self test operation. The self test operation running in each control cell is a FSM with only two states: halted and tested. This is the sequence of steps that a controller of the TS should follow in order to properly use the self test service.
Send a SelfTestServiceInit() command. Once the self test operation is initiated, the operation reaches the halted state (initial state).
Send a RunTest(LogLevel) command, where the parameter LogLevel specifies the level of detail of the error report. An additional parameter type, in the RunTest() command, might be used to distinguish among different types of self test.
The behavior of the RunTest() command depends on whether it is the self test operation of the central node of the TS, or a self test operation in a TS leaf. In the central node of the TS, the RunTest() command is used to follow the above sequence for each TS leaf, and collect all error reports coming from the TS leaves. In the case of a TS leaf, the RunTest() command will implement the test itself and will generate an error report that will be forwarded to the central node of the TS. It is important to note that the error report will be generated in a standard format specified in a XML Schema Document (XSD). This should ease the automation of test reports. The interconnection test service is intended to check the connections among sub-systems. In each test, several trigger sub-systems and sub-detectors can participate as sender/s or receiver/s. Figure 3-5 shows a typical scenario for participants involved in an interconnection test. The example shows the interconnection test of the Trigger Primitive Generators and the Global Trigger logic.
Trigger Supervisor Concept
30
DAQ
S-Link
Det.
FE
Trig. Links
Readout
Opt.
TPG
Trig. Subsystem
Trig. Links
Global Trigger
Start(L1A)
Sender
Receiver
TCS
Figure 3-5: Typical scenario of an interconnection test.
The interconnection test service requires the collaboration of the central node of the TS and some of the TS leaves. Each control cell involved will implement the operation represented in Figure 3-6.
Prepare_test(Test_id)
Start_test()
ConTestServiceInit() Not tested
Ready for test
Preparing
Testing
Tested
Error()
Error()
Error
Figure 3-6: Interconnection test operation.
This is the sequence of steps that a controller of the TS should follow in order to properly use the interconnection test service.
Send a ConTestServiceInit() command.
Once the operation reaches the Not tested state, the next step is to send a PrepareTest(Test_id). This command implemented in the central node of the TS will do the following steps: o
Retrieve from the configuration database the relevant information for the central node of the TS.
o
Send a ConTestServiceInit() command to sender/s and receiver/s.
o
Send Prepare_test() command to sender/s and receiver/s.
o
Wait for Ready_for_test signal from all senders/receivers.
Once the operation reaches the Ready
Wait for results.
state,
the next step is to send a
Start_test
command.
This is the sequence of steps that the TS leaves acting as senders/receivers should follow when they receive the command from the central node of the TS.
Prepare_test(Test_id)
Retrieve from the configuration database the relevant information for the leaf (e.g., which role: sender or receiver, test vectors to be sent or to be expected).
Send a Ready_for_test signal to the central node of the TS.
Wait for the
Start_test()
command.
Design
31
Do the test, and generate the test report to be forwarded to the central node of the TS (if the TS leaf is a receiver).
In contrast to the configuration service, the central node of the TS can already check whether a given user can launch interconnection test operations. However, the TSM of each TS leaf will still be in charge of checking whether acting as a sender/receiver is in conflict with an already running operation. Each sub-detector must also customize a control cell in order to facilitate the execution of interconnection tests that involve the TPG modules.
3.3.3.4 Monitoring The monitoring service is implemented by an operation running in a concrete TS leaf or as a collaborative service where an operation, running in the central node of the TS, is monitoring the monitoring operations running in a number of TS leaves. The basic monitoring operation is a FSM with only two states: monitoring and stop. Once the monitoring operation is initiated, the monitoring process is started. At this point, any controller can retrieve items by sending pull commands. A more advanced monitoring infrastructure should be offered in a second development phase where a given controller receives monitoring updates following a push approach. This second approach facilitates the implementation of an alarm mechanism.
3.3.3.5 Start-up From the point of view of a controller (run control session or standalone client), the whole L1 trigger system is one single resource, which can be started by sending three commands. Figure 3-7 shows how this process is carried out. This approach will simplify the implementation of the client.
RC session
Session_key
JC
Trigger Supervisor (central node)
TS_Start_key
TS_config_data
…
GT_config_data
… TS responsibility
RCMS responsibility
GT_start_key
GT_URL
2nd. To TS: Config_trigger_sw(GT_config_data) 3rd. To TS: Startup_trigger(GT_start_key)
1st.To JC: Start(GT_URL)
JC
JC
Trigger Trigger Supervisor Supervisor (Global Trigger) (Global Muon)
OSWI
TS_URL
2nd. To TS: Config_trigger_sw(TS_config_data) 3rd. To TS: Startup_trigger(TS_start_key)
1st.To JC: Start(TS_URL)
JC
TS_start_key
OSWI
…
Trigger Supervisor (Global Calo)
GT_start_key
GT_OSWI: URL’s, config_data
Sub-system responsibility
OSWI
Figure 3-7: Start-up service. The first client that wishes to operate with the TS must follow these steps:
Send a Start(TS_URL) command to the job control daemon in charge of starting up the central node of the TS, where TS_URL identifies the Uniform Resource Locator from where the compiled central node of the TS can be retrieved.
Send a Config_trigger_sw(TS_config_data) command to the central node of the TS in order to properly configure it. Steps 1 and 2 are separated to facilitate an incremental configuration process.
Trigger Supervisor Concept
32
Send a Startup_trigger(TS_start_key) command to the central node of the TS. This command will send the same sequence of three commands to each TS leaf, but now the command parameters are retrieved from the configuration database register identified with the TS_start_key index. The Config_trigger_sw(TSLeaf_config_data) command that is received by the TS leaf is in charge of starting up the corresponding online software infrastructure.
The release of the TS nodes is also hierarchic. Each node of the TS (i.e., TS central node and TS leaves) will maintain a counter of the number of controllers that are operating on it. When a controller wishes to stop operating a given TS node, it has to demand the value of the reference counter from the TS node. If it is equal to 1, the controller will send a Release_node command and will wait for the answer. When a TS node receives a Release_node command it will behave like the controller outlined above in order to release the unnecessary software infrastructure.
3.3.4 Graphical User Interface Together with the basic building block of the TS or control cell, an interactive graphic environment to interact with it should be provided. It should feature a display to help the user/developer to operate the control cell in order to cope with the requirement outlined in Section 3.2.1, Point 9). Two different interfaces are foreseen:
HTTP: The control cell should provide an HTTP interface that allows full operation of the control cell and visualization of the state of any running operation. The HTTP interface should provide an additional entry point to the control cell (Section 3.3.2), bypassing the ACM, in order to offer a larger flexibility in the development and debug phases.
Java: A generic controller developed in Java should provide to the user an interactive window to operate the control cell through a SOAP interface. This Java application should also be an example of how to interact with the monitoring operations offered by the control cell, and graphically represent the monitored items. This Java controller can be used by the RCMS developers control as an example of how to interact with the TS.
3.3.5 Configuration and conditions database In this design, a dedicated configuration/conditions database per sub-system is foreseen. Different sets of firmware for the L1 trigger electronics boards and default parameters such as thresholds should be predefined and stored in the database. The information should be validated with respect to the actual hardware limitations and compatibility between different components. However, as it is shown in Figure 3-1, all these databases share the same database server provided by the CMS DataBase Working Group (DBWG). The general CMS database infrastructure, which the TS will use, includes the following components:
HW infrastructure: Servers.
SW infrastructure: Likely based on Oracle, scripts and generic GUIs to populate the databases, methodology to create customized GUIs to populate sub-system specific configuration data.
Each trigger sub-system should provide the specific database structures for storing configuration data, access control information and interconnection test parameters. Custom GUIs to populate these structures should also be delivered.
3.4 Project communication channels The development of the Trigger Supervisor required the collaboration of all trigger sub-systems, sub-detectors and the RCMS. Other parties of the CMS collaboration are also involved in this project: the Luminosity Monitoring System (LMS), the High Level Trigger (HLT), the Online Software Working Group (OSWG) and the DataBase Working Group (DBWG). A consistent configuration of the Trigger Primitive Generator (TPG) modules of each sub-detector, the automatic update of the L1 trigger pre-scales as a function of information obtained from the LMS, the adequate configuration of the HLT and the agreement in the usage of software tools and database technologies enlarged the number of involved parties during the development of the TS. Due to the large number of involved parties and sub-system interfaces an important effort was dedicated to documentation and communication purposes.
Project development
33
One of the problems in defining the communication channels is that they may concern different classes of consumers having fairly different background and language - electronics engineers, physicists, programmers and technicians. Consumers can be roughly divided between the TS team and the rest. For internal use, the TS members use the Unified Modeling Language (UML) [61] descriptions to model and document the status of the TS software framework: concurrency, communication mechanism, access control, task scheduling and error management. This model is kept consistent with the status of the TS software framework. This additional effort is worthwhile because it accelerates the learning curve of new team members that are able to contribute effectively to the project in a shorter period of time, it helps to detect and remove errors and can be used as discussion material with other software experts, for instance to discuss the data base interface with the DBWG or to justify to the OSWG an upgrade in a core library. But this approach is no longer valid when the consumer is not a software expert. Project managers, electronic engineers or physicists must also contribute. Periodic demonstrators with all involved parties have proved to be powerful communication channels. This simple approach has facilitated the understanding of the TS for a wide range of experts and has helped in the continuous process of understanding the requirements. A practical way to communicate the status of the project has also facilitated the maintenance of a realistic development plan and manpower prevision calendar.
3.5 Project development The development of the TS was divided in three main development layers: the framework, the system and the services. The framework is the software infrastructure that facilitates the main building block or control cell, and the integration with the specific sub-system OSWI. The system is a distributed software architecture built out of these building blocks. Finally, the services are the L1 trigger operation capabilities implemented on top of the system as a collaboration of finite state machines running in each of the cells. The decomposition of the project development tasks into three layers has the following advantages: 1) Project development coordination: The division of the project development effort into three conceptually different layers facilitates the distribution of tasks between a central team and the sub-systems. In a context of limited human resources the central team can focus on those tasks that had a project overall scope like project organization, communication, design and development of the TS framework, coordination of subsystem integration, sub-system support and so on. The tasks assigned to the sub-systems are those that require an expert knowledge of the sub-system hardware. These tasks consist of developing the sub-system TS cells according to the models proposed by the central team, and the development of the sub-system cell operations required by the central team in order to build the configuration and test services. 2) Hardware and software upgrades: Periodic software platform and hardware upgrades are foreseen during the long operational life of the experiment. A baseline layer that hides these upgrades and provides a stable interface avoids the propagation of code modifications to higher conceptual layers. Therefore, the code and number of people involved in updating the TS after each SW/HW upgrade are limited and well localized. 3) Flexible operation capabilities: A stable distributed architecture built on top of the baseline layer is the first step towards providing a simple methodology to create new services to operate the L1 decision loop (Section 3.2.2, Point 6) ). The simplicity of this methodology is necessary because the people in charge of defining the way of operating the experiment are in general not software experts but particle physicists with almost full time management responsibilities.
Trigger Supervisor Concept
34
Periodic demonstrators as a communication channel with all involved parties: •Trigger sub-systems and sub-detectors • Luminosity monitoring system • High Level Trigger • Run Control and Monitoring System • Database Working Group • Online Software Working Group
Services
Chapter 6
System
Chapter 5
Framework
Chapter 4
Prototype Concept SW Context
HW Context
Chapter 3 Chapter 1
Figure 3-8: Trigger Supervisor project organization and communication schemes.
The TS framework, presented in Chapter 4, consists of the distributed programming facilities required to build the distributed software system known as TS system. The TS system, presented in Chapter 5, is a set of nodes and the communication channels among them that serve as the underlying infrastructure that facilitate the development of the TS services presented in Chapter 6. Figure 3-8 shows a simplified diagram of the project organization, the communication channels and the contents of Chapters 1, 3, 4, 5, and 6.
3.6 Tasks and responsibilities The development of the TS framework, system and services can be further divided into a number of tasks. Due to the limited resources in the central TS team and in some cases due to the required expertise about concrete sub-system hardware, these tasks are distributed among the trigger sub-systems and the TS team. Central team responsibilities The tasks assigned to the central team are those that have a project overall scope like project organization, communication, design and development of common infrastructure, coordination of sub-system integration, subsystem support and so on. The following list describes the tasks assigned to the central team. 1) Trigger Supervisor framework development: The creation of the basic building blocks that form the TS system and that facilitate the integration of the different sub-systems is a major task which requires a continuous development process from the prototype to the periodic upgrades in coordination with the OSWG and DBWG. 2) Coordination: The central team is responsible to discuss and propose to each sub-system an integration model with the TS system. The central team is also responsible to develop the central cell and to coordinate the different sub-systems in order to create the TS services. 3) Sub-system support: It is important to provide adequate support to the sub-systems in order to ease the integration process and the usage of the TS framework. With this aim, the project web page [62] was regularly updated with the last version of the user’s guide [63] and the last presentations, a series of workshops [64][65][66] were organized, and finally a web-based support management tool was set up [67].
Conceptual design in perspective
35
4) Software configuration management: A set of configuration management actions were proposed by the central team in order to improve the communication of the system evolution and the coordination among sub-system development groups. A common Concurrent Versions System9 (CVS) repository for all the online software infrastructure of the L1 trigger was created, which facilitates the production and coordination of L1 trigger software releases. A generic Makefile10 was adopted to homogenize the build process of the L1 trigger software. This allowed a more automatic deployment of the L1 trigger online software infrastructure, and prepared it for integration with the DAQ online software. 5) Communication: The central team was also responsible for communicating with all involved parties according to Section 3.4. The communication effort consisted of periodic demonstrators, the framework internal documentation and presentations in the collaboration meetings. Sub-system responsibilities The tasks assigned to the sub-systems were those that required an expert knowledge of the sub-system hardware. These tasks consisted of developing the sub-system TS cells according to the models proposed by the central team, and the development of the sub-system cell operations required by the central team in order to build the configuration and test services. Shared responsibilities Due to an initial lack of human resources in the sub-system teams, some sub-system cells were initially prototyped by the central team: GT, GMT, and DTTF. At a later stage, the bulk of these developments was transferred to the corresponding sub-systems.
3.7 Conceptual design in perspective The TS conceptual design presented in this chapter consists of functional and non-functional requirements, a feasible architecture that fulfills these requirements and the project organization details. These three points define the project concept. Some initial technical aspects have also been presented in order to prove the feasibility of the design: XDAQ as baseline infrastructure and GUI technologies, the usage of FSM’s, services implementation details and so on. During three years the project scope has not been altered proving the suitability of the initial conceptual ideas. However, some technical details have evolved towards different solutions, some have disappeared and a few have been added. The following chapters describe the final technical details of the Trigger Supervisor.
9 The Concurrent Versions System (CVS), also known as the Concurrent Versioning System, is an open-source version control system that keeps track of all work and all changes in a set of files, typically the implementation of a software project, and allows several (potentially widely-separated) developers to collaborate (Wikipedia). 10 In software development, make is a utility for automatically building large applications. Files specifying instructions for make are called Makefiles (Wikipedia).
Chapter 4 Trigger Supervisor Framework 4.1 Choice of an adequate framework The conceptual design of the Trigger Supervisor presented in Chapter 3 outlines a distributed software control system with a hierarchical topology where each node is lying on a common architecture. Such a distributed system requires the usage of a distributed programming framework11 that should facilitate the necessary tools and services for remote communication, system process management, memory management, error management, logging and monitoring. A suitable solution had to cope with the functional and non-functional requirements presented in Chapter 3. As discussed in Section 1.4, the CMS Experiment Control System (ECS) is based on three main distributed programming frameworks, namely XDAQ, DCS and RCMS, which as official projects of the CMS collaboration will be maintained and supported during an operational phase of the order of ten years. The choice was therefore limited to these frameworks. Other external projects were not considered due to the impossibility to assure the long-term maintenance. Among them, XDAQ had proven to be the most complete and able to facilitate a fast development as required in Section 3.2.2, Point 5):
The Online SoftWare Infrastructure (OSWI) of all sub-systems is mainly formed by libraries written in C++ running on an x86/Linux platform. These are intended to hide hardware complexity from software experts. Therefore, a distributed programming framework based on C++ would simplify the model of integration with the sub-system OSWI’s.
When the survey took place, XDAQ was already a mature product with an almost final API which facilitated the upgrading effort.
XDAQ provides infrastructure for monitoring, logging and database access.
The RCMS and PVSSII/JCOP frameworks were not selected due to the additional complexity of the overall architecture. First, RCMS is written in Java and therefore the integration of C++ libraries would require an additional effort. Besides, RCMS was being totally re-developed when the survey took place. Regarding PVSSII, it could have been adopted if the sub-system C++ code would have run within a Distributed Information Management (DIM) server [70]. This could have provided an adequate remote interface to PVSSII [71]; however, the usage of two distributed programming frameworks (PVSSII and DIM) within two different platforms (PVSSII runs on Windows and DIM on Linux) would have resulted in an undesirably complex architecture.
11 A software framework is a reusable software design that can be used to simplify the implementation of a specific type of software. If this is implemented in an object oriented language, this consists of a set of classes and the way their instances collaborate [68][69].
Trigger Supervisor Framework
38
Despite the fact that XDAQ was the best available option, it was not an out-of-the-box solution to implement the Trigger Supervisor and therefore further development was needed. Section 4.2 describes the requirements of the Trigger Supervisor framework. Section 4.3 describes the functional architecture. Section 4.4 discusses the implementation details. Section 4.5 presents a concrete usage guide of the framework. Finally, the performance and scalability issues are presented in Section 4.6.
4.2 Requirements This section presents the requirements of a suitable software framework to develop the TS. It is shown how the functional (Section 3.2.1) and non-functional (Section 3.2.2) requirements associated with the conceptual design motivate a number of additional developments which are not covered by XDAQ.
4.2.1 Requirements covered by XDAQ The software basic infrastructure necessary to implement the TS should fulfill a number of requirements in order to be able to serve as the core framework of the TS system. The following list presents the requirements which were properly covered by XDAQ: 1) Web services centric: The CMS online software, and more exactly, the Run Control and Monitoring System (RCMS) is extensively using web services technologies ([10], p. 202). XDAQ is also a web services centric infrastructure. Therefore, it simplifies the integration with RCMS (Section 3.2.1, Point 12) ). 2) Logging and error management: According to Sections 3.2.1, Point 7) and 3.2.1, Point 8), the TS framework should provide facilities for logging and error management in a distributed environment. XDAQ provides this infrastructure compatible with the CMS logging and error management schemes. 3) Monitoring: According to Section 3.2.1, Point 4), the TS framework should provide infrastructure for monitoring in a distributed environment.
4.2.2 Requirements non-covered by XDAQ Additional infrastructure had to be designed and developed to cope with the requirements of the conceptual design: 1) Synchronous and asynchronous protocols: The TS frameworks should facilitate the development of distributed systems featuring both synchronous and asynchronous communication among nodes (Section 3.2.1, Point 12) ). 2) Multi-user: The nodes of a distributed system implemented with the TS framework should facilitate concurrent access to multiple clients (Section 3.2.1, Point 10) ). However, the main additional developments were motivated by the human context (Section 3.2.2, Point 7) ) of the project and time constraints (Section 3.2.2, Point 5) ). This section presents a number of desirable requirements grouped as a function of few generic guidelines. Simplify integration and support effort: The resources in the central TS team were very limited. Threfore, it was necessary to provide infrastructure that simplified the software integration and reduced the the need for subsystem support. 3) Finite State Machine (FSM) based control system: A framework that guides the sub-system developer reducing the degrees of freedom during the customization process would simplify the software integration and would reduce the support tasks. A control system model based on Finite State Machines (FSM) is well known in HEP. It was proposed in Section 3.3 as a feasible model to implement the final services of the Trigger Supervisor. FSM’s have been used in other experiment control systems [72][73][74], and are currently being used by the CMS DCS [75] and other CERN experiments [76][77]. On the other hand, just a well known model is not enough, a concrete FSM had to be provided with a clear specification of all states and transitions, their expected behavior, input/output parameter data types and names. The more complete this specification is the easier is the sub-system coordination and the more it facilitates a clear separation of responsibilities among sub-systems. Some more concrete implementation details, shown in the implementation section, like a clear separation of the error management, are intended to ease the
Cell functional structure
39
customization and the maintenance phases. In addition, the usage of a well known model would accelerate the learning curve and therefore, the integration process. 4) Simple access to external services: A framework should provide facilities to access Oracle relational databases, XDAQ applications, and remote web-based services (i.e. SOAP-based, HTTP/CGI based services) in a simple and homogeneous way. This infrastructure would ease the development of the FSM transition methods, for instance when it is necessary to access the configuration database. 5) Homogeneous integration methodology independent of the concrete sub-system OSWI: The TS framework should facilitate a common integration methodology independent of the available OSWI and the hardware setup. 6) Automatic creation of graphical user interfaces: In order to reduce the integration development time, a framework should provide a mechanism to automatically generate a GUI to control the sub-system hardware. This should also facilitate a common look and feel for all sub-systems graphical setups. Therefore, an operator of the L1 trigger system could learn faster how to operate any sub-system. 7) Single integration software infrastructure: A single visible software framework would simplify the understanding of the integration process for the sub-systems. Simplify software tasks during the operational phase: The framework architecture should take into account that support and maintenance tasks are foreseen during the experiment operational phase. 8) Homogeneous online software infrastructure: In addition to simplify the understanding of the integration process, for the sub-system, a single integration software infrastructure would ease the creation of releases, the user support and maintenance tasks. A common technological approach, with the Trigger Supervisor, to design and develop sub-system expert tools, like graphical setups or command line utilities to control a concrete piece of hardware would also help to simplify the overall maintenance effort of the whole L1 trigger OSWI. 9) Layered architecture: From the maintenance point of view, any additional development on top of XDAQ had to be designed such that it is easy to upgrade to new XDAQ versions or even to other distributed programming frameworks.
4.3 Cell functional structure The “cell” is the main component of the additional software infrastructure motivated by the requirements not covered by XDAQ. This component serves as the main facility to integrate the sub-system’s OSWI with the Trigger Supervisor. Figure 4-1 shows the functional structure of the cell for a stable version of the TS framework. This functional structure is more detailed and it has a number of differences compared to the cell presented in the conceptual design chapter. The following sections describe in detail this architecture.
4.3.1 Cell Operation A cell operation is basically a FSM running inside the cell which can be remotely operated. In general, FSM’s are applied to HEP control problems where it is necessary to monitor and control the stable state of a system. The TS services outlined in Chapter 3 were suitable candidates to use it.
Trigger Supervisor Framework
40
HTTP/CGI (GUI)
Control panel plug‐in
Operation plug‐in
SOAP Monitorable Monitoring item handler data source
Access Control
Operations factory
Response Control
Operations Pool
Sub‐system hardware driver
Command Command plug‐in factory
Commands Pool
Error Mgt. Module
Monitor Xhannel
Data base Xhannel
Cell Xhannel
Sub‐system hardware driver
XDAQ Xhannel
Figure 4-1: Architecture of the main component of the TS framework: The cell. To use a cell operation it is necessary to initialize an operation instance. The cell facilitates a remote interface to create instances of cell operations. Figure 4-2 shows a cell operation example with one initial state (S1), several normal states (S2 and S3), transitions between state pairs (arrows), and one event name assigned to each transition (e1, e2, e3 and e4). Operation events are issued by the controller in order to change the current state. The state changes when a transition named with the issued event and with origin in the actual state is successfully executed. A transition named with the event ei has two customizable methods: ci and fi. The method ci returns a boolean value. The method fi defines the functionality assigned to a successful transition. In case ci returns false, the current state does not change and fi is not executed. If ci returns true, fi is executed and after this execution the actual state of the FSM changes. A first aspect to note is that each transition has two functions (fi, ci). This design has been chosen to enforce a customization style that simplifies the implementation, the understanding and the maintenance of the transition code (fi) whilst facilitating a progressive improvement of the necessary previous system check code (ci). For instance, reading from a database and configuring a board would be a sequence of actions defined by the transition code, whilst checking that the board is plugged-in and the database is reachable, among other possible error conditions, would be defined in the check method. e1
e2 S3
S2
S1
e4
ei: if (ci)
e3
then { fi, Move to next state } else { do not move }
Warning_level = 1000 Warning Message = “no message”
Figure 4-2: Cell operation.
Cell functional structure
41
Each operation has a warning object which provides a way to monitor the status of the operation. How this is updated with the execution of every new event, the warning object can also be used to provide feedback about the success level of the transition execution. A warning object contains a warning level and a warning message. The warning message is destined for human operators and the warning level is a numeric value that can be eventually processed by a remote machine controller. A number of operation specific parameters can be set. All of them are accessible during the definition and execution of any of fi’s and ci’s. The value of the parameters can be set by the controller when the operation is initialized or when the controller is sending an event. The type of the parameters can be signed or unsigned integer, string and boolean. The return message, after executing the transition methods, always includes a payload and the operation warning object. The payload data type can be any of the parameter types. Standard operations are provided with the TS framework for the implementation of the configuration and interconnection test services. The transition methods for these operations are left empty and each sub-system is responsible for defining this code. The TS services, presented in Chapter 6, appear as a coordinated collaboration of the different sub-system specific operations. Additional operations can be created by each sub-system to ease concrete commissioning and debugging tasks. For instance, an operation can be implemented to move data from memories to spy buffers in order to check the proper information processing in a number of stages. In order to simplify the understanding of the cell operation model, the intermediate states (Section 3.3.3.1) ,representing the execution of the transition methods, are not visible in Figure 4-2. However, each transition has a hidden state which indicates that the transition methods are being executed.
4.3.2 Cell command A cell command is a functionality of the cell which can be remotely called. Every command splits its functionality in two methods: the precondition() and the code(). The method precondition() returns a boolean value and the method code() defines the command functionality. In case precondition() returns false, the code() method is not executed. When precondition() returns true, the code() method is executed. The cell commands can have an arbitrary number of typed parameters which can be used within the command methods. Like the cell operation, the command has a warning object. This is used to provide a better feedback of the success level of the command execution. The warning object can be modified during the execution of the precondition() and/or code() methods.
4.3.3 Factories and plug-ins A number of operations and commands are provided with the TS framework. These can be enlarged with operations and command plug-ins. The operation factory and command factory are meant to create instances of the available plug-ins under the request of an authorized controller. Several instances of the same operation or command can be operated concurrently.
4.3.4 Pools The cell command’s and operation’s pools are cell internal structures which store all operation and command instances respectively. Each instance of an operation and a command is identified with a unique name (operation_id and command_id). This identifier is used to retrieve and operate a specific instance.
4.3.5 Controller interface Compared to the functional design presented in the conceptual design (Section 3.3.2), the input interfaces were limited to SOAP and HTTP/CGI (Common Gateway Interface12). The I2O high performance interface was not
12 The Common Gateway Interface (CGI) is a standard protocol for interfacing information servers, commonly a web server. Each time a request is received, the server analyzes what the request asks for, and returns the appropriate output. CGI can use the HTTP protocol as transport layer (HTTP/CGI).
Trigger Supervisor Framework
42
added to the definitive architecture because, finally, it was just necessary to serve slow control requests. The possibility to extend the input interface with a sub-system specific protocol was also eliminated because none of the sub-systems required it. Both interfaces (SOAP and HTTP/CGI) facilitate the initialization, destruction and operation of any available command and operation. The HTTP/CGI interface also provides access to all monitoring items in the sub-system cell and other cells belonging to the same distributed system (Section 4.4.4.12). The HTTP/CGI interface is automatically generated during the compilation phase. This simplifies the sub-system development effort and homogenizes the look and feel of all sub-system GUIs. This human-to-machine interface can be extended with control panel plug-ins (Section 4.4.4.11). A control panel is also a web-based graphical setup facilitated by the HTTP/CGI interface but with a customized look and feel. The default and automatically generated GUI provides access to the control panels. The second interface is a SOAP-based machine-to-machine interface. It is intended to facilitate the integration of the TS with the RCMS and to provide a communication link between cells. Appendix A presents a detailed specification of this interface.
4.3.6 Response control module The Response Control Module (RCM) was not introduced in the conceptual design chapter. This cell functional module is meant to handle both synchronous and asynchronous responses with the controller side. The synchronous protocol is intended to assure an exclusive usage of the cell and the asynchronous mode enables multi-user access and an enhanced overall system performance.
4.3.7 Access control module The Access Control Module (ACM) is intended to identify and to authorize a given controller. A new controller trying to gain access to a cell will have to identify himself with a user name and a password. The ACM will check this information in the user’s database and will grant the user a session identifier. This session identifier will be stored and will be accessible from any cell. The session identifier is the key to those services that are granted to a concrete user. This key has to be sent with every new controller request.
4.3.8 Shared resource manager The Shared Resource Manager (SRM), outlined in the conceptual design (Section 3.3.2), is no longer the unique responsible for coordinating the access to any internal or external resource. In the final design, the concurrent access to common resources, like the sub-system hardware driver, the communication ports with external entities is coordinated by each individual entity. The main reason for this approach is that it is not possible to assure that all requests pass through the cell.
4.3.9 Error manager The Error Manager (ERM) is meant to detect any exceptional situation that could happen when a command or operation transition method is executed and the method is not able to solve the problem locally. In this case, the ERM takes the control over the method execution and sends back the reply message to the controller with textual information about what went wrong during the execution of the command or operation transition. This message is embedded in the warning object of the reply message (Appendix A).
4.3.10
Xhannel
The xhannel infrastructure has been designed to gain access to external resources from the cell command and operation methods. It provides a simple and homogeneous interface to a wide range of external services: other cells, XDAQ applications and web services. This infrastructure eases the definition of the command and operation transition methods by simplifying the process of creating SOAP and HTTP/CGI messages, processing the responses and handling synchronous and asynchronous protocols.
Implementation
4.3.11
43
Monitoring facilities
The TS monitoring infrastructure consists of a methodology to declare cell monitoring items and an additional infrastructure which facilitates the definition of the code to be executed every time that each item is being checked. The TS monitoring infrastructure is based on the XDAQ monitoring components.
4.4 Implementation The TS framework is the implementation of the additional infrastructure required in the discussion of Section 4.2 and formalized with a functional design in Section 4.3. The layered architecture of Figure 4-3 shows how the TS framework is implemented on top of the XDAQ middleware and a number of external software packages13. The TS framework, together with the XDAQ middleware, is used to implement the Trigger Supervisor system.
4.4.1 Layered architecture The L1 trigger OSWI has the layered structure shown in Figure 4-3. In this organization, the TS framework lies between a specific sub-system OSWI on the upper side, and the XDAQ middleware and other external packages on the lower side. Figure 4-4 shows the package level description of the L1 trigger OSWI. Each layer of Figure 4-3 is represented by a box in Figure 4-4 and each box includes a number of packages. The dependencies among packages are also presented in Figure 4-4. Sections 4.4.2 to 4.4.4 present each of the layers outlined in Figure 4-3.
Figure 4-3: Layered description of a Level-1 trigger online software infrastructure.
4.4.2 External packages This section describes the external packages used by the TS and XDAQ frameworks. The C++ classes contained in these packages are used to enhance the developments described in Section 4.4.
4.4.2.1 Log4cplus Inserting user notifications, also known as “log statements”, into the code is a method for debugging it (Section 3.2.1, Point 7) ). It may also be the only way for multi-threaded applications and distributed applications at large. Log4cplus is a C++ logging software framework modeled after the Java log4j API [78]. It provides precise context about a running application. Once inserted into the code, the generation of logging output requires no human intervention. Moreover, log output can be saved in a persistent medium to be studied at a later time. The Log4cplus package is used to facilitate the debugging of the TS system and to have a persistent register of the run time system behavior. This facilitates the development of post-mortem analysis tools. Logging facilities are also used to document and to monitor alarm conditions. 13 A software package in object-oriented programming is a group of related classes with a strong coupling. A software framework can consist of a number of packages.
Trigger Supervisor Framework
44
Figure 4-4: Software packages of the Level-1 trigger online software Infrastructure.
4.4.2.2 Xerces Xerces [79] is a validating XML parser written in C++. Xerces eases C++ applications to read and write XML data. An API is provided for parsing, generating, manipulating, and validating XML documents. Xerces conforms to the XML 1.1 [80] recommendation. Xerces is used to ease the parsing of the SOAP request messages in order to extract the command and parameter names, the parameter values and other message attributes.
4.4.2.3 Graphviz Graphviz [81] is a C++ framework for graph filtering and rendering. This library is used to draw the finite state machine of the cell operations.
4.4.2.4 ChartDirector ChartDirector [82] is a C++ framework which enables a C++ application to synthesize charts using standard chart layers. This package is used to present the monitoring information.
4.4.2.5 Dojo Dojo [83] is a collection of JavaScript functions. Dojo eases building dynamic capabilities into web pages and any other environment that supports JavaScript. The components provided by Dojo can be used to make web sites more usable, responsive and functional. The Dojo toolkit is used to implement the TS graphical user interface.
Implementation
45
4.4.2.6 Cgicc Ccgicc [84] is a C++ library that simplifies the processing of the HTTP/CGI requests on the server side (the cell in our case). This package is used by the CellFramework, Ajaxell and sub-system cell packages to ease the implementation of the TS web-based graphical user interface.
4.4.2.7 Logging collector The logging collector or log collector [85] is a software component that belongs to the RCMS framework (Section 1.4.1). It is designed and developed to collect logging information from log4j compliant applications and to forward these logging statements to several consumers at the same time. These consumers can be: Oracle database, files or a real time message system. The log collector is not a component of the TS framework but it is used as a component of the TS logging system, a component of the TS system.
4.4.3 XDAQ development XDAQ (pronounced Cross DAQ) was introduced in Section 1.4.3 as a domain-specific middleware designed for high energy physics data acquisition systems. It provides platform independent services, tools for local and remote inter-process communication, configuration and control, as well as technology independent data storage. To achieve these goals, the framework is built upon industrial standards, open protocols and libraries. This distributed programming framework is designed according to the object-oriented model and implemented using the C++ programming language. This infrastructure facilitates the development of scalable distributed software systems by partitioning applications into smaller functional units that can be distributed over multiple processing units. In this scheme each computing node runs a copy of an executive that can be extended at runtime with binary components. A XDAQ-based distributed system is therefore designed as a set of independent, dynamically loadable modules14, each one dedicated to a specific sub-task. The executive simply acts as a container for such modules, and loads them according to an XML configuration provided by the user. A collection of C++ utilities is available to enhance the development of XDAQ components: logging, data transmission, exception handling facilities, remote access to configuration parameters, thread management, memory management and communication among XDAQ applications. Some core components are loaded by default in the executive in order to provide basic functionalities. The main components of the XDAQ environment are the peer transports. These implement the communication among XDAQ applications. Another default component is the Hyperdaq web interface application which turns an executive into a browsable web application that can visualize its internal data structure [86]. The framework supports two data formats, one based on the I2O [87] specification and the other on XML. I2O messages are binary packets with a maximum size of 256 KB. I2O messages are primarily intended for the efficient exchange of binary information, e.g. data acquisition flow. Despite its efficiency the I2O scheme is not universal and lacks flexibility. A second type of communication has been chosen for tasks that require higher flexibility such as configuration, control and monitoring. This message-passing protocol, called Simple Object Access Protocol (SOAP) relies on the standard Web protocol (HTTP) and encapsulates data using the eXtensible Markup Language (XML). SOAP is a means to exchange structured data in the form of XML-based messages among computers over HTTP. XDAQ uses SOAP for a concept called Remote Procedure Calls (RPC). This means that the SOAP message contains an XML tag that is associated with a function call, a so called callback, at the receiver side. That way a controller can execute procedures on remote XDAQ nodes. The XDAQ framework is divided into three packages: Core Tools, Power Pack and Work Suite. The Core Tools package contains the main classes required to build XDAQ applications, the Power Pack package consists of pluggable components to build DAQ applications, and the Work Suite package contains additional infrastructure, totally independent of XDAQ, which is intended to perform some related data acquisition tasks.
14 XDAQ component, module and application are equivalent concepts.
Trigger Supervisor Framework
46
XDAQ example A XDAQ application is a C++ class which extends the base class xdaq::Application. It can be loaded into a XDAQ executive at run-time. Unlike ordinary C++ applications, a XDAQ application does not have a main() method as an entry point, but instead, has several methods to control specific aspects of its execution. Each of these methods can be assigned to a RPC in order to facilitate its remote execution. At the startup, a XDAQ executive can be configured passing the path of a configuration file as a command line argument. The configuration file contains the configuration information of the XDAQ executive. This file uses XML to hierarchically structure the configuration information in three levels:
Partition: Each configuration file contains exactly one partition that is a collection of XDAQ executives hosting XDAQ applications.
Context: Each context defines one XDAQ executive uniquely identified by its URL that is composed of host name and port. A partition may contain an arbitrary number of contexts. The tag inside the tag specifies the location of shared libraries that have to be loaded in order to make applications available.
Application: The tag uniquely identifies a XDAQ application. Each context can be composed of an arbitrary number of XDAQ applications. Applications can define properties using the tag. The application properties can be accessed at run-time.
The cell is implemented as a XDAQ component or application. Figure 4-5 shows the configuration file of the Global Trigger (GT) cell. The GT cell is running on the first host configured with a number of properties. The GT cell is compiled in one library located in the path given by tag. A second executive runs on a different host and contains one single application named Tstore. GT file://… file://…/libCell.so ...
Figure 4-5: Example of XDAQ configuration file: GT cell configuration file.
4.4.4 Trigger Supervisor framework The TS framework is the software layer built on top of XDAQ and the external packages. This software layer fills the gap between XDAQ and a suitable solution that copes with the project related human factors (Section 3.2.2, Point 7) ), time constraints (Section 3.2.2, Point 5) ) and non-covered functional requirements (Section 3.2.1, Point 10) ) discussed in the TS conceptual design. This solution has been developed according to the requirements discussed in Section 4.2 and the functional architecture presented in Section 4.3.
Implementation
47
The components of the TS framework can be divided in two groups: the TS core framework and the sub-system customizable components. The TS core framework is the main infrastructure used by the customizable components. Figure 4-6 shows a Unified Modeling language (UML) diagram of the most important classes of the TS core framework and a possible scenario of derived or customizable sub-system classes. This section presents the structure of classes contained in the TS framework. Its description has been organized following the same structure of the cell functional description presented in Section 4.3. The implementation of each functional module is described as a collaboration of classes using the UML. The main classes that collaborate to form the cell functional modules are contained in the CellFramework package. This section presents also a number of packages developed specifically for this project: the CellToolbox package and a new library designed and developed to implement the TS Grapical User Interface. Finally, the database interfaces, and the integration of the XDAQ monitoring and logging infrastructures are presented.
4.4.4.1 The cell A SubsystemCell class (or sub-system cell) is a C++ class that inherits from the CellAbstract class which in turn is a descendant of the xdaq::Application class. The fact that a sub-system cell is a XDAQ application allows the sub-system cell to be added to a XDAQ partition, then making it browsable through the XDAQ HTTP/CGI interface. The XDAQ SOAP Remote Procedure Calls (RPC’s) interface is also available to the subsystem cell. The RPC interface, implemented in the CellAbstract class, allows a remote usage of the cell operations and commands. The CellAbstract class is also responsible for the dynamic creation of communication channels between the cell and external services also known as “xhannels”. The xhannel run-time setup is done according to a XML file known as “xhannel list”. The CellAbstract class implements a GUI accessible through the XDAQ HTTP/CGI interface which can be extended with custom graphical setups called “control panels”.
Trigger Supervisor Framework
48
xdaq::Application
Trigger Supervisor core framework
1
CellAbstract
1 CellAbstractContext
+addCommand() +addOperation() +addChannel()
1
«uses»
1 1
1
CellCommandPort
1
1
«uses»
1 +run(in msg : ) : CellXhannel
+createRequest() : CellXhannelRequest +removeRequest(in req : CellXhannelRequest) 1
«uses»
CellOperationFactory «instance»
CellToolbox
Ajaxell
1
+createFromOperation(in name : string) : CellOperation
CellCommandFactory +createFromCommand() : CellCommand
1
«instance» CellPanelFactory
CellWarning
CellXhannelRequest
«instance» «uses» +createPanel(in className : string) : CellPannel «instance» 1
1 DataSource
CellOperation
1
1
CellCommand
CellPannel +layout()
Subsystem monitoring handlers
SubsystemOperation
SubsystemCommand
SubsystemPanel
SubsystemCell
+layout() 1
1 1
1
SubsystemContext 1 1
1 1
Subsystem OSWI
Subsystem customizable classes and OSWI
Figure 4-6: Components of the TS framework and sub-system customizable classes.
4.4.4.2 Cell command A cell command, presented in Section 4.3.2, is an internal method of the cell that can be executed by an external entity or controller. There are few default commands that allow a controller to remotely instantiate, control and kill cell operations. These commands are presented in the following section. It is also possible to extend the default cell commands with sub-system specific ones. Figure 4-7 shows a UML diagram of the TS framework components involved in the creation of the cell command concept. The CellCommand class inherits from the CellObject class which provides access to the CellAbstractContext object and to the Logger object. The CellAbstactContext object is a shared object among all instances of CellObject in a given cell, in particular for all CellCommand and CellOperation instances. The CellAbstractContext provides access to the factories and to the xhannels. Through a dynamic
Implementation
49
CellAbstractContext -xhannels -factories
1
CellObject 1 1
Logger 1
CellCommand -paramList : xdata::serializable
CellWarning
+run() : xoap::message +virtual init() +virtual code() +virtual precondition() : bool
11
-message : xdata::serializable -level : xdata::serializable
«uses»
CellSubsystemCommand
SubsystemContext -HWdriver SubCrate
Figure 4-7: UML diagram of the main classes involved on the creation of the cell command concept. cast, it is also possible to access a sub-system specific descendant of the CellAbstractContext class (or just cell context). In some cases, the sub-system cell context gives access to a sub-system hardware driver. Therefore, all CellCommand and CellOperation instances can control the hardware. The CellObject interface facilitates also access to the logging infrastructure through the logger object. Each CellCommand or CellOperation object has a CellWarning object. The CellCommand has one public method named run(). When this method is called, a sequence of three virtual methods is executed. These virtual methods have to be implemented in the specific CellSubsystemCommand class: 1) the Init() method initializes those objects that will be used in the precondition() and code() methods (Section 4.3.2); 2) the precondition() method checks the necessary conditions to execute the command; and 3) the code() method defines the functionality of the command. The warning message and level can be read or written within any of these methods. Finally, the run() method returns the reply SOAP message which embeds a serialized version in XML of the code() method result and warning objects.
4.4.4.3 Cell operation Figure 4-8 shows a UML diagram of the TS framework components involved in the creation of the cell operation concept. toolbox::lang::class
CellAbstractContext
CellObject 1
1 1
Logger 1
CellFSM
11
CellCommand
-xhannels -factories
OpSendCommand
OpGetState
CellOperation CellWarning
-paramList : xdata::serializable +apply(in CellCommand) +virtual initFSM()
CellSubsystemOperation
1
1
-message : xdata::serializable -level : xdata::serializable
«uses»
OpInit
OpKill
OpReset
SubsystemContext -HWdriver SubCrate
Figure 4-8: UML diagram of the TS framework components involved in the creation of the cell operation concept.
Trigger Supervisor Framework
50
Like the CellCommand class, the CellOperation class is a descendant of the CellObject class. Therefore, it has access to the logger object and to the cell context. The CellOperation class inherits also from toolbox::lang::class. This XDAQ class facilitates a loop that will run in an independent thread executing a concrete job defined in the CellOperation::job() method. This is known as the “cell operation work-loop”. An important member of the CellOperation class is the CellFSM attribute. This attribute implements the FSM defined in Section 4.3.1. The initialization code of the CellFSM class is defined in the initFSM() method of the CellSubsystemOperation class. This method defines the states, transitions and (fi, ci) methods associated with each transition. An external controller can interact with the CellOperation infrastructure through a set of predefined cell commands: OpInit, OpSendCommand, OpGetState, OpReset and Opkill. The OpInit::code() method triggers in the cell the creation of a new CellOperation object. Once the CellOperation object is created the operation work-loop starts. This work-loop reads periodically from a given queue the availability of new events. If a new event arrives, it is then passed to the CellFSM object. This queue avoids losing any event and assures that the events are served orderly. The rest of predefined commands are considered events over existing operation objects. Therefore, the code() method for these commands just pushes the command itself to the operation queue.
4.4.4.4 Factories, pools and plug-ins Figure 4-9 shows the components involved in the creation of the factory, the pool and the plug-in concepts. There are three types of factories: command, operation and panel factories. The factories are responsible for controlling the creation, destruction and operation of the respective items (operations, commands or control panels). Sub-system specific commands, operations and panels are also called plug-ins. The available commands, operations and panels in the factories can be extended at run-time using the CellAbstract::add() method. 1 CellAbstractContext
1 11
1 CellOperationFactory
+createFromOperation(in name : string) : CellOperation +add()
1
CellCommandFactory
1
«instance»
+createFromCommand() : CellCommand +add() 1
1
CellPanelFactory «instance» +createPanel(in className : string) : CellPannel +add() «instance»
*
1
CellOperation
CellCommand
*
CellAbstract CellPannel +addCommand() +addOperation() +addChannel()
*
SubsystemContext
SubsystemOperation
1
SubsystemCommand
SubsystemPanel
SubsystemCell
Figure 4-9: TS framework components involved in the creation of the factory, the pool and the plug-in concepts.
Implementation
51
The factories also play the role of pools. Each factory keeps track of the created objects and is responsible for assigning a unique identifier to each of them. After the object creation, this identifier is embedded in the reply SOAP message and sent back to the controller (Section 4.4.4.5 and Appendix A).
4.4.4.5 Controller interface Figure 4-10 shows the components involved in the creation of the cell Controller Interface (CI). As it is shown in Section 4.4.4.1, sub-system cells are XDAQ applications and therefore are able to expose both a HTTP/CGI and a SOAP interface. The cell HTTP/CGI interface is defined in the CellAbstract class by overriding the default() virtual method of the xdaq::Application class. This method parses the input HTTP/CGI request which is available as a Cgicc input argument (Section 4.4.2.6). The HTTP/CGI response is written into the Cgicc output argument at the end of the default() method and is sent back by the executive to the browser. The TS GUI is presented in Section 4.4.4.11.
xdaq::Application
1 1 CellAbstractContext
CellAbstract +addCommand() +addOperation() +addChannel() +xoap::MessageReference guiResponse(xoap::MessageReference msg)() +xoap::MessageReference command(xoap::MessageReference msg)() +void Default(xgi::Input* in, xgi::Output* out)() «uses» SubsystemCell Ajaxell
Figure 4-10: Components involved in the creation of the controller interface. A second interface is the SOAP interface. A non-customized cell is able to serve the default commands which allows to instantiate, control and kill cell operations. The cell SOAP interface and the callback routine assigned to each SOAP command are defined in the CellAbstract class. This interface is enlarged when a new command is added using the CellAbstract::addCommand() method. All SOAP commands are served by the same callback method CellAbstract::command(). This method uses the CommandFactory object to create a CellCommand object and executes the command public method CellCommand::run() (Section 4.4.4.2). The SOAP message object returned by the run() method is forwarded by the executive to the controller. Section 4.4.4.6 discusses in more detail the implementation of the synchronous and asynchronous interaction with the controller and the Appendix A presents the SOAP API from the controller point of view.
4.4.4.6 Response control module Figure 4-11 shows a UML diagram of the classes involved in the implementation of the Response Control Module (RCM). The RCM implements the details of the communication protocols with a cell client or controller. A given controller has two possible ways to interact with the cell: synchronous and asynchronous (Appendix A). When the controller requests a synchronous execution of a cell command, it assumes that the reply message will be sent back when the command execution will have finished. A second way to interact with the cell is the asynchronous one. In this case, an empty acknowledge message will be sent back immediately to the controller and a second message will be sent back again when the execution of the command will be completed. The asynchronous protocol allows implementing cell clients with an improved response time and facilitates the multiuser (or multi-client) functional requirement outlined in Section 3.2.1, Point 10). The asynchronous protocol
Trigger Supervisor Framework
52
1
1
CellAbstractContext
1
CellCommandPort 1
CellAbstract +addCommand() +addOperation() +addChannel()
1 +run(in msg) :
CellCommandFactory
«uses»
1 +createFromCommand() : CellCommand toolbox::lang::class
«instance»
CellCommand
CellObject
«uses»
SoapMessengeer +send(xoap::mesage)()
Figure 4-11: UML diagram of the classes involved in the implementation of the Response Control Module. facilitates the multi-user interface because the single user SOAP interface provided by the XDAQ executive is leveraged immediately. However, the synchronous protocol is interesting for a given controller that wants to block the access to a given cell whilst it is using the cell. It was shown in Section 4.4.4.5 that all SOAP commands are served by the same callback routine defined in the method CellAbstract::command(). This method uses the CommandFactory object to create a CellCommand object and then executes the method CellCommand::run() which returns the SOAP reply message (Section 4.4.4.2). In the synchronous case, the CellCommand::run() method returns just after executing the code() method. In the asynchronous case, the CellCommand::run() method returns immediately after starting the execution of the code() method which continues running in a dedicated thread. The asynchronous SOAP reply message is sent back to the controller by this thread when the code() method finishes. The thread is facilitated by the cell command inheritance from the toolbox::lang::class class. Figure 4-12 shows a simplified sequence diagram of the interaction between a controller and a cell using synchronous and asynchronous SOAP message protocols.
Implementation
53
Cel
CellCommand 1
CellCommand 2
SOAP message(async=true, cid=xyz) run(async=true) SOAP reply(ack, cid=xyz)
Ack
SOAP reply(result, cid=xyz)
SOAP message(async=false, cid=xyz) run(async=false)
result SOAP reply(result, cid=xyz)
Figure 4-12: Simplified sequence diagram of the interaction between a controller and a cell using synchronous and asynchronous SOAP messages.
4.4.4.7 Access control module The Access Control module (ACM) is not implemented in version 1.3 of the TS framework, although a place holder is available. The run() method of the CellCommandPort object (Figure 4-6) is meant to hide the access control complexity.
4.4.4.8 Error management module The Error Management Module (EMM) catches all software exceptional situations not handled in the command and operation transition methods. When this method is executed due to a synchronous request message, the CellAbstract::command() method is responsible for catching any software exception. If one is caught, the method builds the reply message with the warning level equal to 3000 (Appendix A) and the warning message specifying the software exception. When the command or operation transition method is executed after an asynchronous request, all possible exceptions are caught in the same thread where the code() methods runs. In this second case, the thread itself builds the reply message with the adequate warning information. In case the cell dies during the execution of a given synchronous request, this will be detected on the client side because the socket connection between the client and cell would be broken. If the request is sent in asynchronous mode, the request message is sent through a socket which is closed just after receiving the acknowledge message. In this case, the reply message is sent through a second socket opened by the cell. Therefore, the client is not automatically informed if the cell dies, and it is the client’s responsibility to implement a time-out or a periodic “ping” routine to check that the cell is still alive.
4.4.4.9 Xhannel The xhannel infrastructure was implemented to simplify the access from a cell to external web service providers (SOAP, HTTP, etc.) like for instance other cells. The cell xhannels are designed to hide the concrete details of the remote service provider protocol and to provide a homogeneous and simple interface. This infrastructure eases decoupling the development of external services and the cell customization process.
Trigger Supervisor Framework
54
Four different xhannels are provided: CellXhannelCell or xhannel to other cells, CellXhannelTB or xhannel to Oracle-based relational databases, CellXhannelXdaqSimple or xhannel to access to XDAQ applications through a SOAP interface, CellXhannelMonitor or xhannel to access to monitoring information collected in a XDAQ collector. Table 4-1 outlines the purpose of each of the xhannels. Xhannel class name
Purpose (External service)
CellXhannelCell
To interact with other cells (Section 4.4.4.9.1)
CellXhannelTB
To interact with a Tstore application (Section 4.4.4.9.2)
CellXhannelXdaqSimple
To interact with a XDAQ application
CellXhannelMonitor
To interact with a monitor collector (Section 4.4.4.12)
Table 4-1: Cell xhannel types and their purpose. Each CellXhannel class has an associated CellXhannelRequest class. CellXhannel classes are in charge of hiding the process of sending and receiving whilst the CellxhannelRequest classes are in charge of creating the SOAP or HTTP request messages and to parse the replies. All CellXhannel and CellXhannelRequest classes inherit respectively from the CellXhannel and CellXhannelRequest classes.
4.4.4.9.1 CellXhannelCell For instance, the
CellXhannelCell class provides access to the services offered by remote cells and the CellXhannelRequestCell class is used to create the SOAP request messages and to parse the replies. The CellXhannelCell class can handle both synchronous and asynchronous interaction modes. The asynchronous reply is caught because the CellXhannelCell is also a XDAQ application which is loaded in the same executive
as the cell. A callback method in charge of processing all the asynchronous replies assigns them to the corresponding CellXhannelRequestCell object. A usage example is shown in Figure 4-13. First, the CellXhannel pointer is obtained from the CellContext. Second, the CellXhannel object is used to create the request and the message (line 5). Third, the request is sent to the remote cell using the CellXhannelCell (line 7). And finally, when the reply is received (line 12), the request is destroyed (line 16). The definition of all available xhannels in a cell is made in a XML configuration file called “xhannel list”. When the cell is started-up, this file is processed and the xhannel objects are attached to the CellContext. Figure 4-14 shows an example of xhannel list file. The xhannel list should be referenced from the sub-system configuration file as shown in Figure 4-5.
Implementation
55
1 CellXhannelCell* pXC = dynamic_cast(contextCentral->getXhannel(“GT”)); 2 CellXhannelRequestCell* req=dynamic_cast(pXC->createRequest()); 3 map param; 4 bool async = true; 5 req->doCommand(currentSid_, async, “checkTriggerKey”, param); 6 try { 7
pXhannelCell->send(req);
8 } catch (xcept::Exception& e){ 9
pXhannelCell->remove(req);
10
XCEPT_RETHROW(CellException, “Error sending request to Xhannel GT”,e);
11 } 12 while(!req->hasResponse()) sleepmillis(100); 13 try { 14
LOG4CPLUS_INFO(getLogger(), “GT key is “ + req->commandReply()->toString());
15 } catch(xcept::Exception& e) { 16
pXhannelCell->remove(req);
17
XCEPT_RETHROW(CellException, “Parsing error in the GT reply”,e);
18 } 19 pXhannelCell->remove(req);
Figure 4-13: Example of how to use the xhannel to send SOAP messages to the GT cell. DB tstore MON monitor GT cell
Figure 4-14: Example of xhannel list file. This file corresponds to the central cell of the TS system and defines xhannels to the monitor collector, to a Tstore application and to the GT cell.
4.4.4.9.2 CellXhannelTb The CellXhannelTB class is another case of the xhannel infrastructure. It simplifies the development of the command and operation transition methods that need to interact with an Oracle database server. The
Trigger Supervisor Framework
56
XDAQ executive
Cell 1 CellXhannelTB(SOAP)
XDAQ executive
Cell 2 XDAQ executive
XDAQ executive
Tstore
OCCI
Oracle DB
Cell 3 Figure 4-15: Recommended architecture to access a relational database from a cell. CellXhannelTB provides read and write (insert and update) access to the database. Figure 4-15 shows the recommended architecture to access a relational database from a cell using this communication channel.
The CellXhannelTB sends SOAP requests to an intermediate XDAQ application named Tstore which is delivered with the XDAQ Power Pack package. Tstore allows reading and writing XDAQ table structures in an Oracle relational database. Tstore is the agreed solution for the CMS experiment as intermediate node between the sub-systems online software and the central CMS database server. It is designed to efficiently manage multiple connections with a central database server. The communication between Tstore and the server uses an Oracle proprietary protocol named OCCI.
4.4.4.10 CellToolbox The CellToolbox package contains a number of classes intended to simplify the implementation of the cell. Table 4-2 presents the CellToolbox class list. Class name
Functionality
CellException
Definition of the TS framework exception
CellToolbox
Several methods to create and parse SOAP messages
CellLogMacros
Macros to insert log statements
HttpMessenger
To send a HTTP request
SOAPMessenger
To send a SOAP message
Table 4-2: Class list of the CellToolbox package.
4.4.4.11 Graphical User Interface When a XDAQ executive is started-up, a number of core components are loaded in order to provide basic functionalities. One of the main core components is Hyperdaq. It facilitates a web interface which turns an executive into a browsable web application able to provide access to the internal data structure of any XDAQ application loaded in the same executive [86]. Any XDAQ application can customize its own web interface by overriding the default() virtual method of the xdaq::Application class (4.4.4.5). The web interface customization process requires developing Hypertext Markup Language (HTML) and JavaScript [88] code embedded in C++. Mixing three different languages in the same code has a cost associated with the learning curve because developers must learn two new languages, their syntax, best practices and the testing and debugging methodology using a web browser.
Implementation
57
Command execution control
Operation execution control
Fish eye interface: logging, configuration database, support, …
Possible events
Control panels Operation parameters
Monitoring information visualization
Operation FSM
Figure 4-16: Screenshot of the TS GUI. The GUI is accessible from a web browser and integrates the many services of the cell in a desktop-like fashion. Ajaxell [89] is a C++ library intended to smooth this learning curve. This library provides a set of graphical objects named “widgets” like “sliding windows”, “drop down lists”, “tabs”, buttons, “dialog boxes” and so on. These widgets ease the development of web interfaces with a look-and-feel and responsiveness similar to the stand-alone tools executed locally or through remote terminals (Java Swing, Tcl/Tk or C++ Qt. See Section 1.4.4). The web interface of the cell implemented in the CellAbstract::default() method uses the Ajaxell library. This is an out-of-the-box solution which does not require any additional development by the subsystems. Figure 4-16 shows the TS GUI. It provides several controls: i) to execute cell commands; ii) to initialize, operate, and kill cell operations; iii) to visualize monitoring information retrieved from a monitor collector; iv) to access to the logging record for audit trials and postmortem analysis; v) to populate the L1 trigger configuration database; vi) to request support; and vii) to download documentation. The cell web interface fulfills the requirement of automating the generation of a graphical user interface (Section 4.2.2). The default TS GUI can be extended with “control panels”. A control panel is a sub-system specific graphical setup, normally intended for expert operations of the sub-system hardware. The control panel infrastructure allows developing expert tools with the TS framework. This possibility opens the door for the migration of existing standalone tools (Section 1.4.4) to control panels, and therefore contributes to the harmonization of the underlying technologies for both the expert tools and the TS. This homogeneous technological approach has the following benefits: i) smoothing the learning curve of the operators, ii) simplification of the overall L1 trigger OSWI maintenance, and iii) enhancing the sharing of code and experience. The implementation of a sub-system control panel is equivalent to develop a SubsystemPanel class which inherits from the CellPanel class (Figure 4-6). This development consists of defining the SubsystemPanel::layout() method following the guidelines of the TS framework user’s guide and using the widgets of the Ajaxell library [90]. The example of the Global Trigger control panel is presented in Section 6.5.1.
4.4.4.12 Monitoring infrastructure The monitoring infrastructure allows the users of a distributed control system implemented with the TS framework to be aware of the state of the cells or of any of its components (e.g. CellContext, CellOperation, etc.). Once a monitoring item is declared and defined for one of the cells, it can be retrieved from any node of the
Trigger Supervisor Framework
58
system. Actually the TS framework is using the monitoring infrastructure of XDAQ and one additional class (Datasource) to assist on the definition of the code that updates the monitoring data. The monitoring infrastructure has the following characteristics:
An interface to declare and define monitoring items (integers, strings and tables).
Centralized collection of monitoring data coming from monitoring items that belong to different cells of the distributed system.
Central collector provides HTTP/CGI to consumers of monitoring data.
Visualization of monitoring items history through tables and graphs from all GUI of the cells.
4.4.4.12.1 Model The XDAQ monitoring model is no longer based on FSM’s as proposed in Section 3.3.3.4. Figure 4-17 shows a distributed monitoring system implemented with the TS framework. A central node known as monitor collector polls the monitoring information from each of the cells that has an associated monitor sensor. The monitor sensor forwards the requests to the cell and sends back to the collector the updated monitoring information. The collector is responsible for storing this information and of providing a HTTP/CGI interface. The GUIs of the cells use the collector interface to read updated monitoring information from any cell.
4.4.4.12.2 Declaration and definition of monitoring items The creation of monitoring items for a given cell consists of the monitoring items declaration and the monitoring update code definition. The declaration of a new monitoring item is accomplished by declaring this item in a XML file called “flashlist”. One of these files exists per cell. The declaration step also requires inserting the path to this file in the configuration file of the corresponding monitor sensor application and also of the central collector (Figure 4-18). Second, it is necessary to create the update code of the monitoring items using the DataSource class. The following sections present one example.
PCI to VME h External system h
m
s xe sensor
SOAP Http
d
Occi
mx h
Mon Mstore Collector s
Monitoring DB
s s
Tstore
o
o
Figure 4-17: Distributed monitoring system implemented in the TS framework. The monitor collector polls the cell sensor through the sensor SOAP interface, and the system cells read monitoring data stored in the collector using the HTTP/CGI interface.
Implementation
59
Subsystem true ${XDAQ_ROOT}/trigger/subsystem/ts/client/xml/flashlist1.xml
Figure 4-18: Sub-system cell configuration file configures cell sensor with one flashlist named flashlist1.xml. Declaration Figure 4-19 presents an example of flashlist. This file declares three monitoring items: item1 of type string, of type int (integer) and, table of type table. The monitoring items belong to the items group (or “infospace”) named monitorsource (see below: definition of monitoring items). The name of the infospace is the same as the name of the DataSource descendant class that is used to define the update code of the monitoring items. item2
The tag embeds the definition of the different parameters that will use the monitor collector to poll monitoring information from the sensors. The most important attributes are:
Attribute every: Defines the sampling frequency (the time unit is 1 second).
Attribute history: If true, the monitor collector stores the history of past values.
Attribute range: Defines the size of the monitoring history in time units.
Definition The classes involved in the definition of the monitoring item are shown in the UML diagram of Figure 4-20. The monitor collector is responsible for periodically sending SOAP messages to the cell sensors requesting updated monitoring data. Each monitor sensor translates the SOAP request into an internal event that is forwarded to all objects created inside a given XDAQ executive that belong to a descendant class of xdata::ActionListener.
Trigger Supervisor Framework
60
The DataSource class is a descendant of xdata::ActionListener. It is therefore able to process the incoming events by overriding the actionPerformed(xdata::Event&) method. This method is responsible for executing the MonitorableItem::refresh() method which gets the updated value for the monitoring item. A sub-system specific descendant of the DataSource is meant to contain the refresh methods for each of the monitoring items of the cell. The DataSource class is responsible also for creating the infospace object with the same name declared in the flashlist (Figure 4-19).
Figure 4-19: Declaration of monitoring items using a flashlist.
xdata::ActionListerner
xdaq::Application
DataSource
CellAbstract
-std::map monitorables_ -std::string infospaceName_ -xdata::InfoSpace* infospace_ +void DataSource::actionPerformed(xdata::Event& received)()
CellAbstractContext 1 1
* 1 MonitorableItem -string name_ -xdata::Serializable* serializable_ -RefreshFunctionalSignature* refreshFunctional_ +refresh()()
1 Subsystem monitoring handlers
1 SubsystemContext
Figure 4-20: Components of the TS framework involved in the definition of monitoring items.
Implementation
61
4.4.4.13 Logging infrastructure Each cell of a distributed control system implemented with the TS framework can send logging statements to a common logging database. Logging records can also be retrieved and visualized from any cell. Figure 4-21 shows the logging model for a distributed control system implemented with the TS framework. The Architecture of the data logging model consists of the following components:
Logging database: A relational database stores the logging information that is sent from the logging collector. The logging database is set up according to the schema proposed for the entire CMS experiment.
Logging collector: The logging collector is part of the RCMS framework (Section 4.4.2.7). It is a hub that accepts logging messages via UDP protocol15. The collector filters the logging messages by logging level, if necessary, and relays them to other applications, databases or other instances of logging collector.
Logging console: A XDAQ application named XS included with the Work Suite package (Section 4.4.3) is used as logging console to retrieve the logging information from the database. This application lists logging sessions according to their cell session identifier. A session identifier is the identifier of a session that a given controller has initiated with a distributed control system implemented with the TS framework. The logging console is able to display the logging messages. In addition, the user can filter the logging messages of each session using keywords.
Logging Macros: The TS framework provides macros to notify a log from inside the command and operation transition methods. These macros accept a cell session identifier, a logger object and a message string. The cell session identifier is accessible in any command and operation. The logger object is accessible from any class descendant of CellObject class (Section 4.4.4.2).
u h Cell
xe
XS
o
d u
u Log Collector x
Log u Collector x
c
c
j
j
Chainsaw XML file Console PCI to VME Udp Occi
j
Http
o
Logging DB
Figure 4-21: Logging model of a distributed control system implemented with the Trigger Supervisor framework.
15 User Datagram Protocol (UDP) is one of the core protocols (together with TCP) of the Internet protocol suite. Using UDP, programs on networked computers can send short messages sometimes known as datagrams to one another. UDP is sometimes called the Universal Datagram Protocol. UDP does not guarantee reliability or ordering in the way that the Transmission Control Protocol (TCP) does.
Trigger Supervisor Framework
62
4.4.4.14 Start-up infrastructure The start-up infrastructure of the TS framework consists of one component, the job control (Section 1.4.1). This is a XDAQ application included as a component of the RCMS framework. The purpose of the job control application is to launch and terminate XDAQ executives. Job control is a small XDAQ application running on a XDAQ executive, which is launched at boot time. It exposes a SOAP API which allows launching another XDAQ executive with its own set of environment variables and terminating them. A distributed system implemented with the TS framework has a job control application running at all times in every host of the cluster. In this context, a central process manager would coordinate the operation of all job control applications running in the cluster.
4.5 Cell development model The TS framework, together with XDAQ and the external packages, forms the software infrastructure that facilitated the development of a single distributed software system to control and monitor all trigger sub-systems and sub-detectors. This section describes how to implement a cell to operate a given sub-system hardware. The integration of this node into a complex distributed control and monitoring system is exemplified with the TS system presented in Chapter 5.
Loop Install framework
Do operation Do cell Do command
Prepare cell context
Compile & test
Prepare xhannels Do monitoring item Do control panel
Figure 4-22: Usage model of the TS framework. Figure 4-22 schematizes the development model associated with the TS framework. It consists of a number of initial steps common to all control nodes, and an iterative process intended to customize the functionalities of the node according to the specific operation requirements.
Install framework: The TS and XDAQ frameworks have to be installed in the CERN Scientific Linux machine where the cell should run. The installation details are described in the Trigger Supervisor framework user’s guide [63].
Do cell: Developing a cell consists of defining a class descendant of CellAbstract (Section 4.4.4.1).
Prepare cell context: The cell context, presented in Section 4.4.4.2, is a shared object among all CellObject objects that forms a given cell. The CellAbstractContext object contains the Logger, the xhannels and the factories. The cell context can be extended in order to store sub-system specific shared objects like a hardware driver. To extend the cell context it is necessary to define a class descendant of CellAbstractContext (e.g. SubsystemContext in Figure 4-6). The cell context object has to be created in the cell constructor and assigned to the context_ attribute. The cell context attribute can be accessed from any CellObject object, for instance a cell command or operation.
Prepare xhannel list file: The preparation of the xhannel list consists of defining the external web service providers that will be used by the cell: other cells, Tstore application to access the configuration database or any other XDAQ application (Section 4.4.4.9). Once the cell is running, the xhannels are accessible through the cell context object.
Performance and scalability measurements
63
Do plug-in: Additional cell operations (Section 4.4.4.3), commands (Section 4.4.4.2), monitoring items (Section 4.4.4.12) and control panels (Section 4.4.4.11) can be gradually implemented when they are required. The details are described in the corresponding sections and in the TS framework user’s guide [63].
4.6 Performance and scalability measurements This section presents performance and scalability measurements of the TS framework. This discussion focuses on the most relevant framework factors that affect the ability to build a distributed control system complex enough to cope with the operation of O(102) VME crates and assuming that each crate is directly operated by one cell. These factors are the remote execution of cell commands and operations using the TS SOAP API (Appendix A). The measurements are neither meant to evaluate external developments (i.e. monitoring, database, logging and start-up infrastructures) nor the responsiveness of the TS GUI which was presented in [90].
4.6.1 Test setup Timing and scalability tests have been carried out in the CMS PC cluster installed in the underground cavern. The tests ran on 20 identical rack-mounted PCs (Dell Power Edge SC2850, 1U Dual Xeon 3GHz, hyperthreading and 64 bit-capable) equipped with 1 GB memory and connected to the Gigabit Ethernet private network of the CMS cluster. All hosts run CERN Scientific Linux version 3.0.9 [91] with kernel version 2.4.21.40.EL.cernsmp and version 1.3 of the Trigger Supervisor framework. The most relevant factors of the cell command and operations are presented. In order to evaluate the scalability of each factor under test, five software distributed control system configurations have been set up. Table 4-3 summarizes the setups.
Setup name
# of Hosts
# of Level-0 cells
# of Level-1 cells
# of Level-2 cells
Total # of cells
Central
1
1
0
0
1
Central_10Level1
11
1
10
0
11
Central_10Level1_10Level2
20
1
10
10
21
Level-2 cells are all in the same branch
Central_10Level1_20Level2
20
1
10
20
31
Level-2 cells are distributed in 2 branches
Central_10Level1_100Level2
20
1
10
100
111
Level-2 cells are equally distributed in 10 branches
Notes
Table 4-3: System configuration setups. Each table row specifies a test setup. A test setup consists of a number cells organized in a hierarchical way. There is always 1 level-0 cell or central cell which coordinates the operation of up to 10 level-1 cells, and as function of the setup the level-1 cells coordinate also a number of level-2 cells. Figure 4-23 presents the example of the Central_10Level1_20Level2 setup architecture. This setup consists of 1 central cell, 10 level-1 cells controlled by the central cell and 10 level-2 cells controlled by the first and the second level-1 cells.
4.6.2 Command execution This section measures the remote execution of cell commands. This study has been carried out with the central_10Level1 setup. These tests measure the necessary time for the central cell to remotely execute a number of commands in the first level-1 cell. Each measure starts when the first request message is sent from the central cell and it finishes when the last reply arrives. The first exercise measures the time to execute commands which have a Figure 4-24 shows the test results.
code()
method that does nothing.
Trigger Supervisor Framework
64
s
h
SOAP (CellXhannelCell)
Central Cell
HTTP/CGI
cx
s
h
s
Level-2 Branch 1 Cell 1
s
h
Level-2 Branch 1 Cell 2
h
s
Level-1
Cell 1
Cell 2
Cell 3
cx
cx
h
s
h
Level-2 Level-2 Branch 1 Branch 2 Cell 10 Cell 1
s
s
h
Level-1
s
…
s
h
Level-1
…
h
Level-2 Branch 2 Cell 2
Cell 10
s
…
h
Level-1
h
Level-2 Branch 2 Cell 10
Figure 4-23: Central_10Level1_20Level2 test setup architecture consists of 1 central cell, 10 level-1 cells controlled by the central cell and 10 level-2 cell controlled by the first and the second level-1 cells. The first conclusion that can be extracted from Figure 4-24 is that in both synchronous and asynchronous communication cases, the execution time scales linearly. A second conclusion is that there is a small time overhead due to the asynchronous protocol. For instance, the execution of 256 commands in synchronous mode takes 1.81 seconds whilst the execution of the same number of commands in asynchronous modes takes 1.94 seconds. This overhead is due to the additional complexity of handling the asynchronous protocol in both the client (central cell) and the server (first level-1 cell). In synchronous mode the average time to execute a command is 7 ms, which is just a little bit better than the 7.7 ms obtained in asynchronous mode.
time (s)
Remote command execution with delta = 0 2.5 2 1.5 1 0.5 0 0
50
100
150
200
250
300
number of messages synchronous SOAP
asynchronous SOAP
Figure 4-24: Summary of performance tests to study the remote execution cell commands between the central cell and a level-1 cell.
However, the importance of this overhead disappears when the performance test presents a more realistic scenario. In this new scenario the remote command executes a delay (delta). This delay in the code() method emulates for instance a hardware configuration sequence or a database access. Figure 4-25 summarizes the results of performance tests intended to study the remote execution of 256 cell commands between the central cell and a level-1 cell in synchronous and asynchronous mode (Y axis) and as a function of delta (X axis). The results in synchronous mode increase approximately linearly with the level-1 cell command delay (delta) whilst the results in asynchronous mode remain constant when delta increases. The performance advantage is
Performance and scalability measurements
65
Remote execution of 256 commands as a function of delta
time (s)
30 20 10 0 0
0.02
0.04
0.06 delta time (s)
synchronous SOAP
0.08
0.1
0.12
asynchronous SOAP
Figure 4-25: Summary of performance tests to study the remote execution of 256 cell commands between the central cell and a level-1 cell in synchronous and asynchronous mode. visible for down to 2 messages and small deltas down to 20 milliseconds. This is a proof of the suitability of the asynchronous protocol to improve the overall performance of a given controller. This feature is very much appreciated during the configuration of the trigger sub-systems because the asynchronous protocol allows starting the configuration process in parallel in all the trigger sub-systems. Therefore, the overall configuration time will be approximately the configuration time of the slowest sub-system rather than the addition of all configuration times.
4.6.3 Operation instance initialization This section discusses the performance and scalability of the cell operation initialization. The test setups used for these measurements are: Central_10Level1, Central_10Level1_10Level2, Central_10Level1_20Level2 and Central_10Level1_100Level2. Each test consists of measuring the overall time necessary to initialize an operation in each node of the configuration setup. The measurement includes the operation initialization in the central cell plus the remote initialization in the sibling cells. The test finishes when the last reply message arrives at the central cell.
time (s)
Operation initialization 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
20
40
60
80
100
120
number of nodes
Figure 4-26: Total time to initialize an operation instance in all cells of a setup as a function of the number of cells. Figure 4-26 shows the results of measuring the total time to initialize an operation instance in each cell as a function of the number of cells in the setup, and Figure 4-27 shows the results of measuring the total time to initialize an operation instance in each cell as a function of the number cell levels in the setup. The tests are just done in the synchronous case because the operation initialization request is just available in synchronous mode
Trigger Supervisor Framework
66
time (s)
Operation initialization 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Central_10Level1_20Level2
Central_10Level1_100Level2
0
0.5
1
1.5
2
2.5
3
3.5
number of levels
Figure 4-27: Total time to initialize an operation instance in all cells of a setup as a function of the number of cell levels. It is interesting to note that, due to the synchronous protocol, the number of cells in the setup define the total initialization time. E.g. Central_10Level1_20Level2 and Central_10Level1_100Level2 setups have different total initialization time despite having the same number of levels (3). (cell blocked). This interface constraint was set in order to assure that no operation events were received before the operations instance was created. The results show that the average time to initialize a cell operation is 13.4 ms. We can also conclude that the overall time to initialize one operation in each cell scales linearly with the number of cells independently of the cell levels.
4.6.4 Operation state transition This section discusses the performance and scalability of the cell operation transition. The test setups used for these measurements are again: Central_10Level1, Central_10Level1_10Level2, Central_10Level1_20Level2 and Central_10Level1_100Level2. Each test consists of measuring the overall time necessary to execute an operation transition in each node of the configuration setup.The measurement includes the operation transition in the central cell plus the remote execution of an operation transition in the sibling cells. The test finishes when the last reply message arrives at the central cell. All cell operation transition methods have an internal delay of 1
time (s)
Operation transtion in synchronous mode 120 100 80 60 40 20 0 0
20
40
60
80
100
120
number of nodes
Figure 4-28: Total time to execute an operation transition in all cells of a setup as a function of the number of cells and in synchronous mode.
Performance and scalability measurements
67
second. This time lapse is defined in milliseconds and is called “delta”. Delta is meant to emulate a hardware configuration sequence and/or a database access. Figure 4-28 shows the results of measuring the total time to execute an operation transition in all cells of a setup as a function of the number of cells in the setup and in synchronous mode. This Figure shows that, in synchronous mode, the overall execution time scales linearly with the number of cells and it is therefore independent of the cell levels as it is shown in Figure 4-29.
time (s)
Operation initialization 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Central_10Level1_100Level2
Central_10Level1_20Level2
0
0.5
1
1.5
2
2.5
3
3.5
number of levels
Figure 4-29: Total time to execute an operation transition in all cells of a setup as a function of the cell levels and in synchronous mode. It is interesting to note that, due to the synchronous protocol, the number of cells in the setup define the total execution time. E.g. Central_10Level1_20Level2 and Central_10Level1_100Level2 setups have different total execution time despite having the same number of levels (3). Figure 4-30 shows the results of measuring the total time to execute an operation transition in all cells of a setup as a function of the number of cells and in asynchronous mode. This Figure shows that, in asynchronous mode, the overall execution time, for all test cases, is much better than for the synchronous case. This overall time is equal to adding the worst cases in each level (1 second per level of the test setup). Figure 4-31 shows how in asynchronous mode the overall execution time scales linearly with the number of levels.
Operation transition in asynchronous mode 3.5
time (s)
3 2.5 2 1.5 1 0.5 0 0
20
40
60
80
100
120
number of nodes
Figure 4-30: Total time to execute an operation transition in all cells of a setup as a function of the number of cells and in asynchronous mode.
Trigger Supervisor Framework
68
time (s)
Operation transition in asynchronous mode 4 3 2 1 0 0
1
2
3
4
number of levels
Figure 4-31: Total time to execute an operation transition in all cells of a setup as a function of the number of cell levels and in asynchronous mode.
Chapter 5 Trigger Supervisor System 5.1 Introduction The TS system is a distributed software system, initially outlined in the TS conceptual design chapter (Section 3.3). It consists of a set of nodes and the communication channels among them. The TS system is designed to facilitate a stable platform, despite hardware and software upgrades, on top of what the TS services can be implemented following a well defined methodology. This approach implements the “flexibility” non-functional requirement discussed in Section 3.2.2, Point 6). This chapter is organized in the following sections: Section 5.1 is the introduction; in Section 5.2 the system design guidelines are discussed; in Section 5.3 the system building blocks, the sub-system integration strategies and an overview of the system architecture are presented; Section 5.4 describes the TS control, monitoring, logging and start-up systems. Finally, the service development process associated with the TS system is discussed in Section 5.5.
5.2 Design guidelines The TS system design principles, presented in this section, have two main sources of inspiration: i) the software infrastructure presented in Chapter 4, which consists of a number of external packages, the XDAQ middleware and the TS framework; ii) the functional and non-functional requirements described in the TS conceptual design, with an special attention to the “human context awareness” non-functional requirement (Section 3.2.2, Point 7) ), which already guided the design decisions of the TS framework.
5.2.1 Homogeneous underlying infrastructure The design of the TS system is solely based on the software infrastructure presented in Chapter 4, which consists of a number of external packages, the XDAQ middleware and the TS framework. A homogeneous underlying software infrastructure simplifies the support and maintenance tasks during the integration and operational phases. Moreover, the concrete usage of the TS framework was encouraged in order to profit from a number of facilities designed and developed to fulfill additional functional requirements and to cope with the project human factors and the reduced development time (Section 4.2.2).
5.2.2 Hierarchical control system architecture The TS control system has a hierarchical topology with a central cell that coordinates the operation of the lower level sub-system central cells. These second level cells are responsible for operating the sub-system crate or to coordinate a third level of sub-system cells that finally operate the sub-system crates. A hierarchical TS control system eases the implementation of the following system level features:
Trigger Supervisor System
70
1) Distributed development: Each sub-system has always one central cell exposing a well defined interface. This cell hides from the TS central cell the implementation details of the sub-system control infrastructure. This approach simplifies the role of a TS system coordinator because then s/he just needs to worry about the interface definition exposed by each sub-system central cell. The respective sub-system software responsible takes care to implement this interface. At the sub-system level, the development of the sub-system control infrastructure is further divided into smaller units following the same approach. This development methodology eased the central coordination tasks by dividing the system overall complexity into much simpler sub-systems which could be developed with a minimal central coordination. 2) Sub-system control: The hierarchical design facilitates the independent operation of a given sub-system. This is possible by operating the corresponding sub-system central cell interface. This feature fulfills the non-functional requirement outlined in Section 3.2.2, Point 2). 3) Partial deployment: The hierarchical design simplifies the partial deployment of the TS system by just deploying certain branches of the TS system. This is useful, for instance, to create a sub-system test setup. 4) Graceful degradation: The hierarchical design facilitates a graceful degradation inline with the “Robustness” non-functional requirement stated in Section 3.2.2, Point 4). If something goes wrong during the system operation, only one branch of the hierarchy needs to be restarted.
5.2.3 Centralized monitoring, logging and start-up systems architecture The TS framework uses the monitoring, logging and start-up infrastructure provided by the XDAQ middleware and the RCMS framework. This infrastructure is characterized by enforcing a centralized architecture. Therefore, the TS monitoring, logging and startup systems cannot be a pure hierarchical systems, as proposed in Section 3.3.3, due to the trade-off of reusing existing components.
5.2.4 Persistency infrastructure The TS system requires a database infrastructure to store and retrieve configuration, monitoring and logging information. The following points present the design guidelines for this infrastructure.
5.2.4.1 Centralized access A CMS wide architectural decision enforces the centralization of common services to access the persistency infrastructure. This common access points should facilitate a simple interface to the persistency infrastructure and should be responsible to manage the connections to the persistency server. The CMS database task force recommends using one single Tstore (Section 4.4.4.9.2) application for all nodes of the TS system.
5.2.4.2 Common monitoring and logging databases The TS monitoring and logging systems (Sections 5.4.2 and 5.4.3) are based on XDAQ and RCMS infrastructure. In this context, single monitor and logging collector applications gather periodically the monitoring and logging information respectively and facilitate an HTTP/CGI interface to any possible information consumer. These collectors are also responsible to store the gathered information into the L1 trigger monitoring and logging databases. These two databases are common to all L1 trigger sub-systems.
5.2.4.3 Centralized maintenance All TS databases are maintained in the central CMS database server (Oracle database 10g Enterprise Edition Release 10.2.0.2, [92]) which is under the responsibility of the CMS and the CERN-IT database services.
5.2.5 Always on system The TS configuration and monitoring services are used to operate the L1 trigger when the experiment is running but are also used during the integration, commissioning and test operations of the L1 trigger in standalone mode. In addition, the TS services to test each of the L1 trigger sub-systems and to check the inter sub-system
Sub-system integration
71
connections and synchronization are required outside the experiment running periods. Therefore, the TS system should always be available.
5.3 Sub-system integration Figure 5-1 shows an overview of the TS system with the central node controlling twelve TS nodes, one per subsystem including all L1 trigger sub-systems and sub-detectors: the Global Trigger (GT), the Global Muon Trigger (GMT), the Drift Tube Track Finder (DTTF), the Cathode Strip Chamber Track Finder (CSCTF), the Global Calorimeter Trigger (GCT), the Regional Calorimeter Trigger (RCT), the Electromagnetic Calorimeter (ECAL), the Hadronic Calorimeter (HCAL), the Drift Tube Sector Collector (DTSC), the Resistive Plate Chambers (RPC), the Tracker and the Luminosity Monitoring System (LMS). This is the entry point for any controller that wishes to access only sub-system specific services. For some sub-systems, an additional level of TS nodes can be controlled by the sub-system central node.
L‐1Trigger FM
Central Node
TS Node
TS Node
TS Node TS Node TS Node TS Node
TS Node
TS Node
Common Services: Logging, DB, Monitoring
TS Node
Persistency Infrastructure
Figure 5-1: Overview of the Trigger Supervisor system.
5.3.1 Building blocks The following sections present the building blocks used to build the TS system. The main role is played by the cell. In addition, XDAQ and the RCMS frameworks contribute with a number of secondary elements.
5.3.1.1 The TS node The TS node, shown in Figure 5-2, is the basic unit of a distributed system implemented with the TS framework. It has three main components: the cell, the monitor sensor and the job control. The cell is the element that has to be customized (Section 4.5), the monitor sensor is a XDAQ application intended to interact with the monitor collector forwarding update requests to the cell and sending back to the monitor collector the updated monitoring information (Section 4.4.4.12). Finally, the job control is a building block of the start-up system (Section 4.4.4.14). The cell has two input ports exposing respectively the cell SOAP (s) and HTTP/CGI (h) interfaces and four output ports corresponding to the monitoring (mx), database (dx), cell (cx) and XDAQ (xx) xhannels (Section 4.4.4.9). The functionality of the cell is meant to be customized according to specific needs of each sub-system. The customization process consists of implementing control panel, commands and operations plug-ins, and adding monitoring items (Section 4.5). Those cells intended to directly control a sub-system crate should also embed the sub-system crate hardware driver (Section 4.5).
Trigger Supervisor System
72
s cp op
h
xe
h Mon Sensor
xe
s Job control
m
c d mx dx cx xx
Figure 5-2: Components of a TS node. (s: SOAP interface, h: HTTP/CGI interface, xe: XDAQ executive, op: Operation plug-ins, c: Command plug-ins, m: monitoring item handlers, d: hardware driver, cp: control panel plug-in). The sub-system cells are meant to act as abstractions of the corresponding sub-system hardware. These black boxes expose a stable SOAP API regardless of hardware and/or software upgrades. This facilitates a stable platform on top of which the TS services (Chapter 6) can be implemented. This approach allows largely decoupling the evolution of sub-system hardware and software platforms from changes in the operation capabilities offered by the TS.
5.3.1.2 Common services The common services of the TS system, shown in Figure 5-3, are unique nodes of the distributed system which are used by all TS nodes. These nodes are the logging collector, the Tstore, the monitor collector and the Mstore. tc
u c Log Collector x u j
xe
s Tstore o
xe
h Mon Mstore Collector s
s
Figure 5-3: Common service nodes. (tc: Tomcat server, u: UDP interface, x: XML local file, j: JDBC interface, xe: XDAQ executive, s: SOAP interface, o: OCCI interface, h: HTTP/CGI interface).
5.3.1.2.1 Logging collector The logging collector or log collector [85] is a software component that belongs to the RCMS framework. It is a web application written in Java and running on a Tomcat sever. It is designed and developed to collect logging information from log4j compliant applications and to distribute these logs to several consumers. These consumers can be: an Oracle database, files, other log collectors or a real time message system. The log collector is part of the TS logging infrastructure (Section 4.4.4.13).
5.3.1.2.2 Tstore The Tstore is a XDAQ application delivered with the XDAQ Power Pack package. Tstore provides a SOAP interface which allows reading and writing XDAQ table structures in an Oracle database (Section 4.4.4.9.2). The CMS DataBase Working Group (DBWG) stated that having one single Tstore application for all cells of the TS system already assures a suitable management of the database connections.
5.3.1.2.3 Monitor collector The monitor collector is also a XDAQ application delivered with the XDAQ Power Pack package. This XDAQ application periodically pulls from all TS system sensors the monitoring information of all items declared in the
Sub-system integration
73
sub-system flashlist files. The collection of each flashlist can be performed at regular intervals by providing the collector a snapshot of the corresponding data values at retrieval time. Optionally, a history of data values can be buffered in memory at the collector node. This buffered data can be made persistent for later retrieval. The interface between sensor and collector is SOAP. The collector also provides an HTTP/CGI interface to read the monitoring information coming from all the TS system. The monitor collector is part of the TS monitoring infrastructure (Section 4.4.4.12).
5.3.1.2.4 Mstore The Mstore application is a XDAQ application delivered with the Work Suite package of XDAQ. This application takes flashlist data from a monitor collector and forwards it to a Tstore application for persistent storage in a database.
5.3.2 Integration All sub-systems use the same building blocks, presented in Section 5.3.1, to integrate with the TS system. However, each sub-system follows a particular integration model which depends on a number of parameters related to either the sub-system Online SoftWare Infrastructure (OSWI) or to the sub-system hardware setup. This section presents the definition of all integration parameters, the description of the most relevant integration models and finally a summary of all the integration exercises.
5.3.2.1 Integration parameters This section presents the sub-system infrastructure parameters which were relevant during the integration process with the TS system. These have been separated in those related to the OSWI and those related to the subsystem hardware setup.
5.3.2.1.1 OSWI parameters Usage of HAL This parameter defines the low level software infrastructure to access the sub-system custom hardware boards. The CMS recommendation to access VME boards is the Hardware Access Library (HAL [53]). HAL is a library that provides user-level access to VME and PCI modules in the C++ programming language. Most of the subsystems follow the CMS recommendation to access VME boards with the exception of the RCT and the GCT. In the GCT case, board control is provided by a USB interface and the GCT software infrastructure uses a USB access library. In the RCT case, a sub-system specific driver and user level C++ libraries were developed. C++ API On top of HAL or the sub-system specific hardware access library or driver, most of the sub-systems have developed a C++ library which offers a high level C++ API to control the hardware from a functional point of view. XDAQ application Some sub-systems have developed their own XDAQ application to remotely operate their hardware setups (Sectio 1.4.4). In some of these cases the sub-system XDAQ application is the visible interface to the hardware from the point of view of the cell. Scripts In addition to the compiled applications (i.e. C++ and XDAQ applications), some sub-systems have opted for an additional degree of flexibility enhancing their OSWI with interpreted scripts. Python and HAL sequences are being used. Scripts are used to define test procedures but also to define configuration sequences. These configuration scripts used to mix the configuration code with the configuration data. In the final system, configuration data is retrieved separately from the configuration database. However, during the commissioning phase, some sub-systems retrieve configuration scripts from the configuration database. This is an acceptable
Trigger Supervisor System
74
practice because it helps to decouple the continuous firmware updates with the maintenance of a consistent configuration database.
5.3.2.1.2 Hardware setup parameters Bus adapter From the hardware point of view, the L1 trigger sub-system hardware is hosted in VME crates controlled by an x86/Linux machine. With few exceptions, the interface between the PC and the VME crate is done with a PCI to VME bus adapter [93]. Hardware crate types and number These parameters tell us how many different types of crates and how many units of each type have to be controlled. It was decided to have a one on one relationship between cells and crates. In other words, each cell controls one single crate and each crate is controlled by only one cell. This approach enhances the reusability of the same sub-system cells in different hardware setups. For instance: 1) During the debugging phases, in the home institute laboratory, and during the initial commissioning exercises, when just one or few crates are available, a single cell controlling one single crate was developed in order to enhance the board debugging process. Afterwards, this cell has been reused as a part of a more complex control system. 2) During the system deployment in its final location, when the complete hardware setup must be controlled, all individual cells implemented during the debugging and commissioning exercises were reused and integrated into the corresponding sub-system control system. Exceptions to this rule are the GT, the GMT and the RPC integration models. Board level cells were discarded due to the higher complexity of the resulting distributed control system. Just the control of one single crate with a number of boards would require a central cell which coordinated the operations of as many cells as boards. Hardware crate sharing This parameter tells us whether or not a given sub-system crate is shared by more than one sub-system. This has to be taken into account because to share a crate means also to share the bus adapter.
5.3.2.2 Integration cases The TS sub-systems presented in the following sections are examples of the main different integration cases. Each integration case corresponds to a different L1 trigger sub-system or sub-detector, and it is defined by the parameters presented in Sections 5.3.2.1.1 and 5.3.2.1.2. The result of each integration case is a set of building blocks and the communication channels among them.
5.3.2.2.1 Cathode Strip Chamber Track Finder The hardware setup of the CSCTF is one single VME crate controlled by a PCI to VME bus adapter. The OSWI consists of C++ classes built on top of the HAL library. These classes offer a high level abstraction of the VME boards and facilitate their configuration and monitoring. The integration model for the CSCTF represents the simplest integration case. One single cell running in the CSCTF host was enough. The customization process of the CSCTF cell is based on using the C++ classes of the CSCTF OSWI to operate the crate.
5.3.2.2.2 Global Trigger and Global Muon Trigger The integration of the GT and the GMT represents a special case because despite being two different subsystems, they share the same crate.
Sub-system integration
75
The integration model followed in this concrete case, shown in Figure 5-4, contradicts the rule of one cell per crate. In this case two cells access the same crate. Compared to the single cell integration model, this approach has several advantages: 1) Smaller complexity: During the initial development process, we realized that the overall complexity of two individual cells was smaller than the complexity of one single cell. Therefore, this solution was easier to maintain. 2) Enhanced distributed development: The development work to integrate the GT and GMT sub-systems can be more easily split between two different developers working independently. 3) Homogeneous architecture: The Interconnection test service between GT and GMT can be logically implemented like any other interconnection test service between two sub-systems hosted in different crates. Concerning the OSWI, it consists of C++ classes built on top of HAL. Therefore, the definition of the cells command and operation transition methods is based on using this API. s
h
op
h xe
cp
Mon Sensor
s
GT Cell op
m d
h xe
Mon Sensor
s
GMT Cell
xe
Job control
m
c d
mx dx cx xx
h
cp
c
GMT/GT Host
mx dx cx xx
PCI to VME bus adapter
GMT/GT crate
Figure 5-4: Integration model used by the GT and GMT.
5.3.2.2.3 Drift Tube Track Finder The DTTF hardware setup consists of six identical track finder crates, one central crate and one clock crate. Due to limitations of the device driver specifications, it is not possible to have more than three PCI to VME interfaces per host. Therefore, the six track finder crates are controlled by two hosts. An additional host controls the clock crate and the central crate. The OSWI is based on C++ classes built on top of HAL. Figure 5-5 shows the integration model followed by the DTTF. As usual, each crate is controlled by one cell. There are four different cells: 1) track finder cell (TFC) which is in charge of controlling a track finder crate, 2) clock crate cell (CKC), 3) the central crate cell (CCC) and 4) the DTTF central cell (DCC) which is in charge of coordinating the operation of all other cells. The DCC provides a single access point to operate all DTTF crates and simplifies the implementation of the TS central cell. The customization process of the DTTF crate cells (i.e.: TFC, CKC and CCC) uses the C++ class libraries of the DTTF OSWI. Therefore, all crate cells must run in the same hosts where the PCI to VME interfaces are pluggedin.
Trigger Supervisor System
76
PCI to VME SOAP (CellXhannelCell) s
h
s
DCC xe sensor s
xe
s h s TFC xe sensor
cx
DTTF host 2
Job control s h s TFC xe sensor
dx
dx
xe
SOAP (CellXhannelTstore)
Job control
Occi
DTTF host 1
dx
DTTF host 3
s h s TFC xe sensor d
d
s
s h s CKC xe sensor d
dx
s h s CCC xe sensor d
dx
xe
s h s TFC xe sensor d
dx
s h s TFC xe sensor d
dx
dx
s Job control
s h s TFC xe sensor d dx
d
Central crate: •DCC •TIM •FSC •Barrel sorter
•Clock crate •DIO
3 x Track finder crates
3 x Track finder crates
s
Tstore
o
Configuration DB o
Figure 5-5: Integration model for the DTTF.
5.3.2.2.4 Resistive Plate Chamber The OSWI of the RPC Trigger system consists of three different XDAQ applications that are used to control three different types of crates: 1) twelve RPC Trigger crates, 2) one RPC Sorter crate and 3) one RPC CCS/DCC crate. The integration model of the RPC with the TS is shown in Figure 5-6. In this case, the hardware interface is facilitated by XDAQ applications and these applications are operated by one cell, the RPC cell. s s h RPC xe sensor Cell xx
xe
s Job control
dx
RPC host 1 s
Tstore
o
s
s
s
s
s
RPC xe Xdaq app
RPC xe Xdaq app
RPC xe Xdaq app
RPC xe Xdaq app
RPC xe Xdaq app
RPC host 2
RPC host 3
RPC host 4
Configuration DB o
RPC host 5 PCI to VME
…
SOAP (CellXhannelCell) SOAP (CellXhannelTstore) RPC Sorter crate
12 x RPC Trigger crates
RPC CCS/DCC crate
Http Occi
Figure 5-6: RPC integration model.
5.3.2.2.5 Global Calorimeter Trigger The Global Calorimeter Trigger (GCT) hardware setup consists of one main crate and three data source card crates. The particularity of this hardware setup is that all boards are controlled independently through a USB
Sub-system integration
77
s s h GCT xe sensor Cell d xx dx
GCT host 2
xe
s Job control
GCT host 1
s
s
s
s
GCT xe Xdaq app
GCT xe Xdaq app
GCT xe Xdaq app
Tstore
Configuration DB o
o
USB SOAP (CellXhannelCell) SOAP (CellXhannelTstore) Http Occi d: Python interpreter, Python configuration sequences, Python extension and USB driver
Main crate 3 x Data source crates
Figure 5-7: GCT integration model. interface. Therefore, it is possible to control the four crates from one single host because the limitation of the CAEN driver does not exist. The OSWI consists of a C++ class library, a Python language extension and XDAQ applications. The low level OSWI for both the data source crates and the main crate is based on a C++ class library built on top of a USB driver. A second component of the GCT software is the Python extension that allows developing Python programs in order to create complex configuration, test sequences or simple hardware debugging routines without having to compile C++ code. The third component is a XDAQ application which allows remote access to the boards in the data source crates. Figure 5-7 shows the integration model followed by the GCT. This integration model maximizes the usage of the existing infrastructure. It consists of one single cell, which embeds a Python interpreter in order to execute Python sequences to configure the main crate. This same cell coordinates the operation of the data source crates through the remote SOAP interface of the GCT XDAQ applications.
5.3.2.2.6 Hadronic Calorimeter The HCAL sub-detector has its own supervisory and control system which is responsible for the configuration, control and monitoring of the sub-detector hardware and for handling the interaction with RCMS (Section 1.4.4). In addition to this infrastructure, a HCAL cell will provide the interface to the central cell to set the configuration key of the trigger primitive generator (TPG) hardware and to participate in the interconnection test service between the HCAL TPG and the RCT. The HCAL cell exposes also a SOAP interface that makes it easier for the HCAL supervisory software to read the information that is set by the central cell. The HCAL integration model is shown in Figure 5-8. This model is equally valid for the ECAL sub-detector.
Trigger Supervisor System
s
78
h
Central Cell
Trigger Supervisor system
cx
Configuration DB o
dx s
Tstore
o
s ss HCAL s s HCAL manager HCAL manager HCAL manager HCAL manager manager
PCI to VME SOAP (CellXhannelCell)
h
SOAP (CellXhannelTstore)
HCAL Cell
Subdetector control systems
Http Occi
Figure 5-8: HCAL integration model.
5.3.2.2.7 Trigger, Timing and Control System The TTC hardware setup (Section 1.3.2.4) consists of one crate per sub-system with as many TTCci boards as TTC partitions are assigned to the sub-system. Table 5-1 shows TTC partitions and TTCci boards assigned to each sub-system. Some sub-systems share the same TTC crate. This is the case of: 1) DTTF and DTSC, 2) RCT and GCT, and 3) CSC and CSCTF. The GT has no TTCci board because the GTFE board receives the TTC signals from the TCS directly through the backplane. Sub-system
# of partitions
Partition names
#of TTCci
Pixels
2
BPIX, FPIX
2
Tracker
4
TIB/TID, TOB, TEC+, TEC-
4
ECAL
6
EB+, EB-, EE+, EE-, SE+, SE-
6
HCAL
5
HBHEa, HBHEb, HBHEc, HO, HF
5
DT
1
DT
1
DTTF
1
DTTF
1
RPC
1
RPC
1
CSCTF
1
CSCTF
1
CSC
2
CSC+, CSC-
2
GT
1
GT
0
RCT
1
RCT
1
GCT
1
GCT
1
Totem and Castor
2
Totem, Castor
2
Totals
28
27 Table 5-1: TTC partitions.
Sub-system integration
79
h
cx
s h DTTF Central Cell cx dx
s h TTCci Cell xx
cx
…
cx
dx
xx
s h TTCci Cell xx
dx
s
h
Ecal Cell
dx
h
Tracker Cell dx
cx
dx
s h TTCci Cell xx
dx
h
s
ECAL supervisor s
s
GCT Cell
dx
dx
h
s h TTCci Cell
s TTCci Xdaq
s
h
s
Trigger Supervisor system
Tracker supervisor
…
s
Subdetector control systems
s
Central Cell
s
Tstore
o
Configuration DB o
s PCI to VME
TTCci Xdaq
SOAP (CellXhannelCell) SOAP (CellXhannelTstore) Http
s TTCci Xdaq
s TTCci Xdaq
Occi
… DT‐DTTF TTCci crate (1+1 board)
GCT‐RCT TTCci crate (1+1 board)
ECAL TTCci crate (6 board)
Tracker TTCci crate (6 board)
Figure 5-9: TTCci integration model. The Integration model for the TTCci infrastructure is shown in Figure 5-9. Every TTCci board is controlled by one TTCci XDAQ application. The central cell of each L1 trigger sub-system interacts with the TTCci XDAQ application through a TTCci cell. The TTCci cell retrieves the TTCci configuration information and passes it to the TTCci XDAQ application. The sub-detector TTCci boards are operated slightly differently. The sub-detector supervisory software interacts directly with the TTCci XDAQ application. The sub-detector central cell also has a TTCci cell which controls the TTCci XDAQ applications running in the sub-detector supervisory software tree. This additional control path is necessary to run TTC interconnection tests between the TCS module, located in the GT crate, and TTCci boards that belong to sub-detectors. The sub-detector TTCci cells can control more than one TTCci XDAQ application. The configuration of the L1 trigger sub-systems TTCci boards is driven by the TS. On the other hand, the configuration of the sub-detector TTCci boards is driven by the corresponding sub-detector supervisory software.
5.3.2.2.8 Luminosity Monitoring System The Luminosity Monitoring System (LMS) provides beam luminosity information. The LMS cell uses the monitoring xhannel (Section 4.4.4.12) to retrieve information from the L1 trigger monitoring collector. This information is sent periodically to a LMS XDAQ application which gathers luminosity information from several sources and distributes it to a number of consumers, for instance the luminosity database. Figure 5-10 shows the LMS integration model.
Trigger Supervisor System
s
xe
80
h
s
h
Central Cell
HTR Xdaq
cx
s
PCI to VME
dx s
HCAL software
Tstore
o
SOAP (CellXhannelCell)
Configuration DB o
SOAP (CellXhannelTstore) SOAP (CellXhannelMonitor)
s h
LMS Cell
LMS software
xx
xe
Http
Mon Collector
Occi
mx
s distributor Xdaq
Trigger Supervisor
Figure 5-10: LMS integration model.
5.3.2.2.9 Central cell The central cell coordinates the operation of the sub-system central cells using the cell xhannel interface (Section 4.4.4.9). Figure 5-11 shows the integration model of the central cell with the rest of sub-system central cells.
s
h
Central Cell cx
PCI to VME dx SOAP (CellXhannelCell) SOAP (CellXhannelTstore)
s
h
GT Cell
s
h
GMT Cell dx
s
dx
s
h
DTTF Cell
… dx
Http
h
Occi
ECAL Cell dx s
Tstore
o
Configuration DB o
Figure 5-11: Central cell integration model.
5.3.2.3 Integration summary Table 5-2 summarizes the most important parameters that define the integration model for each of the subsystems including L1 trigger sub-systems and sub-detectors.
System integration
81
Online software related parameters
HW setup parameters
TS system parameters
Scripts
Crates (type/#)
Shared crates
Cells (type/#)
Integration case
No
Yes (HAL)
1
GT/GMT
1
Section 5.3.2.2.2
Yes
No
Yes (HAL)
1
GT/GMT
1
Section 5.3.2.2.2
Yes
Yes
Yes (Python)
GCA(3), GCB(1)
No
GCA(3), GCB(1), CN(1)
Section 5.3.2.2.5
DTTF crates host DTSC receiver board
DA(6), DB(1), DC(1), CN(1)
Section 5.3.2.2.3
Subsystem
HAL
C++ API
XDAQ apps.
GT
Yes
Yes
GMT
Yes
GCT
No (Usb)
DTTF
Yes
Yes
No
No
DA(6), DB(1), DC(1)
CSCTF
Yes
Yes
No
No
1
No
1
Section 5.3.2.2.1
RCT
No
Yes
No
No
RA(18), RB(1)
No
RA (18), RB (1), CN (1)
Section 5.3.2.2.3
DTSC
Yes
Yes
No
No
DA(10)
Receiver optical board in DTTF crate
DTA (10), CN (1)
Section 5.3.2.2.3
RPC
Yes
Yes
Yes
No
RPA(12), RPB(1), RPC(1)
No
1
Section 5.3.2.2.4
ECAL
Yes
Yes
Yes
No
NA
NA
1
Section 5.3.2.2.6
HCAL
Yes
Yes
Yes
No
NA
NA
1
Section 5.3.2.2.6
Tracker
NA
NA
NA
NA
NA
NA
1
Section 5.3.2.2.6
LMS
NA
NA
Yes
NA
NA
NA
1
Section 5.3.2.2.8
8
Section 5.3.2.2.7
1
Section 5.3.2.2.9
DTTF/DTSC TTC
Yes
Yes
Yes
No
7
RCT/GCT CSCTF/CSC
CC
NA
NA
NA
No
NA
NA
Table 5-2: Summary of integration parameters.
5.4 System integration The TS system is formed by the integration of the local scope distributed systems presented in Section 5.3.2. The TS system itself can be described as four distributed systems with an overall scope: the TS control system, the TS monitoring system, the TS logging system and the TS start-up system. The following sections describe the node structure and the communication channels among them for each of the four TS systems.
5.4.1 Control system The TS control system (TSCS) and the TS monitoring system (TSMS) are the main distributed systems with an overall scope. These two systems facilitate the development of the configuration, test and monitoring services
Trigger Supervisor System
82
s
h
Central Cell cx
s
dx
s
h
Subsystem Central Cell cx
…
h
d
dx
s
s h Crate Cell d
d
dx
dx
s
h
s
…
Configuration DB o
dx
h
Tstore
o
PCI to VME
d
SOAP (CellXhannelCell)
dx
SOAP (CellXhannelTstore) Http Occi
Figure 5-12: Architecture of the TS control system. (s: SOAP interface, h: HTTP/CGI interface, d: Hardware driver, cx: Cell xhannel interface (SOAP), dx: Tstore xhannel interface (SOAP), o: OCCI interface). outlined in the conceptual design. Figure 5-12 shows the TSCS. It consists of the sub-system cells, one Tstore application, the sub-system relational databases and the communication channels among all these nodes. The TSCS is a purely hierarchical control system where each node can communicate only with the immediate lower level nodes. The central node of the TSCS uses its cell xhannel interface to coordinate the operation of the sub-system central cells. Sub-system central cells are responsible to coordinate the operation over all sub-system crates. The crate operation is done through an additional level of cells when the sub-system has more than one crate, or directly when the sub-system is contained in one single crate. Each sub-system has its own relational database that can be accessed from the sub-system cell using the Tstore xhannel interface. All database queries sent through the Tstore xhannel are centralized into the Tstore application. This node’s task is to manage the connections with the database server and to translate the SOAP request messages into OCCI requests understandable by the Oracle database server (Section 4.4.4.9.2). The TSCS can be remotely controlled using the TS SOAP interface (Appendix A) or using the TS GUI. Both interfaces are accessible from any node of the TSCS. On the other hand, not all services are available in all the nodes. The central node of the TSCS facilitates access to the global level services, the sub-system central nodes facilitate the access to the sub-system level services and finally the crate cells facilitate the access to the crate level services. The TS services are discussed in Chapter 6.
5.4.2 Monitoring system The TS monitoring, logging and start-up systems are not hierarchic. These systems are very much dependent on existing infrastructure provided by the XDAQ middleware or RCMS framework. The usage model for this infrastructure is characterized by a centralized architecture (Section 4.4.4.12). The TS Monitoring System (TSMS), shown in Figure 5-13, is a distributed application intended to facilitate the development of the TS monitoring service. The TSMS consists of the same cells that participate in the TSCS, the sensor applications associated to each cell, one monitor collector application, one Mstore application, the Tstore application and the monitoring relational database. A TSCS cell that wishes to participate in the TSMS has to customize a class descendent of the DataSource class. This class defines the code intended to create the updated monitoring information. The monitor collector periodically requests from the sensors of the TSMS, through a SOAP interface, the updated monitoring information of all items declared in the flashlist files (Section 4.4.4.12). The Mstore application is responsible for
Services development process
83
h h External system
m
s
Mon Collector s
xe sensor
h
h m
h
s xe sensor
…
mx
h m d
s xe sensor mx
s
h m d
xe sensor mx
…
h m d
m d
s xe sensor mx
s xe sensor mx
Figure 5-13: Architecture of the TS monitoring system. embedding the collected monitoring information into a SOAP message and for sending it to the Tstore application in order to be stored in the monitoring database. A user of the TSCS can visualize any monitoring item of the TSMS with a web browser connected to the HTTP/CGI interface of any cell.
5.4.3 Logging system Figure 5-14 shows the TS Logging System (TSLS). The logging records are generated by any node of the TSCS and stored in the logging database. The TSLS facilitates also a filtering GUI embedded in the TS GUI of any cell. It allows any user to follow the execution flow of the TS system. The TS logging collector is responsible for filtering the logging information and for sending it to its final destinations including the TS logging database. The persistent storage of logging records in the logging database facilitates the development of post-mortem analysis tools. The TS logging collector can also send the TS logging records to a number of destinations: i) a central CMS logging collector intended to gather all logging information from the CMS online software infrastructure, ii) a XML file and iii) a GUI-based log viewer (Chainsaw [94]).
5.4.4 Start-up system Figure 5-15 shows the TS Start-up System (TSSS). The TSSS enables to remotely start-up the TSCS or any subset of its nodes. The TSSS consists of one job control application in each host of the TS cluster. Each job control application exposes a SOAP interface which allows starting or killing an application in the same host. The job control applications are installed as operative system services and are started-up at boot time. A central process manager coordinates the operation of the job control applications in order to start/stop the TS nodes.
5.5 Services development process The Trigger Supervisor Control System (TSCS) and the Trigger Supervisor Monitoring System (TSMS) provide a stable layer on top of which the TS services have been implemented following a well defined methodology [95]. Figure 5-16 schematizes the TS service development model associated with the TS system. The following description explains each of the steps involved in the creation of a new service.
Trigger Supervisor System
84
Entry cell definition: The first step to implement a service is to designate the cell of the TSCS that facilitates the client interface. This cell is known as Service Entry Cell (SEC). When the service involves more than one sub-system, the SEC is the TS central cell. When the scope of the service is limited to a given sub-system, the SEC is the sub-system central cell. Finally, when the service scope is limited to a single crate, the SEC is the corresponding crate cell.
Operation states: The second step is to identify the operation states. These represent the stable states of the system under control that wish to be monitored during the operation execution. For instance, a possible configuration operation intended to set up one single crate could have as many states as boards; and the successful configuration of each board could be represented as a different operation state.
Operation transition: Once the FSM states are known, the next step is to define the possible transitions among stable states and for each transition identify an event that triggers this transition.
Operation transition methods: For each FSM transition, the conditional and functional methods and associated parameters have to be defined. These methods actually do the system state change. In case the SEC is a crate cell, these methods use the hardware driver, located in the cell context, to modify the crate state. When the SEC is a central cell, these methods use the xhannel infrastructure to operate lower level cells and XDAQ applications, and to read monitoring information. New services may require new operations, commands and monitoring items in lower level cells. The developer of the SEC is responsible for coordinating the required developments in the lower level cells.
Service test: The last step of the process is to test the service.
Although changes to the L1 decision loop hardware and associated software platforms are expected during the operational life of the experiment, these changes may occur independently of the requirement of new services or the evolution of existing ones. The TS system is a software infrastructure that facilitates an stable abstraction of the L1 decision loop despite of hardware and software upgrades. The stable layer of the TS system enables the development coordination of new services uniquely following a well defined methodology, with very limited knowledge of the TS framework internals and independently of the hardware and software platform upgrades. This approach to coordinatie the development of new L1 operation capabilities fits the professional background and experience of managers and technical coordinators well. Chapter 6 presents the result of applying this methodology to implement the configuration and interconnection test services outlined in Section 3.3.3.
u h xe
Cell
XS
o
u Log Collector x
Log u Collector x
c
c
j
j
u
h xe
Cell
XS
Cell
…
d
u h Cell d
xe u
XS
o
h Cell d
xe u
XS
o
o
h
o
…
h Cell d
xe
XS
xe
Chainsaw XML file Console
XS
u
o
u j o PCI to VME Udp Occi Http
Figure 5-14: Architecture of the TS logging system.
Logging DB
Services development process
85
Start-up manager s
xe
xe
xe
s Job control
xe
s
s
xe
…
Job control
…
Job control
Job control
s
xe
s Job control
s Job control
Figure 5-15: Architecture of the TS start-up system.
Entry cell
Operation states
Operation transitions
Operation transition methods
Figure 5-16: TS services development model.
Service test
Chapter 6 Trigger Supervisor Services 6.1 Introduction The TS services are the final Trigger Supervisor functionalities developed on top of the TS control and monitoring systems. They have been implemented following the TS services development process described in Section 5.5. The functional descriptions outlined in Section 3.3.3 were initial guidelines. The logging and startup systems provide the corresponding final services and do not require any further customization process beyond the system integration presented in Section 5.4. Guided by the “controller decoupling” non-functional requirement presented in Section 3.2.2, Point 3), the TS services were totally implemented on top of the TS system and did not require the implementation of any functionality on the controller side. This approach to implement the TS services simplified the development of controller applications, and it eased the deployment and maintenance of the TS system and services. The goal of this chapter is to describe for each different service: the functionality seen by an external controller, the internal implementation details from the TS system point of view, and finally, the service operational use cases. This chapter has been organized in the following sections: Section 6.1 is the introduction, the configuration service is presented in Section 6.2, Section 6.3 is dedicated to the interconnection test service, Section 6.4 describes the monitoring service, and finally, Section 6.5 presents the graphical user interfaces.
6.2 Configuration 6.2.1 Description The TS configuration service facilitates setting up the L1 trigger hardware. It defines the content of the configurable items: FPGA firmware, LUT’s, memories and registers. Figure 6-1 illustrates the client point of view to operate the L1 trigger with this service. In general, the TS Control System (TSCS) provides two interfaces to access the TS services: a SOAP based protocol for remote procedure calls (Appendix A) and the TS GUI based on the HTTP/CGI protocol (Section 4.4.4.11 and Section 6.5). Both interfaces to the central cell expose all TS services. The following description presents the service operation instructions without the SOAP or HTTP/CGI details. Up to eight remote clients can use this service simultaneously in order to set up the L1 trigger and the TTC system (Sections 1.3.2 and 1.4.5). The first client that connects to the central cell initiates a configuration operation and executes the first transition configure with a key assigned to the TSC_KEY parameter. The key corresponds to a full configuration of the L1 trigger which is common for all DAQ partitions. When the configure transition finalizes, the L1 trigger system should be in a well defined working state. Additional clients attempting to operate with the configuration service have to initiate another configuration operation and also to
Trigger Supervisor Services
88
execute the configure transition. To avoid configuration inconsistencies, these additional clients have to provide the same configuration TSC_KEY parameter, otherwise they are not allowed to reach the configured state. All clients can execute the partition transition with a second key assigned to the TSP_KEY parameter and the run number assigned to the Run Number parameter. This key identifies the configurable parameters of the L1 decision loop which are exclusive of the DAQ partition that the corresponding client is controlling. The following list presents these parameters:
TTC vector: This 32 bit vector identifies the TTC partitions assigned to a DAQ partition.
DAQ partition: This number from 0 to 7 defines the DAQ partition.
Final-Or vector: This vector defines which algorithms of the trigger menu (128 bits) and technical triggers (64 bits) should be used to trigger a DAQ partition.
BX Table: This table defines which bunch crossings should be used for triggering and which fast and synchronization signals should be sent to the TTC partitions belonging to one DAQ partition. OpInit(“configuration”, “sesion_id1”, “opid_1“) OpInit(“configuration”, “sesion_id2”, “opid_2”) … OpInit(“configuration”, “sesion_id8”, “opid_8“)
SOAP or Http/cgi (GUI) Configuration Operation Operations plug‐in factory
Central Cell CellContext
String TSC_KEY; Bool isConfigured; Bool isEnabled;
Operations Pool partition(“opid_1”,TSP_KEY, Run Number) configure(“opid_1”, TSC_KEY) enable(“opid_1”)
halted
configured
partitioned
enabled
stop(“opid_1”)
Data base Xhannel
… suspend(“opid_1”) 5) suspend(“opid_2”) 5) suspend(“opid_8”)
suspended resume(“opid_1”)
Cell Xhannel
Figure 6-1: Client point of view of the TS configuration service. The enable transition starts the corresponding DAQ partition controller in the TCS module. The suspend transition temporally stops the partition controller without resetting the associated counters. The resume transition facilitates the recovery of the normal running state. Finally, the stop transition which can be executed from either the suspended or enabled states stops the DAQ partition controller and resets all associated counters.
6.2.2 Implementation The configuration service requires the collaboration of the TSCS nodes, the Luminosity Monitoring Software System (LMSS), the sub-detectors supervisory and control systems (SSCS), and the usage of the L1 trigger configuration databases. All involved nodes are shown in Figure 6-2.
Configuration
89
s
s
h
HTR Xdaq
s
s
s
s
HCAL manager
h
HCAL manager
…
s
Sub-detector control systems (HCAL)
HCAL Cell
LMS Cell xx
cx
SOAP (CellXhannelCell)
dx
SOAP (CellXhannelTstore)
xx s TTCci Xdaq
HCAL TTCci crate (5 boards)
mx
…
s
Http
h
GT Cell d
Occi dx s
s
s
PCI to VME
SOAP (CellXhannelMonitor) s h Trigger Sub-system Central Cell cx dx
s
h
s h TTCci Cell
h
TTCci Cell
distributor Xdaq
h
Central Cell
s s
LMS software
Mon Mstore Collector
dx
xx
dx
s h Crate Cell d dx
…
s h Crate Cell d dx
Tstore
o
Configuration DB o
Trigger Supervisor Control System
s TTCci Xdaq
Trigger sub‐system TTCci crate (1 board)
Trigger sub‐system crates
GT crate
Figure 6-2: Distributed software and hardware system involved in the implementation of the TS configuration and interconnections test services.
6.2.2.1 Central cell The role of the central cell in the configuration services is twofold: to facilitate the remote client interface presented in Section 6.2.1 and to coordinate the operation of all involved nodes. Both the interface to the client and the system coordination are defined by the configuration operation installed in the central cell (Figure 6-1). This section describes the stable states, and the functional (fi) and conditional (ci) methods of the central cell configuration operation transitions (Section 4.3.1). Initialization()
This method stores the session_id parameter in an internal variable of the configuration operation instance. This number will be propagated to lower level cells when a cell command or operation is instantiated. The session_id is attached to every log record in order to help identify which client directly or indirectly executed a given action in a cell of the TSCS. Configure_c()
The conditional method of the configure transition checks whether this is the first configuration operation instance. If this is the case, this method disables the isConfigured flag, iterates over all cell xhannels accessible from the central cell and initiates a configuration operation in all trigger sub-system central cells with the same session_id provided by the client. If one of these configuration operations cannot be successfully started this method returns false, the functional method of the configure transition is not executed and the operation state stays halted. This method does not retrieve information from the configuration database. In case this is not the first configuration operation instance, this method checks if the parameter TSC_KEY is equal to the variable TSC_KEY stored in the cell context. If this is different, the configure transition is not executed and the operation state stays halted. Otherwise, this method enables the isConfigured flag, returns true and the functional method of the configure transition is executed.
Trigger Supervisor Services
90
Configure_f()
The functional method for this transition performs the following steps: 1.
If the isConfigured flag is nothing.
2.
To read from the TSC_CONF table, shown in Figure 6-3, of the central cell configuration database the row with the unique identifier equal to TSC_KEY. This row contains as many identifiers as sub-systems have to be configured (sub-system keys). If a sub-system shall not be configured, the corresponding position in the TSC_KEY row is left empty.
3.
To execute the configure transition in each sub-system central cell sending as a parameter the sub-system key. This transition is not executed in those sub-systems with an empty key. Section 6.2.2.2 presents the configuration operation of the sub-system central and crate cells.
4.
To store in the cell context the current TSC_KEY.
false,
the method executes steps 2, 3, 4 and 5. Otherwise this method does
GTL_CONFIG GT_CONF TSC_CON F
GT_KEY
TSC_KEY
GTL_SEQ_KEY
GMT_KEY
GMT_CONF
DTTF_KEY
GMT_KEY
CSCTF_KEY
GTL_FW_KEY GTL_REG_KEY
…
GT_KEY
GTL_KEY
URL_TRIG_MENU
…
GCT_KEY RCT_KEY
DTTF_CONF
DTTF_KEY
RPCTrig_KEY ECAL_TPG_KEY HCAL_TPG_KEY DT_TPG_KEY
… CSCTF_CONF
CSCTF_KEY …
Figure 6-3: L1 configuration database structure is organized in a hierarchic way. The main table is named TSC_CONF. Partition_c()
This method performs the following steps: 1.
To read from TSP_CONF table (Figure 6-4) the row with the unique identifier equal to TSP_KEY. This row points to the hardware configuration parameters that affect just the concrete DAQ partition, namely: the 32 bits TTC vector, the DAQ partition identifier, the 128 + 64 bits vector of the final or and the bunch crossing table.
2.
To use the GT cell commands to check that the DAQ partition and the TTC partitions are not being used. If there is an inconsistency, this method returns false, the functional method of the partition transition is not executed and the operation state stays configured. Section 6.2.2.3.1 presents the GT cell commands.
Partition_f()
This method performs the following steps: 1.
To read from TSP_CONF table the row with the unique identifier equal to TSP_KEY.
2.
To execute the GT cell commands (Section 6.2.2.3.1) in order to: a.
Set up the DAQ partition dependent parameters retrieved in the first step.
b.
Reset the DAQ partition counters.
c.
Assign the Run
Number
parameter to the DAQ partition.
Configuration
91
TSP_CONF
TSP_KEY TTC_VECTOR FIN_OR DAQ_PARTITION BC_TABLE
Figure 6-4: The database table that stores DAQ partition dependent parameters is named TSP_CONF. Enable_c()
This method checks whether this is the first configuration operation instance. If this is the case, this method disables the isEnabled flag. Otherwise, this method enables the isEnabled flag and checks in all trigger subsystem central cells that the configuration operation is in configured state. Enable_f()
The functional method of the enable transition performs the following steps: 1.
If the isEnabled flag is disabled, the method executes steps 2 and 3. Otherwise this method only executes step 3.
2.
To execute the enable transition in the configuration operation of all sub-systems central cells. This enables the trigger readout links with the DAQ system and the LMS software.
3.
To execute the GT cell commands to start the DAQ partition controller in the TCS module.
Suspend_c()
This method checks nothing. Suspend_f()
This method executes in the GT cell a number of commands that simulate a busy sTTS signal (Section 1.3.2.4) in the corresponding DAQ partition. The procedure stops the generation of L1A’s and TTC commands in this DAQ partition. Section 6.2.2.3 presents these commands. Resume_c()
This method checks nothing. Resume_f()
This method executes in the GT cell a command that disables the simulated busy sTTS signal that was enabled in the functional method of the suspend transition. Section 6.2.2.3.1 presents these commands. Stop_c()
This method checks nothing. Stop_f()
This method executes in the GT cell the command to stop a given DAQ partition (Section 6.2.2.3.1). Destructor()
This method is executed when the remote client finishes using the configuration operation service and destroys the configuration operation instance. The destructor method of the last configuration operation destroys the configuration operations running in the sub-system central cells. This stops the trigger readout links with the DAQ system and the LMS software.
Trigger Supervisor Services
92
6.2.2.2 Trigger sub-systems Each trigger crate is configured by a configuration operation running on a dedicated cell for that crate (Section 5.3.2.1.2). A configuration operation provided by the sub-system central cell coordinates the operation over all crate cells. When a trigger sub-system consists of one single crate, the central cell and the crate cell are the same. A complete description of all integration scenarios was presented in Section 5.3.2.2. Figure 6-5 shows the configuration operation running in all trigger sub-system cells. The description of the functional and conditional methods depends on whether it is a cell crate or not. This is a generic description that can be applied to any trigger sub-system. It is not meant to provide the specific hardware configuration details of a concrete trigger sub-system. Specific sub-system configuration details can be checked in the code itself [96]. This section describes the stable states, and the functional (fi) and conditional (ci) methods of the trigger subsystem cell configuration operation transitions. This description includes the sub-system central and crate cell cases. OpInit(“configuration”, “session_id”, “opid“) configure(“opid”, KEY)
halted
enable(“opid”)
configured
suspend(“opid”)
enabled
suspended
resume(“opid”)
Figure 6-5: Trigger sub-system configuration operation. Initialization()
This method stores the session_id parameter in an internal variable of the configuration operation instance. If the current operation instance was started by the central cell, the session_id is the same as the one provided by the central cell client. Configure_()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and initiates a configuration operation in all crate cells and TTCci cells (if the trigger sub-system has a TTCci board). If the operation runs in a crate cell, this method checks if the hardware is accessible using the hardware driver. If one of these configuration operations cannot be successfully started or the hardware is not accessible, this method returns false, the functional method of the configure transition is not executed and the operation state stays halted. Configure_f()
The functional method for this transition performs the following steps: 1.
To read from the trigger sub-system configuration database the row with the unique identifier equal to KEY. If the operation runs in the trigger sub-system central cell, this row contains as many identifiers as crate cells. If a crate cell is not going to be configured, the corresponding position in the KEY row is left empty. If the operation runs in a crate cell, this row contains configuration information, links to firmware or look up table (LUT) files and/or references to additional configuration database tables. Section 6.2.2.3 presents the GT configuration database example.
2.
If the operation runs in the trigger sub-system central cell, this method executes in each crate cell and TTCci cell the configure transition sending as a parameter the crate or TTCci key. If the operation runs in a crate cell, the configuration information is retrieved from the configuration database using the database xhannel. The crate is configured with this information using the hardware driver.
Configuration
93
Enable_c()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and checks if the current state is configured. If the operation runs in a crate cell, this method checks if the hardware is accessible using the hardware driver. If one of these configuration operations is not in the configured state or the hardware is not accessible, this method returns false, the functional method of the enable transition is not executed and the operation state stays configured. Enable_f()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and executes the enable transition. If the operation runs in a crate cell, this method configures the hardware in order to enable the readout link with the DAQ system. Suspend_c()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and checks if the current state is enabled. If the operation runs in a crate cell, this method checks if the hardware is accessible using the hardware driver. If one of these configuration operations is not in the enabled state or the hardware is not accessible, this method returns false, the functional method of the suspend transition is not executed and the operation state stays enabled. Suspend_f()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and executes the suspend transition. If the operation runs in a crate cell, this method configures the hardware in order to disable the readout link with the DAQ system. Resume_c()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and checks if the current state is suspended. If the operation runs in a crate cell, this method checks if the hardware is accessible using the hardware driver. If one of these configuration operations is not in the suspended state or the hardware is not accessible, this method returns false, the functional method of the resume transition is not executed and the operation state stays suspended. Resume_f()
If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and executes the resume transition. If the operation runs in a crate cell, this method configures the hardware in order to enable again the readout link with the DAQ system. Destructor()
The destructor method of the trigger sub-system central cell configuration operation is executed by the destructor method of the last TS central cell configuration operation. If the operation runs in the trigger sub-system central cell, this method iterates over all available cell xhannels and destroys all configuration operations. If the operation runs in a crate cell, this method configures the hardware in order to disable the readout link with the DAQ system.
6.2.2.3 Global Trigger The GT cell operates the GT where L1A decisions are taken based on trigger objects delivered by the GCT and the GMT (Section 1.3.2.3). The GT cell plays a special role in the configuration of the L1 trigger. It facilitates a set of cell commands used by the central cell configuration operation and an implementation of the trigger subsystem configuration operation presented in Section 6.2.2.2. This section presents the interface of the GT cell [97] involved in the configuration and the interconnection test services.
Trigger Supervisor Services
94
6.2.2.3.1 Command interface The GT command interface is used by the configuration and interconnection test operations running in the central cell, and also by the GT control panel (Section 6.5.1). The command interface has been mostly designed according to the needs of these clients. The commands can be classified as a function of the GT boards: Trigger Control System (TCS), Final Decision Logic (FDL) and Global Trigger Logic (GTL). FDL commands The FDL is one of the GT modules that are configured during the partition transition of the central cell configuration operation (Section 6.2.2.1). For instance, to set up the Final-Or of the FDL for a given DAQ partition, to monitor the L1A rate counters for each of the 192 L1A’s (FDL slice) coming from the GTL or to apply a pre-scaler to a certain algorithm or technical trigger. NAME
TYPE
VALID VALUES
Number of slice
xdata::UnsignedShort
The number of FDL slices depends on the firmware. Currently there are 192 slices foreseen on the FDL. Valid values for the parameter are therefore [0:191].
DAQ partition
xdata::UnsignedShort
The Number of DAQ partitions is 8. Therefore valid values are between [0:7].
Pre-scale factor
xdata::UnsignedLong
Value of the pre-scaler for a slice that is determined by a 16 bit register. Range of valid values is [0:65535].
Update step size
xdata::UnsignedLong
Value of the update step size is determined by a 16 bit register. Range of valid values is [0:65535].
Bit for refresh rate
xdata::UnsignedShort
Each of 8 bits refers to a different multiplicity that is defined in the firmware of the FDL. Valid values are between [0:7].
Table 6-1: Description of parameters used in FDL commands. SetFinOrMask: Description:
Each slice can be added to the Final-Or of one or more DAQ partitions. This command adds or removes a specific slice to or from a DAQ partition’s Final-Or according to the ”Enable for Final-Or” parameter.
Parameters:
Number of slice Number of DAQ partition Enable for Final-Or
Return value:
Slice number: ”Number of slice” ”enabled/disabled” for Final-Or in DAQ partition number: ”Number of DAQ partition”
GetFinOrMask: Description:
Reads out whether a slice is currently part of the Final-Or of a certain DAQ partition.
Parameters:
Number of slice Number of DAQ partition
Return value:
xdata::Boolean
SetVetoMask: Description:
Each slice can suppress a L1A for one or more DAQ partitions. This command enables or disables that mechanism for a given slice and DAQ partition.
Parameters:
Number of slice Number of DAQ partition
Configuration
95
Enable for veto Return value:
Slice number: ”Number of slice” ”enabled/disabled” as veto for DAQ partition number: ”Number of DAQ partition”
GetVetoMask: Description:
Reads if a certain slice is currently defined as veto for a certain DAQ partition.
Parameters:
Number of slice Number of DAQ partition
Return value:
xdata::Boolean
SetPrescaleFactor: Description:
To control L1A rates that are too high, a pre-scale factor for each slice can be applied. This factor can be set individually for each slice. Setting the factor to 0 or 1 does no pre-scaling.
Parameters:
Number of slice Pre-scale factor
Return value:
Pre-scale factor of slice Number: ”Number of slice” set to: ”Pre-scale factor”
GetPrescaleFactor: Description:
Reads out the pre-scale factor for a certain FDL slice.
Parameters:
Number of slice
Return value:
xdata::UnsignedLong
ReadRateCounter: Description:
Reads-out the rate counter for a certain slice.
Parameters:
Number of slice
Return value:
xdata::UnsignedLong
SetUpdateStepSize: Description:
Sets the common step-size for the reset period of all rate counters.
Parameters:
Update step size
Return value:
Update step size set to: ”Update step size”
SetUpdatePeriod: Description:
Sets the “update period” of the rate counters for a certain slice, based on the common update step-size. The update-period is chosen by setting a register. Each register bit corresponds to a factor the common update-period is multiplied with. An array in the code of the command maps bit numbers to multiplicities.
Parameters:
Number of slice Bit for refresh rate
Trigger Supervisor Services
Return value:
96
Update Period of slice Number: ”Number of slice” set to: ”multiplicity”
GetNumberOfAlgos: Description:
Depending on the version of the firmware of the FDL chip, the number of Technical Triggers (TT’s) may differ. This command gives back the number of TT’s currently implemented.
Parameters: Return value:
xdata::UnsignedShort
TCS commands The Trigger Control System module (TCS) controls the distribution of L1A’s (Section 1.3.2.4). Therefore, it plays a crucial role with respect to Data Acquisition and readout of the trigger components. The TCS command interface of the GT cell is used by the configuration operation running in the central cell (Section 6.2.2.1) and by the GT control panel (Section 6.5.1). This interface provides very fine grained control over the TCS module. Assigning TTC partitions to DAQ partitions, assigning time slots, controlling the random trigger generator and the generation of fast and synchronization signals, and loading predefined bunch crossing tables separately for each DAQ partition are tasks the command interface has to cope with. Commands of the TCS can be grouped into commands affecting more than one DAQ partition controller (PTC) and PTC dependent commands. The first group of commands therefore contains the prefix ”Master” whereas commands of the second group start with ”Ptc”. The second group of commands has the number of the PTC as a common parameter. NAME
TYPE
VALID VALUES
DAQ partition
xdata::UnsignedShort
The number of DAQ partitions is 8. Therefore, valid values are between [0:7].
Number of PTC
xdata::UnsignedShort
For each DAQ partition there is a PTC implemented on the TCS chip. Therefore, valid values are between [0:7].
Detector partition
xdata::UnsignedShort
This parameter refers to one of 32 TTC partitions. Valid values are between [0:31].
Time slot
xdata::UnsignedShort
The time slot for a PTC is calculated from a 8 bit value. Valid values are between [0:255].
Random trigger frequency
xdata::UnsignedLong
The random frequency is calculated from a 16 bit register value. Valid values are between [0:65535].
Table 6-2: Description of parameters used in TCS commands. MasterSetAssignPart: Description:
This command assigns a TTC partition to a DAQ partition. In case the TTC partition is already part of a DAQ partition it will be assigned to the new partition anyway.
Parameters:
Detector partition DAQ partition
Return value:
Detector partition ”Detector partition” assigned to DAQ partition: ”DAQ partition”.
Configuration
97
MasterGetAssignPart: Description:
Returns the number of the DAQ partition a certain TTC partition is part of.
Parameters:
Detector partition
Return value:
xdata::UnsignedShort
MasterSetAssignPartEn: Description:
This command enables or disables a TTC partition. Before a TTC partition can be assigned to a DAQ partition it has to be enabled.
Parameters:
Detector partition Enable partition
Return value:
Detector partition enabled/disabled
MasterGetAssignPartEn: Description:
Reads-out whether or not a certain TTC partition is enabled .
Parameters:
Detector partition
Return value:
xdata::Boolean
MasterStartTimeSlotGen: Description:
Depending on the registers that define the time slots for every DAQ partition the time slot generator switches between the DAQ partitions in round robin mode. This command starts the time slot generator.
Parameters: Return value:
Time slot generator started.
PtcGetTimeSlot: Description:
Returns the current time slot assignment for a certain PTC.
Parameters:
Number of PTC
Return value:
xdata::UnsignedShort
PtcStartRndTrigger: Description:
Starts the random trigger generator for a specified PTC.
Parameters:
Number of PTC
Return value:
Random trigger generator started for DAQ partition controller ”number of PTC”
PtcStopRndTrigger: Description:
Stops the random trigger generator for a specified PTC.
Parameters:
Number of PTC
Return value:
Random trigger generator stopped for DAQ partition controller ”number of PTC”
Trigger Supervisor Services
98
PtcRndFrequency: Description:
Sets the frequency of generated triggers by the random trigger generator for a specified PTC.
Parameters:
Number of PTC Random trigger frequency
Return value:
Random frequency of partition Group: ”number of PTC” set to: "random trigger frequency"
PtcGetRndFrequency: Description:
Reads-out the frequency of the random trigger generator for a PTC.
Parameters:
Number of PTC
Return value:
xdata::UnsignedLong
PtcStartRun: Description:
Starts a run for a PTC, by first resetting and starting the PTC and then sending a start run command pulse.
Parameters:
Number of PTC
Return value:
Run started for PTC: ”number of PTC”
PtcStopRun: Description:
Stops a run for a PTC.
Parameters:
Number of PTC
Return value:
Run stopped for PTC: ”number of PTC”
PtcCalibCycle Description:
Starts a calibration cycle for the specified PTC.
Parameters:
Number of PTC
Return value:
Calibration cycle for DAQ partition ”number of PTC” started.
PtcResync: Description:
Manually starts a resynchronization procedure for the specified PTC.
Parameters:
Number of PTC
Return value:
Resynchronization procedure for DAQ partition “number of PTC” initialized.
PtcTracedEvent: Description:
Manually sends a traced event for a specified PTC.
Parameters:
Number of PTC
Return value:
Traced event initiated for DAQ partition “number of PTC”.
Configuration
99
PtcHwReset: Description:
Manually sends a hardware reset to the PTC.
Parameters:
Number of PTC
Return value:
Hardware for DAQ partition ”number of PTC” has been reset.
PtcResetPtc: Description:
Resets the state machine of the PTC.
Parameters:
Number of PTC
Return value:
PTC ”number of PTC” reset.
Other commands This section describes a number of commands not specifically implemented for a certain type of GT module but rather used during the initialization, for debugging or for filling the database with register data. NAME
TYPE
VALID VALUES
Item
xdata::String
Refers to a register item, defined in the HAL “AddressTable” file for a module. If the specified item is not found, HAL will throw an exception that is caught in the command.
Offset
xdata::UnsignedInteger
The offset to the register address specified by an Item parameter according to the “HAL AddressTable” file. In case the offset gets too large, a HAL exception caught by the command will indicate that.
Board serial number
xdata::String
Only serial numbers of GT modules that are initialized will be accepted. The GetCrateStatus command returns a list of boards in the crate.
Bus adapter
xdata::String
The GT cell only accepts bus adapters of type ”DUMMY” and ”CAEN”.
“Module Mapper” File
xdata::String
The full path to the HAL “ModuleMapper” file has to be specified. If the file is not found a HAL exception caught by the command will inform the user about that.
“AddressTableMap” File
xdata::String
The full path to the HAL “AddressTableMap” file has to be specified. If the file is not found a HAL exception caught by the command will inform the user about that.
Table 6-3: Description of parameters used in the auxiliar commands. GtCommonRead: Description:
This command was written to read out register values from any GT module in the crate. This is useful for debugging. When correctly using the offset parameter also lines of memories can be read out.
Parameters:
Item Offset Board serial number
Return value:
xdata::UnsignedLong
GtCommonWrite: Description:
Generic write access for all GT modules.
Parameters:
Item
Trigger Supervisor Services
100
Value Offset Board serial number Return value:
Register Value for Item: ”Item” set to: ”Value” (offset=”Offset” ) for board with serial number: ”board serial number”
GtInitCrate: Description:
The initialization of the GT crate object (GT crate software driver) is done during startup of the cell application. If the creation of the crate object did not work correctly or if another type of bus adapter or different HAL files should be used, this command is used. Only if the ”reinitialize crate” parameter is set to true a new CellGTCrate object is instantiated.
Parameters:
Module Mapper File AddressTableMap File Bus adapter Reinitialize crate
Return value:
The GT crate has been initialized with ”bus adapter” bus adapter. Board with serial nr.: ”board1 serial number” in slot Nr. ”board1 slot number” Board with serial nr.: ”board2 serial number” in slot Nr. ”board2 slot number”
GtGetCrateStatus: Description:
The crate object dynamically creates associative maps during its initialization where information about modules in the crate is put. This information can be read out using this command.
Parameters:
Module Mapper File AddressTableMap File Bus adapter Reinitialize crate
Return value:
The GT crate has been initialized with ”bus adapter” bus adapter. Board with serial nr.: ”board1 serial number” in slot Nr. ”board1 slot number” Board with serial nr.: ”board2 serial number” in slot Nr. ”board2 slot number”
GtInsertBoardRegistersIntoDB: Description:
This command reads-out all registers for a specified GT module that are in the configuration database and inserts a row of values with a unique identifier and optionally a description into the corresponding GT configuration database table.
Parameters:
Board serial number Primary Key Description
Return value:
Register values have been read from the hardware and inserted into table ”Name of Register Table” with Primary Key: ”Primary Key”
Configuration
101
6.2.2.3.2 Configuration operation and database The configuration operation of the GT cell is interesting for two reasons: for being responsible for configuring the GT hardware that is common to all DAQ partitions; and is also interesting as an example of configuration operation defined for a trigger sub-system crate cell (Section 6.2.2.2). This section describes in detail the functional method of the configure transition for this operation and the GT configuration database.
Figure 6-6: Flow diagram of the configure transition functional method. Configure_f()
The flow diagram for this method is shown in Figure 6-6. The method performs the following steps: 1.
To retrieve a row from the main table of the GT configuration database named GT_CONFIG (Figure 6-7). This row is identified by the key that is given as a parameter to the operation. If a certain board should not be configured at all, the corresponding entry in the GT_CONFIG table has to be left empty.
2.
To loop over all boards in the GT crate in order to log those not found.
Trigger Supervisor Services
102
Figure 6-7: Main table of the GT configuration database. 3.
For all boards that are initialized, the BOARD_FIRMWARE table, shown in Figure 6-8, is retrieved. New firmware is attempted to be loaded if the version number of the current firmware does not match the firmware version of the configuration.
4.
The same loop is executed over all possible board memories found in the links are omitted just like above.
BOARD_MEMORIES
Figure 6-8: Each BOARD_CONFIG table references a set of sub tables.
table. Empty
Configuration
103
5.
The register table for each board is retrieved. If this table is empty because of a missing link, a warning message is issued, because loading registers is essential to put the hardware into a well defined state.
6.
Finally, a sequencer file is attempted to be downloaded for every board. This sequencer file can be used to write values in a set of registers.
6.2.2.4 Sub-detector cells HCAL and ECAL sub-detectors have just one cell each (Section 5.3.2.2.6). The configuration operation customized by the sub-detector cells is the same as for the trigger cells (Section 6.2.2.2). The configuration operation of the sub-detector cell only does something during the execution of its functional method of the configure transition. This method sets the sub-detector TPG configuration key to an internal variable of the subdetector cell. However, the sub-detector cell is not responsible for actually setting the hardware. Instead, when the sub-detector FM requires the configuration of the TPG (Section 1.4.5), the sub-detector supervisory system performs the following sequence: 1.
It reads the key using a dedicated cell command of the sub-detector cell.
2.
It uses this key to retrieve the hardware configuration from the sub-detector configuration database.
3.
It configures the TPG hardware.
6.2.2.5 Luminosity monitoring system The Luminosity Monitoring System (LMS) cell implements a configuration operation which resets the LMS software (Section 5.3.2.2.8) during its functional method of the enable transition. This method announces that the trigger system is running and the LMS readout software can be started. The destructor method of the LMS configuration operation stops the LMS software. Therefore, the LMS system will be enabled as far as there is at least one configuration operation instance running in the central cell.
6.2.3 Integration with the Run Control and Monitoring System The experiment Control System (ECS) presented in Section 1.4 coordinates the operation of all detector subsystems and among them the L1 decision loop. The interface between the central node of the ECS and each of the sub-systems is the First Level Function Manager (FLFM) which is basically a finite state machine. Figure 6-9 shows the state diagram of the FLFM. It consists of solid and dashed ellipses to symbolize states. The solid ellipses are steady states that are exited only if a command arrives from the central node of the ECS or an error is produced. The dashed ellipses are transitional states which are executing instructions on the sub-systems supervisors and self-trigger a transition to the next steady state upon completion of work. The command Interrupt may force the transition to Error from a transitional state. The transitions itself are instantaneous and guaranteed to succeed as no execution of instructions is taking place. The entry to the state machine is the Initial state [98]. This FLFM has to be customized by each sub-system. This customization consists of implementing the code of the main transitional states. For the L1 decision loop, the code for the Configuring, Starting, Pausing, Resuming and Stopping states has been defined. This definition uses the TS SOAP API described in Appendix A to access the TS configuration service. In this context, the FLFM acts as a client of the TS. During the configuring state, The FLFM instantiates a configuration operation in the central cell of the TS and executes the configure and the partition transitions. During the starting state, the FLFM executes the enable transition. During the pausing state, the FLFM executes the suspend transition. During the resuming state, the FLFM executes the resume transition. Finally, the FLFM stopping state executes the stop transition. The parameters TSC_KEY, TSP_KEY and Run Number are passed during the corresponding transitions (Section 6.2.2.1).
Trigger Supervisor Services
104
Figure 6-9: Level-1 function manager state diagram.
Interconnection test
105
6.3 Interconnection test 6.3.1 Description Due to the large number of communication channels between the trigger primitive generator (TPG) modules of the sub-detectors and the trigger system, and between the different trigger sub-systems, it is necessary to provide an automatic testing mechanism. The interconnection test service of the Trigger Supervisor is intended to automatically check the connections between sub-systems. From the client point of view, the interconnection test service is another operation running in the TS central cell. Figure 6-10 shows the state machine of the interconnection test operation. The client of the interconnection test service initiates an interconnection test operation in the central cell and executes the first transition prepare with a key assigned to the IT_KEY parameter and an optional second string assigned to the custom parameter. This transition prepares the L1 trigger hardware and the TS system for the starting of the interconnection test. The start transition enables the starting of the test. Finally, the client executes the analyze transition to get the test result from the sub-system central cells. OpInit(“interconnectionTest”, “session_id”, “opid“) prepare(IT_KEY, “custom”) start(“opid”)
halted
prepared
analyze(“opid”)
started
analyzed resume(“opid”)
Figure 6-10: Interconnection test operation.
6.3.2 Implementation The following sections describe how the TS interconnection test service is formed by the collaboration of different cell operations installed in different cells of the TS system. In addition, this service requires the collaboration of the Sub-detectors Supervisory and Control Systems (SSCS), and the usage of the L1 trigger configuration databases (Figure 6-2). A unique operation is necessary in the TS central cell. However, every interconnection test requires specific operations for the concrete sender and receiver sub-system central cells and crate cells.
6.3.2.1 Central cell The role of the central cell in the interconnection test service is similar to the role played in the configuration service: to facilitate the remote client interface presented in Section 6.3.1 and to coordinate the operation of all involved nodes. Both the interface to the client and the system coordination are defined by the interconnection test operation installed in the central cell (Figure 6-10). This section describes the stable states, and the functional (fi) and conditional (ci) methods of the central cell interconnection test operation transitions. Initialization()
This method stores the session_id parameter in an internal variable of the interconnection test operation instance. This number will be propagated to lower level cells when a cell command or operation is instantiated. The session_id is attached to every log record in order to help identify which client directly or indirectly executed a given action in a cell of the TSCS. Prepare_c()
This method performs the following steps:
Trigger Supervisor Services
1.
106
To read the IT_KEY row from the IT_CONF database table shown in Figure 6-11. This row contains two keys (TSC_KEY and TSP_KEY) and the cell operation names that have to be initiated in each of the central cells of those sub-systems involved in the interconnection test. IT_CONF
IT_KEY TSC_KEY TSP_KEY GT_IT_CLASS GMT_IT_CLASS DTTF_IT_CLASS CSCTF_IT_CLASS GCT_IT_CLASS RCT_IT_CLASS RPCTrig_IT_CLASS ECAL_IT_CLASS HCAL_IT_CLASS DTSC_IT_CLASS
Figure 6-11: Main database table used by the central cell interconnection test operation. 2.
To initiate the corresponding operation in the required trigger sub-system central cells with the same session_id provided by the central cell client. This method also initiates a configuration operation in the central cell. If one of these operations cannot be successfully started then this method returns false, the functional method of the prepare transition is not executed and the operation state stays halted.
Prepare_f()
This method performs the following steps: 1.
To execute the configure and the partition transitions with the TSC_KEY and TSP_KEY keys respectively in the central cell configuration operation. This configures the TCS module in order to deliver the required TTC commands to the sender and/or to the receiver sub-systems. By reconfiguring the BX table of a given DAQ partition, the TCS can send periodically any sequence of TTC commands to a set of TTC partitions (i.e. senders or receivers or both). The usual configuration use case is that senders are waiting for a BC0 signal16 to start sending patterns, whilst the receiver systems do not need any TTC signal. The configuration operation is also used to configure intermediate trigger sub-systems in order to work in transparent mode.
2.
To execute the prepare transition in the interconnection test operation of each trigger sub-system central cell sending as a parameter the custom string parameter. This parameter is intended to be used by the subsystem interconnection test operation (Section 6.3.2.2).
Start_c()
This method checks if the interconnection test operation state of each trigger sub-system central cell is in prepared state. This method also checks if the configuration operation of the central cell is in partitioned state. If one of these operations is not in the expected state, this method returns false, the functional method of the start transition is not executed and the operation state stays prepared. Start_f()
This method performs the following steps:
16 This TTC command signals the beginning of an LHC orbit.
Interconnection test
107
1.
To execute the start transition in the interconnection test operation of each trigger sub-system central cell. This enables input and output buffers on the receiver and sender sides respectively.
2.
To execute the enable transition in the configuration operation of the central cell. This enables the delivery of TTC commands to the sender and receiver sub-systems.
Analyze_c()
This method checks if the interconnection test operation state of each trigger sub-system central cell is in started state. This method also checks if the configuration operation of the central cell is in enabled state. If one of these operations is not in the expected state, this method returns false, the functional method of the analyze transition is not executed and the operation state stays started. Analyze_f()
This method performs the following steps: 1.
To execute the suspend transition in the configuration operation of the central cell. This temporally stops the delivery of TTC commands to the sender and receiver sub-systems.
2.
To execute the analyze transition in the interconnection test operation of each trigger sub-system central cell. This method retrieves the test result from the sub-systems and disables the input and output buffers on the receiver and sender sides respectively. Usually, the sender returns nothing and the receiver returns the result after comparing the expected patterns with the actual received patterns.
Resume_c()
This method checks in the interconnection test operation of each trigger sub-system central cell that the current state is analyzed. This method also checks if the configuration operation of the central cell is in suspended state. If one of these operations is not in the expected state, this method returns false, the functional method of the resume transition is not executed and the operation state stays analyzed. Resume_f()
This method performs the following steps: 1.
To execute the resume transition in the interconnection test operation of each trigger sub-system central cell. This enables input and output buffers on the receiver and sender sides respectively.
2.
To execute the resume transition in the configuration operation of the GT cell. This enables the delivery of TTC commands to the sender and receiver sub-systems.
6.3.2.2 Sub-system cells The interconnection test operation interface running in the trigger sub-system cells is almost the same as the one running in the TS central cell (Figure 6-10), with the difference that the IT_KEY parameter does not exist. This section describes the stable states, and the functional (fi) and conditional (ci) methods of the sub-system cells interconnection test operation transitions. This description includes the crate cell and the trigger sub-system central cell cases. The following method descriptions do not match a concrete interconnection test example but describe the relevant aspects common to all the cases. Initialization()
This method stores the session_id parameter in an internal variable of the configuration operation instance. This number will be propagated to lower level cells when a cell command or operation is instantiated. The session_id is attached to every log record in order to help identify which client directly or indirectly executed a given action in a cell of the TSCS. Prepare_c()
If the operation runs in the sub-system central cell, this method reads the custom parameter and initiates the interconnection test operation in the crate cells involved in the test. If the operation runs in a crate cell, this method checks if the hardware is accessible. If an operation cannot be started in the crate cells or the hardware is not accessible, this method returns false, the functional method of the prepare transition is not executed and the operation state stays halted.
Trigger Supervisor Services
108
Prepare_f()
This method reads the custom parameter and executes the necessary actions to prepare the sub-system to perform the test according to this parameter. If the operation runs in the sub-system central cell, this method executes the prepare transition in the required interconnection test operation running in the lower level crate cells. If the operation runs in a crate cell, this method prepares the patterns to be sent or to be received. Start_c()
If the operation runs in the sub-system central cell, this method checks if the current state of the interconnection test operation running in the crate cells is prepared. If the operation runs in a crate cell, this method checks if the hardware is accessible. If one of these checks fails, this method returns false, the functional method of the start transition is not executed and the operation state stays prepared. Start_f()
If the operation runs in the sub-system central cell, this method executes the start transition in the interconnection test operation running in the lower level crate cells. If the operation runs in a crate cell, this method enables the input or the output buffers depending on whether the crate is on the receiver or on the sender side. Analyze_c()
If the operation runs in the sub-system central cell, this method checks if the current state of the interconnection test operation running in the crate cells is started. If the operation runs in a crate cell, this method checks if the hardware is accessible. If one of these checks fails, this method returns false, the functional method of the analyze transition is not executed and the operation state stays started. Analyze_f()
If the operation runs in the sub-system central cell, this method executes the analyze transition in the interconnection test operation running in the lower level crate cells, gathers the results and returns them to the central cell. If the operation runs in a crate cell, this method compares the expected patterns, prepared during the prepare transition, against the received ones and returns the result to the sub-system central cell. Resume_c()
If the operation runs in the sub-system central cell, this method checks if the current state of the interconnection test operation running in the crate cells is analyzed. If the operation runs in a crate cell, this method checks if the hardware is accessible. If one of these checks fails, this method returns false, the functional method of the resume transition is not executed and the operation state stays analyzed. Resume_f()
If the operation runs in the sub-system central cell, this method executes the resume transition in the interconnection test operation running in the lower level crate cells. If the operation runs in a crate cell, this method enables again the input or the output buffers depending on whether the crate is on the receiver or on the sender side.
6.4 Monitoring 6.4.1 Description The TS monitoring service provides access to the monitoring information of the L1 decision loop hardware. This service is implemented using the TSMS presented in Section 5.4.2. The HTTP/CGI interface of the monitor collector provides remote access to the monitoring information. Event data based monitoring system A second source of monitoring information is the event data. For instance, the GTFE board is designed to gather monitoring information from almost all boards of the GT and to send this information as an event fragment every time that the GT receives a L1A. Therefore, an online monitoring system for the GT could be based on extracting this data from the corresponding event fragment. This approach would be very convenient because every event would contain precise monitoring information of the L1 hardware status for the corresponding bunch
Graphical user interfaces
109
crossing (BX). In addition, this approach would not require the development of a complex monitoring software infrastructure. On the other hand, we would face two limitations:
The GT algorithm rates are accumulated in the Final Decision Logic (FDL) board and the current version of the GTFE board cannot access its memories and registers. The only way to read out the rate counters is through VME access.
The GTFE board will send event fragments only when the DAQ infrastructure is running.
These limitations could be overcome using the TS monitoring service. This is meant to be an “always on” infrastructure (Section 5.2.5) and to provide a HTTP/CGI interface to access all monitoring items, and specifically the GT algorithm rate counters. Therefore, the TS monitoring service is the only feasible approach to read out the GT algorithm rates and to achieve an “always on” external system depending on this information.
6.5 Graphical user interfaces The HTTP/CGI interface of every cell facilitates the generic TS web-based GUI presented in Section 4.4.4.11. This is automatically generated and provides a homogeneous look and feel to control any sub-system cell independent of the operations, commands and monitoring customization details. The generic TS GUI of the DTTF, GT, GMT and RCT was extended with control panel plug-ins. The following section presents the Global Trigger control panel example [90].
6.5.1 Global Trigger control panel The GT control panel is integrated into the generic TS GUI of the GT cell. It uses the GT cell software in order to get access to the GT hardware. This control panel has the following features:
Monitoring and control of the GT hardware: The GT Control Panel implements the most important functionalities to monitor and control the GT hardware. That includes monitoring of the counters and the TTC detector partitions assigned to DAQ partitions, setting the time slots, enabling and disabling the TTC sub-detectors for a given DAQ partition, setting the FDL board mask, starting a run, stopping a run, starting random triggers, stopping random triggers, changing the frequency and step size for random triggers and resynchronization and resetting each of the DAQ partitions.
Configuration database population tool: The GT Control Panel allows hardware experts to create configuration entries in the configuration database without the need of any knowledge of the underlying database schema.
Access control integration: The GT Control Panel supports different access control levels. Depending on the user logged in (i.e. an expert, a shifter or a guest) the panel visualizes different information and allows different tasks to be performed.
Trigger menu generation: the GT Control Panel allows the visualization and modification of the trigger menu. The trigger menu is the high-level description of the algorithms that will be used to select desired physics events. For each algorithm it is possible to visualize and modify the name, algorithm number, prescale factor, algorithm description and condition properties (i.e. threshold, quality, etc.)
Figure 6-12 presents a view of the GT control panel where it is shown which TTC partitions (32 columns) are assigned to each of the eight DAQ partitions (8 rows). The red color means that a given TTC partition is not connected.
Trigger Supervisor Services
Figure 6-12: GT control panel view showing the current partitioning state.
110
Chapter 7 Homogeneous Supervisor and Control Software Infrastructure for the CMS Experiment at SLHC This chapter presents a project proposal to homogenize the supervisory control, data acquisition, and control software infrastructure for an upgraded CMS experiment at the SLHC. Its advantage is a unique, modular development platform enabling an efficient use of manpower and resources.
7.1 Introduction This proposal aims to develop the CMS Experiment Control System (ECS) based on a new supervisory and control software framework. We propose a homogeneous technological solution for the CMS infrastructure of Supervisory Control And Data Acquisition (SCADA [99]). The current CMS software control system consists of the Run Control and Monitoring System (RCMS), the Detector Control System (DCS), the Trigger Supervisor (TS), and the Tracker, ECAL, HCAL, DT and RPC sub-detector supervisory systems. This infrastructure is based on three major supervisor and control software frameworks: PVSSII (Section 1.4.2), RCMS (Section 1.4.1) and TS (Chapter 4). In addition, each sub-detector has created its own SCADA software. A single SCADA software framework used by all CMS sub-systems would have advantages for the maintenance, support and operation tasks during the experiment life-cycle: 1) Overall design strategy optimization: There is an evident similarity in technical requirements for controls amongst the different levels of the experiment control system. A common SCADA framework will allow an overall optimization of requirements, design and implementation. 2) Support and maintenance resources: The project should enable an efficient use of resources. A common SCADA infrastructure for CMS will manage the increasing complexity of the experiment control and reduce the effects of current and future constraints on manpower. 3) Accelerated learning curve: Operators and developers will benefit from a common SCADA infrastructure due to: 1) One-time learning cost, 2) Moving between CMS control levels and sub-systems will not imply a change in technology. This project proposal is based on the evolution of the software infrastructure used to integrate the L1 trigger subsystems. Section 7.2 presents the project technology baseline and the criteria for its selection. Section 7.3 presents an overview of the project road map. Finally, Section 7.4 outlines the project schedule and the required human resources.
7.2 Technology baseline The design and development of the unique underlying supervisory and control infrastructure should initially start from the software framework currently used to implement the L1 trigger control software system or TS
Homogeneous Supervisor and Control Software Infrastructure for the CMS Experiment at SLHC
112
framework. The following paragraphs describe the principal objective criteria for which this technological baseline has been chosen: 1) Proven technology: It is used in the implementation of a supervisory and control system that coordinates the operation of all L1 trigger sub-systems, the TTC system, the LMS and to some extent the ECAL, HCAL, DT and RPC sub-detectors. This solution was successfully used during the second phase of the Magnet Test and Cosmic Challenge, has been used in the monthly commissioning exercises of the CMS Global Runs and is the official solution for the experiment operation. 2) Homogeneous TriDAS infrastructure and support: The TS framework is based on XDAQ, which is the same middleware used by the DAQ event builder (Section 1.4.3). This component is a key part of the DAQ system and as such it is not likely to evolve towards a different underlying middleware. Therefore, a supervisory and control software framework based on the XDAQ middleware could profit from a long term, in-house supported solution. In addition, a SCADA infrastructure based on the XDAQ middleware would homogenize the underlying technologies for the DAQ and for the supervisory control infrastructure that would automatically reduce the overall support and maintenance effort. 3) Simplified coordination and support tasks: The TS framework is designed to reduce the gap between software experts and experimental physicists and to reduce the learning curve. Examples are the usage of well known models in HEP control systems like finite state machines or homogeneous integration methodologies independent of the concrete sub-system Online SoftWare Infrastructure (OSWI) and hardware setup, or the automatic creation of graphical user interfaces. The latter is a development methodology characterized by a modular upgrading process and one single visible software framework. 4) C++: The OSWI of all sub-systems is mainly formed by libraries written in C++ running on x86/Linux platforms. These are intended to hide hardware complexity from software experts. Therefore, a SCADA infrastructure based on C++, like the TS framework, would simplify the complexity of the integration architecture.
7.3 Road map This project aims to reach the technological homogenization of the CMS Experiment Control System following a progressive and non-disruptive strategy. This shall allow a gradual and smooth transition from the current SCADA infrastructure to the proposed one. An adequate approach could have the following project tasks: 1) L1 trigger incremental development: Continue with the current development and maintenance process in the L1 trigger using the proposed framework. 2) Sub-detector control and supervisory software integration: This task involves the incremental adoption of a common software framework for all sub-detectors in order to homogenize the control and supervisory software of CMS. The participating sub-detectors are ECAL, HCAL, DT, CSC, RPC, and Tracker. Currently, this step is partially achieved because all sub-detectors are partially integrated with the TS system in order to: 1) Automate the pattern tests between the sub-detector TPG’s and the regional trigger systems, 2) Check configuration consistency between L1 trigger and the trigger primitive generators. 3) L1 trigger emulators supervisory system: This task involves the upgrade of the supervisory software of the L1 trigger emulators to the proposed common framework. The hardware emulators of the L1 trigger have been deployed as components of the CMSSW framework [100]. This task does not involve any change in the emulator code or in the CMSSW framework. 4) High Level Trigger (HLT) supervisory system: This task involves the upgrade of the supervisory software of the HLT to the proposed common framework. In this way the components of the HLT (filter units, slice supervisors, and storage managers) will be launched, configured and monitored as the other software components of the CMS online software [101]. This task does not involve any change on the supervised components. 5) Event builder supervisory system: This task involves the deployment of the event builder supervisory system as nodes of the proposed framework. The event builder supervisory software will launch all software components, will configure and will monitor the Front-End Readout Links (FRL), the Front-End Driver Network (FED Builder Network), and the different slices of Event Managers (EVM), Builder Units (BU)
Schedule and resource estimates
113
and Readout Units (RU). This task does not involve the modification of the event builder components (Section 1.4.3). 6) Experiment Control System feasibility study and final homogenization step: This is the last stage of the homogenization process. This task involves the feasibility study to change the top layer of the ECS and, afterwards, its substitution by components of the proposed framework. This means the substitution of the Function Managers by the nodes of the proposed SCADA software. This task also involves the feasibility study and homogenization of the top software layer of the DCS in order to be supervised, controlled and monitored by the ECS (Section 1.4.2).
7.4 Schedule and resource estimates Schedule and resource estimates have been approximated according to the COCOMO II model [102] assuming the delivery of 50000 new Source Lines Of Code (SLOC), the modification of 10000 SLOC and reusing 30000 SLOC, with the model parameters rated as a project with an average complexity. The SLOC effort has been estimated using the development experience with the TS and RCMS frameworks. Additional assumptions are a development team of people working in an in-house environment with extensive experience with related systems, and having a thorough understanding of how the system under development will contribute to the objectives of CMS. The four project phases are: 1) Inception: This phase includes the analysis of requirements, system definitions, specification and prototyping of user interfaces, and cost estimation; 2) Elaboration: This period is meant to define the software architecture and test plan; 3) Construction: this includes the coding and testing phases; 4) Transition: this last phase includes the final release delivery and set up of support and maintenance infrastructure. Table 7-1 shows the schedule for the project phases and the required resources per phase in person-months. This estimate includes the resources to deliver the infrastructure stated in Section 7.3: all templates, standard elements and functions required to achieve a homogeneous system and to reduce as much as possible the development effort for the sub-system integration developers. This estimate does not include the sub-system integration, which follows the transition phase.
Phase
Phase effort (Person-months)
Schedule (Months)
Inception
16
3
Elaboration
64
8
Construction
199
14
Transition
32
13
Table 7-1: Project phases schedule and associated effort in person-months. We summarize in Table 7-2 the top-level resource and schedule estimate of the project. Total effort (Person-months)
311
Schedule (months)
38
Table 7-2: Top-level estimate for elaboration and construction.
Chapter 8 Summary and Conclusions The life span of the last generation of HEP experiment projects is of the same order of magnitude as a human being’s life, and both the experiment’s and the human being’s life phases share a number of analogies: During the conception period of a HEP experiment, key people discuss about the feasibility of a new project. For instance, the initial informal discussions about CMS started in 1989 and continued for nearly three years. This period finished with a successful conceptual design (CMS Letter of intent, 1992). In a similar way, the conception of a human being would follow a dating period and the decision of having a common life project. Right after the conceptual design, the research and prototyping phase starts. During this period research and prototyping tasks are performed in order to prove the feasibility of the former design. A successful culmination of this period is the release of a number of Technical Design Reports (TDR’s) describing the design details, the project schedule and organization. For the CMS experiment this period lasted until the year 2002. This second period is similar to the human childhood and infancy where the child grows up, experiments with her environment, learns the basic knowledge for life and approximately plans what she wants to be when she will grow up. The next stage in the life of a HEP experiment is the development phase. During this time, the building blocks described in the individual TDR’s are produced. For the CMS experiment this period lasted approximately until early 2007. Following the analogy of the human being, this period could be similar to the formation life period spent in high school and college where the adolescent learns several different subjects. Before being operational, the building blocks produced during the development phase need to be assembled and commissioned. The CMS commissioning exercises started in 2006 with Magnet Test and Cosmic Challenge and continued during 2007 with a monthly periodic and incremental commissioning exercise known as Global Run. This is similar to what happens to recent graduates starting their careers with a trainee period in a company or research institute. They learn how to use the knowledge acquired during the formation period in order to perform a concrete task. After a successful commissioning period the experiment is ready for operation. The CMS experiment is expected to be operational for at least 20 years. During this phase, periodic sub-system upgrades will be necessary to cope with the radiation damage or new requirements due to the SLHC luminosity upgrade. This period would be like the adult professional life when the person is fully productive and needs to periodically undergo medical checks or recycle her knowledge in order to fit the continuous changes in the evolution of the job market. Finally, the experiment will be decommissioned at the end of its operational life. The analogy also works in this case, because at the end of a successful career a person will also retire. The long life span is not the only complexity dimension of the last generation of HEP experiments that finds a good analogy in the metaphor of the human being. The numeric complexity of the sub-systems collaborating is amazing also on both sides. We have discussed the time scale and complexity similarities between human beings and HEP experiments, but we can still go further in this analogy and ask: “What is the experiment’s genetic material”? In other words, what is the seed of a HEP experiment project? It cannot be people, because only few collaboration members stay
Summary and Conclusions
116
during the whole lifetime of the experiment. The good answer is that the experiment genetic material is the knowledge consisting of successful ideas applied in past experiments and of novel contributions from other fields which promise improved results. This set of ideas is a potential future HEP experiment. And people? Where do the members of the collaboration fit? In this analogy, the scientists, engineers and technicians are responsible for transmitting and expressing the experiment’s genetic material. In other words, the collaboration members are the hosts of the experiment DNA and are also responsible for its expression in actual experiment body parts. Therefore, even though concrete people are more able than others to transmit and express the experiment DNA, none is essential. The metaphor between the most advanced HEP experiments with the human beings serves the author to explain how this thesis contributed to CMS, and to the HEP and scientific communities. The following sections summarize the contributions of this work to both the CMS body or experiment, and the CMS DNA or knowledge base of the CMS collaboration and HEP communities.
8.1 Contributions to the CMS genetic base This work encompasses a number of ideas intended to enhance the expression of a concrete CMS body part, the control and hardware monitoring system of the L1 trigger or Trigger Supervisor (TS). A successful final design was reached not just by gathering a detailed list of functional requirements. It was necessary to understand the complexity of the task, and the most promising technologies had to be proven. The unprecedented number of hardware items, the long periods of preparation and operation, and the human and political context were presented as three complexity dimensions related to building hardware management systems for the latest generation of HEP experiments. The understanding of the problem context and associated complexity, together with the experience acquired with an initial generic solution, guided us to the conceptual design of the Trigger Supervisor.
8.1.1 XSEQ An initial generic solution to the thesis problem context proposed a software environment to describe configuration, control and test systems for data acquisition hardware devices. The design followed a model that matched well the extensibility and flexibility requirements of a long lifetime experiment that is characterized by an ever-changing environment. The model builds upon two points: 1) the use of XML for describing hardware devices, configuration data, test results, and control sequences; and 2) an interpreted, run-time extensible, highlevel control language for these sequences that provides independence from a specific host platform and from interconnect systems to which devices are attached. The proposed approach has several advantages:
The uniform usage of XML assures a long term technological investment and a reduced in house development due to an existing large asset of standards and tools.
The interpreted approach enables the definition of platform independent control sequences. Therefore, it enhances the sub-system platform upgrade process.
The syntax of a XML-based programming language (XSEQ, XML-based sequencer) was defined. It was shown how an adequate use of XML schema technology facilitated the decoupling of syntax and semantics, and therefore enhanced the sharing of control sequences among heterogeneous sub-system platforms. An interpreter for this language was developed for the CERN Scientific Linux (SLC3) platform. It was proved that the performance of an interpreter for a XML-based programming language oriented to hardware control could be at least as good as the performance of an interpreter for a HEP standard language for hardware control. The model implementation was integrated into a distributed programming framework specifically designed for data acquisition in the CMS experiment (XDAQ). It was shown that this combination could be the architectural basis of a management system for DAQ hardware. A feasibility study of this software defined a number of standalone applications for different CMS hardware modules and a hardware management system to remotely access these heterogeneous sub-systems through a uniform web service interface.
Contributions to the CMS genetic base
117
8.1.2 Trigger Supervisor The experience acquired during this initial research together with the L1 trigger operation requirements seeded the conceptual design of the Trigger Supervisor. It consists of a set of functional and non-functional requirements, the architecture design together with few technological proposals, and the project tasks and organization details. The functional purpose of the TS is to coordinate the operation of the L1 trigger and to provide a flexible interface that hides the burden of this coordination. The required operation capabilities had to simplify the process of configuring, testing and monitoring the hardware. Additional functionalities were required for troubleshooting, error management, user support, access control and start-up purposes. The non-functional requirements were also discussed. These take into account the magnitude of the infrastructure under control, the implications related to the periodic hardware and software upgrades necessary in a long-lived experiment like CMS, the particular human and political context of the CMS collaboration, the required long term support and maintenance, the limitations of the existing CMS online software infrastructure and the particularities of the operation environment of the CMS Experiment Control System. The design of the TS architecture fulfills the functional and non-functional requirements. This architecture identifies three main development layers: the framework, the system and the services. The framework is the software infrastructure that facilitates the main building block or cell, and the integration with the specific subsystem OSWI. The system is a distributed software architecture built out of these building blocks. Finally, the services are the L1 trigger operation capabilities implemented on top of the system as a collaboration of finite state machines running in each of the cells. The decomposition of the project development tasks into three layers enhances the coordination of the development tasks; and helps to keep a stable system, in spite of hardware and software upgrades, on top of which new operation capabilities can be implemented without software engineering expertise.
8.1.3 Trigger Supervisor framework The TS framework is the lowest level layer of the TS. It consists of the basic software infrastructure delivered to the sub-systems to facilitate their integration. This infrastructure is based on the XDAQ middleware and few external libraries. XDAQ was chosen among the CMS officially supported distributed programming frameworks (namely XDAQ, RCMS and JCOP) as the baseline solution because it offered the best trade-off between infrastructure completeness and fast sub-system integration. Although XDAQ was the best available option, further development was needed to reach the usability required by a community of customers with no software engineering background and limited time dedicated to software integration tasks. The cell is the main component of the additional software infrastructure. This component is a XDAQ application that needs to be customized by each sub-system in order to integrate with the Trigger Supervisor. The customization process has the following characteristics:
Based on Finite State Machines (FSM): The integration of a sub-system with the TS consists of defining FSM plug-ins. A FSM model was chosen because this is a well known approach to define control systems for HEP experiments and therefore it would accelerate the customer’s learning curve. FSM plug-ins wrap the usage of the sub-system OSWI and offers a stable remote interface despite software platform and hardware upgrades.
Simple: Additional facilities were also delivered to the sub-systems in order to simplify the customization process. The most important one is the xhannel API. It provides a simple and homogeneous interface to a wide range of external services: other cells, XDAQ applications and web services.
Automatically generated GUI: A mechanism to automatically generate the cell GUI reduced the customization time and facilitated a common look and feel for all sub-systems graphical setups. The common look and feel improved the learning curve for new L1 trigger operators.
Remote interface: The cell provided a human and a machine interface based on the HTTP/CGI and the SOAP protocols respectively, fitting well the web services based model of the CMS Online SoftWare Infrastructure (OSWI). This interface facilitated the remote operation of the sub-system specific FSM plugins. This interface could also be enlarged with custom functionalities using command plug-ins.
Summary and Conclusions
118
8.1.4 Trigger Supervisor system The intermediate layer of the TS is the TS System (TSS). It provides a stable layer on top of which the TS services have been implemented. The TS system is designed to require a reduced maintenance and to provide a methodology to develop services which can fit present and future experiment operational requirements. In this scheme, the development of new services requires very limited knowledge about the internals of the TS framework, and uniquely needs to follow a well defined methodology. The stable TS system together with the associated methodology facilitates to accommodate these functionalities in a non-disruptive way, without requiring major developments. The TSS consists of four distributed software systems with well defined functionalities: TS Control System (TSCS), TS Monitoring System (TSMS), TS Logging System (TSLS) and TS Start-up System (TSSS). The following points describe the design principles:
Reduced number of basic building blocks: The TSS is uniquely based on the sub-system cells and already existing monitoring, logging and start-up components provided by the XDAQ and RCMS frameworks. Reusing XDAQ and RCMS components minimized the development effort and at the same time guaranteed the long term support and maintenance. A reduced number of basic building blocks helped also to communicate architectural concepts.
Nodes and connections without logic: The TSCS is a collection of nodes and the communication channels among them. It does not include the logic of the L1 decision loop operation capabilities. This is implemented one layer above following a well defined methodology. The improved modularity obtained by decoupling the stable infrastructure (TSCS) from the L1 trigger operation capabilities eases the distribution of development tasks. Sub-system experts and technical coordinators were responsible for maintaining and/or implementing L1 trigger operation capabilities, whilst the TS central team focused on assuring a stable TSCS.
Hierarchical control system: It is shown how a hierarchical topology for the TSCS enhances a distributed development, facilitates the independent operation of a given sub-system, simplifies a partial deployment and provides graceful system degradation.
Well defined subsystem integration model: The integration of each sub-system is done according to guidelines proposed by the TS central team. Those are intended to maximize the deployment of the TSS in different set-ups, and to ease the hardware evolution without affecting the services layer intended to provide the L1 trigger operation capabilities.
8.1.5 Trigger Supervisor services The TS services are the L1 decision loop operation capabilities. The current services are the final functionalities required during the conceptual design. These have been implemented on top of the TS system and according to the proposed methodology. The following services were presented:
Configuration: This is the main service provided by the TS. It facilitates the configuration of the L1 decision loop. Up to eight remote clients can use this service simultaneously without risking inconsistent configurations of the L1 decision loop. The configuration information (e.g. firmware, LUT’s, registers) is retrieved from the configuration database using a database identifier provided by the client. RCMS uses the remote interface provided by the central node of the TS in order to configure the L1 decision loop.
Interconnection test: It is intended to automatically check the connections between sub-systems. From the client point of view, the interconnection test service is another operation running in the TS central cell.
Logging and start-up services: They are provided by the corresponding TS logging and start-up systems and did not require any further customization process.
Monitoring: This service, facilitated by the TS monitoring system, provides access to the monitoring information of the L1 decision loop hardware. It is designed to be an “always on” source of monitoring information despite the availability of the DAQ system.
Graphical User Interface (GUI): This service is facilitated by the HTTP/CGI interface of every cell. It is automatically generated and provides a homogeneous look and feel to control any sub-system cell
Final remarks
119
independent of the operations, commands and monitoring customization details. It was also shown that the generic TS GUI could be extended with subsystem specific control panels.
8.1.6 Trigger Supervisor Continuation A continuation line for the TS was presented. The project proposal is is intended to homogenize the Supervisory Control And Data Acquisition infrastructure (SCADA) for the CMS experiment. A single SCADA software framework used by all CMS sub-systems would have advantages for the maintenance, support and operation tasks during the experiment operational life. The proposal is based on the evolution of the TS framework. A tentative schedule and resource estimates were also presented.
8.2 Contribution to the CMS body The main initial goal of this PhD thesis was to build a tool to operate the L1 trigger decision loop and to integrate it in the overall Experiment Control System. This objective has been achieved: The Trigger Supervisor has become a real body part of the CMS experiment and it serves its purpose. Periodic demonstrators brought the TS to the first joint operation with the Experiment Control System in November 2006 with the second phase of the Magnet Test and Cosmic Challenge ([103], pag. 9). It has continued improving and serving every monthly commissioning exercise since May 2007 and is the official tool for the CMS experiment to operate the L1 decision loop ([104], pag. 190). Using the introductory analogy, the CMS Experiment Control System would be the experiment brain, and the Trigger Supervisor a specialized brain module just like the human brain is thought to be divided in specialized units for instance to turn sounds into speech or to recognize a face. The development of the CMS Trigger Supervisor can be seen as the expression of a newly added genetic material in the CMS DNA. This thesis has also an important influence on how the CMS experiment is being controlled. The operation of the CMS experiment is influenced by how the configuration and monitoring services of the TS allows operating the L1 decision loop. Continuing with the analogy, if the TS is a specialized brain module, the TS system would be the static neural net and the TS services would be the behavior pattern stored in it. Having the possibility to adopt new operation capabilities on top of a stable architecture, without requiring major upgrades, fits well a long-life experiment, just like the human brain which keeps an almost invariant neural architecture but is able to learn and adapt to its environment.
8.3 Final remarks This thesis contributes to the CMS knowledge base and by extension to the HEP and scientific communities. The motivation and goals, a generic solution and finally a successful design for a distributed control system are discussed in detail. This new CMS genetic material has achieved its full expression and has become a CMS body part, the CMS Trigger Supervisor. This is the maximum impact we could initially expect inside the CMS Collaboration. A more complicated question is the impact of the exposed material outside the CMS collaboration. Answering this question is like answering the question of how well the added CMS genetic material will spread. To a certain important extent, the chances to successfully propagate the knowledge written in this thesis depends of how well adapted is CMS to its environment - In other words, how successful CMS will be to fulfill its physics goals.
Appendix A Trigger Supervisor SOAP API A.1 Introduction This chapter specifies the SOAP Application Program Interface (API) exposed by a Trigger Supervisor (TS) cell. The audience for this specification is mainly the application developers requiring the remote execution of cell commands and/or operations (e.g. the developer of the L1 trigger function manager in order to use the TS services provided by the TS central cell).
A.2 Requirements
Command and operation control: The protocol should allow the remote initialization, operation and destruction of cell operations and the execution of commands.
Controller identification: The protocol should enforce the identification of the controller in the cell in order to be able to classify all the logging records as a function of the controller.
Synchronous and asynchronous communication: The protocol should allow both synchronous and asynchronous communication modes. The synchronous protocol is intended to assure an exclusive usage of the cell. The asynchronous mode should enable multi-user access and achieve an enhanced overall system performance.
XDAQ data type serialization: The protocol should be able to encode different data types like integer, string or boolean. The encoding scheme should be compatible with the XDAQ encoding/decoding data type from/to XML.
Human and machine interaction mechanism: The protocol should embed a warning message and level in each reply message. The warning information should facilitate a machine comprehension of the request success level.
A.3 SOAP API A.3.1 Protocol The cell SOAP protocol allows both synchronous and asynchronous communication between the controller and the cell. Figure A-1 shows a UML sequence diagram that exemplifies the synchronous communication protocol between a controller and a cell. In that case, the controller is blocked until the reply message arrives. This protocol also blocks the cell. Therefore, additional requests coming from other controllers will not be served until the cell has replied to the former controller. Figure A-2 shows a UML sequence diagram that exemplifies the asynchronous communication protocol between a controller and a cell. In the asynchronous case, the controller is blocked just a few milliseconds per request
Trigger Supervisor SOAP API
122
Synchronous controller
Cell request(async=false, cid=1)
reply(result, cid=1)
request(async=false, cid=2)
reply(result, cid=2)
Figure A-1: UML sequence diagram of a synchronous SOAP communication between a controller and a cell.
Asynchronous controller
Cell
request(async=true, cid=1) Ack(cid=1) request(async=true, cid=2) Ack(cid=2)
reply(result, ci=1)
reply(result, ci=2)
Figure A-2: UML sequence diagram of an asynchronous SOAP communication between a controller and a cell. until it receives the acknowledge message. The asynchronous reply is received in a parallel thread that listen the corresponding port. In that case, the overall response time as a function of the number of SOAP request messages (n) will grow as O(1) instead of O(n) (synchronous case). The total response time will be slightly longer than the longest remote call. On the cell side, each asynchronous request opens a new thread where the command is executed. Therefore, several controllers are allowed to remotely execute commands concurrently in the same cell. Whatever communication mechanism is used, the reply message embeds the warning information. The warning level provides the request success level to the controller. The warning message completes this information with a human-readable message.
SOAP API
123
A.3.2 Request message Figure A-3 shows an example of a request message. This request executes the command given cell.
ExampleCommand
in a
3 CommandResponse http://centralcell.cern.ch:50001 urn:xdaq-application:lid=13
Figure A-3: SOAP request message example. The first XML tag (or just tag) inside the body of the SOAP message (i.e. Examplecommand) identifies the cell command to be executed in the remote cell. The attribute async takes a boolean value and tells the cell whether this request has to be executed synchronously or asynchronously. The cid attribute is set by the controller and the same value is set by the cell in the reply message cid. This mechanism allows a controller to identify request-reply pairs in an asynchronous communication (cid is not necessary in the synchronous communication case). The sid attribute identifies a concrete controller. The value of this attribute is added into all log message generated by the execution of the command. It is therefore possible to trace the actions of each individual controller by analyzing the logging statements. The asynchronous communication modality requires the specification of three additional tags: callbackFun, and callbackUrn. The value of these tags identifies univocally the controller side callback that will handle the asynchronous reply.
callbackUrl
When async is equal to false (i.e. synchronous communication) the attributes cid, and callbackUrn are not needed.
callbackFun, callbackUrl
The parameters of the command are set using the tag param. The name of the parameter is defined with the attribute name. The type of the parameter is defined with the attribute xsi:type and its value is set inside the tag. Table A-1 presents the list of possible types and their correspondence with the class that facilitates the marshalling process17.
xsi:type
attribute
xsd:integer
XDAQ class xdata::Integer
xsd:unsignedShort
xdata::UnsignedShort
xsd:unsignedLong
xdata::UnsignedLong
xsd:float
xdata::Float
17 In the context of data transmission, marshalling or serialization is the process of transmitting an object across a network connection link in binary form. The series of bytes can be used to deserialize or unmarshall an object that is identical in its internal state to the original one.
Trigger Supervisor SOAP API
124
xsd:double
xdata::Double
xsd:Boolean
xdata::Boolean
xsd:string
xdata::String
Table A-1: Correspondence between
xsi:type
data types and the class that facilitates the marshalling process.
A.3.3 Reply message Figure A-4 shows an example of a reply message. This message is the asynchronous response sent by the cell after executing the command ExampleCommand requested with the request message of Figure A-3. Hello World! Warning message 0
Figure A-7: Reply message to an OpInit request.
A.3.5.2 OpSendCommand Figure A-8 shows the request message to execute an operation transition. my_opid configure ts_v13 NULL NULL NULL
Figure A-8: Request message to execute an operation transition. This request example corresponds again to a synchronous request. The value of the operation tag identifies the operation instance where the controller wishes to execute the transition. The param tag allows updating the value of a given parameter before the execution of the transition. The name of the transition to be executed is defined by the command tag value.
SOAP API
127
Figure A-9 shows the reply message to the request of Figure A-8. The tag payload contains the result of the transition execution that depends on the customization process. The operation warning object is also embedded in the reply message. Ok 0
Figure A-11: Reply message to an OpReset request.
A.3.5.4 OpGetState Figure A-12 shows the request message to get the current state of an operation instance. my_opid NULL NULL NULL
Figure A-12: Request message to get the current state of an operation. Figure A-13 shows the reply message to the request of Figure A-12. The tag operation state18.
payload
contains the current
18 The current state of an operation which is executing a transition corresponds to an intermediate state represented by a composed name like [from state]_[transition name]_[to state].
SOAP API
129
halted 0
Figure A-15: Reply message to an OpKill request.
Acknowledgements First of all I want to thank Claudia-Elisabeth Wulz, Joao Varela, Wesley Smith and Sergio Cittolin for granting me the privilege to lead the conceptual design and development effort of the Trigger Supervisor project. My special thanks to Marc Magrans de Abril for being the “always on” motor of the project, for his continuous will to improve, for the never ending flow of ideas and most important for being my brother and strongest support. This thesis work could not have reached its full expression without the hard work of so many CMS collaboration members: managers, sub-system cell developers and TS central team members built the bridge between a dream and a reality. The very careful reading of the manuscript by Marco Boccioli, Iñaki García Echebarría, Joni Hahkala, Elisa Lanciotti, Raúl Murillo García, Blanca Perea Solano and Ana Sofía Torrentó Coello. Their suggestions improved the English and made this document readable for other people than me alone. Many thanks to all my colleagues at the High Energy Physics Institute of Vienna as it was always a pleasure to work with them. Last but not least, I wish to thank my family for the unconditional support.
References [1]
P. Lefèvre and T. Petterson (Eds.), “The Large Hadron Collider, conceptual design”, CERN/AC/95-05.
[2]
CMS Collaboration, “The Compact Muon Solenoid”, CERN Technical Proposal, LHCC 94-38, 1995.
[3]
ATLAS Collaboration, “ATLAS Technical Proposal,” CERN/LHCC 94-43.
[4]
ALICE Collaboration, “ALICE - Technical Proposal for A Large Ion Collider Experiment at the CERN LHC”, CERN/LHCC 95-71.
[5]
LHCb Collaboration, “LHCb Technical proposal”, CERN/LHCC 98-4.
[6]
CMS Collaboration, “The Tracker System Project, Technical Design Report”, CERN/LHCC 98-6.
[7]
CMS Collaboration, “The Electromagnetic Calorimeter Project, Technical Design Report”, CERN/LHCC 97-33. CMS Addendum CERN/LHCC 2002-27.
[8]
CMS Collaboration, “The Hadron Calorimeter Technical Design Report”, CERN/LHCC 97-31.
[9]
CMS Collaboration, “The Muon Project, Technical Design Report”, CERN/LHCC 97-32.
[10]
CMS Collaboration, “The Trigger and Data Acquisition Project, Volume II, Data Acquisition & HighLevel Trigger, Technical Design Report,” CERN/LHCC 2002-26.
[11]
CMS Collaboration, “The TriDAS Project - The Level-1 Trigger Technical Design Report”, CERN/LHCC 2000-38.
[12]
P. Chumney et al., “Level-1 Regional Calorimeter Trigger System for CMS", in Proc. of Computing in High Energy Physics and Nuclear Physics, La Jolla, CA, USA, 2003.
[13]
J.J. Brooke et al., “The design of a flexible Global Calorimeter Trigger system for the Compact Muon Solenoid experiment”, CMS Note 2007/018.
[14]
R. Martinelli et al., “Design of the Track Correlator for the DTBX Trigger”, CMS Note 1999/007 (1999).
[15]
J. Erö et al., “The CMS Drift Tube Track Finder”, CMS Note (in preparation).
[16]
D. Acosta et al., “The Track-Finder Processor for the Level-1 Trigger of the CMS Endcap Muon System”, in Proc. of the 5th Workshop on Electronics for LHC Experiments, Snowmass, Co, USA, Sept. 1999, CERN/LHCC/99-33 (1999).
[17]
H. Sakulin, “Design and Simulation of the First Level Global Muon Trigger for the CMS Experiment at CERN”, PhD tesis, University of Technology, Vienna (2002).
[18]
C.-E. Wulz, “Concept of the CMS First Level Global Trigger for the CMS Experiment at LHC”, Nucl. Instr. Meth. A 473/3 231-242 (2001).
[19]
TOTEM Collaboration, paper to be published in Journal of Instrumentation (JINST).
[20]
CMS Trigger and Data Acquisition Group, “CMS L1 Trigger Control System”, CMS Note 2002/033.
[21]
B. G. Taylor, “Timing Distribution at the LHC”, in Proc. of the 8th Workshop on Electronics for LHC and Future Experiments, Colmar, France (2002).
[22]
V. Brigljevic et al., “Run control and monitor system for the CMS experiment,”, in Proc. of Computing in High Energy and Nuclear Physics 2003, La Jolla, CA (2003).
[23]
JavaServer Pages Technology, http://java.sun.com/products/jsp/
[24]
W3C standard, “Extensible Markup Language (XML)”, http://www.w3.org/XML
[25]
W3C standard, “Simple Object Access Protocol (SOAP)”, http://www.w3.org/TR/SOAP
[26]
PVSS II system from ETM, http://www.pvss.com
[27]
J. Gutleber and L. Orsini, “Software architecture for processing clusters based on I2O,” in Cluster Computing, New York, Kluwer Academic Publishers, Vol. 5, pp. 55–65 (2002).
[28]
J. Gutleber, S. Murray and L. Orsini, “Towards a homogeneous architecture for high-energy physics data acquisition systems”, Comput. Phys. Commun. 153, Issue 2 (2003) 155-163.
[29]
V. Brigljevic et al., “The CMS Event Builder”, in Proc. of Computing in High-Energy and Nuclear Physics, La Jolla CA, March 24-28 (2003).
[30]
P. Glaser et al.,”Design and Development of a Graphical Setup Software for the CMS Global Trigger”, IEEE Transactions on Nuclear Science, Vol. 53, No. 3, June 2006.
[31]
Qt Project, http://trolltech.com/products/qt
[32]
Python Project, http://www.python.org/
[33]
Tomcat Project, http://tomcat.apache.org/
[34]
C. W. Fabjan and H.G. Fischer, “Particle Detectors”, Rep. Prog. Phys., Vol. 43, 1980.
[35]
R.E Hughes-Jones et al., “Triggering and Data Acquisition for the LHC”, in Proc. of the International Conference on Electronics for Particle Physics, May 1995.
[36]
CMS Collaboration, “CMS Letter of Intent”, CERN/LHCC 92-3, LHCC/I 1, Oct 1, 1992.
[37]
K. Holtman, “Prototyping of the CMS Storage Management”, Ph.D. Thesis, Technische Universiteit Eindhoven, Eindhoven, May 2000.
[38]
CDF II Collaboration, “The CDF II Detector: Technical Design Report”, FERMILAB-PUB-96/390-E, 1996.
[39]
J. Gutleber, I. Magrans, L. Orsini and M. Nafría, “Uniform management of data acquisition devices with XML”, IEEE Transactions on Nuclear Science, Vol. 51, Nº. 3, June 2004.
[40]
M. Elsing and T. Schorner-Sadenius, “Configuration of the ATLAS trigger system,” in Proc. of Computing in High Energy and Nuclear Physics 2003, La Jolla, CA (2003).
[41]
Roger Pressman, “Software Engineering: A Practitioner's Approach”, McGraw-Hill, 2005.
[42]
W3C standard, “XML Schema“, http://www.w3.org/XML/Schema
[43]
W3C standard, “Document Object Model (DOM)”, http://www.w3.org/DOM/
[44]
W3C standard, “XML Path Language (XPath)”, http://www.w3.org/TR/xpath
[45]
Apache Project, http://xml.apache.org/
[46]
W3C standard, “HTTP - Hypertext Transfer Protocol”, http://www.w3.org/Protocols/
[47]
W3C standard, “XSL Transformations (XSLT)”, http://www.w3.org/TR/xslt
[48]
G. Dubois-Felsman, “Summary DAQ and Trigger”, in Proc.of Computing in High Energy and Nuclear Physics 2003, La Jolla, CA (2003).
[49]
S. N. Kamin, “Programming Languages: An Interpreted-Based Approach”, Reading, MA, AddisonWesley, 1990.
[50]
I. Magrans et al., “Feasibility study of a XML-based software environment to manage data acquisition hardware devices”, Nucl. Instr. Meth. A 546 324-329 (2005).
[51]
E. Cano et al., ”The Final Prototype of the Fast Merging Module (FMM) for Readout Status Processing in CMS DAQ”, in Proc. of the 10th Workshop on Electronics for LHC Experiments and Future Experiments, Amesterdam, Netherland, September 29 - October 03, 2003.
[52]
J. Ousterhout, “Tcl and Tk Toolkit”, Reading, MA, Addisson-Wesley, 1994.
[53]
HAL Project, http://cmsdoc.cern.ch/~cschwick/software/documentation/HAL/index.html
[54]
Albert De Roeck, John Ellis and Fabiola Gianotti, “Physics Motivations for Future CERN Accelerators”, CERN-TH/2001-023, hep-ex/0112004.
[55]
CMS SLHC web page, http://cmsdoc.cern.ch/cms/electronics/html/elec_web/common/slhc.html
[56]
I. Magrans, C.-E. Wulz and J. Varela, “Conceptual Design of the CMS Trigger Supervisor”, IEEE Transactions on Nuclear Sciences, Vol. 53, Nº. 2, November 2005.
[57]
W3C Web Services Activity, http://www.w3.org/2002/ws/
[58]
W3C standard, “Web Services Description Language (WSDL)”, http://www.w3.org/TR/wsdl
[59]
I2O Special Interest Group, “Intelligent I/O (I2O) Architecture Specification v2.0”, 1999.
[60]
I. Magrans and M. Magrans, “The CMS Trigger Supervisor Project”, in Proc of the IEEE Nuclear Science Symposium 2005, Puerto Rico, 23-29 October, 2005.
[61]
Unified Modeling Language, http://www.rational.com/uml/
[62]
Trigger Supervisor web page, http://triggersupervisor.cern.ch/
[63]
I. Magrans and M. Magrans, “Trigger Supervisor - User’s Guide”, http://triggersupervisor.cern.ch/index.php?option=com_docman&task=doc_download&gid=32
[64]
Trigger Supervisor Framework Workshop, http://triggersupervisor.cern.ch/index.php?option=com_docman&task=doc_download&gid=16
[65]
Trigger Superviosr Interconnection Test Workshop, http://triggersupervisor.cern.ch/index.php?option=com_docman&task=doc_download&gid=44
[66]
Trigger Supervisor Framework v 1.4 Workshop, http://indico.cern.ch/getFile.py/access?resId=0&materialId=slides&confId=24530
[67]
Trigger Supervior Support Management Tool, https://savannah.cern.ch/projects/l1ts/
[68]
R.E. Johnson, and B. Foote, “Designing reusable classes”, Journal of Object-Oriented Programming, 1(2): pp. 22-35, 1988.
[69]
L. Peter Deutsch, “Design reuse and frameworks in the smalltalk-80 system”, In Software Reusability Volume II, Applications and Experience, pp. 57-72, 1989.
[70]
C. Gaspar and M. Dönszelmann, “DIM - A Distributed Information Management System for the DELPHI Experiment at CERN”, in Proc. of the 8th Conference on Real-Time Computer Applications in Nuclear, Particle and Plasma Physics, Vancouver, Canada, June 1993.
[71]
R. Jacobsson, "Controlling Electronic Boards with PVSS”, in Proc. of the 10th International Conference on Accelerator and Large Experimental Physics Control Systems, Geneva, 10-14 October 2005, P01.045-6.
[72]
B. Franek and C. Gaspar, “SMI++ Object-Oriented Framework for Designing and Implementing Distributed Control Systems”, IEEE Transactions on Nuclear Science, Vol. 52, Nº. 4, August 2005.
[73]
T. Adye et al., “The DELPHI Experiment Control”, in Proc. of the International Conference on Computing in High Energy Physics 1992, Annecy, France.
[74]
A. J. Kozubal, L. R. Dalesio, J. O. Hill and D. M. Kerstiens, “A State Notation Language for Automatic Control”, Los Alamos National Laboratory report LA-UR-89-3564, November, 1989.
[75]
R. Arcidiacono et al., “CMS DCS Design Concepts”, in Proc. of the 10th International Conference on Accelerator and Large Experimental Physics Control Systems, Geneva, Switzerland, 10-14 Oct. 2005.
[76]
A. Augustinus et al., “The ALICE Control System - a Technical and Managerial Challenge”, in Proc. of the 9th International Conference on Accelerator and Large Experimental Physics Control Systems, Gyeongju, Korea, 2003.
[77]
C. Gaspar et al.,”An Integrated Experiment Control System, Architecture and Benefits: the LHCb Approach”, in Proc. of the 13th IEEE-NPSS Real Time Conference, Montreal, Canada, May 18-23, 2003.
[78]
Log4j Project, http://logging.apache.org/log4j/docs/index.html
[79]
Xerces-C++ project, http://xml.apache.org/xerces-c/
[80]
W3C recommendation, “XML 1.1 (1st Edition)”, http://www.w3.org/TR/2004/REC-xml11-20040204/
[81]
Graphviz Project, http://www.graphviz.org/
[82]
ChartDirector Project, http://www.advsofteng.com/
[83]
Dojo project, http://dojotoolkit.org/
[84]
Cgicc project, http://www.gnu.org/software/cgicc/
[85]
Logging Collector documentation, http://cmsdoc.cern.ch/cms/TRIDAS/RCMS/
[86]
J. Gutleber, L. Orsini et al., “Hyperdaq, Where Data Adquisition Meets the Web”, in Proc. of the 10th International Conference on Accelerator and Large Experimental Physics Control Systems, Geneva, Switzerland, 10-14 Oct. 2005.
[87]
I2O Special Interest Group, “Intelligent I/O (I2O) Architecture Specification v2.0”, 1999.
[88]
ECMA standard-262, “ECMAScript Language Specification”, December 1999.
[89]
I. Magrans and M. Magrans, “Enhancing the User Interface of the CMS Level-1 Trigger Online Software with Ajax”, in Proc. of the 15th IEEE-NPSS Real Time Conference, Fermi National Accelerator Laboratory in Batavia, IL, USA, May 2007.
[90]
A. Winkler, “Suitability Study of the CMS Trigger Supervisor Control Panel Infrastructure: The Global Trigger Case”, Master Thesis, Technical University of Vienna, March 2008.
[91]
Scientific Linux CERN 3 (SLC3), http://linux.web.cern.ch/linux/scientific3/
[92]
Oracle Corp., http://www.oracle.com/
[93]
CAEN bus adapter, model: VME64X - VX2718, http://www.caen.it
[94]
Apache Chainsaw project, http://logging.apache.org/chainsaw/index.html
[95]
I. Magrans and M. Magrans, “The Control and Hardware Monitoring System of the CMS Level-1 Trigger”, in Proc of the IEEE Nuclear Science Symposium 2007, Honolulu, Hawaii, October 29 November 2, 2007.
[96]
Web interface of the Trigger Supervisor CVS repository, http://isscvs.cern.ch/cgi-bin/viewcvsall.cgi/TriDAS/trigger/?root=tridas
[97]
P. Glaser, "System Integration of the Global Trigger for the CMS Experiment at CERN", Master thesis, Technical University of Vienna, March 2007.
[98]
A. Oh, “Finite State Machine Model for Level 1 Function Managers, Version 1.6.0”, http://cmsdoc.cern.ch/cms/TRIDAS/RCMS/Docs/Manuals/manuals/level1FMFSM_1_6.pdf
[99]
IEEE standard C37.1-1994, “IEEE standard definition, specification, and analysis of systems used for supervisory control, data acquisition, and automatic control”.
[100]
CMS Collaboration, “CMS physics TDR - Detector performance and software”, CERN/LHCC 2006001.
[101]
A. Afaq et Al, “The CMS High Level Trigger System”, IEEE NPSS Real Time Conference, Fermilab, Chicago, USA, April 29 - May 4, 2007.
[102]
B. Boehm et al., “Software cost estimation with COCOMO II”. Englewood Cliffs, NJ: Prentice-Hall, 2000. ISBN 0-13-026692-2.
[103]
CMS Collaboration, “The CMS Magnet Test and Cosmic Challenge (MTCC Phase I and II) Operational Experience and Lessons Learnt”, CMS Note 2007/005.
[104]
CMS Collaboration, “The Compact Muon Solenoid detector at LHC”, To be submitted to Journal of Instrumentation.