AGIS: Evolution of Distributed Computing information

0 downloads 0 Views 968KB Size Report
Jan 14, 2016 - This content has been downloaded from IOPscience. Please scroll ..... CE queue's status, free space at a site) or as complete objects (e.g. downtime entries). Objects ... n within AGI. Figure 3. Sc ... ema update kload mana.
Home

Search

Collections

Journals

About

Contact us

My IOPscience

AGIS: Evolution of Distributed Computing information system for ATLAS

This content has been downloaded from IOPscience. Please scroll down to see the full text. 2015 J. Phys.: Conf. Ser. 664 062001 (http://iopscience.iop.org/1742-6596/664/6/062001) View the table of contents for this issue, or go to the journal homepage for more

Download details: IP Address: 122.96.59.107 This content was downloaded on 14/01/2016 at 10:18

Please note that terms and conditions apply.

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

AGIS: Evolution of Distributed Computing information system for ATLAS A Anisenkov1, A Di Girolamo2, M Alandes2, E Karavakis2 on behalf of the ATLAS Collaboration 1

Budker Institute of Nuclear Physics, Novosibirsk, Russia

2

CERN, European Organization for Nuclear Research, Switzerland

Email: [email protected] Abstract. ATLAS, a particle physics experiment at the Large Hadron Collider at CERN, produces petabytes of data annually through simulation production and tens of petabytes of data per year from the detector itself. The ATLAS computing model embraces the Grid paradigm and a high degree of decentralization of computing resources in order to meet the ATLAS requirements of petabytes scale data operations. It has been evolved after the first period of LHC data taking (Run-1) in order to cope with new challenges of the upcoming Run2. In this paper we describe the evolution and recent developments of the ATLAS Grid Information System (AGIS), developed in order to integrate configuration and status information about resources, services and topology of the computing infrastructure used by the ATLAS Distributed Computing applications and services.

1. Introduction The ATLAS experiment [1] at the Large Hadron Collider has successfully collected billions of physics collision events during the Run-1 period (from 2009 to 2013) and is ready to operate in the upcoming LHC Run-2 (from 2015 to 2018) during which tens of petabytes of data will be produced annually. All these petabyte-scale data will need to be stored, processed and analyzed. The ATLAS data are distributed, processed and analyzed at more than 130 grid and cloud sites across the world, according to the ATLAS computing model [2], which is based on a worldwide Grid computing infrastructure that uses a set of hierarchical tiers. It provides to all members of the ATLAS Collaboration speedy access to all reconstructed data for analysis, and appropriate access to raw data for calibration and alignment activities. ATLAS Distributed Computing (ADC) is capable of running more than 150 000 concurrent jobs simultaneously across the grid. The variety of the ATLAS Computing Infrastructure requires a central information system to define the topology of computing resources and to store the different parameters and configuration data which are needed by the various ATLAS software components. A centralized information system helps to solve the inconsistencies between configuration data stored in the various ATLAS services and in the different external databases. AGIS is the system designed to integrate configuration and status information about resources, services and topology of the computing infrastructure used by ATLAS Distributed Computing applications and services. Being an intermediate middleware system between clients and external information sources like central gLite BDII (Berkeley Database Information Index) [3], Grid Operations Centre Database (GOCDB) [4], the Open Science Grid Information services (MyOSG, OSG Information Management Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

System - OIM) [5], AGIS defi fines the relaations betweeen experim ment specificc used resou urces and physical distributed computing c capabilities. Beingg in producction during g LHC Runn-1, AGIS became b the central infoormation sy ystem for Distributted Computiing in ATLA AS and it is ccontinuously y evolving to o fulfill new user requestts, enable enhancedd operations and follow the t extensionn of the ATL LAS Computting model. In thiis paper we describe d the major capabbilities of AG GIS as the reesult of systeem evolution, that can be expreessed in a nuttshell as in th he followingg list of action ns shown in Figure 1: def efine, connect, collect, integratee, declare, complete, c op perate, distrribute and finally f play – full suppport for the ATLAS experimeent. 2. Systeem capabilitties, informa ation and daata sources The AD DC applicatioons and serv vices requiree the diverssity of comm mon informaation, config gurations, parameteers and quassi-static data originally sttored in diffe ferent sources. Sometimees such staticc data are simply hhardcoded in application programs orr spread over different con nfiguration ffiles. The ddifficulty facced by the ADC A applicaations is thaat ATLAS computing usses a variety y of Grid infrastruuctures (Eurropean Grid d Infrastructture (EGI) [6], Open Science G Grid (OSG) [7] and services, ap NorduGrrid, ARC-baased infrastrructure [8]) which havee different information i pplication interfacees, communiication systems and poliicies. Thereffore, each ap pplication haas to know about a the proper innformation source, s data structure foormats and application a in nterfaces, seecurity infrasstructures and otheer low-levell technical details to reetrieve speccific data. Moreover, M ann application has to implemeent a commuunication logiic to retrievee data from external e sourcces that prodduce a lot of code and effort duuplication. AGIS A definees the topoloogy of Distrributed Com mputing resouurces and masks m the heterogeeneity of com mputing infraastructures bby providing a consistentt Computingg model defin nition for applicatiion services and a developeers. Anothher point off difficulty in the data organization n is the misssing links bbetween exp perimentspecific consumed resources r an nd physical Grid capabilities: the data d structurres and hierrarchy of resourcees defined in external infformation serrvices do no ot fit well in the ATLAS computing relations. For exam mple, ATLA AS can definee a few sub-ssites behind one physicall Grid resourrce. The prim mary goal of AGIS S is to facilitaate, enable an nd define thoose missing relationshipss between the he physical co omputing resourcee provided byy various sites and the oones used byy the experim ment. By prooviding an ab bstraction layer froom the physiical resourcees, AGIS alloows the Exp periment to define d their oown real org ganization of the resources. This is what we call the con nnect capabillity of AGIS..

Figure 1. Keyy capabilities of the system m.

2

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

The ffollowing tw wo capabilitiees, namely ccollect and in ntegrate, und derline the fe features of th he system to prepoopulate the database d con ntent and ke ep it updateed with both h static and dynamic infformation retrievedd from externnal informatiion sources. The system automaticallly collects toopology relattions, site specific information required by ATLAS, cacches and keeeps it up to date, d removinng the extern nal source as a direect dependenncy for clientts but withouut duplicatin ng the sourcee informationn itself. The integrate action iss mainly relaated to dynam mic propertiies like site status, s resou urce state, doowntimes, DDM [9] / PanDA [[10] blacklisting, PanDA A queue dynaamic propertiies and otherrs. In adddition to thhe collect an nd integratee functionalitties, AGIS also declarees within th he system various site configuuration strucctures relatedd to the ex xperiment ussage of its rresources, i..e. Squid Configurration, Fronttier Configurration, PerfSoonar Configu uration, DDM M Access Prootocols. The collected innformation is stored andd exposed in a more convenient way to the ATLAS experimeent. Supplem mentary data models and object relatiions are intro oduced in thee system to cover c any experimeent specific use-cases an nd simplify user operatiions. AGIS completes, oorganizes, sttores and generaliz izes the inforrmation modeel for the AD DC applicatio ons and serviices. Finallly, AGIS proovides functiionalities to ooperate inforrmation via the t User orieented Web In nterface [11] porttal and to disstribute data through uni fied interfaces (REST sty yle API and W Web User In nterface).

Figu ure 2. Experim ment view of physical p resou urces.

3

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

Figure 2 shows the basic information entities stored in AGIS and represents a schematic view of the experiment physical resources. This unique AGIS information collection enables a mapping of the physical resources (GOCDB/OIM sites, CEs, SEs, LFCs, etc.) to the ATLAS activity end points (ATLAS sites, PanDA queues workload management endpoint, DDM spacetoken endpoints). 3. System architecture The AGIS architecture is based on the classic client-server computing model. AGIS uses the Django [12] framework as a high-level web application framework written in Python. The Oracle DBMS is used as the default database backend. The object relation mapping technique which is built in the Django framework allows access to the content of the database in terms of high level models, thus avoiding any direct dependence on the relational database system used. Since the system provides various client interfaces such as an Application Programming Interface (API), a Web user interface (WebUI) and a Command Line Interface (CLI) to retrieve and manage the needed data, no direct access to the database from the clients is required. Figure 3 shows a schematic view of the AGIS architecture. To automatically populate the database with the information collected from external information sources, a set of collectors run on the main AGIS server. All interactions with the external information services are hidden. Synchronization of the AGIS content with the related sources is performed by agents that periodically communicate with sources via standard interfaces and update the AGIS database content. For some types of information AGIS itself is the primary source. The clients are able to update information stored in AGIS through the API and the WebUI. The WebUI is mainly used to define new objects, modify existing properties and easily browse experiment specific resources from various user-friendly views. The AJAX (Asynchronous JavaScript and XML) technology is actively involved to offer efficient interactive access through the WebUI. REST (Representational State Transfer) [13] style API and command line tools further help the end users and developers to use the system conveniently. From the point of view of data synchronization, the whole information stored in AGIS can be classified as static and dynamic. Dynamic data means regular synchronization against the information source from which it is collected. Technically it could be as a set of dynamic properties (for example, CE queue’s status, free space at a site) or as complete objects (e.g. downtime entries). Objects automatically injected into the system (e.g. GOCDB/OIM sites and services) are registered in the database in DISABLED states (hidden for data export through API), to make such object visible the user has to activate it via the WebUI forms. The variety of developed JSON (JavaScript Object Notation) REST services and newer functionality with filtering support which allows easy selection and use of multiple structures of output data (JSON presets suitable for specific client) helps to increase the number of clients using AGIS in production. Today, the REST full client API allows users to retrieve data in JSON and, for some applications, in XML formats. For instance, all the ATLAS topology can be exported either in the XML format or in JSON structures suitable for the ADC applications. 4. Recent developments The AGIS main concept of distinguishing between 'used by' and 'hosted by' site resources (the define and the connect capabilities explained above) easily allows the transparent declaration of any virtual resource, such as Cloud and High-Performance Computing (HPC), which have recently become widely used by the ATLAS computing. Following the evolution of the ATLAS Computing model, AGIS is able to define a new top Site entry (see Fig.2, same level as for GOCDB or OIM) related to HPC or Cloud resources, and then all the remaining object definition of computing resources remains the same as for regular grid resources. A special resource_type attribute on the level of PandaQueue object makes it possible for the PanDA system to identify non-grid resources and interpret them appropriately.

4

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

The eevolution off the ATLAS S computingg model and its continuo ous extensionn required im mmediate schema uupdates and implementation of new functionalitiies on the AG GIS side. Ann example off recently implemeented new tyypes of serv vices in AG GIS is the su upport of deeclaration off HTCondor-CE [14] computinng elements which use the t next-gen eration gatew way software for OSG ssites. AGIS collectors c have beeen upgraded to retrieve information about HTCo ondor-CE en ntries definedd in BDII, as a well as the correesponding WebUI W views have been uupdated to leet site administrators alsoo manage HT TCondorCE serviice definitionn within AGIIS directly.

Figure 3. Scchematic view w of the AGIS S client-server architecture.

Figurre 4 illustraates another AGIS scheema update related to the consoliddation of co omputing resourcees definitionn for the ATLAS A workkload manaagement system. The kkey conceptts of the consoliddation consistt of preventing data dupllication by in ntroducing a parent tempplate object, removing r redundannt many-to--many PandaResource-P PandaQueue relations (h historically defined to associate many C CEs to the same s PandaR Resource), aand simplify ying operatio ons in the eend. Templaate based PandaQuueue definitiion allows th he inheritancce of schedcconfig param meters and heelps in the consistent c declaratiion of multiicore, analyssis and prodduction resou urces behind d a PandaSiite. Any sch hedconfig parameteers can be shhared or overwritten by a child Pand daQueue objeect. Mergingg objects into o a single entry also benefitss operation with SWR Release tag gs more efffectively. M Moreover, the new nctional wayy to resolvee default PaandaQueue iinstance for a given implemeentation provvides a fun PandaSitte, in particuular, it helped d to incorporrate the mapp ping of high memory andd multicore resources r requiredd by HammerrCloud [15]. The implem mented conso olidation becaame the firstt step in the evolution e of computing resource definition n in AGIS. T The final goal, which is currently undder developm ment, is to implemeent a complettely dynamicc computing resource deffinition for PanDA. P Manyy other functtional updatees of the WeebUI and the API have been implem mented and delivered into the production to t enhance the t user opeerations and to improve the t data mannagement acctivity. In particulaar, it includess bulk suppo ort of regularr operations on o DDMEnd dpoint and PaandaQueue objects o in the WebbUI, as weell as the development d t of new REST R style API to appply bulk operations o program mmatically. Basedd on the new w REST style bulk API for DDM objects, o an au utomated blaacklisting seervice for DDMEnndpoint storaage resources has been ddeveloped an nd moved in nto the prodduction. In itts current implemeentation, thee DDM blaccklisting servvice takes in nto account site downtiime informaation and

5

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

availablee storage spaace data (phy ysical free sppace or user quota limitation) and traanslates it intto special metrics ((namely AGIIS, DISKSPA ACE and QU UOTASPAC CE accordingly) to be subbmitted into AGIS A via the REST API. ADC C experts and d shifters alsso use the bllacklisting API A to manuaally enable or o disable space tokkens throughh AGIS.

F Figure 4. Tem mplate-based PandaQueue P oobjects definittion. Coherentt descriptions of Production n, Analysiis, High memoory and Multii Core PanDA A queues.

5. Concclusions AGIS haas been deveeloped to prrovide, in a single portaal, the topolo ogy and resoources inform mation to configurre the ATLA AS computin ng applicatioons. Being in production n during thee LHC Run 1, AGIS became the central information n system forr Distributed d Computing g in ATLASS and represents the primary source of infformation fo or all the ADC C application ns and servicces. AGIS S functionaliities allow the t ADC coommunity, ex xperts and shifters s to coonfigure and d operate production ADC sysstems and Grid G applicattions. AGIS is continuou usly extendiing data stru uctures to follow tthe new reqquirements and needs off the ADC, fulfill new user requessts, enable enhanced e operationns and follow w the extensiion of the AT TLAS Comp puting model. The A AGIS designn and basic principles p inccluded in thee architecture allow the uuse of the AGIS A core part by sseveral experriments. AGIIS is evolvinng towards a common infformation sys ystem not cou upled to a specific experiment. As a resu ult, CMS iss currently evaluating AGIS A to usse it as the primary informattion source thhroughout th he experimennt to describee its topology y (sites and sservices both h SEs and CEs). AGIS S can be usedd within CM MS to map thee CMS netw work topology y (PhEDEx nnodes to CM MS sites to SE namees to gridFTP servers), store perfSON NAR definittions, store user u to physiccs group asssociations and quottas on groupps, store info ormation aboout disk spacce at sites and to managee the configu uration of Condor gglidein factoories. Referen nces [1] Thhe ATLAS Collaboration C n 2008 The A ATLAS Exp periment at the CERN Laarge Hadron n Collider JINST 3 S008003 [2] Joones R and Barberis B D 20 008 The ATL LAS Computting Model J. J Phys.: Conf nf. Ser. 119 072020 0

6

21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing Journal of Physics: Conference Series 664 (2015) 062001 doi:10.1088/1742-6596/664/6/062001

[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

Grid Information system, http://gridinfo.web.cern.ch Grid Configuration Database (GOCDB), https://wiki.egi.eu/wiki/GOCDB Open Science Grid Information service (MyOSG), http://myosg.grid.iu.edu/about European Grid Infrastructure (EGI), http://www.egi.eu/ Open Science Grid (OSG), http://www.opensciencegrid.org/ NorduGrid project, http://www.nordugrid.org/ Branco M on behalf of the ATLAS Collaboration. 2008 Managing ATLAS data on a petabytescale with DQ2 J. Phys.: Conf. Ser. 119 062017 Maeno T on behalf of the ATLAS Collaboration. 2008 PanDA: Distributed production and distributed analysis system for ATLAS J. Phys.: Conf. Ser. 119 062036 AGIS WebUI portal, http://atlas-agis.cern.ch Django project, http://www.djangoproject.com REST interfaces, http://en.wikipedia.org/wiki/Representational_state_transfer HTCondor-CE documentation page, https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/HTCondorCEOverview Elmsheuser J on behalf of the ATLAS Collaboration. 2014 Grid site testing for ATLAS with HammerCloud J. Phys.: Conf. Ser. 513 032030

7