Distributed Database Services - a Fundamental Component of the ...

3 downloads 5951 Views 1MB Size Report
“Effective Oracle by Design” book [6]. This book underlines the importance of the partnership between the developer and the administrator as well as key pitfalls ...
Distributed Database Services - a Fundamental Component of the WLCG Service for the LHC Experiments - Experience and Outlook Maria Girone CERN, 1211 Geneva 23, Switzerland [email protected] Abstract. Originally deployed at CERN for the construction of LEP, relational databases now play a key role in the experiments' production chains, from online acquisition through to offline production, data distribution, reprocessing and analysis. They are also a fundamental building block for the Tier0 and Tier1 data management services. We summarize the key requirements in terms of availability, performance and scalability and explain the primary solutions that have been deployed both on- and off-line, at CERN and outside, to meet these requirements. We describe how the distributed database services deployed in the Worldwide LHC Computing Grid have met the challenges of 2008 - the two phases of CCRC'08, together with data taking from cosmic rays and the short period of LHC operation. Finally, we list the areas - both in terms of the baseline services as well as key applications and data life cycle - where enhancements have been required for 2009 and summarize the experience gained from 2009 data taking readiness testing – aka "CCRC'09" – together with a prognosis for 2009 data taking. 1. Introduction This paper describes the role that databases play in the overall Worldwide LHC Computing Grid (WLCG) [1], the key technologies that are in use and the service deployment model. It describes in detail the preparations taken for the foreseen run of the LHC in 2008, including the Common Computing Readiness Challenge (CCRC’08), and those foreseen for the quasi-combined 2009 / 2010 run. The role of databases in WLCG is best understood through the well-known functional view shown below. Although none of these so-called “functional blocks” mentions database services explicitly, they are fundamental to all areas. This can be seen through the lists of “Critical Services” prepared by the LHC experiments for the WLCG Overview Board, where they feature explicitly amongst the most critical services of all experiments except ALICE (for whom online databases are nonetheless essential). The criticality is high-lighted by the requests from more than one experiment for recovery of the service with a maximum time delay of 30 minutes – something that clearly cannot be addressed for complex problems and but nevertheless requires an architecture focussing on redundancy and highavailability. The database architecture involved in our services [2] is based on Oracle’s Maximum Availability Architecture (MAA) [3], coupled with additional elements for handling middle tier applications that are beyond the scope of this paper. A unique feature of the deployment model is the mapping of MAA onto the WLCG tiers, where specific responsibilities are assigned to the different

tiers (as well as for specific sites), with variations across the different experiments that reflect their computing models.

Figure 1 – The WLCG Functional Model

2. The Role of Databases for LHC Data Processing and Analysis Without providing an exhaustive list and depending on specific details of the individual experiment’s computing models, database services are required by the experiments’ online systems, for most if not all aspects of offline processing, for simulation activities as well as analysis. Some specific examples include for the PVSS Supervisory Control and Data Acquisition (SCADA) system, for detector conditions (e.g. COOL), alignment and geometry applications, for Grid Data Management (LCG File Catalog, File Transfer Service) and Storage Management (e.g. CASTOR + SRM) services, as well as Grid infrastructure and operations tools (GridView, SAM, Dashboards, VOMS). Most, if not all, of these applications and the corresponding services are described by dedicated papers in the proceedings of this conference. It is not the purpose of this paper to cover these applications in detail – the point to retain is the ubiquitous nature of database services in the WLCG environment and those that are implemented on Oracle in particular. 3. Distributed Database Services In the WLCG computing model the various functional blocks are mapped to the tiers in an experiment-dependent fashion. Largely speaking, data acquisition, first pass processing and distribution of data is the responsibility of the Tier0, reprocessing, distribution of analysis and other data to the Tier2s and custodial storage for simulated data from the Tier2s is the responsibility of the Tier1s and the simulation of data (physics processes plus detector response to the final state particles) and support for analysis the role of Tier2s. As these functional blocks involve database services, the distributed nature of this tier model must be taken into account. Given the service levels that can be realistically supported by the various sites, (Oracle) database related services are largely confined to the Tier0 and Tier1s. In this context distributed database services includes not only agreeing on common deployment architectures and procedures – from which significant savings can be expected –

but also support for data distribution and replication for specific applications. This deployment architecture of the WLCG Distributed Database Operations (formerly 3D project) [4] is shown schematically in figures 2 and 3 – both generically and for the specific case of the ATLAS experiment. Specific details of the building blocks and technologies involved are given below.

Figure 2 – The Distributed Database Deployment Architecture in WLCG

In this generic description, different techniques are used to make data available between online and offline and between the Tier0 and Tier1 sites. Three of the four LHC experiments have chosen Oraclebased technology between online and offline systems, two of them have chosen Oracle solutions for the distribution of data between the Tier0 and Tier1s, with CMS opting for Frontier [5] as a caching solution for this purpose. This latter technique is also being evaluated by ATLAS for the distribution of data to Tier2 sites.

Figure 3 - The ATLAS Distributed Database Architecture

The ATLAS experiment currently uses the “Oracle Streams” technology – described in more detail below – both between online and offline as well as to 10 Tier1 sites as shown in figure 3. It is also being considered for other applications and/or specialised sites, such as their “Muon calibration centres”. 4. The Database Developer and Administrator Communities In contrast to the LEP era – where the number of database applications and corresponding developers was rather small – the community has grown to the extent that database developers’ workshops attract up to 100 attendees and database administrators’ (DBA) workshops between 20 and 30. The strategy chosen is strongly influenced by that advocated by Tom Kyte in his popular “Effective Oracle by Design” book [6]. This book underlines the importance of the partnership between the developer and the administrator as well as key pitfalls that should be avoided. As the production operations continuation of the WLCG Distributed Database Deployment (3D) project, regular phone conference calls are held between the DBAs at the various sites, complemented by regular workshops and a daily report to the WLCG operations meeting. The emphasis is on homogeneity, with sharing of policies and procedures, as well as standardized setups, strongly encouraged. Recently, WLCG has made a recommendation of 1 full time (experienced) DBA per WLCG Tier1 site. The consequences of not having sufficient resources in this area or in not following standard procedures have a strong correlation with service instabilities and outages at the concerned sites: it would be highly undesirable to attempt to prove, but it is inconceivable that such widely distributed services could have been achieved at the WLCG scale without such a strategy of commonality.

Figure 4 - Impact of New Hardware and Query Optimization on ATLAS Conditions Usage

Figure 4 shows two aspects of this collaborative work. The sharp drop in CPU utilization seen on the left hand side of the plot corresponds to a hardware upgrade. After this time the CPU load can be seen to grow steadily, negating much of the benefit of this new hardware, until the time that an optimization query was introduced into the corresponding application “COOL”. In particular, concurrency and stress testing are essential to ensure that applications will behave correctly under realistic workloads

and this requires close work between the application developers, the users and the DBAs responsible for providing the service.

5. Key Service Requirements and Technologies The key service requirements and corresponding technologies that have been selected are listed below: • Data availability, reliability and scalability, addressed by Oracle Real Application Clusters (RAC) and Automated Storage Management (ASM); • Data distribution, handled primarily using Oracle Streams; • Data protection, built on Oracle Recover Manager (RMAN), with IBM’s Tivoli Storage Manager (TSM) as backend. Oracle Data Guard is also used to provide additional protection against human errors, to facilitate disaster recoveries, as well as for other key service issues. Whilst the overall setup has been described in previous CHEP papers, in summary there are some 25 Oracle RACs at CERN and around 20 at the WLCG Tier1 sites. These all have similar hardware setups and the same operating system and Oracle versions. The rolling upgrade capabilities offered by RAC – where one node can be taken out of service and upgraded without bringing down the entire service to the users – has been essential for service continuity. This is shown by a 0.04% service (3.5 hours downtime per year) unavailability measured during 2008, as compared to a 0.22% server unavailability (19 hours per year). Neither this level of service nor the ability to support today’s number of applications and users would have been affordable with the “non-RAC” model previously deployed (for financial and technical reasons), using CERN disk servers as database machines.

Figure 5 – Real Application Cluster Architecture for the CERN Physics Database Services

The current RAC configuration at CERN is shown in figure 5 and is build upon dual-CPU quadcore 2950 DELL servers, 16GB memory, Intel 5400-series “Harpertown”; 2.33GHz clock, dual power

supplies, mirrored local disks, 4 NIC (2 private 2 public), redundancy of FC switches, dual HBAs, “RAID 1+0 like” with ASM. In conjunction with Oracle RAC, Oracle Streams is the key technology used within the Oracle database environment for propagating changes between databases both on and offline environments as well as between sites. Using this technology, shown in figure 6, database changes are captured from the redo-log and propagated asynchronously as Logical Change Records (LCRs). Important service issues include so-called “down-stream capture boxes”, which decouple any issues with distributing or applying changes to the Tier1 sites from the Tier0 services, as well as a largely automated procedure for handling longer (effective) downtimes of individual (or multiple) Tier1 sites. Downtimes of several days (depending on the update load) can be handled without any specific action, whereas longer outages – such as those due to the fire at the ASGC Tier1 site – require the site to be removed from the update process and resynchronised by a DBA once the service returns to production using transportable tablespaces. Any Tier1 site can be the “re-synchronisation source” for the target Tier1 – it only requires that the corresponding service(s) at that site are put into ready-only mode during the period of recovery (final catch-up being handled automatically once re-inserted into the streams environment).

Figure 6 - Propagation of Changes using Oracle Streams

Finally, the overall architecture of Oracle Data Guard is shown in figure 7. Whilst there are some similarities both globally as well as in the implementation details with Oracle Streams, there are a number of key differentiators. Most importantly, it works on the level of a complete database and targets different use cases. In particular, it is more oriented towards service continuity and protection – even if this in itself involves data distribution (e.g. by having the standby database in a different location to the primary). In particular it targets: •



Limiting database downtime in the event of: • Multi-point hardware failures; • Wide-
range of corruptions; • Disaster; • H/W upgrades; • Human errors; • within configured redo apply lag (24 hours); Ad-hoc testing of major schema upgrades or data reorganization on the standby.

Of significant note it was used in the May run of CCRC’08 to allow the migration to the new hardware to proceed on time – even though some problems had been seen during the burn-in period – by retaining the previous hardware in standby mode until the concerns had been removed.

Figure 7 – Oracle Data Guard

6. Experience from 2008 Once again, 2008 was expected to be dominated by final preparations for the first data taking from proton-proton collisions in the LHC. As such, the programme of work and schedule was fixed by the CCRC’08 exercise in the first half of the year and LHC commissioning and startup in the second. Significant milestones during the year where the upgrade to quad-core based servers for the Tier0 services as well as many of those at the Tier1s, standardizing on Oracle version (10.2.0.4, RHEL4, x86, 64bit), continued emphasis on robust pre-production testing designed to help minimize or avoid service problems at the production stage, as well as taking over the responsibility for the experiments’ online database services (for which additional funding from the experiments themselves was made available to fund the needed DBAs). Overall, the services met requirements of CCRC’08, cosmic runs and LHC start-up. Associated service enhancements can be summarized as follows (area by area): •

• •

• •

Data Guard: at CERN in physical standby for all the online and offline experiments’ production database clusters, validated during CCRC’08 and deployed in production prior to the LHC start-up and at CNAF in logical standby for the ATLAS LFC application to failover from CNAF to INFN-ROMA1 [7]. Streams: automatic split & merge procedures have been established for isolating a site and and re-synchronize it using transportable tablespaces. Backup and recovery: we have put in place automated test recoveries, improved the on-tape backups performance from 30Mb/s to 70MB/s throughput with single channel and set-up ondisk image copies to speed-up recovery from physical and logical data corruptions. Also LANfree tape backup tests have been performed reaching a throughput of about 200MB/s. Monitoring: a coherent tool for database (online, offline & standby) and streams monitoring/alerts has been set-up and fully integrated in our service infrastructure. This tool has been now extended to display also the Tier1 sites database status. Security: Oracle Critical Security updates are regularly validated and applied following a policy document [8] agreed at the Grid Deployment Board in 2006. Also we use iptables firewalls to blocks access to the DB servers with the exception of the DB listener port (which has also been chosen to be a non-default on). For some database services (notably the online DBs) we also limit access to the only selected networks to the experiments’ networks).



Account policies: we are enforcing use of reader and writer accounts. The owner is typically locked on production and we have set-up an automatic tool for temporary unlocking.

7. Preparing for 2009-2010 running It is now expected that the 2009 run of the LHC will start late in the year and continue almost without interruption into the first part of 2010, followed back-to-back by a “standard” 2010 run extended into the autumn. At the WLCG workshop prior to CHEP 2009 a “Scale Test for Experiment Production” – STEP’09 – was agreed, with a timescale of June 2009 (at least for the Tier1 reprocessing activities). This will be followed almost immediately by data taking by the experiments with cosmic events (as from beginning of July) – it being understood that the service is already in full production mode. Some key issues to be addressed during the coming year include: •

• •



H/W and O/S upgrade – Study usage of systems resources: CPU, random IO (read), available storage space and correlate it with experiments’ requests – Evaluation of RHEL5 – Evaluation of future Oracle releases DB Security – Rule-based tool to detect malicious access to DB Client connection to the DB – Coral Server deployment – Data Life Cycle – “old” data offline with possibility of accessing it on demand • (SAM, conditions, PVSS, etc) – multiple schema archiving, Oracle partitioning, compression Regular Oracle Reviews (discussed at WLCG workshop) to follow-up service requests/bugs which affect WLCG Services

In connection with the database security issue, we are working on a tool that provides security oriented information on database access and discovers dangerous or improper actions. It will alert the DBAs in case any situation rated as dangerous occurs. There will also be a possibility to customize it according to experiments needs. The tool will be based on a set of rules describing suspicious actions and an engine that will search for the situations where the rules were disobeyed. The main requirement is that the rules have to be flexible and easy to change or for new ones to be added. In addition, the forthcoming release of Oracle – 11g R2 –offers a number of new features and enhancements that are of potential interest and will have to be carefully studied. Its eventual use for production will have to be planned taking into account the constraints of the LHC operating schedule and of the experiments’ production and analysis activities. 8. Strengths, Weaknesses, Opportunities and Threats Rather than perform a conventional S.W.O.T. analysis under this heading, we prefer to focus on the areas of main concern – it being hopefully evident from the above that a powerful and flexible distributed database service has been built up through close collaboration between the key sites as well as between the administrators, application developers and user communities. (The clear

“opportunities” being to continue to offer a reliable service scaling to increased usage and matching new requirements with minimum effort and resources). The threats, on the other hand, come from the likely impact to “WLCG operations” if a prolonged service degradation were to occur. Regular WLCG operations reports over several years have highlighted database and data management services (the later typically relying on database services in some form, although not always based on Oracle) as being a key cause of service degradation. At least a fraction of these degradations could be avoided – some are due to poor application design, others due to suboptimal operations procedures (e.g. DB “house-keeping” operations being run either unscheduled or badly scheduled so that experiments’ production is severely impacted). The ongoing work has already helped to minimise such events but continued – even increased – attention in this area is clearly required. Other areas of concern are related to Oracle bugs which have in the past caused serious and / or prolonged service degradation or even complete site downtimes (as viewed by the experiments). The proposed strategy here is a continuation and extension of the overall “by design” principle – be prepared for problems and have a well exercised mechanism for responding to problems in an appropriate manner as and when they occur. As such we are in discussions with Oracle concerning regular service reviews – foreseen at least initially to take place quarterly – to follow-up on existing problems and workarounds and ensure that this worldwide distributed service is maintained at its current level of excellence. The level of criticality of the services concerned is given by the experiments’ lists of critical services, where those were problems would impact data taking are of primary concern, followed closely by those related to first-pass processing and distribution of data to the Tier1s. 9. Conclusions We have described the setup and operation of a complex world-wide distributed database infrastructure for the Worldwide LHC Computing Grid. This is build on a small number of well proven technologies, including RAC+ASM for key database services at the Tier0 and Tier1s, Streams (primarily) for the distribution of detector conditions information – key for reprocessing – and Data Guard to provide additional protection for the most critical services. Large scale tests – such as those performed as part of CCRC’08 – show that the experiments’ requirements have been met. Testing and validation – of hardware, versions of the database software, as well as application versions – have been shown to be key to smooth production and this has required close cooperation between application developers and database administrators. Monitoring of database and streams performance has been implemented and this too has proven essential for the optimisation of a large distributed system. We are now ready to phase the challenges of the incoming STEP’09 and the 2009-2010 LHC run. References [1] The Worldwide LHC Computing Grid (WLCG), http://lcg.web.cern.ch/LCG/. [2] https://twiki.cern.ch/twiki/bin/view/PSSGroup/PhysicsDatabasesSection. [3] Oracle Maximum Availability Architecture, http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm. [4] https://twiki.cern.ch/twiki/bin/view/PSSGroup/LCG3DWiki. [5] D. Dykstra, Improved Cache Coherency Approach for CMS Frontier, this conference. [6] Tom Kyte, Effective Oracle by Design (Osborne ORACLE Press Series). [7] B. Martelli, A lightweight high availability strategy for Atlas LCG File Catalogs, this conference. [8] https://twiki.cern.ch/twiki/pub/PSSGroup/PhysicsDatabasesSection/PhyDB_database_update_p olicy_at_Tier_0.pdf .