c by Walter de Gruyter Berlin New York. DOI 10.1515/piko.2011.005 PIK, Vol. 14, pp. 1–6, März 2010 Copyright
Experience with Hierarchical Storage Management based on GPFS and TSM at INFN-CNAF
A. Cavalli, S. Dal Pra, L. dell’Agnello, A. Fella, D. Gregori, L. Li Gioi, B. Martelli, A. Prosperini, P. P. Ricci , V. Sapunenko INFN CNAF Viale Berti Pichat 6/2 40127 Bologna Italy V. Vagnoni INFN Sezione di Bologna Via Irnerio 46 40126 Bologna Italy
Abstract At the INFN-CNAF Tier-1 computing centre we have evaluated the integration between two IBM products, namely the General Parallel File System (GPFS) and the Tivoli Storage Manager (TSM), with the aim of implementing a high performance Hierarchical Storage Management (HSM) system for managing the data of the experiments operating at the Large Hadron Collider. Starting from version 3.2, GPFS allows for the execution of selective migrations of GPFS disk-resident files to external tape pools by means of appropriate policies defined via the Information Lifecycle Management (ILM) interface and specific user programs. As far as recalls of files from tape to disk are concerned, an optimized tape-ordered access is mandatory, and this has been achieved by exploiting new features of the TSM HSM client, preliminarily made available in version 6.1. In this paper we summarize the work made for implementing the full GPFS-TSM integration, as well as some early but significant results obtained during intensive data access activities.
1
Introduction
One of the most difficult challenges for High Energy Physics (HEP) computing consists in the need of managing very large data sets collected by giant detectors, such as those operating at the European Organization for Nuclear Research (CERN) within the Large Hadron Collider (LHC) programme [1]. Very reliable and high performance disk-based data access systems are required, as well as Mass Storage Systems (MSS’s) allowing for a near-line archival of several PB of data per year. The INFN-CNAF computing centre hosts the Italian World-wide LHC Computing Grid (WLCG) Tier-1 site, the
Corresponding author, e-mail:
[email protected]
largest Italian computing facility employed in the LHC distributed computing infrastructure. At the beginning of 2008 we started an activity focused on the integration of our preexisting GPFS disk storage infrastructure with TSM, aimed to realize a full Hierarchical Storage Management (HSM) system, exploiting the so-called “external pool” features introduced in GPFS version 3.2. In the WLCG model all the Tier-1 centers must provide different classes of data storage, so-called “storage classes” (SC’s) [2]: – Disk0-Tape1 (D0T1). Within this class, data files are stored on tape and the disk area is considered only as a temporary buffer (usually referred to as disk cache or staging area) automatically managed by the system. Usually oldest files are automatically removed from the staging area when disk space is reclaimed for new files. Data files to be removed must have been previously copied to tape. This is in practice the typical functionality of a HSM system. – Disk1-Tape1 (D1T1). Data files are permanently stored on disk and on tape as backup. Obviously the amount of disk and tape space in this case is, by definition, identical. – Disk1-Tape0 (D1T0). Here no guaranteed copy on tape exists, i.e. data files reside uniquely on disk. While the D1T0 SC can be implemented by a plain GPFS file system, D1T1 and D0T1 require a tape backend. In particular, the D1T1 SC only requires migrations to tape, as recalls are only foreseen in case a file system restore is needed, e.g. due to a partial or full file system loss. The D0T1 SC instead requires a fully working HSM system, with user-driven recalls from tape to disk. In this paper, after a brief description of the hardware resources we are currently using in our production environment, we will summarize our studies targeted to the implementation of the SC’s mentioned above, using GPFS and TSM.
2
Disk and Tape Storage resources at CNAF
All disk storage resources are accessed using Storage Area Network (SAN) Fibre Channel (FC) switches, and Linux machines are used as disk-servers. At the moment we have roughly 2.8 PB of raw disk storage installed, primarily provided by Dell EMC CX-3-80 storage controllers making use of SATA disk arrays [3]. Our SAN infrastructure is based on Brocade switches where 2 Fabric Directors (a SilkWorm 24000 with 128 2 Gbit=s ports and a 48000 with 224 4 Gbit=s ports) represent the core of the SAN, while two SilkWorm 3900 (total of 64 ports) and several 8 ports switches integrated in the blade enclosures hosting the disk-servers are connected as peripheral switches. The access to the storage is provided by dedi-
2
A. Cavalli et al.
cated server machines running Scientific Linux CERN (SLC) as operating system. Currently we have in production a total of about 200 disk-servers with redundant Qlogic Host Bus Adapters (HBAs) providing connectivity to the SAN, and Gigabit Ethernet Local Area Network (LAN) connections. The disk-servers act as front-ends of the disk storage infrastructure to the computing farm, comprising at the moment about 4000 CPU cores. As tape resources we run in production 2 tape libraries: a SUN L5500 silo partitioned with 2000 tapes cartridges for 6 LTO-2 drives and 3500 tapes for 10 9940B drives (the total capacity is about 1 PB of uncompressed data); a SUN SL8500 with 8 redundant robot changers, 8 T10000A drives and 20 T10000B drives. The library has a potential capacity of 10000 slots, i.e. 10 PB at maximum with the drive technology in use. These resources are going to grow substantially during the next years of activity of LHC. For example, in view of the 2010 LHC run, the storage is being expanded to about 8 PB of disk and 10 PB of tape space online.
3
Relevant GPFS features
GPFS is a clustered parallel file system, providing a scalable POSIX access to files [4]. Client machines do not need to have direct access to the SAN. Instead, they access the disk devices via LAN by means of the so-called GPFS Network Shared Disks (NSD’s). In a few words, a NSD is an abstraction layer used for shipping I/O requests to the disk servers. The usage of parallel I/O drastically increases the performance compared to other data access systems available in the framework of HEP computing [5]. In our production system roughly 1000 Farm worker nodes access the GPFS file systems using the local area network and the NSD configuration. We currently run 9 GPFS clusters in production at CNAF with a total of 2:2 net PB disk space over 15 GPFS file system online. In GPFS, data files are striped across many devices for high-performance access, and the client nodes can use read-ahead/write-behind techniques for improving I/O rates. Metadata operations are journaled to the file system allowing for fast recovery in the event of node failure. Byte range locking is used to support intra-file parallel I/O operations. Distributed token management is used to grant file access, space allocation and metadata rights allowing for distributed and parallel operation. Additionally, GPFS provides extended features that are the keys for the integration effort discussed in this paper. These include an implementation of the XDSM DMAPI [6] standard and Information Lifecycle Management (ILM) capabilities.
3.1
Data Management API
The XDSM DMAPI specification provides the possibility to use a set of file system extensions that may be used to implement a HSM system. Important features are: – DMAPI events. They provide notification of file system activities to a Data Management (DM) program.
– DMAPI managed regions. They define which parts of the files trigger the notification of an event to the DM application upon their access. – DMAPI extended attributes. They are extensions of the usual UNIX attributes, and can be associated with each file system inode. DM applications receive notifications of file system events via so-called sessions. Events can be either synchronous or asynchronous. In the former case the events block the thread which generated the event until the DM application finishes its processing.
3.2
Information Lifecycle Management
Within ILM, the storage can be partitioned into so-called storage pools. To implement file placement on a specific storage pool and data migration rules from one pool to another one according to some criteria, ILM uses an SQL-like policy language. To generate a list of file candidates for migration, GPFS scans the file system metadata building a result set of file attributes and path names that matches the search criteria specified. In order to perform fast file system scans, GPFS is provided with optimized metadata structures. The scans can be parallelized across many nodes in the GPFS cluster, resulting in improved scalability and higher performance.
4
File migrations from GPFS to TSM
Commencing GPFS version 3.2, the concept of “external storage pool” was introduced into ILM. External storage pools allow the implementation of a policy-driven migration/recall system to a tape storage backend such as TSM. In this case, TSM takes on the role of manager for an external storage pool [7]. However, interfacing GPFS with specific implementations of tape back-ends does not come natively, but must be realized by means of specific external programs. GPFS can start up such programs, feeding them with the necessary file lists built according to the ILM policy scripts installed for a file system. In practice, GPFS can automatically build candidate lists and pass them to the external programs, which in turn can move the data to and from tape by means of the TSM client. In TSM, “pre-migration” stands for the action of copying a file to tape, but keeping the original copy also on disk. With “migration” instead, one means the action of copying the file to tape and removing the content of the file from disk (but not the inode structure), or simply removing the content of the file from disk if the file had been previously pre-migrated to tape. When the content of a file is removed from disk, a so-called “stub” file is kept, i.e. the inode remains on disk and can be listed as a normal file. DMAPI extended attributes are added by TSM to the stub file, containing a sequence to be used as a unique key for identifying the file in the TSM database. DMAPI managed regions are set with specific events, namely READ, WRITE and TRUNCATE. When the file is in the migrated state, all the three events are set on the file, such as any of these three operations will trigger a signal intercepted by a
Experience with Hierarchical Storage Management based on GPFS and TSM at INFN-CNAF
TSM DM application. When the file is in a pre-migrated state instead, just WRITE and TRUNCATE events are set, as TSM must be able to invalidate the copy on tape in case the file is updated on disk. Pre-migrations are started upon a LOW_SPACE event according to the GPFS ILM policy for the file system. In our case we set the threshold for triggering LOW_SPACE to zero, which results in pre-migrating files to tape “as soon as is possible” (this a basic requirement for data custodial in WLCG data centres). The number of pre-migration threads (for each file system) running on each HSM-enabled node can be configured and must be chosen according to the available resources. E.g., with 8 drives and 3 HSM nodes, the number of threads can be set to 3 per each node (9 in total), such as 8 migrations are running on the 8 drives and one migration is pending waiting for a drive mount point to become available. When the file system reaches a (configurable) certain level of occupancy, garbage collection is started and pre-migrated file contents are removed from disk. The “migration” runs over already pre-migrated files, hence it simply frees the blocks on disk owned by the file, leaving in place just an empty stub file.
5
File recalls from TSM to GPFS
While the implementation of file migration from disk to tape is realized using standard GPFS features, an optimal way to recall files from tape to disk has to be defined. In TSM, data files can be recalled from tape to disk in two ways: – Selective recalls. The user (or a specific service on his behalf) asks for a file to be recalled from tape prior to the first access. This typically happens in the WLCG world, where before submitting a job to a computing node, a socalled “Storage Resource Manager” (SRM) [8] service is contacted in order to recall all the needed files from tape. Just when the recalls are over, the user’s job can be submitted and executed on the worker node. – Transparent recalls. The file is accessed by a read operation from the user application regardless on whether it is on disk or has still to be recalled from tape. If the file is not on disk, the read operation triggers via DMAPI the recall of the file from tape. When the recall is finished and the file is accessible on disk, the control is given back to the user’s process. INFN-CNAF has developed a specific implementation of the SRM protocol over parallel file systems [9], called StoRM [10], which in its current production release (1.5) already allows using GPFS in conjunction with TSM. An optimized access to data stored on tape is of primary importance for realizing a high performance HSM system. The tape recalls should be grouped together and sorted according to how the files are stored within each tape, avoiding unnecessary tape seeks and minimizing mount/dismount sequences. The standard TSM behaviour consists in recalling files as soon as they are requested by users, following the same order of the requests. As the user has no knowledge of where the
3
files are stored, and in particular how the files are ordered within a tape, such a procedure ends up in a very inefficient usage of the tape resources. However recent developments of the TSM-HSM client introduced new features which can be used as tools for realizing tape-ordered file recalls. By means of such tools, we were able to implement both selective and transparent recalls as described in the following sub-sections.
5.1
Selective tape-ordered recalls
We implemented a selective tape-ordered recall system by means of 4 main commands/processes: yamssEnqueueRecall, yamssMonitor, yamssReorderRecall, and yamssProcessRecall. The system manages a FIFO queue with the files to be recalled, fetches files from the queue and builds sorted lists with optimal file ordering, and then performs actual recalls from TSM to the GPFS file system. A sketch of the workflow is depicted in Figure 1. The description of the various commands/processes is given below: – yamssEnqueueRecall: It is a command line client that allows to insert into a FIFO the files that need to be recalled from tape. It can be run on any node of the computing farm and in the HSM-client storage nodes. – yamssMonitor: It runs as a service on all HSM-client storage nodes. It discovers managed file systems mounted on HSM nodes, reads the configuration file for each file system (stored in a “system” directory of the file system itself) and triggers the needed actions (i.e. starts other processes). It loops continuously in the background and spawns all needed recall threads according to the configuration of each mounted file system. It also auto-detects changes in the configuration files for each file system and reconfigures actions to be taken accordingly. – yamssReorderRecall: It is scheduled by yamssMonitor. If entries are available from the recall queue, it fetches them in chunks of configurable width and sorts recall file lists according to tape ordering. Then it puts ordered file lists into a shared directory, i.e. a system directory in the GPFS file system itself. If a file list for a given tape already exists, adds new files to the existing list in the correct order. If no work is available (no files are present in the queue) or after having made one reordering, the process just exits. – yamssProcessRecall: It is scheduled by yamssMonitor and executes recalls on ordered file lists produced by yamssReorderRecall. More than one recall thread per node can be started according to how many parallel recalls are desired. It takes the control of one file list for only one tape. In the present implementation it simply chooses amongst the available file lists, the one with the largest number of files to be recalled, unless a file list with one request older than a configurable time threshold exists. Such an exception is needed in order to avoid starvation of tapes with few files requested with respect to tapes with many requests. If no work is available (i.e. no file lists are ready to be processed) or once the recall of the tape is finished, the process exits. It performs the recall of the ordered file list by using the TSM-HSM client commands. It also checks whether each file has been actually recalled. In case of failures, it can re-insert into the queue the failed file recall for a con-
4
A. Cavalli et al.
Figure 1
Workflow of the tape-ordered recall system.
figurable number of times, in order to try again and give the recall another chance. Once the “final” failure occurs (i.e. after all the allowed retrials if any), it writes into an extended attribute of the file the actual time of the failure. This is used to pass to clients the information that a failure occurred and they should give up in waiting for the file recall to happen. In addition to these 4, a set of administrative commands for the daily operation of the system have also been developed, such as commands allowing for monitoring, stopping and starting migrations and recalls, performance reporting, etc. A non-exhaustive list of commands with a short description can be found as a reference in Appendix A.
5.2
– In case of failure or timeout, open() returns I/O error, otherwise when the file is fully on disk returns ordinary libc open(). In practice the preload library is used in order to transform on the client side a transparent recall into a selective recall, then waiting for the actual recall to happen. The preload library mechanism will be soon replaced by means of appropriate DM applications.
6
Test-bed and test results
A sketch of the test-bed that we realized to perform a large scale validation of YAMSS is depicted in Figure 2. Its layout consists of the following elements:
Transparent tape-ordered recalls
In order to allow for transparent tape recalls, a preload Clibrary has been written. The preload library overwrites ordinary libc “open” calls. It intercepts all open calls from client applications which want to perform transparent tape-ordered recalls (i.e. clients applications which do not perform selective recalls through the StoRM SRM service), and then performs the following tasks: – Checks if the file system is GPFS, otherwise returns ordinary libc open(). – Checks if the file is on disk, if yes returns ordinary libc open(). – Inserts into a queue the file for tape-ordered recall using yamssEnqueueRecall. – Polls until the file is fully on disk, or until an error condition shows up or until a configurable timeout is exceeded.
– A GPFS file system (version 3.2.1-14) providing 100 TB of disk space served by 2 NSD disk-servers. A redundant 4 Gbit=s Fibre Channel inter-connection between servers and disk arrays was used. The backend storage hardware was an EMC CX4-960 storage array with 1 TB SATA disks. The GPFS file system was built over 14 RAID5 LUN’s using a block size of 1 MB. The disk-servers were Blade Servers Dell M600 equipped with 2 CPUs Intel Xeon E5410 Quad core 2.33 GHz, 16 GB of RAM and 2 Gigabit Ethernet links. The Operating System (OS) installed was the Scientific Linux CERN (SLC) version 4.4, running a 2.6.9-67.0.15.EL x86_64 kernel. – One dedicated machine running the TSM server version 6.1 and its DB2 database engine. – Three HSM-client storage nodes. Each server runs TSM storage agents and HSM client commands. Each node accessed the GPFS file system through the SAN and the
Experience with Hierarchical Storage Management based on GPFS and TSM at INFN-CNAF
Figure 2
5
Scheme of the INFN-CNAF Tier-1 GPFS-TSM test-bed.
tape drives through a dedicated HBA connected to the Tape Area Network (TAN). – Eight T10000B tape drives in a SUN SL8500 tape library connected to the TAN. The advantage of this configuration, where the three HSMclient storage nodes also have direct access to the GPFS file system via fibre channel is that the traffic between disks and tapes is completely LAN-free. We made a series of tests on this test-bed. First we performed separated sequential migrations and recalls of randomly filled files. About 10 TB of files (1 GB each) were migrated from disk to tape using a subset of the available drives (6 in total), obtaining an aggregated throughput of about 550 MiB=s, with the 6 migration processes balanced over the 3 HSM nodes. Similarly, in the recall test about 10 TB of files (2 GB each) were read from tape, achieving an aggregated throughput of about 500 MiB=s. In a subsequent test, using the same set of files of the previous test, we performed migrations using 3 drives and recalls using other 3 drives at the same time, running one migration and one recall process on each of the 3 HSM nodes. We measured a sustained aggregate throughput of 350 MiB=s for recalls, and 250 MiB=s for migrations. After these preliminary tests, we performed more realistic tests running real analysis jobs accessing the data. CMS, one of the main LHC experiments, which has a computing model characterized by an intensive usage of the D0T1 SC, performed a realistic data processing in order to mimic production activities and data transfers. This test consisted in recalling and analyzing about 9000 real data files, 25 TB in total. Data files were randomly stored on 54 tapes (1 TB each) together with random garbage files in order to reproduce a realistic situation where recalls of non-contiguous files had to happen. About 1900 jobs ran in parallel on the computing farm using transparent recalls, whose output was written to
the same GPFS file system and migrated to tape. In addition, incoming “background” data transfers with an average throughput of 85 MiB/s were performed into the file system, in order to simulate other ordinary activities of the CMS experiment while the main data analysis activity was going on. In summary, we had files migrated from disk to tape resulting from the background data transfers and from the output of the analysis jobs, and at same time recalls from tape to disk of the files used as input to the analysis jobs. Six of the eight T10000B drives were used for recalls, while the remaining two drives were used for migrations. We recorded about 550 MiB=s from tape to disk, 100 MiB=s from disk to tape and 400 MiB=s from disk to the computing nodes, i.e. a total aggregated traffic of the order of 1 GiB=s. The throughput was similar to the one obtained with the equivalent system actually in production at CNAF, however the new system achieved this result using a very limited amount of hardware resources.
7
Conclusions
We presented an implementation of the GPFS-TSM integration developed at the INFN-CNAF WLCG Tier-1. The aim was to build a HSM system able to provide all the required Service Classes foreseen by the LHC distributed computing model. We achieved very good performances and an high reliability in comparison with alternative systems already in production in our Tier-1, and the system proved itself ready for production. The 2010 LHC run is expected to start by mid March 2010, and we will make use of GPFS and TSM, in conjunction with the SRM service provided by StoRM, as T1D0 Storage Class, for the first time in the framework of WLCG. We are really confident that this system will fulfil all the needed requirements and will be used with high efficiency in the next years of activity of LHC.
6
A. Cavalli et al.
References [1] I. Bird et al., “LHC computing Grid Technical design report”, CERNLHCC-2005-024. [2] A Carbone et. al., “A Novel Approach for Mass Storage Data Custodial”, proceeding of the 2008 Nuclear Science Symposium, Medical Imaging Conference and 16th Room Temperature Semiconductor Detector Workshop 19–25 October 2008 Dresden, Germany. [3] The Serial ATA (SATA) computer bus is a storageinterface for connecting host bus adapters (most commonly integrated into laptop computers and desktop motherboards) to mass storage devices (such as hard disk drives and optical drives). See also http://www. ata-atapi.com/sata.html. [4] General Parallel File System documentation. Available online: http://publib.boulder.ibm.com/ infocenter/clresctr/vxrx/topic/com.ibm. cluster.gpfs.doc/gpfsbooks.html. [5] M. Bencivenni et al., “A comparison of Data-Access Platforms for the Computing of Large Hadron Collider Experiments” – IEEE Transactions on Nuclear Science (June 2008), Volume: 55, Issue: 3, Part 3, pp. 1621– 1630 [6] “GPFS V3.2.1 Data Management API Guide”. Available online. [7] “IBM Tivoli Storage Management Concepts”, IBM Redbooks Series, SG24-4877. [8] F. Donno et al., “Storage Element Model for SRm 2.2 and GLUE schema description”, CERN. [9] A. Carbone et al., “Performance studies of the StoRM Storage Resource Manager”, Proceedings of Third IEEE International Conference on e-Science and Grid Computing (10–13 Dec. 2007), pp. 423–430. [10] E. Corso et al., “StoRM: A SRM Solution on Disk Based Storage System”, Proceedings of the Cracow Grid Workshop 2006 (CGW2006), Cracow, Poland, October 15–18, 2006.
Appendix A List of administration commands yamssDrainMigrations: Puts the system in drain state for migrations, i.e. finishes ongoing migrations and then stops migrating new files. yamssUndrainMigrations: Re-enables migrations from a previous drain state. yamssDrainRecalls: Puts system in drain state for recalls, i.e. finishes ongoing recalls and then stops recalling new files. yamssUndrainRecalls: Re-enables recalls from a previous drain state. yamssGetStatus: Prints to standard output the global status for each managed file system, e.g. the number of ongoing migrations and recalls, etc. yamssListRecalls: Lists all ongoing recalls tape by tape, the number of files per tape (short format) or all file names (long format). yamssLogger: Centralized logging facility. Three log files (for migrations, pre-migrations and recalls) are centralized for each managed file system. yamssLs: “ls”-like interface, but in addition prints the status of each file, i.e. whether the file is pre-migrated, migrated or disk-resident. It can run on nodes where TSM-HSM clients are not installed. yamssCleanRecallQueue: Cleans up the recall queue. yamssMigrateStat: Prints to standard output some statistics, e.g. number of files migrated from each HSMnode, failures, retention time on disk, disk-to-tape throughput, etc. It is used by the reporting facility to send report e-mails to the system administrators. yamssRecallStat: Prints to standard output some statistics, e.g. number of files recalled from each HSM-node, failures, average and maximum time elapsed from when the files are inserted into the queue until they are recalled from tape, tape-to-disk throughput, etc. It is used by the reporting facility to send report e-mails to the system administrator.