Digital Libraries and Data Intensive Computing ... - Semantic Scholar

5 downloads 1047 Views 196KB Size Report
process management technologies on top of the data grid software systems ..... HPSS – High Performance Storage System, http://www.sdsc.edu/hpss/hpss1.html.
Digital Libraries and Data Intensive Computing Reagan W. Moore San Diego Supercomputer Center [email protected] Abstract: Scientific data collections that represent the digital holdings of a research community are now being assembled into digital libraries. Scientists use the digital libraries to support browsing of registered material, discovery of relevant digital entities, and display of the data. This is similar to traditional services provided by digital libraries for image and document collections. However, scientists also need the ability to support manipulation of entire collections as part of data intensive computing. Entire collections are accessed for analysis, streamed through a processing pipeline, and the results are registered back into the digital library. The additional capabilities required by digital libraries to enable data intensive computing are examined for analysis of scientific data collections. I. Introduction Scientists assemble data collections to enable use of scientific material by an entire discipline [1]. Scientists publish the data to share with other researchers. Teachers then use the material as primary sources in education. Scientists also devise new research projects based upon the ability to analyze an entire collection to discover new properties of the physical world. Scientists examine data collections to identify unique features or discover collection-wide properties. This may require retrieval and analysis of every element within a collection. The processing of each element within a collection is called data intensive computing [2]. The integration of data intensive computing with digital libraries requires careful attention to the minimization of data access overheads, since the time required to access the data may be greater than the time needed to analyze the data. A second example of data intensive computing is the comparison of all digital entities between two collections. This requires the streaming of data from both collections through a set of analysis procedures, with the products of the procedures registered into a digital library for future use. Scientific data collections represent the intellectual capital of a community. A scientific community uses collections to publish scientific data, preserve standard digital reference sets, and validate new findings against previously published data. The collections contain not only the digital entities that comprise the digital holdings of the community, but also the descriptive metadata required to interpret and manipulate the holdings. Scientific data collections contain both content (data files) and context (metadata characterizing the data files). The content is analyzed to identify features within the data. The features are labeled and stored as descriptive metadata as part of the context of the file. Scientific data collections thus serve as the repository for the information that a scientific discipline has assembled about their digital holdings [3].

Digital libraries traditionally provide mechanisms to publish documents, but can also be used to publish scientific data. Observational and simulation data can be registered into a digital library for use by other members of the community and for use by educators, students, and the public. The characterization of features within the registered data requires the ability to move data from archives into disk caches, analyze each file, and store the results back into the collection. The manipulation and management of data is done with data grid technology which provides efficient data transport over wide-areanetworks, supports replication of files between archives and disk caches, and provides a uniform naming convention for distributed files [4]. Data intensive computing applies process management technologies on top of the data grid software systems to control the execution of the feature detection software. The integration of digital libraries, data grids, and processing pipelines is needed to support data intensive computing. Data intensive computing requirements can be best illustrated by examining real science projects [5]. We will examine digital libraries used in the Southern California Earthquake Center [6], the National Virtual Observatory [7], and the Bio-medical Informatics Research Network [8]. The first two projects are funded by the United States National Science Foundation, and the latter project is funded by the United States National Institute of Health. International projects are also integrating digital library and data grid technology in support of information sharing across academic institutions. The ability to share data through data grids and publish data within digital libraries is essential for the formation of large-scale scientific collections. Representative projects include the United States Department of Energy BaBar high-energy physics project [9] and the Worldwide Universities Network [10]. Such projects are exemplars of future scientific collaborations that will support data intensive computing. II. Scientific Data Collections The design of a software system that is capable of supporting data intensive computing requires an understanding of the unique requirements of scientific data collections. Scientific collections are characterized by: • Large amount of data. The size can be tens of Terabytes to thousands of Terabytes. The number of digital entities can be measured in the millions to tens of millions of files. • Use of distributed storage and compute resources. Scientific data are typically distributed across multiple sites, either during the generation process, or during the analysis process. An infrastructure independent naming convention (logical name space) is needed to identify files. When a file is moved between sites, its name logical name remains unchanged. • Unique set of descriptive metadata for each community. The terms and concepts used by one scientific discipline will not describe the characteristics of the data created by another academic discipline. Each discipline devises its own descriptive metadata. • Streaming access to data. When entire collections are analyzed, data is streamed from a remote archive through a processing pipeline as the entire collection is read. Latency management mechanisms are needed to minimize the number of





messages sent over wide-area-networks, and to minimize the overhead associated with transmission of data. Use of access controls. Until the data is calibrated and verified, projects limit access to the collection to the team members. The process of publication is an assertion by a scientific team that the data are an accurate representation of the physical world. The publication of data may be subject to the same peer-review processes as used for publishing scientific papers. Use of discipline specific data encoding formats. Each scientific community typically implements a different encoding standard that optimizes the ability to manipulate their data structures.

We will examine how each of these requirements arises for representative scientific collections. We will also look at the combinations of digital library, data grid, and processing pipeline software that are used to build production systems for representative science collections. II.1 Southern California Earthquake Center – SCEC While earthquakes cannot be reliably predicted, the effects of an earthquake can be modeled. The SCEC project analyzes the seismic hazard from the propagation of anelastic waves from an earthquake. A digital library is used to organize the results of simulations, and publish standard digital reference sets of observational and simulation data. The SCEC digital library manages multiple types of data, and hence is organized as a collection hierarchy. Files that characterize the sub-surface structure of the ground, the velocity model for propagation of waves through Southern California, are stored in one sub-collection. Files that are the result of a simulation of a particular seismic event such as the Northridge earthquake are stored in a separate sub-collection. Multiple views are maintained on the collections, to support discovery by type of simulation analysis code, or access by seismic event that is being modeled. The amount of data that is generated by the simulations can be enormous. A test run for anelastic wave propagation generated 1.3 million files, and 10 Terabytes of data. The files were registered into the Storage Resource Broker (SRB) data grid [11], and stored into an archive at SDSC. A portal was implemented based upon the National Middleware Initiative (NMI) technology [12] for accessing the collection. The portal provided both browsing support for examining the collection hierarchy, and analysis tools for displaying seismograms associated with ground movement at a selected location. Descriptive metadata for simulation output was based on Dublin Core for provenance information and SCEC defined attributes for the input parameters to the simulation codes and the structure of the output data sets. The scientific community relies upon the Federal Geographic Data Committee standard [13] to describe geo-spatial data, and the ISO 19115:2003 standard [14] to describe geographic information and services. ISO 19115 provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.

Since both finite element and finite difference codes are used to model the anelastic wave propagation, multiple data structures are required to define the simulation output. A standard regular geometric grid is defined to describe the velocities. Derived data products are created that convert the data to seismogram traces at selected points on the surface. The codes are then validated by comparing the relative accuracy of the computed seismograms with observed seismograms. The technologies used to implement the digital library are: • Portal, web-based interface to the digital library – National Middleware Initiative technology • Composition Analysis Tool – develop procedures used for analysis [15] • Data Grid – Storage Resource Broker • Distributed computation – Globus toolkit [16] • Compute servers – Teragrid [17] • Archive – High Performance Storage System (HPSS) [18] The digital library is distributed between the Information Sciences Institute (ISI) at the University of Southern California and the San Diego Supercomputer Center, with the portal residing at ISI. The SRB data grid is used to link the two sites and provide a common name space for files. The data are stored in HPSS at SDSC. II.2 National Virtual observatory - NVO The astronomy community creates massive collections of images that are taken in all sky surveys. Each survey is conducted at a particular set of wavelengths of light. The images are analyzed to identify each observable star and observable galaxy to the sensitivity limit of the survey. Star and galaxy catalogs are then created that list the location, magnitude (brightness), and color of each object. The astronomy community must therefore manage both the images that are taken during the surveys, and the catalogs of derived star and galaxy locations. The amount of data and number of images can be very large. The 2-Micron All Sky Survey [19] contains 5 million images and over 10-Terabytes of data. The Digital Palomar Sky Survey [20] has 3-Terabytes of images, and the Sloan Digital Sky Survey [21] will have 15-Terabytes of images. Future synoptic surveys that take repeated images of the entire sky to detect asteroids and comets will be even larger. The Large Synoptic Survey Telescope will generate petabytes of data per year when it is in operation. The number of objects identified in the images is also very large. The US Naval Observatory catalog-B1.0 [22] holds the locations of 1-billion stars. The Sloan Digital Sky Survey will identify the locations of over 100 million galaxies. Data from multiple sky surveys can be combined to detect new types of objects (such as brown dwarfs) or generate statistics on the shape of galaxies [7]. The analyses can be conducted from information in the catalogs, but can also be conducted by re-analysis of each pixel in each image. This latter approach requires the streaming of images from their storage location through a processing platform. To facilitate the analyses, the collections have been

replicated onto the NSF Teragrid which provides both compute and storage resources. The Storage Resource Broker is used to manage copies of the images on the HPSS archive and on a disk cache. The data is streamed from the disk cache to compute nodes on the Teragrid where the images are analyzed for galactic structure. Workflow processing systems are used to manage the execution of the analysis procedures. Sky portals are used to integrate information from the multiple catalogs, in effect supporting distribution of queries across the multiple catalogs and doing the joins to generate the desired query response. The metadata used to describe the data is based on Uniform Content Descriptors for the catalog entries. The International Virtual Observatory Alliance [23] is developing a uniform set of names that represent the minimum number of physical quantities used across astronomy catalogs. Each astronomy catalog attribute can then be tagged with the corresponding Uniform Content Descriptor. The astronomy community has created a standard data format for images, called the FITS standard [24]. This provides a way to encapsulate descriptive metadata about the location, resolution, and extent of each image along with the pixels that comprise the image. The technologies used to implement the astronomy data collections are very similar to those used for the SCEC project: • Portal, web-based interface to the digital library – based on WSDL services [25] • Pegasus workflow planning system [26] • Data Grid – Storage Resource Broker • Distributed computation – Globus toolkit and Condor [27] • Compute servers – Teragrid • Archive – High Performance Storage System (HPSS) at SDSC, UniTree storage system at NCSA The Teragrid provides compute and storage resources at SDSC, the National Center for Science Applications, the Pittsburgh Supercomputing Center, Caltech, and the Argonne National Laboratory. Data can be retrieved from an archive and processed at any of the TeraGrid sites. II.3 Bio-medical Informatics Research Network - BIRN The BIRN project promotes the sharing of data between researchers involved in neuroscience and brain imaging projects. Each site retains control of their own data, selects files for publication into the BIRN central archive, and uses data grid technology to provide access to the published data sets. The amount of data is measured in the Terabytes, with the number of files measured in the millions. BIRN data grid includes fifteen sites, with each collaborator replicating data into a central repository at SDSC. A metadata catalog is used to control access. Since magnetic resonance images contain a picture of the patient, the HIPAA (Health Insurance Portability and Accountability Act of 1996) [28] patient confidentiality requirements must be met. These include the ability to authenticate all persons who access the images, manage access controls on all data files independently of the place

where the data is stored, manage access controls on the metadata, provide audit trails for all accesses to the data independently of storage location, and support data encryption. In practice, the audit trails are used to demonstrate the amount of sharing of data between the collaborating sites. The amount of remote data accessed by a researcher can be quantified, as well as the amount of the researcher’s data that is distributed to other sites. The architecture used to support BIRN is based on grid technologies, and is very similar to both the SCEC and NVO projects: • Portal, web-based interface to the digital library • Data Grid – Storage Resource Broker • Distributed computation – Globus and Condor • Compute servers – Teragrid • Storage – Grid Bricks [29], commodity based disk distributed to each site, and HPSS archival storage system II.4 International Projects The BaBar high-energy physics experiment generates data files describing collisions of particles at very high energies. The project is an international collaboration, with researchers at the Stanford Linear Accelerator [30] and the Institut National de Physique Nucleaire et de Physique des Particules (IN2P3) [31] in Lyon, France. The size of the collection is on the order of 500 Terabytes. This poses multiple challenges in data management. The access latency between Lyon, France and Stanford is over 100 milliseconds. Commands that require the transmission of multiple messages (such as a list of a sub-collection contents) will no longer provide interactive response (take longer than one-quarter of a second). Also teams at both institutions need the ability to create their own independent digital libraries to manage their experimental data. The resolution is the creation of independent data grids that are able to exchange both data and metadata. Each site establishes a separate information catalog and manages data on their own storage repositories. The data grids are federated through mechanisms that manage both access controls and update consistency constraints for the sharing of data and metadata [32]. In practice, constraints are needed for the sharing of four name spaces: 1. Naming convention for storage resources. Each data grid creates logical names for the storage systems on which their data is located. Controls are established on the storage resource logical names to define whether the 2nd data grid is allowed to store data on the storage systems managed by the 1st data grid 2. Distinguished name space for users. Each data grid establishes unique names for each user that is allowed to exercise controlled operations such as write files, create metadata, or update metadata. Access controls are established on the user names to define whether users in the 2nd data grid are allowed to execute controlled operations in the 1st data grid. 3. Logical name space for digital entities. Each data grid establishes infrastructure independent names for the files, URLs, SQL command strings, and database binary large objects that are registered into their information catalog. Access

controls are established on the files to define whether digital entities in the 2nd data grid can be modified by users in the 1st data grid. 4. Catalog metadata. Each data grid defines a context for their digital entities. Access controls are established for whether users in the 2nd data grid can modify metadata in the 1st data grid. Sharing controls can be implemented to support peer-to-peer environments with remote data access allowed under access control constraints, or hierarchical federations, with data and metadata replicated under master-slave consistency constraints between the grids. Peer-to-peer environments typically assume the coordination of files and metadata between sites is done by the users. Hierarchical federations assume the coordination of files and metadata between sites is done under system control. The Worldwide Universities Network (WUN) is federating access between digital libraries at academic institutions in Europe, the United Kingdom, and the United States. The goal is to promote sharing of information and data in collaborative academic research. The WUN environment is being extended to support federation between data grids that are established by other research communities. The establishment of policies to control access and update between grids is essential to protect intellectual property rights while permitting free access to published information. III. Data Intensive Computing Environments A set of infrastructure support requirements can be derived from the above projects that are needed to integrate data intensive computing with digital library collections. The driving requirements are support for distributed data, support for retrieval of large amounts of data, and support for federation of collections. These capabilities are provided by data grids [33] and include: •







Logical name spaces for digital entities. The naming of entities in distributed systems requires the ability to separate the naming convention from the entity location. The context associated with each entity can then be implemented as metadata attributes mapped onto the logical name space. As a file is moved between sites, the descriptive context and access controls remain unchanged. Replication of digital entities. To improve access, copies of entire collections are needed at sites where the required computing power is available. Note that a sustained data rate of 3.2 Megabytes per second is needed to read 100 Terabytes of data in a year. To read 100 Terabytes of data in a day requires an access rate of 1.1 Gigabytes per second. The analysis of large scientific collections requires replication onto high performance resources such as the NSF Teragrid. Bulk operations. Since collections contain millions of files, mechanisms to manipulate tens of thousands of files at a time are needed to facilitate collection creation and collection retrieval. Bulk operations are needed to register files into the metadata catalog, delete files from a collection, and set access controls. Containers. Access to a large number of files in a distributed environment requires use of containers when the size of the files is smaller than the product of



the network bandwidth and the network latency. This means that the time to move the file over the network is dominated by the time needed to send messages between the source and destination. The aggregation of files into a container makes it possible to move many files at a time, and minimizes the number of control messages that are sent over a network. A similar situation occurs when writing files into archives. In this case the goal is to minimize the number of files that the archive must manage, and ensure that related files are stored on the same tape. Federation. Interactive response over wide area networks requires access to local metadata catalogs. At the same time, coordination between metadata catalogs is needed to ensure that intellectual property rights, patient confidentiality rights, and consistency constraints are maintained.

Digital libraries provide the control and access mechanisms for the scientific collections. The integration of current digital library technology with data grids is needed to extend digital library standards to the management of data distributed and replicated across multiple sites. Fortunately, these standards can be implemented as interfaces to data grids, allowing the data grids to manage the distributed data while the digital library technologies provide the policies associated with document life-cycle management. The standards that are being integrated with data grids include: • • • •

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [34] DSpace [35], digital library infrastructure to capture, store, index, preserve and redistribute intellectual output Fedora [36], flexible extensible digital object repository Metadata Encoding and Transmission Standard [37]

The digital library control mechanisms for procedures must be integrated with both the workflow processing pipelines and the data grid federation mechanisms. An example is the application of curation processes for validating the context associated with material registered into the digital library. The curation processes can be executed for a large number of digital entities through the grid workflow system. The resulting context can then be replicated between multiple catalogs using data grid federation. For the integrated environment to work correctly, consistent policies will need to be followed within the digital library life-cycle management, workflow management, and data grid federation technologies. IV. Vision for future Digital libraries are essential technology for the organization of scientific data into collections. The integration of digital libraries with data grids provides the mechanisms needed to manage the massive amounts of distributed scientific data that are now being generated. The federation of independent digital libraries is being driven by the desire to retain local control over collections, while supporting international access. The result is the need to define management control and consistency update constraints that can be imposed between digital libraries. Life-cycle management (DSpace) and knowledge

management (Fedora) technologies provide mechanisms to specify management controls and consistency constraints. The integration of these emerging digital library technologies with federated data grids promises to provide the data management infrastructure needed to support international collaborations. V. Acknowledgements The Storage Resource Broker was developed under the technical lead of Michael Wan and Arcot Rajasekar at the San Diego Supercomputer Center. Applications of the SRB technology were done by Wayne Schroeder, George Kremenek, Sheau-Yen Chen, Charles Cowart, Lucas Gilbert, Bing Zhu, and Marcio Faerman. This work was supported in part by the NSF NPACI ACI-9619020 (NARA supplement), the NSF Digital Library Initiative Phase II Interlib project, the NSF NSDL/UCAR Subaward S0236645, the DOE SciDAC/SDM DE-FC02-01ER25486 and DOE Particle Physics Data Grid, the NSF National Virtual Observatory, the NSF Grid Physics Network, the NSF Southern California Earthquake Center, and the NASA Information Power Grid. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archives and Records Administration, or the U.S. government. VI. References 1. Rajasekar, A., R. Moore, "Data and Metadata Collections for Scientific Applications", High Performance Computing and Networking (HPCN 2001), Amsterdam, Holland, June 2001. 2. Moore, R., C. Baru, A. Rajasekar, R. Marciano, M. Wan: Data Intensive Computing, In ``The Grid: Blueprint for a New Computing Infrastructure'', eds. I. Foster and C. Kesselman. Morgan Kaufmann, San Francisco, 1999. 3. Chen, C., “Global Digital Library Development,” pp. 197-204, “Knowledge-based Data Management for Digital Libraries,” Tsinghua University Press, 2001. 4. Boisvert, R. P. Tang, “The Architecture of Scientific Software,” pp. 273- 284, “Data Management Systems for Scientific Applications,” Kluwer Academic Publishers, 2001. 5. Rajasekar, A., M. Wan, R. Moore, A. Jagatheesan, G. Kremenek, “Real Experiences with Data Grids - Case studies in using the SRB”, International Symposium on HighPerformance Computer Architecture, Kyushu, Japan, December, 2002. 6. SCEC – Southern California Earthquake Center community digital library, http://www.sdsc.edu/SCEC/ 7. NVO – National Virtual Observatory, http://www.us-vo.org/ 8. BIRN - Biomedical Informatics Research Network, http://nbirn.net/ 9. BaBar - http://www.slac.stanford.edu/BFROOT/ 10. WUN – Worldwide Universities Network, http://www.wun.ac.uk/. 11. SRB - “The Storage Resource Broker Web Page, http://www.npaci.edu/DICE/SRB/. 12. NMI – National Science Foundation Middleware Initiative, http://www.nsfmiddleware.org/

13. FGDC – Federal Geographic Data Committee, http://clearinghouse1.fgdc.gov/ 14. ISO 19115:2003, International Organization for Standardization geographic metadata, http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26020 &ICS1=35 15. Composition Analysis Tool, http://epicenter.usc.edu/cmeportal/krr.html 16. Globus toolkit, http://www-unix.globus.org/toolkit/ 17. TeraGrid distributed infrastructure for scientific research, http://www.teragrid.org/ 18. HPSS – High Performance Storage System, http://www.sdsc.edu/hpss/hpss1.html 19. 2MASS – Two Micron All Sky Survey, http://www.ipac.caltech.edu/2mass/. 20. DPOSS – Digital Palomar Observatory Sky Survey, http://www.astro.caltech.edu/~george/dposs/ 21. SDSS, Sloan Digital Sky Survey, http://www.sdss.org/ 22. USNO-B1.0 – http://ftp.nofs.navy.mil/projects/pmm/catalogs.html 23. IVOA – International Virtual Observatory Alliance, http://www.ivoa.net/ 24. FITS – Flexible Image Transfer System, http://fits.gsfc.nasa.gov/fits_intro.html 25. WSDL – Web Service Definition Language, http://www.w3.org/TR/wsdl 26. Pegasus – Planning for Execution in Grids, http://pegasus.isi.edu/ 27. Condor high throughput computing environment, http://www.cs.wisc.edu/condor/ 28. HIPAA, Health Insurance Portability and Accountability Act of 1996, http://www.hep-c-alert.org/links/hippa.html 29. Rajasekar, A., M. Wan, R. Moore, G. Kremenek, T. Guptil, “Data Grids, Collections, and Grid Bricks”, Proceedings of the 20th IEEE Symposium on Mass Storage Systems and Eleventh Goddard Conference on Mass Storage Systems and Technologies, San Diego, April 2003. 30. SLAC – Stanford Linear Accelerator, http://www.slac.stanford.edu/ 31. IN2P3 - Institut National de Physique Nucleaire et de Physique des Particules, http://cc.in2p3.fr/ 32. Rajasekar, A., M. Wan, R. Moore, W. Schroeder, “Data Grid Federation,” PDPTA2004 - Special Session on New Trends in Distributed Data Access, June 2004. 33. Jagatheesan, A., R., Moore, “Data Grid Management Systems,” NASA / IEEE MSST2004, Twelfth NASA Goddard / Twenty-First IEEE Conference on Mass Storage Systems and Technologies, April 2004. 34. OAI-PMH, Open Archives Initiative Protocol for Metadata Harvesting, http://www.openarchives.org/OAI/openarchivesprotocol.html 35. DSpace, http://www.dspace.org/ 36. Fedora - Flexible, Extensible, Digital Object Repository, http://www.fedora.info/ 37. METS – Metadata Encoding and Transmission Standard, http://www.loc.gov/standards/mets/