50
November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM
BY IAN FOSTER AND ROBERT L. GROSSMAN
DATA INTEGRATION IN A BANDWIDTH-RICH WORLD Inexpensive storage and wide-area bandwidth (with prices for both declining at least as fast as Moore’s Law) drive demand for middleware to integrate, correlate, compare, and mine local, remote, and distributed data.
E
xponential advances in sensors, storage systems, and computers are producing data of unprecedented quantity and quality. Multi-terabyte and even petabyte (1,000TB) data sets are emerging as major assets. For example, the climate science community has access to hundreds of terabytes of observational data from NASA’s Earth-observing system and simulation data from high-performance climate models; these data sources can yield new
insights into global change. The World-Wide Telescope linking hundreds of digital sky surveys is revolutionizing astronomy [11]. And in industry, multi-terabyte (soon to be petabyte) data warehouses of consumer transactional data are increasingly common.
IN-SPIRALING MERGER OF TWO BLACK HOLES. SWIRLING RED TENDRILS ARE OUTWARD-TRAVELING GRAVITATIONAL WAVES. (SIMULATION DATA: PETER DIENER AND THOMAS RADKE, BOTH MAX PLANCK INSTITUTE FOR GRAVITATIONPHYSICS, ALBERT EINSTEIN INSTITUTE/POTSDAM GERMANY; VISUALIZATION: JOHN SHALF USING THE VISAPULT TOOL DEVELOPED BY WES BETHEL, LAWRENCE BERKELEY NATIONAL LABORATORY)
COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No. 11
51
T
he key to deriving insight and knowledge is centralized repositories of data used for reporting and often the correlation of data from multiple querying. High-speed optical networks make it possisources, as these examples show. The tradi- ble for data to instead be stored at its source. When tional paradigm for such syntheses is to reports are required, bandwidth can be requested, data gather data at a single location and trans- merged from multiple sources, and reports generated form it into a common format prior to using the most current data. In effect, virtual data exploring it. However, the expense of this approach in warehouses are constructed on the fly. Here, as elseterms of network resources has meant that most data where, new data architectures become possible when is never correlated or compared to other data. In a wide-area networks are able to transport data at speeds world of more and more data, storage systems, com- comparable to that of a computer’s backplane. puters, and networks, it is both necessary and feasible for system architects to think in terms of Discovery a new paradigm based on data integration—the flexible and managed federation, exploration, R R RM and processing of data from many sources. One factor driving this new paradigm is draRM RM matic improvements in network performance. Access RM Few Internet1 networks move data at more than a megabit per second (Mbps), taking weeks to RM RM move a terabyte. Fortunately, advances in netRM working technologies are ushering in an era of Security Policy bandwidth abundance based on Tbps optical service service backbones providing routine access to end-toend paths of 10Gbps or more—a four-ordersof-magnitude improvement (see the article by DeFanti Figure 1. Major components and activities in a data-integration architecture. Happy users interact with various public or private et al. in this section). For example, in 2002 Earth sciregistries, each providing a particular view of available data, to ence data striped over a three-node cluster was recently discover candidate data. They then dispatch requests (dark transported at a rate of 2.4Gbps between Amsterdam arrows) to access and/or explore (white circles) remote data. Each such request, along with resulting interstorage-system transfers and Chicago, a terabyte in an hour [9]. (dashed and dotted arrows), is subject to resource management Just as critical for effective data integration, and our controls at various points (labeled RM), typically under the control of security and policy services. focus here, is the distributed system middleware beginning to allow distributed communities, or virtual organizations, to access and share data, networks, and Data replication for business continuity. Businesses other resources in a controlled and secure manner. Recent advances promise to provide required capabil- providing critical infrastructure for disaster recovery ities. For example, Open Grid Services Architecture and business continuity are increasingly locating sec(OGSA) standards and technologies provide for the ondary, even tertiary, backup facilities far from prisecure and reliable virtualization and management of mary sites. In the past, large data volumes, high distributed data and computing resources [2, 6]. And transaction speeds, slow networks, and poor distribdata Web infrastructures support discovery, explo- uted data management infrastructure made such disration, analysis, integration, and mining of remote and tributed architectures difficult or impossible. distributed data [10]. Such efforts are pioneering a However, with all-optical networks and distributed new generation of distributed data discovery, access, data services, it becomes feasible to consider replicatand exploration technologies promising to transform ing transformations from core systems to remote the Internet into a data-integration platform. On it, backup systems. Financial exchanges, telecommunicausers will be able to perform sophisticated operations tion systems, reservation systems, and dispatch and on remote and distributed petascale data sets (see the scheduling operations performed by vendors are examples of systems that can be replicated in this way. sidebar “Data-Integration Technologies”). Stream-based distributed processing of sensor data. Petascale Scenarios Large centralized detectors and distributed sensor nets The following scenarios illustrate applications impos- in such fields as physics, astronomy, seismology, and sible today but achievable over optical networks with national security produce high-volume data streams only the help of data services. requiring extensive processing prior to analysis. Today, Virtual data warehouses. Today, data warehouses are processing is performed offline, and data sets are pre52
November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM
IN EFFECT, DATA WAREHOUSES ARE CONSTRUCTED ON THE FLY. pared and distributed only periodically. Optical networks and data-integration services can enable a new paradigm in which even large data sets are continuously updated, so users always have access to the most current data. Data can also be merged from multiple sources, processed in real time, and analyzed for changes, alerts, and other significant patterns.
Requirements and Technologies Distributed data sources can be diverse in their for-
mats, schema, quality, access mechanisms, ownership, access policies, and capabilities. Overcoming this multi-tiered Tower of Babel to achieve distributed data integration requires technical solutions and standards in three closely related areas: data discovery and access; data exploration and analysis; and resource management, security, and policy (see Figure 1). Data discovery and access. The first step in integrating data is discovering data that may be relevant, often through middleware that examines metadata. Meta-
Data-Integration Technologies
T
he following projects are contributing to the development of data-integration middleware: Data Web. This open source, Web-based software supports access, exploration, analysis, integration, and mining of remote and distributed data (see www.dataspaceweb.net). Earth System Grid. This U.S. government-funded project applies data Grid technologies to the integration of Earth system modeling data (see www.earthsystemgrid.org). EU DataGrid. This European Union-funded project develops and applies data Grid technologies in highenergy physics and other domains (see www.eu-datagrid.org). Globus Toolkit. This open-source software provides the basic infrastructure for many Grid deployments worldwide (see www.globus.org). Open Grid Services Architecture. This integration of Grid and Web services technologies defines standard interfaces and behaviors for distributed system integration and management (see www.ggf.org/ogsa-wg and
www.globus.org/ogsa). OGSA Data Access and Integration. This Global Grid Forum working group defines service-oriented interfaces for manipulating distributed data sources (see www.ggf.org/6_DATA/dais.htm). Storage Resource Broker. This data access and federation technology provides data-mediation functions for data-intensive science (see www.npaci.edu/DICE/SRB). Semantic Web. This extension of the Web aims to give information well-defined meaning (see www.w3.org/2001/sw). Virtual Data Toolkit. This product of the National Science Foundation’s GriPhyN project (see www.griphyn.org) integrates data management and analysis technologies, including the Globus Toolkit, Condor, and Chimera. Web Services. This widely used set of standards specifies how applications define, discover, and access network-accessible services (see www.w3.org/ 2002/ws). c
COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No. 11
53
data can be represented, federated, and accessed in a mechanisms with high-performance transport protovariety of ways. Relevant technologies include Web cols remains a major unresolved problem. services mechanisms. For example, there’s the Web Data exploration and analysis. Data rendered accesServices Description Language specifications; Grid- sible can be analyzed in detail. Here, data exploration enabled data access and integration services [2]; direc- services are needed to address the challenges inherent tory services (such the Lightweight Directory Access in finding relevant data that can be combined with Protocol); XML and relational databases; Semantic local data or with other remote data to achieve new Web technologies [5]; and text-based Web search discoveries. These services can provide basic statistical mechanisms applied to unstructured text-based meta- summaries, enable visual exploration of data, and supdata. port standard exploratory functions (such as building Having identified data sets that might be relevant, clusters), computing the regression of one variable on the next step for the user is to access the data to see another. whether it is likely to be relevant and actually worth Efficient integration of distributed data requires investigating. Data formats, schema, and access mechFigure 2. Computer scientists have a good understanding of anisms span a broad range. Widely adopted access how to perform relational joins when data is at rest in a single mechanisms include: the Open source project for a location. An important method for data integration is to join data in motion to look for patterns across data sets. Network Data Access Protocol (OPeNDAP) in the distributed An experiment at the iGrid 2002 Conference in Amsterdam environmental community; Storage Resource Broker integrated (on the fly) climate data from Chicago with vegetation (SRB) [3] in scientific projects; Data Web protocols data in Amsterdam at transfer rates greater than 2.4Gbps, a land-speed record at the time; integration involved two for data mining (the Data Space Transfer Protocol, or distributed three-node clusters and employed the SABUL DSTP); and GridFTP for high-performance and data-transport protocol. striped data movement. The OGSA-based Data Access and Integration (OGSA-DAI) [2] standards emerging from the Global Grid Forum seek to integrate these and other approaches. Data access can demand high transport performance and require parallel data access and movement. For example, if remote data is being CCM3 delivered at a rate of 1Gbps, and a particular data in (xi, ri, si, ti) application’s data-integration activity involves Chicago reading 10 local bytes per remote byte received and performing 100 operations per local byte (yi, ri, si, ti) read, then the application requires 10Gbps local read bandwidth and 1Teraops/sec. of local computing to keep up with data delivery (a substantial and necessarily parallel resource). Vegetation data in Striping data using multiple network connecAmsterdam tions linking pairs of nodes in distributed clusters is becoming a core technique in high-performance data transport [10]. The GridFTP protocols and services for managing the data records extensions to the popular FTP protocol represent a constituting data archives. Unlike files of bits, data standard approach to exploiting parallelism in data archives of records have attributes, attribute metadata, transfers, allowing multiple data channels to be coor- keys, and missing values. Mechanisms for providing dinated via FTP control channel commands. Also rel- attribute- and record-based access to remote and disevant is the work on advanced protocols described in tributed data include: SQL-based access methods for the article by Falk et al. in this section. relational data; protocols designed to work with We anticipate the emergence of data access services remote data (such as the Data Web Transfer Protocol supporting the flexible creation and manipulation of [10], OPeNDAP, and OGSA-DAI [2]); and protocols views on data sources (whether files or tables) and designed to work with remote and distributed semiaccess to those views using a variety of operations, structured data (such as XPath). Data Webs support including database-style operations (such as SQL the exploration and mining of distributed data using “select”) and other more general operations (such as templated data-mining operations. attribute selection, row selection via range queries, and The transformation, analysis, and synthesis perrecord selection via sampling). Integrating these formed during data integration can be complex and 54
November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM
THE EXPERIMENT DEMONSTRATED CONCLUSIVELY THAT GEOGRAPHICAL DISTANCE NEED NOT BE AN OBSTACLE TO DATA INTEGRATION. computationally intensive. Data-transformation primitives incorporated into data middleware cannot capture arbitrary computations but can express many common data-preparation operations [10]. More general workflow services are also required to support the integration and scheduling of arbitrary user- and community-defined transformations. Users benefit from tools that record, organize, and exploit knowledge about how these activities derive new data from old. Virtual data systems aim to capture this information so as to allow reuse of generated data, explanation of data provenance, and other activities [8]. Resource management, security, and policy. Being famiiliar with today’s bandwidth- and data-poor world, users often assume only standard schema and access methods are required to render remote data accessible. But the distributed analysis of large quantities of data is computationally (and bandwidth) intensive, and a high-performance Internet can expose popular data resources to the risk of essentially unlimited loads. Efficient petascale data integration can require the harnessing and coordinated management of multiple computational and network resources at multiple sites. Thus, clients (and brokers acting on their behalf) need to negotiate service level agreements (SLAs) with computers, storage systems, and networks. They also need to deploy applications able to achieve desired end-to-end performance across these resources, as well as monitor performance and adapt to performance problems at either the network or SLA level [7]. For example, an application might request an end-to-end optical network plus associated computing and storage resources, use the resources to integrate remote and local data, then release them. Another effective optimization is to decouple data movement and computa-
tion so the data is staged to locations “near” (in terms of some access cost metric) to where it is required [12]. Data replication and distribution of data across the network [4] are also effective techniques. Along with the data itself, the physical resources employed for data integration are frequently precious and thus subject to access controls. Data-integration middleware must therefore provide comprehensive security, policy, and resource management solutions. These solutions are required at multiple levels, ranging from the individual user (“Can I access this file?”), to the user community (“How many Gb-hours is this community allocated?”), and from the local (“Allocate me 1Gbps bandwidth”), to the end-to-end (“Allocate resources to achieve 10Gbps throughput for this pipeline”), to the global (“Ensure that the most popular data sets are replicated”). Security and policy solutions must address the concerns of both the institutions that own specific resources and the communities wishing to achieve distributed analysis.
Implications Two examples from the sciences illustrate some practical implications and applications of these issues: Joins of distributed Earth science data. The National Center for Atmospheric Research’s Community Climate Model 3 (CCM3) helps research CO2 warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer, and nuclear winter. Scientists regularly want to integrate their data with CCM3 data. For example, they might wish to join their historical data about vegetation levels with CCM3 data to study the effect of global climate change on certain types of vegetation. A typical dataintegration operation is to join a field xi, say, temperaCOMMUNICATIONS OF THE ACM November 2003/Vol. 46, No. 11
55
200000 x 200k SDSS 6
DAG
150
getTargetRegion
getTargetRegion
getBufferRegion
target(1).fit 200k buffer(1).fit 10M
brgSearch brg(1).par
Sloan Data
bcgSearch
25k cores(1).par 25k
18000
getCoreBuffer
17950
coresBuffer(1).fit 5M
Galaxy cluster size distribution
bcgCoalesce parameters.par
job[id]
clusters(1).par 25k getCluster
clusters(2).fit
17900
getGalaxies
300k
17850
17800
17750
galcatalog(2).fit 300k
17700
getCatalog
17650 catalog.fit
galcatalog.fit
0
500
1000
1500
2000 2500 3000
3500
4000
4500 5000
t[s]
Figure 3. The steps involved in galaxy cluster detection in Sloan data, showing (left) the pipeline and (right) the image data, a small directed acyclic graph (DAG), execution schedule for that DAG, and example output data.
ture from the remote CCM3 data set that includes a key ki consisting of a latitude-longitude-time triple (ri, si,, ti,) with a field yi from the other data set representing a vegetation level for the same key (ri, si,, ti,). In this way, scientists can estimate functional relationships of the form y = f(k; x) to capture how vegetation levels change over time with changes in climatic variables. In one study, the goal was to integrate data on the fly, without co-locating it, in order to obtain an estimate of whether such a relationship is probable, in which case more careful follow-up studies would be needed. The study was performed in conjunction with the iGrid 2002 conference in Amsterdam, The Netherlands, evaluating various algorithms for transporting and joining distributed streams of data indexed by latitude, longitude, and time (see Figure 2) [9]. One stream contained temperature and related CCM3 data, the other vegetation levels. One data set was located on a three-node cluster in Chicago, the other on a three-node cluster in Amsterdam. The DataSpace Data Web software was used to move the data across the Atlantic and perform a streaming join of it in Amsterdam; a parallel version of the Simple Available Bandwidth Utilization Library (SABUL) 56
November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM
protocol was used for data transport, and DSTP was used to manage keys, metadata, and data. Data was moved at a rate greater than 900Mbps per node (2.4Gbps, or 1TB/hour, with a three-node cluster) and merged at approximately half that speed [9]. This experiment demonstrated conclusively that geographical distance need not be an obstacle to data integration. Galaxy cluster identification in Sloan data. The Sloan Digital Sky Survey (SDSS) is a digital imaging survey that will, by 2007, have mapped a quarter of the sky in five colors with a sensitivity two orders of magnitude greater than previous large sky surveys. The SDSS data is being made available online as both a large collection (~10TB) of images and a smaller set of catalogs (~2TB) containing measurements on each of 250 million detected objects. SDSS is just one example of a growing set of digital sky survey projects that will soon yield an unprecedented international, distributed multi-petabyte collection of digital astronomical data [11]. Another recent experiment [1] showed how this online data could be integrated with distributed computing and storage resources to perform computationally intensive analysis of unprecedented scale. The challenge was to search the Sloan database for galaxy clusters, the largest gravitationally dominated structures in the universe. Software developed for the GriPhyN project—the so-called Virtual Data
DATA INTEGRATION PROMISES TO HAVE AT LEAST AS GREAT AN EFFECT AS DATA MINING HAS HAD. Toolkit—was used to plan, then manage the required workflow (see Figure 3), ultimately involving computational clusters at four sites across the U.S. This illustrates how even large-scale distributed data analysis tasks might become routine once appropriate infrastructure is in place.
5. Berners-Lee, T., Hendler, J., and Lassila, O. The Semantic Web. Sci. Am. 284, 5 (May 2001), 34–43. 6. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., and Tuecke, S. The Data Grid: Towards an architecture for the distributed management and analysis of large scientific data sets. J. Net. Comput. Applic. 23, 3 (July 2000), 187–200. 7. Czajkowski, K., Foster, I., and Kesselman, C. Resource and Service Management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, 2004. 8. Foster, I., Voeckler, J., Wilde, M., and Zhao, Y. The Virtual Data Grid: A new model and architecture for data-intensive collaboration. In Proceedings of the Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 5–8, 2003). 9. Grossman, R., Gu, Y., Hanley, D., Hong, X., Lillethun, D., Levera, J., Mambretti, J., Mazzucco, M., and Weinberger, J. Experimental studies using photonic data services at iGrid 2002. Future Gen. Comput. Syst. 19, 6 (2003). 10. Grossman, R. Standards and infrastructures for data mining. Commun. ACM 45, 8 (Aug. 2002), 45–48. 11. Szalay, A. and Gray, J. The World-Wide Telescope. Science 293 (2001), 2037–2040. 12. Thain, D., Basney, J., Son, S.-C., and Livny, M. The Kangaroo approach to data movement on the Grid. In Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (San Francisco, CA, Aug. 7–9). IEEE Computer Society Press, New York, 2001, 7–9.
Conclusion The data tsunami already upon us offers great opportunities for new insight and knowledge but demands significant advances in middleware for integrating data from diverse distributed sources. That’s why we have sought to explore here not only the state of the art but likely future directions for this middleware. Data mining emerged from statistics as a new discipline during the past decade, as large data sets became more and more common and the need for new technologies to mine them became critical. In the coming decade, data integration will emerge from distributed computing and data mining, fueled by the increasing number of distributed data sets and enabled by improving network performance. Data integration Ian Foster (
[email protected]) is associate division director and promises to have at least as great an effect as data min- senior scientist at Argonne National Laboratory, Argonne, IL, and a professor of computer science at The University of Chicago. ing has had. c Robert L. Grossman (
[email protected]) is director of the References 1. Annis, J., Zhao, Y., Voeckler, J., Wilde, M., Kent, S., and Foster, I. Applying Chimera virtual data concepts to cluster finding in the Sloan Sky Survey. In Proceedings of SC2002 (Baltimore, MD, Nov. 16–22). ACM Press, New York, 2002. 2. Atkinson, M., Chervenak, A., Kunszt, P., Narang, I., Paton, N., Pearson, D., Shoshani, A., and Watson, P. Data access, integration, and management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, 2004. 3. Baru, C., Moore, R., Rajasekar, A., and Wan, M. The SDSC storage resource broker. In Proceedings of the 8th Annual IBM Centers for Advanced Studies Conference (Toronto, Canada, 1998). 4. Beck, M., Moore, T., and Plank, J. An end-to-end approach to globally scalable network storage In Proceedings of ACM Sigcomm‘02 (Pittsburgh, PA, Aug. 19–23). ACM Press, 2002,.
Laboratory of Advanced Computing and the National Center for Data Mining at the University of Illinois at Chicago and president of the Two Cultures Group, Chicago. This work is supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, SciDAC Program, U.S. Department of Energy, under Contract W-31109-ENG-38, and by the National Science Foundation under contract ITR-0086044 (GriPhyN) and cooperative agreement ANI-0225642 (OptIPuter). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. © 2003 ACM 0002-0782/03/1100 $5.00
COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No. 11
57