Data-Oriented Distributed Computing for Science: Reality and Possibilities Daniel S. Katz1,2, , Joseph C. Jacob2 , Peggy P. Li2 , Yi Chao2 , and Gabrielle Allen1 1
Center for Computation & Technology, Louisiana State University
[email protected] 2 Jet Propulsion Laboratory, California Institute of Technology
Abstract. As is becoming commonly known, there is an explosion happening in the amount of scientific data that is publicly available. One challenge is how to make productive use of this data. This talk will discuss some parallel and distributed computing projects, centered around virtual astronomy, but also including other scientific data-oriented realms. It will look at some specific projects from the past, including Montage1 , Grist2 , OurOcean3 , and SCOOP4 , and will discuss the distributed computing, Grid, and Web-service technologies that have successfully been used in these projects.
1
Introduction
This talk will explore a pair of related questions in computer science and computational science: “What is a Grid?” and “How can the concepts that are sometimes described as a Grid be used to do science, particularly with large amounts of data?” The reason these two questions are interesting is that the amount of scientific data that is available is exploding, and as both the data itself and the computing used to obtain knowledge from it are distributed, it is important that researchers understand how other researchers are approaching these issues.
2
Definitions and Meanings
A recently written paper [1] that attempts to understand what Grid researchers mean when they say “Grid” mentions that the term Grid was introduced in 1998 and says that since that time, many technological changes have occurred in both hardware and software. The main purpose of the paper was to take a snapshot of how Grid researchers define the Grid by asking them to: 1 2 3 4
Corresponding author. http://montage.ipac.caltech.edu/ http://grist.caltech.edu/ http://OurOcean.jpl.nasa.gov/ http://scoop.sura.org/
R. Meersman, Z. Tari et al. (Eds.): OTM 2006, LNCS 4276, pp. 1119–1124, 2006. c Springer-Verlag Berlin Heidelberg 2006
1120
D.S. Katz et al.
Try to define what are the important aspects that build a Grid, what is distinctive, and where are the borders to distributed computing, Internet computing etc. More than 170 researchers were contacted, and more than 40 contributed material that considered distinct enough to be of value to the paper. The conclusion of the paper is that the research community has a fairly homogeneous view of grids, with a few outlying opinions, but with far less diversity of opinions than that expressed in a similar survey among industrial information technology (IT) leaders. One interesting section of this paper (3.2.1) discusses the fact that some of the people surveyed considered Grids a general form of distributed computing while other considered Grids a special form of distributed computing, and some said that there is no line between Grids and distributed computing. The next section (3.2.2) states that Grid services are Web services with additional features. To explore these possible distinctions, this talk will examine how some real applications make use of the Grid and of Web services.
3 3.1
Projects Montage
Montage [2] is a set of software packages that can be used in building astronomical image mosaics. Users see Montage either as a set of software executables that can be run on a single processor or on a parallel system or as a portable, compute-intensive service. In either case, Montage delivers science-grade custom mosaics. Science-grade in this context requires that terrestrial and instrumental features are removed from images in a way that can be described quantitatively; custom refers to user-specified parameters of projection, coordinates, size, rotation and spatial sampling. Fig. 1 shows examples of two mosaics, one without and one with background rectification. This talk will discuss the performance of the parallel Montage application, and will compare it with the performance of the Montage portal on the same problems. This will show that at least some Grid software is both sufficiently mature and sufficiently high-performance to be useful to a set of real scientific applications. 3.2
Grist
The Grist project [3] aims to establish a mechanism whereby grid services for the Astronomy community may be quickly and easily deployed on the NSF TeraGrid, while meeting the requirements of both the service users (the astronomy virtual observatory community) and the grid security administrators. In collaboration with the TeraGrid and National Virtual Observatory (NVO), the Grist project is building the NVO Extensible Secure Scalable Service Infrastructure (NESSSI), a service oriented architecture with the following characteristics:
Data-Oriented Distributed Computing for Science: Reality and Possibilities
1121
– Services are created, deployed, managed, and upgraded by their developers, who are trusted users of the compute platform where their service is deployed. – Service jobs may be initiated with Java or Python client programs run on the command line or with a web portal called Cromlech. – Service clients are authenticated with “graduated security” [4], which scales the size of jobs that are allowed with the level of authentication of the user. The Clarens service infrastructure [5] serves as the “gatekeeper” by managing user certificates. – Access to private data, such as that from the Palomar-QUEST survey, is restricted via a proposed “Visa” system, which examines user certificates to determine who is authorized to access each dataset supported by a service.
Fig. 1. A Montage-produced 2MASS mosaic of a square degree of sky near the Galactic center (left, without background rectification; right, with background rectification)
Grist is using NESSSI to build the NVO Hyperatlas, which supports multiwavelength science via the construction of standard image plates at various wavelengths and pixel sizes, and associated services to construct and visualize these plates. In support of this objective, the Grist team is deploying services on the TeraGrid for computing image mosaics and image cutouts. The image cutout service will be scaled up to compute massive numbers of cutouts in a single request on the TeraGrid. This multi-cutout service will provide input data for a galaxy morphology study to be conducted with a science partner. This talk will discuss NESSSI and how it can be used to build services, including Montage, that use TeraGrid resources while properly handling authentication and authorization issues, either from a portal or from a client application. 3.3
OurOcean
OurOcean [6] is a JPL project that has built a portal to enable users to easily access ocean science data, run data assimilation models, and visualize both data
1122
D.S. Katz et al.
and models. The concept of OurOcean is to allow users with minimal resource requirements to access data and interact with models. Currently, OurOcean provides both real-time and retrospective analysis of remote sensing data and ocean model simulations in the Pacific Ocean. OurOcean covers the U.S. West Coastal Ocean with focused areas around Southern California, Central and Northern California, and Prince William Sound in Alaska. OurOcean consists of a data server, a web server, a visualization server, and an on-demand server, as shown in Fig. 2. The data server is in charge of real-time data retrieval and processing. Currently, our data server manages a MySQL database and a 5 terabyte RAID disk. The web server is an apache2 server with Tomcat running on a Linux workstation. The web server dispatches user requests to the visualization server and the on-demand server to generate custom plots or invoke on-demand modeling. The visualization server consists of a set of plotting programs written in GMT and Matlab. In addition, Live Access Server (LAS) [7] is used to provide subsetting and on-the-fly graphics for 3D time-series model output. Finally, the on-demand server manages the custom model runs and the computing resources. OurOcean has a 12-processor SGI Origin 350 and a 16-processor SGI Altix cluster as modeling engines.
Fig. 2. The OurOcean hardware architecture
3.4
SCOOP
Similarly to OurOcean, the SURA Coastal Ocean Observing and Prediction (SCOOP) program [8] is developing a distributed laboratory for coastal research and applications. The project vision is to provide tools to enable communities
Data-Oriented Distributed Computing for Science: Reality and Possibilities
1123
of scientists to work together to advance the science of environmental prediction and hazard planning for the Southeast US coast. To this end, SCOOP is building a cyberinfrastructure using a service-oriented architecture, which will provide an open integrated network of distributed data archives, computer models, and sensors. This cyberinfrastructure includes components for data archiving, integration, translation and transport, model coupling and workflow, event notification and resource brokering. SCOOP is driven by three user scenarios, all of which involve predicting the coastal response to extreme events such as hurricanes or tropical storms: (i) ongoing real-time predictions, (ii) retrospective analyses, (iii) event-driven ensemble predictions. From these use cases, event-driven ensemble predictions provides the most compelling need for Grids, and provides new challenges in scheduling and policies for emergency computing. Following an initial hurricane advisory provided by the National Hurricane Center, the SCOOP system will construct an appropriate ensemble of different models to simulate the coastal effect of the storm. These models, driven by real time meteorological data, provide estimates of both storm surge and wave height. The ensemble covers different models (e.g. surge and wave), areas (e.g. the entire southeast or higher resolution regions), wind forcing (e.g. NCEP, GFDL, MM5, and analytically generated winds), and other parameters. The ensemble of runs needs to be completed in a reliable and timely manner, to analyze results and provide information that could aid emergency responders. Core working components of the SCOOP cyberinfrastructure include a transport system built primarily on Local Data Manager (LDM) [9], a reliable data archive [10], a catalogue with web service interfaces to query information about models and data locations, client tools for data location and download (getdata) [11] and various coastal models (including ADCIRC, CH3D, ELCIRC, SWAN, WAM, WW3) which are deployed using Globus GRAM. An application portal built with GridSphere [12] provides user interfaces to the tools, and the results are disseminated though the OpenIOOS site5 . Current development in SCOOP is focused on model scheduling, deployment and monitoring, incorporating on-demand priority scheduling using SPRUCE [13].
4
Conclusion
As the projects that have been discussed show, there are many alternative methods to effectively perform scientific calculations in a distributed computing environment. Common aspects of these projects is that they have involved a mix of computer scientists and application scientists. The combined actions of members of these communities working toward a common vision seems to often lead to a successful project. To a large extent, the names and specific technologies used to define the environment are not important, except in how they allow multiple people to effectively discuss and extend distributed scientific computing. 5
http://www.openioos.org/
1124
D.S. Katz et al.
References 1. Stockinger, H.: Defining the Grid: A Snapshot on the Current View. J. of SuperComputing (Spec. Issue on Grid Computing). Submitted June 26, 2006. 2. Jacob, J. C., Katz, D. S., Berriman, G. B., Good, J., Laity, A. C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., Prince, T. A., Williams, R.: Montage: A Grid Portal and Software Toolkit for Science-Grade Astronomical Image Mosaicking. Int. J. of Computational Science and Engineering. (to appear) 3. Jacob, J. C., Williams, R., Babu, J., Djorgovski, S. G., Graham, M. J., Katz, D. S., Mahabal, A., Miller, C. D., Nichol, R., Vanden Berk, , D. E., Walia, H.: Grist: Grid Data Mining for Astronomy. Proc. Astronomical Data Analysis Software & Systems (ADASS) XIV (2004) 4. Williams, R., Steenberg, C., Bunn, J.: HotGrid: Graduated Access to Grid-based Science Gateways. Proc. IEEE SC|04 Conf. (2004) 5. Steenberg, C., Bunn, J., Legrand, I., Newman, H., Thomas, M., van Lingen, F., Anjum, A., Azim, T.: The Clarens Grid-enabled Web Services Framework: Services and Implementation. Proc. Comp. for High Energy Phys. (2004) 6. Li, P., Chao, Y., Vu, Q., Li, Z., Farrara, J., Zhang, H., Wang, X.: OurOcean An Integrated Solution to Ocean Monitoring and Forecasting. Proc. MTS/IEEE Oceans’06 Conf. (2006) 7. Hankin S., Davison, J.,Callahan, J. , Harrison, D.E., O’Brien, K.: A Configurable Web Server for Gridded Data: a Framework for Collaboration. Proc. 14th Int. Conf. on IIPS for Meteorology, Oceanography, and Hydrology (1998) 417–418 8. Bogden, P., Allen, G., Stone, G., Bintz, J., Graber, H., Graves, S., Luettich, R., Reed, D., Sheng, P., Wang, H., Zhao, W.: The Southeastern University Research Association Coastal Ocean Observing and Prediction Program: Integrating Marine Science and Information Technology. Proc. MTS/IEEE Oceans’05 Conf. (2005) 9. Davis, G. P., Rew, R. K.: The Unidata LDM: Programs and Protocols for Flexible Processing of Data Products. Proc. 10th Int. Conf. on IIPS for Meteorology, Oceanography, and Hydrology (1994) 131–136 10. MacLaren, J., Allen, G., Dekate, C., Huang, D., Hutanu, A., Zhang, C.: Shelter from the Storm: Building a Safe Archive in a Hostile World. Lecture Notes in Computer Science 3752 (2005) 294–303 11. Huang, D., Allen, G., Dekate, C., Kaiser, H., Lei, H., MacLaren, J.: getdata: A Grid Enabled Data Client for Coastal Modeling. Proc. High Performance Comp. Symp. (2006) 12. Zhang, C., Dekate, C., Allen, G., Kelley, I., MacLaren, J.: An Application Portal for Collaborative Coastal Modeling. Concurrency and Computation: Practice and Experience, 18 (2006) 1–11 13. Beckman, P., Nadella, S., Beschastnikh, I., Trebon, N.: SPRUCE: Special PRiority and Urgent Computing Environment. University of Chicago DSL Workshop (2006)