Information Technology Implementation for a Distributed Data System Serving Earth Scientists: Seasonal to Interannual ESIP Menas Kafatos CEOSR/CSI, George Mason University, Fairfax, VA
[email protected]
X. Sean Wang
Zuotao Li
ISE Dept., School of IT&E & CEOSR, George Mason Univ.
[email protected]
CEOSR,GMU
[email protected]
Dan Ziskin Ruixin Yang CEOSR,GMU
[email protected]
Abstract In this article, we address the implementation of a distributed data system designed to serve Earth system scientists. Recently, a consortium led by George Mason University has been funded by NASA’s Working Prototype Earth Science Information Partner (WP-ESIP) program to develop, implement, and operate a distributed data and information system. The system will address the research needs of seasonal to interannual scientists whose research focus includes phenomena such as El Niño, monsoons and associated climate studies. The system implementation involves several institutions using a multitiered client-server architecture. Specifically, the consortium involves an information system of three physical sites, GMU, the Center for Ocean-LandAtmosphere Studies (COLA) and the Goddard Distributed Active Archive Center, distributing tasks in the areas of user services, access to data, archiving, and other aspects enabled by a low-cost, scalable information technology implementation. The project can serve as a model for a larger WP-ESIP Federation to assist in the overall data information system associated with future large Earth Observing System data sets and their distribution. The consortium has developed innovative information technology techniques such as content based browsing, data mining and associated component working prototypes; analysis tools particularly GrADS developed by COLA, the preferred analysis tool of the working seasonal to interannual communities; and a Java frontend query engine working prototype.
1. Introduction The U.S. Global Change Research Program (USGCRP) [1] concerns itself with a variety of science areas and associated changes related to the Earth as a system. Earth
GDAAC, NASA, GSFC, Greenbelt, MD, & CEOSR, GMU
[email protected] system science is an interdisciplinary science strongly coupled to data. Satellite observations are often the only way to obtain the needed information to understand individual system processes as well as processes linking different systems. Earth system science processes span many spatial scales (global, mesoscale, regional); and temporal scales (centennial, decadal, interannual, seasonal, daily, even hourly). Remote sensing instruments on-board of Earth observing satellites measure raw radiances at different spectral scales (from low resolution to hyperspectral resolution) which are then converted to calibrated data and in turn to geophysical variables [2]. Great progress has been made in the last two decades with the advent of the space program and associated Earth observations. Unprecedented observations of the atmosphere, land, oceans and the cryosphere have provided, often, sustained coverage. The next decade will, however, provide much more systematic observational coverage as NASA, one of the main USGCRP agencies, begins to launch its Earth Observing System (EOS), a series of Earth observing platforms that will produce the requisite observations and associated massive data products [3]. NASA is building an associated data information system termed EOSDIS, and the data products will be distributed among science data centers termed Distributed Active Archive Centers (DAACs) [3]. Of great importance to Earth system science is the extraction of useful information. The overall process requires understanding of the needs of science and the science users to be served. Building a large, centralized system to serve diverse user communities is expensive and difficult to implement. Moreover, innovation often comes from research efforts involving smaller teams, data systems or institutions. In its 1995 study of the USGCRP and NASA’s Mission to Planet Earth, the NRC’s Board
on Sustainable Development recommended, therefore, augmenting the current EOSDIS system with a federation of Earth science information providers [1], which subsequently NASA followed by funding a working prototype federation program termed WP-ESIP [4]. In this paper, we examine the information technology implementation for one of the peer-reviewed and selected ESIPs, which is to focus on the seasonal to interannual (SI) science research areas, as well as associated data and services for S-I user communities. Termed Seasonal to Interannual ESIP or SIESIP, the distributed system is designed by taking into account the science & applications needs for this particular community; subsequently, data products and services are determined; finally, in-depth development and use of tools and techniques to extract information from the data (data mining, data visualization and analysis) and the associated information technology architecture implementation are carried out. SIESIP allows integration of information across many domains and in a distributed fashion, which in turn requires sharing of data, information technology, science, and applications, forming a real-life federated system. The S-I research areas include the El Niño/Southern Oscillation (ENSO) phenomenon; monsoons; large-scale precipitation and wind patterns; the Intertropical Convergence Zone; the Tropical Biennial Oscillation (TBO) as well as associated influences in the tropics and extratropic regions. Over the years, S-I scientists have improved their ability to predict rainfall, sea surface temperature (SST) variations in the tropics and other associated geophysical climate variability. Even though forecasts (ranging up to a year in advance) by S-I scientists are still experimental, they are increasingly being used by planners in the tropics and even in the U.S to mitigate harmful effects, such as floods, droughts and shifting weather patterns, as evidenced by the current strong 1997-1998 El Niño. For example, the link between ENSO and North American climate varies considerably over the continent, with the largest correlations (and hence largest potentially predictable relationships) in the southeast and along the west coast. S-I climate variations are known to have profound impact on life all over the planet. Our ability to understand these phenomena well enough to make predictions of substantial economic and societal value remains an important goal for climate research worldwide. Also, predictions of warm ENSO event forecasting, e.g., the current event, have improved with lead times of 3-6 months [5].
2. Design of the System SIESIP’s overall goal is to assist S-I climate scientists with both data and information solutions. In practical terms, the SIESIP consortium consists of three main distributed sites: George Mason University with expertise in information technology, data searches and analysis and interdisciplinary Earth system science; the Center for Ocean-Land-Atmosphere Studies (COLA) with expertise in S-I science, user services and tools; and the NASA Goddard DAAC with expertise in data management, data archiving and user services. Each consortium member already delivers services and products or has developed working prototypes. The overall system is designed to enhance the current consortium capabilities to serve the specific Earth science S-I community and to provide an innovative information technology query, engine and implementation of a working federation. SIESIP will achieve its goals by creating new products based on existing and new satellite and station data and models. These products will serve the broad user base involved in S-I research, Tropical Rainfall Measuring Mission (TRMM) scientists, as well as application users such as agriculture. SIESIP will promote ease of use by deploying innovative products and information technology and allowing users to find and obtain data easily. SIESIP will assist students by collecting relevant data sets into a single point of access, integrating complementary data sets to enhance information, and producing needed products. A single analysis tool will be applied across diverse data sets, creating ease of use and compatible data interuse. The integration of products and information technology extends from data discovery, search, browse, selection, and access, to elementary analysis making these aspects easier for users, hence reducing cost. Specifically, SIESIP is developing selection tools (e.g., content based search, data mining) to make data selection and access easy and to streamline data use, by achieving interoperability for the user activity that counts most—data interoperability. Unlike large, conventional information systems, which often adopt a system orientation, SIESIP adopts a product and services orientation and focuses on the information inherent in the data. The requirements of the seasonal to interannual climate variability and prediction problem are the fundamental drivers of the information technology implementation of our approach. The Earth system science aspects of the S-I climate problem lead to several user query and access scenarios for how scientists work with observational and model output data sets. To perform interdisciplinary
analysis, data sets and analysis tools must conform to the following specifications: Global data sets, gridded to uniform grid space, uniform temporal specifications, and able to perform a variety of functions (e.g., statistical correlation analysis) on-the-fly. At GMU, we have examples of several scenarios (see current search engine prototype, http://www. ceosr.gmu.edu/~vdadcp). For example, SIESIP’s data sets are to enable users to explore phenomena such as teleconnections between El Niños and vegetation cover in Africa, by plotting time series (e.g., Eastern Pacific SST anomalies and Sahelian precipitation) and correlations of relevant parameters (e.g., Eastern Pacific SST anomalies and seasonal area extent of Sahara). Images of associated parameters for each plot (either multiple GIF displays or animations) will be supplied. It is expected that correlation coefficients, means, standard derivations, and other statistically-derived parameters derived from the content based browsing, will form a set of new metadata in our suite of data products. A fundamental component of SIESIP enabling S-I research is the close integration of data products and services with analysis software tools. The analysis engine at the heart of the SIESIP engine is the Grid Analysis and Display System (GrADS) [6].
3. Information Technology This section focuses on the information technology rationale, architecture, and implementation associated with serving the scientific needs outlined above. SIESIP is building on the existing GMU search engine and current user services and data at COLA and the DAAC. Our information technology implementation is innovative and scalable, makes substantial use of new technology, and relies on WWW with a multitiered client-server system architecture that starts with easy access for the user.
3.1 Data Products The requirement to integrate existing archives of global climate observations, new satellite measurements, and climate models presents a major information systems challenge. Many important climate data archives are either in situ data sets or gridded products derived from in situ observations. The climate model data sets are typically on regular or nearly regular grids. Satellite data are typically in the form of images registered to satellite coordinates (e.g., swath data). We may think of these as different data models. The integration of these diverse data sets will require both the production of new data products through transformation of data from one data
model to another and the ability to analyze data from multiple data models simultaneously. Data sets include new gridded TRMM data products; event-driven subsets (such as hurricane events); TRMM high resolution rainfall data over land; TRMM coincidence subsets; new five-day mean interdisciplinary climate data; and diverse NOAA data sets prepared at a single site (COLA).
3.2 Three-Phase Access Model SIESIP will provide a search and analysis engine that will allow users to obtain data whether they do or do not know exactly what to retrieve and let them identify, among available data, significant correlations, trends worthy of further analysis and assess the data to be retrieved. Moreover, this support will allow additional communities such as process scientists and applications users to access the SIESIP data and can, therefore, provide scalable usage lessons to NASA and information technology communities. Based on this assumption, we developed a three-phase user data search model for SIESIP. Phase 1: Using the metadata and browse images provided by the SIESIP system, the user browses the data holdings. Organizing knowledge is incorporated in the system (information-rich products). Phase 2: The user gets a quick estimate of the type and quality of data found in phase 1. Analytical tools are applied including statistical functions and visualization algorithms available via WWW through SIESIP. The SIESIP interface will also incorporate a spectrum of statistical data mining algorithms. We have also begun to implement tools for finding positive correlations providing realistic, human-aided data mining capability. We have applied this and other data mining systems [7] to ENSO teleconnections with possible results in identifying anticorrelations with vegetation in tropical Africa and in the NE coastal U.S.[8]. Phase 3: The user has located the data sets of interest and is ready to order. If the data are available through SIESIP, it will handle the data order; otherwise, an order will be issued to the appropriate data provider on behalf of the user, or necessary information will be forwarded to the user for this task. Example: (Phase 1:) User searches the metadata to identify data sets available (e.g., SST, Total Precipitable Water, etc.) and tool capabilities. (Phase 2:) (i) User defines (or chooses from predefined regions) a spatial region such as 30°S to equator and 120°W to 80°W, and asks what data sets satisfy the conditions 25°C ≤ avg(SST) ≤ 29°C, where avg(SST) is the spatial average of SST and is calculated by the system on-the-fly. (ii)
It is important to note that the three phases can be arbitrarily interleaved until the user is satisfied. For example, if after viewing the results of the phase 2 the user decides to further narrow the search, the user can go back to phase 1 and make choices based on system metadata before coming back to phase 2.
Knowledge Base
Thesaurus
Multidimensional Query Processor Data Set Dictionary
Pre-computed Results
Analysis Tool Kit Inter/Intra Net Connection
User Interface Engine (Queries & Results)
Dispatcher
Users (web browsers)
SIESIP Nodes
Internet connection
System returns data set dates satisfying above conditions (e.g., 12/82, etc.). (iii) User refines query by plotting correlations of SST with, say, precipitable water in specified region at different dates. (iv) User looks for correlations in the plots produced by the system and picks data set dates showing significant (e.g., corr. coeff. > 0.7) correlations. (Phase 3:) User orders complete data sets (L2 if available) for the above dates (in iv).
Middleware
SIESIP Archive Gateway
Data Conversion
CDRom/ Tape, etc Inter/Intra Net
DAACs
SIESIP Central
SIESIP data-mart
NOAA
Others
Data Providers
Figure 2: SIESIP System Details
3.3 SIESIP Distributed Information System
3.4 SIESIP Central
We designed the SIESIP architecture as a distributed multitiered client-server entered through commonly available Java savvy Web browsers. It is composed of SIESIP Central and several SIESIP Nodes and Archives to allow the system to scale with increases in the number of users and the amount of data that become available.
The User Interface Engine supports all three phases of our access model. Phase 1 is supported via key-term items that are interconnected and reflected as links in the dynamically generated Web pages. Key-term items and their relationships are stored in a relational data base. Our description metadata are presently organized in a relational database with entities being Phenomenon, Parameters, Platforms at the top level, followed by Phenomenon Instance, Specific Parameters, Instrument at the second level, Predefined Region (important for content based searches for S-I phenomena such as ENSOs), Data Product, Contact (person), at the third level, and, finally, Statistical Summary, Data Format, and Data File, at the lowest level [2]. These entities are important for constructing the metadata (Fig. 2).
Figure 1: The Multitiered SIESIP Architecture
The overall SIESIP architecture of four basic layers is shown in Figure 1. SIESIP will support an extensive hardware/software system. We have provisionally distributed the host responsibilities of the SIESIP system components among our consortium members as shown in Figure 1. Figure 2 presents SIESIP system details and their functionalities, respectively.
Besides the description metadata, we have also developed an innovative data pyramid with associated summary (statistical) metadata (see article in this volume by Z. Li et al [9]). We will use data base entries to dynamically generate the user interface and link user selections to the data. Example: a user is presented with a list of the phenomena we support and relevant specific parameters linked to each phenomenon. The user may then specify a phase 2 query which supplies content based criteria on a parameter. The phase 2 queries are preprocessed by the Multidimensional Query Processor that accesses the specific parameter, statistical summary, data product, and predefined region tables. An innovative pyramid data model [9] is being implemented. The result would be a time range and region of a particular data product. This result would then be converted into a list of file names through the data file table, which would be opened for analysis.
Phase 2 queries are processed by the Multidimensional Query Processor that directs the appropriate SIESIP nodes to process the queries. A phase 2 query may be multidimensional in nature; i.e., it specifies the ranges of several spatial and temporal dimensions and requires application of analytical tools on the selected data associated with the chosen ranges. This allows for a quick look at the data and return of results to the user. The key to the scalability of our system consists in the fact that data sets (often very large in size) need not be transferred among different nodes for a phase 2 interactive query. Phase 3 queries are handled by the User Interface Engine via the following methods. (1) If the data are available from one of the SIESIP nodes and the data size is determined manageable by FTP transfer, an FTP Staging Area is allocated and the user is given an FTP link to the data set requested. (2) If the size of the data requested is large (> tens of GB) or the user wants the data sent on physical media, the request will be handled off line by the SIESIP archive. (3) If SIESIP doesn’t have the data, the user will be guided to the data providers.
3.5 SIESIP Nodes and Archives SIESIP will contain distributed Nodes and Archives. Each SIESIP node contains an Analysis Tool Kit and a Data Mart, and the SIESIP archive serves as the permanent storage facility. The data held at a SIESIP node will change to respond to the user’s needs, and may contain temporary data products that are of interest to individual users (such as a particular subset or the result of a particular analysis). A SIESIP node can be specialized based on available analysis tools and the data types held; hence the term “data mart.” Indeed, a SIESIP node provides a facility with a fast turnaround time achieved by tailoring to the current needs of its users. Data orders are satisfied by either the SIESIP nodes (temporary data) or the SIESIP archive (permanent data). SIESIP nodes can be added as the system grows. If the data volume remains the same but the number of users increases, we may add nodes that replicate existing ones. If data volume increases, we may add nodes to house the new data sets. It is easily seen that this nodes and archives flexibility is key to the scalability of our system that can be prototyped for the WP-Federation. For the prototype effort, we plan two physically distributed SIESIP nodes and one SIESIP archive: GMU and COLA each will have a SIESIP node and GDAAC will have a SIESIP archive. Initially, the GMU node will house mainly NASA data and the COLA node mainly NOAA data.
3.6 Tool Kits and Middleware The Tool Kit provides general functions necessary for Earth science data analysis on-the-fly and should include (i) various analysis, computation, visualization, and statistical tools, (ii) some computational programs for system services, (iii) simple and frequently used physical and mathematical models. Our primary tool is GrADS, which we will use as the heart of the SIESIP analysis engine to achieve seamless interoperability and access to diverse data sets. GrADS is already being used by climate studies scientists worldwide and allows analysis in both gridded and station form. GrADS analysis in phase 2 queries will allow a user to perform statistical correlations between different parameter data sets and intercomparison of different data sets for one parameter (e.g. SST), thus being useful to both S-I scientists and interdisciplinary Earth system scientists. We will support all formats that GrADS supports (e.g. HDF, netCDF, binary, etc.)
3.7 SIESIP Data Marts A local storage and retrieval engine is used to manage a SIESIP Data Mart. The local engine employs cache-like replacements and fetching algorithms to ensure that most accesses are predicted and satisfied without the need for off-site accesses by high-performance storage devices. Presently, we are implementing similar (conventional) data mart engines at GMU and COLA. Future implementations may include an innovative parallel storage and retrieval engine capable of achieving the desired high performance at very low cost (at GMU).
3.8 SIESIP Archive The SIESIP Archive is designed to store data collection for the SIESIP consortium and serve all the SIESIP nodes. SIESIP data volume is estimated at 1.5 TB and will be housed at GDAAC (jukebox configuration). Because experience has shown that nearly 90% of accesses are confined into 10% of the entire address space we plan to maintain about 150 GB of the archive fully on line. These data will be kept in the data mart for responsive and interactive data manipulation. Data are ingested into SIESIP from multiple sources (i.e. NASA, NOAA, UDel, etc.) on varied media (i.e. FTP, CD-ROM, magnetic tape, etc.), and in multiple formats (i.e. HDF, GRIB, etc.). The initial ingest steps include securing the data on our staging disks, reformatting it, and extracting the metadata through automated processing scripts on incoming data.
3.9 Interconnection Commonly available Web browsers are the entrance way to the SIESIP system, mainly through HTTP protocols. Thus the connection between users and the SIESIP distributed system is via the Internet. The connection between SIESIP central and the SIESIP nodes can either be Intranet or Internet, with reliability the major connection concern. Connection from SIESIP central (located at GMU) to both nodes can be accomplished by a LAN. This will allow us to determine for future implementations of the WP-Federation whether a highly reliable link is necessary to connect an ESIP central to its nodes. The above connection is for transmitting analysis results. The data sets themselves may be transmitted via FTP over the network, but only for ordering purpose.
[2]
Concerning the communication protocols among SIESIP sites, we will explore the possibility of using CORBA as the standard. The benefit of such an adoption may further reduce cost by using available COTS software. Equally important, this may allow an architecture more open to systems outside SIESIP and create more opportunity for interoperations.
[6]
3.10 Data Ingest and Distribution
[7]
One benefit of the SIESIP plan is a scalable, easilyimplemented system up, and running within the first 6 months of operation. Data will be ingested into the SIESIP system from external sources. Detailed agreements will be struck with data providers concerning what manipulations of data will be acceptable (i.e., reformatting, resampling, etc.) and the degree of contact the providers wish to maintain with users. Satellite based observations will primarily be provided by NASA via GDAAC. Field experiment and in situ data, with which SI scientists are most familiar, will be provided by NOAA centers via COLA.
4. Conclusion In this paper, we described the architecture of the SIESIP system that supports user accesses to Earth Science Data. The architecture enables users to find the desirable data through an integrated system that can search by using metadata as well as specialized analysis tools.
References [1]
Board on Sustainable Development, 1995. A Review of the U.S. Global Change Research Program and NASA’s Mission to Planet Ear/Earth
[3]
[4]
[5]
[8]
[9]
Observing System. National Academy Press .96 pp. (Washington, DC). Kafatos, M., Li, Z., Yang, R. et. al. 1997. “The Virtual Domain Application Data Center: Serving Interdisciplinary Earth Scientists,” Proceedings of the Ninth International Scientific and Statistical Database, 264-276. IEEE. Asrar, G. & Greenstone, R. (eds.), 1995. 1995 MTPE EOS Reference Handbook. NASA (Washington, DC). NASA Press release, Dec. 2, 1997. “NASA Selects Earth Science Information Partners”, http://www.nasa..gov/releases/1997/. Kirtman, B.P., Huang, B., Shukla, J., Zhu, Z. 1997. “Tropical Pacific SST Prediction with a Coupled GCM,” Experimental Long Lead Forecast Bulletin, Vol. 6, No. 1, 14-15. Doty, B. E., Kinter III, J. L., Fiorino, M, Hooper, D., Budich, R., Winger, K., Schulzweide, U., Calori, L., Holt, T., and Meier, K. 1997. “The Grid Analysis and Display System (GrADS): An update for 1997,” 13th Conf. On Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology 356-358 pp.(American Meteorological Society, Boston). Li, Z., Kafatos, M., Michalski, R. 1997. “Data Mining Application for El Niño Teleconnection Research,” GMU for Machine Learning Laboratory Report. Li, Z., and Kafatos, M. 1997. “Interannual Variability of Vegetation in the US and its Relation to ENSO,” ( in preperation). Li, Z., Wang, X.S, Kafatos, M, and Yang, R. 1998. "A Pyramid Data Model for Supporting Contentbased Browsing and Knowledge Discovery", in these proceedings.
We acknowledge partial prototype funding support from the NASA ESDIS Project, (NAG 5-3086), from the Goddard Global Change Data Center (NCC 5-143), and particularly from the Earth Science Enterprise WP-ESIP CAN program as well as from George Mason University. Other members of the SIESIP team include P. Chan & L. Chiu (GDAAC); B. Doty & J. Kinter (COLA); T. ElGhazawi (FIT); J. McManus (GMU); C. Willmott (UDel); H. Wolf (IRMA), etc.