Information Management for Grid-Based Remote

9 downloads 0 Views 374KB Size Report
environment, grid computing, resource discovery, grid information .... this analysis is to obtain a set of common proprieties .... class to become grid enabled.
Information Management for Grid-Based Remote Sensing Problem Solving Environment Giovanni Aloisio, Massimo Cafaro, Italo Epicoco, Gianvito Quarta Center for Advanced Computational Technologies/ISUFI, University of Lecce via per Monteroni, 73100 Lecce - Italy {giovanni.aloisio, massimo.cafaro, italo.epicoco, gianvito.quarta}@unile.it

Abstract The aim of this work is to design and then implement a Configuration Repository for a grid-based Problem Solving Environment (PSE), specialized to describe applications and data belonging to the remote sensing field. In a distributed environment the information system plays a central role in order to enhance scheduling algorithms and, more generally, to address grid-aware application requirements. Taking into account that the grid is inherently dynamic, i.e. machine load and availability, network latency and bandwidth change continually, we present the design of a Configuration Repository for retrieving, storing and handling information. In particular, the proposed solution has been designed having in mind all of the specific characteristics belonging to the remote sensing systems. K e y w o r d s : remote sensing, problem solving environment, grid computing, resource discovery, grid information system.

1. Introduction A huge quantity of Earth Observation and geospatial data is produced daily by numerous satellites launched by several world wide space agencies. Data belong to different types such as optic, infrared, radar images, etc. Generally, these images represent semi-finished products and the end user further processes these in order to extract relevant information very useful in different scientific areas, such as geology, climatology, oceanography, natural disaster monitoring and prevention, and so on. After an image is acquired by a sensor installed on remote sensing satellites, the data are transmitted to

geographically distributed ground-segments and are eventually processed in order to achieve a useful product, by means of their computing facilities and then archived. Generally, the time needed to process the data is longer then the acquisition time, thus, only few segments of these data are immediately processed. For specific applications, e.g. emergency search and rescue, natural disaster monitoring and so on, the use of high performance resources, like supercomputers, greatly reduces the time required for data processing. Even though the availability of these resources is limited, their sharing among different organizations gives a clear advantage. A PSE can be an applicable solution to handle, coordinate and share heterogeneous and distributed resources. A PSE is designed to provide transparent access to heterogeneous, distributed computing resources for collaborative computational science and engineering [1]. A PSE is a complete and integrated computing environment for composing, compiling and running applications in a specific area. In this environment, an application can be composed using several building blocks through definition of task or data flows. From the architectural point of view, grid computing is today one of the better ways to achieve large-scale effective sharing of computational resources [2, 3]. The resources are shared among formal or informal consortia of individuals and/or institutions called Virtual Organizations (VO). In particular, Grid Portals [4] are designed to simplify the access to a set of distributed resources (e.g. high performance computers, sophisticated instrumentations, databases), through a user-friendly interface. Consequently, the design and implementation of PSE, can be favourably done using the grid technology [5].

In these environments, the description of resources is an important aspect. The involved resources can be computing elements, software objects and data. Moreover, many software tools for remote sensing data processing are legacy applications. In order to better coordinate the use of all of the available applications we need to define a methodology to characterize the applications themselves in terms of input parameters, compatible and produced data formats. Another issue to consider is the fact that remote sensing data are produced by different sensors and can be stored in different formats. Consequently, the environment must take into account also the heterogeneity of the data. The aim of this work is to design and then implement a configuration repository for grid-based PSE, specialized to describe applications and data belonging to the remote sensing field. For this purpose, we have designed a MetaSoftware and MetaData servers that takes into account all of the related aspects. The proposed configuration repository has been inserted in a prototype of grid-based PSE and has been tested. The remainder of this paper is organized as follows. In section 2 we introduce some relevant approaches implementing a Grid Information Service, in section 3 and 4, we present the issue related to the representation and description of data and applications used in remote sensing field; then we describe the proposed configuration repository highlighting the choices we made to describe data, software and computational resources; in section 6 some relevant implementation detail are described. In section 7 we show a use case of the proposed configuration repository and finally, we discuss the conclusions and the future directions.

2. Describing resources in distributed environment In distributed environments, the description and discovery of resources and services (like a software component or computational resource), is a fundamental issue. As a matter of fact, the involved resource can be heterogeneous, with dynamic behaviour, geographically distributed and so on. The information about these resources is needed, for instance, in order to allow automatic jobs submission, to schedule the job execution according to the available scheduling algorithms, and to allow for example the resource brokering. In this context information plays a central role.

A well known approach for managing information can be found in the information system developed in the Globus project. The Globus Monitoring and Discovery Service (MDS-2) [6, 7] provides a large distributed collection of generic information providers that extract information from local resource. The gathered information is structured in term of a standard data model based on LDAP. Moreover, MDS-2 provides a set of higher level services that collect, manage and index information provided by one or more information providers. The MDS schema can be easily extended and additional information providers can be developed in order to manage and publish an extended set of information. In DataGrid project [8] a relational approach to GIS, named Relational Grid Monitoring Architecture (RGMA) [9], has been developed. This solution uses a relational approach to structure the information about the grid resources and it is composed by three main components: Consumer, Producer and a directory service, called Registry (as specified in the Grid Monitoring Architecture [10]). In the R-GMA, Producers register themselves with the Registry and describe the type and structure of information they want to make available to the Grid. Consumer can query the Registry to find out what type of information is available and to locate Producers that provide such information. When this information is known, the Consumer contacts the Producer directly to obtain the data. In a distributed environment, information related to deployed software can play an important role to allow a scheduling algorithm to perform efficiently. Moreover, the software objects deployed on a computing element can require specific libraries, environment variables and so on. With regard to software resources, in literature there are other relevant approaches meant to characterize applications. A formal specification of the software objects in a grid setting has been derived from BIDM standard (Basic Interoperability Data Model, IEEE standard 1420.1) by expanding the classes of objects defined in the standard itself [11]. Another approach is the Open Software Description Format (OSD) [12]. OSD is an application of the eXtensible Markup Language (XML) that provides a vocabulary used for describing software packages and their dependencies. In this work, we have designed an ad-hoc configuration repository specialized to describe applications and tools for remote sensing data processing and management. In our approach, we have

considered separately three kind of resources: hardware, software and data. For each component, we have derived and implemented an information model that describes the component itself.

information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographical data.

3. Remote sensing data distribution

4. Remote sensing software analysis

Remote sensing data, acquired by ground-segments or processing facilities, comes from different sensors and different space missions. Often the data format produced for each mission and each sensor differs and if we consider that the ground-segment can store the acquired data using internal formats it is immediately evident how the management and format conversion is an important issue that must be taken into account. An environment like a PSE, must support several data formats and must allow the conversion between them. In order to simply handle acquired raw data or postprocessed data it is fundamental to associate a set of metadata to each remote sensing product. The Committee on Earth Observation Satellites (CEOS) is an international organization aimed to coordinate international civil spaceborne missions with the purposes to observe and study the Earth planet. CEOS comprise 41 space agencies and other national and international organizations, and is recognized as the major international forum for the coordination of Earth observation satellite programs and for interaction of these programs with users of satellite data worldwide. One of the activities of CEOS is to coordinate the earth observation data exchange, through the publication of a set of principles and the definition of a standard data format. The CEOS format is, indeed, the standard format adopted by several special agencies to distribute remote sensing data. Another relevant standard to consider is ISO TC/211, by the International Organization for Standardization (ISO). The aim of ISO TC/211 is the standardization in the field of digital geographical information or to establish a structured set of standards for information concerning objects or phenomena that are directly or indirectly associated with a location relative to the Earth. These standards specify methods, tools and services for data management, acquiring, processing, analyzing, accessing, presenting and transferring such data in digital/electronic form among different users, systems and locations. The work shall link appropriate standards for information technology and data where possible, and shall provide a framework for the development of sector-specific applications using geographical data. The information model we propose, aims to describe and model the remote sensing products. It is based on the CEOS data format and on ISO TC/21119115 specification. ISO 19115:2003 defines the schema required for describing geographical

In order to extract the appropriate set of information used to characterize remote sensing applications, a set of software packages has been analyzed. The purpose of this analysis is to obtain a set of common proprieties and parameters that allow the design of a complete MetaSoftware schema that takes into account all needed information for running applications in a distributed environment. These properties can include required libraries, the list of supported operative systems, the execution parameters, the supported input data formats, the produced output data format and so on.

UTILITY

POST-PROCESSING

PREPROCESSING

APPLICATION NAME AND DESCRIPTION AESAR: is a SAR processor used by CGS (Centro di Geodesia Spaziale) of Matera (Italy). EARTHVIEW SAR APP: is a commercial SAR processor of Atlantis Scientific Inc (Nepean, Ontario, Canada). EARTHVIEW INSAR: is a commercial tools for interferometric SAR processing of Atlantis Scientific Inc. Delft object-oriented radar interferometric software (DORIS): is a open source software package developed by the Delft Institute for EarthOriented Space Research (DEOS), Delft University of Technology. It allows performing interferometric SAR processing. Unwrapping tools: are a lot of software packages, developed in C that implement various unwrapping algorithms. Basic Envisat SAR Toolbox version 3.0 (BEST): is a collection of executable software tools that facilitates the use of ESA SAR data. It’s freely distributed by European Space Agency and allows making header analysis, media analysis, quick look generation and so on. Cropping and format conversion tools: are a lot of routines that allows making image cropping and format conversion.

Table 1. List of applications analyzed. We have classified the applications available for remote sensing data processing into three categories: • Pre-Processing applications: belong to this class all of those applications used to process raw data coming directly from sensor installed on remote sensing satellites and acquired by ground-segment.

The products obtained in this first phase represent semi-finished products and the end user have to apply further processes in order to extract relevant information; • Post-Processing applications: belong to this class all of those applications used to extract relevant information from semi-finished products produced by pre-processing application. These applications usually perform advanced processing like filtering, classification, data analysis and so forth; • Utility applications: another kind of software tools we have considered are those applications that perform a simple data manipulation e.g. data format conversion, data header analysis and extraction, images cropping and so on. For each of these remote sensing software classes, we have considered a set of packages, as showed in Table 1. The analysis of these applications has allowed to extract all relevant information needed and to design the MetaSoftware schema, as detailed in the following.

• •

• • •

section, can be for instance pre-processing, postprocessing,, utility and so on; data formats: information about the supported input data formats and information about data formats that the application is able to produce; execution command: information to be used to remotely execute the application, thus here we include information like the executable full pathname, list of accepted arguments and their default values, list of environment variables and their default values; required libraries: this is the list of required libraries, the version number, a short description and so on; operative system: is the list of the operative systems and related version, on which the application can be executed; performance: is the ensemble of information that describe the running time as a function of the type and the size of input data obtained from a series of experimental executions.

5. The Configuration Repository

5.2 The MetaData schema

In this chapter, we show the details of our configuration repository, with regard to the three main components: the MetaSoftware schema, the MetaData schema and the extended grid information server.

In order to realize a MetaData schema that involves the most important information about remote sensing data, we have considered the ISO TC/211-19115 standard. From this standard, we have derived a set of raw metadata. This set is mainly composed by thirteen sets of information. The most important considered metadata concern: product identification and distribution, data quality, platform and mission, spectral properties, maintenance, generic information, spatial representation, reference system and other information related to the TIFF data format. This set of raw metadata is mapped into a uniform metadata set derived from ISO standard itself in order to have a homogeneous set related to the following missions: ENVI, ERS1, ERS2, RadaraSat1, SLR1, SLR2, SRTM. We have obtained a set made of about 200 metadata that describe all of the mentioned kind of Earth Observation products with sufficient thoroughness. This metadata set is structured in a relational schema. Moreover, the CEOS data format specification is considered in order to achieve a good description of input and output data format for remote sensing applications. We have considered, for each format, the files associated to the product.

5.1 The MetaSoftware schema We have collected a set of relevant information able to completely describe a software object. The collected information mainly belongs to two classes: information characterizing the applications from functional point of view and information about the performance. The former kind of information is useful for resource discovery, the latter kind of information can be used by the scheduler to define a submission schedule that minimize, for example, the completion time. This information has been structured into a relational schema, and it includes: • application definition: information about the general properties of the application like name, required processor speed, required amount of memory, required disk space, interface type. The interface type gives an indication about the mechanism to be used for remotely starting and monitoring jobs; • application class: information about the application typology, that, as showed in previous

5.3 Computational resources information schema The extension of the information schema related to the computational resources is needed in order to enhance scheduling algorithm and, more in general, to address grid-aware application requirements. Taking into account that the grid is inherently dynamic, i.e. machine load and availability, network latency and bandwidth change continually, we have adopted a hybrid solution keeping all of the significant aspects from the existing approaches. We have analyzed Globus information schema, GridLab information schema considering also other schemas e.g. GLUE schema, Nordugrid schema. As a result of this analysis we consider the information related to computational resources composed by the following set of information: • hardware: some information related to hardware description like processor speed, amount of memory, amount of storage are essential for resource management and resource brokering; • firewall: this kind of information is strictly related to service information. Before registering a new service it is fundamental to know dynamically the range of open ports available on a specified computational resource; • Virtual Organization: information about the VO to which resource belongs, will be used to dynamically discover all of the computational resource available for a specific purpose; • users: each computing resource can be accessed only by a set of authorized users. Discovering which resources can be accessed by our PSE user is the first step before starting the processing;

6. Implementation details We briefly report in this section the implementation details and the technologies used to implement the information schema and the related access functionalities. The MetaSoftware and MetaData schemas have been structured into relational databases and implemented using the Postgres database management system. The access functionalities, needed for managing these catalogues, have been implemented in Java using JDBC. We have also developed the MetaSoftware and MetaData modules using XML schema definition because of two main reasons: (a) the data belonging to the catalogues can be easily represented into dynamic

html pages; (b) the XML language represents the lest way to implement data exchange among heterogeneous components belonging to a grid. In the MetaSoftware catalogue, the user’s management functionality has been realized through Java Servlets that use JAXP package for parsing XML documents and XSL for processing documents. Moreover, with regard to these two catalogues, a set of accessory functionalities has been realized using some Java modules and Java Servlet. These functionalities allow manual and automatic ingestion into the system of software objects and data. We have adopted the information service developed in the GridLab [13, 14] project to handle all others general information that are not related to remote sensing aspects. The approach we adopted, based on Java, simplifies the implementation of the configuration repository as a Grid Service (according to the OGSA specification). The methods of configuration repository Service are easily identified in the Java Servlet already implemented. The grid service implementation itself does not require any specific procedure other than those required by web services. In the case of a Java grid service, all that must be done is to provide a grid-enabled implementation for all of the methods. Using the GT3 container for Java grid services, there are two options for implementing such functionality: delegation and inheritance [7]. In order to delegate to a Java class the ability to run in the GT3 container, this class must implement the org.globus.ogsa.OperationProvider class. By doing so, all of the methods required by the GT3 container to properly manage the class are embedded into it, making it grid enabled. One important detail about delegation is that, doing so, one can spread the implementation of the service’s methods across multiple classes, which might be very useful when dealing with legacy code. The inheritance approach requires only the implementation of the service’s interface for the service class to become grid enabled. The remaining methods, required by the grid container, are inherited from a standard class (namely the org.globus.ogsa.impl.ogsi.GridServiceImpl class) that has implemented the OperationProvider interface itself. The main drawback of this approach is that, as the Java language does not support multiple inheritance, it will not be possible for the service class to inherit from any other class, which might be too restrictive. However, Java allows implementing multiple interfaces.

7. Use case In order to give an idea about how our Configuration Repository can be usefully used, in this section we describe a use case in which our component is involved. The Configuration Repository is currently employed in a prototype PSE for remote sensing data processing [15]. This PSE is an environment that allows users to compose and run applications for processing their data. Such configuration repository allows storing all information about the resources and can be used by several architectural components to perform service discovery, according to user’s request. A fundamental component of a PSE is the Graphical User Interface (GUI), which allows the composition of complex applications built from single application components (workflow) [16]. The GUI, initially presents to the user the services available on the system, querying the Configuration Repository. Once the user selects a set of applications and combines them defining a workflow and specifying the input data, the system will query the configuration repository in order to obtain the metadata attributes related to each involved application. Through the retrieved metadata attributes, the system can verifies the compatibility between the data produced during each processing steps and can use them to map the highlevel request, specified through the workflow, into a set of actions that perform the needed process (see Figure 1). The GUI for composing the workflow includes the following modules: a Java Applet that allows users defining their workflow by drag and drop operations; a Java module aimed to map the defined workflow onto the available grid resources; a module in charge of verifying and reducing the represented workflow. The Java Applet stores the user’s defined workflow into an XML file, compliant with an abstract workflow schema. This XML file is verified and reduced to the concrete workflow. Finally the concrete representation of the workflow is processed by the scheduler component. Let us suppose now that the following resources have been registered into the PSE: • the grid is composed by four computing resources (H1,H2, H3 and H4) and two storage resources (S1 and S2); • a SAR processor that perform the image focalization (A1). Let suppose that it supports the data format F1 and F2 as input formats, it







produces data in format F3 and it is available on host H1 and H2; a co-registration tool, that perform the coregistration of two SAR focused images (A2). Let suppose that it supports the data format F3 and F4 as input format, it produces the data format F5 and it is available on hosts H1 and H3; an interferogram generator, that performs the generation of an interferogram starting from two co-registered SAR images (A3). Let suppose that it supports the data format F5 as input format, it produces the data format F6 and it is available on hosts H2 and H4; finally, let suppose that on the storage resource S1 a raw SAR frame (D1) in the format F1 and a focused SAR frame (D2) in the format F3 are available. On the storage resource S2 the orbital data D3 for the datasets D1 and D2 is available.

Figure 1. Configuration repository used for workflow composition and submission.

Let us now suppose that a potential PSE user wants to get an interferogram (D4) starting from the two dataset available D1 and D2. She asks the system to discover all of the available applications. Thus, she composes, through the GUI, the workflow depicted in Figure 2 and submits it. After the workflow submission, the system queries the configuration repository and retrieves all metadata attributes related to data and applications. It can verify if the input datasets are compatible with the applications; if they are not compatible, the system warns the user or activates, whenever possible, an automatic data format conversion. The system discovers also that the application A1 requires the orbital dataset D3. Moreover, through the metadata attributes, the system discovers where the applications are available and on the basis of this information it can take a better choice as a function of performance

parameters, the execution machine performance and data transfer rate.

[2] Foster, I., C. Kesselman, and S. Tuecke: “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Journal of High Performance Computing Applications, 2001. 15(3): p.200-222 [3] Foster, I. and C. Kesselman, eds. The Grid: “Blueprint for a New Computing Infrastructure”, 1999, Morgan Kaufmann. [4] Fox, G.C.: “Portals for Web Based Education and computational Science”, [http://newnpac.csit.fsu.edu/users/fox/documents/generalportalmay 00/erdcportal.html].

Figure 2. User’s workflow example. As showed in this example, the configuration repository plays a crucial role for PSE working. This component provides in fact, all needed information both to users in order to discover the resource and access available services but also to system in order to perform job scheduling.

8. Conclusions and future directions We have designed and implemented a configuration repository able to describe applications, data and resources belonging to remote sensing field in a grid environment. The design of this component was realized taking into account the result of a preliminary analysis of applications and standards employed in this applicative field. The proposed configuration repository has been actually employed with clear advantage into a gridbased problem solving environment, specialized to process and manage remote sensing data.

9. Acknowledgments This work has been developed within the GRID.IT, FIRB 2001 project, funded by the Italian Ministry of Education, University and Research (MIUR). For further information, readers should refer to the Web page http://www.grid.it.

10. References [1] D. Walker, O. F. Rana, M. Li, M. S. Shields, and Y. Huang, “The Software Architecture of a Distributed Problem-Solving Environment”, Concurrency: Practice and Experience, December 2000, Vol. 12,No. 15, pp. 14551480.

[5] E. Gallopoulos, E. N. Houstis, J. Rice, “Computer as Thinker/Doer: Problem-Solving Environments for Computational Science”, IEEE Computational Science and Engineering, vol.1, n. 2, 1994. [6] K. Czajkowski, S. Fitzgerald, I. Foster, C. Kesselman “Grid Information Services for Distributed Resource Sharing”, Proceedings of the Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Press, August 2001. [7] The Globus Project. 1997: http://www.globus.org. [8] The DataGrid Project. http://www.eu-datagrid.org [9] Steve Fisher. Relational model for information and monitoring. Technical Report GWD-Perf-7-1,GGF, 2001. [10] Brian Tierney, Ruth Aydt, Dan Gunter, Warren Smith, Valerie Taylor, Rich Wolski, and Martin Swany. A grid monitoring architecture. Technical Report GWD-Perf-161, GGF, 2001. [11] J. Millar, University of Tennessee, “Grid software object specification”, Information service, request for comment: GWD-GIS-008 [12] The Open Software Description http://www.w3.org/TR/NOTE-OSD

Format,

[13] GridLab Project: www.gridlab.org [14] G. Aloisio, M. Cafaro, I. Epicoco, D. Lezzi, M. Mirto, S. Mocavero, “The Design and Implementation of the GridLab Information Service”, in Proceeding of Gird and Collaborative Compuntig, Shanghai, 2003 (GCC2003) [15] G. Aloisio, M. Cafaro, I. Epicoco, G. Quarta, "A Problem Solving Environment for Remote Sensing Data Processing", Proceedings of the International Conference on Information Technology (ITCC 2004), IEEE Press, April 5 to 7, Las Vegas (Nevada) USA, Volume II, pp. 5661, 2004. [16] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, A. Lazzarini, A. Arbree, R. Cavanaugh, S. Koranda "Mapping Abstract Complex Workflows onto Grid Environments" Journal of Grid Computing, Vol. 1, No. 1, pp 9-23, 2003.