Technical Report DHPC-008
Eric: A User and Applications Interface to a Distributed Satellite Data Repository H.A.James and K.A.Hawick∗ Department of Computer Science, University of Adelaide, SA 5005, Australia
25 June 1997
Abstract We describe a distributed computational infrastructure for accessing and processing a large repository of geostationary satellite data through a World Wide Web browser interface. Our repository of GMS5 satellite data is stored on a combined RAID and tape silo system, accessible from a cluster of DEC Alpha workstations, interconnected by ATM LAN technology locally (in Adelaide) and also available to another cluster of workstations (in Canberra) over Telstra’s ATM Experimental Broadband Network. Our system makes use of parallel and distributed computing software technologies, controlled using Common Gateway Interface and Java code invoked through a WWW server. We give some performance measurements for access times and processing times for various parts of our system. We discuss the rapid prototyping advantages and the constraints on functionality, scalability, performance and security that arise from using such technology.
∗ Author
for correspondence, Email:
[email protected], Fax: +61 8 8303 4366, Tel +61 8 8303 4519.
1
1
Introduction
Integrating parallel and distributed computer programs into a framework that can be easily used by applied scientists is a challenging problem. In this paper we discuss a framework that integrates parallel processing and distributed computing software for accessing and manipulating geostationary satellite imagery. We employ World Wide Web (WWW) client/server technology to allow a universal form of user interface to our system. Land management and environmental scientists make use of multi-channel geostationary satellite imagery in analysing rainfall and vegetation coverage effects. To do this, satellite data needs to be presented in a convenient form with a number of pre-processing operations such as selection of: the particular channel; area of interest; and time and date of interest (or a sequence of times and dates). It is also important that once specified, such snapshots or sequences can be easily fed into applications programs which can carry out further processing to derive composite imagery or some item of metadata. We illustrate this idea using an example of computing an approximate percentage cloud cover figure for a selected geographic area and time and date. We describe our repository of geostationary satellite data in section 2 and some of the operations we wish users to be able to carry out in section 3. We have implemented our infrastructure using the Common Gateway Interface(CGI) [7] to a standard WWW server such as the NCSA httpd daemon run on a UNIX platform and also Java applets which can be down-loaded and run on the WWW client browser. This architecture gives a good compromise between a system that can be rapidly prototyped using existing server-side utilities and the desired client-side functionality that we require. A number of distributed and parallel processing technologies can be embedded in our infrastructure. We use remote execution in the form of distributed rsh invocations to farm out parts of the processing to other workstations in our cluster. Parallel programs running as remote services either on the workstation farm or on a dedicated machine are also discussed. We describe the infrastructure architecture in section 5. We give a brief overview of the WWW software technology we have employed in our Eric system in section 4 and some of the parallel processing methods we have used in section 6. We have made some measurements of the performance of various parts of our system and describe these in section 7 alongwith the tradeoffs that arise. There are two important data access rates that limit our system. Primarily, the system is limited by data transfer rates between the component machines that are used to provide the processing services. Our implemented system links these machines together using shared file systems and Asynchronous Transfer Mode (ATM) communications [8] running at 155Mbits/s. The second access speed limitation arises from the available bandwidth between the WWW client and server. In the case of down-loading large images or movie sequences this can be significant, but for information services such as browsing catalogues of small sub-sampled imagery and metadata, this bandwidth is not a significant restriction. We have also experimented with accessing the system over wide area networks such as over Telstra’s Experimental Broadband Network[16, 10, 12]. In section 8 we discuss the general abstract model for an infrastructure like Eric and how we plan to adapt it and extend it to provide more general services. We summarise our findings and conclusions in section 9.
2
Repository Design
In this section we describe our repository system for organising the storage of satellite imagery from the Japanese GMS5 satellite. We currently obtain this data by mirroring a NASA ftp site and we use a series of Unix shell scripts to maintain separate files for each time and date and channel of data. Files are stored in the Hierarchical Data Format (HDF) developed by NCSA [21]. This format stores the imagery as well as metadata recording the satellite orientation, calibration and position. This information tags the images in the primary repository and allows subsequent processing of the raw images by other application programs. The HDF file format is supported by a set of utilities 2
and programming libraries which we have integrated into our Eric infrastructure. The Geostationary Meteorological Satellite (GMS) provides more than 24 full hemisphere multichannel images per day, requiring approximately 204MBytes storage capacity per day, or 75GBytes per year. The GMS-5 satellite was launched in June 1995 and provides visual and infra-red data in various wavelength channels from a Visible and Infra-Red Spin Scan Radiometer (VISSR). Each data set consists of 4 image files and a number of documentation files. The GMS-5 documentation files are in plain ASCII and list relevant information such as calibration data and satellite schedules. One image represents albedo in the visible part of the spectrum and the other three are in the infrared range. Each image is a full-disk photograph of the Earth from an altitude of 35,800km and a position almost directly over the equator at 140 degrees east (close to Adelaide’s longitude). These images are 2291x2291 pixel resolution, (approximately 3km pixel edge resolution) with varying resolving power resolution for the different channels. Channel Visual Thermal IR 1 Thermal IR 2 Water Vapour IR 3
Wavelength (µ m) 0.5 - 0.75 10.5 - 11.5 11.5 - 12.5 6.5 - 7.0
Resolution (km) 1.5 5.0 5.0 5.0
Table 1: GMS-5 VISSR Channels The Visible and Infra-Red Spin Scan Radiometer (VISSR) records visible radiation through a photomultiplier tube and infra-red radiation through a HgCdTe detector, using a scanning mirror system. Signals are quantised into 64 bits (visual) and 256 bits (infra-red) prior to transmission to an earth based receiving station. The satellite takes approximately 27.5 minutes to record visual and IR data in a 20 × 20 degree area which includes the earth disk image with approximately 2500 mirror scan steps. Imaging swathes are 5 km by 1 scan line for infra-red and 1.25 km by 4 scan lines for visual data. The satellite is spin-stabilised and scans are synchronised with the spin rate. The observed schedule is full earth disk images hourly with images labelled 0000Universal Time (UT) actually observed between 2330UT and 0000UT. Variations in the observation schedule are made for periods when the satellite is eclipsed, for periods of solar interference, typhoon special observation periods and occasional satellite maintainance periods. More information on the GMS satellite and its sensors is given in the Japanese Meteorological Agency (NASDA) User Manual for the satellite [18]. A number of metadata parameters are held within each HDF file. These include: • satellite elevation and orientation; • latitude and longitude of the sub-satellite position; • sensor calibration information; • time and date stamp of image start and end record times; • Earth radius and east and west horizon points; • satellite misalignment matrix; • satellite orbit parameters; • satellite spin rate and precession matrix; • channel sampling angles.
3
This information is made available by our system as supplemental textual information for a given snapshot image. It can also be fed into an application program for further processing of the imagery. In particular, the metadata giving satellite coordinate information could be used for image rectification and manipulation using longitude and latitude coordinates. At present, the Eric system accesses non pre-processed GMS5 data that is stored in a flat directory structure on a UNIX file-system, and where each file name reflects its data channel and timedate-stamp. Our system stores HDF files that have been compressed using the Lempel-Ziv[23] compression algorithm. We are investigating ways to compress data more efficiently and the tradeoffs between access time and saved storage capacity that will result. Each HDF file contains the image at 2291 × 2291 resolution and quantisation level appropriate to the data channel, and also metadata sufficient to identify the satellite orbit and attitude configuration and thus allow geo-rectification or conversion from image pixel coordinates, to latitude and longitude coordinates. We plan a future capability that will either access a secondary repository of latitude/longitude geo-rectified data, or will carry out the conversion on demand. Our present repository system involves storage of the most recent image data (approximately one half-year of data) on a 40Gigabyte StorageWorks RAID, with older data archived on tapes controlled automatically by a 1.2Terabyte StorageTek tape silo. We are currently developing an archive control system that will migrate data automatically to tapes both locally and to an additional remote tape silo accessible through the Experimental Broadband Network. We envisage that a repository system such as we describe may have additional uses when coupled to other information sources such as weather services. Integrated systems making use for numerical weather prediction applications may assimilate satellite imagery as part of their operation and produce derived data products such as forecasts or flow fields [9].
3
Imagery Operations
The original driving reason for developing our system was to allow simple browsing of the satellite imagery, choosing particular images of scientific interest. The simple queries supported under the original system were to be requests for a particular image channel (such as visual, infra-red1, infrared2 or water-vapour) at a particular time and date. The full images returned were however of the whole hemisphere of the Earth visible from the satellite - as shown in figure 1. More useful for scientific purposes is to request a particular image resolution and sub-area of interest. These parameters are readily specified using the forms interface described in section 4.1.
i)
ii)
Figure 1: GMS-5 Hemisphere of View of Earth: i) Visible Spectra and ii) Water Vapour IR Spectra For visual inspection (or browsing) of snapshot images, it is useful to create ‘thumbnail images’ of 4
i)
ii) Figure 2: GMS-5 View of Australia i) Visible Spectra and ii) Cloud Mask
reduced resolution - effectively sub-sampled images. Our present system creates these dynamically from the primary data images, caching results from previous queries. We are investigating the tradeoffs from creating a complete secondary image repository of a standard thumbnail image size and thus saving the processing time to recreate thumbnails dynamically. It is also useful for scientific analysis of the data sets to specify a query involving a sequence of images. Such a query is specified by a start and end time and date as well as a stride. The query may then be for “every image at midnight, for the last week”. For browsing a sequence of images it is useful to create a thumbnail resolution movie sequence that can be played through the WWW interface as a movie sequence. We employ the MPEG file format for this and integrate public domain MPEG-creating utilities into our infrastructure. Alternatively the system can return what is essentially a vector of static images that can be processed by some application program using the same user-specified parameters. This is a convenient mechanism to specify a batch process once a processing algorithm has been specified by the user. This mechanism allows composite images to be made into a movie sequence as well as the raw imagery. Reduction operations can also be incorporated into the infrastructure. A reduction operation is one which reduces a two dimensional data item such as an image into a vector or scalar quantity. An example would be an algorithm to evaluate percentage cloud cover. We have implemented this very simplistically as a computational example. A simple definition of cloud in the image data is given by ‘cold, bright pixel vales’. A simplistic algorithm would determine whether a pixel shows cloud or not by thresholding the visual channel for brightness value and the thermal infra-red channel for a temperature value. We are developing an operator language that allows a user of the system (or an application programmer) to specify the sequence of image transforms and reduction operations to be applied to a single image or sequence of images from the repository. Since these operations can be implemented as high performance computing modules we believe our infrastructure provides a useful testbed for investigating parallel and distributed algorithms for these applications.
4
WWW Technology
In this section we review two widely used technologies for interfacing applications to the World Wide Web. We discuss the Common Gateway Interface (CGI) mechanism for invoking server side applications and the Java language and environment for invoking both applications that can run at the server side as well as lightweight “applets” that can run at the client side - that is on the web browser host itself. We discuss the relative merits of these two technologies.
5
4.1
The Common Gateway Interface
The Common Gateway Interface (CGI) [7] is a mechanism whereby a WWW server can execute selected programs on behalf of a WWW client. These so-called CGI programs can be written in effectively any language or programming system that is available on the computer that runs the WWW server daemon and are run with all the permissions and access privileges that this daemon has. CGI programs are often used to create HTML output dynamically that can be dependent on what a WWW client user has typed into an HTML form for example. This HTML output is sent back to the WWW client directly from the program and need not be accessed as an HTML file on the WWW server as normal HTML content is. Many CGI programs are implemented as UNIX scripts for programs such as UNIX shells or for the Perl interpreted programming language[22]. There are some problems with implementing robust and secure CGI programs that carry out complex operations on the WWW server host computer. Care must be taken to ensure that the shell scripts run by the WWW server are secure and do not accidentally allow a incoming WWW client to invoke arbitrary UNIX commands or anything that may compromise the system’s integrity. CGI is however a very convenient mechanism for rapid prototyping of user and applications services through a WWW interface. In particular it is very easy to integrate existing UNIX programs together to provide a complex information service such as the pre-processing of satellite data we discuss here. Some degree of security can be obtained through the proper configuration of the WWW server software that invokes the CGI program. Standard mechanisms exist to restrict access to CGI programs to particular Internet sub-domains or to require a user/password pair before WWW browser access is granted. Our Eric system is implemented as a master driver Perl script which can invoke a number of “auxiliary” shell scripts and other UNIX programs. It is common practice with CGI scripts that are invoked from HTML forms to write them to be re-entrant at different points depending upon the arguments passed to them. A common convention is that the program when invoked as a URL with no arguments will generate an HTML form that will re-invoke the program itself with arguments set up in the form. This allows all the interfacing code to be gathered together in one script and more easily managed. It also greatly simplifies the appearance of the program’s interface in that only one master URL need be known to the WWW client user. Preserving state between successive invocations of a CGI program is however not easily supported. A finite state machine approach is necessary to ensure that an identifying state token is passed back and forth between www client and server. This must be set up explicitly by the CGI application programmer. CGI program interfacing is described in detail elsewhere [7], but it is worth noting some of the CGI features we have found useful in implementing a complex integration package like the Eric system. Advantageous features include: • rapid prototyping of interface design; • powerful text manipulation available; • straightforward embedding of existing UNIX programs and services.
4.2
Java and Client-Side Computation
A significant limitation of the CGI approach is that all computation is done at the server-side. While this is appropriate for computationally-intensive tasks, it is inconvenient for heavily user-interactive aspects. A typical example involves the so-called “rubber-banding” selection of a sub-image from the full-Earth images available in our repository. A degree of image selection can be carried out by allowing the user to specify coordinates of an enclosing rectangular sub-area, but this is not as intuitive or convenient for the user as using the mouse to orient a rectangle over the area of interest. Rubber-banding is best carried out at the WWW client side. The only mechanism available to do this is using a Java applet which is down-loaded to the WWW browser and which can display a coarse-resolution image with the mouse controlling a viewing rectangle.
6
SUN’s Java language and development kit make use of a portable bytecode that implements the instructions to a portable virtual abstract machine. This allows byte code to be down-loaded dynamically from a server into a WWW browser environment and run by the user on his own browser host. This clearly allows greater bandwidth of interaction between user and code and for many modern PCs or workstation hosts may also have significant computational resource available. Java and Java applets are well described elsewhere [4]. We have incorporated a rubber-banding Java applet into our Eric system, which communicates the chosen coordinates directly back to the server script by invoking the URL of the server script with appropriate argument parsing. The effect is as though the rubber-banding functionality were embedded directly into the HTML pages generated by the server-script. Some problems with using Java in this way include: • the immaturity and poor performance of Java implementations. In particular reliability and robust inter-operability between Java and typical WWW browsers such as Netscape are still improving. • implementing all the classes necessary to implement such a system in java involves a significant coding effort compared to use of the Perl scripting language.
4.3
CGI vs Java
It is possible to implement the entire Eric system using Java applets at the client side, communicating with a full Java application running as a server daemon. This would allow a more elegant software infrastructure, but we have not chosen to do this for a number of reasons. Our present system was intended as a rapid prototyping system and as a Perl script was able to easily incorporate a number of pre-existing Unix utilities and programs to carry out various aspects of the image processing using system calls. We were able to incorporate a simple form of distributed load balancing across a farm of workstations simply by using remote shell (rsh) operations to handle each user of the system. The NCSA httpd WWW server daemon [20] can be configured to spawn copies of itself (up to some prescribed limit) to handle separate incoming http requests of the CGI script. By ensuring that our CGI script was effectively safe for multiple processes, with no conflicting shared temporary files between separate spawned copies of the script process we do not degrade performance when multiple users try to access the system. In general this combination of Perl and Java achieved our objectives and was the best compromise between the lower performance but greater elegance of using Java code throughout. Another possible compromise may be to employ JavaScript code which is generated at the WWW server side and which can be downloaded and run upon the WWW client. We have yet to investigate this.
5
Eric Architecture
In this section we discuss the overall architecture of our Eric system and some of the important issues we have identified for developing an improved system. Our original system was implemented as shown in figure 3. Eric was originally implemented as a single driver program running on the WWW server machine and invoked as a CGI program by the httpd web server daemon. The driver script was a Perl program, and some of the necessary functions were implemented as C programs or shell scripts embedded as system calls from the Perl script. This mechanism allows for very rapid prototyping but has a number of disadvantages. Perl is not a particular good language for maintaining large software projects. Furthermore embedded system calls are not very efficient either in memory nor in computational startup costs, since they must employ a separately spawned UNIX process to handle them. Our improved architecture made use of a combined Web server and Java Application server to work together to service user requests using WWW protocols. The WWW server and Java application both share a common file space and can therefore communicate using shared information. The improved architecture is illustrated in figure 4. 7
User
Web Server Common Gateway Interface
Web Browser
Driver Script Programs to Manipulate Data Cached Set of Delivery Formats (jpeg, mpeg, metadata...) HDF Data on RAID
Prototype
Eric System
Figure 3: Eric: Prototype Infrastructure The Eric system provides users with an HTML form that can be used to request particular information products from the database of satellite imagery. A sequence of forms is used to narrow down the particular options relevant to a hierarchy of choices presented to the user. At each stage, the HTML form presented to the user will have a link to help information and to the main menu of Eric options. Data is delivered back to the user directly as an embedded image or movie or text file. Once these have been transferred to the user’s browser, he can choose to store them to his own local file system for future use as well as merely storing them in the WWW browser’s cache. In the case of popular products and for a system that may be serving a large community of users, there is a potential performance improvement in managing caches of final and intermediate products. Our system does this by using a data oriented naming scheme that can be parsed at any stage by the driver script. For computationally intensive operations, this makes it possible to determine whether the result has already been computed and is in the cache. Our present system uses a simple naming scheme based on adding many details to the filenames used - a more robust server based database approach would be preferable since it would allow better organisation of the cached data than a flat file structure. The present Eric system makes heavy use of temporary files to handle multiple user requests and to track the intermediate results between programs chained together to provide the various services Eric offers. This requires use of process identifiers and other temporary tokens to track ownership and description of these files. This approach does not scale well. We implement a primitive form of garbage collection by assigning the UNIX cron daemon the task of periodically clearing out outdated temporary information from Eric cache areas. Error handling in CGI programming in general and Eric in particular is not robust. Most WWW servers maintain a log of error codes should the CGI program fail to run at all, but a great deal of checking for file existence and explicit coding of error reporting is necessary to keep the user informed should a complex request fail. Reporting the exact conditions is possible and most failures in our present system occur due to a missing image in the master database of GMS data. 8
User
Web Server
Web Browser (HTML + Applet)
Java Application
Common Gateway Interface Perl Driver Script
Programs to Manipulate Data
HDF Data on RAID
Cached Set of Delivery Formats (jpeg, mpeg, metadata,...)
Prototype ERIC System Figure 4: Eric: Combined CGI and Java Infrastructure
5.1
Eric Implementation
In this section we describe some of the implementation techniques for Eric. The following Perl code subroutine generates the Main menu for the Eric system. A series of name/value pairs are used to transfer information between the HTML form sent to the WWW client and CGI program running on the the WWW server host. For example, the tag “mainchoice” is set to one of the values: “single image”; “stride image”; “create mpeg”; or “help startup” depending on which operation the user wishes to carry out. sub startup_choice_form { &page_header; print qq|\n
\nEric Mk $version\n
A User and Applications Interface to a Distributed Satellite Data Repository\n
\n\n|; print