spatial range queries, but in turn provides a exibility ... depends on e cient spatial range queries. ..... le greatly simpli es matters since now one only needs.
Spatial Data Reallocation Based on Multidimensional Range Queries
A Contribution to Data Management for the Earth Sciences Hans Hinterberger, Kathrin Anne Meier, Hans Gilgen Institute for Scienti c Computing Department of Geography, Climate Research Group Swiss Federal Institute of Technology (ETH), Zurich
Abstract Earth scientists, by de nition, work in an interdisciplinary environment and therefore collect and disseminate data using distinct methods, depending on whether the associated information arises during eld measurements, arrives via remote sensing, or represents simulation results. When calculating the global radiation over one of the oceans, for instance, it is essential that cloud data from at least three dierent sources can be accessed based on one and the same geographic grid. Dierently gridded source data must therefore be reorganized, through interpolation for example. This can be done in dierent ways and we investigate a method whose practicality depends on fast spatial range queries, but in turn provides a exibility that makes it particularly easy to accommodate differently organized data sets - a valuable feature when unanticipated grid organizations show up.
1 The crux of managing large Earth Science data sets Many data sets in the Earth Sciences resemble each other in two fundamental ways: They tend to become massive in size and they incorporate a spatial component relating them to the earth's geography. Fortunatly, mass storage technology has progressed to a point where the accumulation of massive data sets no longer presents any major problems. The development of representational structures to manage these data in a exible and ecient way, however, appears to not have kept pace. This shortcoming manifests itself painfully when, for instance, data should be converted to suit a local le format, particularly if multiple data sets must be compared or be combined in a particular calculation. The problem area is well known and has
been documented elsewhere (e.g. [1]). The challenge to address these diculties comes from the by now mature insight that traditional data management systems are ill-suited to handle such diverse science and engineering requirements. Problems associated with scienti c data management arise immediately in heterogeneous data environments, for instance with information used for climate research, where dierent types of scienti c measurements - collected routinely in data sets containing typically 106 to 108 values - must be related to spatial information. Unfortunately, these data sets are usually compressed or stored in a way that makes ecient retrieval based on their spatial characteristics cumbersome and time consuming. As a result, valuable data remain unexplored and become wasted. Our objective is to investigate methods that help render Earth Science data accessible across dierent disciplines. We have chosen to work with climatological data and study their management with the goal to make them better suited for global climate studies. The key to our solution is elementary but twofold: First, climatological data are transformed to higherdimensional point data and second, information about the data's geographic source is parameterized in such a way that it can be stored and accessed explicitly. Our approach is novel in the sense that the same le structure is used to manage both types of information. The discussion of our work is structured as follows. Section 2 brie y characterizes the type of climatological data that we deal with. In Section 3 we introduce a exible method to interpolate climate data which depends on ecient spatial range queries. Some data management aspects are discussed in Section 4. First results obtained with global cloud data sets are listed in Section 5.
2 Characteristics of global climate data sets Earth scientists with computers . . .
To appreciate the climatologist's interest in ecient access to heterogeneous data sets consider the following. When calculating the global radiation over one of the oceans, for example, it is essential that cloud data from at least three dierent sources can be accessed based on one and the same geographic grid. Examples of typical data sets are: The ISCCP-C2 cloud data, recorded from July, 1983 to December, 1990, available in 90 monthly les in the NCSA (National Center for Supercomputing Applications) HDF format from the NASA Langley Research Center, Hampton, VA. Each le occupies 2.3 MB (compressed) or 4.5 MB (decompressed) on disk ([2],[3]). The surface observation cloud data, organized in 12 ASCII les, totalling 70 MB ([4],[5],[6]). The ECMWF (Eropean Center for MediumRange Weather Forecasts) produces a le of 0.333 MB GRIB coded upper-air and surface meteorological data every 12 hours since 1980 ([7],[8]). In recent climate simulation studies with a GCM, between 120 and 600 monthly GRIB formatted les, with volumes ranging from 60 to 600 MB each are used. It is estimated that a GRIB formatted le becomes approximately ve times larger, when the data are converted to ASCII code ([8]). Such a variety of le organizations may be convenient for those producing the data sets, or maybe for global analysis involving one point in time. For integrated climatological studies, however, the following drawbacks are obvious: 1. During a temporal analysis, a large amount of temporary disk storage is needed, because all les containing data that fall into the time interval to be studied, have to be read from tape or optical disk or be electronically transferred and then decompressed. 2. To calculate temporal and regional averages, only a fraction of the data read from temporary disk storage is used at any one time. 3. The application programs contain large procedural parts, only to be used for the transfer of data from the source les.
. . . bring computer scientists down to earth From the computer scientist's point of view, we chose global climate data for our investigations because they combine just about all characteristics of Earth Science data that challenge data management. Furthermore, the authors combine their eorts in an interdisciplinary project involving climatology and computer science. Another interesting aspect with climate data is that they progress from many, individual sets of scienti c data to one huge set of statistical data. Based on their method of acquisition, global climate data sets can be categorized into point data sets and gridded data sets. Point climate data sets contain values that were measured at stations on the earth's surface. Values for quantities such as temperature, humidity, global radiation, etc. are measured at a predetermined, standard height above the surface of the earth. Upper-air values of temperature, humidity, atmospheric pressure, etc., on the other hand, are measured by radiosonde or other devices producing pro les of a quantity in the air above the observing ground stations. These climate data sets are eciently managed with a relational database management system ([9],[10]).
Gridded climate data sets Gridded climate data sets contain interpolated values that are representative for prede ned geographical regions. These typically static data sets share the following characteristics. The data are organized either in equal area grids or regular grids, both with perpendicular boundaries in a cylinder projection of the logitude and latitude circles of the earth. All cells of an equal area grid cover approximately the same area on the surface of the earth. From equal area gridded data, global averages can be easily calculated and, for a given resolution on the surface of the earth, the amount of data generated is minimal ([11]). The cells of regular grids all have the same width and the same height in a cylinder projection of the earth's surface. Consequently, the areas of the cells at the earth's surface decrease with increasing latitude. Global averages are calculated after the data have been weighted based on the area of each cell. The values in a cell of a climate data set are spatial (related to the are inside the cell) or temporal (represented values within a month or a season) averages or interpolations from surface, upper-air or satellite observations, or they result from model calculations.
The quantities measured or calculated can be approximated with reasonably smooth functions of longitude and latitude and the averaging or interpolation procedures eliminate the noise produced as part of their acquisition. The measurements have spatial spectra similar to red noise spectra, i.e. most of their variance is distributed over the lower spatial frequencies. Three typical cases can be distinguished when data are retrieved from gridded data sets, namely queries for 1) spatial analysis, 2) temporal analysis, and 3) the calculation of a new quantity. To facilitate data access for these three classes of queries, climate data are embedded in search spaces having between ve and eight symmetric key-attribute domains. In this way, dierent types of queries can be adjusted to characteristically shaped query regions. Physically, quantities are measured in the air over a grid cell on the earth's surface, at one or more levels with predetermined elevations. These "measurementboxes" typically extend from 100 to 1000 km in longitude and latitude and rise 10 km above ground. The elevation level, the latitudes and longitudes of the lower left or south-west and upper right or northeast vertex of a grid cell correspond to three search space dimensions, or attributes. Other attributes of the elements in a climate data set, its parameters, are the measured quantity's name, the date of acquisition (year and month) and so on. Ecient data retrieval is only possible if all dimensions of this parameter space are treated symmetrically. When values for quantities in a box are measured or calculated and stored in a data set, one physical dimension is often replaced with two search space dimensions, e.g. monthly values are identi ed by the coordinates year and month. A global climate data set is usually stored and distributed as a set of les on tapes or optical disks, sometimes they are made accessible for transfer via ftp. A le contains the values of all quantities at all elevation levels for all grid cells, one record per cell, for a given month or a particular point in time. The les are either ASCII coded, or a binary, machine independent (interchange) format, such as GRIB, is used ([12],[8]).
ward if all auxiliary measurements have been sampled from or interpolated into the same temporal and spatial intervals. Temporal intervals are usually identical since time units for climatological studies are typically based on month or season. Spatial intervals, however, are often not compatible when they belong to dierent geographic grids so that the various auxiliary data must be interpolated into the cells of a single, common grid. The choice of interpolation method depends predominantely on the availability of the original measurements and on whether or not the procedures used to calculate the data sets are restricted to a particular geographic grid. The task becomes simple if a grid-independent interpolation procedure can be applied to the original measurements in order to generate new data sets based on the new common grid. Unfortunately, such recalculations can rarely be performed outside the institutions that produced the original data sets and often they cannot even be carried out within them for the simple reason that the original measurements are no longer available. Several procedures have been introduced to interpolate values associated with one grid to corresponding values for another grid. Two widely applied types of methods are based on point interpolation and area weighted interpolation respectively. Whenever original measurements cannot be "resampled" or "reinterpolated" to a new grid, one resorts to methods that spatially "reallocate" existing data sets to a new common grid (e.g. [13]).
3 Interpolation as a consequence of heterogeneous data sets
Figure 1: Point interpolation: The thin lines radiate from
In order to complete a data set for a particular climatological study, it is often necessary to substitute missing data items by suitably combining values from two or more auxiliary data sets. This is straightfor-
.
o
. .
o o
.
o o o
Point Interpolation
. . Area Weighted Interpolation
destination grid start grid
the center point of the destination grid cell to the center points of the start grid's contributing cells. Area weighted interpolation: The dierently shaded rectangles indicate the areas of the cells in the start grid which intersect with the cell of the destination grid.
3.1 Point and Area Weighted Interpolation When discussing interpolation methods, we refer to the geographic grid underlying the values of the existing data set as the "start grid" and designate with "destination grid" the geographic grid for which new data must be interpolated. The basis for point interpolation are the center points of the cells in both the start grid and the destination grid, as illustrated in Fig. 1. To determine the value of a cell in the destination grid, its center point is projected onto the start grid. The cells of the start grid that surround this projected point will contribute to the interpolated value only if they satisfy some neighborhood function. Given the distances to the center points of these contributing cells, an interpolation algorithm (based, for example on inverse distances or splines) determines the amount that will be considered from each contributing cell. This procedure is repeated for each cell of the destination grid. The resulting values are then stored in a new data set. When the data is related to area, some method based on area weighted interpolation is more appropriate. For area weighted interpolation the destination grid is also projected over the start grid. The contribution of each start grid cell to the destination grid cells which it intersects is resolved based on area rather than distance, as shown schematically in Fig. 1. The weights that determine the contribution of the value associated with the start grid cell are based on the corresponding start grid cell's proportional contribution to the destination grid cell's area. The crucial part of this method is to nd a procedure that quantitatively establishes how the cells of a start grid intersect the cells of a destination grid. The necessary steps are often embedded in a single program, capable to handle one type of start and destination grid only. In other words, information about the two grids is cast in a particular algorithm so that a new procedure needs to be written whenever a new combination of start and destination grid arises. This can become an impediment because it is dicult to nd the required algorithms whenever a grid partition is not constructed generically, which is often the case. An algorithmic approach to area weighted interpolation is impractical for climatological studies because the grids are often complex and many arbitrary and unanticipated grid organizations must be considered during a single investigation.
3.2 The Spatial Data Reallocator To make area weighted interpolation more practical, greater exibility when working with dierent geographic grids is mandatory. This becomes possible when information about these grids is maintained descriptively, rather than have it procedurally embedded in a particular algorithm. This is done by explicitly storing a grid's interval boundaries with a spatial data structure. The question of which cells of the start grid intersect a given destination grid cell can now be formulated with a region query, whose upper and lower bounds are given by the destination grid cell's extension. In other words, we reduce area weighted interpolation to retrieving fractional values (the dierently shaded areas in Fig. 1) and reallocating them to a new cell. This approach leads to a large number of range queries, however, and the method becomes only practical if the start grid information and its associated global climate data set can be eciently accessed at least with respect to their spatial components. External source data
ETH reference data
description of geographic grid
GG
gft/1
climate data ASCII stream
description gridfile spatially ordered collection of geographic grid vertices
CD
data gridfile k-dimensionally ordered collection of climate data sets
Figure 2: Simpli ed representation of a global climate
data transformation from the format of an arbitrary external source into a format suitable for interpolation-based reallocation.
After deciding to investigate this approach, we chose to manage the grid information with a vedimensional grid le ([14]) as follows. Four keyattributes specify the longitude and latitude of two diagonally opposite cell vertices; the cell's identi cation number constitutes the fth key-attribute. This le is called the description le and the associated le storing the global climate data set is referred to as the data le. The organization of these two le types is illustrated in Fig. 2, where the description le is labelled GG (for geographic grid) and the data le CD (for climate data).
The data le stores individual climatological measurements and cross-references these values to the associated geographic grid by also listing a cell's identi cation number or the coordinates of the two cell vertices, as stored in the description le. Clearly, other multidimensional le structures could be used to store the parameters of a geographic grid. Using the grid le, however, brings the added advantage that both, the grid as well as the measurement data, can be managed with the same le system. destination grid global climate data set j interpolated to fit geographic grid q
GGq
spatial data reallocator
CDnew / GGq
gft/1 reallocated data for further processing
GGi start grid
CDj / GGi global climate data set j organized for geographic grid i ETH reference data
Figure 3: Simpli ed representation of interpolation-based
reallocation of global climate data (CDj / GGi ) from an arbitrary geographic grid (GGi ) into another arbitrary geographic grid (GGq ). The newly generated records, stored in CDnew are in grid le format, but written as sequential le. To become useful as a multidimensional direct access le, they must be inserted with gft/1.
To illustrate the reallocation of spatial data we designate: with GGi a geographic grid of type i with CDj a global climate data set of type j with CDr / GGs a global climate data set of type r organized for geographic grid s To reallocate data organized for one geographic grid to the cells of another geographic grid, one proceeds as follows. First, the description les of the start grid and the destination grid must be available; either obtained from a grid-library or newly generated. For each cell
of the destination grid a corresponding region query is formulated, based on the grid le GGq , and applied to the grid le of the start grid, designated GGi as shown in Fig. 3. This query returns all intersecting cells so that the size of the overlapping areas can readily be calculated to obtain the weights necessary for the area weighted interpolation. The same query is subsequently applied to the data grid le CDj /GGi, providing the data necessary to compute the values for the new data set, shown as CDnew/GGq in Fig. 3. The distinction between description le and data le greatly simpli es matters since now one only needs to generate a description le for every new type of start or destination grid - totally independent of the procedure to manipulate the grids. The geographic region covered by the destination grid can be of any size but only values for those cells that lie inside the region covered by the start grid can be interpolated. Ecient access to spatially reallocate climate data is important when preparing data sets for global climate studies. But these data sets are also investigated in their own right and any transformation of the data, done as part of spatial data reallocation, must be compatible with other types of queries as well, preferably at no additional cost. In Section 4 we brie y describe the data structures chosen to implement the grid le and show how the necessary speed-up and savings in storage space have been achieved.
4 A grid le-based le structure for global climate data The software used to manage the global climate data consists of two compact C programs, both running under UNIX. One, called gft/1 (a grid le based system, [15]), manages the description and the data les; the other, called sdr ([16]), interpolates data values that are based on one geographical grid to corresponding values of another grid. Assuming that readers are familiar with the concepts of the grid le, we restrict the remainder of our discussion to those features of gft/1 which are relevant to the problems at hand, namely fast query processing and minimal storage requirements. The main characteristics of gft/1 are:
A memory-resident region directory.
A "de atable" search space.
A compressable data bucket le.
These properties are a prerequisite to satisfy the requirements that arise with the spatial data reallocator introduced in the previous Section. The reasons are: 1. The high speed with which a grid le, equipped with a de atable, main memory resident region directory, answers arbitrary range queries. 2. Lossless compression of the data bucket les leads to smaller storage space requirements than would be necessary for the original data.
4.1 Memory-resident region directory Early publications on the grid le (e.g. [14]) mention as one of the design goals the "two disk access principle", which states that, given a fully speci ed query, the rst disk access retrieves the relevant directory entry, the second reads the data bucket holding the required data item. But two disk accesses are only necessary when both the data and the directory are stored on disk after the grid le has been opened for access. With the region directory (introduced in [17]), a linearly growing directory structure for the grid le has been found whose size is proportional to the number of data items inserted. Furthermore, given a constant average bucket occupancy, its size can be controlled with the volume of the data bucket so that it can be kept in central memory, reducing disk accesses to data bucket retrieval. The fact that a k-dimensional grid le partitions the search space into disjoint, k-dimensional, box-shaped bucket regions, has been exploited in the design of the region directory. Each bucket region is uniquely identi ed with two diagonally opposite vertices, each located at the intersection of k scale interval boundaries, such that no interval boundary is common to both vertices. The k interval boundaries which are close to the origin of their respective scales intersect at the "near vertex", the "far vertex" comprises all other interval boundaries. The scale indices representing the two vertices, together with a bucket address result in 2k + 1 directory entries per data bucket.
Storage structure for the region directory One of the simplest storage structures for the region directory is a tabular list of directory entries, arranged either randomly or lexicographically ordered, for example by near vertex. gft/1 does not order its directory entries it just appends them at the end of the list. This list is stored in a dynamic array which, through reallocation, can grow in increments of 1000 entries.
The region directory of gft/1 also stores each bucket's occupancy to facilitate certain operations, such as the run-length encoding of data buckets described in Section 4.2.
"De ating" the region directory Since all bucket regions together span the entire search space, many of them must cover space which is mostly empty whenever data are not uniformly distributed. This leads to empty gridblocks, i.e. no record in the grid le has a key value from the region covered by this gridblock. We call bucket regions with empty gridblocks "in ated" and measure the degree of a region directory's in ation simply by the proportion of its empty grid blocks. Whenever a query covers only empty gridblocks of a bucket region, the corresponding bucket will be read from disk in vain. This can be prevented, however, if some or all of these gridblocks disappear from the directory, in other words, if the bucket regions become "de ated." gft/1 provides the option to request the le system to adjust every boundary of all bucket regions so that as many empty gridblocks as possible will disappear from the region directory. The bucket regions remain convex and box-shaped, but will no longer span the search space (the directory now resembles an abstract block of Emmental cheese). Fig. 4 illustrates the structure of a de ated, two-dimensional region directory. . . .. . .. . . . . . . .. . . .. . . . .. . . .. .. . .. . . .. . . .. . ..
. . .. . .. . . . . . . .. . . .. . . . . .. .. .. . . .. . . .. .. .. . ..
. . .. . .. . . . . .. .. . . .. . . . . .. .. .. . . .. . . .. .. .. . ..
a
b
c
Figure 4: A de ated region directory has its bucket region
boundaries adjusted to better t the structure of the data. This illustration shows a hypothetical, two-dimensional data set (a) that is stored in a grid le (b). The bucket region boundaries (shown with bold outlines) are subsequently adjusted to de ate the directory(c). In the de ated directory, the grid blocks falling into the cross-hatched areas are no longer accessible.
Because unnecessary disk accesses are now largely eliminated, the time required to answer queries can be reduced substantially.
4.2 Compressing the data bucket le When designing bucket-based le management systems, it is useful to distinguish between operations on the buckets when they are in main memory and when they reside on disk. While in main memory, buckets should be available in full, i.e. not only the data items but also the empty part of a bucket, to support ecient execution of all operations. Space for at least two buckets should be allocated, to allow ecient split and merge operations. With gft/1, the size of the bucket cache can be speci ed (default: 12 buckets). While they reside on disk, the only requirement on the buckets is that they take up as little space as possible, both for reasons of economy and reduced I/O transfer times. The processor performance of today's workstations warrants the eort to compress data bucket les in order to reduce transfer times.
Eliminating trailing blanks Grid les typically store data in xed-sized buckets, with an average bucket occupancy commonly around 50%. If all records in a bucket are moved together so that they cover a contiguous region, beginning at one end of the bucket, one observes an eect similar to trailing blanks in les with xed-length records. Considering the large size of the data sets involved, substantial amounts of secondary storage can be saved by not storing the empty part of the data buckets. It is mandatory, however, that a technique is chosen which works with dynamically growing and shrinking les. If a bucket does not change in size, it is written to its old location on disk. If a bucket changes in size, it is written to an empty location on disk whose size exactly ts the bucket's new length. It is appended at the end of the le, if no such "hole" exists.
Run-length encoded data buckets Elimination trailing blanks can be generalized to replacing a a sequence of N occurrences of some value V by three elds: < V > < N >. The "special character" indicates that the value following it will be repeated as often as is indicated by the value of N in the third eld. This method, known as run-length encoding, is only useful when "runs" of identical values occur often and data les are either static or are part of a le system which allows incremental modi cations. Run-length encoding is a natural choice to compress data bucket les because the grid le preserves order in each key-attribute domain and keeps changes to the le local to individual data
buckets. In other words, it stores neighboring data values also close together physically, so that identical values are likely to be found in one and the same data bucket. Furthermore, modifcations do not "ripple" through the entire le. There are two ways to run-length encode, bucket les with gft/1, allowing a trade-o between speed and compression eciency. The "fast" method compresses buckets after lexicographically ordering the records based on the prede ned key permutation. The "intelligent" method rst determines that search-key permutation which will result in the largest gain of disk space and reorganizes the data buckets accordingly, before compressing them.
5 First results based on large cloud data sets The data management program gft/1 and the spatial data reallocator sdr have been installed on the SUN SPARC 10 System of the Climate Research Group at ETH Zurich, where the software has been evaluated. The system is clocked at 50 MHz and equipped with 64 MByte main memory. An evaluation of our software aims at answers to the following two questions: 1. Does the proposed method to interpolate dierently gridded global climate data sets stand up to the load of large, actual data sets? 2. How does the grid le perform with respect to: a) storage space requirement, including the promised bene ts of compression, and b) time required when inserting large data sets and submitting typical queries? In other words, at present we are primarily interested in determining whether our approach is feasible and practical, and only secondly in the performance of the chosen storage system. We have handled data of the ISCCP (International Satellite Cloud Cover Project) datasets consisting of NASA Langley DAAC HDF ([3]) coded les. The ISCCP data are calculated operationally from imaging radiometer measurements on geostationary and polar orbiting weather satellites. The ISCCP-C1 product reports the merged global results every three hours for 132 quantities in each ISCCP grid cell. The ISCCP-C2 product reports the monthly statistics for 72 quantities in each grid cell summarized from the ISCCP-C1 data. The ISCCP geographic grid is organized into
equal area cells with sides that are approximately 250 km long on the surface of the earth. For our tests, 90 monthly les (from July 1983 until December 1990) have been selected from the ISCCPC2 data set. These les contain statistics (mean, temporal and spatial standard deviations) of the cloud top pressure, the cloud top temperature, the cloud optical thickness and the cloud precipitable water as well as the frequency of occurrence and means of cloudiness at three elevation levels (low, middle, high) and of seven cloud types (cumulus, stratus, altocumulus, nimbostratus, cirrus, cirrostratus, deep convective). In addition, values for other quantities, such as mean snow or ice cover, near surface temperature and pressure, temperature and pressure at other elevation levels, etc. are stored in the les.
5.1 Evaluating spatial data reallocation We have done spatial reallocation for ISCCP-C2 datasets of dierent size. As start grid we chose the ISCCPgrid. It is an equal area grid with cells of size 2:5 2:5 at the equator. The height of the cells remains constant (2:5), the width is a function of the latitude: sin(2:5) width = 2:5 sin(+2 :5)?sin() where = latitude in radians. The start grid consists of 6596 cells. The start data is an uncompressed grid le called ISCCPdata0 which contains all 72 attributes of the ISCCP-C2 data. For our test its size varies from approx. 3 MByte (one month = 6596 records) to approx. 127 MByte (60 months = 395'760 records). As destination grid we chose a CDIAC ([6]) grid. It is organized into 1820 equal area cells of size 5 5 at the equator. The exact de nition is shown in Table 1.
Figure 5: The grid cells' width as a function of the geographical latitude of the ISCCP and the CDIAC grid cells.
Results The 1820 range queries, the calculations to reallocate ISCCPdata0 and the I/O transfers necessary to write the elements of the new data set in grid le format on a sequential le, took the times shown in Fig. 6.
60 50
cell dimensions latitude range number of lat lon cells in zone 5 5 50 N to 50 S 72 5 10 50 to 70 36 5 20 70 to 80 18 5 40 80 to 85 9 5 360 85 to 90 1
Time in minutes
40 30 20 10 0 0
20
40
60
Number of monthly batches of data
Figure 6: Time required to reallocate data of dierent sized les (ASCII format).
Table 1: The CDIAC data set is stored in an equal area grid with a total number of 1820 grid cells. Both start and destination grid cover the entire globe. A comparison of the two grids is shown in Fig. 5.
When working with traditional methods (cf. Section 3 and 5.4) the same task requires approximately four hours. Global climatological data - or other Earth Science data, for that matter - can be spatially reallocated using the method described in Section 3.2, only after they have been inserted into a grid le, as shown in Fig. 2. Consequently, it is of interest how much storage the grid le consumes as well as how much time the insertion process consumes and, furthermore, how much additional time must be spent for the data compression discussed in Section 4.2.
5.2 Storage space requirements and the bene ts of compression Before data can be managed with a grid le, they must rst be parametrized and their search space needs to be de ned. A suitable choice for key and non-key attributes is not always obvious, however, and sometimes it is worthwhile to experiment with smaller, representative data sets before de nitely deciding which search key combination to use. For this reason, the ISCCP-C2 data set was divided and inserted into two grid les with gft/1.
Designing the search space The rst grid le, called ISCCPdata1, consists of about half a million records, each representing the values for 28 of all the 72 attributes. The second grid le, called ISCCPdata2, stores about six million records, representing the values of the remaining 35 attributes, including quantities such as cloud top pressure, cloud top temperature, frequency of cloudiness and cloud optical thickness of the seven cloud types. These are non-key attributes. For grid le ISCCPdata1 the following search-key attributes have been de ned: 1. identi cation number of the cell 2. year 3. month 4. longitude of the cell's lower left vertex 5. latitude of the cell's lower left vertex 6. longitude of the cell's upper right vertex 7. latitude of the cell's upper right vertex 8. surface type (land, sea, coast) The search-key attributes for grid le ISCCPdata2 are identical to those of ISCCPdata1, except that a ninth search-key has been included, namely the type of cloud that was observed. Note that including this additional key-attribute leads to the dierence in the size of the grid les because records that are identical with respect to the rst eight search keys can now be identi ed as being distinct, based on the dierent values of the 9th search key. In the grid le a given search key combination exists only once. With two data sets of dierent size we were also able to compare the performance of the grid le when it can be stored entirely in central memory (grid le ISCCPdata1 ts entirely into the available 64MB) with its performance when disk accesses are necessary. A point query - all search-keys speci ed with a single value - submitted to gft/1 will always require at
most one disk access. Range queries, however, can lead to an arbitrary number of disk accesses (possibly as many as there are data buckets); a high precision for answers to range queries is therefore important. The fraction of valid records retrieved, depends to a large degree on the granularity of the search space in combination with the geometry of the query region. The granularity, imposed by the grid le's directory, depends on the number of key-attributes, the splitting policy and the number of data buckets created during insertion. The number of buckets created is a consequence of the data bucket's capacity and the average bucket occupancy which, to some extent, can be controlled with the system's splitting policy. Our splitting policy is such that the search space is re ned rst in those intervals which have the most entries. Tests with the same data but dierent bucket capacities have shown that, for this application, a bucket size of 100 yields the best results overall.
Results It is important to note that the substantial savings in storage space, as shown in Table 2, are a consequence of lossless data compression. In Table 2 one can also observe that the ISCCPdata2 grid le's potential for savings in storage space is greater than that of the ISCCPdata1 grid le. The reason is that the ISCCPdata2 le has more key attributes than the ISCCPdata1 grid le so that a better choice for run-length encoding exists. ISCCPdata1 ISCCPdata2 number of records 593'640 5'936'400 ASCII coded 148'989 KB 469'578 KB gft/1 regular 201'259 KB n.a. gft/1 compressed 29'734 KB 48'672 KB potential savings 80% 90% Table 2: Storage space in KByte used for the two grid les, each storing ninety monthly ISCCP-C2 les. The size of an uncompressed ISCCPdata2 grid le is not available because storage space restrictions made it necessary to compress the bucket le before all data had been inserted.
5.3 The speed of the grid le system gft/1 Time required for insertion and compression We have timed the insertion of 60 monthly les, i.e. cloud data covering a ve year period. Because of
local disk space restrictions, it was necessary to compress the grid le every time a month's worth of data had been inserted. The times required for both operations, insertion and compression, based on yearly increments, are listed in Tables 3 and 4. Year ISCCPdata1 ISCCPdata2 increments increments 1984 62 121 1985 71 190 1986 72 291 1987 72 370 1988 76 398 Table 3: Time spent in seconds to insert the cloud data, recorded for data volumes representing one year. Year ISCCPdata1 ISCCPdata2 increments increments 1984 29 114 1985 44 151 1986 52 282 1987 59 362 1988 64 488 Table 4: Time spent in seconds to compress the data bucket le.
The gures in Tables 3 and 4 show that the times for insertion and compression increase almost linearly with the number of records inserted, with only a small non-linear component.
Time required to answer queries The grid le's symmetrical treatment of keyattributes makes it not only an attractive le system for spatial data reallocation but also for many other applications that require fast answers to higherdimensional range queries. To substantiate this claim, we timed six dierent types of range queries over the two grid les mentioned above. The six range queries were composed as follows: 1. one geographic grid box, all years and all months 2. small region on the earth's surface, all years and all months 3. all geographic grid boxes, one year and one month
4. large region on the earth's surface, one year and one month 5. one geographic grid box, one month, all years 6. small region on the earth's surface, one month, all years Queries 1 and 2 represent regions in the search space that are typical when statistics across an entire year are calculated. Such statistics are collected to detect changes that happen during time spans extending over several years. Queries of type 3 and 4 result when constructing maps. When computing seasonal statistics one typically resorts to queries as illustrated with cases 5 and 6. query records retrieved type ISCCPdata1 1 90 2 880 (avg.) 3 6'596 4 770 (avg.) 5 8 6 1'000 (avg.) query records retrieved type ISCCPdata2 1 900 2 7'000 (avg.) 3 65'960 4 5'000 (avg.) 5 80 6 10'000 (avg.)
seconds used ISCCPdata1 0.379 2.032 6.662 1.109 0.283 0.614 seconds used ISCCPdata2 0.346 4.327 32.45 3.351 0.152 1.613
Table 5: Performance of gft/1, evaluated with six types
of range queries. Each query type was submitted twenty times with dierent parameters and the times averaged.
Table 5 relates the number of records satisfying each query to the times required to retrieve them in grid les ISCCPdata1 and ISCCPdata2. The number of records retrieved for query types 2, 4, and 6 have been averaged because the number of cells involved changed randomly among the twenty queries. The numbers in Table 5 illustrate an important characteristic of the grid le, namely the preservation of order in the key attribute domains. Because of this characteristic, the time spent retrieving records satisfying the speci cations of a range query depends primarily on the size of the search region, and much less on the size of the grid le.
5.4 Queries on gft/1 compared with traditional reading from ISCCP-C2 les The run time of any application that uses ISCCPC2 data is the sum of the time spent on reading the data from the les, performing the calculations and then writing the results. Here, only the rst term of this sum is accessed for a traditional solution and a gft/1 solution. For this comparison we chose the situation most favourable to the traditional solution, since it is easy to select or construct a situation or a problem and then construct a data structure that will perform better in this special, isolated situation.
Traditional les An application program, that uses all or parts of the data in one or more les, reads the whole les into a buer of size 6596 by 72, since there are 6596 grid cells with 72 values in each ISCCP-C2 le. Depending on the application, additional buers are needed, e.g. accumulators for the calculation of temporal statistics. The time to uncompress and read an ISCCP-C2 le into the buer is 56.7 seconds with a standard deviation of 1 second on the SUN SPARC10 (sample size 30). This time is also used if only the values in a small region are used for calculations or mapping (e.g. the values in the 174 cells in the region between 0 E and 20 E and 0 N and 20N). This time doubles, if two les are read and approximately 12 minutes are used to read 12 les for the calcuation of a yearly statistic of only 1 or all 6596 grid cells. gft/1
These results are compared with the execution time of queries on the grid le ISCCPdata3. The grid le ISCCPdata3 contains all attributes of the ISCCP-C2 data with the exception of the four ISCCP-C2 latitude and longitude indexes wich have been replaced by coordinates of the lower left and upper right coordinates of the cells. These vertices, the surface type and the additional attributes as the number of the cell, year and month were de ned as key attributes (cf. de nitions of ISCCPdata1 and ISCCPdata2 above). ISCCPdata3 occupies 55 MByte compressed on disk. Monthly batches of data were read from ISCCPdata3 into a 6596 by 75 buer in 23 seconds with a standard deviation 1.6 seconds (random sample of size 30 drawn from the 90 monthly batches of data in ISCCPdata3). Queries on parts of a monthly batch of data are faster by an order of magnitude as compared to the traditional solution, e.g. the values in the 174
grid cells in the region between 0 E and 20 E and 0 N and 20N are read into the buer of the application in less than 1 second. Range queries on the time dimensions and small regions of some 100 cells are faster by orders of magnitude as compared to the traditional solution. In these situations, the superior performance of the grid les is due to the symmetric keys. But also in the situations where the traditional solution delivers the best results, the grid le is faster by a factor of approximately 2.46 (56.7 s / 23 s).
Search space compression The timing gures for range queries reported above have all been recorded with grid les whose search space had been compressed (cf. Section 4.1), for the simple reason that data retrieval in an in ated region directory takes longer in all cases. Tests with our cloud data have shown that the volume of the region directory in terms of gridblocks can be reduced by a factor of 4 to 5. As a consequence, the average time required to answer a query has been cut in half (with reductions ranging from a factor of 1.2 to 7.5).
5.5 Summary and future work The technical and administrative diculties that emerge when earth scientists { climatologists in our project { have to deal with large, heterogeneous data sets, have motivated us to investigate methods to oer researchers better and more ecient access to scienti c data sets. From a data management point of view we aimed at two fundamental goals, namely 1) reduce storage space requirements through lossless compression, and 2) provide fast, symmetric, multikey access to these compressed data sets. The bene ts to the climatologists are twofold: 1) Data sets that were previously stored in dierent formats can now be kept in single les that provide exible access over many attributes, and 2) methods to interpolate global climate data that were hitherto impractical and slow become attractive alternatives. We will continue this project with additional work on dierent fronts. 1. Some of the methods discussed in this paper will be applied to other types of earth science data, such as les from the SEQUOIA 2000 benchmark. 2. More needs to be known about the relationship between a given scienti c data set and the best
way to parameterize it so that the corresponding higher-dimensional search space provides the most ecient access to the data. 3. There exist more sophisticated ways to organize the region directory of a grid le; additional work is necessary along those lines. 4. The region directory of the grid le contains information about multidimensional density distributions in the data set to which it provides access. This can be utilized as an abstraction mechanism with great potential for lossy but powerful compression techniques.
Acknowledgements The authors are grateful to Patrick Ludi for programming large parts of gft/1 and his assistance in applying the program, and Lorenzo Hutter for writing an early version of the sdr program. We also thank Fabio Poroli for the preparation of the cloud data sets and the testing of gft/1. This project is being supported in part by Swiss National Science Foundation Grant No. 21-27705.89 and Grant No. 2100-037698.93.
References [1] J. C. French, A. K. Jones, J. L. Pfaltz (eds.). Scienti c Database Management. Computer Science Report No. TR-90-22, Department of Computer Science, University of Virginia, Charlottesville, VA, 1990. [2] W. B. Rossow, R. A. Schier. ISCCP Cloud Data Products. Bull. Am. Met. Soc., 72:2-20. 1991. [3] S. Sorlie (ed.) Langley DAAC Handbook. Draft Feb. 1993. Langley Distributed Active Archive Center. NASA Langley Research Center, Hampton, VA. 1993. [4] S. G. Warren et al. Global distribution of total cloud cover and cloud type amounts over land. NCAR TN 273+STR, National Center for Atmospheric Research, Boulder CO, 1986. [5] S. G. Warren et al. Global distribution of total cloud cover and cloud type amounts over the ocean. NCAR TN 317+STR, National Center for Atmospheric Research, Boulder CO, 1988. [6] C. Hahn et al. Climatological Data for Clouds over the Globe from Surface Observations. NDP-026, Carbon Dioxide Information Analysis Center, Oak Ridge, 1987. [7] K. E. Trenberth and J. G. Olson. ECMWF global analyses 1979 - 1986: Circulation statistics and data evaluation. NCAR TN 300+STR, National Center for Atmospheric Research, Boulder CO, 1988.
[8] European Center for Medium-Range Weather Forecasts. Binary data representation FM92 GRIB. Technical report, European Center for Medium-Range Weather Forecasts ECMWF, Reading, GB, 1992. [9] H. Gilgen et al. Baseline Surface Radiation Network (BSRN). Technical plan for BSRN data management, version 1.1. WMO/TD-No. 443, World Meteorological Organization, Geneva, 1993. [10] H. Gilgen and D. Steiger. The BSRN database. In Proceedings of the Sixth International Working Conference on Scienti c and Statistical Database Management, Ascona, Switzerland, pages 307{326, 1992. [11] W. B. Rossow, L. Garder. Selection of a Map Grid for Data Analysis and Archival. J. of Clim. Appl. Met., 23:1253-1257. 1984. [12] WMO Commission for Basic Systems CBS and Working Group on Data Management. Manual on Codes, Volume 1: International Codes, Part b: Binary Codes. WMO/TD-No. 306, World Meteorological Organization, Geneva. [13] J. W. Hurrell and G. G. Campbell. Monthly mean global satellite data sets available in GCM history tape format. NCAR TN 371+STR, National Center for Atmospheric Research, Boulder CO, 1992. [14] J. Nievergelt, H. Hinterberger, K. Sevcik. The Grid File: An adaptable, symmetric multi-key le structure ACM TODS, 9:38-71. 1984. [15] K. A. Meier. GFT/1: A Grid File Tool in the UNIX Environment. Internal report. Institute for Scienti c Computing, ETH Zurich, 1993. [16] K. A. Meier. SDR: Spatial Data Reallocation. Working paper. Institute for Scienti c Computing, ETH Zurich, 1994. [17] H. Hinterberger. Data Density: A Powerful Abstraction to Manage and Analyze Multivariate Data. Informatik-Dissertationen ETH Zurich, No.4, Verlag der Fachvereine, Zurich, 1987.