Towards Comprehensive Database Support for ... - CiteSeerX

18 downloads 8268 Views 71KB Size Report
burdened upon the database application. RasDaMan is a DBMS under development which aims at domain-. independent, comprehensive database support for ...
Towards Comprehensive Database Support for Geoscientific Raster Data N. Widmann, P. Baumann FORWISS (Bavarian Research Center for Knowledge Based Systems) Orleansstr. 34, D-81667 Munich, Germany phone: +49-89-48095-200, fax +49-89-48095-203 Email: {widmann,baumann}@forwiss.tu-muenchen.de

Abstract Raster data in various dimensions form a geoscientific information category of ever growing importance. Hence, in future there will be an increasing need for versatile and fast networked database support for raster data. This need, however, is hardly satisfied by current DBMSs as they usually retract to blobs, i.e., encoded byte streams, which only allow for retrieval of large raster objects as a unit. The DBMS offers only very unsatisfactory quality of service and performance – all search and processing is burdened upon the database application. RasDaMan is a DBMS under development which aims at domainindependent, comprehensive database support for multidimensional discrete data (MDD), i.e., raster data of arbitrary size and dimension. In this contribution, we demonstrate applicability of RasDaMan to the geoscientific field, focusing on the respective extensions to standard SQL accomplishing flexible, formatindependent MDD retrieval and update.

1 Introduction Quantization of natural spatio-temporal phenomena leads to multidimensional data representations; likewise, data sets generated from simulations and experiments frequently are analyzed best by interpreting them in multidimensional form. The term multidimensional discrete data (MDD) has been coined for this information category which, technically, has the structure of an array with individual dimension, size, and base type. Technological advances in memory, storage and processing power more and more make it feasible to store online, retrieve, and process large amounts of raster data; now the challenge for database management systems (DBMSs) is to offer versatile, high-level MDD services to optimally make use of this new hardware. Geo and environmental data are particularly challenging for MDD management in several respects. A wide variety of MDD structures has to be processed jointly in many cases; e.g., projecting a satellite image on top of a 3-D representation of a DEM requires MDD of different dimensionalities to be handled. MDD can have fixed bounds like Landsat images or partly variable bounds like MOMS images. As for the size, MDD vary between some kilobytes per MDD for seismic data and tens of Gigabytes for results of a climate simulation run [Sara-93]. These accumulate to Teraand Petabyte databases, as raster data are produced in continuous

streams by an ever increasing number of sensors, for example in the Mission to Planet Earth program [Arda-94]. The pixel (or more general: cell) type of MDD encompasses binary and grayscale pixels, multispectral pixels, voxels containing floating point temperature values, and many others. Notably, cell information does not necessarily denote only scalar sensor values: a reference to a data set somewhere else in the database can be associated with a cell as well. However, current GIS technology is far from being able to meet this challenge. In industrial GIS systems, raster data mostly are stored in operating system files, thereby giving up the advantages of DBMS technology. And only very limited raster support is available in some research systems and in object-relational commercial products. In the RasDaMan1 project is to develop a DBMS offering the same quality of service on raster data as is available on conventional alphanumeric data. On the logical level, a MDD model is offered which allows structure definition of multidimensional arrays with fixed or variable bounds as well as retrieval and manipulation through both an ODMG-conformant [Catt-96] C++ API and a declarative query language enhancing standard SQL [ISO-92] with MDD operations. On the physical level, a specialized MDD storage manager is employed together with algebraic MDD query optimization to minimize disk access and networks traffic. The overall goal is that response times as far as possible depend on the query result size instead of the database size. Transparent integration of different storage media, in particular tertiary storage systems such as tape archives, will allow to keep the huge data sets on hand available for online retrieval. The goal of the RasDaMan project is, by providing high-level raster data management services, to make application development faster, less error-prone, and easier to maintain; as a consequence of the resulting lightweight applications, less expensive client hardware can be employed. In this contribution, we demonstrate how a raster DBMS like RasDaMan can be applied in geo applications. Section 2 gives an overview on related work done in both the GIS and database community. In Section 3, the RasDaMan DBMS interface is presented. Implementation work is discussed in Section 4. Finally, Section 5 concludes the plot.

2 Related Work Usually GIS support 2-D MDD in the form of images; most of the classic vector based GISs by now have support for images at least as backdrop images. There are also GISs where the main functionality is carried out on a raster based representation of geographical data (e.g. GRASS , IDRISI, TNTMips) and image processing systems with a strong focus on geographical applications (e.g. EASI/PACE, ERDAS IMAGINE). Missing in these systems are the classical DBMS properties like ACID semantics for modifications or concurrency control for a centralized repository of data. But GISs offer advanced application specific functionality a general raster DBMS can not provide. A coupling between a GIS and RasDaMan could offer the benefits of both technologies. While relational DBMSs can only store and process MDD as unstructured BLOBs (binary large objects), new DBMS technologies like object-relational DBMSs (e.g. Informix Universal Server [Ston-95]) and object DBMSs [Loom-95] (e.g. O2 [Banc-92]) offer possibilities to implement user defined data types managing 1

sponsored by the European Community under Esprit research grant no. 20073.

1

complex data. However, support for MDD is limited in two ways: Both ODBMS and ORDBMS can not offer a declarative query language for MDD of arbitrary dimensionality, as they can extend their query language only by functions and not add a new syntax for access of multidimensional data. Furthermore this data is still stored in unstructured long fields, resulting in non-optimal performance when accessing parts of MDD.

3.1.3 Induced operations.

In the Paradise project [Pate-97], a domain specific DBMS for GIS applications is built using a toolset for implementing DBMS systems. This approach offers optimized storage and the query language handles 2-D raster data and vector data. However, the authors thinks that a general DBMS for MDD is preferable over a specific DBMS for a certain type of application.

Example 4: "Band 3 of all Landsat TM images, with the intensity reduced by d". SELECT l.data.band3 - d FROM LandsatImages as l

3 The RasDaMan Interface RasDaMan’s conceptual level offers MDD data definition, manipulation, and retrieval in a declarative style. The interface consists of the query interface RasQL and the C++ API RasLib.

3.1 RasQL The algebraic basis of the conceptual MDD model is inspired by the AFATL image algebra [Ritt-90]. Given a base type B and n finite integer intervals I1,...,In ⊂ Z (n>0), an MDD a is defined as a = { (x,a(x))|x ∈ I1×... ×In, a(x) ∈ B } Note that with this definition index ranges do not have to start with 0 as is usual in many languages; negative index ranges are admissible. In the RasDaMan implementation, B can be any C++ type, be it a scalar value, a structure or a reference. AFATL Image Algebra defines a rich set of operations (in fact, it has been proven that any imaging operation can be expressed), and it is not feasible to incorporate a fully-fledged image processor in the DBMS. We have decided to process incrementally: First, a set of operations has been determined which is considered indispensable for common applications in particular with respect to the desired performance advantages. After an evaluation, the need for further operations will be investigated. Right now, the following categories are considered: 3.1.1 Schema information. For some MDD m, the function spatial_domain(m) retrieves an array of integer pairs (loi,hii) containing m’s current lower and upper bound for all dimensions 0≤i 17 and a[x] < 42 using 1 Quantifiers as special Condensers often are used to ”collapse” Boolean MDDs resulting from induced comparisons. Example 6: Implementation of quantifiers through aggregate operators. Let a be an MDD, then all( p(a) ) = condense and over x in spatial_domain(a) using p(a[x]) some( p(a) )= condense or over x in spatial_domain(a) using p(a[x]) 3.1.5 Partial updates. Because of the large size of the data items, we feel that, despite ODMG OQL includes only retrieval, support for partial updates is feasible, too. To this end, the SQL update statement has been included in RasQL with an extension to specify an MDD part on the left hand side of an assignment clause. Example 7: Inserting band 5 sensor data into an existing Landsat image can be expressed as UPDATE LandsatImage SET data[ *:*, *:* ].band5 = 3.1.6 Data independence. Data independence frees the application from the need to cope with different image formats. By default, RasDaMan delivers MDD in the client machine’s hardware representation for arrays, ready for further processing. On request, MDD are accepted or delivered using a particular data exchange format indicated in the query. Of course, the format must be able to host the structure to

be encoded – a seven-band multispectral image cannot be cast into GIF, whereas a single band can.

I. Linear

II. Arbitrary tiling

Example 8: The application reads a Landsat image in BIL format from tape into the variable bilimage on the client. Function invbil() converts this into the database internal representation before storing it: INSERT INTO LandsatImage VALUE data = invbil(bilimage) Conversely, a single band of the Landsat image stored earlier is delivered in TGA format by applying function tga(): SELECT tga(l.data.band3) FROM LandsatImage as l 3.1.7 Complex Queries In general, MDD expressions can be used in the select part of a query and, if the outermost expression result type is scalar, also in the where part. Example 9: „All Spot datasets having images with less than p % clouds.“ Let us assume cloud detection is based on the high albedo of water clouds in the visible band with threshold value c. The corresponding query is SELECT s FROM spotImages s WHERE sum( s.panchromatic > c ) / card(spatial_domain(s.panchromatic)) < p / 100 Example 10: The last example shows how thematic maps can be produced. Assume a DEM is to be colored in arbitrary steps. The standard SQL CASE statement can be used conveniently with induction here. SELECT CASE WHEN dem.height>100 THEN 1 WHEN dem.heightband3); }

4 Internal MDD Management 4.1 Storage Using the trivial approach to storing MDD in databases, i.e. just linearizing the MDD and storing it as a very large 1-D array of bytes, leads to disastrous performance when executing operations beyond read and write of the whole MDD. A simple geographical operation, like retrieving a cutout of an area of interest in a 2-D satellite image, involves reading a lot of unnecessary data, transferring them to the client and doing the cutout in client memory. Therefore a huge amount of data which is actually not needed is read from disk, transferred over the network and stored in client memory. In Figure 1-I, which shows the disk pages for linearized storage of MDD as BLOB, four disk pages are accessed during retrieval of the rectangular cutout marked. The storage structure for an MDD object should be designed to minimize the amount of unnecessary data accessed, when an operation is executed on the object or part of it. This can be done best, if the subdivision strategy chosen anneals spatial proximity within the object. The subdivision into equally sized tiles was already suggested in [Tamu-80]. Subdivision of MDD into arbitrary multidimensional rectangular tiles, possibly nonaligned as shown in Figure 1-II was suggested in [Furt-93]. In this example, only two disk pages have to be accessed to retrieve the area of interest .The position of the geographic area of interest shown in the example is a good case for the tiling strategy chosen for this MDD, and in a worse case more disk pages must be accessed to retrieve an area of the same size. With arbitrary tiling, however, the storage layout of a MDD can be optimized towards certain operations. A more in-depth discussion of tiling and storage layout of MDD objects in RasDaMan can be found in [Furt-97].

3

In RasDaMan, arbitrary tiling for MDD is implemented using a specialized multidimensional index for fast access to the tiles of a MDD. By using object oriented techniques, different index structures can be created for different objects just by specifying a specific subclass of a general index class for the object. For example, a directory structure could be used for a small object with aligned tiles to reduce overhead, or a R+-tree for another object with a more complex tiling scheme.

4.2 The RasDaMan Architecture The overall architecture of the RasDaMan DBMS is shown in Figure 2. The RasDaMan API consists of RasQL and the RasLib, both explained in detail in Section 3. RasDaMan follows the classical two-tier client/server architecture: a RasDaMan client connects to the RasDaMan server through RasLib, possibly on a remote machine. Query processing is done completely at the server: the Query Evaluator parses a RasQL query and builds an operator based query tree. This query tree is optimized in two steps: first, algebraic query rewriting based on rules derived from the AFATL Image Algebra is done and second, physical optimization based on tiling, clustering, and device information takes place. At the end of query translation, a RasQL query is broken down into a sequence of operations on tiles. Enabled by the declarative query language which expresses the „what“, not the „howto“, the algebraic query optimizer rearranges tile access sequence during query evaluation finding the best evaluation sequence. To identify the tiles involved in a query and to calculate the costs to retrieve them, the Index Manager is consulted. The Catalog Manager takes care of schema information such as base type and dimension of MDD, whereas the Device Manager is responsible for handling different storage media. Finally, the tile sets identified are retrieved from the Cache Manager to apply the operations specified in the query. An interface layer between RasDaMan modules and the base DBMS, the Base DBMS Interface, is responsible for storage and access to all data in secondary and tertiary storage. This prepares RasDaMan for easy portability between different base DBMSs and storage systems.

5 Conclusion The ever growing demand for raster data support in geoscientific application fields makes DBMS support for this information category indispensable. MDD must become just another kind of attributes in databases. The intention of RasDaMan is to relieve application developers from many low-level, costly, repeating tasks, by providing flexible high-level operations in the DBMS. Both RasDaMan and O2, the commercial object-oriented DBMS on top of which RasDaMan currently runs, have a C++ API conforming to the object database standard ODMG. Likewise, the RasDaMan query language, RasQL, extends standard SQL so that the application is provided with a uniform, standards-based database interface for raster and conventional data. Further, RasDaMan provides database services such as transactions, multiuser synchronization, recovery, etc. Although benchmarking has not started yet, considerable performance improvements due to the streamlined architecture and first steps in internal optimization already can be observed. Next steps to be undertaken include further development of the DBMS functionality such as incorporation of additional exchange formats and work on the storage manager and optimizer, but also

4

Figure 2: Ra sDaMan system architecture outline evaluating RasDaMan in as many raster data application fields as possible to improve our understanding of their respective needs.

Acknowledgement In addition to the authors, the FORWISS RasDaMan team consists of Paula Furtado and Roland Ritsch. The RasDaMan project is being carried out jointly by STI s.a., GeoForschungszentrum Potsdam, Spanische National Geographic Institute, HGM, and FORWISS.

References [Arda-94] Philip E. Ardanuy, Robert D. Price: Mission to Planet Earth: Four Decades of Accessible and Well-Characterized Environmental Data. Proceedings of the 1994 ASPRS/ACSM, 1994. [Banc-92] F. Bancilhon, C. Delobel, P. Kanellakis: Building an Object-Oriented Database System. Morgan Kaufmann Publishers, San Mateo, CA, 1992. [Catt-96]

R. Cattell: The Object Database Standard: ODMG-93. Morgan Kaufmann Publishers, 1996.

[Furt-93]

P. Furtado, J. Teixeira: Storage Support for Multidimensional Discrete Data in Databases. Computer Graphics Forum - Special Issue on Eurographics’93 Conference, vol. 12, no.3, pp. 89-100, September 1993.

[Furt-97]

P. Furtado, R. Ritsch, N. Widmann, P. Zoller, P. Baumann: Object-Oriented Design of a Database Engine for Multidimensional Dis-

crete Data. Proc. of the OOIS ’97 Conference, Brisbane, Australia, November 1997. [ISO-92]

The International Organization for Standardization (ISO): Database Language SQL. ISO 9075, 1992(E), 1992.

[Loom-95] M. Loomis: Object Databases: The Essentials. Addison-Wesley, 1995. [Pate-97]

Jingesh M. Patel, Jie-Bing Yu, et.al.: Building A Scalable GeoSpatial Database System: Technology, Implementation, and Evaluation. Proceedings of the 1997 ACM SIGMOD, 1997.

[Ritt-90]

G. Ritter, J. Wilson, J. Davidson: Image Algebra: An Overview. Computer Vision, Graphics, and Image Processing, vol. 49, no. 1, pp. 297336, Houston, Feb. 1990.

[Sara-93]

S. Sarawagi, M. Stonebraker: Efficient Organization of Large Multidimensional Arrays. Very Large Databases Conf., 1993.

[Ston-95]

M. Stonebreaker, D. Moore: Object-Relational DBMSs: The Next Great Wave. Morgan Kaufmann Publishers; 1995.

[Tamu-80] H. Tamura: Image Database Management for Pattern Information Processing Studies. In: S. Chang, K. Fu (ed): Pictorial Information Systems. Lecture Notes in Computer Science Vol. 80, pp. 198-227, Springer 1980.

5