Spatial Indexing of Global Geographical Data with ... - Semantic Scholar

4 downloads 1060 Views 415KB Size Report
database management system (DBMS) and distributed file system. The major ... dataset are stored on Hadoop Distributed File System (HDFS) that is a part of ...
Spatial Indexing of Global Geographical Data with HTM Zhenhua Lv, Yingjie Hu, Haidong Zhong, Bailang Yu, Jianping Wu* Key Laboratory of Geographic Information Science, Ministry of Education East China Normal University Shanghai, China *Corresponding author: [email protected] Abstract—Spatial indexing is one of the most important techniques in the field of spatial data management. Many kinds of techniques of spatial indexing have been successfully developed, and each of them has advantages towards special applications. As a type of spatial data structure, Hierarchical Triangular Mesh (HTM) has excellent features of global continuity, stability, hierarchy and uniformity, which has attracted much interest of researchers for many years. This paper investigates the method that using HTM as indexing for global geographical data (only point-like objects now). The HTM is defined by subdividing a unit sphere recursively and the basic elements in it are spherical triangles that are coded as integers called HTM codes in the computer system. At the global scale, all the regions on the sphere are spherical, which can be intersected with HTM elements obeying some equations. The spatial position of each input object can also be represented by a HTM code. HTM codes thus become the bridge between query regions and input objects. Our system is based on the combination of database management system (DBMS) and distributed file system. The major information of input files is extracted as metadata that are stored on tables of DBMS, while the original files are stored on the distributed file system (called HDFS) which has potential abilities to support parallel processing. Millions of point-like objects on the global were examined and the experiments indicated the system were acceptable. Keywords-spatial indexing; hierarchical triangular mesh; spherical triangle; HDFS

I.

INTRODUCTION

With the development of spatial information science, more and more spatial data need to be stored and processed immediately. Managing massive spatial data on global scale requires efficient schemes of spatial indexing. Hierarchical Triangular Mesh (HTM) model [1] is a kind of data structure, which has excellent features of global continuity, stability, hierarchy and uniformity. The elements of HTM are spherical triangles on the sphere, and each of them can be uniquely assigned an integral code that has the information of both geographical position and spatial resolution. Goodchild [2, 3] and Song et al [4] created the Discrete Global Grid, with precisely equal areas. Szalay et al [1] and Gray [5] made use of HTM as spatial indexing by spherical

Bo Li, Hui Zhao Institute of Software Engineering East China Normal University Shanghai, China

partitioning, mapping onto B-Tree index in SQL Server. In this paper we followed the theory of HTM from them but took a different approach to implement the system architecture. The major differences lie in that our system is based on the combination of database management system (DBMS) and distributed file system. And metadata are used to describe the dataset and are stored on tables of DBMS. We adopt this scheme on the follow factors. First, most types of input data in our system are nonstructural, indicating that they are not greatly suitable for DBMS, especially relational database [6]. Metadata that extracted from input datasets contain the primary information, which is structural and lightweight; hence they can be easily managed in tables of DBMS. This will lighten the burden of DBMS and elevate the system efficiency if raw dataset and metadata store in separate. Second, extracting metadata from large amounts of dataset is time-consuming, because it depends on a great deal of IO operations that slow down the processing. If the IO operations execute in parallel at different machines, the efficiency raises dramatically. Besides, the distributed file system provides the considerable capacity in storage and high stability in fault tolerance. Moreover, fast storage and access to data are just basic requirements for spatial data, while distributed processing is another advanced topic. Hence, it is important to devise a spatial data system considering the easy support of distributed processing. Metadata in our study are managed by MySQL, while raw dataset are stored on Hadoop Distributed File System (HDFS) that is a part of Hadoop framework. Hadoop is an open source system that hosted by Apache Software Foundation [7]. It is a kind of implementation of MapReduce programming model that is firstly proposed by Google Inc and has been successfully used for many different applications [8]. This model is easy to use, even for programmers without experience with parallel and distributed systems, and it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing, thus making users focus on applications themselves. The rest of this paper is organized as follow. We briefly introduce the theories of HTM model in section two, including how to create the model, the schemes of coding and the transformations of coordinates. Spherical areas are defined at the beginning of the section three. And the algorithms of intersecting with spherical areas are then elaborated, followed

by the discussions of optimization and limitation. In section four, we talk about the implementation of the system and analyze the experiment’s results. Conclusions and further work are listed in the end. II.

The XZ plane represents the Prime (Greenwich) Meridian, having the value of longitude zero. The north and south poles (90° and -90°) are v = (0, 0, 1), and v = (0, 0, -1). Consequently, the relations between geographical positions and Cartesian coordinates are established by (2):

HIERARCHICAL TRIANGULAR MESH MODEL

A discrete global grid system (DGGS) is a hierarchical structure in which the earth surface is subdivided into a series of discrete grids, and the grids represents spatial regions or points [9]. Generally, a DGGS can be designed and developed in five explicit steps [10], including a base polyhedron, a fixed inscribed orientation in the coordinate system, a transformation relationship between the polyhedron and the earth surface, a hierarchical partition method and an efficient indexing method for the hierarchical facets [11]. A. Definition HTM is originated from an octahedron that inscribed in a unit sphere. Both the octahedron and the unit sphere are placed in the Cartesian coordinate system. The six initial vertices of the octahedron have the follow coordinates represented by vectors, as in (1).

x = cos(lat ) cos(lon) y = cos(lat ) sin(lon) z = sin(lat )

The hierarchical subdivisions of the sphere start with eight spherical triangles that are defined by projecting octahedron onto the sphere. A spherical triangle is given by three points on the unit sphere connected by great circle segments, which means each edge of a spherical triangle is a segment of great circle and its midpoint is also located on the sphere which can be expressed by (3).

(0, 0, 1) := v0 (1, 0, 0) := v1 (0, 1, 0) := v 2 (− 1, 0, 0) := v3

(1)

(0, − 1, 0) := v 4 (0, 0, − 1) := v5 The globe is represented by a unit sphere located in the Cartesian coordinate system like the octahedron; the relations between them are shown in the Figure 1.

(2)

w0 =

v1 + v 2 v1 + v 2

w1 =

v0 + v 2 , v0 + v 2

w2 =

v0 + v1 v0 + v1

(3)

where (v0, v1, v2) are three vertices of the spherical triangle, and (w0, w1, w2) are the midpoints of their edges. The vertices of the initial octahedron are connected to their midpoints by the great circle to get the next level of subdivision, and then each spherical triangle generates four small spherical triangles. The new children are given by (4):

Triangle0 := (v0 , w2 , w1 ) Triangle1 := (v1 , w0 , w2 ) Triangle 2 := (v 2 , w1 , w0 )

(4)

Triangle3 := ( w0 , w1 , w2 ) This process is then repeated as many times as needed until an appropriate spatial resolution reached. The number of triangles multiplies by four each time. The octahedron’s edges all span 90°, and the angle that a spherical triangle spans will be recursively cut in half in the further subdivisions. Finally the hierarchical triangle meshes are formed. Figure 1. The globe’s orientations in the three-dimensional cartesian coordinate [5].

After placing the sphere into the three-dimensional Cartesian coordinate system, every point on the sphere with lon-lat values can be represented by a unit vector v = (x, y, z).

B.

HTM Coding and Transformation As mentioned previously, each spherical triangle will generate four children in subdivisions. There are many schemes coding the children, but here they are labeled 0, 1, 2, and 3 by keeping order counter-clockwise. The initial eight triangles also have fixed codes. Each triangle has a unique code

called HTM ID which is given by concatenating its order to parent ID. The HTM ID uniquely defines its depth (spatial resolution) and its location on the sphere, thus it can be used identifying the spatial objects on the earth. Most geospatial data have geographical positions represented by longitude and latitude pairs, or they can easily be transformed into that forms. Thus relationships between geo-positions and HTM IDs are necessary before associating HTM IDs with geo-objects. Since the geographical coordinates can be directly transformed into Cartesian coordinates by the equations (2), calculating HTM IDs from geographical coordinates is equal to that calculating from Cartesian coordinates. TABLE I.

TRANSFORMING CARTESIAN COORDINATES TO HTM IDS

of HTM IDs, it becomes the simple selections of integers in the database. A. Geometry Pimitives z Halfspace A halfspace defines a cap on the unit sphere that is sliced by a plane, and it has the form (5):

h := {v; d }, | v |= 1,−1 ≤ d ≤ 1 ,

(5)

where v is the normal vector of the cutting plane, from the origin pointing into the halfspace. Scalar d is the distance from the origin to the plane along the normal vector v. Figure 2. illustrates a halfspace.

Algorithm: xyzToHtmid

Input: (x, y, z, depth), where ‘x, y, z’ are the 3-d Cartesian coordinates of input point p, and ‘depth’ indicates the depth of subdivision. Output: (htmid) 1.

2. 3. 4. 5.

Decide which face of the initial octahedron that the input point p locates at, and then get initial id and vertices of selected face. They are marked as topid and v0, v1, and v2. The result htmid is initialized by topid. Decrease the depth. If the value of depth is still greater than zero, then continue the process; otherwise the process is finished. Generate four new children of the triangle that contains p. They are marked as t0, t1, t2 and t3. If one of the four triangles ti contains p, the number i is appended to topid. Repeat the steps two through four until it is finished.

At a fixed level of subdivision, a lon-lat point can be represented by a spherical triangle that identified by HTM ID. Since the point is zero in area but the spherical triangle is not zero, the deviations are qualified by the areas of this spherical triangle, depending on the depths of subdivisions. The resolutions of HTM will approximate to one meter if the value of depth is twenty one [9]. Up to here, all the point-like objects on the sphere can be associated with HTM IDs that specify there geographical positions. However, the HTM ID must support intersections with arbitrary spherical areas to perform the functions of query. III.

Figure 2. A typical example of halfspace[5].

The center of Shanghai in China is located at the point (121.466383, 31.235406), which can be expressed as (6).

v = (-0.446331, 0.729307, 0.518555)

(6)

The query circle around Shanghai of which angle is one degree can be expressed as (7).

h = (-0.446331, 0.729307, 0.518555, 0.999848) (7) The radius of this circle is sixty nautical miles (one degree is equal to sixty arc minutes), which approximates to 111.12 kilometers. z

Convex

A convex is the intersection of some halfspace, i.e. the intersection of caps on the unit sphere. It can be defined as (8).

SPHERICAL AREAS AND QUERY

There are three types of geometry areas. They are halfspace, convex, and region. Halfspace is the basic building block of convexes and regions. A convex is defined by the intersection of some halfspaces and a region is in turn defined by the union of some convexes. Every shape on the sphere can be represented by regions. Querying by regions is that selecting all objects of which geographical positions are inside in the regions. In other words, we need select all HTM IDs that are inside regions. Since the query regions are composed of a list

c := h1 & h2 & … & hn , n ∈ N +

(8)

Most simple shapes can be directly represented by convexes, like spherical rectangles and polygons. z

Region

A region is a union of a number of convexes, which can represents any area on the sphere. It can be defined as (9).

r := c1 | c2 | … | cn , n ∈ N +

(9)

B. Intersecting with HTM Triangles Given a region on the unit sphere, we compute a list of triangles (HTM IDs) that cover the region. Considering the fact that a region is composed of some halfspaces, we should first find out the relationships between halfspaces and spherical triangles of HTM. Halfspace defines a cap by {vh, d}, while a spherical triangle is identified by three corner points, i.e. three unit vector v1, v2, v3. If all three corner points are inside the halfspace, then the spherical triangle is fully inside it. Otherwise, the spherical triangle is outside or partially inside the halfspace. Any corner point which is inside a halfspace of the triangle satisfies (10).

v h ⋅ vi > d

(10)

Obviously, if one or two corner points of the spherical triangle are contained by the halfspace, this is the case of partial intersection. If the spherical triangle has not corner point intersecting with the halfspace, it may be outside the halfspace or contain the halfspace depending on the further decisions. If one of the spherical sides intersects with the halfspace, this is the case of partial intersection. Test the value of (11).

(v

i

× v j ) ⋅ v h < 0 (i, j ) ∈ {(1,2); (2,3); (3,1)} (11)

If none of (i, j) has the value true, then the triangle (v1, v2, v3) embraces the halfspace, otherwise it is outside the halfspace. Intersecting of a convex with HTM triangles is not as easy as the halfspace, because the convex is composed of some halfspaces and may have holes, which complicates the process of decisions. Nevertheless, intersections of halfspaces are still basic computations. Intersecting with convexes or regions is inevitably based on the computations of halfspaces. If we need a list of HTM triangles to cover a halfspace at a fixed depth, the process starts intersecting from the initial eight spherical faces of the octahedron to their descendants until the specified depth is reached. The HTM IDs of triangles that are fully inside the halfspace are accepted by the list. Triangles that are partially intersected with the halfspace will be further examined by subdivisions if the depth is allowed. Thus given query region, a list of HTM IDs that cover it can be computed. C. Optimization There are several simplifications of computing triangles list for regions, including removing the duplicates of halfspaces, identifying the complements and nulls, dropping the halfspaces that cover the whole sphere. In addition, to reduce the length of the list, each item in the list specifies a range of triangles' value instead of listing each one. For more details, please reference the paper [1].

D. Limitation The algorithms of HTM model have by now only supported point-like objects of input data, which means that the areas of objects on the sphere are seemed as zeros. The types of query triggered by users are thus limited as ‘which objects are nearby a given point’ or ‘which objects are contained in a region’. If the objects of input data have the polygon-like shapes, we choose any points that are inside the objects to represent them, which will inevitably incur deviations. IV.

SYSTEM IMPLEMENTATION

Gray [5] has implemented a database system that supports spatial queries based on SQL Server 2005. Different with them, in this paper we have re-implemented HTM library by Java and deployed it on the distributed system. In general, there are three types of architectures managing massive data, by using file system, DBMS, and the combination of the both. We have adopted the latter on two factors. For one thing, it is not easy for users to manage massive data only with file system, because there is much work needs to be done by the users. These include many techniques, such as searching, indexing, scheduling, etc, which are difficult but have existed in the database management system for many years. For another, the DBMS usually demands strict schemas of data and it is slower to insert raster data to tables (especially with large size) than that to store them on the file system directly. Instead, the metadata of input files with features of strict structure, light data size and powerful description, are more suitable to be stored on tables. Additionally, extracting metadata from large amounts of input files will cost much time. This is caused by lots of IO operations. But it will be greatly alleviated if the timeconsuming operations are distributed on several parallel machines. The file system used here is Hadoop Distributed File System (HDFS), which is a basic block of Hadoop framework. Hadoop is an open source system that supports the MapReduce model, which was introduced by Google Inc as a method of solving huge scale problems with large clusters. In this model, applications are based on two distinct steps that are map and reduce operations. Input files are automatically split into logical chunks and each chunk will be processed independently [12]. HDFS provides high reliability and fault tolerance. Input data are mirrored to multiple storage nodes, and this technique is called replication. As long as one replica of data chunk is available, the user will not know of storage nodes’ failures. A. Logical Architecture The system is composed of raw files, metadata, DBMS, Hadoop, HTM Library and Query Application. The logical view of our system is shown in Figure 3. Metadata contain the primary and frequently-used information. They may be defined at least by keys that identify the objects, geographical positions, HTM IDs that transformed from geographical positions, file paths in the form of HDFS’s schema. The file paths tell the system where the original files are. This is necessary when the users want to access after getting metadata from the database.

The HTM Library was re-implemented by Java, because the old one was written by CSharp language which was not compatible with Hadoop. Although long integers with sixty four bits in the computer can hold HTM IDs of which depths are thirty, we only used the depth twenty one by considering the complexities of computation and the requirements in the real world. The spatial resolutions on the sphere at this depth approximate to one meter.

TABLE II. Name

Number 1

CPU ×8, 2.0GHz, 4GB memory

ComputingNode

8

CPU ×8, 2.0GHz, 8GB memory

Network

--

All nodes are connected by a gigabit switch

StorageDisk

30

750G SATA ×30

TABLE III. Name

The processes are divided into the following five steps. Step 1. Copy the original files to HDFS. This can be done by shell commands or by the use of its interface functions. This step may cost much time, but this runs only once. Step 2, 3. Extract metadata from the input files that stored on HDFS. There are several map tasks in a Hadoop program. A map task includes three functions, such as configure, map and close. In our system, the configure function was responsible for initializing the database connections, while the map function inserted the metadata to the database. The close function cleared up the connections. Generating HTM IDs was also finished in the map function. Step 4, 5. Users’ applications requested objects from the database. When the records returned, the applications accessed the raw files throw the paths. B. Experiments The environments of our experiments are composed of hardware and software. They are listed by TABLE II. and TABLE III. separately. The datasets we used were point-like objects that uniformly distributed on the sphere, which were generated by a computer automatically. Each object was contained in a grid, which was 0.01° in longitude and 0.1° in latitude. The number of the objects approximated to sixty five millions, but they were small in size. All the objects were stored by twenty hundreds text files, in which each line represented an object. Each text file thus contained about fifty thousands of objects.

Details

ControlNode

Tables in MySQL should at least have five columns. They are key, longitude, latitude, path and htmid, which are the same as the structure of metadata. B-Tree index created for the column of htmid is necessary.

Figure 3. A logical view of the system

HARDWARE CONFIGURATION

Version

SOFTWARE CONFIGURATION Details Installed on each the computer. The control node of the cluster was specified as NameNode of Hadoop, while other computing nodes were used as DataNodes.

Hadoop

0.19

MySQL

5.0

Installed on the control node.

HTMLibrary

--

Implemented in Java.

QueryApp

--

Implemented in Java.

RedHat Linux Enterprise

4 AS

Installed on each computer.

It cost about seven hours to put all the objects into the database if there was only one map task used. In this case, the insertions ran sequentially, and only one database connection worked, which was the best efficiency of the database. With the increase in the number of map tasks, the time spent on IO operations was distributed on different nodes. This decreased the cost of IO operations, but brought the concurrencies to the database. Hence, there must be a tradeoff between the number of maps and the burden of database. The best result in the experiments was four hours to put all the objects into the database by tuning the number of maps. The number was sixteen then. But this may vary a lot according to the different datasets, especially the differences in file types. The HTM IDs were organized by the B-Tree indexes, through which the searching was very fast. Moreover, MySQL provided some techniques of caching, and it thus greatly enhanced the efficiencies of the similar searches. To avoid the effects of caching, the experiments were performed on the initial state of the database. This can be done by restarting the instance of MySQL before each exercise to ensure there is no cached data available. The worst result of searching by a circular region of which radius was ten nautical miles was only about seven hundreds milliseconds and about three hundreds objects returned. The result increased to five seconds when the radius of the circular query region was one hundred nautical miles, and more than twenty thousand objects returned. These two results became to zero and less than a half of one second when the cache was used. V.

CONCLUSION

Hierarchical Triangular Mesh (HTM) has excellent features of global continuity, stability, hierarchy and uniformity, which has abilities to index the spatial objects on the sphere. Each element of HTM has a unique and powerful ID that contains

the information of both position and resolution. By defining the spherical regions on the surface, the relationships between the regions and HTM’s elements are established. The most important characteristics of the HTM ID are that it is onedimensional and very suitable for B-Tree index. The amount of datasets in the experiment is small, as the types of the objects are texts. But it may multiply by at least one hundred thousand if the input objects contain the raster images, because it is common for the remote sensing images that are more than one hundred thousand in size. To be a useful system, there are many things to do in the future. Firstly, the system should support polygon-like objects. Associated algorithms for intersecting with spherical polygons are developing now. Secondly, there are few file types that are supported well on the HDFS. Many kinds of files are incompatible with this file system. It is urgent for us to extend the functions of HDFS to support some common types, such as TIFF, JPG, Hierarchical Data Format (HDF) [13], etc. Although the HDF Group provided some tools, they are not suitable for accessing in parallel. Additionally, a nice and friendly user interface is necessary for the system. REFERENCES [1]

A. S. Szalay, J. Gray, et al, “Indexing the sphere with the Hierarchical Triangular Mesh,” Microsoft Technical Report MSR-TR-2005-123, 2005. http://arxiv.org/abs/cs/0701164.

[2]

[3]

[4]

[5]

[6]

[7] [8] [9]

[10]

[11]

[12] [13]

M. F. Goodchild, Y. Shiren, and G. Dutton, “Spatial data representation and basic operations on triangular hierarchical data structure,” National Center for Geographic Information and Analysis, Santa Barbara, Technical Report 91-8, 1991. M. F. Goodchild and Y. Shiren, “A hierarchical data structure for global geographic information systems,” CVGIP: Graphical Models and Image Processing, vol. 54, pp. 31–44, 1992. L. Song, A. J. Kimerling and K. Sahr, “Developing an equal area global grid by small circle subdivision,” Proc. International Conference on Discrete Global Grids, Santa Barbara,CA, March 26-28, 2000. J. Gray, A. S. Szalay and G. Fekete, “Using table valued functions in SQL Server 2005 to implement a spatial data library,” Microsoft Technical Report MSR-TR-2005-122, 2005. A. Pavlo, E. Paulson, et al, “A comparison of approaches to large-scale data analysis,” Proceedings of the 35th SIGMOD international conference on Management of data. Providence, Rhode Island, USA, ACM: 165-178, 2009. Hadoop Website, http://hadoop.apache.org/. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the Acm vol.51, pp. 107-113, 2008. G. Dutton, “Planetary modelling via hierarchical tessellation,” Proceedings of Ninth International Symposium on Automated Cartography (AutoCarto 9), Bethesda, MD,USA, pp. 462-471, 1989. K. Sahr, D. White and A. J. Kimerling, “Geodesic discrete global grid Systems,” Cartography and Geographic Information Science, vol. 30 (2), pp. 121-134, 2003. W. Yuan, C. Q. Cheng, A. N. MA, and X. J. Guan, “L curve for spherical triangle region quadtrees,” Science in China Series EEngineering and Materials Science, vol 47, pp. 265–280, 2004. J. Venner, Pro Hadoop. USA: Apress, 2009. Hierarchical Data Format Website, http://www.hdfgroup.org/.

Suggest Documents