A Distributed Geospatial Data Storage and Processing ... - IEEE Xplore

2 downloads 971 Views 3MB Size Report
Abstract-With the rapid growth of geospatial data and concurrent users, the state-of-the-art Web GIS cannot support massive data storage and processing due to ...
A Distributed Geo spatial Data Storage and Processing Framework for Large- S cale WebGIS Yunqin Zhong 1 , 2 *, Jizhong Ran\ Tieying Zhang\ Jinyun Fang 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beij ing, China

2 Graduate University of Chinese Academy of Sciences, Beij ing, China *Corresponding author, e-mail: [email protected]

Abstract-With the rapid growth of geospatial data and concurrent users, the state-of-the-art Web GIS cannot support massive data storage and processing due to poor scalability of underlying centralized systems (e.g., native file systems and SDBMS). In this paper, we propose a novel distributed geospatial data storage and processing framework for large-scale WebGIS. Our proposal contains three significant characteristics. Firstly, a scalable cloud-based architecture is designed to provide elastic storage and computation resources of shared-nothing commodity cluster for WebGIS. Secondly, we present efficient geospatial data placement and geospatial data access refinement schemes to improve I/O efficiency. Thirdly, we propose MapReduce based localized geospatial computing model for parallel processing of massive geospatial data, which improves geospatial computation performance. We have implemented a prototype named VegaCI on top of the emerging Hadoop cloud platform. Comprehensive experiments demonstrate that our proposal is efficient and applicable in practical large-scale WebGIS. Keywords-spatial cloud computing; cloud infrastructure; data management; geospatial computation; WebGIS; Hadoop;

INTRODUCTION I. WebGIS (Web-based Geographic Information System) is the collection, administration, analysis and display of geospatial data in a user-friendly manner over the web [ l ] . WebGIS services are widely used in the real world, such as WMS (Web Map Service), WFS (Web Feature Service), and etc. Moreover, the emerging domains such as loT (Internet of Things) and LBSNS (Location-based Social Networking Service) need WebGIS to be integrated with them to provide useful sensor web services.

With the advancements of data acquisition techniques, large amounts of geospatial data have been collected from multiple data sources, such as satellite observations, remotely sensed imagery, aerial photography, and model simulations. The geospatial data are growing exponentially to PB (Petabyte) scale even EB (Exabyte) scale[2] . Moreover, typical geospatial processing for functioning WebGIS capabilities requires intrinsic complex computation, such as spatial analysis and spatial queries. Thus, large-scale WebGIS applications are both data-intensive and computing-intensive. Until recently, most WebGISs are built on top of centralized systems such as native file systems and SDBMS (Spatial Database Management System). They perform very

This research is supported by National High Technology Research and Development Program ( 863 Program) of China (No.20 1 1AA120302 and No.201 1AAI20300) and funded by The CAS Special Grant for Postgraduate Research, Innovation and Practice. 978-1-4673-1104-5/12/$31.00 ©2012 IEEE

well while dealing with relatively small geospatial datasets. However, they have been posed grand challenges on geospatial infrastructure while processing over massive geospatial data. Firstly, WebGIS based on SDBMS cannot manage massive geospatial data due to its poor scalability. WebGIS continues to demand more physical storage for both geospatial data and non-spatial data (e.g., attribute data). Although hardware vendors continue to expand the per-disk and RAID (Redundant Array of Inexpensive Disks) capacity, the storage efficiency of SDBMS has not improved at the same pace with the growth of geospatial data volume [2] . Thus, SDBMS cannot provide high 110 efficiency for large scale WebGIS that involves massive data and concurrent users. Secondly, since terabytes or petabytes geospatial data become commonplace, WebGIS is required to process very large geospatial datasets so that it can provide services for concurrent users. Nevertheless, the present WebGIS cannot process large amounts of geospatial data efficiently due to limited computation power of SDBMS based on single node. Although parallel DBMS [3 ] can improve computational capability, it requires expensive license fee and advanced hardware to support geospatial data processing. Thirdly, WebGIS on SDBMS is vulnerable to suffer from SPOF (Single Point of Failure) problem in case of hardware error and network partition[ l ] . Thus, the state-of-the-art WebGIS cannot guarantee to provide high quality geospatial services for a large number of concurrent users. Motivated by these challenges, in this paper, we propose a novel distributed geospatial data storage and processing framework for large-scale WebGIS. This framework provides effective and efficient distributed storage and parallel processing paradigm for massive geospatial data. Our proposal has several significant characteristics. Firstly, we design a scalable cloud-based architecture to provide elastic storage and computation resources from shared­ nothing commodity cluster for WebGIS . This architectural scheme consists of six hierarchies, which is based on PaaS (platform as a Service) layer of cloud stack. By leveraging the advancements of the emerging open-sourced cloud platform­ Hadoop[4] , WebGIS could meet the requirements of massive geospatial data. Secondly, we present efficient geospatial data placement and geospatial data access refmement schemes to improve 110 efficiency, which guarantees access efficiency, storage load balancing and disk space utilization. The massive geospatial data are organized by our geospatial map file

structure, which are stored on the scalable HDFS (Hadoop Distributed File System)[5] . Thirdly, we present MapReduce[6] based LGC (Localized Geospatial Computing) model to accelerate geospatial processing such as spatial analysis and spatial query computation. LGC is a fusion model which combines geospatial computation model with MapReduce parallel paradigm. Typical geospatial computation is divided into several small units, and then these units are parallel processed by the slave nodes on which the input data is stored, fmally, the master node collects the execution result from slave nodes. Since the geospatial objects are distributed across cluster nodes, the concurrent geospatial data accesses and complex geospatial computation can be parallel performed on many nodes simultaneously.

II. ARCHITECTURE OF VEGACI FRAMEWORK VegaCI framework is targeted at improving geospatial storage and data processing efficiency for large-scale WebGIS. To guarantee good scalability and high efficiency, we take advantage of cloud computing technologies and distributed system techniques to design the VegaCI framework. As shown in Fig .• � ! * � J1H r Jij � o 1 , VegaCI framework consists of six hierarchical layers, i.e., Linux cluster layer, geospatial data long-term preservation layer, geospatial data placement scheme layer, geospatial data processing layer, standard geospatial data storage and processing virtualization layer, and WebGIS service layer. The system architecture of VegaCI framework is detailed as follows.

With the help of our cloud-based geospatial data processing framework, the large scale WebGIS can improve 110 efficiency and computing performance, and thus achieves high aggregate I/O throughput and high performance geospatial computation for massive geospatial data and numerous concurrent users. Our contributions in this paper can be summarized as follows. Firstly, we propose a novel geospatial data processing framework for lage-scale WebGIS. This framework consists of several novel schemes: ( 1 ) a scalable cloud-based architecture is presented so that WebGIS could obtain elastic storage and computation resources from shared-nothing commodity cluster; (2) the efficient geospatial data placement and geospatial data access refmement schemes are designed to improve I/O efficiency, which guarantees access efficiency, storage load balancing and disk space utilization; (3) a MapReduce based LGC model is proposed to improve geospatial processing efficiency with parallel execution of spatial computation on many cluster nodes. Secondly, we have im� lemented a framework prototype named VegaCI (VegaGIS Cloud Infrastructure) on top of Hadoop platform. VegaCI has good scalability, and its storage capacity and computation capability can be scaled near linearly by adding more commodity servers. Moreover, VegaCI has good feasibility and manageability, and it can be seamlessly integrated with typical WebGIS services such as WMS and WFS. The geospatial access efficiency of VegaCI is about 5 7%-153% better than that of DBMS cluster. Moreover, the geospatial data processing efficiency of VegaCI has been improved by about 2. 29-3. 67 times better than that of compared systems, and VegaCI gains the average speedup ratio of 71. 38%-85. 12% on a 9-node cluster. Besides, VegaCI achieves low latency access, and its average response performance has been improved by about 69. 1 %-83. 7%.

The rest of the paper is organized as follows. Section II describes the system architecture of VegaCI framework. Section III presents the distributed geospatial data storage and processing scheme. Section IV presents the performance evaluation. Section V reviews the related work. Finally we conclude this paper in Section VI.

1

VegaGIS is a large-scale GIS platform product developed by our laboratory.

Data Nodes 1

Lima C uster La er

Data Nodes n

Figure 1 . Cloud-based architecture of VegaCI framework

The Linux cluster layer is the fundamental storage and computing infrastructure, which is built on shared-nothing cluster nodes with Linux operating system. Each node has its own storage capacity and computing power. Cluster nodes are connected with each other via gigabyte switch. The long-term geospatial data preservation layer is the distributed middleware based on Linux cluster, which provides global uniform namespace for geospatial data manipulation. This distributed middleware contains two emerging distributed storage systems : HDFS [4] and HBase [ 1 2] . HDFS is used to preserve geospatial data, and HBase is adopted to store billions of structured geospatial records such as attribute data. The geospatial data placement scheme layer is the collection of geospatial data organization and several refinements. Firstly, the geospatial data are organized by the geospatial objects map file structure (the details are described in Section III-A). The large amounts of WKT (Well-Known Text) and WKB (Well-Known Binary) obj ects are modeled as key-value pairs and sequentially stored in the geospatial map file. Moreover, since the geospatial map file is preserved as data blocks on HDFS, we adopt data compression algorithm to improve disk space utilization, and use storage load balancing mechanism to guarantee that the geospatial data blocks are evenly distributed across cluster nodes. Besides, in order to improve access efficiency, we provide in-memory distributed caching mechanism for hot-spot geospatial obj ects.

The geospatial data processing layer provides MapReduce based LGC (Localized Geospatial Computing) model. We also have implemented MapReduce-based library for typical spatial analysis and spatial query algorithms. Since the geospatial data are evenly distributed, the geospatial analysis and spatial query operations can be parallel performed on cluster nodes. The master node (i.e., Job Tracker) assigns geospatial tasks to slave nodes (i.e., TaskTrackers) , and then TaskTrackers process their own data and compute respective temporary results, [mally lob Tracker collects these temporary results and return the final result to clients. The standard geospatial data storage and processing virtualization layer provides API (Application Programming Interface) for handling geospatial data, which includes geospatial data manipulation and geospatial processing APIs. With the provision of APIs, WebGIS can perform geospatial storage and processing operations transparently, and do not need to care about the implementation details such as the data layout, locations of geospatial data, communications and etc.

The WebGIS services layer provides geospatial information services for clients, such as WMS, WFS WCS, etc. VegaCI framework is transparent to WebGIS applications, and the service applications invoke the standard library of virtualization layer to process massive geospatial data. With the help of above hierarchical layers, VegaCI can obtain elastic storage and computation resources from commodity clusters, and it can be dynamically scaled only by adding or removing nodes. Thus, VegaCI could achieve high efficiency geospatial data processing for large-scale WebGIS . III. A.

DISTRIBUTED GEOSPATlAL DATA STORAGE AND PROCESSING SCHEME

Geospatial Data Placement Scheme

By leveraging advancements of Hadoop platform, we design an efficient data placement scheme for storing massive geospatial data required by large-scale WebGIS. The geospatial data are represented by two data models : raster data model and vector data model. Typical raster data are stored as tile-based structure, where a tile can be stored as a file with its name composed of row number and column number. A typical tile is 256 X 256 pixels, and its size ranges from several kilobytes to tens of kilobytes. Vector data are stored as geometry objects such as points , lines , polygon , etc. The geospatial objects are represented with WKT format. In our scheme, both raster data and vector data are stored as key-value pairs in the geospatial map file. As for raster data, the key is tile name and its value is tile's content; for vector data, the key is geometry ID and respective value is WKT obj ect. We design geospatial map file structure named GFile for geospatial data preservation. As shown in Fig.2 (a), GFile is composed of five parts: geospatial data block, geospatial metadata block, geospatial data block index, geospatial metadata index and fixed geospatial trailer.

Geosnatial Data Block Geospatial Metadata Block eos atm Data B oc Index Geosoatial Metadata Index FLxe Geosoatla Tr31 er (a) GF,le structure

I

Row

I Column I Time Stamp II Raster Type I BLOB (b) Key-Value object structure for raster data

I

MER I Time Stamp I Feature Type I WKT I (c) Key-Value object structure for vector data

I Geometry ID I

Figure 2. GFile and key-value object structure for geospatial data Geospatial Data Block: The geospatial data block is composed of record sets, where a record contains several geospatial Key Value obj ects. The KeyValue structure of

1)

geospatial obj ects is shown in Fig.2, where (b) shows the raster data obj ect structure and ( c) shows KeyValue structure for vector data obj ects. The raster KeyValue obj ect contains five fields: row number, column number, time stamp, raster type (e.g., spectrum identifier) and obj ect value represented by BLOB . The vector KeyValue obj ect consists of five different fields : geometry ID, MBR (Minimum Bounding Rectangle), time stamp, feature type (e.g., Polygon) and WKT string.

Since raster data and vector data are stored as different KeyValue structures, we design different record structure for raster data and vector data while storing geospatial obj ects into data block. As shown in Fig.3, the record structure of geospatial data block is composed of key length , value length , Key Value objects and their respective offset. The access efficiency is influenced by record size, and smaller record could improve random reads performance, whereas larger record could improve scan and sequential reads performance. Since the geospatial access patterns have uncertainty, we mainly focus on improving the random access efficiency. Thus, the default record size is set to 32KB, which is obtained from experiments with practical WebGIS workloads. I

Key Len gth

I Value Length I Offset I 1 · · · 1 Offset n I KeyVal ue I 1 . . · 1 KeyVal ue n I

Figure 3 . Record structure of geospatial data block

2) Geospatial Metadata Block: The metadata block contains descriptive information of geospatial data block and bloom filter information. The basic descriptive information of GFile includes quanternary geocode, layer type, MBR information, first key, last key and key comparator. The bloom filter is used to determine whether a given KeyValue obj ect is in the GFile or not. 3) Geospatial Data Block Index: The geospatial data block index contains the location items of geospatial data blocks within GFile. The location item of geospatial block is composed of block offset in GFile, the first key and block size. 4) Geospatial Metadata Index: The geospatial metadata index contains the location items of metadata data blocks, and each item includes metadata block offset and block size. 5) Fixed Geospatial Trailer: The fixed geospatial trailer is used to find the offset of different parts in GFile. As shown in Fig.4, the geospatial trailer is composed of geospatial data block offset, metadata offset, data block index offset, metadata index offset, the block count, metadata count, compression type and GFile version. B lock Count

G F i l e Version

Figure 4. Structure of fixed geospatial trailer

6) Geospatial Data Compression: Since the geospatial data are terrible large, we provides compression function for GFile so as to improve disk space utilization. We provide three compression types: NON, RECORD, BLOCK. "NON" option indicates without compression; "RECORD" indicates record granularity compression; "BLOCK" indicates block granularity compression, and each block is compressed. WebGIS applications could choose the appropriate option to compress geospatial data so as to balance the tradeoff between space utilization and access efficiency. The default compression option is BLOCK type. B.

Geospatial Data Access Refinements

1) Storage Load Balancing: Since there are massive geospatial data and concurrent user accesses in large-scale WebGIS, VegaCI framework provides load balancing mechanism to assign workloads to cluster nodes uniformly. With evenly distribution of geospatial data and user accesses across many nodes, VegaCI can avoid SPOF (Single Point of Failure) problem, and thereby it could support continuous online WebGIS services with high availability. 2)

Distributed Caching Mechanism for Geospatial Objects:

According to TFL (Tobler's First Law), the geographically adj acent geospatial objects implies tight correlation, and the clients are more likely to access adj acent obj ects as well. We design a distributed caching mechanism to improve geospatial data access efficiency, especially for I/O-intensive workloads. VegaCI is devoted to reduce reading latency when there are numerous concurrent user accesses, and thereby can guarantee the real-time response for WebGIS service requests. The geospatial KeyValue obj ects are resident in distributed memory pool. The distributed memory pool is constituted by memory pools of cluster nodes, more specifically, each cluster node is allocated a configurable memory pool named local cache pool, VegaCI integrates these local pools to form a larger memory pool with distributed system technology. The geospatial obj ects are firstly resided into local cache. If the local cache pool is not enough, the obj ects will be transferred to remote cache pool of another cluster node. We use triggered strategy to load geospatial obj ects into distributed cache from local disk. Once KeyValue obj ect is accessed by clients, their adj acent obj ects within the same record would be loaded into distributed caching pool. We adopt the LRU (Least Recently Used) replacement policy to remove the victim obj ects from memory caching pool. C.

MapReduce-based Model

Localized

Geospatial

Computing

In practical large-scale WebGIS, geospatial computation within spatial query and spatial analysis should process large amounts of geospatial data. Since geospatial computation of processing massive data is both I/O intensive and computing intensive, the typical serial processing methods cannot meet the requirement of low latency access and real-time processing. To improve the processing efficiency, we propose MapReduce-based LGC (Localized Geospatial Computing) model for parallel processing of massive geospatial data.

The LGC model integrates the storage and computing resources of cluster nodes, and the geospatial computation is tightly coupled with data storage placement. The geospatial computation is conducted parallel on shared-nothing cluster, and the computing process is performed simultaneously on local nodes where the geospatial data blocks are stored. This parallel execution mechanism is different from MPI (Message Passing Interface) based processing mechanism. Typical MPI execution is based on shared-disk or shared-memory cluster, which needs expensive hardware to support parallelism. Moreover, the MPI master node should send original data to slave workers for further parallel execution. However, our LGC model is based on shared-nothing cluster composed of cheap commodity hardware. It has three virtues: low cost, scalability, and high availability. With LGC model, the master node only send processing instructions rather than geospatial data to slaves nodes, and slave nodes transfer the processing results to master, and thereby avoids data transmission between master and slave nodes. As shown in Fig. 5, the geospatial computation is parallel processed on different nodes. Since the geospatial data is stored as GFile structure, its geospatial data blocks are naturally as the input splits of MapReduce runtime system. Typical geospatial computations in LGC model is divided into three basic operations: map , shuffle and reduce , and these operations can be parallel performed by many cluster nodes. As shown in Fig.5, the geospatial processing workflow of LGC model contains four steps :

I)

2)

3)

4)

Input geospatial data. The record reader resolves the geospatial key/value obj ects from input split of geospatial data block, and the key/value obj ects are sent to Mapper daemon as input data. Map execution. The different input splits of geospatial data blocks are processed simultaneously by Mapper daemons on cluster nodes, and the spatial relation algorithms are encapsulated into map function. Combination process. The geographically adj acent KeyValue obj ects are sorted by their Hilbert value of central point of MBR and then they are combined into different groups. These groups of obj ects are sent to Reducer daemons for refmement processing. Reduce execution. The adj acent key/value obj ects are processed by geometric computation algorithm, and then each reducer obtains temporary key/value obj ects that match the spatial relation condition. Finally, the master collects the fmal result and returns it to client.

Geospatial Data

I nput

Combine: [I D;, { W KT;, " ' } ]

Geodata

Block N

Reduce : [ I D , WKT]

Figure 5. The MapReduce based parallel execution procedure of LGC model.

We have implemented a fundamental MapReduce-based library for geospatial computation, including spatial analysis and spatial query, such as spatial selection, spatial j oin, ANN (All Nearest Neighbor), kNN and rNN (Reverse Nearest Neighbor). The complex geospatial analysis can be assembled with the fundamental library. The details of geospatial data storage, computation and communication are transparent to applications. WebGIS applications only need to concern the logical workflow of geospatial processing, and the geospatial computation tasks can be automatically assigned on cluster nodes when WebGIS services invoke the computation j obs. With the help of LGC parallel processing model, VegaCI could provide high efficiency geospatial computation and low latency 110 access, and thus it can meet the storage and processing requirements of large-scale WebGIS. IV. A.

PERFORMANCE EVALUATION

Experiment Environment

Our experiments are conducted on a cluster composed of 9 commodity nodes. The hardware configuration of each node consists of two quad-core Intel CPU 2 . 1 3 GHZ, 4GB DDR3 RAM, 1 5000r/min SAS hard disk driver, 3 00GB . Software configuration includes CentOS5 .5, JDK1 .6.0_20, MySQL5 . 1 .49, PostgreSQL-9.0.5, PostGIS- 1 .5 . 3 , Oracle Database I l g Release 2 and HBase-0.90.3 . Besides, an industrial standard performance validation tool named LoadRunner is deployed on a server to generate concurrent user requests.

Figure 6. Comparisons of geospatial index bulk construction performance.

As shown in Fig.6, VegaCI outperforms compared systems while bulk building geospatial index. The respective runtime of VegaCI is about 965s and 1 3 1 1 s for raster index and vector index construction, whereas the runtime of compared systems is more than 3509s for raster index and 3629s for vector index. VegaCI perfonns geospatial index construction by a parallel MapReduce j ob, and its construction performance has been improved by more than 2. 78 times than that of other systems. C.

Geospatial Data Processing Performance

We evaluate the geospatial data processing performance with four metrics, i.e., ANN, kNN (k=5 , 1 0) and rNN. VegaCI is compared with PostGIS+PostgreSQL and Oracle Spatial + Oracle DB on a 5-node cluster. The four micro-benchmarks are evaluated to verify the geospatial processing efficiency, including execution time and speedup ratio. Fig.7 shows the comparison results of PostGIS, Oracle Spatial and VegaCI. VegaCI costs 3 78s, 219s, 253s and 381s while processing ANN, 5 -NN, l O-NN and rNN, respectively. With provision of MapReduce-based geospatial computation mechanisms, VegaCI process geospatial data on many nodes in parallel, and its average processing runtime is about 2. 293. 67 times less than that of compared systems.

The real geospatial datasets are about 937. 1 GB and consist of raster dataset and vector dataset. The raster dataset covers six map scales and the highest resolution is 1 :5000. It contains about 89, 3 72, 583 tiles and each tile is 256 X256 pixels. The vector dataset contains 218, 61 1, 35 7 POINT, 58, 149, 873 LINE and 9, 09 7, 126 POLYGON objects. B.

Bulk Construction ofGeospatial Index

In order to prune search space and improve data retrieval efficiency, geospatial index is mandatory for WebGIS. We build Quad-tree index for raster dataset and construct R-tree index for vector dataset. The bulk construction perfonnance of VegaCI is compared with MySQL cluster, PostgreSQL cluster and Oracle DB cluster on a 5-node cluster.

ANN

kNN(k-S)

kNN(k-10)

rNN

Figure 7. Comparison of geospatial data processing performance

To verify the scalability of our scheme, we evaluate the four metrics on VegaCI with various cluster scale. As shown in Fig. 8, the execution time is decreased rapidly with more nodes, e.g., it consumes 1 009s while processing ANN on 1 node, whereas the time is only 8 7s with 9 nodes. The average speedup ratio of VegaCI has been achieved by about 71. 38%85. 12% when the cluster is scaled from 1 node to 9 nodes.

5 I of VegaCI nodes

50

Figure 8. The processing speedup ratio of VegaCI

D.

Geospatial Data Access Efficiency

We evaluate the geospatial 110 access efficiency with two metrics, i.e., bulk loading 1 0GB data, randomly reading data from 1 6 areas. VegaCI and its compared systems (i.e., MySQL, PostgreSQL, HBase) are running on a 9-node cluster. TABLE I.

GEOSPATlAL I/O ACCESS EFFICIENCY

Test Items

bulk loading( 1 0GB) random read(l 6 areas)

I

I

I

Execution time (seconds)

MySQL

1472 25

I I

Postf(reSQL

1 3 80 21

I I

HBase

901 29

V egaCl

I 573 I

12

TABLE I shows the companson results of geospatLal 110 access efficiency. It can be observed that VegaCI outperforms other systems. For instance, its bulk loading time is about 5 7%-153% less than that of MySQL and HBase. Besides, the read efficiency of VegaCI is improved by about 1 08%-1 4 1 % better than that o f compared systems. VegaCI can distribute the 110 accesses to different nodes, and all nodes serves clients in parallel with low latency. Thus, VegaCI has achieved better geospatial access efficiency than DBMS cluster and HBase. E.

1 00

1 50 200 # of Concurrent Users

250

300

Figure 9. Comparisons of system evaluation.

System Evaluation

The system evaluation is performed by the response time of concurrent requests. We integrate an existing WebGIS with cluster version of MySQL, PostgreSQL, HBase and VegaCI on a 9-node cluster respectively. LoadRunner simulates concurrent user requests for services. The number of users increases from 50 to 3 00, and the duration time is 30 minutes. As shown in Fig.9. VegaCI gains the best system response performance, and its response time is about 69. 1 %-83. 7% less than that of compared systems. Besides, the response time of VegaCI is kept at a relative stable level (less than 259ms) with growing number of concurrent users, whereas the response time of compared systems increases from 509ms to 1 693ms. VegaCI achieves low latency response to a large number of concurrent users, and thus it could facilitate the real-time service quality of large-scale WebGIS.

V. RELATED WORK Spatial databases [7] are the state-of-the-art solution of underlying data infrastructure for WebGIS. However, with the rapid growth of geospatial data and concurrent users, WebGIS based on top of centralized spatial databases cannot meet the requirements of data intensity, computing intensity and concurrent access intensity[8] . Since cloud ecosystem provides elastic computing and storage power with commodity cluster, NoSQL database are applied in geospatial science domains[9] . They are emerging with web-scale semi-structured data [ 1 0, 1 1 ] . Bigtable is used to store the satellite imagery for Google Earth[12]. [ 1 ] presents a method to manage a large number of small tiles on HDFS, and [ 1 3 ] describes a method to manage very large raster data with key-value storage system such as HBase[ 14] and Cassandra[ 1 5 ] . [ 1 6] describes a method to process aerial image quality with MapReduce. [ 1 7] propose the spatial keyword querying method. These works focus on raster data or textual data rather than geospatial processing issues such as spatial query and spatial analysis in terms of vector data [ 1 8] . Our proposal provides geospatial data storage and parallel processing framework for vector data and raster data. It is can support large-scale WebGIS that involves massive geospatial data and concurrent users. VI. CONCLUSION We have proposed a novel distributed geospatial data storage and processing framework for large-scale WebGIS, which involves massive geospatial data and concurrent users. Firstly, we present a cloud-based architecture for WebGIS obtaining elastic storage and computation resources from commodity cluster. Secondly, we design efficient geospatial data placement and geospatial data access refinement schemes to improve I/O efficiency, including access efficiency, storage load balancing and disk space utilization. Thirdly, we propose LGC model for parallel processing massive geospatial data by leveraging MapReduce. In addition, we have implemented a prototype named VegaCI on top of Hadoop cloud platform. Performance Evaluations of the real deployed WebGIS confirm that VegaCI has excellent 110 efficiency, e.g., its geospatial access efficiency is about 5 7%-153% better than that of DBMS cluster, and its response performance has been improved by about 69. 1 %-83. 7%. VegaCI has achieved good scalability and high performance geospatial processing, e.g., its processing efficiency has been improved by about 2. 293. 67 times better than that of compared systems, and its speedup ratio has been achieved by about 71. 38%-85. 12%.

Therefore, VegaCI is efficient and applicable in practical large-scale WebGIS. REFERENCES [I] [2] [3] [4] [5] [6] [7] [8] [9] [ t o] [ I I] [12] [13] [14] [ 1 5] [ 1 6] [ 1 7] [ 1 8]

X . Liu, e t aI., "Implementing WebGIS on Hadoop: A case study of improving small file JlO performance on HDFS," Proc. IEEE Int. Conf. on Cluster Computing, Cluster '09, pp. 1 -8, 2009. R. Yang, et aI., "Data Access and Data Systems," in Advanced Geoinformation Science, C. Yang, et aI., Eds., ed: CRC Press, 20 1 1 , pp. I27- I37. D. DeWitt and J. Gray, "Parallel database systems: the future of high performance database systems," ACM Commun., vol. 35, pp. 85-98, 1 992. Apache Hadoop project. Available: http://hadoop.apache.org/ K. Shvachko, et aI., "The Hadoop Distributed File System," Proc. IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST' I O, pp. 1 - 1 0, 20 10. J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," ACM Commun., vol. 5 1 , pp. 1 07- 1 13, 2008. S. S. a. S. Chawla, Spatial Databases:A Tour. New Jersey: Prentice Hall, Inc, 2002. C. Yang, et aI., "Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing?," Int. Journal of Digital Earth, vol. 4, pp. 305-329, 201 1 . J. Levandoski, et aI., "CareDB : a context and preference-aware location-based database system," Proc. Int. Conf. on Very Large Databases, VLDB' I 0, vol. 3, pp. 1 529- 1 532, 2010. A. Thusoo, et aI., "Hive - a petabyte scale data warehouse using Hadoop," Proc. IEEE 26th Int. Conf. on Data Engineering, ICDE' I O, pp. 996- 1 005, 2010. A. Abouzeid, et aI., "HadoopDB : an architectural hybrid of MapReduce and DBMS technologies for analytical workloads," The VLDB Journal, vol. 2, pp. 922-933, 2009. F. Chang, et aI., "Bigtable: A Distributed Storage System for Structured Data," ACM Trans. Comput. Syst., vol. 26, pp. 1 -26, 2008. Y. Zhong, et aI., "A novel method to manage very large raster data on distributed key-value storage system," Proc. IEEE 1 9th Int. Conf. on Geoinformatics, Geoinformatics'l I , pp. 1 -6, 201 I . Apache HBase project. Available: http://hbase.apache.org/ Apache Cassandra project. Available: http://cassandra.apache.org/ A. Cary, et aI., "Experience on Processing Spatial Data with MapReduce," Proc. Int. Conf. on Scientific and Statistical Database Management, SSDBM'09, pp. 1 - 1 8, 2009. X. Cao, et aI., "Collective spatial keyword querying," Proc. Int. Conf. on Management of Data, SIGMOD' I I , pp. 373-384, 201 I . c. Yang, et aI., "Geospatial Cyberinfrastructure: Past, present and future," Computers, Environment and Urban Systems, vol. 34, pp. 264-277, 20 10.

Suggest Documents