A Map-Reduce based parallel approach for improving query ...

11 downloads 5269 Views 2MB Size Report
your own website. You may ... performance in a geospatial semantic web for disaster response .... key, process them together and generate the final output—a.
A Map-Reduce based parallel approach for improving query performance in a geospatial semantic web for disaster response Chuanrong Zhang, Tian Zhao, Luc Anselin, Weidong Li & Ke Chen

Earth Science Informatics ISSN 1865-0473 Volume 8 Number 3 Earth Sci Inform (2015) 8:499-509 DOI 10.1007/s12145-014-0179-x

1 23

Your article is protected by copyright and all rights are held exclusively by SpringerVerlag Berlin Heidelberg. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy Earth Sci Inform (2015) 8:499–509 DOI 10.1007/s12145-014-0179-x

RESEARCH ARTICLE

A Map-Reduce based parallel approach for improving query performance in a geospatial semantic web for disaster response Chuanrong Zhang & Tian Zhao & Luc Anselin & Weidong Li & Ke Chen

Received: 14 May 2014 / Accepted: 9 September 2014 / Published online: 20 September 2014 # Springer-Verlag Berlin Heidelberg 2014

Abstract Rapid retrieval of spatial information is critical to ensure that emergency supplies and resources can reach the impacted areas in the most efficient manner. However, it remains challenging to find out the needed spatial information efficiently because of the intensive geocomputation processes involved and the heterogeneity of spatial data. It is quite cost prohibitive to query the spatial information from geographical knowledge bases containing complex topological relationships. This research introduces a Map-Reduce based parallel approach for improving the query performance of a geospatial ontology for disaster response. The approach focuses on parallelizing the spatial join computations of GeoSPARQL queries. The proposed parallel approach makes full use of data/task parallelism for spatial queries. The results of some Communicated by: H. A. Babaie Published in the Special Issue “Intelligent GIServices” with Guest Editor Dr. Rahul Ramachandran C. Zhang (*) : W. Li Department of Geography & Center of Environmental Sciences and Engineering, University of Connecticut, Storrs, CT 06269-4148, USA e-mail: [email protected] W. Li e-mail: [email protected] T. Zhao : K. Chen Department of Computer Science, University of Wisconsin– Milwaukee, Milwaukee, WI 53201, USA T. Zhao e-mail: [email protected] K. Chen e-mail: [email protected] L. Anselin School of Geographical Sciences & Urban Planning, Arizona State University, Tempe, AZ 85287, USA e-mail: [email protected]

initial experiments show that the proposed approach can reduce individual spatial query execution time by taking advantage of parallel processes. The proposed approach, therefore, may afford a large number of concurrent spatial queries in disaster response applications. Keywords Map-Reduce . Parallel geocomputation . Disaster response . Geospatial semantic web

Introduction There are frequent occurrences of disasters in the world caused by either the natural environment such as earthquakes and floods, or human activities such as plane crashes, highbuilding collapses or major nuclear facility malfunctions. A huge number of people have been affected, resulting in enormous damage to society. A more effective response can help save lives, reduce damages and costs of dealing with the disasters. Because disasters have a temporal and geographic footprint, geospatial data and tools are important in disaster response and management. The geospatial data have made it possible for disaster responders to obtain spatial information such as accessibility to roads, damaged areas, locations of injured people, and power outrage maps. However, it remains challenging to integrate various spatial data together and it is difficult to find out the needed geospatial information efficiently over the Web due to the required intensive geocomputational processes and the intrinsic data heterogeneity. Due to poorly adapted tools, training and strategies, responders are increasingly ill-prepared to produce useful knowledge by querying the spatial information from geographical knowledge bases that contain complex topological relationships. In addition, because the data needed for disaster response are collected and produced from different sources,

Author's personal copy 500

they face the heterogeneity problem. Disaster data are extremely heterogeneous, both syntactically and semantically (Zhang et al. 2010a, 2013). Experience suggests that the real barriers to emergency response and disaster management are not lack of data as such (Donkervoort et al. 2008), but rather the difficulties in sharing and integrating heterogeneous data. Data sharing facilitated by the advances of network technologies is hampered by the incompatibility of the variety of data models and semantics used at different sites (Ramachandrana et al. 2004). The concept of the Geospatial Semantic Web was suggested recently to address these challenges and achieve automation in service discovery and execution (Peng and Zhang 2004; Yue 2013; Zhang et al. 2007, 2010b, c). The Geospatial Semantic Web can be seen as an extension of the current web where geospatial information is given well-defined meaning by ontologies, thereby enabling a semantic spatial query capability (Yue et al. 2007, 2009, 2011). However, performance remains a major challenge (Zhang et al. 2010a). Currently, large amount of semantic data, chiefly in the context of Linked Open Data, is available in the RDF (Resource Description Framework) format over the Web in many fields such as bioinformatics and life science. A growing number of organizations or Community driven projects such as Wikipedia and Science Commons have begun exporting RDF data. According to Linked Open Data Project (http://www.w3.org/wiki/ SweoIG/TaskForces/CommunityProjects/LinkingOpenData), more than 52 billion triples were published by March 2012. Among these large RDF data, there are a large number of spatial entities in RDF knowledge bases (Hoffart et al. 2013). In fact, the RDF data model has recently been extended to support representation and querying of spatial information. The recent OGC GeoSPARQL standard extends RDF and SPARQL to represent geographic information and support spatial queries (Battle and Kolas 2012). For example, the USGS (United States Geological Survey) has published RDF triple data derived from The National Map to support spatial queries. The goal of the Geospatial Semantic Web is to use the Web as a large geospatial database to answer structured spatial queries from users. The huge RDF ontology data in Geospatial Semantic Web that contain all the topological relationships among the data remains quite prohibitive to query without the help of appropriate index structures or optimization techniques. Despite the large volume of work on querying large RDF knowledge bases (e.g. Weiss et al. 2008; Yuan et al. 2013), there are only a few studies focusing on effectively handling spatial semantics in RDF data (e. g. Liagouris et al. 2014). Further, disaster response applications also require many users to concurrently access spatial databases through highly intensive geocomputation processes. Recent developments in distributed geographic information processing and the popularization of web and wireless devices enabled massive

Earth Sci Inform (2015) 8:499–509

numbers of end users to access geospatial systems concurrently. However, the spatial data objects are generally nested and complex, and spatial queries are based not only on the attributes of spatial objects but also on the spatial location, extent and measurements of spatial objects contained in a reference geographical system. Therefore, spatial query requires intensive disk I/O accesses and spatial computation. This creates further challenges to efficiently and quickly conducting spatial queries concurrently. This research introduces a parallel approach for improving the query performance of geospatial ontology using a MapReduce concept, which is a programming model and an associated implementation for processing and generating large datasets (Dean and Ghemawat 2004). The approach focuses on parallelizing spatial join computations of GeoSPARQL queries. We expect that the results of this research will facilitate the access to spatial information for multiple users through highly intensive geocomputation processes over the Web, particularly in the context of disaster response applications.

The framework for the proposed approach We propose a parallel approach to provide efficient geospatial ontology query processing over large spatial databases. Figure 1 illustrates the overall architecture of the proposed parallel approach using a Map-Reduce concept (Dean and Ghemawat 2004). Please note that ontology is one of the pillars of the Geospatial Semantic Web and is normally written in a formal ontology language such as RDF and OWL (Ontology Web Language) (OWL is a language to extend the expressiveness that RDF provides). Geospatial Semantic Web applications in this paper manage knowledge bases and data ontologies in the form of RDF. Therefore, different words such as “Ontologies” and “RDF ontologies” refer to the same thing and have been used interchangeably in this paper. RDF is a simple model, where all data are in the form of triples. SPARQL is a query language for RDF data. GeoSPARQL is a query language for requesting geospatial entities in the RDF data. The geospatial data objects in RDF are first given spatial indices in order to speed up parallel spatial query processing. The RDF files are then broken into multiple blocks that are called splits. All splits are of the same size with the exception of the last one, and the split size is configurable per RDF file. Map-Reduce consists of two user-defined functions, namely a map function and a reduce function. Given a Map-Reduce task, first each mapper is assigned to one or more splits depending on the number of machines in the cluster. Then, each mapper reads input provided as pairs one at a time, applies the map function to it and generates intermediate partial RDF ontologies (i.e. pairs) as the output. Finally, reducers

Author's personal copy Earth Sci Inform (2015) 8:499–509

501

Fig. 1 The overall architecture of the proposed parallel approach

fetch the output of the mappers and merge all of the intermediate RDF ontology values that share the same intermediate key, process them together and generate the final output—a single set of RDF ontologies. The approach focuses on parallelizing spatial join computations in GeoSPARQL queries, which was proposed by OGC (Open Geospatial Consortium) as a new geographic query language for RDF data (OGC 11-052r4 2012). GeoSPARQL extends SPARQL with a standard vocabulary for spatial information, query functions for spatial computation, and query rewriting rules to expand feature-feature query to geometry query (OGC 11-052r4 2012). In the proposed approach as introduced in the above concept architecture we first use spatial indices to partition spatial objects referenced in a RDF ontology. Based on the data partitions we then compute spatial joins in parallel to improve the performance of GeoSPARQL queries. Figure 2 shows the detailed query procedure using the proposed parallel approach. There are four major components that play important roles in the query procedure: a parser, a static analyzer, a query optimizer, and a spatial query parallelizer. GeoSPARQL is used as a query language to search the needed geospatial information from heterogeneous data sources over the Web. The parser converts a geo-sparql query to an abstract syntax tree form, which is checked by the static analyzer for potential errors. The result of the static

Fig. 2 The query procedure using the proposed parallel approach

Author's personal copy 502

analysis is an ordered collection of primitive sub-queries that will be processed by the optimizer so that the sub-queries can be processed more efficiently. The optimizer divides subqueries into spatial and non-spatial queries. The spatial queries are potentially computational intensive. Therefore, they will be parallelized based on pre-computed spatial indices. The spatial queries are parallelized by splitting them into disjoint parallel query tasks, which are answered independently. In the end, the results of the parallel query tasks are integrated with the results of the non-spatial query to form the final answer to the original geo-sparql query. The main advantages of this framework are: (1) it is able to recognize and represent the implicit and explicit meaning of heterogeneous geospatial data content and can query geospatial data at the semantic level, so applications of emergency response and disaster management can share and integrate interoperable data quickly; (2) it improves the computing performance associated with intensive geospatial queries by utilizing the Map-Reduce concept, which is simple and can effectively support parallelism, so applications of emergency response and disaster management can process the spatial queries from a massive volume of spatial data within a reasonable amount of time. In the following sections, we introduce the primary technologies applied in the framework. These include GeoSPARQL and query rewriting algorithms and data/task parallelism. GeoSPARQL In the proposed framework, we use the OGC (Open Geospatial Consortium) GeoSPARQL approach for representing, accessing, and querying geospatial data on the Geospatial Semantic Web for disaster response. GeoSPARQL is an OGC standard for representation and querying of geospatial linked data. GeoSPARQL represents geospatial data using RDF ontologies. It queries geospatial data by extending the general SPARQL query language to process geospatial data. It supports both qualitative and quantitative spatial reasoning and querying from geospatial RDF data. Figure 3 illustrates the major components of GeoSPARQL (OGC 11-052r4 2012): core, geometry, topological vocabulary, geometry topology, query rewrite, and RDFS (RDF Schema) entailment. The Core component defines top-level RDFS/OWL (RDF Schema/ Web Ontology Language) classes for spatial objects. The Geometry component describes the geometry vocabulary and non-topological query functions for geometry objects. The Topological vocabulary component expresses RDF properties for asserting topological relations between spatial objects. The Geometry topology component identifies topological query functions. The Query rewrite component defines transformation rules for computing spatial relations between spatial objects based on their associated geometries. Finally, the RDFS entailment component

Earth Sci Inform (2015) 8:499–509

Fig. 3 Major components of GeoSPARQL (adapted from Zhao et al. 2014)

introduces a mechanism for matching implicit RDF triples that are derived based on RDF and RDFS semantics. GeoSPARQL defines a small ontology to represent geospatial features and geometries. The ontology is fundamental to querying geospatial data. Specifically, geo:SpatialObject and geo:Feature are the two main ontology classes that are defined in GeoSPARQL to represent geospatial features. The single root geometry ontology class called geo:Geometry or the properties geo:hasGeometry and geo:defaultGeometry that associate with geospatial features are used for encoding geometry information. These ontology classes are able to be connected to an ontology representing a domain of interest. GeoSPARQL also defines a number of topological and non-topological query predicates and functions to support queries of relationships between geospatial entities. GeoSPARQL includes a set of terms for topological relations such as geo:-equals, geo:-disjoint, geo:-intersects, geo:touches, geo:-crosses, geo:-within, geo:-contains, geo:overlaps, which allows users to perform geospatial reasoning and formulate queries based on topological relations between spatial objects. Geospatial reasoning is critical for emergency response. For example, when concerned with damage around a given town, e.g., Mansfield, this allows questions such as “Which residential homes are contained within the damaged area of Mansfield?” to be answered efficiently. This query requires a topological comparison between the geometries of the residential homes and the geometries of the damaged area of Mansfield. The properties of the two geospatial features geo:hasGeometry can be used to connect the features to their geometries, and the topology function geo:-within can be used

Author's personal copy Earth Sci Inform (2015) 8:499–509

to evaluate the topological relationships. The following lists some sample codes to carry out the example query:

GeoSPARQL also supports non-topological query functions such as geof:distance, geof:buffer, geof:convexHull, g e o f : i n t e r s e c t i o n , g e o f : u n i o n , g e o f : d i f f e re n c e , geof:symDifference, geof:envelope, and geof:boundary. This allows users to make inference and link multiple data sets together to solve a given problem. In emergency response scenarios, disaster responders typically need to combine multiple data sources. For example, consider a scenario where a hurricane struck the Town of Groton. To take immediate rescue actions, the emergency responders need to find evacuation routes. The evacuation routes must not go through possibly flooded areas. So they need to combine data such as transportation road data, non-flooded areas, and political boundaries of the town of Groton together to identify potential evacuation routes. The non-topology function geof:union can be used to find all route features (?r) that touch the union of the feature non-flooded areas (?flood) and the feature political boundaries of the town Groton (?Groton). The following lists some sample codes to implement the associated query:

GeoSPARQL provides a way to link geospatial datasets together and extract more meanings from the spatial relations among the data. All the ontology classes and functions are derived from OGC standards, which ensure interoperability of geospatial data. GeoSPARQL allows data to be properly indexed and queried from spatial RDF stores. In addition, it is intended to inter-operate with both quantitative and qualitative spatial reasoning systems (Battle and Kolas 2012). With a quantitative spatial reasoning system, GeoSPARQL explicitly calculates distances and topological relations among concrete geometries of features. With a qualitative geospatial reasoning system, GeoSPARQL allows RCC (Region

503

Connection Calculus) type topological inferences for features where the geometries are either unknown or cannot be made concrete (Grütter and Bauer-Messmer 2007). For example, if there are assertions that a hospital is inside the town of Groton, and Groton is within a flooded area, a qualitative reasoning system should be able to infer through transitivity that the hospital is within a flooded area. In general, GeoSPARQL is a minimal vocabulary for storage and query of geospatial information, and it represents geospatial Feature, Geometries, and the geospatial relationships between them. GeoSPARQL intends to be simple enough for Linked Open Data.

Query rewriting algorithms and data/task parallelism To further improve performance of answering GeoSPARQL queries, we store spatial and non-spatial data separately. The spatial data are stored in databases with spatial indices while non-spatial data are stored in ontology databases. The spatial data does not need to be in ontology form since we can define SPARQL extensions to translate ontology queries into spatial database queries. In this manner, we can carry out the spatial and non-spatial queries in parallel and then join the query results. The crucial question is how to parse and translate the original GeoSPARQL query into separate sub-queries that can be processed in parallel by distinct data sources. To address this issue, we employ algorithms from our previous work (Zhao et al. 2008), where we developed a query answering algorithm with backtracking to find answers that satisfy specified constraints. As illustrated in Fig. 4, the query rewriting algorithms have two parts: The first part applies inference rules to the body of a GeoSPARQL query so that RDF triples with object properties are replaced by RDF triples with datatype properties. An inference rule i is applicable to a triple t if i.head matches t via a variable substitution s such that s(i.head) = t. The second part rewrites the resulting query to WFS getFeature requests. The basic idea of the query rewriting algorithms is to first reduce the query statements (in form of triples) into more basic triples using a set of inference rules. A triple has the form: subject predicate object. The subject is an ontology instance and the predicate is either an attribute of the instance or a relation between the subject and the object. The next step is to group the resulting triples by the subject of each triple. The last step is to replace each group of triples in the query with the data source schema (e.g. WFS feature type) mapped to that group. The resulting query can then be partitioned into sub-queries. Each sub-query will be sent to one or more eligible data sources until either it is answered or failed on all sources, in which case the user query cannot be answered. If all sub-queries are answered, then the combined results are the answers to the user query. This algorithm is similar to a

Author's personal copy 504

Earth Sci Inform (2015) 8:499–509

Fig. 4 Query rewriting algorithms (adapted from Zhao et al. 2008)

Input: target query q A set of inference rules I

For each triple t in q.body For each inference rule I in I If there exists a substitution s such that s (i.head) = t Then replace t in q.body with s (i.body) End for End for

Output: q where q .body has only triples in RDF mapping

(a)

Rewrite GeoSPARQL queries to WFS queries algorithm part 1: apply inference rule to GeoSPARQL query Input: target query q A set of mapping rules M Initialize: apply algorithm 1 to q to obtain q Group triples in q .body by subject name Resulting a set of triple groups L For each triple group t in L For each mapping rule m in M If there exists a substitution s such that s(m.body) contains all triples in t Then replace t in q with s (m.head) End for End for Output: q where q .body contains WFS get Feature requests

(b)

Rewrite GeoSPARQL queries to WFS queries algorithm part 2: query rewriting

local-as-view approach used in query answering in distributed database systems. Another performance bottleneck of GeoSPARQL query answering is the processing of the filter components of the queries. Spatial filters are often used in GeoSPARQL queries to refine the query results. The spatial filters calculate quantitative data such as distances between spatial objects or qualitative data such as whether two spatial objects overlap. The computation of the spatial filter functions may involve a large number of spatial objects. For example, if we want to find all the groups of three schools that are within certain distance from each other, we may need to filter O (N^3) number of school groups if there are N schools. One way to improve performance is to try to reduce the number of spatial objects that need to be processed by the spatial filters. However, such optimization requires prior knowledge of the query costs. An alternative way is to parallelize the filtering computation by dividing the spatial objects into approximately equal-sized patches based on spatial indices. The filtering functions are then processed on each set of spatial objects in parallel, after which the results are combined. This approach is a form of data parallelism, where we have to consider processing the spatial objects close to the borders of the patches. Consider the earlier example of finding groups of three schools that are next to each other. If we divide all the schools into patches based on

spatial locations where each patch has the same number schools, we can find the three schools that are next to each other within each patch and we also consider the schools that are close to the border of the patches. The queries within each patch and around the borders of the patches can be processed in parallel. In general, our approach takes advantage of the inherent data and task parallelism of spatial queries by translating each user query into sub-queries that can be executed in multiple spatial data servers in parallel. Data parallelism exists in a spatial query when the same query task is performed on partitioned data in parallel without the need to synchronize. Task parallelism exists in a spatial query when the query task is transformed into several sub-query tasks that are performed independently in parallel data servers. A spatial query that exhibits data or task parallelism can be answered more efficiently with parallel processors or distributed servers. A restricted form of data/task parallelism is supported by the Map-Reduce programming model (Dean and Ghemawat 2004), where independent tasks (mappers) are computed in parallel on partitioned data and the results of the mappers are aggregated by reducers into the final outcome. However, the existing implemented Map-Reduce frameworks such as Hadoop have high

Author's personal copy Earth Sci Inform (2015) 8:499–509

505

Performance evaluation

line features. To decide whether there is a “nearby” relation between a point and a polyline, we defined a customized geo-SPARQL function to plug into the Jena API to compute the distances between the spatial features. Below are the query Q1 to find nearby streets of each school and the query Q2 to find nearby highway of high schools.

Recently Connecticut, especially its coastal areas, has experienced several disasters that resulted in severe flooding. Although GIS and spatial information have been recognized by local governments to play an important role in the flooding disaster response and recovery, it is difficult for emergency responders to quickly search for the needed information from multiple sources because of the heterogeneous of GIS databases. To share the heterogeneous of GIS databases at the semantic level, we published the original formats of GIS data (Shapefiles and PostGIS) using distributed WFS servers (GeoServer) as WFS services across different sources. We used the previously mentioned query rewriting algorithm to convert GeoSPARQL queries into WFS requests to the WFS services hosted by GeoServers. The results of the queries, which are in JSON (JavaScript Object Notation) format, were then converted into RDF files by a Java program and loaded into memory through Jena library API. Under this environment we conducted a limited set of experiments on parallelizing the geo-SPARQL queries using the RDF files. The experiments were run on a workstation with Intel Core i5-3320M CPU at 2.6 GHz. The queries were executed both sequentially and in parallel using a Java program. The parallel version of the program used a shared memory model where all threads have access to the same memory. Here we show some of the evaluation results of one experiment that we conducted for New Haven, Connecticut. The experiments was conducted on a data set, which consists of two map layers, one for schools and one for streets, illustrated in Fig. 5. It contains 54 schools, shown as red squares, and 3,449 streets, shown as blue lines. In this experiment, we executed two queries: one is to select the nearby streets of each school in New Haven and the other one is to select the nearby highway of each high school. The first query involves many more spatial features, thus takes much longer to complete. The queries are executed both sequentially and in parallel. We used the Jena library API to load the ontology that contains the spatial features of New Haven and to answer the sequential geo-SPARQL queries. The schools (including high schools) are point features while the streets (including highways) are poly-

To parallelize the query, we separated the query statements into two sub-queries: the first sub-query consists of the triple statements such as “?school rdf:type ct:school” and the second sub-query consists of the filter statements. In this experiment, the first sub-query was executed in parallel by sending triples of the same subject variable to the same thread. The final results of the triple-statement threads were aggregated into variable tuples. A tuple is an anonymous record of several variables and tuples can be nested. In this case, the results of the triple statement queries are tuples of the form ((?school, ?g1), (?street, ?g2)). The resulting tuples from the triple subquery threads were processed using the filter statements. We executed the filter sub-query in parallel by dividing up the tuples in equal proportions and then sent them to each thread. For this sub-query, we divided up the tuples to p equal-sized blocks where each block contains N/p number of streets, where N is the number of streets and p is the number of threads. While the number of threads to execute triple subquery is bounded by the number of subject variables involved, the number of threads to execute filter sub-query is not bounded. We observed that the runtime of geo-SPARQL query might be dominated by the filter sub-query. Therefore, it is possible to reduce the execution time of a query by increasing the number of threads for executing filter functions.

overhead for joint queries (Husain et al. 2009). To avoid unnecessary overhead, we implemented the parallel execution of GeoSPARQL queries directly in the Java language using the Map-Reduce concept.

Author's personal copy 506

Earth Sci Inform (2015) 8:499–509

Fig. 5 The map layers of schools and streets in New Haven, CT

Nevertheless, the total improvement of runtime of the parallel execution is limited by underlying parallel architecture and the overhead of threading. Table 1 shows the runtime statistics of the query Q1, where the sequential query time is over 4 s. If we divided the query into a sub-query of triple statements and a sub-query of filter statements and executed the two sub-queries in sequence, then total runtime was reduced substantially to 488~759 milliseconds (ms) depending on how many threads were used to run the filter statement. The triple sub-queries were run on two threads only and it took 56 ms in total. Since the number of schools is much less than the number of streets, the thread for querying streets took 54 ms while thread for querying schools only took 6 ms. We ran the filter sub-query using 1, 2, 4, and 8 threads and the average runtime of each thread decreased as expected but the total runtime did not decrease significantly. This may be due to the fact that the experiment was run on Intel i5 CPU with 2 cores and 4 hyper threads. As we increased the number of Java threads, the runtime per thread decreased as expected but there was no bigger performance gain after the number of threads was increased to more than 4. The runtime statistics of the query Q2 is shown in Table 2, where the runtime of filter sub-query is comparable to that of the triple sub-queries. The total runtime of two sub-queries ranges from 27 to 30 ms, which are substantially lower than the runtime of 256 ms of the sequential Geo-SPARQL query. However, as we increased the number of threads to run the filter statement sub-query, we did not obtain a significant performance gain due to the fact that the overhead of threading becomes a big portion of the workload in each thread.

Discussions Although the experiment results are limited, we are able to gain some insights into how the execution of geo-SPARQL queries can be improved through parallel processing. From the runtime statistics, we learned that the biggest performance gain was achieved by separating the triple statements from the filtering statements that involved spatial computations, which tended to be very time consuming. Spatial indexing can help partition inputs to filter functions for parallel processing. This provided additional performance gain since the inputs to filter functions are often Cartesian products of several sets of geometries, which can be very large. Partitioning inputs to filter functions can reduce the input size and runtime costs significantly. We also learned that the performance of evaluating triple statements might be improved through parallelization though the performance gain might not be significant as compared to that of the filter statements. By increasing the number of threads used in processing the filter statements, we were able to reduce the total runtime subjected to the limitation of the underlying architecture. In this experiment, the CPU has 4 hyper threads, which limited the performance gain to that of 4 threads. When we increased the number of threads to more than 4, we observed that the average runtime for each thread was reduced but the total runtime remained stable. However, this is not the performance bottleneck since after we reduced the runtime cost of filter sub-query, the runtime of the triple sub-queries dominated the total runtime again. It is not straightforward to increase the performance of triple subqueries through parallelization since we can assign at most one thread to one triple statement. To further improve

Author's personal copy Earth Sci Inform (2015) 8:499–509

507

Table 1 Runtime statistics of Q1 to find the nearby streets of each school in New Haven # of features

Schools

54

Streets

3,449

Sequential query

Parallel query Triple sub-queries

Filter sub-queries

2 threads

1 thread

2 threads

4 threads

8 threads

701 ms 703 ms

459 ms 460 ms

408 ms 432 ms

206 ms 434 ms

759 ms

516 ms

488 ms

490 ms

6 ms 54 ms

Average time per thread Total time of threads

30 ms 56 ms

Total time

4,086 ms

performance of triple sub-queries, we need to partition the RDF model, which is an issue needed for further research. A lot of geospatial information is available on the Web for disaster response applications, but the current Web search engines are not yet smart enough to understand and answer the queries that disaster responders requested. Geospatial Semantic Web technologies such as RDF ontologies and GeoSPARQL make spatial data to be retrieved and understood by both human and machine, thus spatial data can be shared and utilized more efficiently by disaster response applications. Geospatial Semantic Web can deal with semantic heterogeneity problems for disaster response applications by using RDF ontologies. To make full use of Geospatial Semantic Web for disaster response applications, it is necessary to be able to compute spatial relationships from the geometry of objects to support ontology reasoning and knowledge obtaining from a variety of data sources. For simplicity reason, in this study we used a single-domain ontology to ensure semantic interoperability. However, it is impractical to develop a global ontology for all disaster management applications that support the tasks envisaged by a distributed environment like the Geospatial Semantic Web. To overcome this problem, a distributed localresponsibility service infrastructure, which is an environment with multiple independent systems and each system has its own local ontology, may be used. This distributed localresponsibility service infrastructure approach, however, brings the possibilities of conflicts and mismatches among different

local ontologies. Thus, it is necessary to develop algorithms to integrate heterogeneous local ontology. In fact, we have developed a Partition-Refinement algorithm (Zhang et al. 2010c) for integrating heterogeneous ontology in our previous studies. The main advantage of the partition refinement algorithm is that it finds matching ontology classes and properties based on their structures. The partition refinement algorithm makes full use of the structures of the ontology being mapped. Thus, it allows translation of instances between different ontologies. Further, the partition refinement algorithm can deal with recursive structure efficiently. However, the currently implemented single machine based Geospatial Semantic Web systems are unlikely to efficiently handle geographical knowledge bases. A huge ABox containing all the topological relationships of the spatial data will be quite prohibitive to be queried without the help of appropriate index structures or parallel techniques. Unlike the structured data that can be handled repeatedly through a relational database management system (RDBMS), semi-structured data such as RDF data may call for ad hoc and one-time extraction, parsing, processing, indexing, and analytics in a scalable and distributed environment (Kulkarni 2010). Increasingly, distributed systems such as Cloud Computing (Cui et al. 2010; Liu et al. 2009), cyberGIS (Wang 2010; Wang et al. 2013), spatial cyberinfrastructure (Wright and Wang 2011), or geospatial cyberinfrastructure have been suggested as a solution to overcome the scalability and

Table 2 Runtime statistics of Q2 to find the nearby highways of each high school in New Haven # of features

High school Highway Average time per thread Total time of threads Total time

Sequential query

9 313

Parallel query Triple sub-queries

Filter sub-queries

2 threads

1 thread

2 threads

4 threads

8 threads

15 ms 15 ms 30 ms

13 ms 14 ms 29 ms

6.5 ms 12 ms 27 ms

4.2 ms 12 ms 27 ms

4 ms 14 ms 9 ms 15 ms 256 ms

Author's personal copy 508

performance problems of the currently implemented Geospatial Semantic Web systems. For example, Cloud Computing is a recent paradigm developed to search, access, and utilize large volumes of geospatial data for many geospatial science applications. In this context, Hadoop is an emerging Cloud Computing tool supported by Amazon, the leading Cloud Computing hosting company, to search, access, and utilize large volumes of geospatial data. Hadoop is a distributed file system where files can be saved with replication and it also contains an implementation of the Map-Reduce programming model for parallel processing of large amounts of data (Mazumdar 2011). However, Hadoop has high overhead for joint queries and are more restrictive in how data are assigned to parallel tasks and how the tasks are synchronized. These limitations make them more suitable for batch processing jobs instead of real-time spatial joint queries (Sun and Jin 2010). To facilitate real-time spatial joint queries and make the Geospatial Semantic Web scale well, we proposed a novel parallel approach to retrieve spatial data over Geospatial Semantic Web for disaster response applications. The proposed approach takes a GeoSPARQL query from the user and passes it to parallel processes, which are designed and managed based on the Map-Reduce concept. The proposed parallel approach makes full use of the data/task parallelism for concurrent spatial queries. We expect that the data/task parallelism processing architecture will reduce individual spatial query execution time by taking advantage of parallel and distributed processes. In the context of disaster response applications, this opens up the possibility of handling a large number of concurrent spatial queries. The proposed approach has several advantages. Firstly, it is suitable to process spatial data in parallel. The parallel processes are run on independent data blocks. Because the parallel processes work on independent data blocks and processing one piece of data is not dependent on the outcome of any other piece of data, we can easily perform the joint spatial queries, which are the common queries to discover the needed information from diverse sources. Secondly, because our approach uses the Java language to provide the parallel computing environment, the partitioning, synchronization, and aggregation of tasks and data are performed at the language level. This reduces the high communication overhead required for joint spatial queries and is more flexible in how data are assigned to parallel tasks and how the tasks are synchronized. Further, with the Java language implementation, we are able to implement more flexible parallel GeoSPARQL query algorithms. Finally, the proposed approach has a scalability advantage. The proposed system can not only handle RDF triples efficiently, it also can efficiently handle the addition of users, data, and tasks without affecting its performance significantly.

Earth Sci Inform (2015) 8:499–509

Further study is needed to improve the proposed approach. Even with this parallel approach, significant performance degradation can still occur because of the need for communication between the computer nodes. This study only did a limited set of experiments to test the proposed parallel approach from the collected databases. To implement a workable system across the Web, more studies are needed. The ultimate goal of a workable system for disaster applications is to allow disaster responders to transparently query geospatial information from diverse sources with incompatible data formats and semantics across the Web. However, many components in such a workable system have not been included in this study. For example, how to design an ontology-based semantic web crawler to mine information from separated data sources connected over the Web? For another example, the current partitioning scheme does not consider the distribution of the spatial data. How are we able to produce a balanced partitioning for a skewed distribution of the data? Similarly, how can we build a cloud service or cyberinfrastructure based on the proposed framework for a real world disaster response? How scalable will the system be when we add different types of spatial data? Will the delay introduced to answer a spatial query increase linearly or in other patterns with the increment in the spatial data size? With the potentially exponential rise in the amount of spatial data stored across various services, how does one keep track of data ownership and data quality? These are some examples of the challenges we face to further improve the proposed approach for the real world disaster response applications.

Conclusions In the context of disaster response, obtaining spatial information quickly from disparate sources is a critical need. Although advances in Geospatial Semantic Web facilitate geospatial data sharing for disaster response, performance issues still hamper efficient and effective utilization of spatial information. This study proposes a novel parallel approach for improving query performance in the Geospatial Semantic Web. It employs a Map-Reduce concept, uses spatial indices to partition spatial objects referenced in a RDF ontology, followed by the computation of spatial joins in parallel, thereby improving the performance of GeoSPARQL. The initial experimental results show that the proposed parallel approach can improve query performance of GeoSPARQL based on the combination of the spatial indexing and the data/task parallelism. The proposed parallel approach can reduce answer time for computationally intensive spatial queries through parallel processing. In the future, we may consider combine the proposed parallel approach with other methods such as compression and cache methods to further improve the GeoSPARQL query

Author's personal copy Earth Sci Inform (2015) 8:499–509

performance. The proposed approach can also be adapted to the existing Cloud computing services or cyberGIS to make faster GeoSPARQL queries. The proposed parallel approach may improve GeoSPARQL query performance by making full use of high-performance computer servers distributed over a wide area network. With unique merits of job parallelism, the proposed parallel approach may provide an effective solution to the spatial data analysis challenge in disaster response. Acknowledgments Anselin’s research was supported in part by award OCI-1047916, SI2-SSI from the U.S. National Science Foundation.

References Battle R, Kolas D (2012) Enabling the geospatial semantic web with parliament and GeoSPARQL. http://www.semantic-web-journal. net/sites/default/files/swj176_3.pdf. Accessed Jan 2014 Cui D, Wu Y, Zhang Q (2010) Massive spatial data processing model based on cloud computing model. In: Proceedings of the Third International Joint Conference on Computational Sciences and Optimization, IEEE Computer Society, Los Alamitos, CA, pp 347–350, 28–31 May 2010, Huangshan, Anhui, China Dean J, Ghemawat S (2004) Map-reduce: simplified data processing on large clusters. https://www.usenix.org/legacy/events/osdi04/tech/ full_papers/dean/dean.pdf. Accessed Jan 2014 Donkervoort S, Dolan SM, Beckwith M, Northrup TP, Sozer A (2008) Enhancing accurate data collection in mass fatality kinship identifications: lessons learned from Hurricane Katrina. Forensic Sci Int Genet 2(4):354–362 Grütter R, Bauer-Messmer B (2007) Combining owl with rcc for spatioterminological reasoning on environmental data. http:// sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol258/paper17.pdf. Accessed Jan 2014 Hoffart J, Suchanek FM, Berberich K, Weikum G (2013) Yago2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intell 194:28–61 Husain MF, Doshi P, Khan L, Thuraisingham B (2009) Storage and retrieval of large RDF graph using Hadoop and Map-Reduce. In: Jaatun MG, Zhao G, Rong C (eds) CloudCom 2009, LNCS 5931, pp. 680–686 Kulkarni P (2010) Distributed SPARQL query engine using MapReduce. http://www.inf.ed.ac.uk/publications/thesis/online/ IM100832.pdf. Accessed Jan 2014 Liagouris J, Mamoulis N, Bouros P, Terrovitisx M (2014) An effective encoding scheme for spatial RDF data. http://www.vldb.org/pvldb/ vol7/p1271-liagouris.pdf. Accessed Aug 2014 Liu Y, Guo W, Jiang W, Gong J (2009) Research of remote sensing service based on cloud computing mode. Appl Res Comput 26(9): 3428–3431 Mazumdar S (2011) Complex SPARQL query engine for Hadoop MapReduce. http://www.csi.ucd.ie/files/u1450/SM_Query_RDf.ps. Accessed Jan 2014

509 OGC 11-052r4 (2012) OGC GeoSPARQL - a geographic query language for RDF Data. http://www.opengis.net/doc/IS/geosparql/1.0. Accessed Jan 2014 Peng ZR, Zhang C (2004) The roles of geography markup language, scalable vector graphics, and web feature service specifications in the development of internet geographic information systems. J Geogr Syst 6(2):95–116 Ramachandrana R, Gravesa S, Conovera H, Moeb K (2004) Earth Science Markup Language (ESML): a solution for scientific data-application interoperability problem. Comput Geosci 30(1):117–124 Sun J, Jin Q (2010) Scalable RDF store based on HBase and MapReduce. In: Proceedings of advanced computer theory and engineering (ICACTE), pp 633–636, 20–22 Aug. 2010. doi:10.1109/ ICACTE.2010.5578937 Wang S (2010) A cyberGIS framework for the synthesis of cyberinfrastructure, GIS and spatial analysis. Ann Assoc Am Geogr 100(3):535–557 Wang S, Anselin L, Badhuri B, Crosby C, Goodchild M, Liu Y, Nyerges T (2013) CyberGIS software: a synthetic review and integration roadmap. Int J Geogr Inf Sci. doi:10.1080/13658816.2013.776049 Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1):1008–1019 Wright D, Wang S (2011) The emergence of spatial cyberinfrastructure. Proc Natl Acad Sci 108(14):5488 Yuan P, Liu P, Wu B, Jin H, Zhang W, Liu L (2013) Triplebit: a fast and compact system for large scale rdf data. PVLDB 6(7):517–528 Yue P (2013) Semantic web-based intelligent geospatial web services. Springer Yue P, Di L, Yang W, Yu G, Zhao P (2007) Semantics-based automatic composition of geospatial web service chains. Comput Geosci 33(5):649–665 Yue P, Di L, Yang W, Yu G, Zhao P, Gong J (2009) Semantic Web Services–based process planning for earth science applications. Int J Geogr Inf Sci 23(9):1139–1163 Yue P, Gong J, Di L, He L, Wei Y (2011) Integrating semantic web technologies and geospatial catalog services for geospatial information discovery and processing in cyberinfrastructure. GeoInformatica 15:273–303 Zhang C, Li W, Zhao T (2007) Geospatial data sharing based on geospatial semantic web technologies. J Spat Sci 52(2):11–25 Zhang C, Zhao T, Li W (2010a) Automatic search of geospatial features for disaster and emergency management. Int J Appl Earth Obs Geoinformation 12(6):409–418 Zhang C, Zhao T, Li W, Osleeb J (2010b) Towards logic-based geospatial feature discovery and integration using web feature service and geospatial semantic web. Int J Geogr Inf Sci 24(6):903–923 Zhang C, Zhao T, Li W (2010c) A framework for geospatial semantic web based spatial decision support system. Int J Digit Earth 3(2): 111–134 Zhang C, Zhao T, Li W (2013) Towards improving query performance of Web Feature Services (WFS) for disaster response. ISPRS Int J GeoInf 2:67–81 Zhao T, Zhang C, Wei M, Peng Z-R (2008) Ontology-based geospatial data query and integration. Lecture Notes in Computer Science LNCS5266: Geographic Information Science 5266:370–392 Zhao T, Zhang C, Anselin L, Li W, Chen K (2014) A parallel approach for improving Geo-SPARQL query performance. Int J Digit Earth (in press)

Suggest Documents