Web data retrieval: solving spatial range queries ... - Computer Science

3 downloads 9890 Views 848KB Size Report
A typical scenario is the existence of numerous business web sites that provide ... Keywords Range queries·k-Nearest neighbor queries·Web data·Web ...
Geoinformatica DOI 10.1007/s10707-008-0055-2

Web data retrieval: solving spatial range queries using k-nearest neighbor searches Wan D. Bae · Shayma Alkobaisi · Seon Ho Kim · Sada Narayanappa · Cyrus Shahabi

Received: 24 October 2007 / Revised: 25 July 2008 / Accepted: 28 August 2008 © Springer Science + Business Media, LLC 2008

Abstract As Geographic Information Systems (GIS) technologies have evolved, more and more GIS applications and geospatial data are available on the web. Spatial objects in a given query range can be retrieved using spatial range query − one of the most widely used query types in GIS and spatial databases. However, it can be challenging to retrieve these data from various web applications where access to the data

The author’s work is supported in part by the National Science Foundation under award numbers IIS-0324955 (ITR), EEC-9529152 (IMSC ERC) and IIS-0238560 (PECASE) and in part by unrestricted cash gifts from Microsoft and Google. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. W. D. Bae (B) Department of Mathematics, Statistics and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USA e-mail: [email protected] S. Alkobaisi College of Information Technology, United Arab Emirates University, Al-Ain, United Arab Emirates e-mail: [email protected] S. H. Kim Department of Computer Science Information & Technology, University of District of Columbia, Washington, DC, USA e-mail: [email protected] S. Narayanappa Department of Computer Science, University of Denver, Denver, CO, USA e-mail: [email protected] C. Shahabi Department of Computer Science, University of Southern California, Los Angeles, CA, USA e-mail: [email protected]

Geoinformatica

is only possible through restrictive web interfaces that support certain types of queries. A typical scenario is the existence of numerous business web sites that provide their branch locations through a limited “nearest location” web interface. For example, a chain restaurant’s web site such as McDonalds can be queried to find some of the closest locations of its branches to the user’s home address. However, even though the site has the location data of all restaurants in, for example, the state of California, it is difficult to retrieve the entire data set efficiently due to its restrictive web interface. Considering that k-Nearest Neighbor (k-NN) search is one of the most popular web interfaces in accessing spatial data on the web, this paper investigates the problem of retrieving geospatial data from the web for a given spatial range query using only k-NN searches. Based on the classification of k-NN interfaces on the web, we propose a set of range query algorithms to completely cover the rectangular shape of the query range (completeness) while minimizing the number of k-NN searches as possible (efficiency). We evaluated the efficiency of the proposed algorithms through statistical analysis and empirical experiments using both synthetic and real data sets. Keywords Range queries · k-Nearest neighbor queries · Web data · Web interfaces · Web integration · GIS

1 Introduction Due to the recent advances in Geographic Information Systems (GIS) technologies and emergence of diverse web applications, a large amount of geospatial data has become available on the web. For example, numerous businesses release the locations of their branches on the web. The web sites of government organizations such as the US Postal Office provide the list of its offices close to one’s residence. Various non-profit organizations also publicly post a large amount of geospatial data for different purposes. There is a growing number of GIS applications that crawl or wrap various web sources into one integrated solution for their users, such as Information Mediators. As an example, a fast food restaurant may need to know the distribution of other restaurants, e.g., McDonalds, in a certain geographic region for its future business plan. There could be a government or environmental research organization needing to know about the characteristic of some businesses, social activities or changes in the nature. These applications, however, have difficulties to efficiently retrieve information for a certain region from various web sources because of their restrictive web interfaces [4]. On the web, access to geospatial data is only possible through the interfaces provided by the applications. These interfaces are usually designed for one specific query type and hence cannot support general access to the data. They could retrieve data only using some particular types of queries such as k-NN query. For example, the McDonalds web site provides a restaurant locator service through which one can ask for the five closest restaurants from a given location. This type of web interface to search for a number of “nearest neighbors” from a given geographical point (e.g., a mailing address) or an area (e.g., a zip code) is very popular for accessing geospatial data on the web. It nicely serves the intended goal of quick and convenient dissemination of business location information to potential customers. However, if the consumer of the data is a computer program, as in the case

Geoinformatica

of web data integration utilities (e.g., wrappers) and search programs (e.g., crawlers), such an interface may be a very inefficient way of retrieving all data in a given query region. To illustrate, suppose a web crawler wants to access the McDonalds web site to retrieve all the restaurants in the state of California. Even though the site has the required information, the interface only allows the retrieval of five locations at a time. Even worse, the interface needs the center of the search as input. Hence, the crawler needs to identify a set of nearest location searches that both covers the entire state of California (for completeness) and has minimum overlap between the result sets (for efficiency). The number of results returned even varies depending on the application. This problem necessitates for methods that utilize certain query types to provide solutions to non-supported queries. In this paper, we conceptualize the aforementioned problem into a more general problem of supporting spatial range query to retrieve web data in a given rectangular shaped region using k-Nearest Neighbor (k-NN) search. Besides web data integration and search applications, the solution to this more general problem is beneficial for other application domains in the areas of sensor networks, online maps and adhoc mobile networks where their applications require great deal of data integration for data analysis and decision support queries. Figure 1 illustrates the challenges of this general problem through an example. Suppose that we want to find the locations of all the points in a given region (rectangle), and the only available interface is the k-NN search with a fixed k (k = 3 in the example of Fig. 1). The only parameter that we can vary is the query point for the k-NN search (shown as cross marks in Fig. 1). Given a query point q, the result of the k-NN search is a set of k nearest points to q. It defines a circle centered at q with the radius equal to the distance from q to its kth nearest neighbor (i.e., the 3rd closest point in this example). The area of this circle, covering part of the rectangle, determines the searched (i.e., covered) portion of the rectangular region. To complete the range query, we need to identify a set of input locations of q corresponding to a series of k-NN searches that would result in a set of circles covering the entire rectangle. Based on the classification of k-NN interfaces on the web, we propose a set of range query algorithms: the Density-Based (DB) algorithm for the case with known statistics of the data sets assuming the most flexible k-NN interface (i.e., k Fig. 1 A range query on the web data

P ur

q

3-NN Search

Pll Range Query R

Geoinformatica

is a user input), the Quad Drill Down (QDD) and Dynamic Constrained Delaunay Triangulation (DCDT) algorithms for the case with no knowledge of the distribution of the data sets and with the most restrictive k-NN interface (i.e., no control on the value of k), and the Hybrid (H B) of the two approaches, DB and QDD, for the solution of the case with bounded values of k. Our algorithms guarantee the complete coverage of any given spatial query range. The efficiencies of the algorithms are measured by statistical analysis and empirical experiments using both synthetic and real data sets. The remainder of this paper is organized as follows: The problem is formally defined in Section 2 and the related work is discussed in Section 3. In Section 4 we propose our algorithms to solve range queries using a number of k-NN queries. Section 5 provides the performance evaluation of the proposed algorithms. We first discuss the statistical analysis of DB and QDD. Then we present the efficiency of our algorithms through empirical experiments. Finally, Section 6 concludes the paper.

2 Problem definition A range query in spatial database applications can be formally defined as follows: Let R be a given rectangular query region represented by two points, Pll and Pur , where Pll = (x1 , y1 ) and Pur = (x2 , y2 ). Pll and Pur refer to the lower-left corner and the upper-right corner of R, respectively. Given a set of objects S and a region R, spatial range query finds all objects of S in R [8]. Similarly, a nearest neighbor (NN) query can be defined. Given a set of objects S and a query point q, an NN query searches for a point p ∈ S such that dist(q, p) ≤ dist(q, r) for any r ∈ S. The point p is the nearest neighbor of q. One can be interested in finding more than one point, say k points, which are the nearest neighbors of the query point q. This is called the k-Nearest Neighbor (k-NN) problem also known as the top-k selection query. Given a set of objects S (with cardinality(S) = N), a number k ≤ N and a query point q, a k-NN query searches for a subset S ⊆ S with cardinality(S ) = k such that for any p ∈ S and r ∈ S − S , then distance(q, p) ≤ distance(q, r) [16]. Our problem is defined as follows: Given a rectangular shaped query region R, find all spatial objects within the query range using k-NN search interface as supported by various web applications. The complexity of this problem is unknown. Therefore only approximation results are provided. In reality, there can be more than one web site involved in the given range query. Then, the general problem is to find all objects within the query region R using k-NN searches on each of these web sources and to integrate the results from the sources into one final query result. Figure 2a and b show examples of two different k-NN interfaces, k = 3 in (a) and k = 6 in (b). Two different data sets in the given same query region can be retrieved from the different web sources. The optimization objective of this problem is to minimize the number of k-NN searches while retrieving all objects in a given query region. Each k-NN search results in communication overhead between the client and the server in addition to the actual execution cost of the k-NN search at the server. The range query execution is broken down into four basic steps: 1) Client sends a request to the web server with the query point q, 2) Server executes the k-NN search using q, 3) Server returns the

Geoinformatica Northglenn

Broomfield

Northglenn

Broomfield

H

Thornton

Westminster

Thornton

Westminster

H Arvada

H

H

285

Wheat Ridge

Golden

H

H

H H

ARVADA COUNTY

Denver Lakewood

Aurora

Lakewood

H

H

Aurora

Englewood Cherry Hills village

Indian Hills

Centennial

I 70

Glendale

I 225 285 5

q

Littleton I 25

Denver

Englewood Cherry Hills village

Indian Hills

ARVADA COUNTY

Wheat Ridge

I 70

Glendale

I 225

H 285

Arvada

285 Golden

H

Littleton

Centennial

I 25

H Highlands Ranch

Highlands Ranch

Parker

Parker

(a) A range query using 3-NN search

(b) A range query using 6-NN search

Fig. 2 Range queries on the web data (a, b)

result of the k-NN search to client, 4) Client prunes the result (client may invoke additional requests to server). Let Cn be the total cost of the range query and Cq , Cknn , Cres , and C pru represent the cost associated with each of the four steps, where n is the total number of k-NN searches required to complete a range query. Then, Cn is defined as follows: Cn = C pru(n) +

n 

[Cq(i) + Cknn(i) + Cres(i) ].

i=1

Steps 1 and 3 both involve communication over the network. For step 2, we assume that any of the known k-NN search algorithms [6, 7, 13, 17] is available to the web applications; if any tree index structure is used, the complexity of k-NN search is known to be O(logN + k) [7], where N is the total number of objects in the database, and k is the number of objects we want to retrieve. The cost of step 4 is the CPU time required to prune the k-NN result at the client side. If additional k-NN searches are required to cover the query region, then additional costs of all Cq , Cknn , and Cres must be taken into account. However, C pru is required only once for a given query. Hence, we focus on reducing the number of k-NN searches in this paper while placing our circles to have as few overlaps as possible. Finally, the fact that we do not know the radius of each circle prior to the actual evaluation of its corresponding k-NN query makes the problem even harder. To summarize, the optimal solution should find a minimal set of query locations that minimizes the number of k-NN searches to completely cover the given query region. In many web applications, the client has no knowledge about the data set at the server and no control over the value of k (i.e., k is fixed depending on the web applications). These applications return a fixed number of k nearest objects to the user’s query point. A typical value of k in real applications rages from 5 to 20 [4], multiple queries are evaluated to completely cover a reasonably large region. In this paper we consider all possible k-NN interfaces. Any web applications supporting k-NN search can be classified based on the value of k: 1. The value of k is fixed, and the user does not have any control over k−the general problem. Many applications always return a fixed number of k nearest objects to the user’s query point. Different applications may have different values of k.

Geoinformatica

2. The value of k can be any positive integer determined by users. Some web applications accept user-defined k values for their k-NN searches. 3. The value of k can be any positive integer less than or equal to a certain value. Some web applications support k-NN search with user-defined k values but allow only a bounded range of k values. To simplify the discussion, we made the following assumptions: 1) a range query will be supported by a single web source; 2) a web application supports only one type of k-NN query (fixed, flexible or bounded k value); 3) a k-NN search takes a query point q as an input parameter.

3 Related work Some studies have discussed the problems of data integration and query processing on the web [12, 22]. A survey of the general problems in the integration of web data sources can be found in [9]. The paper discussed challenges of the heterogeneous nature of the web and different ways of describing the same data, and proposed general ideas of how to tackle those problems. Nie et al. [12] presented a query processing framework for web-based data integration and implemented the evaluation of query planning and statistics gathering modules. The problem of querying web sites through limited query interfaces has been studied in the context of information mediation systems in [22]. Based on the capabilities of the mediator, the authors proposed an idea to compute the set of supported queries. Several studies have focused on performing spatial queries on the web [1, 4, 10]. In [1], the authors demonstrated an information integration application that allows users to retrieve information about theaters and restaurants from various U.S. cities, including an interactive map. Their system showed how to build applications rapidly from existing web data sources and integration tools. The problem of spatial coverage using only the k-NN interface was introduced in [4]. The paper provided a quad-tree based approach. A quad-tree data structure was used to check the complete coverage by marking regions that have been covered so far. The authors follow a greedy approach using the coverage information to determine the next query’s position. Our previous work [2] provided an efficient and complete solution for the problem in the case when the value of k is fixed and users have no control on the web interface and no knowledge on the data. This paper extends our previous work and provides comprehensive solutions on the most general case based on the classification of all possible k-NN interfaces on the web. In our proposed QDD algorithm, we follow a systematic way of dividing the region to be covered which spare us the need of the quad tree data structure used in [4]. This systematic division of the range query provides the information of coverage without needing to access any data structure to decide on the next query’s position. The problem of supporting k-NN query using range queries was studied in [10]. The idea of successively increasing query range to find k points was described assuming statistical knowledge of the data set. This is the complement to the problem we are focusing on. Many k-NN query algorithms have been proposed [6, 13, 17]. In [13], the authors proposed branch-and-bound R-tree traversal algorithm to find the k nearest neighbors of a given query point. They also introduced two metrics for both searching

Geoinformatica

and pruning, which are MINDIST that produces the most optimistic ordering and MINMAXDIST that produces the most pessimistic ordering possible. A new cost model for optimizing nearest neighbor searches in low and medium dimensional spaces was proposed in [17]. They proposed a new method that captures the performance of nearest neighbor queries based on approximation. Their technique was based on the vicinity rectangles and the Minkowski rectangles. In [6], given a set of points, the authors proposed two algorithms to solve two k nearest neighbor related problems. The first enumerates the k smallest distances between pairs of points in nondecreasing order and the second finds the k nearest neighbors of each point. Both are based on a Delaunay triangulation. The Delaunay triangulation and Voronoi diagram based approaches have been studied for various purposes in wireless sensor networks [15, 19, 21]. In [19], the authors used the Vornonoi diagram to discover the existence of coverage holes, assuming that each sensor knows the location of its neighbors. The region is partitioned into Voronoi cells and each cell contains one sensor. If a cell is not covered by a sensor, the coverage holes are found. The authors use a greedy approach by targeting the largest hole which is a Voronoi cell as the next region to be covered. Sensors replacement that have fixe coverage radius by minimizing the movements of the sensors is one of the main ideas in their paper. Similarly, our proposed DCDT algorithm greedily targets the largest uncovered region, a triangle, to be covered, however, coverage is determined by the non fixed radius of the k-NN query. The Delaunay triangulation and Vornonoi diagram was used to determine the maximal breach path (MBP) and the maximal support path (MSP) for a given sensor network in [11]. In [21] proposed a wireless sensor network deployment method was proposed based on the Delaunay triangulation. This method was applied for planning the positions of sensors in an environment with obstacles. Their main goal was to maximize the coverage of the sensors and not to cover the complete region. Each candidate position of a sensor is generated from the current sensor configuration by calculating its scored using a probabilistic sensor detection model. The proposed approach retrieved the location information of obstacles and pre-deployed sensors, then constructed a Delaunay triangulation to find candidate positions of new sensors. In our DCDT algorithm, a constrained Delaunay triangulation (CDT) is used to find uncovered region with the most coverage gains. The CDT is dynamically updated while covering the query region with k-NN circles.

4 Range query algorithms using k-NN search We study and investigate different approaches for all possible k-NN search interfaces defined in Section 2. We develop four spatial range query algorithms that utilize k-NN search to retrieve all objects in a given rectangular shaped query region: Density Based (DB) for flexible values of k, Quad Drill Down (QDD) and Dynamic Constrained Delaunay Triangulation (DCDT) for fixed values of k, and Hybrid (H B) for bounded values of k. The notations in Table 1 are used throughout this paper and Fig. 3 illustrates an example of range query using 5-NN search using the notations. Let R be a given query region and C Rq be the circle inscribing R having q and Rq as the center and radius of C Rq , respectively. Notice that q is also the center point of R. Let Pk =

Geoinformatica Table 1 Notations Notation

Description

R Pll Pur q Rq C Rq Pk

A query region (rectangle) The lower-left corner point of R The upper-right corner point of R The query point for a k-NN search Half of the diagonal of R; the distance from q to Pll Circle inscribing R; the circle of radius Rq centered at q The set of k nearest neighbors obtained by a k-NN search ordered ascendingly by distance from q, i.e., P5 = { p1 , p2 , p3 , p4 , p5 } in Fig. 3 The distance from q to the farthest point pk in Pk k-NN circle; the circle of radius r centered at q Tighter bound k-NN circle; the circle of radius  · r centered at q

r Cr  Cr

{ p1 , p2 , ..., pk } be the k nearest neighbors, ordered by distance from q obtained by the k-NN search with q as the query point. Let qpk = r, i.e., r is the distance from q to the farthest point pk . Let Cr be the circle of radius r centered at q, which we refer to as a k-NN circle. If r > Rq , then the k-NN search must return all the points inside C Rq . Therefor, we can obtain all the points inside the region R after pruning out any points of Pk that are outside R but inside Cr . Then, the range query is complete. Otherwise, we need to perform some additional k-NN searches to completely cover R. For example, if we use a 5-NN search for the data set shown in Fig. 3, then the resulting Cr is smaller than C Rq . Note that it is not sufficient to have r ≥ Rq in order to retrieve all the points inside R when q has more than one kth nearest neighbor since one of them will be randomly returned as q’s kth nearest neighbor. Assuming four points in the four corners of R, four points strictly inside of Cr , and the value of k = 5, the returned result is P5 = { p1 , p2 , ..., p5 }. The condition, r > Rq , is required since 5-NN search randomly picks one from four points on the corners as p5 discarding the rest of three points. As a result, we use r > Rq as a terminating condition in our algorithms.

Fig. 3 A range query using 5-NN search

Pur Cr

P5

r

P2 P3

q P1 P4

Rq

Pll

Range Query R

C Rq

Geoinformatica

4.1 Algorithm using flexible values of k Some web applications can support k-NN interface with flexible values of k so that users can choose any integer value for k-NN search. Some also provide the total number of data (objects, i.e., stores or branches) and their service area. By carefully determining the k value, a range query can be evaluated with a single k-NN search if all the objects in the given query region can be found from the results of the very first k-NN search. Based on this assumption, we propose the Density Based (DB) range query algorithm, a statistical approach for retrieving web data in a given region. The main idea of DB is to find a good approximation of the k value using some statistical knowledge of the web data, i.e., the density of the data sets. Assuming that the density information of web data is available, let N be the total number of objects in a web source and Aglob al be the area of the minimum bounding rectangle (MBR) that includes all those objects. Then, two estimation methods are considered: global and local estimations. N The global estimation method is as follows: The global density is Aglob , and Rq is al half the diagonal of R as described in Fig. 3. We then estimate the number of objects (k) that are inside C Rq with the parameters, Aglob al , N, and Rq : k = π Rq2 · N/Aglob al .

(1)

k is used to call the first k-NN search to the Web data. From the k-NN result, Cr (kNN circle) is expected to match C Rq . If Cr ≤ C Rq , then we need extra k-NN searches. Thus, we need to re-estimate the k value using the local estimation. The local estimation method is as follows: Let Alocal be the area of the current Cr , Area(Cr )=πr2 and k be the value used by the previous k-NN search. The local k density is Alocal . Then the new estimation of k is: k = π Rq2 · k/Alocal .

(2)

k is used to call the next k-NN search to the Web data. In the DB algorithm we first estimate the initial k using the global density of the data set. The center point of the query region R is the query point q (see Fig. 3). If Cr is not larger than C Rq , the value of k is under-estimated because Cr cannot cover the entire R. Then, it is necessary to increase the k value. A new value k is then calculated using the local density. This local estimation of k continues until Cr becomes larger than C Rq . If Cr is larger than C Rq , the result of k-NN includes all the objects inside C Rq . Finally, pruning is required to eliminate the objects outside R but inside C Rq . Even though we determine the value of k by taking advantage of known statistics, there can be a significant number of underestimated/overesitmated cases due to statistical variation. Underestimation of k value incurs an expensive cost, i.e., a new k-NN search requiring all new C pru , Cq , Cknn , and Cres , while overestimation of k value does not.1 From the observation that a slight overestimation of k value can absorb a significant amount of statistical variance (i.e., more range queries can be evaluated in a single k-NN search) and it does not create a significant overhead

1 Actually,

overestimation introduce a slight overhead in each cost function but it is way lower than the overhead caused by underestimation.

Geoinformatica

cost, we introduce the overestimation constant, , for a loosely bounded k value. To obtain a loosely bounded k value, the k value is multiplied by  ( > 1). The  value can be determined from empirical experiments and statistics based upon the density and distribution of the data set. Algorithm 1 describes the DB range query. The DB algorithm will be evaluated through both statistical analysis and empirical experiments in Section 5. We will also demonstrate how the  value affects the performance of the DB algorithm. Algorithm 1 DB rangeQuery(Pll , Pur , Aglob al , N) 1:  ← a constant factor 2: Pc .x ← (Pll .x + Pur .x)/2 3: Pc .y ← (Pll .y + Pur .y)/2 √ 4: Rq ← (Pur .x − Pc .x)2 + (Pur .y − Pc .y)2 5: q ← Pc 6: kinit ← estKvalue(Rq , Aglob al , N) // using Eq. (1) 7: Knn{} ← kNNSearch(q,  · kinit ) 8: r ← maxDistkNN(q, Knn{}) 9: k ← kinit 10: while r ≤ Rq do 11: clear Knn{} 12: Alocal ← π · r2 13: k ← estKvalue(Rq , Alocal , k) // using Eq. (2) 14: Knn{} ← kNNSearch(q,  · k) 15: r ← maxDistkNN(q, Knn{}) 16: end while 17: result{} ← pruneResult(Knn{}, Pll , Pur ) 18: return result{}

4.2 Algorithms using fixed values of k In many web applications, the client has no knowledge of the data set at the server and no control over the value of k (i.e., k is fixed). Considering a typical value of k in real applications, which ranges from 5 to 25 [4], in general, multiple queries are required to completely cover a reasonably large R. Our approach to this general range query problem on the web is as follows: 1) divide the query region R into subregions, R = {R1 , R2 , ..., Rm }, such that any Ri ∈ R can be covered by a single k-NN search, 2) obtain the resulting points from the k-NN search on each of the subregions, 3) combine the partial results to provide a complete result to the original range query R. Then, the main focus is how to divide R so that the total number of k-NN searches can be minimized. We consider the following three approaches: 1. A naive approach: divide the entire R into equi-sized subregions (grid cells) such that any cell can be covered by a single k-NN search. 2. A recursive approach: conduct a k-NN search in the current query region, divide the current query region into subregions with same sizes if the k-NN search fails to cover R, and call k-NN search for each of these subregions. Repeat the process until all subregions are covered.

Geoinformatica

3. A greedy and incremental approach: divide R into subregions with different sizes. Then select the largest subregion for the next k-NN search. Check the covered region by the k-NN search and select the next largest subregion for another k-NN search. Taking the example in Fig. 3, if r > Rq , then the k-NN search must return all the points inside C Rq . Therefore, we can obtain all the points inside the region R after pruning out any points of Pk that are outside R but inside Cr . Then, the range query is complete. Otherwise, we need to perform some additional k-NN searches to completely cover R. For example, if we use a 5-NN search for the data set shown in Fig. 3, then the resulting Cr is smaller than C Rq . For the rest of this section, we focus on the latter case, which is a more realistic scenario. In a naive approach, R is divided into a grid where the size of cells is small enough to be covered by a single k-NN search. However, it might not be feasible to find the exact cell size with no knowledge of the data sets. Even in the case that we obtain such a grid, this approach is inefficient due to large amounts of overlapping areas among k-NN circles and wasted time to search empty cells considering that most real data sets are not uniformly distributed. Therefore, we provide a recursive approach and an incremental approach in the following subsections.

4.2.1 Quad drill down (QDD) algorithm We propose a recursive approach, the Quad Drill Down (QDD) algorithm, as a solution to the range query problem on the web using the properties of the quadtree. A quad-tree is a tree whose nodes either are leaves or have four children, and it is one of the most commonly used data structures in spatial databases [14]. The quadtree represents a recursive quaternary decomposition of space wherein at each level a subregion is divided into four equal sized subregions (quadrants). The properties of the quad-tree provide a natural framework for optimizing decomposition [20]. Hence, the quad-tree can give us an efficient way to divide the query region R. We adopt the partitioning idea of the quad-tree, but we do not actually construct the tree. The main idea of QDD is to divide R into equal-sized quadrants, and then recursively divide quadrants further until each subregion is fully covered by a single k-NN search so that all objects in it are obtained. An example of QDD is illustrated in Fig. 4 (when k = 3), where twenty one k-NN calls are required to retrieve all the points in R. Algorithm 2 describes QDD. First, a k-NN search is invoked with a point q which is the center of R. Next, the k-NN circle Cr is obtained from the k-NN result. If Cr is larger than C Rq , it entirely covers R, then all objects in R are retrieved. Finally, a pruning step is necessary to retrieve only the objects that are inside R − a trivial case. However, if Cr is smaller than or equal to C Rq , the query region is partitioned into four subregions by equally dividing the width and height of the region by two. The previous steps are repeated for every new subregion. The algorithm recursively partitions the query region into subregions until each subregion is covered by a k-NN circle. The pruning step eliminates those objects that are inside the k-NN circle but outside the subregion. Statistical analysis of QDD is demonstrated in Section 5.2 as well as the empirical experiment results.

Geoinformatica Fig. 4 A Quad Drill Down range query (a, b)

P3 x

x x

x

q2

q3

x

x

x

x

x

x

x

P2

x

x x

x

3-NN Search

P1

q

x x

q1

x x

Pur

q4 x

x

Pll

P0 Range Query R

(a) A QDD Range Query q

q1

q11 q12 q13 q14

q2

q21 q22 q23 q24

q4

q3

q41 q42 q43 q44

q111 q112 q113 q114

(b) A Quad-tree representation of QDD

4.2.2 Dynamic constrained delaunay triangulation (DCDT) Algorithm We propose the Dynamic Constrained Delaunay Triangulation (DCDT) algorithm as another approach to solve the range query using the most restricted k-NN interface − a greedy and incremental approach using the Constrained Delaunay Triangulation (CDT).2 DCDT uses triangulations to divide the query range and keeps track of covered triangles by k-NN circles using the characteristics of constrained edges in CDT. DCDT greedily selects the largest uncovered triangle for the next k-NN search while QDD follows a pre-defined order. To overcome redundant k-NN search on the same area, DCDT includes a propagation algorithm to cover the maximum possible area within a k-NN circle. Thus, no k-NN search will be wasted because a portion of a k-NN circle is always added to the covered area of the query range.

2 The terms, CDT and DCDT, are used for both the corresponding triangulation algorithms and the data structures that support these algorithms in this chapter.

Geoinformatica

Algorithm 2 QDDrangeQuery(Pll , Pur ) 1: q ← getCenterOfRegion(Pll , Pur ) 2: Rq ← getHalfDiagonal(q, Pll , Pur ) 3: Knn{} ← add kNNSearch(q) 4: r ← getDistToK th NN(q, Knn{}) 5: if r > Rq then 6: result{} ← pruneResult(Knn{}, Pll , Pur ) 7: else 8: clear Knn{} 9: P0 ← (q.x, Pll .y) 10: P1 ← (Pur .x, q.y) 11: P2 ← (Pll .x, q.y) 12: P3 ← (q.x, Pur .y) 13: Knn{} ← add QDDrangeQuery(Pll , q) 14: Knn{} ← add QDDrangeQuery(P0 , P1 ) 15: Knn{} ← add QDDrangeQuery(P2 , P3 ) 16: Knn{} ← add QDDrangeQuery(q, Pur ) 17: result{} ← Knn{} 18: end if 19: return result{}

// q1 // q4 // q2 // q3

Data Structure for DCDT Given a planar graph G with a set of n vertices, P, in the plane, a set of edges, E, and a set of special edges, C, CDT of G is a triangulation of the vertices of G that includes C as part of the triangulation [5], i.e., all edges in C appear as edges of the resulting triangulation. These edges are referred as to constrained edges, and they are not crossed (destroyed) by any other edges of triangulation. CDT is also called as an obstacle triangulation or a generalized Delaunay Triangulation. Figure 5 illustrates a graph G and the corresponding CDT. In this example, Fig. 5a shows a set of vertices, P = {P1 , P2 , P3 , P4 , P5 , P6 , P7 }, and a set of constrained edges, C = {C1 , C2 }. Figure 5b shows the result of CDT that includes the constrained edges of C1 and C2 as part of the triangulation. We define the following data structures to maintain the covered and uncovered regions of R: • • •

pList: a set of all vertices of G; initially {Pll , Prl , Plr , Pur } cEdges: a set of all constrained edges of G; initially empty tList: a set of all uncovered triangles of G; initially empty

pList and cEdges are updated based on the current covered region by a k-NN search. R is then triangulated (partitioned into subregions) using CDT with the current pList and cEdges. For every new triangulation, we obtain a new tList. DCDT keeps track of covered triangles (subregions) and uncovered triangles; covered triangles are bounded by constrained edges (i.e., all three edges are constrained edges) and uncovered triangles are kept in tList, which are sorted in descending order by the area of the triangles. DCDT chooses the largest uncovered triangle in tList and calls k-NN search using the centroid, which always lies inside the triangle as opposed to the center of the triangle, as the query point. Our algorithm uses a heuristic approach that picks the largest triangle from the list of uncovered triangles.

Geoinformatica Fig. 5 Example of Constrained Delaunay triangulation (a, b)

p2

p2

p4

p3

p1

p3 p1

c1

c1 p5

p5

p6

p6

c2

p7

p4

c2

p7

(a) Constrained edges

(b) CDT

The algorithm terminates when no more uncovered triangles exist. We describe the details of DCDT shown in Algorithm 3 and Algorithm 4 in the following sections. The First k-NN search in DCDT DCDT invokes the first k-NN search using the center point of the query region R as the query point q. Figure 6a shows an example of a 3-NN search. If the resulting k-NN circle Cr completely covers R (r > Rq ), then we prune and return the result − a trivial case. If Cr does not completely cover R,  DCDT needs to use Cr , which is a little smaller circle than Cr , for checking covered regions from this point for further k-NN searches. We have previously discussed the possibility that the query point q has more than one kth nearest neighbor, and one of them is randomly returned. In that case, DCDT may not be able to retrieve all the   points in R if it uses Cr . Hence we define Cr using r = δ · r, where 0 < δ < 1 (δ = 0.99 is used in our experiment).  DCDT creates an n-gon inscribed into Cr . Choosing the value of n depends on the tradeoff between computation time and the coverage of area; n = 6 (a hexagon) is used in our experiments as shown in Fig. 6b. All vertices of the n-gon are added into pList. To mark the n-gon as a covered region, the n-gon is triangulated and the resulting edges are added as constrained edges into cEdges (getConstrainedEdges() line 12 of Algorithm 3). The algorithm constructs a new triangulation with the current

Pur

q

C‘r Pll

(a) Query region R and center point q

(b) Marking a hexagon as covered region

Fig. 6 An example of first k-NN call in query region R (a, b)

Geoinformatica

pList and cEdges, then a newly created tList is returned (constructDCDT() line 13 of Algorithm 3). Covering Regions DCDT selects the largest uncovered triangle from tList for  the next k-NN search. With the new Cr , DCDT updates pList and cEdges and  creates a new triangulation. For example, if an edge lies within Cr , then DCDT adds  it into cEdges; on the other hand, if an edge intersects Cr , then the partially covered edge is added into cEdges. Figure 7 shows an example of how to update and maintain pList, cEdges and tList. 1 ( max ) is the largest triangle in tList, therefore 1 is selected for the next k-NN search. DCDT uses the centroid of 1 as the query   point q for the k-NN search. Cr partially covers 1 : vertices v5 and v6 are inside Cr but   vertex v3 is outside Cr . Cr intersects 1 at k1 along v3 v6 and at k2 along v3 v5 , hence, k1 and k2 are added into pList (getCoveredVertices() line 3 of Algorithm 4). Let  Rc be the covered area (polygon) of 1 by Cr (Fig. 7c). Then Rc is triangulated and the resulting edges, k1 k2 , k1 v6 , k2 v5 , v5 v6 and k1 v5 are added into cEdges. The updated pList and cEdges are used for constructing a new triangulation of R (Fig. 7c).

Algorithm 3 DCDTrangeQuery(Pll , Pur , δ) 1: pList{} ← {Pll , Prl , Plr , Pur }; cEdges{} ← { }; tList{} ← { }; 2: Knn{} ← { } // all objects retrieved by k-NN search 3: q ← getCenterOfRegion(Pll , Pur ) 4: Knn{} ← add kNNSearch(q) 5: r ← getDistToK th NN(q, Knn{}) 6: Cr ← circle centered at q with radius r 7: Rq ← getHalfDiagonal(q, Pll , Pur ) 8: if r ≤ Rq ; Cr does not cover the given region R then  9: Cr ← circle centered at q with radius δ · r  10: Ng ← create an n-gon inscribing Cr with q as the center 11: pList{} ← add {all vertices of Ng } 12: cEdges{} ← getConstrainedEdges({all vertices of Ng }) 13: tList{} ← constructDCDT( pList{}, cEdges{}) 14: while tList is not empty do 15: mark all triangles in tList to unvisited 16: max ← getMaxTriangle() 17: q ← getCentroid( max ) 18: Knn{} ← add kNNSearch(q) 19: r ← getDistToKth NN(q, Knn{})  20: Cr ← circle centered at q with radius δ · r  21: checkCoverRegion( max , Cr , pList{}, cEdges{}) 22: tList{} ← constructDCDT( pList{}, cEdges{}) 23: end while 24: end if 25: result{} ← pruneResult(Knn{}, Pll , Pur ) 26: return result{}

Geoinformatica 

Algorithm 4 checkCoverRegion( , Cr , pList{}, cEdges{}) 1: if is marked unvisited then 2: mark as visited  3: local PList{} ← getCoveredVertices( , Cr ) 4: localCEdges{} ← getConstrainedEdges(local PList) 5: if localCEdges is not empty then 6: pList{} ← add local PList{} 7: cEdges{} ← add localcEdges{} 8: neighb ors{} ← findNeighbors( ) 9: for every i in neighbors{}, i = 1, 2, 3 do  10: checkCoverRegion( i , Cr , pList{}, cEdges{}) 11: end for 12: end if 13: end if

As shown in Fig. 7c, the covered region of 1 can be a lot smaller than the  coverage area of Cr . In order to maximize the covered region by a k-NN search, DCDT propagates to (visits) the neighboring triangles of 1 (triangles that share an edge with 1 ). DCDT marks the covered regions starting from max , and recursively visiting neighboring triangles ( f indNeighb ors() line 8–11 of Algorithm 4). Figure 8 shows an example of the propagation process. In Fig. 8a, max is completely covered and its neighboring triangles 11 , 12 and 13 (1-hop neighbors of max ) are partially covered in Fig. 8b, c and d, respectively. In Fig. 8e, the neighboring triangles of max ’s 1-hop neighbors, i.e., 2-hop neighbors of max , are partially covered. Finally, based on the returned result of checkCover Region() shown in Fig. 8e, DCDT constructs a new triangulation as shown in Fig. 8f. Algorithm 4 describes the propagation process. Notice that DCDT covers a certain portion of R at every step, and therefore the covered region grows as the number of k-NN searches increases. On the contrary, QDD discards the returned result from a k-NN call when it does not cover the whole subregion, which results in re-visiting the subdivisions of the same region with 4 additional k-NN calls (see line 5–16 in Algorithm 2). Also note that the cost





pList ={v1,v2,v3,v4,v5,v6} cEdges ={ } tList = { Δ1 , Δ 2 , Δ3 , Δ 4 , Δ5 , Δ 6 }

pList ={v1,v2,v3,v4,v5,v6,k1,k2} cEdges ={ v5v6 , v6k1 , k1k 2 , k 2v5 , v5k1 } tList = { Δ1 , Δ 2 , Δ3 , Δ 4 , Δ5 , Δ6 }

pList ={v1,v2,v3,v4,v5,v6,k1,k2} cEdges ={ v5v6 , v6k1 , k1k 2 , k 2v5 , v5k1 } tList = { Δ 2 , Δ5 , Δ6 , Δ7 , Δ8 , Δ9 , Δ10 , Δ11 }

(a) step1

(b) step2

(c) step3

Fig. 7 Example of pList, cEdges and tList: covering max (a–c)

Geoinformatica

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 8 Example of checkCoverRegion() (a–f)

of triangulation is negligible as compared to k-NN search because triangulation is performed in memory and incrementally. DCDT repeats the processes until no more uncovered regions exist. The resulted k-NN circle Cr can either completely cover R or partially cover R. The example in Fig. 9a shows the first step of DCDT. The shaded triangles are covered-subregions and all the edges of these triangles are now considered as constrained edges. Figure 9b and c illustrate examples of DCDT result after the 5th k-NN search and the 9th k-NN search, respectively. Cases of Covered Regions In this section we define the cases of covered region Rc and discuss the details of Algorithm 4. Let Cr be the returned k-NN circle centered by q and T be the given triangle that we consider covering using Cr . Let a, b and c

Pur

Pur

Pur

Cr Cr

Cr Pll

Pll

(a) 1st k-NN call (12%)

Pll

(b) 5th k-NN call (52%)

Fig. 9 Example of DCDT w/k=15 (coverage rate) (a–c)

(c) 9th k-NN call (92%)

Geoinformatica Fig. 10 Examples of the cases for covered region Rc (a–h)

(a) Case 1

(c) Case 3.1

(e) Case 4.1

(g) Case 4.3a

(b) Case 2

(d) Case 3.2

(f) Case 4.2

(h) Case 4.3b

be the three vertices of T . The following cases are defined for covered region Rc , and the corresponding examples are illustrated in Fig. 10. Case 1: All three vertices, a, b and c, are inside Cr so that T is completely covered by a k-NN search: Rc is ab c . Case 2: Two vertices (say a and b ) are inside Cr but the third vertex (say c) is outside Cr . Then Cr has one intersection (say b  ) with b c and one intersection (say a ) with ca: Rc is the polygon with the edges, ab , aa , b b  and a b  . Case 3: One vertex (say a) is inside Cr but two other vertices (say b and c) are outside Cr ; Case 3.1: Cr has one intersection (say b  ) with ab and one intersection (say c ) with ca: Rc is ab  c .

Geoinformatica

Case 3.2: Cr has one intersection (say b  )with ab , one intersection (say c ) with ca, and either one intersection (say k1 ) or two intersections (say k1 and k2 ) with b c: Rc is the polygon with the edges, ab  , b  k1 , k1 k2 , k2 c and c a. Case 4: All three vertices are outside Cr ; Case 4.1: The center of Cr is inside T and Cr has either no intersection or only one intersection with all edges of T : create an n-gon inscribed into Cr and then Rc is n-gon. Case 4.2: Cr has two intersections with at least two edges of T : Let these intersection points be k1 , k2 ,...,ki where 4 ≤ i ≤ 6. Rc is the polygon with the edges, k1 k2 ,k2 k3 ,..., ki k1 . Case 4.3: Cr has two intersections with one edge (say ab ) of T . Let these two intersections be k1 and k2 and let m be the center point of k1 k2 : Case 4.3a: If the center of Cr is inside T , then add k1 k2 , then create an n-gon inscribed into Cr and find the intersections of the n-gon and ab . Let S be a set of points that includes: 1) the points of n-gon that are inside T , 2) the intersection points of the n-gon and ab , S = {k1 , k2, ..., ki }. Rc is the polygon with edges in S. Case 4.3b: If the center of Cr is outside T , then find an intersection point of Cr and the line that is perpendicular to k1 k2 and passes the point m. Let this point be m . Then Rc is k1 k2 m . After we define Rc , Tc is obtained by triangulating Rc . Then all vertices of Rc are added into pList, and all edges of Tc are added into cEdges. The updated pList and cEdges are used for creating a new triangulation. 4.3 Algorithm using bounded values of k Some web applications support k-NN queries with bounded values of k. We propose the Hybrid (H B) algorithm that shows the possibility of combining the solutions for the two previous cases of k-NN interfaces, flexible values of k and fixed values of k, in order to provide a solution to these applications. When the range of k values is bounded, not only a good estimation of the k value must be considered but also a good placement of query point q. 4.3.1 Hybrid (H B) algorithm H B first estimates the value of k based on the global density. If additional k-NN search calls are required, the new k value is calculated based on the local density. The estimation of the k value continues until either the k-NN circle covers the query region R or the k value reaches the maximum value allowed. If the k value reaches the maximum value of k defined by the application before R is covered, R needs to be partitioned into subregions and the previous steps are called for each of these regions.

Geoinformatica

The algorithm keeps recursively partitioning the query region into subregions until every subregion is covered by its k-NN circle. Considering QDD and DCDT as an approach when the estimated k value reaches to the maximum before covering the entire region, QDD divides the region into four equal sized subregions while DCDT divides the region into irregular shapes of triangles. Notice that the density of each subregion can be assumed to the same as the density of the entire region in QDD. This may not be applied when dealing with irregular shaped subregions of DCDT resulting in some overhead when calculating each subregion’s density. In this paper, we combine the basic approaches of DB and QDD discussed in Section 4.1 and Section 4.2, respectively. The implementation of H B is then straightforward, and the details of the H B algorithm is descried in Algorithm 5. Algorithm 5 HB rangeQuery(Pll , Pur , Aglob al , N) 1: ε ← a constant factor; kmax ← a system factor 2: q ← getCenterOfRegion(Pll , Pur ) 3: Rq ← getHalfDiagonal(q, Pll , Pur ) 4: kinit ← estKvalue(Rq , Aglob al , N, ε) 5: Knn{} ← kNNSearch(q, kinit ) 6: r ← maxDistkNN(q, Knn{}); k ← kinit 7: while r < Rq and k < kmax do 8: clear Knn{} 9: Alocal ← π · r2 10: k ← estKvalue(Rq , Alocal , k, ε) 11: Knn{} ← KnnSearch(q, k) 12: r ← maxDistkNN(q, Knn{}) 13: end while 14: if r < Rq then 15: clear Knn{} 16: P0 ← (q.x, Pll .y) 17: P1 ← (Pur .x, q.y) 18: P2 ← (Pll .x, q.y) 19: P3 ← (q.x, Pur .y) 20: Knn{} ← add HB rangeQuery(Pll , q, Alocal , k) 21: Knn{} ← add HB rangeQuery(P0 , P1 , Alocal , k) 22: Knn{} ← add HB rangeQuery(P2 , P3 , Alocal , k) 23: Knn{} ← add HB rangeQuery(q, Pur , Alocal , k) 24: return result{} ← Knn{} 25: else 26: return result{} ← prunResult(Knn{}, Pll , Pur ) 27: end if

5 Performance evaluation In this section, we evaluate the performance of our proposed algorithms. We first discuss the statistical analysis of the DB and QDD algorithms comparing their

Geoinformatica

experimental results. We then present the experimental results of DB, QDD and DCDT. Based on the observation from the results of DB and QDD, the use of H B and its performance can be driven. 5.1 Datasets and experimental methodology In this paper, we evaluate the proposed range query algorithms using both synthetic and real GIS data sets. Our synthetic data sets use both uniform (random) and skewed (exponential) distributions. For the uniform data sets, (x, y) locations are distributed uniformly and independently between 0 and 1. The (x, y) locations for the skewed data sets are independently drawn from an exponential distribution with mean 0.3 and standard deviation 0.3. The number of points in each data set varies: 1K, 2K, 4K, and 8K points. Our real data set is from the U.S. Geological Survey in 2001: Geochemistry of consolidated sediments in Colorado in the U.S. [18]. The data set contains 5,410 objects (points). Our performance metric is the number of k-NN calls required for a given range query. We performed range queries using the proposed algorithms with various data sets and measured the average number of k-NN searches to support a range query while varying the size of range queries and the distribution of the data sets. In our experiments, we varied the range query size between 1% and 10% of the entire region of the data set. Different k values between 5 and 50 were used for the QDD algorithm, and the DB algorithm used varying  values from 1.0 to 2.0. Range queries were conducted for 100 trials for each query size and the average values were reported. We present only the most illustrative subset of our experimental results due to space limitation. Similar qualitative and quantitative trends were observed in all other experiments. 5.2 Statistical analysis The data distribution in a query region may affect the performance of the algorithms. One can devise a better solution with a complete knowledge of the data set. To show the expected performance of the DB and QDD algorithms and evaluate the impact of data distribution on the algorithms, this section presents statistical analysis of the two algorithms. We first statistically analyze DB and QDD, then the expected performances are compared with the empirical results with our data sets. 5.2.1 Statistical analysis of DB We generated a number of data points assuming two typical distributions of their locations, uniform and exponential. Then, while varying the size of range queries, we analyze how many points will be distributed in a query region and how it will impact the performance of the algorithms. We created various sized synthetic data sets (1K, 2K, 4K, 8K, 16K points) using both uniform and exponential distribution of their locations. The size of query region varied such that C Rq of the given query region varied: 3%, 5%, and 10% of the entire region of data set. For each experiment, 100K range queries were issued on randomly generated query points. Then, the average number of points in a query region was calculated.

Geoinformatica Fig. 11 Data distribution of the number of points in a query region

0.6 0.5

Probability

0.4 uni form 3% 0.3

uniform 5%

0.2

uniform 10% exponential 3% exponential 5%

0.1

exponential 10%

0.0 0

100

200

300

400

500

600

700

800

number of points

Figure 11 shows the distributions of the number of points in a query region with the query sizes of 3%, 5%, and 10% for the 4K synthetic data sets. The x-axis represents the number of points inside R, and the y-axis represents the probability of each number of points. The distribution of the uniform data set is normal while that of the exponential data set shows a skewed distribution. For example, the 3% query size in the uniform data set retrieved 120 points on average and the maximum number of points was 156. Our global estimation method used in the DB algorithm decides the initial k value using the global estimation (i.e., k = 3 · 4000/100 = 120), which matches the result from the statistical analysis well. Then, if DB uses  = 1.3 (i.e., k = 120 × 1.3 = 156), it can retrieve all the points in R in a single k-NN search. Even though the exponential data set resulted in a skewed distribution, two or three k-NN searches can be enough to retrieve all points in R. For example, Approximately 75% of the range queries resulted in less than or equal to the estimated value of k in DB. In our experimental result of DB in Section 5.3, the value of k was overestimated in most cases with the exponential data set. This proves that DB can be effective regardless of the distribution of data set. 5.2.2 Statistical analysis of QDD Suppose that we have a perfectly balanced quad-tree. Each node corresponds to each region in which a range query is invoked at its center point. Let D be the diagonal of the initial query region R and dk be the minimum of the distances between any point in the data set and its kth nearest neighbor. The region represented by a node becomes smaller and smaller while going down the tree. Finally, when the diagonal of each region becomes less than or equal to dk , R can be completely covered by k-NN circles so that all the points in R can be retrieved. When does the diagonal of every region become less than or equal to dk ? Let L be the tree level when the diagonal of the region is less than or equal to dk (i.e., L is the leaf level in the quad-tree). Then, L = log4 (D/dk ). Let Nknn be the total number of nodes of the perfect quad tree. Then we have

Nknn ≤ 40 + 41 + 42 + ... + 4 L ≤

L  i=0

4i

(3)

Geoinformatica

By solving (3), we get

Nknn ≤

L 

log

4 = i

i=0

D

dk 

i=0

    D 2 1 4 4 = −1 3 dk i

(4)

The performance of the Naive approach is defined by Nknn because the Naive needs to be perfectly balanced, i.e., each cell should be the same size. However, the quad-tree representation of QDD does not need to be perfectly balanced, and the total number of nodes can be smaller than Nknn (see the example in Fig. 4b). In QDD, each level of the quad tree has a different size of a query region, i.e., the size of its quadrant. If there are j levels from level 0 to level j − 1, then j different sizes of the query region are created. In order to analyze the performance of QDD, let us assume that the distribution of the datasets are known. First, we do random range queries to get the data distribution of each size of range query. We then calculate the probability for each level of the quad-tree, which is the probability that its quadrant cannot be covered by a single k-NN search. Non-zero probability means that the subregion should be further divided. Finally, we estimate the total number of nodes of the quad-tree. Figure 12 shows a data distribution of the number of points in a query region. Let E be the expected value of the number of points in C Rq , where E N · the area of C Rq Aglob al

. Let P B be the probability that C Rq contains more than k points, i.e.,

P B = P(E > k). Let P Bi be the probability that C Rq contains more than k at the ith  i level of the tree. Then the number of nodes at the ith level is ( i−1 j=0 P Bj ) · 4 , and  th the number of nodes at the 0 level is obviously one. Let Nknn denote the expected number of nodes in a quad-tree representation of QDD; it is quantified as: 

Nknn

⎛ ⎞ L i−1 

⎝ = 1 + P B0 · 41 + ... + P B0 P B1 · ·P BL−1 · 4 L = 1 + PBj⎠ · 4i

CRq

R

j=0

Probability

i=1

PB

Aglobal Fig. 12 Data distribution in a query region

avg . k number of points

(5)

Geoinformatica Table 2 Statistical analysis of QDD: 4K synthetic uniformly distributed data set Level

Query size

P Bi (%)

k=5

10

15

20

30

40

50

Level 0 Level 1 Level 2 Level 3 Level 4 Number of k-NN searches

3% 1/41 · 3% 1/42 · 3% 1/43 · 3% 1/44 · 3% Nknn  Nknn QDD

P B0 P B1 P B2 P B3 P B4

100 100 54.48 36.94 0 341 108 115

100 100 45.52 0 0 85 51 59

100 99.96 6.35 0 0 85 26 29

100 96.27 0 0 0 85 21 25

100 50.00 0 0 0 21 13 16

100 3.73 0 0 0 21 6 7

100 0 0 0 0 5 5 3

 Since P Bi ≤ 1 for every level i, i−1 j=0 P Bj becomes smaller and smaller as the tree  level increases, and finally approaches 0. Therefore, Nknn ≤ Nknn . Table 2 shows the analysis of QDD for our 4K synthetic uniformly distributed data set. The quad tree for the 4K data set has six levels from level 0 to 5. Each probability P Bi for level i is calculated based on the data distribution of the  given range query. We compare Nknn with Nknn as well as with the results of the QDD algorithm. For example, when k = 15, a quadrant at level 1 is not covered by a single k-NN search with a 99.96% probability, and one at level 2 is not covered by a k-NN search with a 6.35% probability. The perfect quad-tree with four levels has 85 nodes in total. Therefore, the total number of k-NN calls for the Naive  approach is Nknn = 85 while Nknn = 26 for this case. The number of k-NN calls with  QDD for this example was 29 as presented in Section 5.3. Overall, Nknn had 62% of average reduction compared to Nknn . We also observed that the maximum difference  between QDD and Nknn was 9%. This statistical analysis describes the behavior of the QDD algorithm. QDD discards the returned k-NN result when the k-NN circle does not cover the given query region and repeatedly calls additional four k-NN searches for the new four subregions. Thus, we can consider skipping some levels of the QDD process to reduce unnecessary calls. If Rc is the current query region of a k-NN search, and Ps is the probability that the k-NN circle completely covers Rc , then 1 − Ps is the probability that the k-NN circle fails to cover the region. From statistical analysis based on the results of the previous k-NN searches, we estimate the probability Ps that the k-NN search covers Rc . To skip the current call of k-NN search, the total expected number of k-NN searches for the current level and the next level must be greater than four calls; four k-NN searches are required if we skip the current level and directly go to four subregions for the next level. Hence we obtain the condition of Ps by the following equation: Ps + (1 − Ps ) > 4. Therefore, QDD can skip a k-NN search for Rs iff Ps < 14 . 5.3 Experimental evaluation 5.3.1 Results of the DB algorithm With the statistical knowledge of the data set, DB can decide the k value to minimize the number of k-NN calls. As a result, DB completed any range query in one or two k-NN calls in all experiments with both synthetic and real data sets. The impact of  was quantified by measuring the required number of k-

Geoinformatica 120%

100% 80% 1 k-NN call on uni. 4k

60%

2 k-NN calls on uni. 4k 1 k-NN call on exp. 4k

40%

2 k-NN calls on exp. 4k

20%

percentage of range queries

percentage of range queries

120%

0%

100% 80% 60% 1 k-NN call

40%

2 k-NN calls

20% 0%

1

1.2

1.4 1.6 epsilon value

1.8

2

1

(a) 4k Synthetic data set with 3% region queries

1.2

1.4 1.6 epsilon value

1.8

2

(b) Real data set with 3% region queries

Fig. 13 Percentage of k-NN calls of DB(a, b)

NN calls while varying . As  increased from 1.0 to 2.0, more range queries were completed in a single k-NN call. Figure 13a shows what percentage of range queries were evaluated in one k-NN call (or in two) while varying . For example, in the case of the uniformly distributed dataset, when  = 1.0, 57% of range queries were covered by one k-NN search while 43% were by two kNN searches. When  ≥ 1.2, all range queries were completed in one k-NN call. Overall, the exponential data set required a higher number of k-NN calls than the uniform data. The real data set showed a similar trend as the exponential data set. DB performed all range queries in one or two k-NN calls with over-estimated values of k. Because DB uses a loosely bounded k value, the obtained k-NN circle Cr is larger than C Rq in most cases. To quantify how accurately DB estimates Cr with regard to C Rq (Cr /C Rq or r/Rq ) using , we measured the ratio of r to Rq while varying  (Fig. 14). The ratio was approximately 1.26 and 1.98 for the uniform and exponential data sets, respectively, when  = 1.5 while the ratio was 1.48 for the uniform data set and 2.34 for the exponential data set when  = 2.0. The ratio for

5.0

5.0 exp. e=2.0 uniform e=2.0

4.0

exp. e=1.5 uniform e=1.5

3.0 2.0

Accuracy (r/Rq)

Accuracy (r/Rq)

4.0

epsilon = 2.0 epsilon = 1.5

3.0 2.0

1.0

1.0

0.0 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

0.0 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

size of range query

(a) 4k Synthetic data set with 3% region queries Fig. 14 Accuracy of DB(a, b)

size of range query

(b) Real data set with 3% region queries

Geoinformatica 120 QDD w/uniform

100

number of k-NN calls

number of k-NN calls

120

DCDT w/uniform

80

N.C. w/uniform

60 40 20

QDD w/exponential

100

0

DCDT w/exponential

80

N.C. w/exponential

60 40 20 0

5

10

15

20

25

30

35

40

45

50

5

10

15

20

value of k

25

30

35

40

45

50

value of k

(a) Uniformly distributed data set

(b) Exponentially distributed data set

Fig. 15 Number of k-NN calls for 4K synthetic data set with 3% range query (a, b)

the real data set are presented in Fig. 14b: 1.51 with  = 1.5 and 1.77 with  = 2.0. For both the synthetic and real data sets, the ratio r to Rq decreased as the size of the range query increased. 5.3.2 Results of QDD and DCDT First, we compared the performance of QDD and DCDT with the theoretical minimum, i.e., the necessary condition (N.C.) kn , where n is the number of points in R. Figure 15a and b show the comparisons of QDD, DCDT and N.C. with uniformly distributed and exponentially distributed 4k synthetic data sets, respectively. For both DCDT and QDD, the number of k-NN calls rapidly decreased as the value of k increased in the range 5-15 for uniformly distributed data and 5-10 for the exponentially distributed data. Then, it slowly decreased as the value of k became larger, approaching to those of N.C. when k is over 35. The results for the real data set have similar trends to those of the exponentially distributed synthetic data set (Fig. 16). Note that k is determined not by the client but by the Web interface, and a typical k value in real applications ranges between 5 and 20 [4]. In both synthetic and real data sets, DCDT needed a significantly smaller number of k-NN calls compared to QDD. For example, with the exponential data set (4K), DCDT showed 55.3% of average reduction in the required number of k-

140

number of k-NN calls

Fig. 16 Number of k-NN calls for real data set with 3% range query

120

QDD QDD

100

DCDT N.C. N.C.

80 60 40 20 0 5

10

15

20

25

30

k value

35

40

45

50

Geoinformatica Table 3 The average percentage reduction in the number of k-NN calls (%) Distribution

Query size

k=5

10

15

20

25

30

35

40

45

50

Uniform

3% 5% 10% 3% 5% 10%

51.3 56.3 56.2 60.5 68.4 69.4

55.9 53.5 53.6 58.8 67.2 59.7

51.7 56.9 48.2 58.6 68.8 56.5

52.0 56.3 45.0 56.9 65.7 56.7

50.0 55.6 37.5 51.1 60.7 55.8

43.8 52.5 30.0 40.0 58.3 34.4

27.3 54.0 22.2 37.5 50 46.9

0 43.8 22.2 37.5 41.7 46.4

0 35.7 25.0 20.0 25.0 35.0

0 0 20.0 20.0 20.0 31.3

Exponential

110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

k=5 k=7 k = 10 k = 12 k = 15 90% coverage 50% coverage

0

10

20

30

number of k-NN calls

(a) 3% range query

40

50

percentage of coverage

percentage of coverage

NN searches compared to QDD. As discussed in Section 4.2, DCDT covers a portion of the query region with every k-NN search while QDD revisits the same subregion when a k-NN search fails to cover the entire subregion. DCDT still required more k-NN calls than N.C. On the average, DCDT required 1.67 times more k-NN calls than N.C. However, the gap became very small when k was greater than 15. Next, we conducted range queries with different sizes of query regions: 3%, 5%, and 10% of the entire region of data set. Table 3 shows the average percentage reduction in the number of k-NN calls between DCDT and QDD for 4K uniformly and exponentially distributed synthetic data sets. On the average, DCDT resulted in 50.4% and 58.2% of reduction over QDD for the uniformly and exponentially distributed data set, respectively. The results show that the smaller the value of k is, the greater the reduction rate. In Fig. 17, we plotted the average coverage rate versus the number of k-NN calls required for DCDT on a 4K synthetic uniformly distributed data set. Figure 17a shows the average coverage rate of the DCDT algorithm while varying the value of k. For example, when k=7, 50% of the query range can be covered by the first 8 k-NN calls while the entire range is covered by 40 k-NN calls. The same range query required 24 calls for 90% of coverage, and this number is approximately 60% of the total required number of k-NN searches. Figure 17b shows the average coverage rate of DCDT while varying query size. For 5% range query, 12 and 31 k-NN searches were required for 50% and 90% coverage, respectively.

110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

3% range query 5% range query 10% range query 90% coverage 50% coverage

0

10

20

30

40

50

60

number of k-NN calls

(b) k=10

Fig. 17 Percentage of coverage using 4K uniformly distributed synthetic data set (a, b)

70

80

90

Geoinformatica 110% 90% 80% 70% k=5 k=7 k = 10 k = 12 k = 15 90% coverage 50% coverage

60% 50% 40% 30% 20% 10% 0%

0

10

20

30

40

percentage of coverage

percentage of coverage

100%

110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

50

number of k-NN calls

(a) 3% range query

3% range query 5% range query 10% range query 90% coverage 50% coverage

0

10

20

30

40

50

60

70

80

90

number of k-NN calls

(b) k =10

Fig. 18 Percentage of coverage using real data set (a, b)

Figure 18a and b show the average coverage rate versus the number of kNN calls required for DCDT on real data set while varying the value of k in (a) and varying query size in (b). When k=7, 50% of the query range can be covered by 20% of the total required number of k-NN searches. The same range query required 16 calls for 90% of coverage, which is 55% of the total required number of k-NN searches. For 5% range query, 20% and 58% of the total required k-NN searches were required to cover 50% and 90% of the query range, respectively.

6 Conclusions In this paper, we introduced the problem of evaluating spatial range queries on the web data by using only k-Nearest Neighbor searches. The problem of finding a minimum number of k-NN searches to cover the query range appears to be hard even if the locations of the objects (points) are known. Based on the classification of k-NN interfaces, four algorithms were proposed to achieve the completeness and efficiency requirements of spatial range queries. Quad Drill Down (QDD) and Dynamic Constrained Delaunay Triangulation (DCDT) algorithms were proposed for the case where we have no knowledge about the distribution of the data sets with the most restricted k-NN interface (i.e., no control on the value of k). We showed that both QDD and DCDT can cover the entire range even with the most restricted environment while providing reasonable performance. The efficiency of the algorithms were measured using statistical analysis and empirical experiments. DCDT provides a better performance than QDD. In the presence the data distribution knowledge and more flexible k-NN interface, we proposed the DB algorithm that can support any range query with only one or two k-NN searches as was demonstrated in all our experiments. The Hybrid (H B) algorithm provides the possibility of customization of our algorithms for real world applications. We plan to expand our range query algorithms for parallel k-NN searches on multiple web site, which is expected to significantly enhance the data integration and

Geoinformatica

search time. Consequently, a new problem of minimizing the overlapping area can be introduced. ˇ Acknowledgements The authors thank Prof. Petr Vojtechovský for providing helpful suggestions and Brandon Haenlein for the valuable discussions throughout this work.

References 1. Barish G, Chen Y, Dipasquo D, Knoblock CA, Minton S, Muslea I, Shahabi C (2000) Theaterloc: using information integration technology to rapidly build virtual application. In: Proceedings of international conf. on data engineering (ICDE), 28 February–3 March 2000, San Diego, pp 681–682 2. Bae WD, Alkobaisi S, Kim SH, Narayanappa S, Shahabi C (2007) Supporting range queries on web data using k-nearest neighbor search. In: Proceedings of the 7th international symposium on web and wireless GIS (W2GIS 2007), 28–29 November 2007, Cardiff, pp 61–75 3. Bae WD, Alkobaisi S, Kim SH, Narayanappa S, Shahabi C (2007) Supporting range queries on web data using k-nearest neighbor search. Technical report DU-CS-08-01, University of Denver 4. Byers S, Freire J, Silva C (2001) Efficient acquisition of web data through restricted query interface. In: Poster proceedings of the world wide web conference (WWW10), Hong Kong, 1–5 May 2001, pp 184–185 5. Chew LP (1989) Constrained Delaunay triangulations. Algorithmica 4(1):97–108 6. Dickerson M, Drysdale R, Sack J (1992) Simple algorithms for enumerating interpoint distances and finding k nearest neighbors. Int J Comput Geom Appl 2:221–239 7. Eppstein D, Erickson J (1994) Interated nearest neighbors and finding minimal polytypes. Discrete Comput Geom 11:321–350 8. Gaede V, Gounter O (1998) Multidimensional access methods. ACM Comput Surv 30(2): 170–231 9. Hieu LQ (2005) Integration of web data sources: a survey of existing problems. In: Proceedings of the 17th GI-workshop on the foundations of databases (GvD), Wörlitz, 17–20 May 2005 10. Liu D, Lim E, Ng W (2002) Efficient k nearest neighbor queries on remote spatial databases using range estimation. In: Proceedings of international conf. on scientific and statistical databases management (SSDMB), Edinburgh, 24–26 July 2002, pp 121–130 11. Mergerian S, Koushanfar F (2005) Worst and best-case coverage in sensor networks. IEEE Trans Mob Comput 4(1):84–92 12. Nie Z, Kambhampati S, Nambiar U (2005) Effectively mining and using coverage and overlap statistics for data integration. IEEE Trans Knowl Data Eng 17(5):638–651, May 13. Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of ACM SIGMOD, San Jose, May 1995, pp 71–79 14. Samet H (1985) Data structures for quadtree approximation and compression. Commun ACM 28(9):973–993, September 15. Sharifzadeh M, Shahabi C (2006) Utilizing voronoi cells of location data streams for accurate computation of aggregate functions in sensor networks. GeoInformatica 10(1):9–36 16. Song Z, Roussonpoulos N (2001) K-nearest neighbor search for moving query point. In: Proceedings of international symposium on spatial and temporal databases (SSTD), Redondo Beach, 12–15 July 2001, pp 79–96 17. Tao U, Zhang U, Papadias D, Mamoulis N (2004) An efficient cost model for optimization of nearest neighbor search in low and medium dimensional spaces. IEEE Trans Knowl Data Eng 16(10):1169–1184, October 18. USGS (2001) USGS mineral resources on-line spatial data. http://tin.er.usgs.gov/ 19. Wang G, Cao G, Porta TL (2003) Movement-assisted sensor deployment. In: Proceedings of IEEE INFOCOM, San Francisco, 30 March–3 April 2003, pp 2469–2479 20. Wang S, Armstrong MP (2003) A quadtree approach to domain decomposition for spatial interpolation in grid computing environments. Parallel Comput 29(3):1481–1504, April 21. Wu C, Lee K, Chung Y (2006) A Delaunay triangulation based method for wireless sensor network deployment. In: Proceedings of ICPADS, Minneapolis, July 2006, pp 253–260 22. Yerneni R, Li C, Garcia-Molina H, Ullman J (1999) Computing capabilities of mediators. In: Proceedings of SIGMOD, Philadelphia, 1–3 June 1999, pp 443–454

Geoinformatica

Wan D. Bae is currently an assistant professor in the Mathematics, Statistics and Computer Science Department at the University of Wisconsin-Stout. She received her Ph.D. in Computer Science from the University of Denver in 2007. Dr. Bae’s current research interests include online query processing, Geographic Information Systems, digital mapping, multidimensional data analysis and data mining in spatial and spatiotemporal databases.

Shayma Alkobaisi is currently an assistant professor at the College of Information Technology in the United Arab Emirates University. She received her Ph.D. in Computer Science from the University of Denver in 2008. Dr. Alkobaisi’s research interests include uncertainty management in spatiotemporal databases, online query processing in spatial databases, Geographic Information Systems and computational geometry.

Geoinformatica

Seon Ho Kim is currently an associate professor in the Computer Science & Information Technology Department at the University of District of Columbia. He received his Ph.D. in Computer Science from the University of Southern California in 1999. Dr. Kim’s primary research interests include design and implementation of multimedia storage systems, and databases, spatiotemporal databases, and GIS. He co-chaired the 2004 ACM Workshop on Next Generation Residential Broadband Challenges in conjunction with the ACM Multimedia Conference.

Sada Narayanappa is currently an advanced computing technologist at Jeppesen. He received his Ph.D. in Mathematics and Computer Science from the University of Denver in 2006. Dr. Narayanappa’s primary research interests include computational geometry, graph theory, algorithms, design and implementation of databases.

Geoinformatica

Cyrus Shahabi is currently an Associate Professor and the Director of the Information Laboratory (InfoLAB) at the Computer Science Department and also a Research Area Director at the NSF’s Integrated Media Systems Center (IMSC) at the University of Southern California. He received his Ph.D. degree in Computer Science from the University of Southern California in August 1996. Dr. Shahabi’s current research interests include Peer-to-Peer Systems, Streaming Architectures, Geospatial Data Integration and Multidimensional Data Analysis. He is currently on the editorial board of ACM Computers in Entertainment magazine. He is also serving on many conference program committees such as ICDE, SSTD, ACM SIGMOD, ACM GIS. Dr. Shahabi is the recipient of the 2002 National Science Foundation CAREER Award and 2003 Presidential Early Career Awards for Scientists and Engineers (PECASE). In 2001, he also received an award from the Okawa Foundations.