International Journal of Geographical Information Science
ISSN: 1365-8816 (Print) 1362-3087 (Online) Journal homepage: http://www.tandfonline.com/loi/tgis20
Pattern-mining approach for conflating crowdsourcing road networks with POIs Bisheng Yang & Yunfei Zhang To cite this article: Bisheng Yang & Yunfei Zhang (2015) Pattern-mining approach for conflating crowdsourcing road networks with POIs, International Journal of Geographical Information Science, 29:5, 786-805, DOI: 10.1080/13658816.2014.997238 To link to this article: http://dx.doi.org/10.1080/13658816.2014.997238
Published online: 06 Mar 2015.
Submit your article to this journal
Article views: 276
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tgis20 Download by: [Wuhan University]
Date: 18 September 2015, At: 01:24
International Journal of Geographical Information Science, 2015 Vol. 29, No. 5, 786–805, http://dx.doi.org/10.1080/13658816.2014.997238
Pattern-mining approach for conflating crowdsourcing road networks with POIs Bisheng Yang* and Yunfei Zhang State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
Downloaded by [Wuhan University] at 01:24 18 September 2015
(Received 24 September 2014; final version received 7 December 2014) Crowdsourcing geospatial data mainly collected by public citizens have brought about a profound transformation on data acquisition and utilization. However, the unpredictable positional accuracies, unstructured semantic descriptions, and invalid spatial relations occur to crowdsourcing geospatial data, causing difficulties for conflating heterogeneous data sets collected by different professional agencies or volunteers. We thus propose a novel pattern-mining approach to conflate crowdsourcing road networks with points of interest (POIs) geometrically and semantically. The proposed method mines the geometric patterns between road networks and POIs respectively and generates the pattern-related skeleton graphs for them. Then, corresponding points are determined between the two skeleton graphs to align POIs and road networks geometrically, and the road-related semantic data between the associated POIs and the road segments are compared to check the data quality of POIs and infer the road names of the road segments. Experimental results show the advantages of our proposed method, demonstrating a functional and promising solution for enriching POIs and road network geometrically and semantically. Keywords: pattern mining; conflation; data enrichment; crowdsourcing; road networks
1. Introduction and literature review With the emergence of GeoSensor Networks and web 2.0 technologies, citizens start to create and edit geographic information directly, called crowdsourcing geospatial data (Heipke 2010). On the one hand, the popularity of crowdsourcing geospatial data provides a high-efficient and low-cost solution for updating the official Spatial Data Infrastructures, prompting a paradigm shift of geospatial data acquisition from the traditional top-down to bottom-up approach (Jiang 2013). On the other hand, the dynamically changed usergenerated geographic information plays an essential role in crisis mapping and environment monitoring (Zook et al. 2010). Exploring the richly thematic information, volunteered geographic information has also been investigated to generate 3D city models (Goetz and Zipf 2013). It is witnessed that the evolution of geo-crowdsourcing has created a new period of user-centric, ubiquitous Geographical Information Science (GIS). In particular, in mobile location-based services (e.g., pedestrian navigation, geo-fencing), studies are ongoing to make the best of crowdsourcing road networks and geo-tagged data contributed by open mapping platforms (e.g., OpenStreetMap (OSM), Google Map Maker) and social media sources (e.g., Flickr, Twitter, Weibo) (Sun et al. 2013). As illustrated in Figure 1, despite a *Corresponding author. Email:
[email protected] Email Id of Yunfei Zhang:
[email protected] © 2015 Taylor & Francis
International Journal of Geographical Information Science
Downloaded by [Wuhan University] at 01:24 18 September 2015
Figure 1.
787
Semantic enrichment for street panoramic imagery.
strong sense of reality, the street panoramic imagery remains to be semantically enriched with the content of crowdsourcing road networks and geo-tagged data. However, the participation of nonexperts leads to redundant data, wrong/incomplete information, and unstructured semantic description, greatly reducing the availabilities and interoperability of crowdsourcing geospatial data. Moreover, inconsistencies between spatially related features become another obstacle to good-quality mapping (Touya and Brando-Escobar 2013). The redundant data, multirepresentations of one object, and invalid spatial relations should be exploited and remedied to provide integrated information with clear semantics, seamless geometry, and consistent topology (Sester et al. 2014). Figure 2 depicts the overlaying of the points of interests (POIs) from Dianping (a social media site in China, www.dianping.com) and crowdsourcing road networks from OSM (www.openstreetmap.org), respectively. It can be seen a large discrepancy occurred between the roads and geo-tagged points denoted as white lines and dots. In particular, the larger white POIs are clearly mismatched with the neighboring thin roads, but are geometrically consistent with the longer thick roads. Moreover, the road names of OSM are unstructured described
Figure 2. Invalid spatial relations and unstructured semantic description occurring to crowdsourcing road networks and geo-tagged data in social media sources.
Downloaded by [Wuhan University] at 01:24 18 September 2015
788
B. Yang and Y. Zhang
with the mixing of Chinese characters and Pinyin. The diversity of positioning devices and the usage of nonstandard vocabularies particularly trigger the failure of maintaining the geometric and semantic consistencies between road networks and POIs from authorized or volunteered geographic information (VGI) sources. Invalid spatial relations and unstructured semantic descriptions possibly result in unsatisfactory route planning and incorrect spatial querying. Hence, maintaining the geometric and semantic consistencies between crowdsourcing road networks and POIs is an important precondition of guaranteeing the reliability and usefulness of crowdsourcing geospatial data (Yang et al. 2014). In the last decades, numerous conflation methods have been developed to deal with the geometric and semantic heterogeneities between various geospatial data, and widely applied for data enrichment, change detection, and quality assessment (Saalfeld, 1988; Butenuth et al. 2007, Sheeren et al. 2009). Ruiz et al. (2011) reviewed and classified the conflation process according to matching criteria, categorization problem, and representation model. Previous studies mainly concentrated on geometric conflation, identifying corresponding objects between two different vector roads or building footprint data sources (Safra et al. 2013, Yang et al. 2013, Zhang et al. 2014b). Several studies investigated schema-level fusion, exploring semantic similarities between feature categories of multisource data (Duckham and Worboys 2005, Al-Bakri and Fairbairn 2012). The automatic conflation of vector road and ortho-imagery also received much attention. For instance, Chen et al. (2006) and Song et al. (2009) detected and identified the corresponding road intersections from vector roads and imageries respectively for aligning vector road to imageries based on a piecewise rubber-sheeting transformation. González et al. (2013) recently characterized the impacts of strong and weak metrics on conflation results and proposed a novel term of ephemeral conflation for occasional online conflation, particularly for visual checking. The increasing volume of crowdsourcing geospatial data raises opportunities and challenges for geographic information handling (Adams and Gahegan 2014). Different approaches have been proposed to conflate crowdsourcing road networks and POIs to enrich authoritative data (Neis and Zielstra 2014). Mooney et al. (2011) affirmed the role of VGI as a dynamically updating data source, and Du et al. (2012) demonstrated it with an ontology-based integration method. Several recent studies are described to generate road maps or gazetteers from VGI data (Gao et al. 2014; Li et al., 2014a; Wang et al. 2014). Besides, the weighted multiattributes methods are utilized for matching different user-generated POIs (Scheffler et al. 2012, McKenzie et al. 2014). The previous studies mainly addressed the conflation of different road networks or finding attractive POIs from geo-tagged data in social media sites. Relatively little attention has been paid to conflating crowdsourcing road network and POIs, considering valid spatial relations and semantic enrichment. Zhang et al. (2014a) utilized node matching for two road networks to transfer the POIs of one road network to the other, but it is still infeasible for aligning geo-tagged data with crowdsourcing road networks due to the lack of matched road networks for geotagged data. Yang et al. (2014) proposed a geometric-based method to integrate VGI POIs and road networks. Their method determined the corresponding points between POIs and road networks by constructing a POI Connect Graph (PCGraph) from POIs to match POIs with the road networks. However, the construction of the PCGraph assumes a linear distribution of POIs in space, which is only valid for limited, specific cases, reducing the robustness and usability of the approach. Moreover, they failed to address object-level correspondences, resulting in the failure of semantic enrichment between geometric objects. Because of unpredictable positional accuracies, unstructured semantic descriptions, and different geometry types, it is still challengeable to conflate crowdsourcing road
Downloaded by [Wuhan University] at 01:24 18 September 2015
International Journal of Geographical Information Science
789
networks and POIs. The paper thus concentrates on one particular issue, exploring the conflation of crowdsourcing road networks and POIs from authorized or informal sources geometrically and semantically through a pattern-mining method. It is accepted that, in spite of uncertain accuracies, POIs from different sources are pattern-related with the associated road networks. We extend the work of Yang et al. (2014), proposing a pattern-mining approach to conflate crowdsourcing road networks and POIs in both geometry and semantic data. In contrast to Yang et al. (2014), the contributions of the proposed method are as follows: (1) the road network is utilized as prior knowledge to infer the geometric patterns of geo-tagged points, reducing the effects of the spatial distribution of POIs; (2) the geometric distributing pattern of POIs is delineated without assuming the linear distribution, improving the robustness and usability of the proposed method; and (3) the object-level correspondences between POIs and road segments are constructed for semantic enrichment, producing a robust approach for conflating crowdsourcing road networks and POIs geometrically and semantically. Following the introduction part, Section 2 elaborates the patterning mining approach of conflating crowdsourcing road networks and POIs. In Section 3, we discuss an experimental study and demonstrate the advantages of the proposed method. The conclusion and proposed future work are presented in Section 4. 2. Pattern mining for conflating road networks and POIs The unpredictable positional accuracies, unstructured semantic descriptions, as well as different geometry types, greatly complicate linking and aligning crowdsourcing road networks and POIs. Geometric patterns represent the geometric characteristics (e.g., shape, connectedness, density, or distribution) repeated with sufficient regularity within an object or between objects (Mackaness and Edwards 2002). Many specific patterns (e.g., high-level strokes) are scale-independent characteristics, and preserving these patterns is one of the most important principles in map generalization, data compression, and data matching (Jiang and Claramunt 2004, Yang et al. 2011, 2014, Zhang et al. 2014a). For road networks, high-level strokes reflect the functional and structural importance of the street networks. The POIs normally represent structural hierarchies of the urban street network as do those of high-level strokes. Yang et al. (2014) attempted to extract the linearly geometric patterns of VGI POIs to match the associated road networks. They solely utilized the geometric distribution of POIs and ignored the pattern-related similarities between POIs and road networks. We propose a pattern-mining approach to construct their pattern-related correspondences and hence conflate crowdsourcing road networks and POIs. As illustrated in Figure 3, the proposed method encompasses four components: ● Generate the high-level strokes of the road networks to compose a generalized skeleton graph for the whole road network. ● Assign POIs into k associated strokes or null strokes using a k-means clustering algorithm to build the correspondences of POIs and strokes. ● Extract the geometric patterns of the POIs clustered by the k-means algorithm with a polygonal algorithm to generate the skeleton graph of POIs. ● Match the skeleton graphs of POIs and road networks to identify the corresponding points for aligning POIs to road networks geometrically. A semantic processing module is then activated to detect their road-related semantic inconsistencies and enrich the aligned data with the inferred semantic data.
Downloaded by [Wuhan University] at 01:24 18 September 2015
790
Figure 3.
B. Yang and Y. Zhang
The framework of the pattern-mining conflation approach.
Due to data encryption and positional errors, the preprocessing of rough registration is firstly executed based on the attribute data associated (e.g., names, addresses) to solve the large discrepancies. The addresses and road names are compared between POIs and road networks to determine the associated POIs for each road segment. The associated POIs of each road will be fitted as a line segment. The corresponding intersection points of the fitted line segments and the associated road are regarded as the controlling points for an affine transformation, decreasing the systemic deviation between POIs and road networks to less than 100 m.
2.1. Generating high-level strokes from road networks In our proposed method, the high-level strokes of road networks are extracted to construct the skeleton graph of road network. Strokes are the linear elements with good continuity in a network, which were often generated by aggregating neighboring road segments with the identical road names or large angles (Thomson 2006). The operator of aggregating the neighboring road segments according to the directional continuities is utilized to generate the strokes. The operator starts from a node with the largest degree (e.g., node Ni). The neighboring road segments, NRi, of the node are calculated as well as the angles of each two neighboring road segments. The EveryBest-Fit strategy is then adopted for pairwise grouping of continuous directional road segments. Suppose that MCi is the Every-Best-Fit combination, each road pair in MCi are traversed and labeled with the identical stroke id according to the defined rules. The pseudo code of the operator is illustrated in Figure 4. The Every-Best-Fit strategy considers the directions of all road segments and obtains a balanced concatenation combination (Zhou and Li, 2012). For example in Figure 5a, (a→d, b→e, c→f) are the final pairwise combinations determined by the Every-Best-Fit strategy. Although the directional continuity of roads (a, c) is better than that of roads (a, d), road a is still concatenated with road d to reach the maximum sum of angles. The angle of the pairwise road segments should be larger than a threshold, Φ, to ensure directional continuity. Once all road nodes are traversed, each road segment is labeled with a unique stroke id and road segments with identical stroke id are concatenated into a stroke. As illustrated in Figure 5b, many strokes generated by the aggregating process may meet the requirement of local continuity but produce a larger overall bending
Downloaded by [Wuhan University] at 01:24 18 September 2015
International Journal of Geographical Information Science
Figure 4.
791
The procedure of aggregating neighboring road segments.
Figure 5. (a) The strategy of Every-Best-Fit to aggregate neighboring road segments and (b) the parameters to split large bending strokes.
direction. Suppose that s and e are the start and end nodes of the stroke drawn as the solid line in Figure 5b, and p is the farthest perpendicular point from the stroke to the line of s and e. If θ is less than the predefined Φ and the ratio of d and l is larger than R, the stroke is considered as large bending and should be split into two strokes at point p. All generated strokes are checked individually, and those with large bending directions are iteratively split. In urban road networks, high-level roads normally represent multilane roads reflecting the importance of functional hierarchy and transportation capacities (Li et al., 2014b). Large numbers of POIs are likely to be distributed along high-level multilane roads, making it difficult to determine the correspondences between multilane strokes and POIs. Hence, we implement a triangulation-based collapsing algorithm to extract the centerline of multilane strokes. The aggregated strokes are traversed in descending order of the stroke length to find the neighboring strokes by the ε-radius buffering analysis. When the angle of the neighboring strokes is less than 180°-Φ and the length within the buffering region is larger than 1 – R, neighboring strokes are approximately parallel and are regarded as multilane strokes. The flat buffering areas of the detected multilane strokes
Downloaded by [Wuhan University] at 01:24 18 September 2015
792
Figure 6.
B. Yang and Y. Zhang
Generating high-level strokes from road networks.
are merged into a new polygon. The triangulation-based collapsing algorithm (Liu et al. 2009) is applied on the merged polygons to extract the skeleton line of the multilane strokes. Empirically, a post-processing is required to preserve the topologies between multilane and adjacent strokes. The strokes of the road networks are generated by neighboring roads aggregating and multilane strokes collapsing. To improve the robustness of linking strokes and POIs, high-level strokes are selected to generate a skeleton graph to align with POIs. We select high-level strokes according to the network analysis (Yang et al. 2011). Figure 6 represents the high-level strokes generated from the road networks as thick black lines.
2.2. Linking POIs with the high-level strokes Despite the random distribution of POIs in road networks, the POIs distributed along high-level road segments show similar geometric patterns as those of the road segments. Thus, the shape of the road segments provides prior knowledge for the geometric distribution of the associated POIs. Yang et al. (2014) clustered POIs without utilizing this prior knowledge and is only suitable for clustering POIs distributing in nearly straight lines. To address this limitation, a k-means clustering algorithm was implemented to partition all POIs to the k strokes and build the linkages between POIs and strokes. L et P = {pi | i = 1…n} be the POIs set, S = {sj | j = 1…k} be the stroke set, and X = {xij | i = 1…n, j = 1…k} be the projecting vector set of POI pi in P and stroke sj in S. The element xij in X is defined as the vector consisting of the projection from pi to sj and the point pi, which measures the relative position between POIs and the neighboring strokes. POIs associated with identical strokes will have uniform projecting vectors. A k-means algorithm finds a partition such that the squared error between the cluster center and the points in the cluster is minimized (Jain 2010). For POIs and road networks, the squared error of the projecting vectors of the POIs to the associated stroke
International Journal of Geographical Information Science
793
should be minimized. Where POIs deviate greatly from the roads, they should be partitioned to an individual cluster ‘null stroke’. Following Tseng (2007), the objective formula of the k-means clustering method with noise is arg min f ¼
n X k X
uij xij cj þ λ kNk
i¼1 j¼1
8 k P > > > < j¼1 uij 1; i ¼ 1 . . . n; j ¼ 1 . . . k ; s:t: > S > > j p ‚ s N ¼ p : i i j
(1)
Downloaded by [Wuhan University] at 01:24 18 September 2015
j¼1...k
with n P
cj ¼
i¼1
uij xij cj n P
;
(2)
uij
i¼1
and n P k P
λ ¼ λ0
uij xij cj
i¼1 j¼1 n P k P
;
(3)
uij
i¼1 j¼1
where uij is a binary assignment variable, uij = 1 means that POI pi is assigned to stroke sj, and vice versa; cj is the cluster center of stroke sj; λ is the criterion for the POIs belonging to null strokes; and ||N|| is the squared error of two vectors or the element numbers of a set N. The constraint ∑uij ≤ 1 ensures that a POI is assigned to no more than one stroke. The set N are those POIs assigned to null strokes. Many POIs are located close to the centers of street blocks and should not be assigned to any roads. The parameter cj is the mean projecting vector of the POIs assigned to stroke sj, and λ0 is a predefined parameter (0.15–2.5). The k-means clustering algorithm first calculates the projecting vector of each POI to the strokes within the δ-neighborhood. As a rough registration is performed, δ is uniformly defined as 100 m. The cluster center of each stroke is initially set as (0, 0). Then all POIs are clustered to k or null strokes according to the framework of the k-means clustering, as follows: Step 1: Select the nearest stroke sj, for each POI pi, to minimize the squared error, || xij – cj ||. Step 2: Calculate the cluster center cj for each stroke sj, and the clustering parameter λ, according to Equations (2) and (3). Step 3: Reassign the POIs to the nearest stroke or null stroke. For each POI pi, the nearest stroke sj, with the minimum squared error of || xij – cj || is found. If the minimum squared error is less than λ, the POI is assigned to sj; otherwise, it is assigned to null stroke.
Downloaded by [Wuhan University] at 01:24 18 September 2015
794
Figure 7.
B. Yang and Y. Zhang
(a) The initial clustering and (b) the k-means clustering.
Step 4: Repeat steps 2 and 3 until the clusters of k strokes converge. The cluster of stroke sj is converged when the distance difference of the cluster center cj, between two iterations is less than a threshold. Figure 7 shows the initial clustering to the nearest strokes and the results of k-means clustering. The POIs shown in identical color are partitioned into one stroke, and those clustered as null strokes are shown as small black triangles. The POIs in the black ellipse are initially partitioned into the nearest strokes (Figure 7a) but are finally reassigned to null strokes (Figure 7b). A few POIs in the red ellipse are initially assigned to the horizontal stroke but are finally reassigned to the vertical stroke. Once the POIs are partitioned into different strokes, we build the linkage between POIs and the corresponding strokes. Since the linkage between strokes and road segments is clear, the object-level correspondences between POIs and road networks are determined, guiding for geometric alignment and checking the semantic inconsistencies between the associated POIs and road segments.
2.3. Delineating the geometric patterns of POIs After k-means clustering, the POIs are partitioned into k clusters and null clusters. Each of the k-clusters corresponds to one stroke. The geometric pattern of each cluster is extracted to form a skeleton graph for building the pattern-related correspondences between POIs and road networks. Yang et al. (2014) fitted a straight-line segment for each cluster and connected these line segments by defined rules and tuning of thresholds, making it difficult to obtain a reliable skeleton graph of POIs. In contrast, we generate the principal curves, a set of smooth ‘self-consistency’ curves passing through the middle of the p-dimensional data (Hastie 1984), for fitting the clustered POIs. The polygonal method of Kégl et al. (2000) is used to describe the geometric patterns of the POIs assigned to an individual stroke as a principal polyline. Let X = {xi | i = 1…n} be the POI set assigned to the identical stroke and f be a polyline consisting of k points, that is, f = {yj| j = 1…k}. The polyline is defined as the principal curve of X when it minimizes the penalized squared error
International Journal of Geographical Information Science
795
Gn ðf Þ ¼ Δn ðX ; f Þ þ η Pðf Þ;
(4)
η ¼ η0 kn1=3 Δn ðX ; f Þ1=2 r1 ;
(5)
Downloaded by [Wuhan University] at 01:24 18 September 2015
and
where Δn (X, f) is the average squared distance between the training data, X and f; P(f) is the penalized function for measuring the smoothness of f; η is the penalized factor calculated by Equation (5); η0 is a defined parameter; and r is the maximum distance of the points in X to the center of X. In practice, it is computationally difficult to obtain the minimum penalized squared errors. Hence, a suboptimal strategy is adopted: Step 1: The principal curve f is initialized as the first principal line, k = 2. Step 2: All points of X are projected to the segments or vertices of f according to Equation (6). Vj=1…k stores the observed points projecting to the vertex yj in f. Sj=1…k-1 stores the observed points projecting to the segment sj in f. All points in X are partitioned into 2 k–1 sets. Vj ¼ fxi 2 X j Δðxi ; f Þ ¼ Δðxi ; yj Þ; yj 2 f g ! Sj ¼ fxi 2 X j Δðxi ; f Þ ¼ Δðxi ; sj Þ; sj ¼ y j yjþ1 g
(6)
Step 3: Select the segment sm of f with the largest number of points projecting to it and insert the middle point of sm into f. Repartition the points projecting to sm into the new segments and vertex of f. Step 4: Reshape the vertices of f by minimizing the local penalized squared error iteratively. Repeat steps 2–4 until the penalized squared error converges to a small value or the number of points in f exceeds the threshold C. We assume that moving one vertex of f can only impact the local penalized squared error at the vertex. The new position of vertex yj of f is determined to minimize the local penalized squared error. The local penalized square error at vertex yj is 1 1 Gn ðyj Þ ¼ Δn ðyj Þ þ η • Pðyj Þ; n k
(7)
where Δn (yj) and P(yj) are the local squared distances and penalty measures (see Kégl et al. 2000); η is calculated by Equation (5). The principal curve method inserts one point into the principal curve each iteration and recursively optimizes the position of each vertex by minimizing the local penalized square error. Hence, the POIs in each cluster are fitted as a single curve. The fitted curves constitute the skeleton graph of POIs. Figure 8 shows the fitted principal curves over several iterations. As the number of iterations increases, the fitted curve better represents the observed points.
2.4. Geometric adjustment and semantic enrichment To conflate the POIs and road networks geometrically, the controlling point pairs are identified from the skeleton graphs of POIs and road networks. As illustrated in Figure 9, the solid and dashed lines in gray color are the skeleton graphs of POIs and road networks, respectively. Suppose that A and A′ are two matched edges of the
Downloaded by [Wuhan University] at 01:24 18 September 2015
796
B. Yang and Y. Zhang
Figure 8.
Extracting the principal curve based on polygonal algorithm.
Figure 9.
Identifying the controlling points from skeleton graphs of POIs and roads.
skeleton graphs of POIs and road networks, the corresponding intersection nodes linked by thick black lines are first determined as the initial controlling points (CPs). Then the corresponding vertices (linked by thin black lines) between two corresponding nodes are inserted into CPs. Following these CPs, a rubber sheet transformation is executed to align the POIs to the associated road networks geometrically. Then, the addresses of POIs and the names of the associated road segments are compared to mutually check the semantic consistency between the POIs and the road segments. Let RN = {RN0, RN1,…, RNm} be the set of road names of all road segments in the road network. The associated road segment of each POI is determined as the nearest one of those road segments that composes the linked stroke of the POIs by k-means clustering. Suppose that road segment Ri is the associated road segment of POI Pi, Addri is the address of Pi and Namei is the road name of Ri. Then, the semantic relation between two associated objects of Pi and Ri is described as Ti = (Namei, Addri). Based on the decision tree in Figure 10a, the set of POIs associated with one road segments is classified
International Journal of Geographical Information Science
Downloaded by [Wuhan University] at 01:24 18 September 2015
Figure 10.
797
Classification of POI types by decision tree.
into five types (e.g., the leaf nodes, C1,…,C5). The corresponding rules of R1,…,R6 are described in Figure 10b. For example, one POI Pi with Addri = ‘No. 50, Wangfujing Street, Dongcheng District’ and its associated road segment Ri with Namei = ‘Wangfujing Street’. As the address of Pi contains the name of Ri (e.g., Namei \ Addri = Namei), Ti = (Namei, Addri) successively meets the rules of R2, R3, and R5. Hence, the POI Pi is categorized as C3. The C3 type indicates that the POI and the associated road segment have relatively complete semantic data such as road names. The five types of POIs associated with one road segment are counted. Let Ni (i = 1, …,5) be the count of the five types of POIs associated with one road segment. For each road segments, the percentage of C3 type POIs associated with it is N3 R ¼ P5 i¼1
Ni
100%
(8)
If R > 60%, the road segment is classified as semantically consistent. Otherwise, it is classified as semantically inconsistent. For one road segment classified as semantically inconsistent, the address data of its associated POIs is processed by a Chinese word segmentation library (Pan Gu Segment, http://pangusegment.codeplex.com/) that segments Chinese and English words from sentences. The frequency of each segmented word is counted and the words ranked before the fourth order are likely to be road namerelated keywords that are then checked manually to guarantee the correctness of the inferred road names.
3. Experiments and results analysis Two experimental data sets (about 6 * 8 km) of Beijing and Shanghai in China were selected to validate the proposed method. The corresponding road networks were downloaded from the OSM website (www.openstreetmap.org). The POIs of Beijing were obtained from VGI data, and the POIs of Shanghai were obtained from a professional agency. The Beijing data set has 4748 POIs and 6903 road segments, and the Shanghai data set has 13,623 POIs and 4680 road segments. The threshold parameters for conflating crowdsourcing road networks and POIs are consistent for both data sets and are listed in Table 1.
798
B. Yang and Y. Zhang Table 1.
The thresholds for conflating road networks and POIs. Linking POIs with strokes
Generating high-level strokes Φ
Downloaded by [Wuhan University] at 01:24 18 September 2015
120°
Delineating geometric patterns of POIs
R
Ε
Δ
λ0
η0
C
0.2
40 m
100 m
1.8
0.4
30
Thresholds Φ (Thom 2006) and R (empirically set) are used in aggregating neighboring road segments and splitting large bending strokes (see in Figure 5). ε is the buffering distance to identify candidate multilane strokes, estimated as the average road width. The angle between candidate parallel strokes (1 – R) are used to further clarify multilanes strokes. In the k-means clustering, δ is the buffering distance to identify candidate linked strokes for POIs, estimated as the largest geometric deviations between road networks and POIs after rough alignment. λ0 and η0 are defined in Equations (3) and (5), which are fine-turned according to Tseng (2007) and Kégl et al. (2000), respectively. C is empirically set as the largest number points in the fitted curves.
3.1. Results of extracting the geometric patterns from road networks and POIs The extracted geometric patterns of the road networks and POIs in Beijing and Shanghai are illustrated in Figures 11 and 12, respectively. The generated high-level strokes in the areas of Shanghai and Beijing are depicted as thick solid lines in Figures 11a and 12a, respectively. In Figures 11b and 12b and c, the POIs in an identical color are assigned to one particular stroke and those POIs clustered with null stroke are dotted in gray color. The dashed lines in Figures 11 and 12 are the extracted principal curves from those clustered POIs. It can be seen from Figures 11 and 12 that the proposed k-means clustering method not only correctly recognized the associated POIs of one stroke but also efficiently eliminated those POIs far always from the strokes. As shown in the ellipse of Figure 11b, the clusters with few POIs (e.g., less than 6) are not delineated as the principal curves to ensure the reliability of the extracted geometric patterns from POIs. On the other hand, it possibly failed to cluster a lot of neighboring POIs (e.g., the rectangle of Figure 11b) with a similar pattern of one road segment because they distribute far away from the generated strokes. The geometric patterns of POIs correctly keep the shape characteristics of the clustered POIs and their associated road segments. Figure 12c depicts the enlarged part of the black rectangle in Figure 12b, demonstrating that the curves correctly preserve the geometric patterns of POIs. Compared to the extracted geometric patterns from road networks and POIs, the position and object correspondences are correctly recognized and provide the geometric adjustment for semantic enrichment.
3.2. Geometric adjustment and evaluation The original and rectified POIs after rough alignment and then geometric adjustment are represented as black, gray, and white dots in Figure 13, respectively. Consider, for
Downloaded by [Wuhan University] at 01:24 18 September 2015
International Journal of Geographical Information Science
Figure 11.
The extracted geometric patterns from road networks and POIs in Beijing.
Figure 12.
The extracted geometric patterns from road networks and POIs in Shanghai.
799
example, the original POIs in the left ellipse. These POIs are shifted into the right ellipse (gray POIs) after rough alignment, showing a large displacement to heterogeneous POIs and road networks compensating for data encryption errors and unstable GPS precision. However, nonrigid deviation still exists between the POIs and the road segments after rough alignment. Building the object correspondences for the semantic data enrichment
Downloaded by [Wuhan University] at 01:24 18 September 2015
800
Figure 13.
B. Yang and Y. Zhang
Spatial adjustment based on the pattern-related matching.
corrects this remaining deviation. As depicted in the right ellipse of Figure 13, the white POIs are consistently distributed along the associated road segments after geometric adjustment. The spatial relations between the associated POIs and road segments are the foremost concern for mobile LBS applications (Yang et al. 2014). It is important to retain the position of a POI to the left or right of a road (Sester et al. 2014). For lack of reference road networks, the prime map services websites in China (e.g., Baidu Map, Sogou Map) are taken as the ground truth. We compared the spatial relations of the POIs and road networks after geometric adjustment with those map services sites to quantitatively evaluate the proposed conflating method. We randomly selected approximately 10% of the POIs near the road segments and checked whether the spatial relation between them is consistent with that in the map services websites. As illustrated in Figure 14a, the selected POI A′ in cyan color locates on the upperright side of the road intersection. We search the name of A′ in Baidu Map to find one POI A in Figure 14b, which is also on the upper-right side of the same road intersection. The spatial relations between A′ and its neighboring road segments in Figure 14a are consistent with that of A in Figure 14b, hence the POI A′ can be considered as geometrically correct (green tick). Otherwise, if inconsistent spatial relation or large discrepancy occurs to one POI, it is marked as incorrect (red cross). If no corresponding records with the name of POI A′ are found in the map services sites, the POI is marked as uncertain (dots in yellow and black color). The blue dots in Figure 14a are those POIs to be manually judged. The results of correct, incorrect, and uncertain POIs checked by manual inspection are listed in Table 2. It can be seen from Table 2 that approximately 65% of the crowdsourcing POIs in Beijing correctly preserve the spatial relations with their neighboring road segments by our proposed method. Most remaining POIs either have extremely large displacement or do not have the corresponding records in the public map services applications. In
Downloaded by [Wuhan University] at 01:24 18 September 2015
International Journal of Geographical Information Science
Figure 14.
801
Manual inspection of the spatial relations between POIs and road networks.
Table 2. Spatial accuracy of two test data sets compared to the map service websites.
Beijing Shanghai
Correct
Incorrect
Uncertain
Total
1085 (65.8%) 1692 (85.4%)
471 187
92 102
1648 1981
comparison, more than 85% of the authoritative POIs in Shanghai have consistent spatial relations with the map services applications.
3.3. Results of semantic inconsistency detection The correspondences between the POIs and the strokes are built by the k-means clustering method, and the associated road segment of one POI is determined by selecting the nearest of the road segments composing the stroke. Generally, the associated road name of one POI is the most important geo-reference information to locate it. Hence, we detected the semantic inconsistencies between POIs and road networks using the associated road names. Five types (C1…C5) of POIs were categorized (see Section 2.4) and the percentages of all types of POIs are illustrated in Figure 15, showing that most POIs have the correct associated road names. The percentage of C3 type in Beijing (about 60%) is somewhat lower than that in Shanghai (80%), indicating that the semantic quality of crowdsourcing POIs is inferior to that of authoritative POIs. However, the semantic consistency (C3 percentage) is similar to the spatial consistency (Table 2). Thus, conflating crowdsourcing POIs and road networks can efficiently detect and eliminate incorrect geo-coding of candidate POIs. The consistent and inconsistent semantic types between road segments were counted according to the value of R in Equation (8), and the missing road names in road networks were also inferred according to the addresses of the associated POIs. Figure 16 illustrates the result for the Shanghai data set, depicting the consistent (R > 60%) and inconsistent (R ≤ 60%) semantic types of road segments as red and blue lines, respectively. The thin black road segments indicate no POIs are aligned to them. The labels on the road
Downloaded by [Wuhan University] at 01:24 18 September 2015
802
B. Yang and Y. Zhang
Figure 15.
Five types of POIs categorized by the road name-related information.
Figure 16.
Inferring the road names from the associated POIs.
segments are the inferred road-related words from the associated POIs. Manual checking found that segmented words ranked before the fourth order are closely related to the road name information. This provides an important cue for exploring road name information for enriching OSM data. Further selection from the road-related words is performed to remedy and update the name attributes of the road segments, contributing to the semantic data improvement of road segments.
4. Conclusion It is witnessed that crowdsourcing geospatial data brings great opportunities and challenges to the whole GIS community (Sui et al. 2013). On the one hand, the popularity of
Downloaded by [Wuhan University] at 01:24 18 September 2015
International Journal of Geographical Information Science
803
crowdsourcing geospatial data provides high-efficient and low-cost data sources for geospatial data updating and map analysis. On the other hand, the participation of nonexperts and lack of quality assurance probably lead to heterogeneous data representations. In particular, unpredictable discrepancies, inconsistent spatial relations, and unstructured semantic descriptions occurred to crowdsourcing road networks and POIs. A novel conflation approach is imminently required to conflate crowdsourcing road networks and POIs. We thus propose a pattern-mining method to conflate crowdsourcing road networks and POIs geometrically and semantically. The proposed method first extracts the highlevel strokes of road networks that constitute the skeleton graph of road networks. Then the POIs are assigned to the high-level strokes according to the k-means clustering method, and the geometric pattern of each cluster of POIs is delineated as a principal curve, comprising the skeleton graph of POIs. The skeleton graphs of POIs and road networks are then linked to find the controlling points for geometric adjustment between POIs and road networks. The associated road name of one POI is the most important geo-reference information to locate it. Hence, the road-related semantic data (e.g., road names) are compared to detect the semantic inconsistency, providing a functional solution for the updating and change detection of road names. Experimental studies demonstrate that the proposed method works well for conflating heterogeneous POIs and road networks in terms of robustness and enriching the semantic data of road networks with POIs. Future research will focus on automated data enrichment of POIs and road networks, and the conflating and updating of various crowdsourcing POIs. Acknowledgments Special thanks go to editor and anonymous reviewers for their constructive comments that substantially improved the quality of the paper.
Funding This work was jointly supported by the project from 863 (no. 2012AA12A211), Academic Award for Excellent Ph.D. Candidates funded by the Ministry of Education of China (no. 5052012619001), and the Fundamental Research Funds for the Central Universities (no. 3103005).
References Adams, B. and Gahegan, M., 2014. Emerging data challenges for next-generation spatial data [online]. In: The proceeding of 2014 LOCATE, 7–9 April, Canberra, 118–129. Al-Bakri, M. and Fairbairn, D., 2012. Assessing similarity matching for possible integration of feature classifications of geospatial data from official and informal sources. International Journal of Geographical Information Science, 26 (8), 1437–1456. doi:10.1080/ 13658816.2011.636012 Butenuth, M., et al., 2007. Integration of heterogeneous geospatial data in a federated database. ISPRS Journal of Photogrammetry and Remote Sensing, 62 (5), 328–346. doi:10.1016/j. isprsjprs.2007.04.003 Chen, C.-C., Knoblock, C.A., and Shahabi, C., 2006. Automatically conflating road vector data with orthoimagery. GeoInformatica, 10 (4), 495–530. doi:10.1007/s10707-006-0344-6 Du, H., et al., 2012. Geospatial information integration for authoritative and crowd sourced road vector data. Transactions in GIS, 16 (4), 455–476. doi:10.1111/j.1467-9671.2012.01303.x Duckham, M. and Worboys, M., 2005. An algebraic approach to automated geospatial information fusion. International Journal of Geographical Information Science, 19 (5), 537–557. doi:10.1080/13658810500032339
Downloaded by [Wuhan University] at 01:24 18 September 2015
804
B. Yang and Y. Zhang
Gao, S., et al., 2014. Constructing gazetteers from volunteered big geo-data based on Hadoop. Computers, Environment and Urban Systems. doi:10.1016/j.compenvurbsys.2014.02.004. Goetz, M. and Zipf, A., 2013. The evolution of geo-crowdsourcing: bringing volunteered geographic information to the third dimension. In: D. Sui, S. Elwood, and M. Goodchild, eds. Crowdsourcing geographic knowledge. Dordrecht: Springer, 139–159. González, C.-H., López-Vázquez, C., and Bernabé, M.-Á., 2013. Ephemeral conflation. The Cartographic Journal, 50 (1), 43–48. doi:10.1179/1743277412Y.0000000014 Hastie, T., 1984. Principal Curves and Surfaces. Thesis (PhD). Stanford University. Heipke, C., 2010. Crowdsourcing geospatial data. ISPRS Journal of Photogrammetry and Remote Sensing, 65 (6), 550–557. doi:10.1016/j.isprsjprs.2010.06.005 Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31 (8), 651–666. doi:10.1016/j.patrec.2009.09.011 Jiang, B., 2013. Volunteered geographic information and computational geography: new perspectives. In: D. Sui, S. Elwood, and M. Goodchild, eds. Crowdsourcing geographic knowledge. Dordrecht: Springer, 125–138. Jiang, B. and Claramunt, C., 2004. Topological analysis of urban street networks. Environment and Planning B: Planning and Design, 31, 151–162. doi:10.1068/b306 Kégl, B., et al., 2000. Learning and design of principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (3), 281–297. doi:10.1109/34.841759 Li, J., et al., 2014a. Mining trajectory data and geotagged data in social media for road map inference. Transactions in GIS. doi:10.1111/tgis.12072. Li, Q., et al., 2014b. Polygon-based approach for extracting multilane roads from OpenStreetMap urban road networks. International Journal of Geographical Information Science. doi:10.1080/ 13658816.2014.915401. Liu, X., Ai, T., and Liu, Y., 2009. Road density analysis based on skeleton partitioning for road generalization. Geo-spatial Information Science, 12 (2), 110–116. doi:10.1007/s11806-0090012-8 Mackaness, W. and Edwards, G., 2002. The importance of modelling pattern and structure in automated map generalisation. In: Joint workshop on multi-scale representations of spatial data, 7–8 July, Ottawa, ON, 1–11. McKenzie, G., Janowicz, K., and Adams, B., 2014. A weighted multi-attribute method for matching user-generated Points of Interest. Cartography and Geographic Information Science, 41 (2), 125–137. doi:10.1080/15230406.2014.880327 Mooney, P., Sun, H., and Yan, L., 2011. VGI as a dynamically updating data source in locationbased services in urban environments. In: Proceedings of the 2nd international workshop on Ubiquitous crowdsouring, 18 September, Beijing, 13–16. Neis, P. and Zielstra, D., 2014. Recent developments and future trends in volunteered geographic information research: the case of OpenStreetMap. Future Internet, 6 (1), 76–106. doi:10.3390/ fi6010076 Ruiz, J.J., et al., 2011. Digital map conflation: a review of the process and a proposal for classification. International Journal of Geographical Information Science, 25 (9), 1439–1466. doi:10.1080/13658816.2010.519707 Saalfeld, A., 1988. Conflation automated map compilation. International Journal of Geographical Information Systems, 2 (3), 217–228. doi:10.1080/02693798808927897 Safra, E., et al., 2013. Ad hoc matching of vectorial road networks. International Journal of Geographical Information Science, 27 (1), 114–153. doi:10.1080/13658816.2012.667104 Scheffler, T., Schirru, R., and Lehmann, P., 2012. Matching points of interest from different social networking sites. In: B. Glimm and A. Krüger, eds. KI 2012: advances in artificial intelligence. Berlin: Springer, 245–248. Sester, M., et al., 2014. Integrating and generalising volunteered geographic information. In: D. Burghardt, C. Duchêne, and W. Mackaness, eds. Abstracting geographic information in a data rich world. Switzerland: Springer International Publishing, 119–155. Sheeren, D., Mustière, S., and Zucker, J.-D., 2009. A data‐mining approach for assessing consistency between multiple representations in spatial databases. International Journal of Geographical Information Science, 23 (8), 961–992. doi:10.1080/13658810701791949 Song, W., et al., 2009. Automated geospatial conflation of vector road maps to high resolution imagery. IEEE Transactions on Image Processing, 18 (2), 388–400. doi:10.1109/ TIP.2008.2008044
Downloaded by [Wuhan University] at 01:24 18 September 2015
International Journal of Geographical Information Science
805
Sui, D., Goodchild, M., and Elwood, S., 2013. Volunteered Geographic Information, the Exaflood, and the Growing Digital Divide. In: D. Sui, S. Elwood, and M. Goodchild, eds. Crowdsourcing Geographic Knowledge. Dordrecht: Springer, 1–12. Sun, Y., et al., 2013. Road-based travel recommendation using geo-tagged images. Computers, Environment and Urban Systems. doi:10.1016/j.compenvurbsys.2013.07.006. Thom, S., 2006. Conflict identification and representation for roads based on a Skeleton. In: Proceedings of the 12th international symposium on spatial data handling, 12–14 July, Vienna, 659–680. Thomson, R.C., 2006. The ‘stroke’ concept in geographic network generalization and analysis. In: Proceedings of the 12th international symposium on spatial data handling, 12–14 July, Vienna, 681–697. Touya, G. and Brando-Escobar, C., 2013. Detecting level-of-detail inconsistencies in Volunteered Geographic Information data sets. Cartographica: The International Journal for Geographic Information and Geovisualization, 48 (2), 134–143. doi:10.3138/carto.48.2.1836 Tseng, G.C., 2007. Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics, 23 (17), 2247–2255. doi:10.1093/bioinformatics/btm320 Wang, J., et al., 2014. A novel approach for generating routable road maps from vehicle GPS traces. International Journal of Geographical Information Science. doi:10.1080/13658816.2014.944527 Yang, B., Luan, X., and Li, Q., 2011. Generating hierarchical strokes from urban street networks based on spatial pattern recognition. International Journal of Geographical Information Science, 25 (12), 2025–2050. doi:10.1080/13658816.2011.570270 Yang, B., Zhang, Y., and Lu, F., 2014. Geometric-based approach for integrating VGI POIs and road networks. International Journal of Geographical Information Science, 28 (1), 126–147. doi:10.1080/13658816.2013.830728 Yang, B., Zhang, Y., and Luan, X., 2013. A probabilistic relaxation approach for matching road networks. International Journal of Geographical Information Science, 27 (2), 319–338. doi:10.1080/13658816.2012.683486 Zhang, M., Yao, W., and Meng, L., 2014a. Enrichment of topographic road database for the purpose of routing and navigation. International Journal of Digital Earth, 7 (5), 411–431. doi:10.1080/ 17538947.2012.717110 Zhang, X., et al., 2014b. Data matching of building polygons at multiple map scales improved by contextual information and relaxation. ISPRS Journal of Photogrammetry and Remote Sensing, 92, 147–163. doi:10.1016/j.isprsjprs.2014.03.010 Zhou, Q. and Li, Z., 2012. A comparative study of various strategies to concatenate road segments into strokes for map generalization. International Journal of Geographical Information Science, 26 (4), 691–715. doi:10.1080/13658816.2011.609990 Zook, M., et al., 2010. Volunteered geographic information and crowdsourcing disaster relief: a case study of the Haitian earthquake. World Medical & Health Policy, 2 (2), 7–33. doi:10.2202/19484682.1069