Assessing Crowdsourced POI Quality: Combining Methods Based on ...

International Journal of

Geo-Information Article

Assessing Crowdsourced POI Quality: Combining Methods Based on Reference Data, History, and Spatial Relations Guillaume Touya 1, *, Vyron Antoniou 2, *, Ana-Maria Olteanu-Raimond 1, * and Marie-Dominique Van Damme 1, * 1 2

*

Univ. Paris-Est, LASTIG COGIT, IGN, ENSG, F-94160 Saint-Mande, France Hellenic Military Geographical Service, Evelpidon 4, Athens 11362, Greece Correspondence: [email protected] (G.T.); [email protected] (V.A.); [email protected] (A.-M.O.-R.); [email protected] (M.-D.V.-D.)

Academic Editors: Alexander Zipf, David Jonietz, Linda See and Wolfgang Kainz Received: 5 December 2016; Accepted: 8 March 2017; Published: 14 March 2017

Abstract: With the development of location-aware devices and the success and high use of Web 2.0 techniques, citizens are able to act as sensors by contributing geographic information. In this context, data quality is an important aspect that should be taken into account when using this source of data for different purposes. The goal of the paper is to analyze the quality of crowdsourced data and to study its evolution over time. We propose two types of approaches: (1) use the intrinsic characteristics of the crowdsourced datasets; or (2) evaluate crowdsourced Points of Interest (POIs) using external datasets (i.e., authoritative reference or other crowdsourced datasets), and two different methods for each approach. The potential of the combination of these approaches is then demonstrated, to overcome the limitations associated with each individual method. In this paper, we focus on POIs and places coming from the very successful crowdsourcing project: OpenStreetMap. The results show that the proposed approaches are complementary in assessing data quality. The positive results obtained for data matching show that the analysis of data quality through automatic data matching is possible but considerable effort and attention are needed for schema matching given the heterogeneity of OSM and the representation of authoritative datasets. For the features studied, it can be noted that change over time is sometimes due to disagreements between contributors, but in most cases the change improves the quality of the data. Keywords: crowdsourced data quality; point of interest; crowdsourced data evolution; data matching; OpenStreetMap; intrinsic data quality assessment; extrinsic data quality assessment; spatial relations

1. Introduction Points of Interest (POIs) describe places of interest for any user of geographic information, such as touristic places, amenities, or shops, and can be easily crowdsourced by citizens. As a consequence, they are often a major part of the contributions of volunteered geographic information (VGI) projects; see for instance OpenStreetMap (OSM) and Wikimapia. However, representing a geographic place that can have an extent as large as a building using the point primitive is not a straightforward process for contributors or even for professional surveyors: what is the best location for a point representing a school that might be composed of multiple buildings? Moreover, crowdsourced data might be subject to frequent, minor or major, changes until a consensus is reached among volunteers. So, only measuring the distance between such POI data and their counterpart in a reference dataset is not particularly meaningful. However, VGI POI datasets do need to be assessed to be effectively used in

ISPRS Int. J. Geo-Inf. 2017, 6, 80; doi:10.3390/ijgi6030080

www.mdpi.com/journal/ijgi

ISPRS Int. J. Geo-Inf. 2017, 6, 80

2 of 29

applications: for instance, when the POIs are to be displayed on a map, the consistency of their location into the vague region they represent is very important to reduce the map reader cognitive load. From the beginning, assessing data quality was one of the main research topics related to VGI, and particularly OSM (see for example [1]), and is mainly performed through comparisons with authoritative reference datasets [2,3]. Many methods have then been proposed to assess VGI quality; a review can be found in [4–6], among others. The quality assessment of VGI POIs is essential as many applications are based on such data, and can be affected by low-quality POIs. POIs from OSM, Wikipedia, or social networks such as Facebook or Swarm can be used for user-centric wayfinding [7], for population analysis [8,9], for land use mapping [10], for urban analysis [11], or for the analysis of people’s perception of places [12]. Despite the aforementioned and ongoing research around VGI quality, there is no holistic and comprehensive framework for assessing the overall quality of such data (research in this direction can be found, for example, in [13,14]) and the existing methods (e.g., ISO-based quality evaluation) are not inclusive or flexible enough to accommodate this new type of data. The crowdsourced nature of VGI is fundamentally different to classic Geographic Information (GI), not least because there is now a strong social factor behind public contributions and data creation. Thus, it is not uncommon for VGI to suffer from participation biases at all levels of granularity. Moreover, as VGI comes in many flavors and from diverse sources, existing quality evaluation methods do not always provide the means to evaluate them. For, example there are both explicit (e.g., OSM, Geograph, etc.) and implicit (e.g., Flickr, Twitter, etc.) sources of GI [15,16], which, considering the advances in GI retrieval methodologies, can be used to extract meaningful data from all kinds of content available on the Web and thus create innovative spatial products (e.g., from bicycle maps to emotion maps). These spatial products are out of the scope of traditional geo-data producers such as National Mapping Agencies (NMAs) or large corporations. Additionally, VGI has matured enough to be considered as a replacement of, or as a way to enrich, authoritative data with GI, which has, so far, been missing from authoritative databases. Thus, along with the evaluation against authoritative reference data, VGI needs an autonomous and comprehensive methodology for quality evaluation. In this context, much of the research work has focused on the discovery, documentation, and development of intrinsic VGI quality indicators [4]. In such studies the efforts have focused mainly on a feature-level examination. The examination of the lineage (e.g., versions), of the topology or of the geometric characteristics (e.g., length, area), can result in a basic understanding of overall quality of a VGI dataset. Following this line of research, this paper examines different ways to evaluate VGI quality. Recent research has proposed methods to evaluate VGI POIs based on history [17] or on comparisons with reference datasets [18], but neither is enough to assess quality alone. In order to assess the potential of combining such methods on POIs, we compare alternative methods on the same dataset. Thus, our first aim is to assess the usability of these methods for VGI POI quality evaluation, and to identify what aspects of data quality each method can or cannot cover. Our second aim is to experiment with a more holistic approach by combining the different methods and identifying what more we can learn regarding quality through this combination. For example, comparing each instance of a feature history against reference data can tell us if new versions of the feature improved its quality. The remainder of this paper is structured as follows: Section 2 provides an overview of the methodology and the data used. Section 3 evaluates VGI data using their lineage; Section 4 assesses VGI quality using internal spatial relations. Section 5 discusses the quality assessment using data matching techniques by comparing them with authoritative reference data; Section 6 discusses quality evaluation using data from other VGI sources. Then, Section 7 discusses the possibilities when combining the previous methods. Finally, in Section 8, the paper provides conclusions and recommendations for future work.


3 of 29

2. Methodology and Datasets Used In order to assess the quality of crowdsourced POIs, we used two general approaches: (i) use intrinsic characteristics and measurements derived from the crowdsourced datasets at hand, and (ii) evaluate crowdsourced POIs using an external dataset, that can either be an authoritative reference or another crowdsourced dataset. More specifically, the data under evaluation are the OSM POIs for the Paris region. For the first approach, two methods are proposed for quality evaluation: using the history of each feature alone (Section 3) and analyzing the spatial and topological relationships of the OSM features (Section 4). For the second case, we examined the quality of OSM points against authoritative ones from IGN (The French Mapping Agency) by using a data matching algorithm (see Section 5) and by using geo-tagged photographs from Flickr (see Section 6). Finally, all these evaluation methods were combined to improve the analysis of data quality and generate new knowledge (see Section 7). 2.1. OpenStreetMap Dataset The OSM project is one of the prime examples of crowdsourcing GI. Since 2004, the OSM project has mobilized contributors from around the globe in order to create a freely available geographic database. Today, there are more than three million registered contributors in OSM and despite the fact that the vast majority are occasional contributors and thus only a small fraction of them is consistently contributing, on a monthly basis, there are more than 20,000 active contributors (at the time of writing); this constitutes a huge workforce. The initial aim was to enable volunteers to use their local knowledge to capture geographic data for areas they know. This has since expanded and volunteers are now mapping areas worldwide due to technological advances and the open availability of very high resolution satellite and aerial imagery. This development has boosted the expansion of OSM with the tradeoff that the influence of local knowledge is diminishing in OSM data. Nevertheless, OSM is still the most up-to-date crowdsourced global geographic database today and the data are used in diverse applications from aiding humanitarian efforts [19] to use in governmental projects [20]. The OSM data model consists of three vector-based geometric primitives. The first are nodes, which are point-based and are used to capture either individual points (e.g., landmarks) or are part of the second primitive, i.e., a way, which is used to capture either linear or polygon entities. Finally, relations are used to the capture logical collection of nodes, ways, or other relations. For each of these primitives, the OSM data model allows attributes (commonly referred to as tags) to be attached using a key:value form. While a fundamental OSM principle is that contributors are free to choose whatever attributes they want, in practice there is a wiki-based specification for the OSM project that provides suggestions and best practices about the combinations that should be used in each case. Another important element of the OSM data model is that the geographic database itself is created in a wiki-based manner. This means that all contributors can edit all existing features and that all the previous versions of a feature are available, along with details for each version (e.g., when it was created, by which user, which tags were assigned to the feature, etc.). OSM provides an Application Programming Interface (API) that enables access to the geographic data, data about the contributors, and data about the individual features and their previous versions. For this study, OSM data were downloaded for a case study area in the Paris region. The coverage of the selected area is quite complete due to the presence of an active OSM community. To download the OSM data, the Geofabrik shapefile download service was used. Although there are some differences between the OSM and Geofabrik repositories (mainly due to the exclusion of non-standard tags), these differences will have no effect on our methodology. As the scope of this paper is quality evaluation of point-encoded features, and in particular the geometry, the name and the types of the OSM points, only the layers of Places (4275 features) and POIs (192,228 features) were assessed. The Geofabrik shapefiles include, as an attribute, the original and unique OSM_ID of every OSM feature; this information was used in combination with the OSM API. For example, the entire history of the OSM feature with OSM_ID 26691437, can be accessed using the API request: http://api.openstreetmap.org/api/0.6/ node/26691437/history/. The API response is an XML file that includes all the versions of the feature


4 of 29

with OSM_ID 26691437. Through an iterative process, all the versions of each OSM features in scope were downloaded and stored in a PostgreSQL/PostGIS database. This method provided a complete timeline of the OSM edits made in the area for the data of interest. 2.2. Reference Dataset The reference data were extracted from the BD TOPO database produced by IGN. BD TOPO is a topographic dataset with a positional accuracy below 1 m. Its scope does not cover all POIs that are captured in OSM (e.g., there are no shops or restaurants), but the POI layer covers education, administration, transportation, religion, health, sports, and hydrography, which gives a sufficient overlap with OSM POIs for comparison purposes. The attribute values, originally in French, have been translated to English to enable semantic matching with OSM. The IGN POI dataset contains 6202 features. The change in IGN features is due to updates (e.g., a change in the real world, errors correction), changes in specifications or changes due to partnerships that provide data to IGN (e.g., Ministry of Education for the position of schools, RAPT for the local public transportation administration). 2.3. Flickr Dataset The development of Web 2.0 [21] has led to a bi-directional Web where the aim of many web-based applications is to provide platforms that enable users to create and publish their own content and share this with other users. This development has led to the emergence of social networking websites. One of the first examples of such websites have been photograph sharing applications, e.g., Flickr, Picasa Web, or the more recent Instagram, which urges users to share their photographs along with titles, comments, keywords (known as tags), and their location (known as geo-tagging), and then to use them as means of networking. While geography or the location of the content is not the prime feature of such applications, they can still be implicit sources of GI since newly developed GI retrieval methods can transform implicit information into geospatial content [22]. In this study, geo-tagged photographs downloaded from Flickr have been used to evaluate the validity and quality of ambiguous OSM features. By using the Flickr API, all geo-tagged photographs uploaded to the Flickr website between April and December 2015 have been downloaded, resulting in a total of 79,722 geo-tagged photographs to help determine the presence and type of OSM POIs that cannot be identified using satellite imagery (e.g., POIs located under trees or the type of buildings). 3. Assessment by Historical Analysis Research into geometric intrinsic evaluation indicators has focused on different aspects of position and topology (see also Section 4). For example, van Exel et al. [23] examined the provenance of OSM features as an indicator of their quality. Feature length and point density have been used by Ciepłuch et al. [24] to analyze OSM data quality. Keßler and de Groot [25] examined feature-level attributes that are specific to VGI (and not applicable to authoritative data) such as the number of versions, the stability against changes from other contributors, and the number of corrections or rollbacks of OSM features. Barron et al. [14] presented a framework that provides multiple intrinsic indicators, based solely on the data history of OSM, which helps in the quality assessment. Here, we examine the type, name and geometric lineage of OSM point-encoded features (i.e., Places and POIs) for the study area. Thus, the possible changes in an OSM point that are of interest for this study, are those that could affect: (i) its type, for example, a feature initially characterized as a town can later be changed into a city or village; (ii) its name, where a feature’s name can undergo either a major or a minor grammatical change (e.g., the place name ‘St. Louis’ could change to ‘Saint Louis’ or to something totally different); and (iii) its location, where a feature can be moved by the contributors by just a few centimeters up to hundreds of meters away. Thus, the possible states of an OSM point are either to be stable through its entire life-cycle or to have a change in one or more of the three abovementioned factors. For the present POI assessment by history analysis, we selected only named places and POIs (4273 and 53,052 features, respectively).

ISPRS ISPRS Int. J. Int. Geo-Inf. 2017,2017, 6, 806, 80 J. Geo-Inf.

5 of 295 of 29

the three abovementioned factors. For the present POI assessment by history analysis, we selected

3.1. Methodology and Results only named places and POIs (4273 and 53,052 features, respectively).

To evaluate the quality of OSM points (that have a name) over time, we used the history (i.e., OSM 3.1. Methodology and Results versions) of each feature so as to examine the types and level of change. The ability to replicate the evaluate of OSM points (thatenables have a both name)qualitative over time, and we used the history (i.e., state of anTo element atthe anyquality particular point in time quantitative assessment OSM versions) of each feature so as to examine the types and level of change. The ability to replicate of the type and magnitude of the changes. the state of an element at any particular point in time enables both qualitative and quantitative First, we counted the features affected by different types or combinations of changes for OSM assessment of the type and magnitude of the changes. Places and OSM POIs. Then, we examined the magnitude of change for selected cases. For the OSM First, we counted the features affected by different types or combinations of changes for OSM Places we calculated the Then, following: (i) the the percentage ofofdisplaced per feature Places and OSM POIs. we examined magnitude change forfeatures selected cases. For the type OSM (e.g., what Places percentage of towns have a positional change over the years); and (ii) the overall displacement we calculated the following: (i) the percentage of displaced features per feature type (e.g., (in meters) per feature OSM POIs we calculated: (i)years); the percentage displaced features per what percentage of type. townsFor have a positional change over the and (ii) theof overall displacement (in type; meters) type. For we calculated: (i) the changes percentage of feature displaced features feature (ii)per thefeature percentage ofOSM namePOIs changes and location per type; andper (iii) the feature type; (ii) the percentage of name changes and location changes per feature type; and (iii) the displacement (in meters) of features per feature type. displacement (in meters) of features per type. The magnitude of displacement of afeature feature is defined by the distance between the centroid The magnitude of displacement of a feature is defined by the between the centroid derived from all known positions of the place (for its entire life) anddistance its last position. Figure 1 shows derived from all known positions of the place (for its entire life) and its last position. Figure 1 shows examples of the positional change of POIs along with the entity names and the creation date of each examples of the positional change of POIs along with the entity names and the creation date of each pointpoint version. version.

Figure 1. Examples of displacement in OSM POIs (©OpenStreetMap contributors).

Figure 1. Examples of displacement in OSM POIs (©OpenStreetMap contributors).

By following this methodology, we tried to determine how named OSM POIs behave during their life time. this Whilemethodology, this is not an exhaustive analysis, it nevertheless lays OSM the groundwork for an By following we tried to determine how named POIs behave during of how types behave andit thus what assumptions be made for their understanding life time. While this different is not anfeature exhaustive analysis, nevertheless lays the can groundwork regarding their of quality time. feature types behave and thus what assumptions can be made an understanding howover different

regarding their quality over time. 3.2. Assessment of OSM Places

3.2. Assessment OSM Places are a valuable element in many different products including gazetteers, in Places of and toponyms cartography and for mobile applications, location-based services, and search engines. Thus, any kind

Places and toponyms are a valuable element in many different products including gazetteers, in of change in the data will affect the consistency of products or services that use VGI. cartography for mobile applications, location-based services, of and searchchanges engines.showed Thus, any Theand analysis regarding the different types or combinations possible that kind of change in the data will affect the consistency of products or services that use VGI. two-thirds of the features (66.8%) remained unchanged. The most common change is recorded in the The analysis regarding theofdifferent types or combinations of possible showed geographic position (12.3%) the features, followed by a combined change inchanges the position and that name (7.8%). the analysis of the percentageThe of displaced featureschange per typeisshowed thatin the two-thirds of the Interestingly, features (66.8%) remained unchanged. most common recorded different types of(12.3%) OSM places different behaviors. A case in point is the OSM feature and townname geographic position of thehave features, followed by a combined change in the position where 80% (i.e., 198 out of 248) have been geographically moved over time while only 8% of the (7.8%). Interestingly, the analysis of the percentage of displaced features per type showed that different feature locality has undergone a change in their position. The analysis of the magnitude of the types of OSM places have different behaviors. A case in point is the OSM feature town where 80% positional displacement per feature type showed that features with large spatial extent face large (i.e., 198 out of 248) have been geographically moved over time while only 8% of the feature locality has changes in their location in contrast with smaller entities. For example, 21% of the features having undergone a change in their position. The analysis of the magnitude of the positional displacement the type suburbs moved less than 100 m, whereas for only 2% of the suburbs a positional change of per feature type showed that features with large spatial extent face large changes in their location in contrast with smaller entities. For example, 21% of the features having the type suburbs moved less than 100 m, whereas for only 2% of the suburbs a positional change of more than 1000 m was recorded. In contrast, 14% of all features in the towns category have moved more than 1000 m.


6 of 29

more than 1000 m was recorded. In contrast, 14% of all features in the towns category have moved ISPRS Int. J. Geo-Inf. 2017, 6, 80 6 of 29 more than 1000 m. 3.3. Assessment Assessment of of OSM OSM POIs POIs 3.3. Only lately found theirtheir way into applications. Traditional map-making processes Only latelyhave havePOIs POIs found wayspatial into spatial applications. Traditional map-making excluded this type of information, not because of importance but rather because maps, and especially processes excluded this type of information, not because of importance but rather because maps, and paper maps, weremaps, built aswere long-lasting products aiming to support multiple applications due especially paper built asgeneric long-lasting generic products aiming to support multiple to the high cost and extensive knowledge needed to construct them [26]. However, the advent of VGI applications due to the high cost and extensive knowledge needed to construct them [26]. However, hasadvent enabledofcitizens easily capture put local information on a information map. Much on of this information the VGI hastoenabled citizensand to easily capture and put local a map. Much of comes in the form of POIs. Usually, apart from the location of the POI, contributors can add various this information comes in the form of POIs. Usually, apart from the location of the POI, contributors attributes such as its name (if applicable) or its type (e.g., bank, hospital, coffee shop, etc.). From the can add various attributes such as its name (if applicable) or its type (e.g., bank, hospital, coffee shop, POIs extracted from OSM for the study area, (i.e., 200,000), only 27.6% of them (i.e., 53,052) had a name etc.). From the POIs extracted from OSM for the study area, (i.e., 200,000), only 27.6% of them attribute. This is an expected result This sinceis POIs are part ofresult a heterogeneous spatial can include (i.e., 53,052) had a name attribute. an expected since POIs are partlayer of athat heterogeneous world-known landmarks and monuments as well as local street junctions and traffic lights spatial layer that can include world-known landmarks and monuments as well as local where street a name attribute is not applicable. junctions and traffic lights where a name attribute is not applicable. As described described in in the the methodology, methodology, the the changes changes in in location, type, name name and and any any possible possible As location, type, combinations have also been examined for POIs. The analysis shows that about 60% of features combinations have also been examined for POIs. The analysis shows that about 60% of features have have not been changed since their creation, while the most common change for POIs is a change in not been changed since their creation, while the most common change for POIs is a change in the the name (13.2%), followed by a change in the location (12.7%) or in both location and name (10.2%). name (13.2%), followed by a change in the location (12.7%) or in both location and name (10.2%). The The of rest ofchanges the changes correspond to than less than rest the correspond to less 2.0%2.0% each.each. Given that that the the two two most most frequent frequent feature feature changes changes are are the the changes changes in in name name and and in in location, location, and and Given examining this from the point of view of individual categories (i.e., OSM types), it is interesting to examining this from the point of view of individual categories (i.e., OSM types), it is interesting to see which which POI POI categories categories are are changing changing more more and and how how much much they they are are changing. changing. Figure Figure 22 shows shows the the see percentage of OSM POIs that have undergone a name and a location change grouped by POI category percentage of OSM POIs that have undergone a name and a location change grouped by POI for the 20for most This visualization of the volatility of each category be linked category thepopular 20 mostcategories. popular categories. This visualization of the volatility of each can category can to the fitness forfitness purpose various beyond the strict, tangible measurements of quality be linked to the for of purpose ofcategories various categories beyond the strict, tangible measurements of elements (as defined in [27]). For example, the station category is an extremely volatile category as quality elements (as defined in [27]). For example, the station category is an extremely volatile 59% of the have changed position andposition 76% have changed name. The analysis described in category asfeatures 59% of the features have changed and 76% have changed name. The analysis Section 5 will show that this category is a complex one, and can help to explain the high volatility in described in Section 5 will show that this category is a complex one, and can help to explain the high the feature of this category. The number of changes also reflects the popularity of the features since volatility in the feature of this category. The number of changes also reflects the popularity of the stations since are used by most usersby from this area. Any application products aiming to incorporate features stations are used most users from this area. Anyor application or products aiming to these features should take into account these possible changes. In contrast, POIs related to information incorporate these features should take into account these possible changes. In contrast, POIs related prove to be much more change. to information prove to resilient be muchto more resilient to change.

Figure 2. Percentage of POIs that had a name (x-axis) and a location (y-axis) change. The size of the Figure 2. Percentage of POIs that had a name (x-axis) and a location (y-axis) change. The size of the bubble corresponds to the intersection of the two changes. bubble corresponds to the intersection of the two changes.


7 of 29

Tables 1 and 2 show statistics of the positional movement of the ten OSM categories ranked by the highest movement. The aim is once again to understand the fitness for purpose of different categories for different applications. For example, for the station OSM features that have experienced a change in their location, the average movement is almost 40 m whereas for bakery it is a mere 4.7 m. However, in all cases the standard deviation is bigger than the (arbitrary) threshold of the 10 m accuracy that hand-held GPS devices can achieve. In parallel, the maximum positional movement recorded shows that, in some cases (e.g., supermarket or stations), the contributors might disagree on where the best place is to position the point that will represent the physical features since the area of the feature might be quite large (see the next sections for more discussion of this point). For other cases such as café, bakery, hotel, and bank, which are characterized by small and precise geographic extents, the maximum distance shows that there have been some gross errors made by contributors that have been corrected in a later version (or vice versa). However, the notion of gross error is somehow elusive and certainly not uniform in these datasets. For example, a maximum displacement of 339 m for a point that represents a hotel might be a valid movement since both points could be inside the spatial footprint of the hotel. Even in the case where one of the two positions is a wrong measurement, again a visual inspection would be needed to determine whether the magnitude of the displacement falls under the “gross error displacement” or whether it is an acceptable error. Thus, it was decided to include all the positions in the calculation of the descriptive statistics presented in Tables 1 and 2, which will also allow the reader to understand the magnitude of the volatility of such data. Table 1. The statistics of the positional movement of the top 10 OSM feature types according to the number of features. #

POI Category

Count

Mean Distance (m)

SDev. (m)

Max Distance (m)

1 2 3 4 5 6 7 8 9 10

bus_stop restaurant station bicycle_rental bank cafe supermarket pharmacy hotel bakery

3418 1248 535 499 464 452 393 364 274 265

9.2 8.6 39.9 13.8 6.4 6.0 11.7 5.7 14.7 4.7

23.6 28.7 138.5 35.8 19.8 24.8 45.5 14.1 38.0 10.9

567.0 467.4 2794.3 340.3 349.7 412.2 842.7 198.2 338.9 113.1

Table 2. The statistics of the positional movement of the top 10 OSM feature types according to the mean distance (in m) of positional movement. #

POI Category

Count

Mean Distance (m)

SDev. (m)

Max Distance (m)

1 2 3 4 5 6 7 8 9 10

station motorway_junction fuel drinking_water hotel bicycle_rental kindergarten school supermarket post_office

535 223 186 63 274 499 45 239 393 231

39.9 34.3 33.7 21.0 14.7 13.8 13.3 12.8 11.7 11.1

138.5 42.7 75.5 66.0 38.0 35.8 21.3 17.2 45.5 23.2

2794.3 321.5 659.7 372.6 338.9 340.3 108.3 122.7 842.7 250.4

4. Assessment by Analysis of Spatial Relations 4.1. Methodology Another way to carry out intrinsic quality assessments of VGI is to rely on the geographic properties of features [28]. If we find expected spatial relationships within the VGI features, e.g., a motorway junction is at the intersection of a motorway and another road, the quality tends to be


4. Assessment by Analysis of Spatial Relations

8 of 29

4. Assessment by Analysis of Spatial Relations 4.1. Methodology 4.1. Methodology

way to6,carry ISPRSAnother Int. J. Geo-Inf. 2017, 80

out intrinsic quality assessments of VGI is to rely on the geographic 8 of 29 properties of features [28]. If we find expected spatial relationships within the VGI features, e.g., a Another way to carry out intrinsic quality assessments of VGI is to rely on the geographic motorway junction is [28]. at theIf intersection of a motorway and anotherwithin road, the the VGI quality tends e.g., to bea properties of features we find expected spatial relationships features, good. On the the contrary, contrary, if we find impossible or improbable improbable spatial spatial relations relations within within the the features, features, e.g., e.g., good. On if we find impossible or motorway junction is at the intersection of a motorway and another road, the quality tends to be aa shop POI is outside a building, then the quality tends to be poor. This use of spatial relations to shopOn POI outside aifbuilding, then the quality tends to spatial be poor. This use of spatial relations to good. theiscontrary, we find impossible or improbable relations within the features, e.g., guide the quality assessment has already been implemented in [29]. However, the main problem guide quality assessment has already implemented inpoor. [29]. However, main relations problem to is a shopthe POI is outside a building, then thebeen quality tends to be This use ofthe spatial is that POIs arepoints pointsthat thatrepresent representa ageographic geographicentity entitythat thathas has aa geographic geographic extent extent often often much that POIs are much guide the quality assessment has already been implemented in [29]. However, the main problem is larger than the point, for example, do that represents aa school: in the school larger thanare the points point, so, so, example,awhere where do we we put put aa point point school: the school that POIs thatforrepresent geographic entity that that has arepresents geographic extentinoften much boundaries, at the entrance of the school, or in the middle of the school? There is no unique solution boundaries, at the entrance of the school, or in the middle of the school? There is no unique solution larger than the point, so, for example, where do we put a point that represents a school: in the school to this (Figure 3), we propose different methods for different types of in next to this problem problem (Figure 3), and and weschool, propose methods forschool? different types of POIs POIs in the the next boundaries, at the entrance of the ordifferent in the middle of the There is no unique solution sub-sections. This will improve overall quality as it will improve the consistency of the contributions. sub-sections. This will improve quality as it willmethods improvefor thedifferent consistency of of thePOIs contributions. to this problem (Figure 3), and overall we propose different types in the next sub-sections. This will improve overall quality as it will improve the consistency of the contributions.

Figure 3. Three maps where the POIs are located differently inside the buildings that host them. Figure 3. Three maps where the POIs are located differently inside the buildings that host them. Figure 3. Three maps where the POIs are located differently inside the buildings that host them.

4.2. Amenity POIs 4.2. Amenity POIs 4.2. Amenity POIs There are many types of OSM amenities but not all are facilities located inside a building. We There are many types of OSM amenities but not all are facilities located inside a building. We did did not consider amenities that canamenities be located a building, such as inside food trucks. Thus, We we There are many types butoutside not all are facilities located a building. not consider amenities that of canOSM be located outside a building, such as food trucks. Thus, we consider consider that theiramenities quality is that goodcan if they are inside a building, and if they are located on the partwe of did their not consider be alocated outside building, suchon asthe food trucks. that quality is good if they are inside building, and if athey are located part of theThus, building the building occupied by the amenity. There are two ways to capture such amenities: put the POI at consider that their quality There is good if two they ways are inside a building, if they put are located of occupied by the amenity. are to capture such and amenities: the POIon at the the part center the center of occupied the amenity, or at the entrance of the amenity. Thus, we such propose to checkput three spatial the building by the amenity. There are two ways to capture amenities: the POI at of the amenity, or at the entrance of the amenity. Thus, we propose to check three spatial relations relations (Figure 4): the center (Figure 4): of the amenity, or at the entrance of the amenity. Thus, we propose to check three spatial relations (Figure 4): • inside a building; •• inside a building; distance to the building center; • inside a building; ••• distance to center; nearest distance to the the building buildingbuilding center; boundary, as a proxy for amenity entrance. •• distance to the nearest building distance to the nearest building boundary, boundary,as asaaproxy proxyfor foramenity amenityentrance. entrance.

Figure 4. Three spatial relations are checked: inside a building, the distance to the building center, and the4.distance to therelations nearest boundary as inside a proxy for an amenity entrance (©OpenStreetMap Figure Three spatial are checked: a building, the distance to the building center, Figure 4. Three spatial relations are checked: inside a building, the distance to the building center, and contributors). and the distance to the nearest boundary as a proxy for an amenity entrance (©OpenStreetMap the distance to the nearest boundary as a proxy for an amenity entrance (©OpenStreetMap contributors). contributors).

Figure 5a shows that the building center is not always a good approximation of the amenity center, as several amenities can share a building. Thus, we also computed the Voronoï regions around the amenity points (Figure 5b) and intersected the buildings with the Voronoï regions (Figure 5c). Then,


9 of 29


9 of 29

Figure 5a shows that the building center is not always a good approximation of the amenity center, as several amenities can share a building. Thus, we also computed the Voronoï regions wearound also computed the distance to the center andintersected to the boundary for these newthe regions. Weregions selected the amenity points (Figure 5b) and the buildings with Voronoï four types of amenities (around 6500 POIs in the Paris dataset): (Figure 5c). Then, we also computed the distance to the center and to the boundary for these new regions. We selected four types of amenities (around 6500 POIs in the Paris dataset): • very small amenities: gift shops, i.e., the features tagged with “amenity = gifts”; • very small amenities: gift shops, i.e., the features tagged with “amenity = gifts”; • medium amenities: bars (“amenity = bar”), cafes (“amenity = café”), and restaurants (“amenity • medium amenities: bars (“amenity = bar”), cafes (“amenity = café”), and restaurants (“amenity = = restaurant”); restaurant”); • • large amenities: cinemas large amenities: cinemas(“amenity (“amenity==cinema”); cinema”); • • very stable amenities: hairdressers thatdo donot notmove moveoften, often, according very stable amenities: hairdressers(“amenity (“amenity == hairdressers”) hairdressers”) that according to [17]. to [17].

Figure 5. (a) One buildingmay maycover coverseveral severalamenities amenities so amenity Figure 5. (a) One building so the the centroid centroidmay maybe befar farfrom fromthe the amenity center, (b) using the Voronoï region intersected with the building gives (c) a better approximation of center, (b) using the Voronoï region intersected with the building gives (c) a better approximation of the amenity center (©OpenStreetMap contributors). the amenity center (©OpenStreetMap contributors).

The results, summarized in Table 3, show that precision is somehow related to the size of the The results, Table 3, show thattoprecision is somehow related to the Itsize amenity, whichsummarized is logical as in there is more room locate a POI inside large amenities. canofbethe amenity, is logical as there is more room to locate POI insideranges large from amenities. It 5%, can and be noticed noticedwhich that the percentage of amenities that are outsidea buildings 1.5% to thus that the percentage amenities that accuracy are outside ranges from to 5%, and thus The these these POIs have aoflow positional (giftbuildings shops are too few to 1.5% be significant here). POIs have aare lowquite positional (gift shops are amenities, too few toi.e., be significant here). The numbers numbers similar accuracy in all the small/medium they are quite far from the center,are even with Voronoï with a large variability, appears be athe pattern POIs quite similar in all theregions, small/medium amenities, i.e., and theythere are quite far to from center,with even with located close towith the boundary/entrance, inside the buildings (aroundwith two POIs meters). This can Voronoï regions, a large variability, but andjust there appears to be a pattern located close be explained by the waybut thatjust theinside amenity is visualized in the OSM This mapcan as the symbol to the boundary/entrance, the symbol buildings (around two meters). be explained appears just at the entrance of the building with this 2-m shift. by the way that the amenity symbol is visualized in the OSM map as the symbol appears just at the entrance of the building with this 2-m shift. Table 3. Summary of the amenities’ locations consistency in buildings that contain at least three POIs.

% ofamenities’ POIs Mean Distance SDev. in DC with Mean SDev. of DB with Table 3. Summary of the locations consistency buildings that Distance containtoat least three POIs. Outside to Centroid of DC Voronoï Boundary (DB) DB Buildings (in m) (DC) (m) (m) (m) (m) (m) Mean Distance to SDev. % of POIs Mean Distance SDev. DC with Gift shops 0 12.9 11.5 7.72 2.3 3 Outside to Centroid of DC Voronoï Boundary (DB) of DB Bars, cafes, Restaurants 3 10.8 10.5 7.21 2.3 2.1 Buildings (m) (DC) (m) (m) (m) (m) (m) Cinemas 5.5 16.9 18.4 9.39 4.9 4.9 Gift shops 12.9 11.5 7.72 2.3 3 Hairdressers 1.40 11.3 9.5 7.07 2.1 1.2 Bars, cafes, Restaurants 3 10.8 10.5 7.21 2.3 2.1 Cinemas 5.5 16.9 18.4 9.39 4.9 4.9 Hairdressers 11.3the same 9.5 7.07were consistent 2.1 1.2 It would be logical if 1.4 the POIs inside buildings with each

Voronoï (m) DB with 3.15 Voronoï 2.27 (m) 4.57 3.15 2.19 2.27 4.57 2.19 i.e., other,

located similarly at the center of the amenity, or at the entrance (Figure 6), because either the contributor sameiforthe thePOIs first contributors directbuildings the following Thus,each we checked It would is bethe logical inside the same werecontributors. consistent with other, i.e., this assumption for more than 4000 buildings in Paris that contain more than two POIs. located similarly at the center of the amenity, or at the entrance (Figure 6), because either the contributor

is the same or the first contributors direct the following contributors. Thus, we checked this assumption for more than 4000 buildings in Paris that contain more than two POIs.


10 of 29


10 of 29

Figure 6. Even if there is no unique location for an amenity POI inside a building, consistency in POI Figure 6. Even if there is no unique location for an amenity POI inside a building, consistency in POI location (here points should all be aligned) is better than inconsistency (©OpenStreetMap contributors). location (here points should all be aligned) is better than inconsistency (©OpenStreetMap contributors).

The results, summarized in Table 4, show good homogeneity inside a building. There is some The results, summarized in Table 4, show good homogeneity inside a building. There is some standardization: POIs are located closer to boundaries than to the center but not on the boundary. standardization: POIs are located closer to boundaries than to the center but not on the boundary. However, even with an accurate standardized location, the precision is quite low, as shown in Figure 4. However, even with an accurate standardized location, the precision is quite low, as shown in Figure 4. Table 4. Summary of the locations’ consistency of amenities in buildings that contain at least three POIs. Table 4. Summary of the locations’ consistency of amenities in buildings that contain at least three POIs. Mean of SDev Median of SDev Minimum of SDev Maximum of SDev Mean2.99 of SDev Median of SDev Minimum 0of SDev Maximum76.5 of SDev Distance to center 1.56 Distance toto boundary 2.46 2.15 38.5 Distance center 2.99 1.56 00 76.5 Distance to boundary

2.46

2.15

0

38.5

4.3. ATM POIs 4.3. ATM POIs In the city of Paris, most ATMs are located on the walls of banks and only a minority are located In athe citybuilding. of Paris, most are located on the of banks onlybea just minority located inside bank Thus,ATMs a logical positioning of walls the ATM POIs and would on theare boundary inside bank building. Thus, logical of the ATM POIs would be just onto the boundary of the abuilding polygons. We aused the positioning same methodology as that used for amenities compute the of the building polygons. We used the methodology as that forboundary amenities (Table to compute the building center, and we measured thesame distance to the center andused to the 5). There building we measured the6.6% distance to theare center and any to the boundary (Table 5). There were were 347center, ATMsand in the dataset and of points outside building: these points clearly are 347 ATMs in the dataset 6.6% of points outside any building: these clearly poor poor quality because theand building layer hasare good completeness, which waspoints confirmed byare a visual quality because the building layer has good completeness, which wason confirmed by a visual inspection inspection of those points. Sixty-five percent of the ATMs lie just the boundary of buildings, so of those points. Sixty-five percent of the ATMs lie just on the boundary of buildings, so there is a vast there is a vast majority of points with a good quality. There are an additional 6% that are both far majority pointsand with a good quality. There are an 6%randomly that are both from the center from theofcenter from the boundary, so they areadditional considered andfar badly located. We and also from theout boundary, so they areconsistency considered randomly andthat badly located. We also carried the same carried the same test for in buildings contain several ATMs; the out results show test consistency in buildings that contain the results show one that is ATMs are mostly thatfor ATMs are mostly consistently locatedseveral inside ATMs; a building but when located on the consistently located insidethere, a building but when one is located on the boundary, all are located there, boundary, all are located with very few exceptions. with very few exceptions. Table 5. Summary of the ATM locations in relation to building center and boundary. Table 5. Summary of the ATM locations in relation to building center and boundary.

Mean (m) Distance to center Mean 17.6 (m) Distance to boundary 1.1 Distance to center 17.6

Distance to boundary

1.1

Median (m) 13 (m) Median 0 13

SDev (m) 12.9(m) SDev 4.6 12.9

0

4.6

4.4. School POIs 4.4. School POIs Schools are complex functional entities composed of buildings, schoolyards, sports fields, libraries, and even chapels [30]. Ideally, need to know the complete extentsports of thefields, school and its Schools are complex functional entitieswe composed of buildings, schoolyards, libraries, components to assess the quality of the schoolthe POI position: in Figure the POI in the and even chapels [30]. Ideally, we need to know complete extent of the7,school andisitslocated components schoolyard, so not inside any building, but it is close to the school extent centroid. Unfortunately, to assess the quality of the school POI position: in Figure 7, the POI is located in the schoolyard, so not extentany polygons arebut notitavailable school POIscentroid. in this dataset: only 23% of the 438 schools are inside building, is close tofor theallschool extent Unfortunately, extent polygons are not capturedfor with a POI a polygon. The difficulty assessing POI quality due atoPOI the available all both school POIsand in this dataset: only 23% of theof438 schoolsschool are captured with both complex structures of schools is highlighted in [31], the authors of which tried to match several and a polygon. The difficulty of assessing school POI quality due to the complex structures of schools POI datasets differentofVGI or reference datasets. isschool highlighted in [31],from the authors which tried to match several school POI datasets from different VGI or reference datasets.


11 of 29


11 of 29


11 of 29

Figure 7. A school area, buildings, and a school POI: schools can be composed of several buildings

Figure 7. school A school area, andbea located school outside POI: schools can be composed of several buildings and and yards, so buildings, the POI might a building (©OpenStreetMap contributors). school yards, so the POI might be located outside a building (©OpenStreetMap contributors). When a polygon extent was available for the POI, we measured the distance to the center of this Figure 7. to A its school area,was buildings, and awas school POI: can the be composed of several polygon and boundary; when none available, we measured used building that contained the POI, When a polygon extent available for the POI,schools we the distance to buildings the center of this and school yards, so the POI mightThe be located outside a building (©OpenStreetMap contributors). as we did for amenities and ATMs. results, summarized in Table 6, show that, when the polygon polygon and to its boundary; when none was available, we used the building that contained the POI, is available, the POI is closer to the school center and so more likely outside buildings; when the as we did When for amenities and ATMs. The results, summarized in Tablethe 6, show that, when the polygon is polygon extent wascontributors available fortend the POI, we the measured distance polygon is anot available, OSM to map school POIs closer to to the the center walls, of atthis the available, the POItoisits closer to thewhen school center so more outside buildings; when thePOI, polygon polygon boundary; none wasofand available, we likely used the building thatbe contained entrance and of the school. However, with 30% POIs outside a building, it would better tothe use the is notas available, OSM contributors tend to map the school POIs closer to the walls, at the entrance we did for amenities and ATMs. The results, summarized in Table 6, show that, when the polygon real school extent, which can be computed geometrically by aggregating the nearby buildings and of is available, the POI is closer tolikelihood the school and so likely(e.g., outside buildings; when the school. However, with 30% of POIs outside a building, it would be better to use the realthe school geographic entities, using their ofcenter belonging to more the school a building that contains polygon is not available, OSM contributors tend to map the school POIs closer to the walls, at the extent, which be computed geometrically aggregating the nearby buildings andingeographic shop POIscan is unlikely to be part of the school).byThe same approach was successfully used [32] to entrance oftheir the school. with extents 30% of outside (e.g., a building, ita would be contains better to use the adjust the geometry ofHowever, suchofschool for When polygon is available, thePOIs entities, using likelihood belonging toPOIs the cartography. school a building that shop real school which can be computed geometrically by size aggregating theused nearby and the contributors tend locate the POI closer to the center, but the of the school implies low is unlikely to beextent, parttoof the school). The same approach was successfully in buildings [32]precision. to adjust geographic entities, using their likelihood of belonging to the school (e.g., a building that contains geometry of such school extents for cartography. When a polygon is available, the contributors tend to shop POIs is unlikely to be partschool of thelocations school).inThe same approach was center successfully used in [32] to Table 6. Summary of the relation to building/extent and boundary. locate the POI closer to the center, but the size of the school implies low precision. adjust the geometry of such school extents for cartography. When a polygon is available, the % Outside Mean Distance to Mean Distance to contributors tend to locate the POI closer to the center, but the size of the school implies low precision. Table 6. Summary of the schoolBuildings locations in relation to building/extent center and boundary. Center (m) Boundary (m) Schools with a polygon 40 31.6 13.3 Table 6. Summary of the school locations in relation to building/extent center and boundary. % Outside Mean Distance Mean Distance Schools without a polygon 30 19.1 to 4.7 to Center (m) Boundary (m) %Buildings Outside Mean Distance to Mean Distance to Buildings Center (m) Boundary (m) Schools with a polygon 40 31.6 13.3 4.5. Bus Stop POIs Schools without a polygon 30 19.1 4.7 13.3 Schools with a polygon 40 31.6 In Paris, bus stops can be marked either by a bus shelter, as in Figure 8, or by a sign, Schools without a polygon 30 19.1 4.7but they are always located on the pavement, so there is a minimum distance between the bus stop and the 4.5. Bus Stop POIs closest 4.5. Bus building Stop POIsboundary and between the bus stop and the road centerline. There is not enough information the OSM data to checkeither if the by busa stop on the as pavement, canacheck is are In Paris, businstops can be marked bus is shelter, in Figurebut 8, we or by sign, that but it they In Paris, bus stops can bebuilding marked either by a bus shelter, ascenterline. in Figure 8, or also by a want sign, but they are not too close to the nearest boundary or the road We to measure always located on the pavement, so there is a minimum distance between the bus stop and the closest always located on the pavement, so isbecause a minimum between the bus stop and the whether there isand a road close to thebus busthere stop,and therecenterline. isdistance no bus stop without road nearby. building boundary between the stop the road There is nota enough information closest building boundary and between the bus stop and the road centerline. There is not enough in theinformation OSM data in to the check if the stop is on the butpavement, we can check that is notthat tooitclose OSM databus to check if the bus pavement, stop is on the but we canitcheck is to the nearest building boundary or the road centerline. We also want to measure whether there is a road not too close to the nearest building boundary or the road centerline. We also want to measure close whether to the bus stop, because there is no bus stop without a road nearby. there is a road close to the bus stop, because there is no bus stop without a road nearby.

Figure 8. Bus stops in Paris are located on the pavement, so should be located far enough from a road centerline and the buildings (©OpenStreetMap contributors).

Figure 8. Bus stops in Paris are located on the pavement, so should be located far enough from a road

Figure 8. Bus stops in Paris are located on the pavement, so should be located far enough from a road centerline and the buildings (©OpenStreetMap contributors). centerline and the buildings (©OpenStreetMap contributors).


12 of 29

We used 1.5 m as a minimum distance between the bus stop point and the nearest building boundary. The minimum distance to the road centerline depends on the road type but a 2-m threshold is used here. In parallel, as large roads are captured with several lanes in Parisian OSM data, we used 10 m as a threshold for being too far from any road. Table 7 shows the results obtained on the 2022 bus stop points in the dataset. Of these, 13.5% of the bus stops are either too close to a road or a building (since they cannot be both), and are considered to be badly located. Only 0.4% are too far from any road. The results show consistent locations in the remaining bus stop points, which can be explained by the fact that bus stops are entities with a small geographic extent, and are easy to spot from high-resolution satellite imagery. Table 7. Summary of the bus stop locations in relation to buildings and road centerlines.

Distance to roads Distance to buildings

Mean (m)

SDev (m)

Maximum (m)

% Too Close

% Too Far

6.2 7.4

2.9 5.8

18.5 -

6.8 6.7

0.4 -

5. Assessment based on Data Matching with Reference Data A number of research papers dealing with VGI data quality assessment have compared crowdsourced data with reference data to infer their quality [2,3,33]. The comparison of two datasets supposes that homologous features (i.e., features representing the same object in the real world) are first identified. This process is referred to as data matching. In this paper, we propose using a data matching algorithm defined by [34] to automatically detect homologous features between OSM and IGN POIs. The method, which uses belief theory, combines different criteria based on geometry, thematic and semantic properties. 5.1. Methodology To compare IGN and OSM POIs, we first identify the types of features that are represented in both datasets. As shown in Figure 2, stations and subway entrances are types of objects that have relevant changes regarding names or positions and they are represented in both datasets. Thus, we focus the data matching on stations representing subway stations and subway entrances. Figure 9 illustrates the data model for subway station representation and defines the links between the real world and the two representations from IGN and OSM. Thus, a subway station in the real world (represented in the blue box) is characterized by a name and a certain number of subway entrances and exits. Regarding the representation of subway stations in the two datasets, two main differences can be noticed. In the IGN dataset (green box), a subway station from the real world is represented by n features. Thus, with respect to the IGN data specification, two types of mapping for each subway station are possible: (i) for subway stations without a connection between several lines, only one point feature is represented in the dataset; (ii) for subway stations that connect to at least two subway lines, at least two feature points are represented in the dataset. Even if the concept of entrance/exit does not exist in the IGN dataset, the features representing the subway station are frequently located close to an entrance. On the other hand, in OSM (purple box), a subway station in the real world is represented by m features belonging to the OSM type station. The subway station is composed of m’ entrances/exists represented by a point feature, each entrance/exit having a name. In Figure 9, dotted lines illustrate matching links (i.e., features representing the same object in the real world and numbers illustrate the cardinality of the matching links (e.g., n IGN subway stations should be matched with m OSM subway stations). Given the complexity of the representation of the two datasets, many types of data matching links are possible but we are interested in two types of data matching links in this study. First, n:m links are defined for the concept of a subway station (i.e., n IGN subway points can be matched with m OSM subway points representing the same subway station). Second, m:m’ links



13 of 29

13 of 29

are defined for a subway station and its entrances/exits (m OSM subway stations are matched with m’ OSM Finally,1:1 1:1links linksare aredefined definedforforentrances/exits entrances/exits (1 IGN subway station is OSMentrances/exits). entrances/exits). Finally, (1 IGN subway station is matched to 1 to OSM entrance/exist). matched 1 OSM entrance/exist). ISPRS Int. J. Geo-Inf. 2017, 6, 80

13 of 29

OSM entrances/exits). Finally, 1:1 links are defined for entrances/exits (1 IGN subway station is matched to 1 OSM entrance/exist).

Figure 9. Representation of subway stations in OSM and IGN datasets.

Figure 9. Representation of subway stations in OSM and IGN datasets.

Figure 10 shows an example of two different representations in the IGN and OSM datasets. For

Figure 9. Representation of subway stations in OSM and IGN datasets. Figure shows an example of two10a) different inand the two IGNpoints and in OSM datasets. example,10Duroc subway station (Figure has onerepresentations point in OSM data the IGN For example, Duroc subway station (Figure 10a)in has one(‘Duroc—sortie pointininthe OSM and two points dataset Figure that are to subway entrances OSM 3 data placeOSM Léon-Paul Fargue 10 closer shows antwo example of two different representations IGN and datasets. For in the andexample, ‘Duroc—sortie 1 boulevard des Invalides’). contrary, Saint-Sulpice subway IGN dataset thatDuroc are closer to station two subway entrances inthe OSM 3 place Léon-Paul subway (Figure 10a) hasOn one point in(‘Duroc—sortie OSMfor data and two points in thestation, IGN Fargue onedataset point isthat mapped in both IGN and OSM data but there is no link between the IGN subway station and ‘Duroc—sortie boulevard des Invalides’). On the (‘Duroc—sortie contrary, for Saint-Sulpice subway station, are1 closer to two subway entrances in OSM 3 place Léon-Paul Fargue and the OSM subway entrance. In Figure 10b two OSM entrances can be seen near the OSM subway and ‘Duroc—sortie 1 boulevard des Invalides’). On the contrary, for Saint-Sulpice subway station, one point is mapped in both IGN and OSM data but there is no link between the IGN subway station station;point one of them caninbe considered a duplicate entrance andlink should be removed. is mapped both IGN and OSM thereentrances is no between IGN subway stationsubway and the one OSM subway entrance. In Figure 10bdata twobut OSM can be the seen near the OSM and the OSM subway entrance. In Figure 10b two OSM entrances can be seen near the OSM subway station; one of them can be considered a duplicate entrance and should be removed.

station; one of them can be considered a duplicate entrance and should be removed.

Figure 10. Different representations of a subway station: (a) the two station POI in IGN are closer to OSM entrance POIs than to OSM station POI; (b) only(a) one POI POI in both datasets buttonot Figure 10. Different representations of a subway station: thestation two station in IGN are closer Figure 10. Different representations of a subway station: (a) the two station POI in IGN are closer to located at the same place (©OpenStreetMap contributors, ©BDTOPO). OSM entrance POIs than to OSM station POI; (b) only one station POI in both datasets but not

OSM entrance POIs thanplace to OSM station POI; (b) only one©BDTOPO). station POI in both datasets but not located located at the same (©OpenStreetMap contributors, data matching algorithm works as follows: for each reference feature from a dataset, the at theThe same place (©OpenStreetMap contributors, ©BDTOPO). Thelooks data for matching algorithm as follows:dataset) for eachto reference feature from aThen dataset, the algorithm candidates (fromworks the comparison match using a buffer. different algorithm looksare for compared candidates between (from the the comparison match a buffer. Then different similarity factors referencedataset) featuretoand all using selected candidates. Knowing Thesimilarity data matching algorithm works as follows: for each reference feature fromKnowing a dataset, the factors of areour compared between the reference feature and all selected the characteristics datasets (see Sections 2.1 and 2.2), three criteriacandidates. were used: position, algorithm for candidates (from the tothree match using a buffer. Then different the looks characteristics of criteria. our datasets (seecomparison Sections 2.1 dataset) and 2.2),on criteria were used:the position, toponym, and semantic The position criterion is based the distance between reference similarity factors are compared between the reference feature and all selected candidates. toponym, and semantic criteria. The position criterion is based on the distance between the reference feature and a candidate. The toponym criterion compares the name of the reference feature Knowing with the the feature and a candidate. The toponym criterion compares the name of the reference feature with the characteristics of our datasets (see Sections 2.1 and 2.2), three criteria were used: position, toponym, name of the candidate. Among the different measures tested to compare names, the normalized name of the candidate. Among the different measures tested to compare names, the normalized and semantic criteria. The criterion is based on distance the reference feature Levenshtein distance wasposition used since it was considered thethe most suitablebetween for our dataset. Finally, the Levenshtein distance was used since it was considered the most suitable for our dataset. Finally, the criterion, which compares feature types,the wasname used of to the select features feature in both with datasets and asemantic candidate. The toponym criterion compares reference thethat name of semantic criterion, which compares feature types, was used to select features in both datasets that

the candidate. Among the different measures tested to compare names, the normalized Levenshtein distance was used since it was considered the most suitable for our dataset. Finally, the semantic


14 of 29

criterion, which compares feature types, was used to select features in both datasets that represent the same type the real world, in our case: subway stations, stations, and subway entrances. ISPRS of Int. objects J. Geo-Inf. in 2017, 6, 80 14 of 29 The data matching described above was applied on different datasets: (i) subway stations from represent same type of *objects in the7,real in our case: stations, stations, and OSM subwaywith IGN with OSMthe stations (link in Figure n:mworld, cardinality) andsubway (ii) public stations from entrances. subway entrances from OSM (link ** in Figure 7, m:m’ cardinality). In the first case, the reference The data matching described above applied different datasets: subway stations from feature is IGN and the candidate features arewas from OSM,on while for the second(i)case the reference feature IGN with OSM stations (link * in Figure 7, n:m cardinality) and (ii) public stations from OSM with is OSM subway entrances and the candidate features are OSM stations. subway entrances from OSM (link ** in Figure 7, m:m’ cardinality). In the first case, the reference The thresholds used in the matching algorithm were defined with respect to data characteristics feature is IGN and the candidate features are from OSM, while for the second case the reference obtained by analyzing the distributions the distances for each Thus, the threshold for the feature is OSM subway entrances andof the candidate features are criterion. OSM stations. position criterion (Euclidian distance) and the name similarity criterion (Levenshtein distance), which The thresholds used in the matching algorithm were defined with respect to data characteristics indicates that itbyisanalyzing unlikely the for distributions two featuresoftothe represent object inThus, the real world, are obtained distancesthe for same each criterion. the threshold forequal the to (Euclidian distance) and candidate the name selection similarity was criterion distance), 220 mposition and 0.6,criterion respectively. The buffer for the set to(Levenshtein 350 m. which indicates that it is unlikely for two features to represent the same object in the real world, are

5.2. Data Matching equal to 220 mResults and 0.6, respectively. The buffer for the candidate selection was set to 350 m. The results and their evaluations (manually checked) are provided in Table 8. Thus, 5.2.data Data matching Matching Results the results for the first matching (i.e., IGN subway stations and OSM stations) are quite good. Out of The data matching results and their evaluations (manually checked) are provided in Table 8. 329 IGN subway stations, 328 were correctly matched with their homologous features in OSM stations. Thus, the results for the first matching (i.e., IGN subway stations and OSM stations) are quite good. Out of 329 IGN subway stations, 328 were correctly matched with their homologous features in OSM Table 8. Data matching results and evaluation. stations. Data Matching

Number of

Table 8. Data matching results and evaluation. Features

IGN subway stations/OSM stations Data Matching (348 IGN subway stations/ OSMstations/OSM stations) IGN704 subway stations

Matched IGN subway station Non-matched IGN subway station WithoutMatched decisionIGN IGNsubway subwaystation station

(348 entrances/OSM IGN subway stations/ Non-matched IGN subway station Matched OSM subway entrances OSM subway stations Non-matched OSM subway entrances OSM entrances/ stations) Without decision IGN subway station (794 OSM704 subway decisionOSM OSMsubway subwayentrances entrances OSMentrances/OSM stations) stations WithoutMatched OSM 704 subway (794 OSM subway entrances/ Non-matched OSM subway entrances 704 OSM stations) Without decision OSM subway entrances

329 (1 error) Number of 9 Features (2 errors) 329 (110error) 9 (2701 errors) 19 10 74 701 19 74

Precision

Recall

99.7% Precision 78% 99.7%

98% Recall 78% 98%-

78% 100% - 0% 100%0% -

78%-0% - 0% -

Figure 11 illustrates a data matching result for both IGN and OSM stations (blue link) and OSM stations and entrances (yellow linkmatching betweenresult a subway station andOSM its entrances). The subway station Figure 11 illustrates a data for both IGN and stations (blue link) and OSM stations and entrances (yellow link between a subway station and its entrances). The subway station Châtelet is represented by two features in IGN data and one feature in OSM data. The link between Châteletsubway is represented by having two features in IGN data andisone feature ininOSM data. The link between homologous stations a cardinality of 2:1 illustrated blue. Twelve OSM entrances homologous subway stations having a cardinality of 2:1 is illustrated in blue. Twelve OSM entrances are matched with the OSM subway station ‘Châtelet’. The link, with a cardinality of 1:12, is represented are matched with the OSM subway station ‘Châtelet’. The link, with a cardinality of 1:12, is by yellow lines in Figure 11. We can see that the homologous subway stations are quite far away from represented by yellow lines in Figure 11. We can see that the homologous subway stations are quite one another, i.e., 140 m and 180 m from the IGN subway station. far away from one another, i.e., 140 m and 180 m from the IGN subway station.

Figure 11. Data matching results for Châtelet subway station (©OpenStreetMap contributors, ©BDTOPO).

Figure 11. Data matching results for Châtelet subway station (©OpenStreetMap contributors, ©BDTOPO).

ISPRS Int. J. Geo-Inf. 2017, 6, 80 ISPRS Int. J. Geo-Inf. 2017, 6, 80

15 of 29 15 of 29

Among the non-matched IGN subway stations, seven of them are correct (i.e., there is no Amongsubway the non-matched stations, seven them (i.e., are correct (i.e., there is no homologous station in IGN OSM)subway and two of them are of wrong they should be matched subway station in OSM) and twomatched of them are wrongthe (i.e.,distance they should be matched to a to homologous a station from OSM). The features are not because between homologous station from OSM). The features are not matched because the distance between homologous features features is high and the names are quite different; see Figure 12a to see one of the non-matched IGN is high and the names are quite different; see Figure 12a to see one of the non-matched IGN subway subway stations. stations. For 10 of the IGN subway stations, no decision has been taken. This is due to the fact that for For 10 of the IGN subway stations, no decision has been taken. This is due to the fact that for an an IGN subway station there are many OSM candidates with the same characteristics. For example, IGN subway station there are many OSM candidates with the same characteristics. For example, in in Figure 12b the IGN subway station ‘Concorde’ has three candidates in OSM that have the same Figure 12b the IGN subway station ‘Concorde’ has three candidates in OSM that have the same name and are This is is definitely definitelyaalimitation limitationofofthe thedata data used and name and areclose closetotoIGN IGNsubway subwaystation. station. This used and thethe matching algorithm for each of the n:m cardinality links. matching algorithm for each of the n:m cardinality links.

Figure12.12. Data Data matching matching results: results: (a) Figure (a) wrongly wrongly non-matched; non-matched; (b)(b)correctly correctlynon-matched non-matched (©OpenStreetMap contributors,©BDTOPO). ©BDTOPO). (©OpenStreetMap contributors,

Of the total OSM stations, 408 were not matched. This is generally due to the fact that the OSM Of total OSM stations, 408subway were not matched. Thisother is generally due totransportation the fact that the OSM stationthe category includes not only stations but also types of public in the station only subway stations but also other types of public transportation in the Pariscategory area suchincludes as trains,not RER, etc. Paris area as trains, Thesuch second line ofRER, Tableetc. 8 summarizes the results for data matching between OSM subway entrances and OSM Of 794 OSM subwayfor entrances, 701 correspond to OSM at least one The second line subway of Tablestations. 8 summarizes the results data matching between subway OSM station. All these links have been manually checked and validated (precision = 100%). The entrances and OSM subway stations. Of 794 OSM subway entrances, 701 correspond to at least one recall indexAll is not computed for been this case because there isand no ground truth regarding the number of OSM station. these links have manually checked validated (precision = 100%). The recall entrances and exits for each subway station. Regarding the non-matched results, 19 OSM subway index is not computed for this case because there is no ground truth regarding the number of entrances entrances matched by the data matching algorithm results, but all 19 of OSM themsubway have atentrances least oneare and exits for are eachnot subway station. Regarding the non-matched corresponding OSM station. In this case the data matching algorithm fails because the subway not matched by the data matching algorithm but all of them have at least one corresponding OSM entrances are far from the corresponding station and most of the non-matched OSM subway station. In this case the data matching algorithm fails because the subway entrances are far from entrances do not have a name (attribute omission). the corresponding station and most of the non-matched OSM subway entrances do not have a name Similarly to the data matching between the OSM and IGN subway stations, the 74 subway (attribute omission). entrances are not matched because no decision was taken. This is due to the fact that for an OSM Similarly to the data matching between the OSM and IGN subway stations, the 74 subway subway entrance many OSM stations candidates exist that have the same characteristics. entrances are not matched because no decision was taken. This is due to the fact that for an OSM The two matching results define the homologous features between IGN and OSM subway subway entrance many OSM stations existexits that to have theOSM samesubway characteristics. stations and define links that assigncandidates entrances and each station. Thanks to The two matching results define the homologous features between IGN and OSM stations these two results, it is possible to define the link between IGN subway stations and subway OSM subway and define links that assign entrances and exits to each OSM subway station. Thanks to these two entrances/exits. These three links will be used to assess the quality of subway stations and subway results, it is possible to define the link between IGN subway stations and OSM subway entrances/exits. entrances from OSM. These three links will be used to assess the quality of subway stations and subway entrances from OSM. 5.3. Positional Quality Assessment Based on Data Matching Results 5.3. Positional Quality Assessment Based on Data Matching Results To assess the positional quality, the Euclidian distances between homologous objects in IGN ToOSM assess theare positional quality,Figure the Euclidian distances homologous objects in and and data first computed. 13a shows that thebetween homologous subway stations areIGN quite OSM data are first Figure 13adistance shows that the homologous stationsfeatures are quite far away from onecomputed. another. The median is equal to 35 m (50% subway of homologous arefar away fromless onethan another. median distance is equal to 35stations m (50% the of homologous features are located located 35 m The away) and for many OSM subway distance is higher than 100 m. Second, between eachOSM IGN subway subway stations station and its nearest entrance is less than 35the m distance away) and for many the distance is OSM highersubway than 100 m. Second,


16 of 29

ISPRS Int. between J. Geo-Inf. 2017, 6, 80IGN subway station and its nearest OSM subway entrance is computed. 16 of 29 the distance each Figure 13b clearly shows that IGN subway stations are closer to OSM subway entrances than OSM computed. Figure 13b clearly shows that IGN subway stations are closer to OSM subway entrances subway stations. This fits to IGN specifications (i.e., a subway station location represent an entrance) than OSM subway stations. This fits to IGN specifications (i.e., a subway station location represent and illustrates theand complexity of these types of ofthese features. the IGN subway closer to a an entrance) illustrates the complexity typesWhen of features. When the IGNstation subwayisstation subway entrance, then distances are around 80 m. Finally, when an IGN subway station is closer is closer to a subway entrance, then distances are around 80 m. Finally, when an IGN subway stationto an OSM issubway the station, distances are distributed a 50-m interval. closer tostation, an OSMthen subway then theequally distances are equallyacross distributed across a 50-m interval.

Figure 13. Comparison between homologous features: features : (a) between IGNIGN and and OSMOSM Figure 13. Comparison between homologous (a) Euclidian Euclidiandistance distance between subway stations and (b) Euclidian distance between IGN and OSM subway stations in red and the the subway stations and (b) Euclidian distance between IGN and OSM subway stations in red and distance between IGN subway stations and the nearest corresponding OSM subway entrances in green. distance between IGN subway stations and the nearest corresponding OSM subway entrances in green.

In addition, the data matching and the evaluation of results allowed us to identify the following

In addition, the data matching and the evaluation of results allowed us to identify the omissions: following omissions: •

• • •

one OSM subway station (‘Pont de Neuilly’) is not mapped. It is important to note that this station was added during 2016, after we had extracted our OSM dataset . one OSM subway station (‘Pont de Neuilly’) is not mapped. It is important to note that this station • 13 OSM subway stations do not yet have subway entrances. was during 2016, afterhave we had extracted our OSM dataset . position). • added two OSM subway stations duplicate locations (exactly the same

13 OSM subway stations do not yet have subway entrances. 5.4. Thematic Qualitystations Assessment Based on Data Matching two OSM subway have duplicate locationsResults (exactly the same position).

In contrast to the positional quality, the thematic quality seems to be better as revealed by the

5.4. Thematic Quality Assessment on Data Matching Results comparison between namesBased of homologous subway stations. To compare the names, the

Levenshtein distance was used.quality, The distance is zero when the names areto exactly the same and equals In contrast to the positional the thematic quality seems be better as revealed by the 1 when they are completely different. As can be seen in Figure 14, almost all homologous subway comparison between names of homologous subway stations. To compare the names, the Levenshtein stations have the same name (median of Levenshtein distance is zero). Of the 329 matching links, 257 distance was used. The distance is zero when the names are exactly the same and equals 1 when they have a Levenshtein distance equal to zero. When this is not the case, the following situations can be are completely observed: different. As can be seen in Figure 14, almost all homologous subway stations have the same name (median of Levenshtein distance is zero). Of the 329 matching links, 257 have a Levenshtein • For 17 subway stations, the IGN name is too detailed, such as ‘Goncourt (Hôpital Saint-Louis)’ distance equal to zero. When this is not the case, the following situations can be observed:

• • •

(IGN) versus ‘Goncourt’ (OSM), ‘Pont Neuf (la Monnaie)’ (IGN) versus ‘Pont Neuf’. The type of station is integrated in the name for OSM subway stations such as: ‘Charles de For 17 subway stations, the IGN name is too detailed, such as ‘Goncourt (Hôpital Saint-Louis)’ Gaulle Étoile (RER)’, ‘Gare de Lyon (métro 1)’. (IGN) versus ‘Goncourt’ (OSM), ‘Pont Neuf (la Monnaie)’ (IGN) versus ‘Pont Neuf’. • Different nomenclature is used: ‘Palais Royal (Musée du Louvre)’ for IGN versus ‘Palais Royal – The type of du station is integrated Musée Louvre’ for OSM. in the name for OSM subway stations such as: ‘Charles de Gaulle •

Étoile (RER)’, ‘Gare de Lyon (métro 1)’. Different nomenclature is used: ‘Palais Royal (Musée du Louvre)’ for IGN versus ‘Palais Royal – Musée du Louvre’ for OSM.


17 of 29

Figure 14. Comparison of names of homologous subway stations using the Levenshtein distance.

We have noticed that the naming convention of OSM subway entrances more or less follows a standard: Name of the subway station—Exit nn—street xxx. Thus, we were interested to see how closely this standard is followed. The following cases are observed:

• •

•

case 1: OSM subway entrances that contain the name of the OSM subway station. case 2: OSM subway entrances have the same name with the corresponding OSM subway station where the latter has at least two subway entrances. This case can help identify OSM subway entrances for which the name should be distinguished from the name of the OSM subway station (because the subway station has more than two subway entrances) but is not. case 3: OSM subway entrances for which the name contains ‘exit xx’ or ‘−xx’. The results for these three cases are given in Table 9. Table 9. Data quality analysis with respect to the observed naming standard for OSM subway entrances. Name of OSM Subway Entrances Versus Name of Subway Stations 340 OSM subway entrances have a name starting with the corresponding name of the OSM subway station 156 OSM subway entrances have a name partially starting with its corresponding OSM subway station case 1

6 OSM subway stations have a name more detailed than the name of the OSM subway entrances ‘Chaussée d’Antin—La Fayette’ (OSM subway station) versus ‘Chaussée d’Antin’ (OSM subway entrances) 5 OSM subway stations have names strictly integrated into the OSM subway entrances: ‘Denfert-Rochereau’ versus ‘Métropolitain, station Denfert-Rochereau’, ‘Saint-Michel’ versus ‘Quai Saint-Michel (Notre-Dame)’

case 2

For an OSM subway station that has at least two corresponding OSM subway entrances, 463 OSM subway entrances have a name contained in the name of the OSM subway station Among the 463 OSM subway entrances, 306 (66%) have exactly the same name as the corresponding OSM subway station 168 OSM subway entrances contain the word exit. Among those, 22 do not contain the name of the OSM subway station (e.g., ‘Pernéty’ versus ‘Pernety—sortie 2—rue Niepce’. Adding the sign ‘—‘ between the name of the subway station and the word exit has been noticed

case 3

142 OSM subway entrances that contain the name of the subway stations follow the standard: Name of the subway station—Exit nn—street xxx (e.g., ‘Bastille—sortie 1—rue de la Roquette’) 4 OSM subway entrances start with the word Exit (e.g., ‘Sortie 3—Rue Goscinny’) 148 OSM subway entrances have a number after the word Exit

In addition to the previous analyses, the data matching and the evaluation of the results allowed 1 us to identify the following omissions:

• •

for one OSM subway station the name has not been filled in; 121 OSM subways entrances have no names.


18 of 29

6. Assessment by Checking Complementary VGI Sources As discussed in the previous section, a common approach to assessing the quality of VGI data is comparing it against authoritative datasets. This method enables the use of existing quality evaluation methods and has a number of advantages that can provide a comprehensive understanding of VGI, it can reveal any patterns or biases that might exist in the data (e.g., number of contributions and level of participation in rural vs. urban areas) or it can document, in a tangible way, the agreement of VGI data with authoritative, and thus asserted and trusted, spatial data. However, there are some equally important disadvantages. First, authoritative data are not always freely available or come with restrictive licensing agreements. Secondly, in some cases, VGI datasets are more current than the authoritative ones and thus omission and commission errors are hard to assess. This problem is intensified by the fact that VGI can be more detailed regarding certain features (e.g., bicycle rental points) or in certain geographic areas (e.g., popular or touristic areas). Moreover, in many cases, the lack of specifications in VGI products leads to heterogeneous data that prohibit straightforward comparison with authoritative data. Finally, VGI projects, such as OSM, can be more accurate and more complete than authoritative data (especially in developing countries), which violates the basic assumption of using authoritative data as reference data [35]. All these reasons gradually transfer the onus of VGI quality assessment to the contributors. Although this concept is a core part of VGI projects, participation and contribution biases do not allow for a uniform quality assessment (see for example [1]). Moreover, even when contributors want to help in quality assessment, there are cases where this is not always possible, especially when this is done remotely through the use of satellite imagery for features and attributes that cannot be shown in the images (e.g., attributes, names, hidden or not visible features, etc.) and thus cannot be extracted by photo-interpretation. Finally, there are cases where, for the same or adjacent features, there are contributions that are contradictory or not topologically feasible. Again, in such cases the pool of contributors and contributions from a single VGI project might not be enough for the quality assessment of VGI. 6.1. Methodology In order to cover this gap in quality assessment, we used data from other sources of implicit VGI content to evaluate the validity and quality of OSM POIs. For this role we chose the geo-tagged photographs available from Flickr. The combined view of OSM POIs and geo-tagged photographs can help to evaluate the quality of the OSM POIs and disambiguate inconsistencies or contradictory contributions, especially when the time-stamp available in both VGI datasets are taken into consideration. This evaluation was made manually for OSM POIs that are not directly visible from satellite imagery (e.g., POIs that are under wooded areas). 6.2. Results We used the IGN dataset for the wooded areas of Paris in order to spot the OSM POIs that are not visible using satellite imagery. Those POIs cannot be evaluated or corrected by contributors unless they physically visit the location. However, taking into consideration the Pareto principle, which also applies to VGI contributions [36] and dictates that a small percentage of volunteers generate the vast majority of content, as well as the participation and spatial biases of the contributors, it is understandable that quality evaluation or update of “hidden” elements is a challenging issue. In this context, we manually examined whether the use of geo-tagged photographs can provide additional information for POIs that are not directly visible from remote sensing. Figure 15a shows the OSM map for an area in Paris (‘Musée Rodin Gardens’) where a number of POIs are under trees. Figure 15b shows the same area with the IGN polygons of the wooded areas and the position of geo-tagged Flickr photographs. The Flickr photograph is shown as a pop-up with additional attributes provided by the photographer.


19 of 29


(a)

19 of 29

(b)

Figure 15. The use of Flickr photographs to verify OSM POIs. (a) OSM map for an area in Paris with

Figure 15. The use of Flickr photographs to verify OSM POIs. (a) OSM map for an area in Paris POIs that are under trees; (b) the same area with the IGN polygons of the wooded areas and the withposition POIs that are under trees; (b) the same area with the IGN polygons of the wooded areas and of geo-tagged Flickr photographs. A Flickr photograph is shown in a pop-up as well as the position geo-tagged Flickrby photographs. A Flickr photograph is shown in a©IGN). pop-up as well as additionalofattributes provided the photographer (©OpenStreetMap contributors, additional attributes provided by the photographer (©OpenStreetMap contributors, ©IGN).

The above example show that the combination of different sources of VGI data can offer a better understanding of the underlying reality, allowing the evaluationsources or update of volunteered content. The above example show that the combination of different of VGI data can offer a better However, manual evaluation is cumbersome, and thus, as a second step, we developed an ad content. hoc understanding of the underlying reality, allowing the evaluation or update of volunteered application that uses distance between points to detect the possible combinations of geo-tagged However, manual evaluation is cumbersome, and thus, as a second step, we developed an ad hoc photographs and hidden OSM POIs and then asks the user to respond whether a particular POI application that uses distance between points to detect the possible combinations of geo-tagged (according to OSM) can be recognized in a photograph that is taken from a close distance (e.g., ‘Do photographs and hidden OSM POIs and then asks the user to respond whether a particular POI you see a bus-stop about 4 m away, in the photo below?’) where the attribute ‘bus-stop’ comes from the (according to OSM) candistance be recognized photograph is taken closeand distance (e.g., ‘Do you OSM POI and the of 4 minisa the distance that between the from OSM aPOI the geo-tagged see aphotograph. bus-stop about 4 m away, in the photo below?’) where the attribute ‘bus-stop’ comes from OSM We manually evaluated 1000 pairs (i.e., POI—geo-tagged photograph) and itthe was POI possible and theto distance of 4recognize m is theand distance between theInOSM and geo-tagged photograph. positively evaluate 78 POIs. thesePOI cases, thethe POIs could be positively We manually 1000 pairs photograph) and it was possible to positively identified evaluated and thus evaluated for(i.e., theirPOI—geo-tagged existence and validity.

recognize and evaluate 78 POIs. In these cases, the POIs could be positively identified and thus 7. Combining Methodsand validity. evaluated for theirthe existence All the presented methods give an insight into the assessment of the quality of VGI POIs from a

7. Combining the Methods different point of view, but none is really able to provide a generic framework. Reference-based methods require reference are into not always accessibleof or not at all.from All the presented methodsdata, give which an insight the assessment themight quality of exist VGI POIs History-based methods, cannot distinguish between high quality features made from the first edit a different point of view, but none is really able to provide a generic framework. Reference-based and features that require further edits to improve quality. The combination of different VGI sources methods require reference data, which are not always accessible or might not exist at all. History-based can disambiguate or verify problematic cases up to a certain level, while spatial relation-based methods, cannot distinguish between high quality features made from the first edit and features that methods are good at finding outliers such as amenities outside buildings but are dependent on require further edits to improve quality. The combination of different VGI sources can disambiguate existing data to provide a finer-grained quality assessment (e.g., school POIs would be better or verify problematic cases upwere to aavailable). certain level, methods are isgood assessed if school polygons Somewhile other spatial POIs arerelation-based hard to assess because there no at finding outliers such as amenities buildings dependent on existing dataPOIs to provide generic spatial relation that can outside be applied to them.but Forare example, in OSM the artwork are a finer-grained quality assessment (e.g., school POIs would be better assessed if school polygons pieces of art mainly located outdoors, but indoor artwork can also be added. In this context, we trywere available). Some POIsquestions: are hard to assess because there is no generic spatial relation that can be to examine theother following applied to them. For example, in OSM the artwork POIs are pieces of art mainly located outdoors, but • How can combining methods improve the results of each method? indoor artwork can also be added. In this context, we try to examine the following questions: • How can combining methods help to elicit new information?

• •

How can combining methods improve the results of each method? How can combining methods help to elicit new information?

7.1. Combining History with an Analysis of Spatial Relations

The method based on the history of editions presented in Section 3 showed that some types of

features areHistory more subject displacements than others. One way to understand the contribution 7.1. Combining with antoAnalysis of Spatial Relations patterns underneath this displacement is to use the method based on the analysis of spatial relations The method the history of editions presented in Section 3 showed some types on each versionbased of theon features where there is considerable change. We selected three that feature types of features to displacements than others. One way to understand thestops—and contribution among are the more ones subject that move the most—motorway junctions, subway stations, and bus

patterns underneath this displacement is to use the method based on the analysis of spatial relations on each version of the features where there is considerable change. We selected three feature types among the ones that move the most—motorway junctions, subway stations, and bus stops—and analyzed


20 of 29


20 of 29

whether quality ofquality the POIs over timeover in terms consistency with the analyzedthe whether the of improved the POIs improved time of in their termsspatial of their spatial consistency surrounding features. with the surrounding features. 7.1.1. Motorway Junctions 7.1.1. Motorway Junctions Motorway junctions are the features that moved the most, according to historical analysis. Motorway junctions are the features that moved the most, according to historical analysis. Their Their quality can be assessed by measuring their topological relations with the roads, e.g., motorway quality can be assessed by measuring their topological relations with the roads, e.g., motorway junction POIs should be located at the intersection of motorways and a ramp. The study of the junction POIs should be located at the intersection of motorways and a ramp. The study of the 76 76 points of the dataset showed a perfect 100% consistency with roads. The history of these points points of the dataset showed a perfect 100% consistency with roads. The history of these points shows that most of them (around 70%) have been displaced from their initial position where there was shows that most of them (around 70%) have been displaced from their initial position where there no consistency with the roads, with often large displacements (Figure 16). was no consistency with the roads, with often large displacements (Figure 16).

Figure 16. The motorway junction is displaced at the intersection of the motorway and the ramp at Figure 16. The motorway junction is displaced at the intersection of the motorway and the ramp at version 6, where the displacement is large, i.e., greater than 30 m (©OpenStreetMap contributors). version 6, where the displacement is large, i.e., greater than 30 m (©OpenStreetMap contributors).

In our case study, only the history of points were collected, so we cannot be sure that the In our case study, only the historyconsistency of points were collected, so we cannot sure that just the displacement improved the topological of motorway junctions, becausebethey might displacement improved the topological consistency of motorway junctions, because they might just have followed the displacement of the roads. This should be checked by analyzing the history of the have followedasthe displacement of the roads. This should be checked by analyzing the history of the road features well. road features as well. 7.1.2. Subway Stations 7.1.2. Subway Stations Subway stations are located at the intersection of two lines, or simply on the polylines that Subway stations are located at the intersection of two lines, or simply on the polylines that represent a subway line (Figure 17a). They do not represent the subway entrances/exits that are represent a subway line (Figure 17a). They do not represent the subway entrances/exits that are captured with the tag value ‘subway_entrance’. The topological consistency with subway lines is captured with the tag value ‘subway_entrance’. The topological consistency with subway lines is measured similarly to motorway junctions with roads, and in this case, 3.8% of the 291 points are not measured similarly to motorway junctions with roads, and in this case, 3.8% of the 291 points are not connected to any subway line. A visual inspection confirms that the consistency problem is due to a connected to any subway line. A visual inspection confirms that the consistency problem is due to a poorly specified location for the station point. poorly specified location for the station point. The study of the history of subway stations shows that this good quality was obtained over time The study of the history of subway stations shows that this good quality was obtained over time with many edits to the stations. Most of the time, the edits made by several contributors together with many edits to the stations. Most of the time, the edits made by several contributors together produced a satisfactory conclusion (Figure 17b). However, a study of instances where the current produced a satisfactory conclusion (Figure 17b). However, a study of instances where the current version still has a low quality shows interesting cases of disagreements between contributors on version still has a low quality shows interesting cases of disagreements between contributors on where where to locate the feature. The example in Figure 17b shows the evolution of the large subway to locate the feature. The example in Figure 17b shows the evolution of the large subway station station ‘Nation’ with 17 contributors having edited the feature, and nine different positions. ‘Nation’ with 17 contributors having edited the feature, and nine different positions.


21 of 29

ISPRS Int. J. Geo-Inf. 2017, 6, 80 ISPRS Int. J. Geo-Inf. 2017, 6, 80

21 of 29 21 of 29

Figure 17. 17. Two examples of of the the evolution evolution of of subway subway station station location: location: (a) put at at the the Figure Two examples (a) the the station station is is put Figure 17. Two examples of theinevolution of subway station location: (a) the station is put at the intersection of the subway lines version 5 and then never moves (only tags are then modified); (b) intersection of the subway lines in version 5 and then never moves (only tags are then modified); intersection of the subway lines in version 5 and and we thencan never moves (only tags are then modified); (b) the current version is still wrongly located, see many displacements due to contributor (b) the current version is still wrongly located, and we can see many displacements due to contributor the current version is still wrongly located, and we can see many displacements due to contributor disagreements, as as 17 17 contributors contributors edited edited this this feature feature (©OpenStreetMap (©OpenStreetMapcontributors). contributors). disagreements, disagreements, as 17 contributors edited this feature (©OpenStreetMap contributors).

7.1.3. Bus Stops 7.1.3. Bus Stops 7.1.3. Bus Stops Regarding bus 4.54.5 was used on on each version of the Regarding bus stops, stops,the themethodology methodologydescribed describedin inSection Section was used each version of Regarding bus stops, the methodology described in Section 4.5 was used on each version of not the bus bus stops. In this case, thethe combination ofofthe historical and spatial relation methods does the stops. In this case, combination the historical and spatial relation methods does not bus stops.any In specific this case, the combination of the historical and spatial relation methods not highlight pattern. There the edits edits to the the geometry movedoes point highlight any specific pattern. There are are many many cases cases where where the to geometry move aa point highlight any specific pattern. There are many cases where the edits to the geometry move a point off of of aa building building boundary boundary or or aa road road centerline centerline (version (version 33 to to version version 55 in in Figure 18), 18), but but there there are are also also off off of acases building boundary or a road centerline (version 3 to or version 5 in Figure Figure 18),cases but there arethese also many where the bus stops are put too close to a road building, and other where many cases where the bus stops are put too close to aa road or building, and other cases where these many cases where the bus stops are put too close to road or building, and other cases where these two spatial spatial relations relations are are insufficient insufficient to to determine determine whether whether the the edits edits have have improved improved or or degraded degraded the the two two spatial relations are insufficient to determine whether the edits have improved or degraded the quality. For For instance, instance, the the bus bus stops stops in in Paris Paris have have been been displaced displaced considerably considerably in in recent recent years, years, and and quality. quality. For instance, the bus stops in Paris have been displaced considerably in recent years, and different POI versions often result from this displacement (e.g., version 1 to version 3 in Figure 18). different different POI POI versions versions often often result result from from this this displacement displacement(e.g., (e.g.,version version11to toversion version33in inFigure Figure18). 18).

Figure 18. Evolution of bus stop location: the bus stop did move between version 1 and 3, but version Figure Evolution of bus stop location: the bus stop did move betweenboundary version 1 and 3, but version 5 is just18. a better placement as version 3 was too the did building Figure 18. Evolution of bus stop location: theclose bus tostop move between (©OpenStreetMap version 1 and 3, 5contributors). is just a better placement as version 3 was too close to the building boundary (©OpenStreetMap but version 5 is just a better placement as version 3 was too close to the building boundary contributors). (©OpenStreetMap contributors).

7.2. Combining History with Data Matching 7.2. Combining History with Data Matching 7.2. Combining History Data Matching In this section, wewith combine data matching with historical analysis. Our goal was to analyze how In this section, we combine data matching with historical analysis. Our goal was to analyze how homologous subway their corresponding subway entrances over time and to In this section, westations combineand data matching with historical analysis. Ourchange goal was to analyze how homologous subway stations and their corresponding subway entrances change over time and to identify patterns of changes. homologous subway stations and their corresponding subway entrances change over time and to identify patterns of changes. identify patterns of changes.


7.2.1. IGN Subway Station Evolution ISPRS Int. J. Geo-Inf. 2017, 6, 80

22 of 29

22 of 29

As mentioned in Section 2.2, a change in IGN data is possible when the following situations 7.2.1. IGN Subway Station Evolution occur: a change in the real world, error corrections, a change in data specifications, or new partnership Asfor mentioned in Section change IGN data is possible when the following situations agreements sharing open data 2.2, arisea for someinlayers. Whatever the source of information, all features occur: a change in the real world, error corrections, a change in data specifications, new 19a used to update IGN POI datasets are checked and follow the data quality validation. orFigure partnership agreements for sharing open data arise for some layers. Whatever the source of illustrates the changes in subway stations in the Paris region from 2006 to the present with respect information, all features used to update IGN POI datasets are checked and follow the data quality to the source of information. We can see that, from 2006 to 2011, position changes are mostly due to validation. Figure 19a illustrates the changes in subway stations in the Paris region from 2006 to the updates and they are coming from another IGN database (blue line). Some thematic changes (the present with respect to the source of information. We can see that, from 2006 to 2011, position cumulative is 0)due coming from paper maps line)from can also be observed from(blue 2010 line). and 2013. changesdistance are mostly to updates and they are(red coming another IGN database The interesting point to highlight is that starting 2012 (line light blue), the change in the Some thematic changes (the cumulative distancewith is 0) coming from paper maps (red line) can alsosubway be station location is high, the cumulative distance being more than 1600 m in 2014, followed by observed from 2010 and 2013. The interesting point to highlight is that starting with 2012 (line 1500 light m in 2015, blue), and tending towards less than 500 m). This is due to adistance new partnership the change in thestability subway (e.g., station location is high, the cumulative being moreagreement than 1600IGN m in and 2014,the followed 1500 mwe in 2015, tending towards (e.g., less than 500 This in between RATP.by Indeed, haveand noticed that 93% ofstability the subway stations thatm). changed is due to athe new partnership agreement between IGN and RATP. of Indeed, we have noticed that 2014 are from RATP database. Figure 19b illustrates the the variation distance between consecutive 93% of the subway stations that changed in 2014 are from the RATP database. Figure 19b illustrates changes with respect to the number of changes, which shows that the number of changes correlates the variation of distance between consecutive changes with respect to the number of changes, which positively with the median distance between consecutive changes. For example, for subway stations shows that the number of changes correlates positively with the median distance between consecutive having two location changes over time, the median distance is around 50 m and more than 300 m for changes. For example, for subway stations having two location changes over time, the median subway stations having location changes. distance is around 50four m and more than 300 m for subway stations having four location changes.

Figure 19. Evolution of subway stationsin inthe the IGN IGN dataset: of of change andand (b) location Figure 19. Evolution of subway stations dataset:(a) (a)sources sources change (b) location changes Paris region. changes overover timetime for for thethe Paris region.

7.2.2. Subway Station and Subway Entrance Location Changes

7.2.2. Subway Station and Subway Entrance Location Changes In this section, we are interested in identifying patterns of location change concerning subway

In this section, we are interested in identifying patterns of location changeinconcerning subway stations and subway entrances by analyzing the data matching links described Section 5. With stations and subway entrances by analyzing the data matching links described in Section 5. With respect respect to a feature defined as the reference, the analyses are made on the last three versions of each to a feature as the reference, analyses are last three feature: feature:defined the current version (Tn), thethe version before (Tmade n−1) andon thethe version beforeversions Tn−1 (i.e., of Tn−2each ). Table 10 illustrates the (T evolution of location for the of each feature. the current version before (Tlast ) andversions the version before Tn−1 (i.e., Tn−2 ). Table 10 n ), the version n−1three of all, we of can see that stationsofhave least one location change. illustratesFirst the evolution location forall theIGN last subway three versions eachatfeature. Concerning the OSM subway entrances, 449 of them have no location change change. over time. The First of all, we can see that all IGN subway stations have at least one location Concerning different patterns are depicted in Table 10. With respect to the location distance of changes, two the OSM subway entrances, 449 of them have no location change over time. The different patterns are types of patterns can be identified: the feature gets closer to the reference (i.e., the feature change depicted in Table 10. With respect to the location distance of changes, two types of patterns can be location and the distance between the feature and its reference is decreasing over time), and the identified: thesituation feature where gets closer to thediverges reference (i.e., feature(i.e., change location and the distance converse the feature from thethe reference the feature changes location between the feature and its reference is decreasing over time), and the converse situation where and the distance between the feature and its reference is increasing over time). For example, 109 the feature diverges themove reference theIGN feature changespoints location the distanceWhen between OSM subwayfrom stations closer(i.e., to their homologous and and 52 are diverging. the the feature and its reference is increasing over time). For example, 109 OSM subway stations move closer to their IGN homologous points and 52 are diverging. When the OSM subway station is defined as the


23 of 29

reference, the trend is different, i.e., more IGN subway stations move away from OSM (92) than are coming closer (61). The second pattern concerns changes when there is a disagreement between versions. Four types of disagreement are identified. First, a disagreement between versions that end up closer to the reference (e.g., diverge from the reference and then come closer to the reference). We 23 can see that ISPRS Int. J. Geo-Inf. 2017, 6, 80 of 29 this pattern concerns OSM subway stations (35 features) in particular. The second pattern concerns OSM subway station is defined as the reference, the trend is different, i.e., they more come IGN subway a disagreement, which moves the features away from its reference (e.g., closer to the stations move away from OSM (92) than are coming closer (61). reference and then diverge from the reference). This occurs for both OSM subway stations and The second pattern concerns changes when there is a disagreement between versions. Four entrances for 69 and 47 features, respectively. Finally, the last two types of patterns are characterized types of disagreement are identified. First, a disagreement between versions that end up closer to the by return changes: getting andreference then returning thecloser previous (17 subway reference (e.g., divergecloser from the and then to come to thelocation reference). WeOSM can see that thisstations) and diverging and thenOSM returning the previous location (14 OSM stations). Thisareflects pattern concerns subway to stations (35 features) in particular. Thesubway second pattern concerns disagreement, which of moves thesubway features station away from its reference (e.g., they come closer to the once again the complexity public mapping. reference and then diverge from the reference). This occurs for both OSM subway stations and entrances for 69location and 47 features, respectively. Finally,and the subway last two entrances types of patterns are characterized patterns for subway stations with respect to a reference. Table 10. Change by return changes: getting closer and then returning to the previous location (17 OSM subway stations) and diverging and then returning to the location (14 OSM stations). This vs. OSM Subway Station vs. previous IGN Subway Station vs. subway OSM Subway Entrance Patterns IGN Subway Station (a) OSM Subway Station (b) IGN Subway Station (c) reflects once again the complexity of public subway station mapping. Without location history 0 0 449 Table 10. Change location patterns for subway stations and subway entrances with respect to a reference. Get closer to the reference 109 61 76 Diverging from Patterns the reference Disagreement but finallyhistory get Without location closer to the reference Get closer to the reference Diverging from the reference Disagreement but finally Disagreement but finally get diverge from the reference closer to the reference

Disagreement: Get closer and Disagreement but finally return to the previous location

diverge from the reference Disagreement: Diverge and and Disagreement: Get closer return return to the to previous location the previous location Disagreement: Diverge and return to the previous location

OSM Subway 52 Station vs. IGN Subway Station (a) 35 0 109 52

IGN Subway92Station vs. OSM Subway Station (b) 09 61 92

OSM Subway Entrance vs. 60 IGN Subway Station (c) 449 7 76 60

35

9

7

69 17

69

1417 14

6

6

0

47

47

00

0

0

0

0 0

In order to examine if there is any correlation between the importance of the subway stations and to examine if there any correlation between the importance thesubway subway stations the number In oforder changes, we tested theis hypothesis that the more importantofthe station is, the and the number of changes, we tested the hypothesis that the more important the subway station higher the number of changes in the location should be. To measure the importance of theis,subway the number of changes in the location should be. To measure the importance of the station, the we higher counted the number of subway entrances. The results illustrated in Figure 20 are quite subway station, we counted the number of subway entrances. The results illustrated in Figure 20 are surprising, showing that the number of changes is not linked to the importance of the subway station. quite surprising, showing that the number of changes is not linked to the importance of the subway For example, linethe represents subway stations with 1–2 changes. We We notice that 3535subway station.the Forblue example, blue line represents subway stations with 1–2 changes. notice that stations subway with only two changes have just one subway entrance in OSM. stations with only two changes have just one subway entrance in OSM.

Figure 20. Relationship between the number of subway entrances for each subway stations and the

Figure 20. Relationship between the number of subway entrances for each subway stations and the number of changes. number of changes.


24 of 29

7.2.3. Subway Station and Subway Entrance Name Changes Concerning the name of subway stations, we have seen in Section 5 that there is a high similarity between IGN and the last version of OSM subway stations. Based on that, we are interested in the following: have the names of OSM stations improved over time or were the names correct from the beginning? Among the OSM subway stations matched with IGN subway stations a total of 391 name changes occurred:

• • • •

•

64 features were subject to small corrections (Levenshtein distance is less than 2 edits). For example: ‘Marcadet—Poissonniers’→’Marcadet—Poissoniers’ For 62 features, the change led back to the previous version’s name. This reflects a disagreement between contributors. 202 subway stations had no name tag in the first version and the names were added over time. 140 OSM subway stations changed name over two days (i.e., 2008-05-14 and 2008-05-15). This case can be either due to an automatic process that has been run on OSM subway stations, or a contributor or contributors made all these changes over a couple of days, in order to standardize the dataset. A similar case occurred in 2013 when 168 subway stations changed name during one day (i.e., 2013-11-24). In this case, one hypothesis is that the new changes come from the transportation office, which released its data as open data in 2012.

This analysis shows that the evolution of names for subway stations over time has improved the thematic quality of the names. Regarding name changes of the subway entrances, we tried to see if the changes also improved their thematic quality by harmonizing the names towards a standard name. For this analysis the last three changes in the name were considered. Among 701 OSM subway entrances matched to OSM (and indirectly to IGN) subway stations, 365 subway entrances had no changes, 256 subway entrances had two changes and 80 of them had three changes. We noticed that, in general, the name for a subway entrance is ‘Name of the subway station—Exit nn—street’. Thus, we try to identify whether the name follows this format and whether the changes that occurred have a tendency to result in this format. Thus, three relevant cases are depicted (see Table 11):

• • •

case 1: the name of the OSM subway entrance contains the name of the OSM subway station; case 2: the name of the OSM subway entrance contains the name of the IGN subway station; case 3 :the name of the OSM subway entrance contains the word ‘exit xx’. Table 11. Evolution of names of OSM subway entrances.

Case 1 Case 2 Case 3

Current Version (T)/523

Previous Version Ti+j /258

Previous Version Ti+j /64

458 383 144

117 101 17

11 11 3

As we can see, once again, the changes improve the quality of the name of the OSM subway stations. The following example illustrates the general trend of changes regarding the names of OSM subway entrances. For a subway station named ‘Porte de Montreuil’, the current name of one of its subway entrances is ‘Porte de Montreuil—sortie 1 Avenue de la Porte de Montreuil’. The name of this cited subway entrance has been changed two times before: in the first version the name was ‘Av. de la Porte de Montreuil’ and in the second version, just before the current one, the name was ‘Avenue de la Porte de Montreuil’.


25 of 29

7.3. Discussion Some of the methods, especially those based on spatial relations between features and history, can be used to improve POI quality in real time. Integrity constraints can be defined and implemented in a collaborative system in order to guide the contributor to produce consistent data, to check that a change in a feature does not introduce inconsistencies or deteriorate the quality of existing data. Although research in this direction already exists, in most cases the methods for post-creation error fixing are proposed or they guide the contributor into choosing the type of a new feature [37,38]. Regarding real-time quality assessment, there have been a few research studies conducted: one focuses on attribute assessment [15], another one proposes near-real-time comparison of addresses with references [39], and a last one identifies errors that do not respect integrity constraints [40]. In this last study only a few integrity constraints are defined and the approach was tested on a small dataset. Thus, more research is needed to define and combine many integrity constraints. The four methods presented here can all be improved. Regarding the data matching approach, the semantic schema matching, which has been done manually in this paper, can be improved by using a schema matching based on ontology. For that, more research is necessary in order to align the IGN topographic ontology [41] with OSM ontology [42]. Regarding the spatial relations method, the main problem is to interpret what the spatial relations mean in terms of quality. It would be helpful to use machine learning techniques to infer the quality of the POI given a set of spatial relations. Although the paper urges the use of holistic approaches for VGI quality assessment, the combination of assessment methods experimented with in this paper used only two of the four presented methods: history with spatial relations, and history with matching to NMA reference data. Combining all four would really increase the scope of the quality assessment. For instance, for a feature with multiple versions one could acquire a match to reference data, a spatial relations analysis for all versions, and a set of geo-tagged photographs corresponding to the dates of each version. We could then overcome the limitations of each method (e.g., the quality of the reference data, or the fuzziness of the spatial relations analysis). However, the holistic approach should not be limited to the use of the four methods presented here. For instance, methods following the social approach of [28], i.e., methods that consider the social interactions between the contributors of a crowdsourcing project, would clearly help us to understand the cases where disagreements between contributors occur. Of course, if we increase the number of methods combined to assess data quality, it will be difficult to balance the contribution of each method in the global assessment. The use of computer science techniques such as multiple criteria decision making [29], fuzzy modeling, and machine learning will become mandatory for implementing such a protocol. 7.4. Guidelines for Typical Use Cases A first typical use case is a NMA that controls a photo sharing web app and/or a vector (i.e., point feature) contribution platform. How can the crowdsourced POIs be integrated with the authoritative datasets of the NMA? Data matching can be used in this situation to link both datasets in order to identify homologous features or crowdsourced POIs that do not exist in the authoritative dataset. There can be three types of results in the matching process. First, if the crowdsourced POI is not matched, a joint analysis of its spatial relations with the authoritative dataset and of its past versions can help us to assess if it can be integrated into the authoritative dataset (i.e., the quality is good or not). If the crowdsourced POI is matched and very close to its counterpart in the authoritative dataset, then the two features can be merged, which might semantically enrich the NMA dataset. The enrichment can be either the addition of missing attributes in the authoritative POI or its enrichment with vernacular name(s). Finally, if the crowdsourced POI is matched with a corresponding authoritative POI but it is somewhat geometrically or semantically different, then both the history and the spatial relation analyses can help to decide which should be kept, or how they can be merged (e.g., use the geometry of one and the semantics of the other). Thus, in this use case, data matching is used to address the challenge of multi-source spatial data integration. History and spatial relation analyses are used


26 of 29

to evaluate data matching links and improve the decision-making step. In all cases, if geo-tagged photographs of the POIs exist, they could also be used to disambiguate and confirm certain decisions. A second use case is the derivation of maps with the cartographic background from a NMA, and the OSM POIs on top to create an enhanced map. As described in [32], the problem here is the consistency between the POIs and the authoritative background. In this use case, no matching is needed because the POIs are taken from OSM as an additional dataset. The first step is to analyze the consistency in the spatial relation between the crowdsourced POI and the map background. If consistency is poor, or inconclusive, a joint analysis of past versions and a spatial relation consistency analysis is performed with OSM data only, as outlined in Section 7.1, which can remove ambiguity and again aid the decision-making process. For example, there may be a case where a store POI has been moved through different versions to a consistent position inside the building in OSM, but the inconsistency has come from a shift in the building’s position. In this case, the easiest solution is not to displace the building, as legibility is more important than positional accuracy, but rather to automatically move the POI to a similar position in the NMA building (buildings can be matched as they exist in both datasets). Thus, in this case the history and spatial relation analysis are used to improve the quality of OSM POIs. 8. Conclusions In this paper we focused on a quality evaluation of the POIs of the most successful VGI project: OSM. The first contribution of the paper has been to propose four different methods to assess VGI POI quality. Two of these methods include a comparison to a reference dataset approach, i.e., multi-criteria matching to NMA data and the use of Flickr photographs to check POI validity. Another method follows Goodchild & Li’s [28] crowdsourcing approach by analyzing feature history. The last method follows Goodchild and Li’s [28] geographic approach by analyzing the spatial relations of the POIs with their neighboring features. On the one hand, the application of each method on the same benchmark dataset shows that the results obtained by a method can highlight a phenomenon and can be used as an input for another method to better analyze the reasons for this phenomenon. For example, the analysis of history showed that some feature types change more than other types. In order to better understand the reason, the comparison with other sources of information can be carried out on these types of features. On the other hand, each method can provide unique insights on data quality, but none is sufficient to clearly evaluate the quality of POIs from VGI. Then based on the lessons learned from the strong and weak points of each method, we tried to combine them so as to gain a better understanding of the data quality. For instance, the combination of history and spatial relations shows that multiple edits of a feature mainly tend to improve the logical consistency with other features; combining matching to NMA data and history highlights several patterns of improvement or disagreement when several contributors edit the same feature. The paper shows that a holistic approach is needed for the quality evaluation, so much so that totally new ways need to be developed compared to the existing spatial data quality evaluation methods. Thus, this analysis helped us to understand what the next steps in building a solid VGI quality evaluation framework are, how it should behave, and where the challenges lie. For example, when it comes to the use of geo-tagged images, multiple sources (not just Flickr) can be included in the evaluation process and the method can also make use of other pieces of data such as titles, tags, or descriptions. Furthermore, new algorithms should be developed in order to filter out much of the noise that is present in social networking content. Automatic image analysis and classification, as presented in [43], will greatly improve the process. Regarding the evolution of OSM places and POIs, the analysis showed that, even if in some cases the change of an OSM feature is due to disagreement between contributors, changes generally occur to improve the quality of features and reduce the heterogeneity of the thematic attributes, in a sense extending the findings of [1]. In parallel, new methods need to be developed in order to compare the evolution of names. More specifically, we realized that the existing measures that allow string


27 of 29

comparison are not always applicable for this task. For example, the distance between ‘Sully Morland’ and ‘Sully Morland-sortie 4’ is bigger than the distance between ‘Sully Morland’ and ‘Pont Marie’. This means that ‘Sully Morland’ is closer from a linguistic point of view to ‘Pont Marie’ than to ‘Sully Morland-sortie 4’. Finally, we would like to suggest that all the datasets used here become part of a benchmark for VGI quality assessment. Here, only POI data were used, but this approach could be applied to other features such as roads, rivers, buildings, and land use. Acknowledgments: The authors would like to acknowledge the support and contribution of the ENERGIC COST Action (TD1203). Author Contributions: This paper is a collaboration that started during a workshop by the ENERGIC COST Action. All four authors contributed equally to all aspects of the paper: i.e., they performed the experiments and analysis and wrote relevant sections of the paper: Vyron Antoniou worked on the historical analysis and the comparison with FlickR data; Ana-Maria Olteanu-Raimond and Marie-Dominique Van Damme worked on the matching related tasks; Guillaume Touya worked on the spatial relations analyses. GuillaumeTouya initiated the collaboration, so he is the first author; the others are listed in alphabetical order. The datasets can be obtained by contacting the authors of the paper. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2. 3. 4. 5. 6. 7.

8. 9.

10.

11. 12. 13.

14.

Haklay, M.; Basiouka, S.; Antoniou, V.; Ather, A. How many volunteers does it take to map an area well? The validity of linus’s law to volunteered geographic information. Cartogr. J. 2010, 47, 315–322. [CrossRef] Haklay, M. How good is volunteered geographical information? A comparative study of OpenStreetMap and ordnance survey datasets. Environ. Plan. B Plan. Des. 2010, 37, 682–703. [CrossRef] Girres, J.-F.; Touya, G. Quality assessment of the french OpenStreetMap dataset. Trans. GIS 2010, 14, 435–459. [CrossRef] Senaratne, H.; Mobasheri, A.; Ali, A.L.; Capineri, C.; Haklay, M. A review of volunteered geographic information quality assessment methods. Int. J. Geogr. Inf. Sci. 2017, 31, 139–167. [CrossRef] Fonte, C.C.; Bastin, L.; See, L.; Foody, G.; Lupia, F. Usability of VGI for validation of land cover maps. Int. J. Geogr. Inf. Sci. 2015, 29, 1269–1291. [CrossRef] Antoniou, V.; Skopeliti, A. Measures and indicators of VGI quality: An overview. ISPRS Ann. 2015, II-3/W5, 345–351. [CrossRef] Quesnot, T.; Roche, S. Quantifying the significance of semantic landmarks in familiar and unfamiliar environments. In Spatial Information Theory; Fabrikant, S.I., Raubal, M., Bertolotto, M., Davies, C., Freundschuh, S., Bell, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; pp. 468–489. Bakillah, M.; Liang, S.; Mobasheri, A.; Arsanjani, J.J.; Zipf, A. Fine-resolution population mapping using OpenStreetMap points-of-interest. Int. J. Geogr. Inf. Sci. 2014, 28, 1940–1963. [CrossRef] Rodrigues, F.; Alves, A.; Polisciuc, E.; Jiang, S.; Ferreira, J.; Pereira, F.C. Estimating disaggregated employment size from points-of-interest and census data: From mining the web to model implementation and visualization. Int. J. Adv. Intell. Syst. 2013, 6, 41–52. Jiang, S.; Alves, A.; Rodrigues, F.; Ferreira, J., Jr.; Pereira, F.C. Mining point-of-interest data from social networks for urban land use classification and disaggregation. Comput. Environ. Urban Syst. 2015, 53, 36–46. [CrossRef] Bawa-Cavia, A. Sensing the urban: Using location-based social network data in urban analysis. In Proceedings of the Workshop on Pervasive Urban Applications (PURBA) 2011, San Francisco, CA, USA, 12 June 2011. Huang, H.; Gartner, G.; Turdean, T. Social media as a source for studying people’s perception and knowledge of environments. Mitteilungen Österreichischen Geogr. Ges. 2013, 155, 291–302. [CrossRef] Leibovici, D.G.; Evans, B.; Hodges, C.; Wiemann, S.; Meek, S.; Rosser, J.; Jackson, M. On data quality assurance and conflation entanglement in crowdsourcing for environmental studies. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, II-3/W5, 195–202. [CrossRef] Barron, C.; Neis, P.; Zipf, A. A comprehensive framework for intrinsic OpenStreetMap quality analysis. Trans. GIS 2014, 18, 877–895. [CrossRef]


15. 16. 17.

18. 19.

20. 21. 22. 23. 24.

25.

26. 27. 28. 29. 30. 31.

32. 33. 34. 35.

36.

28 of 29

Antoniou, V. User Generated Spatial Content: An Analysis of the Phenomenon and Its Challenges for Mapping Agencies. Ph.D. Thesis, University College London, London, UK, 2011. Craglia, M.; Ostermann, F.; Spinsanti, L. Digital earth from vision to practice: Making sense of citizen-generated content. Int. J. Dig. Earth 2012, 5, 398–416. [CrossRef] Antoniou, V.; Touya, G.; Olteanu-Raimond, A.-M. Quality analysis of the parisian OSM toponyms evolution. In European Handbook of Crowdsourced Geographic Information; Capineri, C., Haklay, M., Huang, H., Antoniou, V., Kettunen, J., Ostermann, F., Purves, R., Eds.; Ubiquity Press: London, UK, 2016; pp. 97–112. Jonietz, D.; Zipf, A. Defining fitness-for-use for crowdsourced points of interest (POI). ISPRS Int. J. Geo-Inf. 2016, 5, 149. [CrossRef] Soden, R.; Palen, L. From crowdsourced mapping to community mapping: The post-earthquake work of OpenStreetMap Haiti. In Proceedings of the 11th International Conference on the Design of Cooperative Systems, Nice, France, 27–30 May 2014; pp. 311–326. Haklay, M.; Antoniou, V.; Basiouka, S.; Soden, R.; Mooney, P. Crowdsourced Geographic Information Use in Government; World Bank Publications: Washington, DC, USA, 2014. O’reilly, T. What is Web 2.0: Design patterns and business models for the next generation of software. Commun. Strateg. 2007, 1, 17. Hollenstein, L.; Purves, R. Exploring place through user-generated content: Using Flickr tags to describe city cores. J. Spat. Inf. Sci. 2010, 1, 21–48. Van Exel, M.; Dias, E.; Fruijtier, S. The impact of crowdsourcing on spatial data quality indicators. In Extended Abstracts Volume, GIScience 2010; Purves, R., Weibel, R., Eds.; GIScience: Zurich, Switzerland, 2010. Ciepłuch, B.; Jacob, R.; Mooney, P.; Winstanley, A.C. Comparison of the accuracy of OpenStreetMap for Ireland with Google maps and Bing maps. In Proceedings of the Ninth International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences, Leicester, UK, 20–23 July 2010; p. 337. Keßler, C.; de Groot, R.T. Trust as a proxy measure for the quality of volunteered geographic information in the case of OpenStreetMap. In Geographic Information Science at the Heart of Europe; Vandenbroucke, D., Bucher, B., Crompvoets, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 21–37. Goodchild, M.F. Commentary: Whither VGI? GeoJournal 2008, 72, 239–244. [CrossRef] International Organization for Standardization. ISO 19157: Geographic Information: Data Quality (DIS); ISO/TC211: Geneva, Switzerland, 2013. Goodchild, M.F.; Li, L. Assuring the quality of volunteered geographic information. Spat. Stat. 2012, 1, 110–120. [CrossRef] Touya, G.; Brando, C. Detecting Level-of-Detail inconsistencies in volunteered geographic information data sets. Cartographica 2013, 48, 134–143. [CrossRef] Chaudhry, O.Z.; Mackaness, W.A.; Regnauld, N. A functional perspective on map generalisation. Comput. Environ. Urban Syst. 2009, 33, 349–362. [CrossRef] Jackson, S.P.; Mullen, W.; Agouris, P.; Crooks, A.; Croitoru, A.; Stefanidis, A. Assessing completeness and spatial error of features in volunteered geographic information. ISPRS Int. J. Geo-Inf. 2013, 2, 507–530. [CrossRef] Touya, G.; Baley, M. Level of Details Harmonization Operations in OpenStreetMap Based Large Scale Maps. In Citizen Empowered Mapping; Leitner, M., Jokar Arsanjani, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2016. Fan, H.; Zipf, A.; Fu, Q.; Neis, P. Quality assessment for building footprints data on OpenStreetMap. Int. J. Geogr. Inf. Sci. 2014, 28, 700–719. [CrossRef] Olteanu-Raimond, A.-M.; Mustière, S.; Ruas, A. Knowledge formalization for vector data matching using belief theory. J. Spat. Inf. Scie. 2015, 10, 21–46. [CrossRef] Vandecasteele, A.; Devillers, R. Improving volunteered geographic information quality using a tag recommender system: The case of OpenStreetMap. In OpenStreetMap in GIScience: Experiences, Research, Applications; Jokar Arsanjani, J., Zipf, A., Mooney, P., Helbich, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; pp. 59–80. Quattrone, G.; Capra, L.; de Meo, P. There’s No Such Thing as the Perfect Map: Quantifying Bias in Spatial Crowd-Sourcing Datasets. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, Vancouver, BC, Canada, 14–18 March 2015; pp. 1021–1032.


37.

38.

39. 40. 41. 42. 43.

29 of 29

Bordogna, G.; Frigerio, L.; Kliment, T.; Brivio, P.A.; Hossard, L.; Manfron, G.; Sterlacchini, S. “Contextualized VGI” Creation and Management to Cope with Uncertainty and Imprecision. ISPRS Int. J. Geo-Inf. 2016, 5, 234. [CrossRef] Scheider, S.; Keßler, C.; Ortmann, J.; Devaraju, A.; Trame, J.; Kauppinen, T.; Kuhn, W. Semantic referencing of geosensor data and volunteered geographic information. In Geospatial Semantics and the Semantic Web; Ashish, N., Sheth, A.P., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 27–59. Stark, H.-J. Quality Assessment of Volunteered Geographic Information using Open Web Map Services within OpenAddresses. In Proceedings of Geospatial Crossroads @ GI_Forum ’11, Salzburg, Austria, 5–8 July 2011. Brando-Escobar, C. Un Modèle d’Operations Réconciliables Pour l’Acquisition Distribuée de Données Géographiques. Ph.D. Thesis, Université Paris-Est, Champs-sur-Marne, France, 2013. IGN BD TOPO Ontology. Available online: http://data.ign.fr/def/topo/20140416.htm (accessed on 16 January 2017). OSM Ontology. Available online: http://wiki.openstreetmap.org/wiki/OSMonto (accessed on 16 January 2017). Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning Deep Features for Scene Recognition using Places Database. In Advances in Neural Information Processing Systems 27 (NIPS); Neural Information Processing Systems Foundation, Inc.: Saint Paul, MN, USA, 2014. © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).