Data Quality Assurance for Volunteered Geographic Information Ahmed Loai Ali & Falko Schmid GIScience 2014 Vienna - September 23-26
a Volunteered
Geographic
Information
V G I
Crises Management
VGI Data
Planning
Events Reporting
VGI Data
Maps Services
VGI Data
Quality Assurance
Services
JOSM
Quality Assurance
Quality Assurance
Quality Assurance
Residential ? Services ?
Primary ? Secondary ? Tertiary ?
Lake ? Pond ?
Positional accuracy
http://www.mdpi.com/2220-9964/2/3/704/htm
Completeness
Attribute accuracy Park Garden
Comparison
Intrinsic data analysis
Intrinsic data analysis
Intrinsic data analysis
C l a s s i f i c a t i o n
Country
State/Province
City
…
C l a s s i f i c a t i o n
restaurant
Playgroun playgrounds d
footway footways s
lake
Country
State/Province Rule-based integrity checking City
…
C l a s s i f i c a t i o n
Restaurant
Playgrounds
footways
Learning-based integrity checking
Lake
boundary = administrative 4 admin_level = 2 6
Germany Austria
Hierarchical Consistency
country
2
Country
state/governorate/province
4
State/Province
city/town
6
district/neighbourhood
8
suburbs
10 11
• Exceptions are exists: city states, exclaves • But, the majority follows the strict hierarchical structure
City
…
admin_level != numeric value
∀𝑢 ∈ 𝑈𝑖 𝑤ℎ𝑒𝑟𝑒 1 ≤ 𝑖 ≤ 11 admin_level= provinsi, kabaupaten,...
Incorrect Values
∀𝑢𝑎 ∈ 𝑈𝑖>1 , ∃𝑢𝑏 ∈ 𝑈𝑗>𝑖 : 𝑢𝑎 ⊂ 𝑢𝑏
Inconsistency
∀𝑈𝑗 , 𝑈𝑘 ⊂ 𝑈𝑖 ∶ 𝑈𝑗 ⋂𝑈𝑘 = ⌀
duplication
Duplication
• 24,410 out of 259,667 entities are detected as problematic entities ≈ 10 %
• Not all the detected entities represent problematic: • E.g. city states, independent territories
• Rule-based checking results in: • Potential problematic classified entities
• Required actions: • Automatic correction for clear problematic entities • Crowdsourcing-evaluation for potentially problematic entities Ex. OSM community, Web-based tools, Gamification, etc.
Classification Plausibility
? Pond
Lake
Classification Plausibility
P r o p o s e d A p p r o a c h
P r o p o s e d A p p r o a c h
Locality Within a country
Filtering Densest cities
Sports Playground Amenity Water body Forest
Footway Flower Grass
Cultivation
100’s….1000000’s Sqm
>
10’s…1000’s Sqm
Learning-based Integrity Checking • Classifier training: • We utilize K-Nearest Neighbours (KNN) • The similarity between entities is measured by Euclidean distance
• Classifier validation: • One training and test set is biased • Depending on one performance measure is also biased
• To avoid biased classification: • Mutual classifiers • Multi-measurement of performance (accuracy, AUC)
Learning-based Integrity Checking
Mutual learning for classifiers at Germany: (red values) top 3 models with higher performance, (blue values) biased models
Implausible Classifications
Small lands covered by grass classified as “park”. They might be “garden”
Implausible Classifications
A tiny grass land classified as “park”.
Implausible Classifications
A grass area on a building roof classified as “park”. But, it might be “garden”
Implausible Classifications
A roundabout classified as “garden”. The appropriate classification might be “grass”
Results & Discussion • The analysis of classifiers accuracy: • Indicate problematic classifications at the studied countries
• The results show that the classification accuracy of “park” & “garden” entities • In cities of the UK and Germany: 70% to 90% • In cities of Austria: 50% to 65% • The applicability of learning depends on the availability of sets with sufficient quality
large data
Results & Discussion • No magic rule to judge on the detected problematic entities. • Crowdsourcing revisions are required. • The revisions could be used to improve and assess the developed classifiers.
Results & Discussion • We re-check the detected problematic entities in both scenarios: • 4 months later
• A promising percentage of the problematic entities have been updated: • Hierarchical consistency: (“administrative boundaries”) ≈ 40 % of the detected entities • Classification Plausibility: (“park” - “garden”) ≈ 8 % at both the UK & Germany ≈ 11 % at Austria • Results show positive indicators for the feasibility of our approaches.
Conclusion & Future Work • The quality of VGI is a crucial issue for providing reliable services. • Integrity checking is one way for consistent classification of similar features: • Rule-based • Learning-based • In VGI context: • Every classification is possible • Classification accuracy is one fact of data quality which needs strict integrity checking mechanisms
Conclusion & Future Work
• Investigating geometric and topological properties of the entities. • Further analysis to check which type of entities are frequently involved in a topological relation with an entity.
? contact:
[email protected]