Data Quality Assurance for Volunteered ... - Semantic Scholar

1 downloads 0 Views 8MB Size Report
A roundabout classified as “garden”. The appropriate classification might be “grass”. Implausible ... No magic rule to judge on the detected problematic entities.
Data Quality Assurance for Volunteered Geographic Information Ahmed Loai Ali & Falko Schmid GIScience 2014 Vienna - September 23-26

a Volunteered

Geographic

Information

V G I

Crises Management

VGI Data

Planning

Events Reporting

VGI Data

Maps Services

VGI Data

Quality Assurance

Services

JOSM

Quality Assurance

Quality Assurance

Quality Assurance

Residential ? Services ?

Primary ? Secondary ? Tertiary ?

Lake ? Pond ?

Positional accuracy

http://www.mdpi.com/2220-9964/2/3/704/htm

Completeness

Attribute accuracy Park Garden

Comparison

Intrinsic data analysis

Intrinsic data analysis

Intrinsic data analysis

C l a s s i f i c a t i o n

Country

State/Province

City



C l a s s i f i c a t i o n

restaurant

Playgroun playgrounds d

footway footways s

lake

Country

State/Province Rule-based integrity checking City



C l a s s i f i c a t i o n

Restaurant

Playgrounds

footways

Learning-based integrity checking

Lake

boundary = administrative 4 admin_level = 2 6

Germany Austria

Hierarchical Consistency

country

2

Country

state/governorate/province

4

State/Province

city/town

6

district/neighbourhood

8

suburbs

10 11

• Exceptions are exists: city states, exclaves • But, the majority follows the strict hierarchical structure

City



admin_level != numeric value

∀𝑢 ∈ 𝑈𝑖 𝑤ℎ𝑒𝑟𝑒 1 ≤ 𝑖 ≤ 11 admin_level= provinsi, kabaupaten,...

Incorrect Values

∀𝑢𝑎 ∈ 𝑈𝑖>1 , ∃𝑢𝑏 ∈ 𝑈𝑗>𝑖 : 𝑢𝑎 ⊂ 𝑢𝑏

Inconsistency

∀𝑈𝑗 , 𝑈𝑘 ⊂ 𝑈𝑖 ∶ 𝑈𝑗 ⋂𝑈𝑘 = ⌀

duplication

Duplication

• 24,410 out of 259,667 entities are detected as problematic entities ≈ 10 %

• Not all the detected entities represent problematic: • E.g. city states, independent territories

• Rule-based checking results in: • Potential problematic classified entities

• Required actions: • Automatic correction for clear problematic entities • Crowdsourcing-evaluation for potentially problematic entities Ex. OSM community, Web-based tools, Gamification, etc.

Classification Plausibility

? Pond

Lake

Classification Plausibility

P r o p o s e d A p p r o a c h

P r o p o s e d A p p r o a c h

Locality Within a country

Filtering Densest cities

Sports Playground Amenity Water body Forest

Footway Flower Grass

Cultivation

100’s….1000000’s Sqm

>

10’s…1000’s Sqm

Learning-based Integrity Checking • Classifier training: • We utilize K-Nearest Neighbours (KNN) • The similarity between entities is measured by Euclidean distance

• Classifier validation: • One training and test set is biased • Depending on one performance measure is also biased

• To avoid biased classification: • Mutual classifiers • Multi-measurement of performance (accuracy, AUC)

Learning-based Integrity Checking

Mutual learning for classifiers at Germany: (red values) top 3 models with higher performance, (blue values) biased models

Implausible Classifications

Small lands covered by grass classified as “park”. They might be “garden”

Implausible Classifications

A tiny grass land classified as “park”.

Implausible Classifications

A grass area on a building roof classified as “park”. But, it might be “garden”

Implausible Classifications

A roundabout classified as “garden”. The appropriate classification might be “grass”

Results & Discussion • The analysis of classifiers accuracy: • Indicate problematic classifications at the studied countries

• The results show that the classification accuracy of “park” & “garden” entities • In cities of the UK and Germany: 70% to 90% • In cities of Austria: 50% to 65% • The applicability of learning depends on the availability of sets with sufficient quality

large data

Results & Discussion • No magic rule to judge on the detected problematic entities. • Crowdsourcing revisions are required. • The revisions could be used to improve and assess the developed classifiers.

Results & Discussion • We re-check the detected problematic entities in both scenarios: • 4 months later

• A promising percentage of the problematic entities have been updated: • Hierarchical consistency: (“administrative boundaries”) ≈ 40 % of the detected entities • Classification Plausibility: (“park” - “garden”) ≈ 8 % at both the UK & Germany ≈ 11 % at Austria • Results show positive indicators for the feasibility of our approaches.

Conclusion & Future Work • The quality of VGI is a crucial issue for providing reliable services. • Integrity checking is one way for consistent classification of similar features: • Rule-based • Learning-based • In VGI context: • Every classification is possible • Classification accuracy is one fact of data quality which needs strict integrity checking mechanisms

Conclusion & Future Work

• Investigating geometric and topological properties of the entities. • Further analysis to check which type of entities are frequently involved in a topological relation with an entity.

? contact: [email protected]