“Nothing is created, nothing is lost, everything changes” measuring and visualizing data quality in Europeana Valentine Charles1, Péter Király2, ELAG 2017, June 8. Athens, Greece
1
Europeana Foundation, The Netherlands
2
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
Measuring and visualizing data quality. Generic title and bad thumbnail
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)2
Measuring and visualizing data quality. Multilinguality problem
★ ★ ★
Mona Lisa → La Gioconda→ La Joconde →
456 results 365 results 71 results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
3
Measuring and visualizing data quality. Et cetera title: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1" Same title and description title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen...", Machine-readable ID in title title: "+++EMPTY+++" Leftover
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)4
Measuring and visualizing data quality. The problem
there are “good” and “bad” metadata records but we don’t have clear metrics like this: functional requirements
bad
acceptable
good 5
Measuring and visualizing data quality. Why data quality is important?
“Fitness for purpose” (QA principle) Purposes: ★ data re-use ★ discovery no metadata
no access to data
no data usage
more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ 6
Measuring and visualizing data quality. Hypothesis
by measuring structural elements we can approximate metadata record quality ≃ metadata smell
7
Measuring and visualizing data quality. Benefits
★ improve the metadata ★ services: good data → reliable functions ★ better metadata schema & documentation ★ propagate “good practice”
8
Measuring metadata quality. Proposal I. work together
Europeana Data Quality Committee ★ ★ ★ ★ ★
Interdisciplinary expert groups Analysing/revising metadata schema Functional requirement analysis Problem catalog Multilinguality
In North America: DLF Metadata Assessment Group (Christina Harlow)
9
Measuring and visualizing data quality. What to measure?
★ Structural and semantic features Cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (schema-independent measurements) ★ Discovery scenarios or functional requirement analysis Requirements of the most important functions ★ Problem catalog Known metadata problems
10
Measuring and visualizing data quality. Dimensions and metrics
★ Completeness: degree to which all required information is present schema completeness - # of classes and properties represented / total # of classes and properties property completeness population completeness interlinking completeness ★ Availability: the extent to which data is present and ready for use ★ Licensing: granting of permission to re-use under defined conditions ... Bruce and Hillmann, The Continuum of Metadata Quality (2004) Ochoa and Duval, Automatic Evaluation of Metadata Quality in Digital Repositories (2009) Zaveri et al., Quality Assessment for Linked Data: A Survey (2015) 11
the most important functions
Measuring and visualizing data quality. Discovery scenarios
★ ★ ★ ★ ★ ★ ★ ★ ★
Basic retrieval with high precision and recall Cross-language recall Entity-based facets Date-based facets Improved language facets Browse by subjects and resource types Browse by agents Hierarchical search and facets ...
12
Measuring and visualizing data quality. Metadata requirements
Questions ★ How particular metadata elements support the function? ★ How can we score the fulfillment of a specific scenario?
13
“metadata anti-patterns”
Measuring and visualizing data quality. Problem catalog ★ ★ ★ ★ ★ ★ ★ ★ ★ ★
Title contents same as description contents Systematic use of the same title Bad string: “empty” (and variants) Shelfmarks and other identifiers in fields Creator not an agent name Absurd geographical location Subject field used as description field Unicode issues e.g. U+FFFD (�) Very short description field ...
14
Measuring and visualizing data quality. Problem definition
Description Example Motivation Checking Method Notes Metadata Scenario
Title contents same as description contents [link] Distorts search weightings Field comparison Record display: creator concatenated onto title Basic Retrieval
15
Measuring and visualizing data quality. Proposal II. Tooling
“Metadata Quality Assurance Framework” a generic tool for measuring metadata quality
★ ★ ★ ★
adaptable to different metadata schemes scalable (to Big Data) understandable reports for data curators open source
16
Measuring and visualizing data quality. Data processing workflow
ingest ★ ★ ★ ★
OAI-PMH Europeana API Hadoop NoSQL
json
measure ★ ★ ★ ★
Spark Hadoop Java Apache Solr
csv
statistical analysis ★ ★
Spark R
json, png
web interface ★ ★ ★ ★
PHP D3.js highchart.js NoSQL
html, svg
17
Measuring and visualizing data quality. Measurement
overall view measurements
record view
aggregated numbers
collection view
18
Measuring and visualizing data quality. Field frequency per collections filters
no record has alternative title
every record has alternative title
19
Measuring and visualizing data quality. Multilinguality score Text w/o language annotation (dc.subject: Germany):
0
Text w language annotation (dc.subject: Germany@en)
1
Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany)
2
n
Multilinguality measurement with Juliane Stiller
20
Measuring and visualizing data quality. Multilinguality example a ore:Proxy ; dc:subject “Ballet”, “Opera” .
0
a ore:Proxy ; edm:europeanaProxy true ; dc:subject , .
0 dereferencing
a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . skos:prefLabel "Opera"@no, "ओपेरा (गी तनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .
11
Distinct languages
19
Tagged literals
1,7
Literals per language 21
Measuring and visualizing data quality. Language frequency
has language specification
has no language specification
22
Measuring and visualizing data quality. Multilingual saturation
23
Measuring and visualizing data quality. Functionality matrix elements
functionalities
24
Measuring and visualizing data quality. Flexible measurement
API ★ Addressing and iterating over schema elements ○ schema.getFields() ○ field.getPath() , field.getSubdimensions() , ... ★ Abstracting the metrics ○ metric1.measure() metric2.measure() ○ metric1.getResult() metric2.getResult() ★ Making the process configurable (turn on-off metrics) ○ configuration.enableMetricX() ○ configuration.disableMetricY() ★ Unified reporting data structure Unified statistical analysis Machine learning (to come) 25
Measuring and visualizing data quality. Batch API client
Metadata QA
measurement
/batch/measuring/start sessionID /batch/[recordId]
for each records csv
/batch/measuring/stop “success” | “failure”
analysis
/batch/analyzing/start “success” | “failure” /batch/analyzing/status periodically “in progress” | “ready” /batch/analyzing/retrieve compressed package
26
Measuring and visualizing data quality. Further steps human analysis
★ Translate the results into documentation, recommendations ★ Communication with data providers ★ Human evaluation of metadata quality ★ Cooperation with other projects
technical
★ Incorporating into Europeana’s new ingestion tool ★ Shape Constraint Language (SHACL) for defining patterns ★ Process usage statistics ★ Measuring changes over time ★ Machine learning based classification & clustering
27
Measuring and visualizing data quality. Links
★ Europeana Data Quality Committee: http://pro.europeana.eu/europeana-tech/data-quality-committee ★ site: http://144.76.218.178/europeana-qa/ ★ codes (GPL): http://pkiraly.github.io/about/#source-codes ★ data (CC0): http://hdl.handle.net/21.11101/0000-0001-781F-7 ★ bibliography: http://zotero.org/groups/metadata_assessment ★ DFL Metadata Assessment Group: http://dlfmetadataassessment.github.io ★ contact:
[email protected] (@valentinec89),
[email protected] (@kiru), #metadata, #metadataquality ★ these slides: http://bit.ly/mq-elag2017 28