Towards an extensible measurement of metadata quality Péter Király Digital Access to Textual Cultural Heritage (DATeCH), Göttingen, 2017-06-02
Towards metadata measurement. Glossary
★ ★ ★ ★ ★
Metadata here: cultural heritage metadata (descriptions of books etc.) Europeana a metadata aggregator from 3500+ cultural heritage institutions http://europeana.eu Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB EDM Europeana Data Model, Europeana’s metadata schema MARC MAchine Readable Catalog, a library metadata standard
2
Towards metadata measurement. Multilinguality problem
★
Mona Lisa
→
456 results
★
La Gioconda →
365 results
★
La Joconde
→
71 results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
3
Towards metadata measurement. Generic title and bad thumbnail
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
4
Towards metadata measurement. Problems with title title: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1" Same title and description
title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen voor...", Machine-readable ID in title
title: "+++EMPTY+++" Leftover
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
5
Towards metadata measurement. Copy & paste cataloging
6
Towards metadata measurement. Why data quality is important?
“Fitness for purpose” (QA principle) purpose: to access content no metadata
no access to data
no data usage
more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ 7
Towards metadata measurement. Hypothesis
by measuring structural elements we can approximate metadata record quality ≃ metadata smell
8
Towards metadata measurement. Purposes
★ improve the metadata ★ services: good data → reliable functions ★ better metadata schema & documentation ★ propagate “good practice”
9
Measuring metadata quality. Proposal I. work together
Europeana Data Quality Committee, DLF Metadata Assessment Group ★ ★ ★ ★ ★
Interdisciplinary expert groups Analysing/revising metadata schema Functional requirement analysis Problem catalog Multilinguality 10
Measuring metadata quality. Proposal II. Tooling
“Metadata Quality Assurance Framework” a generic tool for measuring metadata quality ★ ★ ★ ★
adaptable to different metadata schemes scalable (to Big Data) understandable reports for data curators open source 11
Towards metadata measurement. Data processing workflow
ingest ★ ★ ★ ★
OAI-PMH Europeana API Hadoop NoSQL
json
measure ★ ★ ★ ★
Spark Hadoop Java Apache Solr
csv
statistical analysis ★ ★
Spark R
json, png
web interface ★ ★ ★ ★
PHP D3.js highchart.js NoSQL
html, svg
12
Towards metadata measurement. What to measure?
★ Structural and semantic features Cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (schema-independent measurements) ★ Functional requirement analysis / Discovery scenarios Requirements of the most important functions ★ Problem catalog Known metadata problems
13
Towards metadata measurement. Dimensions and metrics
★ Completeness: degree to which all required information is present CM1: schema completeness - no. of classes and properties represented / total no. of classes and properties CM2: property completeness CM3: population completeness CM4: interlinking completeness ★ Availability: the extent to which data is present and ready for use ★ Licensing: granting of permission to re-use under defined conditions ...
Ngomo et al., Introduction to Linked Data and Its Lifecycle on the Web (2014) 14
Towards metadata measurement. Metadata requirements // User scenario As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc. Metadata analysis Description of relevant metadata elements and their rules Measurement rules ★ the relevant field values should be resolvable URI ★ each URI should be associated with labels in multiple languages
15
Towards metadata measurement. Requirements // element—function map
Europeana sub-dimensions
MARC Summary of Mapping to User Tasks http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf
16
Towards metadata measurement. Measurement links
overall view
collection view
aggregated statistics
record view measurements
Completeness
metrics
Field cardinality Uniqueness Multilinguality Language specification Problem catalog etc.
17
Towards metadata measurement. Field frequency per collections filters
no record has alternative title
every record has alternative title
18
Towards metadata measurement. Distinct Languages Text w/o language annotation (dc.subject: Germany):
0
Text w language annotation (dc.subject: Germany@en)
1
Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany)
2
n
Multilinguality measurement with Juliane Stiller
19
Towards metadata measurement. Record level a ore:Proxy ; dc:subject “Ballet”, “Opera” .
0
a ore:Proxy ; edm:europeanaProxy true ; dc:subject , .
0 dereferencing
a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . skos:prefLabel "Opera"@no, "ओपेरा (गी तनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .
11
Distinct languages
19
Tagged literals
1,7
Literals per language 20
Towards metadata measurement. Multilingual saturation I.
21
Measuring metadata quality. Language frequency
has language specification
has no language specification
22
Towards metadata measurement. Flexible measurement
API ★ Addressing and iterating over schema elements ○ schema.getFields() ○ field.getPath() , field.getSubdimensions() , ... ★ Abstracting the metrics ○ metric1.measure() metric2.measure() ○ metric1.getResult() metric2.getResult() ★ Making the process configurable (turn on-off metrics) ○ configuration.enableMetricX() ○ configuration.disableMetricY() ★ Unified reporting data structure Unified statistical analysis 23
Towards metadata measurement. Modules metadata-qa-api
europeana-qa-api
europeana-qa-spark
marc-qa-api*
europeana-qa-rest
ddb-qa-api*
de.gwdg.metadataqa metadata−qa−api 0.4 de.gwdg.metadataqa europeana−qa−api 0.4 ...
24
Towards metadata measurement. Batch API client
Metadata QA
measurement
/batch/measuring/start sessionID /batch/[recordId]
for each records csv
/batch/measuring/stop “success” | “failure”
analysis
/batch/analyzing/start “success” | “failure” /batch/analyzing/status periodically “in progress” | “ready” /batch/analyzing/retrieve compressed package
25
zotero.org/groups/metadata_assessment dlfmetadataassessment.github.io
Towards metadata measurement. Community bibliography
26
Towards metadata measurement. Further steps human analysis
★ Translate the results into documentation, recommendations ★ Communication with data providers ★ Human evaluation of metadata quality ★ Cooperation with other projects
technical
★ Incorporating into ingestion process ★ Shape Constraint Language (SHACL) for defining patterns ★ Process usage statistics ★ Measuring changes of scores ★ Machine learning based classification & clustering
27
Towards metadata measurement. Links
★ Europeana Data Quality Committee // http://pro.europeana.eu/europeana-tech/data-quality-committee ★ site // http://144.76.218.178/europeana-qa/ ★ source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes ★ Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7 ★ Library of Congress data (OA) // http://www.loc.gov/cds/products/marcDist.php ★ contact:
[email protected], @kiru
28