Towards an extensible measurement of metadata ...

4 downloads 5367 Views 2MB Size Report
Jun 2, 2017 - EDM Europeana Data Model, Europeana's metadata schema ... 71 results http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html ...
Towards an extensible measurement of metadata quality Péter Király Digital Access to Textual Cultural Heritage (DATeCH), Göttingen, 2017-06-02

Towards metadata measurement. Glossary

★ ★ ★ ★ ★

Metadata here: cultural heritage metadata (descriptions of books etc.) Europeana a metadata aggregator from 3500+ cultural heritage institutions http://europeana.eu Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB EDM Europeana Data Model, Europeana’s metadata schema MARC MAchine Readable Catalog, a library metadata standard

2

Towards metadata measurement. Multilinguality problem



Mona Lisa



456 results



La Gioconda →

365 results 



La Joconde



71 results

http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html

3

Towards metadata measurement. Generic title and bad thumbnail

more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)

4

Towards metadata measurement. Problems with title title: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1" Same title and description

title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen voor...", Machine-readable ID in title

title: "+++EMPTY+++" Leftover

more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)

5

Towards metadata measurement. Copy & paste cataloging

6

Towards metadata measurement. Why data quality is important?

“Fitness for purpose” (QA principle) purpose: to access content no metadata

no access to data

no data usage

more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ 7

Towards metadata measurement. Hypothesis

by measuring structural elements we can approximate metadata record quality ≃ metadata smell

8

Towards metadata measurement. Purposes

★ improve the metadata ★ services: good data → reliable functions ★ better metadata schema & documentation ★ propagate “good practice”

9

Measuring metadata quality. Proposal I. work together

Europeana Data Quality Committee, DLF Metadata Assessment Group ★ ★ ★ ★ ★

Interdisciplinary expert groups Analysing/revising metadata schema Functional requirement analysis Problem catalog Multilinguality 10

Measuring metadata quality. Proposal II. Tooling

“Metadata Quality Assurance Framework” a generic tool for measuring metadata quality ★ ★ ★ ★

adaptable to different metadata schemes scalable (to Big Data) understandable reports for data curators open source 11

Towards metadata measurement. Data processing workflow

ingest ★ ★ ★ ★

OAI-PMH Europeana API Hadoop NoSQL

json

measure ★ ★ ★ ★

Spark Hadoop Java Apache Solr

csv

statistical analysis ★ ★

Spark R

json, png

web interface ★ ★ ★ ★

PHP D3.js highchart.js NoSQL

html, svg

12

Towards metadata measurement. What to measure?

★ Structural and semantic features Cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (schema-independent measurements) ★ Functional requirement analysis / Discovery scenarios Requirements of the most important functions ★ Problem catalog Known metadata problems

13

Towards metadata measurement. Dimensions and metrics

★ Completeness: degree to which all required information is present CM1: schema completeness - no. of classes and properties represented / total no. of classes and properties CM2: property completeness CM3: population completeness CM4: interlinking completeness ★ Availability: the extent to which data is present and ready for use ★ Licensing: granting of permission to re-use under defined conditions ...

Ngomo et al., Introduction to Linked Data and Its Lifecycle on the Web (2014) 14

Towards metadata measurement. Metadata requirements // User scenario As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc. Metadata analysis Description of relevant metadata elements and their rules Measurement rules ★ the relevant field values should be resolvable URI ★ each URI should be associated with labels in multiple languages

15

Towards metadata measurement. Requirements // element—function map

Europeana sub-dimensions

MARC Summary of Mapping to User Tasks http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf

16

Towards metadata measurement. Measurement links

overall view

collection view

aggregated statistics

record view measurements

Completeness

metrics

Field cardinality Uniqueness Multilinguality Language specification Problem catalog etc.

17

Towards metadata measurement. Field frequency per collections filters

no record has alternative title

every record has alternative title

18

Towards metadata measurement. Distinct Languages Text w/o language annotation (dc.subject: Germany):

0

Text w language annotation (dc.subject: Germany@en)

1

Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany)

2

n

Multilinguality measurement with Juliane Stiller

19

Towards metadata measurement. Record level a ore:Proxy ; dc:subject “Ballet”, “Opera” .

0

a ore:Proxy ; edm:europeanaProxy true ; dc:subject , .

0 dereferencing

a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . skos:prefLabel "Opera"@no, "ओपेरा (गी तनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .

11

Distinct languages

19

Tagged literals

1,7

Literals per language 20

Towards metadata measurement. Multilingual saturation I.

21

Measuring metadata quality. Language frequency

has language specification

has no language specification

22

Towards metadata measurement. Flexible measurement

API ★ Addressing and iterating over schema elements ○ schema.getFields() ○ field.getPath() , field.getSubdimensions() , ... ★ Abstracting the metrics ○ metric1.measure() metric2.measure() ○ metric1.getResult() metric2.getResult() ★ Making the process configurable (turn on-off metrics) ○ configuration.enableMetricX() ○ configuration.disableMetricY() ★ Unified reporting data structure Unified statistical analysis 23

Towards metadata measurement. Modules metadata-qa-api

europeana-qa-api

europeana-qa-spark

marc-qa-api*

europeana-qa-rest

ddb-qa-api*

de.gwdg.metadataqa metadata−qa−api 0.4 de.gwdg.metadataqa europeana−qa−api 0.4 ...

24

Towards metadata measurement. Batch API client

Metadata QA

measurement

/batch/measuring/start sessionID /batch/[recordId]

for each records csv

/batch/measuring/stop “success” | “failure”

analysis

/batch/analyzing/start “success” | “failure” /batch/analyzing/status periodically “in progress” | “ready” /batch/analyzing/retrieve compressed package

25

zotero.org/groups/metadata_assessment dlfmetadataassessment.github.io

Towards metadata measurement. Community bibliography

26

Towards metadata measurement. Further steps human analysis

★ Translate the results into documentation, recommendations ★ Communication with data providers ★ Human evaluation of metadata quality ★ Cooperation with other projects

technical

★ Incorporating into ingestion process ★ Shape Constraint Language (SHACL) for defining patterns ★ Process usage statistics ★ Measuring changes of scores ★ Machine learning based classification & clustering

27

Towards metadata measurement. Links

★ Europeana Data Quality Committee // http://pro.europeana.eu/europeana-tech/data-quality-committee ★ site // http://144.76.218.178/europeana-qa/ ★ source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes ★ Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7 ★ Library of Congress data (OA) // http://www.loc.gov/cds/products/marcDist.php ★ contact: [email protected], @kiru

28