Nothing is created, nothing is lost, everything changes - Elag 2017

5 downloads 1405 Views 1MB Size Report
La Joconde →. 71 results http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html .... a ore:Proxy ; edm:europeanaProxy true ; dc:subject ...
“Nothing is created, nothing is lost, everything changes” measuring and visualizing data quality in Europeana Valentine Charles1, Péter Király2, ELAG 2017, June 8. Athens, Greece

1

Europeana Foundation, The Netherlands

2

Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

Measuring and visualizing data quality. Generic title and bad thumbnail

more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)2

Measuring and visualizing data quality. Multilinguality problem

★ ★ ★

Mona Lisa → La Gioconda→ La Joconde →

456 results 365 results  71 results

http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html

3

Measuring and visualizing data quality. Et cetera title: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIEFEYENOORD - GO AHEAD 3-1" Same title and description title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen...", Machine-readable ID in title title: "+++EMPTY+++" Leftover

more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)4

Measuring and visualizing data quality. The problem

there are “good” and “bad” metadata records but we don’t have clear metrics like this: functional requirements

bad

acceptable

good 5

Measuring and visualizing data quality. Why data quality is important?

“Fitness for purpose” (QA principle) Purposes: ★ data re-use ★ discovery no metadata

no access to data

no data usage

more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ 6

Measuring and visualizing data quality. Hypothesis

by measuring structural elements we can approximate metadata record quality ≃ metadata smell

7

Measuring and visualizing data quality. Benefits

★ improve the metadata ★ services: good data → reliable functions ★ better metadata schema & documentation ★ propagate “good practice”

8

Measuring metadata quality. Proposal I. work together

Europeana Data Quality Committee ★ ★ ★ ★ ★

Interdisciplinary expert groups Analysing/revising metadata schema Functional requirement analysis Problem catalog Multilinguality

In North America: DLF Metadata Assessment Group (Christina Harlow)

9

Measuring and visualizing data quality. What to measure?

★ Structural and semantic features Cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (schema-independent measurements) ★ Discovery scenarios or functional requirement analysis Requirements of the most important functions ★ Problem catalog Known metadata problems

10

Measuring and visualizing data quality. Dimensions and metrics

★ Completeness: degree to which all required information is present schema completeness - # of classes and properties represented / total # of classes and properties property completeness population completeness interlinking completeness ★ Availability: the extent to which data is present and ready for use ★ Licensing: granting of permission to re-use under defined conditions ... Bruce and Hillmann, The Continuum of Metadata Quality (2004) Ochoa and Duval, Automatic Evaluation of Metadata Quality in Digital Repositories (2009) Zaveri et al., Quality Assessment for Linked Data: A Survey (2015) 11

the most important functions

Measuring and visualizing data quality. Discovery scenarios

★ ★ ★ ★ ★ ★ ★ ★ ★

Basic retrieval with high precision and recall Cross-language recall Entity-based facets Date-based facets Improved language facets Browse by subjects and resource types Browse by agents Hierarchical search and facets ...

12

Measuring and visualizing data quality. Metadata requirements

Questions ★ How particular metadata elements support the function? ★ How can we score the fulfillment of a specific scenario?

13

“metadata anti-patterns”

Measuring and visualizing data quality. Problem catalog ★ ★ ★ ★ ★ ★ ★ ★ ★ ★

Title contents same as description contents Systematic use of the same title Bad string: “empty” (and variants) Shelfmarks and other identifiers in fields Creator not an agent name Absurd geographical location Subject field used as description field Unicode issues e.g. U+FFFD (�) Very short description field ...

14

Measuring and visualizing data quality. Problem definition

Description Example Motivation Checking Method Notes Metadata Scenario

Title contents same as description contents [link] Distorts search weightings Field comparison Record display: creator concatenated onto title Basic Retrieval

15

Measuring and visualizing data quality. Proposal II. Tooling

“Metadata Quality Assurance Framework” a generic tool for measuring metadata quality

★ ★ ★ ★

adaptable to different metadata schemes scalable (to Big Data) understandable reports for data curators open source

16

Measuring and visualizing data quality. Data processing workflow

ingest ★ ★ ★ ★

OAI-PMH Europeana API Hadoop NoSQL

json

measure ★ ★ ★ ★

Spark Hadoop Java Apache Solr

csv

statistical analysis ★ ★

Spark R

json, png

web interface ★ ★ ★ ★

PHP D3.js highchart.js NoSQL

html, svg

17

Measuring and visualizing data quality. Measurement

overall view measurements

record view

aggregated numbers

collection view

18

Measuring and visualizing data quality. Field frequency per collections filters

no record has alternative title

every record has alternative title

19

Measuring and visualizing data quality. Multilinguality score Text w/o language annotation (dc.subject: Germany):

0

Text w language annotation (dc.subject: Germany@en)

1

Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany)

2

n

Multilinguality measurement with Juliane Stiller

20

Measuring and visualizing data quality. Multilinguality example a ore:Proxy ; dc:subject “Ballet”, “Opera” .

0

a ore:Proxy ; edm:europeanaProxy true ; dc:subject , .

0 dereferencing

a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . skos:prefLabel "Opera"@no, "ओपेरा (गी तनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .

11

Distinct languages

19

Tagged literals

1,7

Literals per language 21

Measuring and visualizing data quality. Language frequency

has language specification

has no language specification

22

Measuring and visualizing data quality. Multilingual saturation

23

Measuring and visualizing data quality. Functionality matrix elements

functionalities

24

Measuring and visualizing data quality. Flexible measurement

API ★ Addressing and iterating over schema elements ○ schema.getFields() ○ field.getPath() , field.getSubdimensions() , ... ★ Abstracting the metrics ○ metric1.measure() metric2.measure() ○ metric1.getResult() metric2.getResult() ★ Making the process configurable (turn on-off metrics) ○ configuration.enableMetricX() ○ configuration.disableMetricY() ★ Unified reporting data structure Unified statistical analysis Machine learning (to come) 25

Measuring and visualizing data quality. Batch API client

Metadata QA

measurement

/batch/measuring/start sessionID /batch/[recordId]

for each records csv

/batch/measuring/stop “success” | “failure”

analysis

/batch/analyzing/start “success” | “failure” /batch/analyzing/status periodically “in progress” | “ready” /batch/analyzing/retrieve compressed package

26

Measuring and visualizing data quality. Further steps human analysis

★ Translate the results into documentation, recommendations ★ Communication with data providers ★ Human evaluation of metadata quality ★ Cooperation with other projects

technical

★ Incorporating into Europeana’s new ingestion tool ★ Shape Constraint Language (SHACL) for defining patterns ★ Process usage statistics ★ Measuring changes over time ★ Machine learning based classification & clustering

27

Measuring and visualizing data quality. Links

★ Europeana Data Quality Committee: http://pro.europeana.eu/europeana-tech/data-quality-committee ★ site: http://144.76.218.178/europeana-qa/ ★ codes (GPL): http://pkiraly.github.io/about/#source-codes ★ data (CC0): http://hdl.handle.net/21.11101/0000-0001-781F-7 ★ bibliography: http://zotero.org/groups/metadata_assessment ★ DFL Metadata Assessment Group: http://dlfmetadataassessment.github.io ★ contact: [email protected] (@valentinec89), [email protected] (@kiru), #metadata, #metadataquality ★ these slides: http://bit.ly/mq-elag2017 28