Elasticsearch as Hub for Linked Bibliographic Metadata - Meetup

0 downloads 129 Views 2MB Size Report
Aug 31, 2016 - Web API enabling reuse of the enriched data. Focus on Reusability ... BaseLine: Metafacture as Transforma
Elasticsearch as Hub for Linked Bibliographic Metadata Sebastian Schüpbach [email protected] Project linked-swissbib, University Library Basel

Elastic Meetup #11 - Zurich - August 31, 2016

What the Talk Is About

“Classic” Swissbib Overview Architecture linked-swissbib Overview BaseLine EnrichedLine WorkLine Public Access Interfaces

“Classic” Swissbib

Ever Heard of Swissbib? I

Metacatalogue for bibliographic metadata

I

Unifies collections of 900+ memory institutions

I

21m records

I

Daily updated

I

Mostly FOSS

Further reading I

Search catalogue

I

Project wiki

I

Code repositories on Github

A Layered Architecture

linked-swissbib

linked-swissbib: Going beyond “Classic” Swissbib Data processing I

Input: Metadata records of Swissbib

I

Output: 100m uniquely identifiable resources in 6 concepts

I

Partially enriching/linking documents with external resources

Interfaces I

Search catalogue exploiting linked data structure

I

Web API enabling reuse of the enriched data

Focus on Reusability I

Providing easy access to data via Web API

I

Using RDF data model

I

Linking important linked data hubs

I

Integrating widely disseminated RDF vocabularies

linked-swissbib: Workflows

BaseLine: Overview

BaseLine: Metafacture as Transformation Framework Metafacture I

Framework I I I I

Based on modules Modules are responsible for one task in the workflow Modules are sticked together to build a reasonable workflow Data set is “streamed” through the workflow

I

Morph: DSL for transformation definitions

I

Flux: DSL for entire transformation pipelines

Further reading I

Website of Metafacture

I

Metafacture on Github

BaseLine: Concepts

I I

Classic Swissbib: 1 bibliographic record = 1 document in Solr linked-swissbib: I I I

1 bibliographic record = x entities in 6 different concepts 1 entity = 1 resource = 1 document in Elasticsearch 1 concept = 1 type in Elasticsearch

I

Concepts semantically structure bibliographic records (e.g. “person” for authors)

I

Resources have a dereferenceable URI each

I

Different bibliographic records can share a resource (e.g. authors)

BaseLine: Overview

Enrichment: Overview

Enrichment: Steps (Simplified) Blocking vs. Non-Blocking 1. Normalization to N-Triples 2. Sorting by subject 3. Removing unneeded statements 4. Blocking 5. Crosswise comparison of records via comparison of fields (LIMES, Silk) 6. Enrichment of Swissbib data 7. Transformation to JSON-LD 8. Indexing

Further reading I

Code on Github

WorkLine: Work Concept Generation

WorkLine: Work Concept as a Special Case I

Aim: Group and merge entities with identical work URI

I

Problem: Resource intensive creation

I

Solution: Using Apache Spark in combination with Elasticsearch Approach:

I

1. 2. 3. 4. 5.

Getting documents containing a work URI Filtering out unneeded fields Grouping tuples with same work URIs Merging field values into arrays Indexing newly created documents

Further reading I

Code on Github

Data Access: Overview

Data Access: Search Interface Goal Search facilities beyond “classic” Swissbib using linked data

VuFind I

Library resource portal based on Zend Framework

I

Modular architecture, easy to extend

I

Also used for “classic” Swissbib (reuse of several components)

I

Home-grown Elasticsearch module based on official PHP-client

Features beyond “classic” Swissbib I

Dedicated webpages for authors and subjects

I

“Knowledge cards”

I

Extended auto-completion

I

Ordered result pages

Data Access: Web API

Goal Providing easy access to data for machines and acting as gateway to lookup linked-swissbib URIs

Means I

APIPlatform: I I

I

PHP web framework to build API-first web applications Supports Linked Data (JSON-LD) and Hydra out of the box

Hydra: I

I

Standardisation effort for a self-descriptive vocabulary for hypermedia-driven Web APIs Building blocks: Core vocabulary and JSON-LD as serialisation format

Conclusion: Why Elasticsearch for Linked Data?

I

Performance: Faster indexing and faster retrieval than a RDF triplestore

I

As document-oriented store optimized for semi-structured data (like linked data) Integration:

I

I I

Native JSON(-LD) support Easy and well documented APIs for common languages (Java, PHP, Python. . . )

Thank you for your attention

Further Reading

I

linked-swissbib web site

I

Project abstract

I

linked-swissbib on Github

I

Search server index (testing)

I

Prototype of search catalogue Article series on linked-swissbib (German / French translation):

I

I I I I

Part Part Part Part

1: 2: 3: 4:

Transformation, data modelling, indexing Linking and enrichment User interface Hydra Web API