Aug 31, 2016 - Web API enabling reuse of the enriched data. Focus on Reusability ... BaseLine: Metafacture as Transforma
Elasticsearch as Hub for Linked Bibliographic Metadata Sebastian Schüpbach
[email protected] Project linked-swissbib, University Library Basel
Elastic Meetup #11 - Zurich - August 31, 2016
What the Talk Is About
“Classic” Swissbib Overview Architecture linked-swissbib Overview BaseLine EnrichedLine WorkLine Public Access Interfaces
“Classic” Swissbib
Ever Heard of Swissbib? I
Metacatalogue for bibliographic metadata
I
Unifies collections of 900+ memory institutions
I
21m records
I
Daily updated
I
Mostly FOSS
Further reading I
Search catalogue
I
Project wiki
I
Code repositories on Github
A Layered Architecture
linked-swissbib
linked-swissbib: Going beyond “Classic” Swissbib Data processing I
Input: Metadata records of Swissbib
I
Output: 100m uniquely identifiable resources in 6 concepts
I
Partially enriching/linking documents with external resources
Interfaces I
Search catalogue exploiting linked data structure
I
Web API enabling reuse of the enriched data
Focus on Reusability I
Providing easy access to data via Web API
I
Using RDF data model
I
Linking important linked data hubs
I
Integrating widely disseminated RDF vocabularies
linked-swissbib: Workflows
BaseLine: Overview
BaseLine: Metafacture as Transformation Framework Metafacture I
Framework I I I I
Based on modules Modules are responsible for one task in the workflow Modules are sticked together to build a reasonable workflow Data set is “streamed” through the workflow
I
Morph: DSL for transformation definitions
I
Flux: DSL for entire transformation pipelines
Further reading I
Website of Metafacture
I
Metafacture on Github
BaseLine: Concepts
I I
Classic Swissbib: 1 bibliographic record = 1 document in Solr linked-swissbib: I I I
1 bibliographic record = x entities in 6 different concepts 1 entity = 1 resource = 1 document in Elasticsearch 1 concept = 1 type in Elasticsearch
I
Concepts semantically structure bibliographic records (e.g. “person” for authors)
I
Resources have a dereferenceable URI each
I
Different bibliographic records can share a resource (e.g. authors)
BaseLine: Overview
Enrichment: Overview
Enrichment: Steps (Simplified) Blocking vs. Non-Blocking 1. Normalization to N-Triples 2. Sorting by subject 3. Removing unneeded statements 4. Blocking 5. Crosswise comparison of records via comparison of fields (LIMES, Silk) 6. Enrichment of Swissbib data 7. Transformation to JSON-LD 8. Indexing
Further reading I
Code on Github
WorkLine: Work Concept Generation
WorkLine: Work Concept as a Special Case I
Aim: Group and merge entities with identical work URI
I
Problem: Resource intensive creation
I
Solution: Using Apache Spark in combination with Elasticsearch Approach:
I
1. 2. 3. 4. 5.
Getting documents containing a work URI Filtering out unneeded fields Grouping tuples with same work URIs Merging field values into arrays Indexing newly created documents
Further reading I
Code on Github
Data Access: Overview
Data Access: Search Interface Goal Search facilities beyond “classic” Swissbib using linked data
VuFind I
Library resource portal based on Zend Framework
I
Modular architecture, easy to extend
I
Also used for “classic” Swissbib (reuse of several components)
I
Home-grown Elasticsearch module based on official PHP-client
Features beyond “classic” Swissbib I
Dedicated webpages for authors and subjects
I
“Knowledge cards”
I
Extended auto-completion
I
Ordered result pages
Data Access: Web API
Goal Providing easy access to data for machines and acting as gateway to lookup linked-swissbib URIs
Means I
APIPlatform: I I
I
PHP web framework to build API-first web applications Supports Linked Data (JSON-LD) and Hydra out of the box
Hydra: I
I
Standardisation effort for a self-descriptive vocabulary for hypermedia-driven Web APIs Building blocks: Core vocabulary and JSON-LD as serialisation format
Conclusion: Why Elasticsearch for Linked Data?
I
Performance: Faster indexing and faster retrieval than a RDF triplestore
I
As document-oriented store optimized for semi-structured data (like linked data) Integration:
I
I I
Native JSON(-LD) support Easy and well documented APIs for common languages (Java, PHP, Python. . . )
Thank you for your attention
Further Reading
I
linked-swissbib web site
I
Project abstract
I
linked-swissbib on Github
I
Search server index (testing)
I
Prototype of search catalogue Article series on linked-swissbib (German / French translation):
I
I I I I
Part Part Part Part
1: 2: 3: 4:
Transformation, data modelling, indexing Linking and enrichment User interface Hydra Web API