Automation of metadata processing
Automation of metadata processing CLARIN-Conference in Wroclaw, Poland, 15 - 17, Octobre
Except where otherwise noted, content on this poster is licensed under a Creative Commons Attribution 4.0 International license.
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
1
Automation of metadata processing Introduction Repositories
• Introduction • HZSK- and (Daniel Jettka) • LAUDATIO-Repository (Dennis Zielke) • Open-Source technologies • Generalized model of the data ingest process • Role of standardized metadata in the import process • Validation of data • Modelling import formats and data structures • Indexing of metadata
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
2
Automation of metadata processing Introduction HZSK-Repository
• is based on the software triad Fedora, Islandora, and Drupal • currently contains 19 corpora of transcribed spoken language • stored research data includes texts, transcripts, audio and video data, images, metadata, and other data types • is connected to the CLARIN-D infrastructure on several levels, e.g. the central services Virtual Language Observatory (for metadata search) and the CLARIN Federated Content Search (for search directly in the content)
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
3
Automation of metadata processing Introduction LAUDATIO-Repository
• is an open access environment for persistent storage of historical texts and their annotations • it currently contains historical corpora from various disciplines with a total of 2000 texts that contain about two million word forms • the main focus lies on German historical texts and linguistic annotations including all dialects of time periods ranging from the 9th to the 19th century
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
4
Automation of metadata processing Introduction LAUDATIO-Repository technical
• the technical repository infrastructure is based on generalizable software modules such as the graphical user interface, the data exchange module between research data and the Fedora REST API • the metadata search for indexing and faceting is based on the Lucene-based technology ElasticSearch • the imported corpora are stored in their original structure in a permanent and unchangeable version
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
5
Automation of metadata processing LAUDATIO: Used Open-Source-Technologies (1)
• CakePHP 2.4 to use MVC PHP5 Web-Framework • Authorization and Authentication in the user management via Access Control List • Fedora 3.6 for Data storage • REST-API for Data exchange • ElasticSearch as Search engine • REST-API for Data exchange • Implemented customized and versioned IndexMapping
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
6
Automation of metadata processing LAUDATIO: Used Open-Source-Technologies (2)
• External PID-Webservice (EPIC API Version 2) to assign the Persistent Identifier • Third party Open Source libraries auf Github • http://tinyurl.com/lf26u97 • Flat-Design (HTML5, CSS3) (Coming soon)
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
7
Automation of metadata processing LAUDATIO: appropriated Data structure
TEI XML P5
17.10.2015
Description of the corpus data structure using the TEI metadata standards
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
8
Automation of metadata processing LAUDATIO: View/Index Mapping ElasticSearch
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
9
Automation of metadata processing LAUDATIO: Examples ElasticSearch for Indexing IndexMapping
ViewMapping
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
10
Automation of metadata processing LAUDATIO: Object model Fedora via RIDGES-Korpus
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
11
Automation of metadata processing LAUDATIO: Schema config stored in Fedora
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
12
Automation of metadata processing
If you have questions please contact us: Dennis Zielke, Humboldt-Universität zu Berlin, E-Mail:
[email protected] Daniel Jettka, Hamburg Centre of spoken language corpora, E-Mail:
[email protected]
17.10.2015
Dennis Zielke, Daniel Jettka Humboldt-Universität zu Berlin, Universität Hamburg
13