Information Services, ProQuest LLC and Gale Cengage. Learning) to produce services. ⢠Often, they also participate in
Linked data in practice in digital humanities projects Eetu Mäkelä, D.Sc., Assistant professor in digital humanities, University of Helsinki Adjunct professor in computer science, Aalto University
[email protected]
Lugar y fecha
Linked digital humanities research process
Ideal digital humanities research process analysis tools
Iterative exploration of data raw data
results
processing tools
research articles
Sources for open data in the digital humanities - the good • Great aggregators pushing for CC0 licenses, publishing participating data: Europeana, Digital Public Library of America & The European Library • Influential national libraries moving to co-operative open (linked) data – Library of Congress, Deutsche Nationalbibliothek, British Library, Bibliothèque nationale de France • Museums, Galleries and Archives catching up: British Museum, Finnish National Gallery, … • Glue available: VIAF, CIDOC-CRM, Getty AAT, TGN, ULAN, CONA, Pleiades, ...
Sources for open data in the digital humanities - the bad • Academic libraries have a long tradition of collaborating with library service companies (primarily EBSCO Information Services, ProQuest LLC and Gale Cengage Learning) to produce services • Often, they also participate in content creation projects, and then hold the rights for that content – e.g. Early English Books Online (ProQuest), Nineteenth Century Collections Online (Gale), State Papers Online (Gale) • But, this is also a wider culture inside humanities, e.g. Electronic Enlightenment
Data in practice
Library catalogue contents Leader *****ngm 22*****1a 4500
538 ## $a VHS.
245 04 $a The Adventures of Safety Frog. $p Fire
521 ## $a Elementary grades.
safety $h [videorecording] / $c Century 21 Video, Inc. 246 30 $a Fire safety $h [videorecording] 260 ## $a Van Nuys, Calif. : $b AIMS Media, $c 1988. 300 ## $a 1 videocassette (10 min.) : $b sd., col. ; $c 1/2 in. 500 ## $a Cataloged from contributor's data.
530 ## $a Issued also as motion picture. 520 ## $a Safety Frog teaches children to be fire safe, explaining that smart kids never play with matches. She shows how smoke detectors work and explains why they are necessary. She also describes how to avoid house hold accidents that lead to fires and how to stop, drop, and roll if clothing catches fire. 650 #0 $a Fire prevention $v Juvenile films.
Documentation!!! • 81 pages of documentation on the exact annotation practices used in the digital edition of the Potage Dyvers • Library cataloguing standards: – 302 pages of ISBD – 750 pages AACR, 1056 pages of RDA • 1020 pages of the SPECTRUM standard for museum cataloguing • A single page of field descriptions in the Schoenberg database
The missing documentation • “We changed our cataloguing standards once in the 80’s, and then a second time in 1998.” • “Most of our older entries have actually been copied from the national library that has different cataloguing standards” • “A lot of the publications from the middle of the 18th century are simply missing, as they were never indexed.” • “This database was gathered based on the whimsies of what the participating researchers researched. It’s probably thus quite biased.”
Open data in the digital humanities - the ugly ● Different forms of encoding, typos (Paris,) Paris [Paris,] (Paris) A Paris À Paris (Paris.) [A Paris]
[Paris] (Paris
Amsterdam. - et Paris Amsterdam ; et Paris Amsterdam. - et à Paris Amsterdam [Paris] (Paris. - Amsterdam A Amsterdam [i. e. Paris]. M. DCC. LXX.
Data woes: viaf.org ● Automatic conversions from “Lastname, Firstname” to “Firstname Lastname” do not always work due to bad data
Charles-Victor Prévost d'Arlincourt Charles Victor Prévôt ˜d'œ Arlincourt Charles Victor Prevot d' Arlincourt Arlincourt
http://viaf.org/viaf/41896578/
Data woes: Europeana http://labs.europeana.eu/api/linked-open-da ta-data-downloads
Digital humanities research process in practice analysis tools
cleanup tools
Iterative cleanup, exploration of data raw data
results
understanding data
clean data
processing tools
research articles
Digital humanities research process in practice analysis tools
cleanup tools
Iterative cleanup, exploration of data, with attendant tool development
raw data
understanding data
clean data
processing tools
results
research articles
Collaborative digital humanities
Digital humanities
Linked digital humanities research process
Linked digital humanities research process analysis tools
cleanup tools integration tools
Iterative integration, cleanup, exploration of data, all with attendant tool development understanding data
clean data
processing tools
results
research articles
Linked digital humanities process in practice?
research articles??
research articles??
research articles??
What makes for good Linked Data? ● Data that is part of a wider context ● The hardest part is manifesting the network of relationships ○ Use existing vocabularies for attribute values! ■ People: VIAF, ULAN, ... ■ Places: GeoNames, TGN, Pleiades, ... ■ Concepts & General: DBPedia, AAT, LCSH, Iconclass, … ■ Events: ? ○ Everybody seems to obsess over schemas, but they are actually not that important (but do help) ■ CIDOC-CRM, EDM, BIBFRAME, ORG, RELATIONSHIP, FOAF, schema.org, SKOS, Geo, ...
A second view
Linked digital humanities research process
Digital humanities workflow 0. Formulate research questions 1. Discover relevant data
2. Ingest and integrate data
4. Publish interpretations, also as data
3. Explore and interpret data
Enabling a Virtuous Cycle 0. Formulate research questions 1. Discover relevant data
2. Ingest and integrate data
4. Publish interpretations, also as data
3. Explore and interpret data
e.g. the prosopographical records in Early Modern Letters Online (EMLO) originating from research into primary sources
Digital humanities workflow 0. Formulate research questions 1. Discover relevant data How?r, vocab.a balloon
4. Publish interpretations, also as data How?
2. Ingest and integrate data How?
3. Explore and interpret data How? o, Europeana 4D, VISU, Khepri, SAHA, RelFinder, ...
Digital humanities workflow • • • • • • •
Model Create Convert Publish Discover Integrate Explore
Linked Humanities Case Studies
180 people, 33 countries, led by Oxford
EMLO as Linked Data ● SAHA ● Palladio ● EMLO, EE and D’Alembert ● EMLO and BNF http://tinyurl.com/y9wcakhr
Aspects covered • • • • • • •
Model Create Convert (Publish) Discover Integrate (Explore)
Data model Created events and their temporal relationships: - someone possessed the manuscript at least before the auction - someone else may possess the manuscripts after the auction if the auction contains a sale or gift event - provenance info creates possession events that are in the stated order
Catalogue
Catalogue Entry
Place
Auction
Manuscript
Work
Sale
Gift
Possession
Time
Actor
Other notable types and their relationships: - Catalogues have entries that refer to manuscripts that may be comprised of multiple works - Works are created, owned, sold and bought by actors - Manuscripts also have a lot of other metadata, e.g. time, place, number of miniatures and so on
Sample queries ● Where are the manuscripts collected by Sir Thomas Phillipps from? ● How many hands have the manuscripts passed through?
Aspects covered • • • • • • •
Model Create Convert (Publish) Discover (Integrate) (Explore)
Interfacing structured and unstructured data in sociolinguistic research on language change
LINGUISTIC QUESTIONS Social meaning of spelling variation in historical periods of English and Finnish Neologisms in early English correspondence
TOOLS AND MATERIALS Developing a modular research toolkit for sociolinguistic analysis
RESEARCH GROUP: Terttu Nevalainen (PI; University of Helsinki) Samuli Kaislaniemi (University of Helsinki)
Anni Sairio (University of Helsinki)
Taru Nordlund (PI; University of Helsinki)
Eetu Mäkelä (Aalto University)
Tanja Säily (University of Helsinki)
Katja Litola (University of Helsinki)
Poika Isokoski (PI; University of Tampere)
Anna Merikallio (University of Helsinki)
Johanna Utriainen (University of Helsinki)
Harri Siirtola (University of Tampere)
Use by gender
Use by societal rank: clergy vs gentry
Use by societal rank: professionals vs others
Aspects covered • • • • • • •
Model Create Convert (Publish) Discover Integrate Explore
With Thea Lindquist, University of Colorado
Contextual Reader
Support close reading in an unfamiliar domain 1. Automatically give context 2. Locate other sources relevant to the topic across distributed collections requiring as little as possible from the distributed collections
A Contextual Reader for First World War Primary Sources Demonstrative documents: • a primary source PDF from the CU-Boulder WWI Collection Online • a postcard with metadata from the Great War Archive • an encyclopedic article from 1914 - 1918 Online
A Contextual Reader for WWI Primary Sources
A Contextual Reader for WWI Primary Sources
Under the Hood: Dynamic Entity Extraction 1. Extract content from HTML/PDF in browser 2. Call language analysis web service to generate query terms 3. Query Linked Data repository for context information → Items under study do not need to be formally annotated!
Vocabularies used: 1. WW1LOD 2. 1914 - 1918 Online Vocabularies 3. Europeana 1914 - 1918 Thesaurus 4. Out of the Trenches (PCDHN-LOD) 5. Trenches to Triples 6. DBpedia
Under the Hood: Query Expansion 1. Gather cross-lingual, alternate term information from vocabularies 2. Query remote web services for related content → Related content does not need to be formally annotated!
Repositories used: 1. CU-Boulder WWI Collection Online 2. WW1 Discovery 3. Europeana 4. Digital Public Library of America 5. The European Library
A Contextual Reader for Ancient Texts Demonstrative documents: ● an english translation of Caesar's Gallic War in the Perseus Hopper ● a Latin text by Livy → Language analysis step allows support for highly inflected languages, multilingual Linked Data enables crossing language boundaries
A Contextual Reader for Ancient Texts Vocabularies used: 1. Pleiades gazetteer of ancient places 2. English and Latin DBpedias
Repositories used: 1. Pelagios datasets 2. Perseus Catalog 3. Europeana 4. Digital Public Library of America 5. The European Library
A Contextual Reader for Finnish Law Demonstrative documents: ● a Finnish law ● a Finnish supreme court decision ● a statement by the standing committee on law on a law in preparation ● a news article concerning a law
A Contextual Reader for Finnish Law Vocabularies used: 1. legal terminology in the Finnish Terminology Bank 2. Asseri legal vocabulary 3. Edilex legal vocabulary 4. Talentum legal vocabulary 5. Legal terminology section of the Finnish DBpedia
Repositories used: 1. Finlex consolidated legislation 2. Finlex precedents of Finnish supreme courts 3. Edilex legal news
Aspects covered • • • • • • •
Model Create Convert Publish Discover Integrate Explore
Aspects covered • • • • • • •
Model Create Convert Publish Discover Integrate Explore
With Hans Wietzke, Stanford
Ancient Name Dropping
Ancient Name Dropping, with Hans Wietzke, Stanford Detecting clusters in references to authority in ancient greek texts on natural science – co-citations hand curated by Hans Wietzke
Aspects covered • • • • • • •
Model Create Convert (Publish) Discover Integrate Explore
with Dan Edelstein and Nicole Coleman, Stanford
Fibra – human scale tool for linked data that supports critical inquiry 1. Source information from linked datasets 2. Organize and add to data in order to build an argument 3. Capture both the data and the reasoning behind it so it will have context within the scholarly community 4. Publish the new knowledge to the community where it can be cited, re-used and built upon by others.
Fibra Construct
Aspects covered • • • • • •
Model (Convert) Publish (Discover) Integrate Explore
Final view
Linked digital humanities research process
Digital Humanities Workflow 0. Formulate research questions 1. Discover relevant data How?r, vocab.a balloon
4. Publish interpretations, also as data How?
2. Ingest and integrate data How?
3. Explore and interpret data How? o, Europeana 4D, VISU, Khepri, SAHA, RelFinder, ...
Tools for phases of the cycle Bulk Understand
Local
Aether
vocab.at
Voyager
(ARPA)
Karma
OpenRefine
Breve
Import
Edit OpenRefine
FiCa
Recon
Silk
Wrangler
SAHA
Snapper
Reconcile Fibra
Organize
Explore
SKOSJS
Publish
VISU
LDF.fi
Palladio
Octavo
Khepri
nodegoat
nodegoat
[email protected] url: http://seco.cs.aalto.fi/u/jiemakel @jiemakel