Interfacing structured and unstructured data ... From handwritten letters to structured data ... er(e), ar(e), or(e), ou
Interfacing structured and unstructured data in sociolinguistic research on language change Terttu Nevalainen, Samuli Kaislaniemi, Anni Sairio, Tanja Säily, Anna Merikallio, Taru Nordlund, Katja Litola, Johanna Utriainen, Eetu Mäkelä, Poika Isokoski, Harri Siirtola
Linguistic end research questions • Social meaning of spelling variation in historical periods of English and Finnish • Social variation in language productivity in early English correspondence
Subprojects – the long road from data to questions 1. 2. 3. 4.
From letters to data (Finnish) Quality control of data (English) Tools for linguistic research (English ⇒ Finnish) Linguistic research
From letters to data Katja Litola, Johanna Utriainen
From handwritten letters to structured data
Corpus building Digital letter corpus of Early Modern Finnish:
● ● ●
Socially, temporally and regionally balanced corpus of letters from the long 19th century (1800-1921). Handwritten letters unearthed from public and private archives around Finland. Transcribing, checking and re-checking about one thousand handwritten letters.
Critical points: ●
Working with original manuscript material is extremely laborious and time-consuming; negotiations with memory organizations; legal restrictions to publish on-line; protection of identity.
Quality control of data Samuli Kaislaniemi, Anni Sairio
Linguistic variation: give vs giue, up vs vp old
new
=?=
What the editor says: “letters in the present edition are published … precisely as written”
Reality: checking editions vs manuscripts
Example: spelling variation in 17th-century letters All editions
Assess editions
‘Best’ editions
Tools for linguistic research Harri Siirtola, Eetu Mäkelä
Tools for linguistic research • Starting from two tools: TVE & Khepri • Moving to develop tools in dialogue with and as part of end user linguistic research
Linguistic research (and tools for such) Tanja Säily, Eetu Mäkelä, Jukka Suomela
Case study: derivational productivity of -er and -or ● Verb + suffixes -er and -or: driver, governor, filler ● Corpora of Early English Correspondence: spelling variation, false positives
○ er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive… ○ \S*(([rR]|[eEoO]~)(=?|=?[eE]=?|[='~]*[eEiIyY]?[='~]*[sSzZ][=']*))( ?![a-zA-Z'~=+]) ○ 6800 candidate words, 400 000 appearances
FiCa
Derivational productivity of -er and -or ● 5080 words out of 6800 irrelevant after manual study ● 153 words out of 6800 needed further study ○ 11768 individual uses
Case study: newly coined words • Compare Corpus of Early English Correspondence words to:
• The millions of words in Eighteenth Century Collections Online, Early English Books Online, British Library Newspapers, Burney Collection, Nichols Collection • Structured information in the Oxford English Dictionary
[email protected] http://j.mp/s-makela This presentation: http://j.mp/stratas-l