Interfacing structured and unstructured data in ... - Google Docs

2 downloads 229 Views 7MB Size Report
Interfacing structured and unstructured data ... From handwritten letters to structured data ... er(e), ar(e), or(e), ou
Interfacing structured and unstructured data in sociolinguistic research on language change Terttu Nevalainen, Samuli Kaislaniemi, Anni Sairio, Tanja Säily, Anna Merikallio, Taru Nordlund, Katja Litola, Johanna Utriainen, Eetu Mäkelä, Poika Isokoski, Harri Siirtola

Linguistic end research questions • Social meaning of spelling variation in historical periods of English and Finnish • Social variation in language productivity in early English correspondence

Subprojects – the long road from data to questions 1. 2. 3. 4.

From letters to data (Finnish) Quality control of data (English) Tools for linguistic research (English ⇒ Finnish) Linguistic research

From letters to data Katja Litola, Johanna Utriainen

From handwritten letters to structured data

Corpus building Digital letter corpus of Early Modern Finnish:

● ● ●

Socially, temporally and regionally balanced corpus of letters from the long 19th century (1800-1921). Handwritten letters unearthed from public and private archives around Finland. Transcribing, checking and re-checking about one thousand handwritten letters.

Critical points​: ●

Working with original manuscript material is extremely laborious and time-consuming; negotiations with memory organizations; legal restrictions to publish on-line; protection of identity.

Quality control of data Samuli Kaislaniemi, Anni Sairio

Linguistic variation: give vs giue, up vs vp old

new

=?=

What the editor says: “letters in the present edition are published … precisely as written”

Reality: checking editions vs manuscripts

Example: spelling variation in 17th-century letters All editions

Assess editions

‘Best’ editions

Tools for linguistic research Harri Siirtola, Eetu Mäkelä

Tools for linguistic research • Starting from two tools: TVE & Khepri • Moving to develop tools in dialogue with and as part of end user linguistic research

Linguistic research (and tools for such) Tanja Säily, Eetu Mäkelä, Jukka Suomela

Case study: derivational productivity of -er and -or ● Verb + suffixes -er and -or: driver, governor, filler ● Corpora of Early English Correspondence: spelling variation, false positives

○ er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive… ○ \S*(([rR]|[eEoO]~)(=?|=?[eE]=?|[='~]*[eEiIyY]?[='~]*[sSzZ][=']*))( ?![a-zA-Z'~=+]) ○ 6800 candidate words, 400 000 appearances

FiCa

Derivational productivity of -er and -or ● 5080 words out of 6800 irrelevant after manual study ● 153 words out of 6800 needed further study ○ 11768 individual uses

Case study: newly coined words • Compare Corpus of Early English Correspondence words to:

• The millions of words in Eighteenth Century Collections Online, Early English Books Online, British Library Newspapers, Burney Collection, Nichols Collection • Structured information in the Oxford English Dictionary

[email protected] http://j.mp/s-makela This presentation: http://j.mp/stratas-l

Suggest Documents