From sociolinguistic interviews to a spoken corpus of London English

4 downloads 0 Views 1MB Size Report
Analysis of spoken London English using corpus tools. Costas Gabrielatos ... Mark: #1 no it's sort of /like she lives at one end/ of the road I'm at the other end ...
ICAME 34, 22-26 May 2013, Santiago de Compostela, Spain

From sociolinguistic interviews to a spoken corpus of London English: Creating the Linguistic Innovators Corpus (LIC) Costas Gabrielatos, Sebastian Hoffmann & Eivind Torgersen Part of the project: Analysis of spoken London English using corpus tools

Linguistic Innovators: The English of Adolescents in London • Change and innovation in British English are supposed to originate in London, but had so far never been tested • ESRC-funded project 2004-2007 – Research questions • Where do innovations take place? • How do they spread? • Who are the innovators?

Main findings • External factors play a large role in language change and innovation in London English – Inner city – Non-Anglo ethnicity – Male speakers – Diverse friendship network

Transcribing sociolinguistic data • Previous sociolinguistic studies in the UK – Mainly interested in phonological variation • Only partly transcribed orthographically • Qualitative analysis of grammatical features

• Sali Tagliamonte’s York datasets – All interview data transcribed orthographically – Manual analysis of morpho-syntactic features • Quantitative data

Linguistic Innovators study – Meta-data for all speakers in the sample – Transcribe complete interview data – Quantitative analysis of morpho-syntactic features • Past tense BE • Quotatives, in particular BE LIKE

– Analysis envisaged • Manual search of individual transcriptions

Transcriber software • Separate tiers in transcriptions not possible (unlike ELAN and Praat) – Backchannels in [ ] as part of other speakers’ turns – / / used to indicate overlapping speech by maximum two speakers

Backchannels

Mark: I don't really know any local clubs around here Tina: there's one down there actually called {name of club} just round the corner [Mark: mm] wouldn't recommend it myself Mark: I used to work in a bar down near Liverpool Street where my brother used to work called {name of bar} [Sue: mm] and I go there sometimes cos I stopped working there now but I still go there sometimes [Sue: mm] was there a couple of weeks ago for a party that was there Sue: mhm have you got part-time jobs?

Overlapping speech

Tina: me {name of road} #1 the same yea # Sue: #2 oh the same? oh # #1 are you on the same? # Mark: #2 yeah she lives at the back # Sue: is it an estate that you live on? Mark: #1 no it's sort of /like she lives at one end/ of the road I'm at the other end # Tina: #2 /no he lives on the estate/ #

Issues with transcriptions • Non-unique speaker names – E.g. Kelly (Kelly1, Kelly2)

• Variable transcription of some items – Yeah/yea/yeh, innit/init, blood/blud

• Inconsistent transcription of disfluencies – Uh, eh, uhm, erm, uhhrm etc.

• Inaccuracies in annotations – E.g. missing second element in { } [ ] / / ( )

Conversion of text files me (name of road)
    the same yea
    oh the same? oh
    are you on the same?
    yea she lives at the back
is it an estate that you live on? no it's sort of
    like she lives at one end
of the road I'm at the other end
    no he lives on the estate
oh right

➟ Minimal conversion to enable corpus searches • Format allows restricted searches with Wordsmith (e.g. exclude words spoken by interviewers) • But some information is lost (e.g. exact sequence of overlaps)

Evaluation This is a “dirty hack”! Possible enhancements: • Fully TEI-compliant XML • Fully time-aligned turns, including all overlaps But there’s clearly a need for this type of data in suitable format:

• Old recordings are today being transcribed: • E.g. Labov’s Philadelphia data from the 1970s • Automated analyses of phonetic features are possible if there is a very accurate transcription

Methodology: Searching and annotating pragmatic markers (PMs) • Automatic extraction of concordances of candidate wordforms together with metatextual data about speakers • Manual identification/annotation of PMs:

– Transcript (concordance analysis): co-text – Recordings: phonological and discourse features • Tabulation of PM instances – descriptive statistics combines manually annotated data with sociolinguistic information

➟ Standard corpus methodology

Problems with identification of PMs: an example re still like and oohh i was just busting up boy when he told me and my friend (name) boy he's . he went with them innit and he's only young and i don't know how he m . he's short

>

• He’s short, you know? • You (do) know (name)?

(name) innit? yeh the one that just went prison
    that was in our . that's
in our class .
    oh (nickname)?
yeh he went with my uncle blad i

A case study of an emergent pragmatic marker (PM) mm are you close to your mum? yeah and that they call me a mummy's boy . I don't care . it's my mum you get me . mm call me what you want .. I'm the one that's still at home . all the luxuries and they're out there . no money yeh each week . scraping through . mm . mm ... mm so where do you see yourself . a few years from now? ..

Emerging PM Age

Sex

Ethnicity Residence

innit

Young

Male

Non-Anglo

------

yeah

Young

------

Non-Anglo

------

you know

Old

Male

Anglo

------

ok

------

Female

Non-Anglo

------

right

Young

Female

Non-Anglo

Hackney

(do) (you) know what I mean

Young

Female

Anglo

Havering

if you know what I mean

Young

------

------

Havering

(do) you know what I’m saying

Young

Female

------

Havering

you get me

Young

------

Non-Anglo

Hackney

Friendship Network Scores

Emerging PMs and linguistic innovation • Established PMs, irrespective of whether they are becoming more or less frequent, have a less marked ethnic distribution. • Innovation in PMs is in line with previous findings on innovation in phonology and grammar. – Inner-city, non-Anglo males are in the lead.

• New (emerging) PMs, like you get me, are currently used significantly more frequently within the multi-ethnic networks in which they have probably first emerged.

Conclusion • The simple conversion of a dataset originally conceived for traditional sociolinguistic analysis can now be used for standard corpus analysis. • Advantages of sociolinguistic annotation, e.g. friendship network score, allow for interdisciplinary analyses.