TEXT-TYPES IN SPEECH TECHNOLOGY AND LANGUAGE ...

61 downloads 404 Views 109KB Size Report
Keywords: Estonian, text-types, parts of speech, formality, language teaching. RESUMEN ... identification in speech technology. 2. ..... Information Technology &.
TEXT-TYPES IN SPEECH TECHNOLOGY AND LANGUAGE TEACHING

KRISTA KERGE Tallinn University HILLE PAJUPUU Institute of the Estonian Language, Tallinn ABSTRACT Text-type determines the linguistic and paralinguistic means for conveying the message. The present study investigates how to discriminate between text-types and which types should be focused on in language teaching. One of the distinctive features of text-type is text formality, determinable from its part-of-speech structure. Our formality analysis concerns 28 oral and written texts. The results indicate that written texts are more formal than oral ones, and monologues more formal than dialogues. The results are applicable in FLT as well as in language technology for automatic identification of texttypes. Keywords: Estonian, text-types, parts of speech, formality, language teaching

RESUMEN La necesidad de distinguir entre tipos de texto es importante en la enseñanza y tecnología del lenguaje dado que el tipo de texto determina el medio linguístico y paralinguístico para transmitir el mensaje. Este estudio investiga cómo distinguir entre los tipos de texto y en qué tipos debería enfocarse la enseñanza de lenguas. Una característica del tipo de texto es su grado de formalidad, visible en las relaciones entre tipos de palabras en el texto. Nuestro análisis sobre 28 textos orales y escritos indica que el texto escrito es más formal que el oral, y el monólogo es más formal que el diálogo. Los resultados tienen aplicación en la enseñanza de lenguas y tecnología del lenguaje para la identificación automática de tipos de texto. Palabras clave: Estonio, tipo de texto, tipos de palabras, formalidad, enseñanza de lenguas BUENO ALONSO, Jorge L., Dolores GONZÁLEZ ÁLVAREZ, Úrsula KIRSTEN TORRADO, Ana E. MARTÍNEZ INSUA, Javier PÉREZ GUERRA, Esperanza RAMA MARTÍNEZ & Rosalía RODRÍGUEZ VÁZQUEZ (Eds). 2010: Analizar datos > Describir variación / Analysing data > Describing variation. Vigo: Universidade de Vigo (Servizo de Publicacións), 380-390

1. INTRODUCTION In FLT, the skill-oriented model of literacy has been gradually replaced by an ideological (Street 1984), or rather, socio-cultural model defining literacy as a functional skill in a community's social practices (Gee 2008). Texts are approached as part of everyday activities (see Ivanič et al. 2009; Kucer 2009). The usage-based actionoriented approach has been widely accepted in European language teaching (CEFR 2001: 8−16). At that, relatively little use is still made of linguistic parameters of authentic texts; despite being relevant to style they rather interest corpus linguistics. Speech systems are also taking increasing interest in aspects of oral and written texts. Text-type is relevant for speech-style, which, in turn, influences prosody (Theune et al. 2006; Fackrell et al. 2000). Variation in speech-style may enhance the believability of synthetic speech (Campbell 2007). Also, text-type is relevant for pausing, thus to be considered in synthetic speech (Pajupuu and Kerge 2006). Recent studies on educated L1 and C1/L2 speaking skills indicate that to be accepted as natural, language use should be of appropriate genre and register. With that fulfilled, other parameters, such as lexical richness, vocabulary range, word-formation skills, foreign accent etc. are less vital for L2. (Pajupuu and Kerge 2009) Hence, both in language teaching and speech technology, speech naturalness requires differentiation between text-types and training the speaker (human or machine) in genre-specific expression. In text classification, the Biber’s 1988 pioneer study on the difference between oral and written texts lays the foundation to automatic identification of text-types. In Estonia, there is an applicational need for text-type discrimination (Kerge et al. 2008), but there is no systematic study in the field. Among many features, a text can be characterized by its part-ofspeech structure (inventory and relative proportion), rendering information on its formality/contextuality (Michos et al. 1999; Heylighen and Dewaele 2002; Teddimann 2009). Heylighen and Dewaele (2002) have used part-of-speech relations to devise an F-measure of text formality, tested on various types of texts. As to Estonian texts, the pilot results show fairly good promise for that parameter (Kerge et al. 381

2007). The discrimination on that basis would enable search of other linguistic levels for other text-type parameters (see Santini 2001; Kwasnik and Crowston 2005). According to Heylighen and Dewaele (2002) text formality depends both on the communication situation and the personality of the text author (see Fig. 1). Situational variables audience size difference backgrounds

+

difference setting

+ + + -

time span feedback need for understanding

feminity inroversion

-

FORMALITY

+ + +

+ +

Deictic words Non-deictic words F-measure

Personality variables

Figure 1. Summary of the formality model. Arrows with +signs denote positive correlations, – signs denote negative correlations; to the left are the behavioral variables that affect the formality of linguistic expressions, to the right are the linguistic variables affected by formality. (Heylighen and Dewaele 2002: 335)

Our aim is (1) to show which texts clearly differ in formality, thus helping to find the best material for training a learner's sense of text-type characteristics, and (2) to decide (preliminarily) what parameters of communication contribute most to the formality-score. For lack of data, personality variables are seldom discussed. The results may also prove relevant for automatic text-type identification in speech technology. 2. MATERIAL There is always some danger to pre-define what we study (sociolect, register/genre, functional style). One might get more objective results for a code as the language of a text within defined situation

382

(participants, circumstances), i.e. the context-befitting adaptation of user- and usage-dependent varieties (see Kerge 2003). Text-type

caretaker turns in dialogue with child child turns in dialogue with caretaker

Linguistic form oral oral

Discourse character interactive interactive

Discourse -type dialogue dialogue

monologue

Addressee person person public person public person public person public person

monologue

public

institution

dialogue dialogue

public public

person person

monologue

public

institution

L2 dialogue (C1 speaking test)

oral

interactive

dialogue

L1 dialogue (exam-like situation)

oral

interactive

dialogue

oral oral oral/ prewritten oral/ prewritten written written

noninteractive noninteractive noninteractive interactive interactive noninteractive noninteractive interactive interactive

written

interactive

written written

L1 monologue (exam-like situation)

oral

L2 monologue (C1 speaking test)

oral

parliament speech (spontaneous monologue) live radio-dialogue (spontaneous) live TV, interview with a guest TV-monologue (broadcast introduction) TV-monologue (newspaper review) informal e-mailing (teenagers) private enterprise inside e-mailing clients e-mailing to a private enterprise personal letter’s by Kaplinski (famous writer) online-comments for a yellow newspaper SLÕhtuleht online-comments for daily newspaper (Postimees) public service official letters to people public service official letters to other public services private enterprise e-mailing to clients

oral

monologue

Author person person person person person person

monologue

public

institution

dialogue dialogue

person institution

monologue

person person institution

person

interactive

monologue

person

person

interactive

monologue

public

person

written

interactive

monologue

public

person

written

interactive

monologue

institution

written

interactive

monologue

written

interactive

monologue

L1 essay (exam-like situation)

written

interactive

monologue

L2 essay (C1 writing test)

written

interactive

monologue

novel by Kaplinski news from a yellow newspaper SLÕhtuleht essay by Kaplinski news from daily newspaper (Postimees) laws (legal texts) EU regulations translated

written

interactive

monologue

person institution person public person public person public

written

interactive

monologue

public

institution

written

interactive

monologue

public

person

written

interactive

monologue

public

institution

written written

interactive interactive

monologue monologue

public public

institution institution

Table 1. Research material, sorted by linguistic form

383

institution institution person person person

Thus, our research object is texts from defined situations, having an oral (possibly pre-written) or written form, with clearly definable communicative parameters: synchronic/non-synchronic (oral/chatroom-communication versus mailing); public/personal (broadcasting versus letters); showing direct/mediated interaction (oral dialogue versus written correspondence), or no direct interaction (paper media or books); monologue/dialogue (which may be individual by emailing); etc. Such an approach excludes the risk of reducing genre to a pre-defined classification system of static forms, which would discount the wide range of variation within genres and decontextualize the text, as argued by Devitt (2008). In our qualitative study we compared the contextuality/formality of 28 oral or written texts (16,805 tokens in total), representing the public, professional, educational and personal spheres, institutional and other communication, etc. (see Table 1). 3. METHOD According to Heylighen and Dewaele (2002), relative abundance of nouns and their modifiers increases text formality and decreases its ambiguity, while abundance of other parts of speech increases text contextuality and ambiguity (see Figure 2). Contextual style a mbiguity

Formal style misinterpretation

ambiguity avoidance

more deictic words verbs adverbs interjections

more nouns articles adjectives adpositions

Figure 2. Contextuality/formality axis

Their F-measure has to be adapted for concrete languages. Estonian is mainly synthetic, but besides case endings there are also some analytic means (mostly postpositions). There are no articles. Parts of speech are often defined from context only. Indexical linguistic units were classified among deictic words, including textbounded anaphors (e.g. antud ’given’, eelmainitud ’above-men-

384

tioned’), pronouns and pro-adverbs (like when, there, now), but not proper names as in the studied texts the names are well known, thus providing no cues for context. Either by context or archaic form, a number of noun and verb forms were classified as adpositions (e.g. hoolimata ’despite’, lit. ’not-caring’) or adverbs (e.g. öösel ’at night’), etc. We computed formality scores for all the above texts (cf. Heylighen and Dewaele 2002: 309): F = 0,5*[(noun frequency + adjective freq. + adposition freq.) – (deictic word freq.+ verb freq.+ adverb freq.+interjection freq.) + 100]

Next we ran a correlation between the F-measure, communicative parameters and part-of-speech data. 4. RESULTS AND DISCUSSION By F-scores the texts were ranked along the contextuality/formality axis (see Table 2). Text-types differ significantly if their F differs by at least 3 points (Heylighen and Dewaele 2002: 328). The Estonian study indicates that the method (for methodological arguments see Heylighen and Dewaele 2002: 304−310) works indeed: text formality shows strong correlation (r 0.97) with its noun content. Adpositions, in Estonian alternated by case endings, tend to be more frequent in writing (r 0.57); the effect is reverse for deictics (r -0.74) and adverbs (r -0.62). High noun content has a good correlation with writing (r 0.60) while deictics are a strong cue of orality (r 0.75). Owing to feedback, contextual reference is normal in oral communication. The correlation of interjections with orality is surprisingly weak (r 0.45). Higher contextuality strongly correlates with adverbs (r 0.91), deictic words (r 0.87) and verbs (r 0.76). Adjectives take no part in formality calculus (r 0.22). For the same reason, conjunctions are not included in the original formula (ibid.: 311). Consequently, massive studies may lead to a formula including F-relevant parts of speech only.

385

42.0 43.1 43.3 47.0 47.0 49.5 49.5 49.7 52.0 52.8 58.2 60.0 67.1 67.6 69.1 75.0

Conj.

41.5

Interj.

37.8

Adverbs

37.7

Verbs

caretaker turns in dialogue with child L2 dialogue (C1 speaking test) live radio-dialogue (spontaneous) L1 dialogue (exam-like situation) informal e-mailing (teenagers) L2 monologue (C1 speaking test) L1 monologue (exam-like situation) novel by Kaplinski (famous writer) live TV, interview with a guest online-comments for a yellow newspaper SLÕhtuleht news from a yellow newspaper SLÕhtuleht online-comments (daily newspaper Postimees) child turns in dialogue with caretaker private enterprise inside e-mailing TV-monologue (broadcast introduction) parliament speech (spontaneous monologue) clients e-mailing to a private enterprise L1 essay (exam-like situation) personal letter’s by Kaplinski L2 essay (C1 writing test) essay by Kaplinski news from daily newspaper (Postimees) private enterprise e-mailing to clients TV-monologue (newspaper review) laws (legal texts) EU regulations translated public service official letters to people public service official letters to other public services

Deictics

28.6 29.0 29.4 30.2 32.4 34.4 35.2 35.5 35.6

Adpos.

Text-type

Adject.

F

Nouns

Formality is higher for writing (r 0.59) and for monologue (r 0.58). In Estonian, oral text tends to be more contextual (r 0.59) but not as strongly as in Dutch (cf. ibid.: 312). The other (mutually independent) linguistic parameters have no bearing on the F-measure.

23.1 18.0 17.1 19.0 20.6 22.1 20.8 22.0 23.6

4.0 4.1 5.0 3.4 7.2 4.5 6.2 5.0 4.6

1.5 1.5 1.7 2.2 1.1 1.2 2.2 2.0 2.1

24.5 24.9 22.8 24.7 20.7 22.9 22.2 22.0 20.9

26.3 21.1 22.9 21.8 22.1 21.9 19.8 22.0 19.1

15.1 18.6 16.2 16.7 19.0 13.1 16.7 14.0 17.3

3.4 1.1 3.3 1.1 2.7 1.2 0.3 0.0 1.8

2.0 10.5 11.0 11.1 6.7 13.1 11.8 13.0 10.6

27.7

4.5

2.5

17.7

22.6

18.3

0.8

5.9

34.5

4.3

2.6

12.6

24.0

16.5

0.4

5.1

31.9

4.0

2.4

16.3

23.5

14.8

0.9

6.2

32.4 32.0

8.4 4.3

1.1 2.8

23.9 13.2

13.5 23.2

15.2 13.7

5.3 2.8

0.2 8.1

29.9

5.2

2.3

17.3

18.1

14.0

1.5

11.8

32.2

7.7

2.5

16.7

18.9

12.6

0.3

9.1

37.7

3.5

2.7

14.5

21.6

10.9

2.9

6.2

33.5 32.0 33.9 37.0

9.3 10.0 8.6 7.0

2.5 3.0 3.2 3.0

12.0 14.0 14.1 15.0

21.7 21.0 21.7 19.0

12.6 11.0 10.4 9.0

0.0 0.0 0.2 0.0

8.4 9.0 7.9 10.0

42.9

3.0

4.3

12.9

20.3

11.4

0.0

5.2

47.3 48.5 54.6 54.7 56.4

4.3 4.9 5.7 5.1 5.4

4.0 3.0 1.9 5.1 4.2

8.3 7.4 12.7 12.2 9.7

20.9 16.5 13.4 12.3 14.2

7.8 12.0 1.8 5.3 4.0

2.3 0.4 0.0 0.0 0.0

5.1 7.4 9.8 5.2 6.1

62.4

6.9

3.1

5.6

10.7

6.1

0.0

5.2

Table 2. Formality scores and parts of speech vs. text-type

Heylighen and Dewaele (2002: 326) suggest some extralinguistic factors working for higher formality, such as audience size 386

and heterogeneity (psychological and/or cultural), spatial and/or temporal distance; “contextual speech-styles will .. be more interactive” (ibid.: 301). Our Estonian material revealed no correlation between extralinguistic parameters and formality. Considering the interactive nature of mailing, interactivity has a slight correlation with dialogue (r 0.59), without, however, raising contextuality (correlation with F being -0.21). The reason may lie in the usage of Estonian civil servants, whose correspondence is the most formal and dense of all text-types studied, with noun content 8% higher than in laws and 14% higher than in news texts. Even letters personally addressed to citizens surpass laws in formality. The size and heterogeneity of audience were studied via the joint parameter public, which does not necessarily correlate with formality (r 0.30). Spontaneity needs further investigation: an impromptu speech to parliament turns out more formal than a pre-written broadcast introduction. Both have a broad and heterogeneous audience (parliament speeches are broadcast); the difference may derive from the educated audience as well as from the topic of a parliament speech (a dense bill). The method enables cross-language comparison of text-type formality rankings and of inter-type distances (see Heylighen and Dewaele 2002: 312 for Dutch, 314 for Italian, 316 for English). Unfortunately the studied texts are too different. E.g. informational writing (the most formal text-type in English) is found in 11 studied Estonian texts, whose formality ranges from 37.8 to 75.0, depending on institutionality, channel, publicity, etc. Conversation (the most contextual texts in English) is represented in five Estonian texts, F ranging from 28.6 to 42.0, depending on privacy, publicity, age and other situational factors, yet remaining very contextual for adult personal interaction, etc. In language technology the F-score enables discrimination between text-types and, in certain cases, even intra-type classification (e.g. in journalistic texts, yellow news can be discriminated from dailies news, interviews from broadcast introductions, etc.), thus economizing on searching the texts for formality cues (cf. Michos et al. 1999). At the same time the formality-score should be regarded as 387

a raw classifier: the F-measure, showing the difference or similarity of texts, fails to reveal why. Moreover, a same F-measure may apply to different text-types (e.g. a parliament speech vs. a client's e-mail to a private enterprise) (cf. Teddimann 2009). And yet, in text-to-speech synthesis, the formality score can be used to vary text style from informal to formal. 5. CONCLUSION Variation in the results of, e.g. the English corpus studies referred to above (Biber 1988, modified by Hudson) may be too wide to be applicable in language teaching. On the other hand, the more specific genres fail, in most cases, to form distinct clusters by the formalityscore; instead, they distribute along the axis evenly and widely (F 28−75), without indicating an unmistakable reason for their ranking. Correlation calculus reveals but single tendencies: in most cases a written text is more formal than an oral one, and a monologue is more formal than a dialogue. Even institutionality and publicity need not have an unambiguous effect on style. The differences in formality-scores depend on linguistic factors, while situational context-factors are individually combined and affect style indirectly via this combination. The results support the usagebased approach in language teaching, questioning rigid classification of genres (e.g. Devitt 2008). Without diminishing the importance of multiplicity of genres, this emphasizes genre-awareness in language learning. Curiously enough, our result walks in the open door of the sociocultural approach to language teaching, the door opened by the systemic-functional and sociocognitive paradigms in linguistics. Optimal training of text-type identification requires that attention be paid to certain aim-bound aspects of communication, concerning the situation and the text, i.e. whether the communication is private, personal, institutional or public; direct or mediated; interactive or not; oral or written, and what matters most, who is the addressee or target group.

388

Having a private conversation, delivering a public speech and performing on a talk show are three very different things. Teenager mailing is close to adult oral conversation, which in turn is not far from the online comments for yellow papers, while a higher formality level of a newspaper elicits more formal comments, etc. 6. ACKNOWLEDGEMENTS The study was supported by the Project SF0050023s09. REFERENCES Biber, D. 1988. Variation Across Speech and Writing. Cambridge: CUP. CEFR = Common European Framework of Reference for Languages: Learning, Teaching, Assessment. 2001. Strasbourg: Council of Europe; CUP. Campbell, N. 2007. “Evaluation of speech synthesis”. Evaluation of Text and Speech Systems. Text, Speech and Language Technology 37. Eds. L. Dybkjær, H. Hemsen, W. Minker. Springler. 29–64. Devitt, A. 2008. Writing Genres. Carbondale: Southern Illinois University. Fackrell, J., Vereecken, H., Martens, J.-P., Van Coile, B. 2000. “The variation of prosody with text-type”. Proc. of IEE Colloquium on 'State of the Art in Speech Synthesis'. London, UK. 5/1–5/9. Heylighen, F., Dewaele, J.-M. 2002. “Variation in the contextuality of language: An empirical measure”. Foundations of Science 7: 293–340. Gee, J. P. 2008. Social linguistics and literacies: Ideology in discourses. Critical perspectives on literacy and education. Third edition. London: Routledge. Ivanič, R., Edwards, R., Barton, D., Martin-Jones, M., Fowler, Z., Hughes, B., Mannion, G., Miller, K., Satchwell, C. and Smith, J. (eds). 2009. Improving Learning in College: Rethinking Literacies Across the Curriculum. NY: Routledge.

389

Kerge, K. 2003. “Linguistic varieties: regular nominalization as a parameter of the complicacy of syntax in different fields of language use”. Summary. Tallinna Pedagoogikaülikooli humanitaarteaduste dissertatsioonid 10. Tallinn: TPÜ Kirjastus. 61−66. Kerge, K., Pajupuu, H., Altrov, R. 2007. “Tekst, kontekstuaalsus ja kultuur”. Keel ja Kirjandus 8: 624–637. Kerge, K., Pajupuu, H., Tamuri, K., Meier, H. 2008. “Keeletehnoloogia vajab žanrilist lähenemist”. Rakenduslingvistika ühingu aastaraamat, 4: 53–65. Kucer, S. B. 2009. Dimensions of Literacy. NY: Routledge. Kwasnik, B. H., Crowston, K. 2005. “Introduction to the special issue: Genres of digital documents”. Information Technology & People, 18 (2): 76–88. Michos, S. E., Fakotakis, N., Kokkinakis, G. 1999. “Using functional style features to enhance information extraction from Greek texts”. Advances in Intelligent Systems. Concepts, Tools and Applications. Ed. S. G. Tzafestas. Kluwer. 134–154. Pajupuu, H., Kerge, K. 2006. “Hingav süntesaator ja pausid tekstiliigiti”. Keel ja Kirjandus 3: 202–210. Pajupuu, H., Kerge, K. 2009, forthcoming. “Characteristics and assessment of educated L1 and L2 dialogue”. Proc. of the XXVII Congreso de AESLA. Universidad Castilla-La Mancha. Santini, M. 2001. “Text typology and statistics. Explorations in Italian press subgenres”. Italian Journal of Linguistics 13(2): 339– 374. Street, B. V. 1984. “Literacy in theory and practice”. Cambridge Studies in Oral and Literate Cultures, 9. Cambridge, NY, Melbourne: CUP. Teddimann, L. 2009. “Contextuality and beyond: Investigating an online diary corpus”. Proc. of the Third International ICWSM Conference. 331–333. Theune, M., Meijs, K., Heylen, D., Ordelman, R. 2006. “Generating expressive speech for storytelling applications”. IEEE Transactions on Audio, Speech, and Language Processing 14(4): 1137–1144.

390