Recovering Capitalization and Punctuation Marks on ... - INESC-ID

11 downloads 364 Views 4MB Size Report
Nas experiências até agora realizadas para o PE, elas são sobretudo classificadas com .... Our data-driven approach revealed a very distinctive set of prosodic ...
Universidade de Lisboa Faculdade de Letras Departamento de Linguística Geral e Românica em associação com

Universidade Técnica de Lisboa Instituto Superior Técnico Departamento de Informática

Processing Disfluencies in European Portuguese

Helena Gorete Silva Moniz

Doutoramento em Linguística (Linguística Educacional)

2013

Universidade de Lisboa Faculdade de Letras Departamento de Linguística Geral e Românica em associação com

Universidade Técnica de Lisboa Instituto Superior Técnico Departamento de Informática

Processing Disfluencies in European Portuguese

Helena Gorete Silva Moniz

Tese orientada pelas Professoras Doutoras Ana Isabel Mata e Isabel Trancoso

Tese especialmente elaborada para a obtenção do grau de doutor em Linguística (Linguística Educacional)

2013

Esta tese foi realizada com o apoio da Fundação para a Ciência e a Tecnologia, financiamento comparticipado pelo Fundo Social Europeu e por fundos nacionais do MCTES, através da bolsa de investigação com a referência SFRH/44671/2008

Aos meus pais.

À minha família.

Resumo A presente tese centra-se na análise de disfluências com o duplo objectivo de caracterizar os padrões regulares associados à sua produção e contribuir para o processamento automático de um conjunto mais alargado de eventos designados no inglês “structural metadata events” (Liu et al. (2006b); Ostendorf et al. (2008); Jurafsky and Martin (2009)), nomeadamente, a recuperação automática de pontuação e maiúsculas em fronteiras de frase, bem como a anotação e filtragem de disfluências. A análise apresentada tem como base o processamento automático de propriedades prosódicas em corpora de natureza distinta. Para validar a metodologia de extracção automática de propriedades prosódicas, desenhada no âmbito deste trabalho, a primeira experiência incide sobre o processamento automático de interrogativas, um tópico ainda por explorar no português. A literatura crítica da área não é consensual relativamente ao contributo de diversas pistas linguísticas, nomeadamente lexicais e prosódicas, para a identificação de marcas de pontuação. Com o objectivo de verificar se o contributo destas pistas linguísticas varia em função da natureza específica de um corpus ou dos tipos de interrogativas, procedeu-se à análise da distribuição das interrogativas em quatro corpora distintos: noticiários televisivos, aulas universitárias, diálogos espontâneos e, para efeitos de comparação, notícias do jornal Público. Os resultados evidenciam uma correlação entre a natureza dos corpora e a frequência e distribuição de tipos de interrogativas, permitindo um claro contraste entre diálogos espontâneos e aulas universitárias, por um lado, e noticiários televisivos e notícias do jornal, por outro. Na distribuição dos diferentes tipos, verifica-se que o corpus de aulas universitárias contém sobretudo interrogativas Qu- e tags, enquanto que o de diálogos espontâneos tem uma significativa percentagem de interrogativas de sim/não, e o de notícias televisivas apresenta uma distribuição semelhante entre Qu- e interrogativas de sim/não. Os resultados da detecção automática de interrogativas demonstram que: i) quando são apenas utilizadas pistas lexicais (categoria morfológica, n-gramas de palavras mais frequentes, número e posição das palavras na frase, inter alia), apenas as interrogativas Qu- são detectadas vs. ii) quando são adicionadas pistas prosódicas (energia, duração e frequência fundamental das unidades sílaba e palavra), as interrogativas globais e as tags passam, então, a ser detectadas. Os resultados apontam, assim, para um efeito determinante da combinação de pistas linguísticas na identificação das diferentes estruturas interrogativas do Português Europeu (PE). Os resultados desta experiência constituem um dos principais contributos desta tese. Um segundo conjunto de experiências é dedicado à predição dos sinais de pontuação mais frequentes nos corpora (vírgulas, pontos finais e pontos de interrogação) e à discriminação en-

tre frases, ou constituintes similares a frase (do inglês “sentence-like unit”), e disfluências, num corpus de aulas universitárias. Com recurso à aplicação de acesso público Weka, utilizaramse vários métodos de aprendizagem, sendo que as Árvores de Decisão e Regressão (CART) evidenciam os melhores resultados. Para a discriminação destas classes de eventos é determinante o seguinte conjunto de pistas linguísticas: contornos de frequência fundamental ( f 0 ), níveis de energia, duração relativa das unidades de análise e grau de confiança dessas mesmas unidades. Em primeiro lugar, as pistas que mais contribuem para a predição da reposição de fluência a seguir a uma sequência disfluente integram: i) duas palavras contíguas idênticas; ii) subida dos níveis de f 0 e de energia na palavra que inicia uma reposição de fluência e um contorno estacionário de f 0 na palavra anterior; iii) grau de confiança da palavra que inicia a reposição, superior ao da disfluência propriamente dita. Relativamente às pistas associadas à predição de pontos finais, estas incluem: i) contorno descendente na palavra antes de um ponto final; ii) nível estacionário de energia na mesma palavra; iii) duração relativa entre essa palavra e a seguinte; e iv) grau superior de confiança em relação à palavra seguinte. Este conjunto de pistas é ilustrativo do comportamento de uma declarativa neutra no PE. Quanto aos pontos de interrogação, estes são caracterizados por dois padrões diferenciados: i) contorno de f 0 ascendente na palavra antes de um ponto de interrogação e declive (do inglês “slope”) de energia ascendente nessa e na palavra seguinte; ii) contorno de f 0 estacionário na palavra antes de um ponto de interrogação e declive de energia descendente nessa mesma palavra. As vírgulas são o evento que menos depende de uma caracterização prosódica. Nas experiências até agora realizadas para o PE, elas são sobretudo classificadas com base em pistas morfo-sintácticas, não sendo claramente desambiguadas por meio de pistas prosódicas. Este segundo conjunto de experiências constitui-se como um primeiro contributo para a sistematização das propriedades linguísticas associadas a sinais de pontuação e a reposição de fluência em PE. O terceiro conjunto de experiências integrado nesta tese concentrou-se na investigação do comportamento prosódico das disfluências em aulas universitárias e em diálogos espontâneos. Relativamente às aulas universitárias, foram encontrados dois padrões essenciais: i) declives de f 0 e de energia estatisticamente significativos entre os contextos adjacentes e a disfluência propriamente dita; ii) aumentos de f 0 e de energia (marcação prosódica por contraste) entre a disfluência e a reposição da fluência para a maioria das categorias disfluentes, embora com diferentes graus de contraste. Deve notar-se que os aumentos de f 0 e de energia entre a disfluência e a reposição da fluência são produzidos por todos os falantes. O primeiro padrão ilustra a forma como o falante sinaliza de forma económica as diferentes regiões, utilizando apenas uma palavra antes e depois da sequência disfluente, e pode ser interpretado como uma estratégia do falante para auxiliar os ouvintes a processar as pistas produzidas num curto intervalo de tempo. No segundo padrão, os aumentos mais elevados de f 0 estão associados às categorias pausas preenchidas e apagamentos e os de energia à categoria repetições, o que aponta para combinatórias de parâmetros prosódicos ao serviço de propósitos funcionais distintos.

A estratégia de marcação prosódica por contraste de disfluência para reposição de fluência é realizada por todos os falantes. Quanto aos diálogos e seguindo a mesma ordem de padrões: i) o contexto adjacente anterior a uma disfluência não apresenta diferenças significativas; ii) metade das categorias disfluentes é produzida com aumentos de f 0 da disfluência para a reposição da fluência; há aumentos de energia constantes por falante, mas não por categoria (apagamentos e fragmentos não são produzidos com aumentos de energia). Note-se que a estratégia de marcação prosódica por contraste é realizada por 71% dos falantes. Os padrões temporais das unidades de análise são em média mais breves do que nas aulas. A comparação inter-corpora aponta efeitos de estilo de fala na distribuição das disfluências, nos padrões temporais e mesmo na marcação prosódica por contraste da disfluência para a reposição da fluência entre aulas universitárias e diálogos. Embora as pausas preenchidas sejam a categoria mais representativa em ambos os corpora, as restantes categorias apresentam uma distribuição distinta. Nas aulas, as sequências complexas (e.g., repetições e substituições utilizadas para procura/precisão lexical) são mais frequentes do que as repetições, enquanto nos diálogos ambas têm distribuições similares. Nos diálogos, os fragmentos correspondem a mais do dobro dos fragmentos produzidos nas aulas e os apagamentos são residuais. Estas diferenças na distribuição das categorias disfluentes podem ser interpretadas em função da natureza dos diálogos em análise, nomeadamente das restrições temporais a que estão sujeitos, com recurso mais frequente a categorias como repetições e fragmentos e menos a sequências complexas e apagamentos. Os padrões temporais também apontam para a natureza mais dinâmica dos diálogos por comparação com as aulas, com produção de menos palavras, tanto em frases fluentes como em frases que contêm disfluências. O encadeamento das interacções comunicativas num diálogo está sujeito a restrições temporais, evidentes também na duração dos silêncios, na disfluência e nos próprios contextos adjacentes. Uma vez mais, todas as unidades referidas são mais breves nos diálogos do que nas aulas. Mesmo a estratégia de marcação prosódica por contraste da disfluência para a reposição da fluência está sujeita a variação inter-corpora, sendo esta marcação mais forte nas aulas que nos diálogos. Nas aulas, pistas de f 0 e de energia são produzidas por todos os falantes, para a maioria das categorias, tanto para as disfluências como para os contextos adjacentes. O conjunto de padrões apresentado é um contributo para a diferenciação entre estilos de fala, nomeadamente entre fala espontânea e fala preparada não-lida. Espera-se que esta análise para o português europeu possa contribuir para questões de investigação ainda em aberto relativas ao impacto de pistas linguísticas distintas por tarefas, domínios e línguas.

Abstract This thesis focuses on the analysis of disfluencies, aiming at a characterization of the regular patterns in their production in European Portuguese, and at contributing towards the fully automatic processing of structural metadata events. This analysis was strongly supported on prosodic feature processing, and involved corpora of very different characteristics. In terms of structural metadata, one of the main contributions concerns the automatic processing of interrogatives, an unexplored topic in Portuguese. When using only lexical cues in the automatic detection of interrogatives, mostly wh- questions are detected. By adding prosodic features, yes/no and tag questions are then increasingly identified, showing the advantages of combining both lexical and prosodic features. The inter-corpora analysis of interrogatives evidenced that there are domain specific distributional patterns. Prosodic features also played a dominant role in the discrimination between commas, fullstops, question marks and disfluencies. Our data-driven approach revealed a very distinctive set of prosodic features for each event, going beyond the established evidences for our language. In terms of disfluencies, we analyzed university lectures and map-task dialogues, showing that the selection of specific disfluency types is corpus dependent. Pitch, energy and tempo parameters display inter-corpora similarities, showing a cross-speaking style prosodic strategy of contrast marking in the disfluency-fluency repair, and also relative tempo symmetries regarding the length of the structured elements of a disfluency and its context. However, in the lectures, pitch and energy cues are given both for the units inside disfluent regions and between these and the adjacent contexts, showing a stronger prosodic contrast marking when compared to dialogues. As for tempo patterns, the length of the structured elements in the dialogues is smaller, reinforcing their dynamic and interactive character. This analysis will hopefully contribute to the open debate on the relative impact of distinct linguistic features across tasks, domains and languages.

Palavras Chave Keywords Palavras chave Fala, disfluências, pontuação, prosódia e processamento de fala.

Keywords Speech, disfluencies, punctuation, prosody, and speech processing.

Acknowledgements This work is the outcome of many generous contributions and of an overwhelming support, both components shaped me as an intermediary in this process. I am deeply grateful to my supervisors, Professors Ana Isabel Mata and Isabel Trancoso, for all their support, critics, challenges, and their guidance. Thank you for allowing me to understand the realms of research and to enjoy it. Thank you also for the true friendship that bind us for quite a while. I am also very grateful to my dissertation committee, Professors Maria do Céu Viana, David de Matos, and Inês Duarte, for their very helpful comments and suggestions. A special thanks to Céu Viana, for being a constant presence in the cooperation process between linguistics (FLUL) and automatic speech processing (INESC-ID, L2F). Much beyond that, I am grateful for her accurate questions along my research path. To Professors Julia Hirschberg and Nick Campbell, for their scientific generosity in supervising my visit to Columbia and in guiding my short term scientific mission at Trinity College, respectively. Those experiences will always be valuable lessons along my way. To Professor Anna Esposito, coordinator of COST-2102 (Cross-Modal Analysis of Verbal and Non-verbal Communication). Under her auspices, I did my first review and had my first professional experience abroad. Thank you for believing and for enriching my research path. To my colleagues at Spoken Language Laboratory at INESC-ID, thank you for the knowledge and the laughs shared. Research and life are much more fun with a generous combination of chocolates and valuable research discussions. Within my colleagues at lab, I should say a special thanks to Fernando Batista, from whom I learned so much, for all the work done in straight cooperation, for his patience, and for our scientific discussions. Thank you Aida Cardoso, Vera Cabarrão, and Silvana Abalada, for the precise annotations and for helping me with some of the experiments conducted on this thesis. Thanks, Vera, for making days lighter. To my beautiful family. For all the sacrifices you did and for demanding just a smile. To Pedro and Tomás, my melopoeia of truth.

Contents 1

Introduction

1

2

State-of-the-art on disfluencies

5

2.1

Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

Structure of a disfluent sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

Contrast and parallelism between disfluent regions . . . . . . . . . . . . . . . . .

9

2.4

Studies on disfluencies for EP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3

4

State-of-the-art on prosody processing

15

3.1

Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.2

ToBI system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.3

AuToBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.4

Prosodic and lexical cues for structural metadata . . . . . . . . . . . . . . . . . . .

18

3.5

Overview for European Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Corpora

23

4.1

The CORAL corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

4.1.1

Contents of the maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.1.2

Number and type of speakers . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.1.3

Recording conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.1.4

Corpus division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

The CPE-FACES corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2.1

26

4.2

Recording conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

4.2.2

Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

The ALERT corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.3.1

Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

The LECTRA corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.4.1

Recording conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.4.2

Corpus division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.5

The newspaper data from Público . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.6

Corpora annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.6.1

Orthographic tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.6.2

Disfluency tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.6.3

Syntactic tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.6.4

Morphological information . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.6.5

Inter-transcriber agreement . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.7

Corpora alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.3

4.4

5

Towards an Automatic Prosodic Description

39

5.1

Recognizer output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.2

Adjusting phone boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

5.3

Marking syllable boundaries and stress . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4

Adjusting word boundaries and silent pauses . . . . . . . . . . . . . . . . . . . . .

41

5.4.1

Impact on acoustic models . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5.5

Pitch and energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5.6

Integration of prosodic information in transcription files . . . . . . . . . . . . . .

46

5.7

Extended set of prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.8

Towards an Automatic ToBI annotation system for EP . . . . . . . . . . . . . . . .

49

5.8.1

Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.9

ii

6

Analysis of interrogatives: a case-study

55

6.1

Statistical Analysis of Interrogatives . . . . . . . . . . . . . . . . . . . . . . . . . .

56

6.1.1

Overall frequency of interrogative types in the training corpora . . . . . .

56

Punctuation experiments for interrogatives . . . . . . . . . . . . . . . . . . . . . .

58

6.2.1

Baseline experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.2.2

Experiments with lexical and speaker related-features . . . . . . . . . . . .

59

6.2.3

Experiments with prosodic features . . . . . . . . . . . . . . . . . . . . . .

60

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

6.2

6.3 7

Automatic structural metadata classification

65

7.1

Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

7.2

Predicting structural metadata events . . . . . . . . . . . . . . . . . . . . . . . . .

66

7.2.1

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

7.2.2

Most salient features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

7.3 8

9

Disfluencies and their fluent perspective

71

8.1

Definitions of fluency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

8.2

Perceptual test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

8.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

8.4

CART experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

8.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Analysis of disfluencies in the LECTRA corpus

81

9.1

Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

9.2

Rate of disfluencies per speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

9.3

Rate of disfluencies per lecture and per speaker . . . . . . . . . . . . . . . . . . . .

84

9.4

Rate of disfluencies per sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

9.5

Patterns in the reparandum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

9.6

Prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

iii

9.7

9.6.1

Overall prosodic characterization . . . . . . . . . . . . . . . . . . . . . . . .

90

9.6.2

Speaker and type of disfluency . . . . . . . . . . . . . . . . . . . . . . . . .

91

9.6.3

Tempo characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

10 Analysis of disfluencies in the CORAL corpus

97

10.1 Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

10.2 Rate of disfluencies per speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

10.3 Rate of disfluencies per dialogue and per speaker . . . . . . . . . . . . . . . . . . 100 10.4 Rate of disfluencies per sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 10.5 Prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 10.5.1 Overall prosodic characterization . . . . . . . . . . . . . . . . . . . . . . . . 103 10.5.2 Speaker and type of disfluency . . . . . . . . . . . . . . . . . . . . . . . . . 105 10.5.3 Tempo characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 11 Speaking style effects in the production of disfluencies

111

11.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 11.2 Inter-corpora distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 11.3 Inter-corpora prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 12 Conclusions

121

12.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 12.1.1 Automatic processing of prosodic cues . . . . . . . . . . . . . . . . . . . . 122 12.1.2 Prosodic and lexical cues to interrogative types distinction . . . . . . . . . 122 12.1.3 Prosodic cues to structural metadata classification . . . . . . . . . . . . . . 123 12.1.4 Prosodic contrast marking of disfluency/fluency repair . . . . . . . . . . . 124 12.2 Overcoming the limitations of this work . . . . . . . . . . . . . . . . . . . . . . . . 125 12.3 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 iv

Bibliography

129

A Features in the LECTRA corpus per speaker

141

B Features in the CORAL corpus per speaker

143

v

vi

List of Figures 2.1

Structure of a disfluent sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.1

Example of a TextGrid file annotated with ToBI. . . . . . . . . . . . . . . . . . . .

17

4.1

Example of CORAL maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.2

The LECTRA division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.3

AUDIMUS processing pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.4

Corpus-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.5

Example of an ASR transcript segment, enriched with reference data. . . . . . . .

36

4.6

Excerpt of an enriched ASR output with marked disfluencies. . . . . . . . . . . .

37

4.7

Disfluency and other events alignment examples. . . . . . . . . . . . . . . . . . .

38

5.1

Example of a file containing the phones/diphones produced by the ASR system.

40

5.2

PCTM of monophones, marked with syllable boundary and stress. . . . . . . . .

42

5.3

Improvement in terms of correct word boundaries, after post-processing. . . . . .

43

5.4

Phone segmentation before and after post-processing. . . . . . . . . . . . . . . . .

44

5.5

Example of an erroneous segmentation due to a fricated plosive. . . . . . . . . . .

44

5.6

Improvement of correct word boundaries, after retraining. . . . . . . . . . . . . .

45

5.7

Pitch adjustment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.8

Workflow of prosodic information. . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

5.9

Excerpt of the final XML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

5.10 Excerpt of an input TextGrid file from LECTRA. . . . . . . . . . . . . . . . . . . .

51

5.11 Excerpt of an output TextGrid file. . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

5.12 Excerpt of a manual output TextGrid file. . . . . . . . . . . . . . . . . . . . . . . .

52

8.1

73

Median values for disfluencies scores. . . . . . . . . . . . . . . . . . . . . . . . . . vii

8.2

Tonal scaling of prolongations, filled pauses and repetitions. . . . . . . . . . . . .

74

8.3

Felicitous example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

8.4

Infelicitous example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

8.5

CART results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

9.1

Total time and useful time between disfluencies per lecture. . . . . . . . . . . . .

85

9.2

Mean of words uttered between disfluent sequences per lecture. . . . . . . . . . .

86

9.3

Total words and disfluent words per lecture. . . . . . . . . . . . . . . . . . . . . .

87

9.4

Sequences of events of the same category. . . . . . . . . . . . . . . . . . . . . . . .

88

9.5

Distribution of disfluencies in the reparandum. . . . . . . . . . . . . . . . . . . . .

89

9.6

Distribution of disfluencies in the reparandum per speaker. . . . . . . . . . . . . .

89

9.7

Pitch and energy slopes in the LECTRA corpus. . . . . . . . . . . . . . . . . . . .

91

9.8

Pitch differences per type and speaker in LECTRA. . . . . . . . . . . . . . . . . .

92

9.9

Energy slopes per type and speaker in LECTRA. . . . . . . . . . . . . . . . . . . .

93

9.10 Duration of all the events in LECTRA. . . . . . . . . . . . . . . . . . . . . . . . . .

93

9.11 Duration of all the events per disfluency type in LECTRA. . . . . . . . . . . . . .

94

10.1 Total time and useful time between disfluencies per dialogue. . . . . . . . . . . . 102 10.2 Total words and disfluent words per dialogue. . . . . . . . . . . . . . . . . . . . . 102 10.3 Mean of words uttered between disfluent sequences per dialogue. . . . . . . . . . 103 10.4 Pitch and energy slopes in the CORAL corpus. . . . . . . . . . . . . . . . . . . . . 105 10.5 Pitch differences per type and speaker in CORAL. . . . . . . . . . . . . . . . . . . 106 10.6 Energy slopes per type and speaker in CORAL. . . . . . . . . . . . . . . . . . . . . 107 10.7 Duration of all the events in CORAL. . . . . . . . . . . . . . . . . . . . . . . . . . . 107 10.8 Duration of all the events per disfluency type in CORAL. . . . . . . . . . . . . . . 108 11.1 Pitch differences between units based on the average for university lectures. . . . 116 11.2 Pitch differences between units based on the average for dialogues. . . . . . . . . 117 11.3 Energy slopes per type and speaker in LECTRA. . . . . . . . . . . . . . . . . . . . 118 11.5 Duration of the disfluency (in ms), of the adjacent words and silent pauses. . . . 118 viii

11.4 Energy slopes per type and speaker in CORAL. . . . . . . . . . . . . . . . . . . . . 119

ix

x

List of Tables 2.1

Prosodic properties of disfluent regions. . . . . . . . . . . . . . . . . . . . . . . . .

10

4.1

LECTRA corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.2

Symbols used in the orthographic tier. . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.3

Labels used in the disfluency tier. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.4

Morphological tag set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.5

Evaluation of the inter-transcriber agreement. . . . . . . . . . . . . . . . . . . . . .

34

5.1

Extended set of prosodic features. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.2

Results for the AuToBI performance on EP data . . . . . . . . . . . . . . . . . . . .

53

6.1

Overall punctuation marks frequency in the training sets. . . . . . . . . . . . . . .

56

6.2

Overall frequency of interrogative types in training corpora. . . . . . . . . . . . .

57

6.3

Automatic and manual classification of interrogative types in the test sets. . . . .

58

6.4

Baseline results, achieved with lexical features only. . . . . . . . . . . . . . . . . .

59

6.5

Results after re-training with transcriptions and adding acoustic features. . . . .

60

6.6

Recovering the question mark over the LECTRA corpus, using prosodic features. .

61

6.7

Recovering the question mark over the Alert corpus, using prosodic features. . . .

61

7.1

Corpus properties and number of metadata events. . . . . . . . . . . . . . . . . .

66

7.2

CART classification results for prosodic features. . . . . . . . . . . . . . . . . . . .

67

7.3

Confusion matrix between events. . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

7.4

Top most relevant features, sorted by relevance. . . . . . . . . . . . . . . . . . . .

69

9.1

Overall characteristics of the LECTRA training subset. . . . . . . . . . . . . . . . .

82

9.2

Distribution of disfluencies per speaker in LECTRA. . . . . . . . . . . . . . . . . .

84

xi

9.3

Means of fluent and disfluent words per sentence in LECTRA. . . . . . . . . . . .

88

9.4

Ratios per speaker in LECTRA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

10.1 Overall characteristics of the CORAL training subset. . . . . . . . . . . . . . . . .

99

10.2 Distribution of disfluencies per speaker. . . . . . . . . . . . . . . . . . . . . . . . . 101 10.3 Means of fluent and disfluent words per sentence in CORAL. . . . . . . . . . . . 104 10.4 Ratios per speaker in CORAL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 11.1 Overall characteristics of lectures and dialogues. . . . . . . . . . . . . . . . . . . . 113 11.2 Mean words in distinct corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 11.3 Distribution of disfluencies per corpora. . . . . . . . . . . . . . . . . . . . . . . . . 115

xii

Introduction

1

Disfluencies are on-line editing strategies with several (para)linguistic functions. They account for a representative portion of our spoken interactions. Everyday we are analists of our own speech and of others, monitoring distinct linguistic and paralinguistic factors in our communications, using disfluencies to make speech a more error-free system, a more edited message, and a more structured system with coherent and cohesive mechanisms. Disfluencies are an important research topic in several areas of knowledge, namely, Psycholinguistics, Linguistics, Automatic Speech Recognition, and more recently in Text-to-Speech conversion and even in Speech-to-Speech translation. Yet, whereas for several languages one can find much literature on disfluencies, for others, such as European Portuguese, the literature is quite scarce. Detecting and filtering disfluencies is one of the hardest problems in rich transcription of spontaneous speech. Enriching speech transcripts with structural metadata (Ostendorf et al., 2008) is of crucial importance for many speech and language processing tasks, and comprises several metadata extraction/annotation tasks besides dealing with disfluencies such as: speaker diarization (i.e. assigning the different parts of the speech to the corresponding speakers); sentence segmentation (also known as sentence boundary detection); punctuation and capitalization recovery; topic and story segmentation, etc.. Such metadata extraction/annotation technologies are recently receiving increasing attention (Liu et al., 2006b; Jurafsky and Martin, 2009; Ostendorf et al., 2008), and demand multi-layered linguistic information to perform such tasks. A simple segmentation method, for instance, may rely only on information about pauses. More complex methods, however, may involve as well lexical cues, dialog acts cues, etc.. In fact, the term structural segmentation encompasses all algorithms based on linguistic information that delimit spoken sentences (units that may not be isomorphic to written sentences), topics and stories. Inscribed in this research trend, this study will target the analysis of disfluencies in different corpora, aiming at two main objectives: to characterize the regular patterns in the production of disfluencies in European Portuguese, and to contribute towards a fully automatic processing of disfluencies, and other structural metadata events. In fact, structural metadata may be almost regarded as a satellite research trend in our work, quickly growing from a side topic to a very prominent one, as the role of prosodic features extended much beyond the scope of dis-

2

CHAPTER 1. INTRODUCTION

fluencies, becoming more and more pervasive in different automatic speech processing tasks in our research group. One of these first tasks was word boundary delimitation, resulting in a more stable automatic speech recognition system. But the greatest impact was on recovering punctuation marks, specially in what concerned the detection of interrogatives, a hitherto unexplored topic in our language. Therefore, this work cannot be read as a book concerning disfluencies as an exclusive topic, due to the fact that much parallel work had to be done in terms of prosodic feature processing for Portuguese. This thesis starts with an overview of the core concepts regarding disfluencies and prosody in Chapters 2 and 3, respectively. The corpora and annotation schemas will be presented in Chapter 4. The first chapters after this introductory part cover our main contributions in terms of structured metadata. The integration of prosodic information into the automatic speech recognizer output towards an automatic prosodic description will be described in Chapter 5. Chapter 6 will report our experiments in integrating interrogatives into the punctuation module, and in evaluating the impact of several linguistic features. Our latest work towards automatic structural metadata classification will be described in Chapter 7. The remaining of the thesis is devoted to disfluencies. They can be described from two main perspectives: as speech errors that disrupt the ideal delivery of speech or as fluent linguistic devices used to manage speech. Chapter 8 will report on two main experiments, a perceptual test and a CART, which were conducted to validate the assumption of the fluent prosodic properties of disfluencies. The next two chapters are devoted to a study of the distributional trends and the prosodic patterns of disfluencies, and of their adjacent contexts. Chapter 9 will do this for a corpus of University lectures, and 10 for a corpus of map-task dialogues. Their comparison will be covered in Chapter 11, aiming at verifying speaking style effects in the production of disfluencies. Finally, our conclusions and future work trends will be presented in Chapter 12. A note to our reader. Much of the work presented in this thesis was already published in international peer-reviewed publications. Therefore, the majority of the chapters are a direct reflex of those publications. For sake of clarity and ethics, we will list the publications integrated in this thesis and the correspondent chapters: 1. Thomas Pellegrini, Helena Moniz, Fernando Batista, Isabel Trancoso & Ramon Astudillo, “Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese”, in Speech and Corpora, Belo Horizonte, March 2012. (Chapter 4) 2. Isabel Trancoso, Rui Martins, Helena Moniz, Ana Isabel Mata & Maria do Céu Viana, “The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese”, in LREC 2008 - Language Resources and Evaluation Conference, Marrakesh, Morocco, May 2008. (Chapter 4)

3

3. Helena Moniz, Fernando Batista, Hugo Meinedo, Alberto Abad, Isabel Trancoso, Ana Isabel Mata & Nuno Mamede, “Prosodically-based automatic segmentation and punctuation”, in Speech Prosody 2010, ISCA, Chicago, USA, May 2010. (Chapter 5) 4. Fernando Batista, Helena Moniz, Isabel Trancoso, Nuno Mamede & Ana Isabel Mata, “Extending Automatic Transcripts in a Unified Data Representation towards a Prosodicbased Metadata Annotation and Evaluation”, In Journal of Speech Sciences, Luso-Brazilian Association of Speech Sciences, vol. 2, n. 2, December 2012. (Chapter 5) 5. Helena Moniz, Fernando Batista, Isabel Trancoso & Ana Isabel Mata, “Analysis of interrogatives in different domains”, in A. Esposito, A. M. Esposito, R. Martone, V. Müller & G. Scarpetta (Eds.), Towards Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues. Third COST 2102 International Training School, Springer Berlin / Heidelberg, series Book series: Lecture Notes in Computer Science, pages 136-148, Caserta, Italy, January 2011. (Chapter 6) 6. Fernando Batista, Helena Moniz, Isabel Trancoso & Nuno Mamede, “Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts”, in IEEE Transactions on Audio, Speech, and Language Processing, IEEE Signal Processing Society, vol. 20, n. 2, pages 474 – 485, doi: 10.1109/TASL.2011.2159594, February 2012. (Chapter 6) 7. Helena Moniz, Fernando Batista, Isabel Trancoso & Ana Isabel Mata, “Automatic structural metadata identification based on multilayer prosodic information”, in DISS 2013, the 6th Workshop on Disfluency in Spontaneous Speech, KTH Royal Institute of Technology, Sweden, August 2013. (Chapter 7) 8. Helena Moniz, Isabel Trancoso & Ana Isabel Mata, “Disfluencies and the perspective of prosodic fluency”, in A. Esposito, N. Campbell, C. Vogel, A. Hussain, A. Nijholt (Eds.), Development of Multimodal Interfaces: Active Listening and Synchrony, Springer Berlin / Heidelberg, Book series: Lecture Notes in Computer Science. DOI: 10.1007/978-3-642-123979, April 2010. (Chapter 8) 9. Helena Moniz, Fernando Batista, Isabel Trancoso & Ana Isabel Mata, “Prosodic contextbased analysis of disfluencies”, in Interspeech 2012, ISCA, Portland, Oregon, U.S.A., September 2012. (Chapter 9) 10. Helena Moniz, Fernando Batista, Ana Isabel Mata & Isabel Trancoso, “Analysis of disfluencies in a corpus of university lectures”, in ExLing 2012, August 2012. (Chapter 9)

4

CHAPTER 1. INTRODUCTION

State-of-the-art on dis uencies

2

Disfluencies, e.g., filled pauses, prolongations, repetitions, substitutions, deletions, insertions, characterize spontaneous speech and play a major role in speech structuring (Levelt, 1983; Allwood et al., 1990; Swerts, 1998; Clark and Fox Tree, 2002). They have been studied from different perspectives. For speech processing, the analysis of the regular patterns of those phenomena is crucial (Nakatani and Hirschberg, 1994; Shriberg, 1994). In automatic speech recognition (ASR), their identification accounts for more robust language and acoustic models (Liu et al., 2006a) and even in text to speech synthesis (TTS) these phenomena are being modeled to improve the naturalness of synthetic speech (Adell et al., 2008). Moreover, when combining ASR and TTS with machine translation to achieve spontaneous speech translation, dealing with disfluencies is one of the aspects where substantial improvements are most needed (Tomokiyo et al., 2006). Recent studies in psycholinguistics (e.g., Esposito and Marinaro, 2007) have also targeted the relation of non-verbal communication (gestures) with silent and filled pauses, highlighting the pragmatic and semantic similarities between them. The multifaceted analysis of filled pauses can also be accounted for on an emotion oriented perspective or even on social behavior detection, such as in Benus et al. (2006); Gravano et al. (2011); Benus et al. (2012); Ranganath et al. (2013). There are two main perspectives in the literature to describe disfluencies: i) as speech errors that disrupt the ideal delivery of speech or ii) as fluent linguistic devices used to manage speech. For a survey on these perspectives, vide Kowal and O’Connell (2008). Disfluencies may be used for different purposes related to, e.g., speech structuring (Clark and Fox Tree, 2002), introducing new information (Arnold et al., 2003) and producing fluent strategies in second language learning (Rose, 1998). The fluent component of these phenomena is still rather controversial, even though Heike (1981) and Allwood et al. (1990) have already pointed out the benefits of disfluencies for communicative purposes, and their contribution for on-line planning efforts. Although the word disfluencies still exhibit the depreciating connotation linked to error, this term will be used for sake of terminological simplicity and for a contribution for direct comparisons with other studies. For an overview of the historical perspective of the terminological aspects associated with positive/negative connotations of the terms and of the realms of linguistic studies vide Erard (2007). Cross-linguistic studies of filled pauses have pointed out language universal and language

6

CHAPTER 2. STATE-OF-THE-ART ON DISFLUENCIES

specific regularities (Allwood et al., 1990; Eklund and Shriberg, 1998; Vasilescu and Addadecker, 2007), both segmental and prosodic. As in Allwood et al. (1990), we concentrate on phenomena which indicate "normal spontaneous management of speech": (self-)repairs, (self)correction, hesitation phenomena, (self-)repetition, (self-)reformulation, substitution and editing. The experiments conducted in this work will focus on prosodic features and its role for the analysis of disfluencies. Therefore, we will briefly introduce the most used typology, the structure of a disfluent sequence, prosodic contrast and parallelism strategies in the production of disfluencies, and finally an overview of the work done for European Portuguese.

2.1

Typology

As in other areas, terminology regarding disfluent events is rather diverse. However, in the last decades, since the influential work of Shriberg (1994), there is a common-ground typology that speech scientists have been using, promoting direct comparisons of the results achieved in different areas. Shriberg’s typology encompasses the following set of disfluent categories. (i) Filled pauses - schwa-like quality vowel and/or nasal murmur for European Portuguese; for languages such as Spanish it may also be demonstratives, e.g.,“este”. ou pode estar trancada (or it can be closed) (ii) Repetitions - linguistic material repeated e vocês sabem que (and you know that) (iii) Substitutions - linguistic material replaced, it usually corresponds to the same morphological category. que, aliás, saiu na vossa ficha (Which, alias, came out in your test) (iv) Deletions - abandoned linguistic material, correspond to a complete refresh. Ah, e no fim, e no fim, diz aí que vocês tinham ainda um stock de cento e cinquenta traves, ( Oh, and at the end, and at the end says here that you still had a stock of hundred and fifty beams,) (v) Insertions - linguistic material inserted, usually with repetitions to clarify an idea. em que medida é que o padrão é útil? (in what way is the pattern useful?)

2.1. TYPOLOGY

7

(vi) Editing terms/expressions - overt expressions regarding on-line message editing. acabou o tempo ( time ran out) (vii) Word fragments - linguistic material truncated or incompleted. complementar ( additional) (viii) Mispronunciations - linguistic material pronounced in an erroneous way. pode-nos servir pronounced as [S1r’nir] instead of [s1r’vir] (can serve us) (ix) Complex sequences - linguistic material comprising distinct disfluent categories (e.g., repetitions and substitutions) O ano passado houve uns colegas vossos da matemática que queriam fazer o projecto quase só com strings. (Last year there were some of your colleagues from math who wanted to do the project almost only with strings.) (x) Others - sometimes simultaneous - phonetic-phonological, lexical, morphological and syntactic. With the work of Eklund (2004), an overview of prolongations in Sweden and in other languages is described, observing regularities in the segmental properties of the elongated lexical material, which provides evidence for another category per se - prolongations. Two contributions were taken from the mentioned study. Besides the category prolongation, this study will also consider the disfluent events index system proposed by Eklund (2004), establishing correlations between the material to be corrected and the correction itself and the order in which the linguistic material is uttered. (xi) Segmental prolongations - elongated segmental linguistic material. Procedurally, prolongations can be measured and compared with linguistic material in other locations. In EP, prolongations in the sense of management of speech are often related with specific lexical items, e. g., functional words with elongated vowels in a context where we would expect reduction or elision of those vowels. In our previous studies we have found that we may also have lexical words elongated with two effects: prolongation affecting more than the last syllable of the word and final lengthening corresponding to an interval of more than 1 second. E= o que é que acontecia? (pronounced as [i:]) (And= what happened?)

8

CHAPTER 2. STATE-OF-THE-ART ON DISFLUENCIES

Figure 2.1: Structure of a disfluent sequence. Figure extracted from Shriberg (1994).

2.2

Structure of a disfluent sequence

As Figure 2.1 shows, disfluencies have a specific structure: reparandum, interruption point, interregnum, and repair of fluency (Levelt, 1989; Nakatani and Hirschberg, 1994; Shriberg, 1994). The reparandum is the region to repair. The interruption point is the moment when the speaker stops his/her production to correct the linguistic material uttered, ultimately, it is the frontier between disfluent and fluent speech. The interregnum is an optional part and it may have silent pauses, filled pauses (uh, um) or explicit editing expressions (I mean, no). The repair is the corrected linguistic material.

It is known that each of these regions has idiosyncratic acoustic properties that distinguish them from each other (Hindle, 1983; Levelt and Cutler, 1983; Nakatani and Hirschberg, 1994; Shriberg, 1994, 2001; Liu et al., 2006a). There is in fact an edit signal process (Hindle, 1983), meaning that speakers signal an upcoming repair to their listeners. The edit signal is manifested by means of repetition patterns, production of fragments, glottalizations, co-articulatory gestures and voice quality attributes, such as jitter (perturbations in the pitch period) in the reparanda. Sequentially, it is also edited by means of significantly different pause durations from fluent boundaries and by specific lexical items in the interregnum. Finally, it is edited via pitch and energy increases in the repair.

2.3. CONTRAST AND PARALLELISM BETWEEN DISFLUENT REGIONS

2.3

9

Contrast and parallelism between disfluent regions

As stated before, in a disfluent sequence there are several regions to be considered (Levelt and Cutler, 1983; Nakatani and Hirschberg, 1994; Shriberg, 1994). The possible connections between the reparandum and the repair have been explored from different perspectives in the literature. Since Levelt and Cutler (1983) there is a binary tendency towards the classification of the prosodic properties of (certain) disfluencies as either copying the pitch contour of the reparandum or contrasting the onset of fluency in the repair with the reparandum, by means of increasing f 0 and energy. The first strategy is classified as a parallelism between the two regions and is mainly related to appropriateness (involving, for instance, repetition and insertion), whereas the second is classified as contrast marking and is productive with error corrections (mostly substitutions). The literature is not consensual about this dichotomy. For Plauché and Shriberg (1999), repetitions per se can behave as parallelistic prosodic structures (copying the pitch contour of the reparandum) and also have some degree of contrast (a rising pattern in the repetition is related to an emphasis in the new unit), although not the one reported by Levelt and Cutler (1983). For Savova and Bachenko (2003a,b), distinct categories, such as repetitions and substitutions seem to copy the patterns of their counterparts in the reparandum. Moreover, for the authors there is only partial support for the contrastive nature of substitutions when this is manifested by a higher pitch range. Cole et al. (2005) sustains the parallelistic nature of both repetitions and error corrections and considers parallelism the most frequent strategy.

Table 2.1 shows the overall prosodic characteristics of the distinct regions of a disfluent sequence. In the overview presented by Savova and Bachenko (2003a), there are differences regarding pitch properties between the reparandum and the repair across domains. However, in the majority of the studies targeted in the overview, results point out to higher pitch levels in the repair than in the reparandum.

The contrast and parallelism strategies may also be regarded from a comprehension perspective (Levelt, 1983; Levelt and Cutler, 1983; Levelt, 1989). In comprehension tasks, the information available in disfluencies can help listeners compensate for disruptions and delays in spontaneous utterances (Brennan and Schober, 2001). Cues are not exclusively the presence of a (certain type of) disfluency, but also the linguistic properties of the structured regions of a disfluent event (Hindle, 1983; Nakatani and Hirschberg, 1994; Shriberg, 1994, 1999, 2001), namely the transition to the repair of fluency, which is of crucial importance for the process of understanding a message. However, the literature does not focus on how those cues may vary accordingly to speaking style due to underlying situational contexts and communicative purposes.

10

CHAPTER 2. STATE-OF-THE-ART ON DISFLUENCIES

Table 2.1: Prosodic properties of disfluent regions. Table extracted from Savova and Bachenko (2003a). “rm” stands for reparandum, “df” for disfluency, “rr” for repair, and “fp” for filled pause.

2.4. STUDIES ON DISFLUENCIES FOR EP

2.4

11

Studies on disfluencies for EP

For European Portuguese, much has been said for silent pauses, filled pauses and prolongations (e.g., Moniz (2006); Moniz et al. (2007, 2008a); Veiga et al. (2011)), whereas the other categories are poorly described. Silent and filled pauses in European Portuguese (EP) were first studied by Freitas (1990). Its main focus was the temporal organization of discourse and the syntactic distribution of silent and filled pauses. Based on reading and spontaneous speech data, the author pointed out that, as expected, filled pauses were only uttered in spontaneous speech, making them speech style discriminating events. The syntactic distribution of silent and filled pauses was also accounted as a distinctive feature. Filled pauses were mainly uttered within a phrase, while silent pauses had two different patterns: in the reading corpus they were essentially located at syntactically higher positions, i.e., sentences and clauses; and in the spontaneous data at or within phrase boundaries. In EP the first study to present the relative frequency of different disfluency types, their distribution, the way they may associate with each other, and with different intonational and durational patterns was the one by Moniz (2006). Filled pauses and segmental prolongations have also been detailed in Moniz et al. (2007) and Moniz et al. (2008b). The definition of filled pauses does not seem to be ambiguous across languages, corresponding to elongated segments (in EP, [5:]; [@:]; [1:]; [m:] or one of these vowels with the nasal coda [m:], like [5:m]). There are distinct forms for FPs: (i) an elongated central vowel only; (ii) a nasal murmur only; and (iii) a central vowel followed by a nasal murmur, spelled as aa, mm and aam, respectively, as the quality of the central vowel most often coincides with the one of unstressed /a/. Although a schwa-like quality ([5:] or [@:]), appears to be the most commonly used, in a quick survey of other speech corpora available for EP, we have found, however, some speakers consistently using the neutral vowel [1:] instead, and others both [1:] and [5:], sometimes in the same sentence, depending on the quality of the previous word last vowel. Our point here is not to acknowledge that FP vocalizations may be built around central vowels and speakers may differ in their preferences, but that FPs do not appear to behave as other words in the language. In EP, [1] and [5] correspond to reduced forms of different vowels in unstressed position (/i/, /e/, /E/ vs /a/, respectively) and words homophones to aa (the preposition a or the feminine determinant a) do not undergo this type of contextual variation (Moniz, 2006; Moniz et al., 2007). In EP, as in other languages, final lengthening is a cue for intonational phrase boundaries (Falé, 1995; Mata, 1999; Frota, 2009). These are not the elongations accounted for in this work. The prolongations that we have been studying are mainly elongated functional words (e.g., conjunctions and prepositions, such as [i:], e - and; [k1:], que - that; [d1:], de - of ) that appear in contexts where a strong reduction or a deletion are the expected processes in EP. We also have been analyzing lexical/functional words elongated in sequences with self repairs or with additional clarifications. Both of them can be automatically identified by comparing the relative

12

CHAPTER 2. STATE-OF-THE-ART ON DISFLUENCIES

durations of the same words or segments in fluent contexts for the same speaker. The lengthening of words ending in a coronal fricative, for instance, could be obtained by prolonging the entire rhyme and/or the fricative only. Most of the time, however, the neutral vowel [1] is appended to achieve the desired effect. Contrarily to regular sandhi phenomena generally observed within as well as across word boundaries, the final fricative is never realized as [z], but as [Z]. Different disfluency types tend to occur in different prosodic contexts. Prosodic studies for EP (Martins, 1986; Viana, 1987; Falé, 1995; Mata, 1999; Frota, 2000; Vigário, 2003; Frota, 2009) have pointed out the need for at least two levels of phrasing, major and minor intonational phrases (IP). The main distinction is that major IP shows a wider pitch range and bigger final lengthening than the minor IP boundary, indicating (as pointed out by Frota, 2000 and Viana et al., 2007) that these constituents correspond to boundaries with different strength. These two levels of boundary strength are marked with 3 (minor IP) and 4 (major IP), as in ToBI (Silverman et al., 1998). In our work we have been using break indices 3 and 4 as well. In spontaneous data, the break index 3 seems crucial to account for sentence internal chunks, advantageous for the description of disfluencies, and the way they relate to adjacent prosodic constituents. Moreover, in the joint attempt to propose a ToBI system for European Portuguese (Viana et al., 2007), the authors pointed out the importance of having the break index 3 as well. The use of a common system of annotation is also beneficial for comparison with other languages, aiming at a cross-linguistic validation of the behavior of the so called disfluencies. Different filled pauses, for instance, tend to occur in different prosodic contexts: (i) aam generally occurs at major intonational phrase boundaries, (ii) aa is most likely found at minor intonational phrase boundaries; (iii) mm occurs mainly in coda position (e.g., [qu1:m], [5:m]). Segmental prolongations are most likely found at internal clause boundaries and at intonational phrase boundaries, behaving as aa. Previous studies also pointed out that filled pauses are uttered mainly with plateau contours or with gradual falling contours, whereas segmental prolongations exhibit more complex f 0 contours. Silent pauses are consistently used as a cue to either automatically recognize disfluencies (Stolcke et al., 1998) or to analyze their psycholinguistic implications (Levelt, 1989; Clark and Fox Tree, 2002). Our previous study (Moniz et al., 2008a) pointed out that more than 80% of prolongations and filled pauses are followed by silent pauses of a reasonable length, supporting the view that their presence may effectively be used by listeners as a cue to an upcoming delay. The absence of such a pause is strongly penalized as misleading information. Filled pauses and prolongations have also been targeted by Veiga et al. (2011, 2012) with two main goals. Firstly, automatically detect both disfluent categories in a broadcast news corpus; secondly, differentiate between speech styles, meaning, spontaneous from prepared speech in the same domain. The authors used a combined set of segmental features in their

2.5. SUMMARY

13

experiments to perform both tasks. Disfluencies have also been targeted in a stuttering perspective by Cruz (2009). Melodic patterns of stutters vs. non-stutters were presented, pointing out that stutters produce more tonal events per intonational phrase, being those constituents shorter than the ones produced by non-stutters, and also that such tonal events are distinct from the non-stutters, characterized by a preference for boundary tones !H% rather than L% and for simple tonal pitch accents.

2.5

Summary

This chapter adressed the working-definitions of several core aspects regarding disfluencies: most used typologic categories of disfluencies, structural regions of a disfluent sequence, and prosodic strategies applied to disfluency-fluency repair. An overview of the European Portuguese studies focusing on disfluencies was also given.

14

CHAPTER 2. STATE-OF-THE-ART ON DISFLUENCIES

State-of-the-art on prosody processing

3

Prosody will be a pervasive concept throughout this work. Two main perspectives will be reviewed: the first one is related to the prosodic characterization of structured metadata events and the second one is centered in the automatic modeling of prosodic properties. Therefore, this chapter will focus on the core aspects associated with prosodic parameters, the prosodic annotation system adopted and its automatic counterpart, and will end with an overview of studies conducted for European Portuguese.

3.1

Prosody A working-definition of prosody is provided by Shattuck-Hufnagel and Turk (1996):

“we specify prosody as both (1) acoustic patterns of F0, duration, amplitude, spectral tilt, and segmental reduction, and their articulatory correlates, that can be best accounted for by reference to higher-level structures, and (2) the higher-level structures that best account for these patterns ... it is ‘the organizational structure of speech’.” (Shattuck-Hufnagel & Turk, p. 196) In the above definition, prosody has two components: firstly, the acoustic correlates and, secondly, their relation to the organizational structure of speech. Detailed analysis have been conducted to describe the properties of the prosodic constituents and their functions (e.g., Liberman (1975); Bruce (1977); Pierrehumbert (1980); Pierrehumbert and Hirschberg (1990); Beckman and Pierrehumbert (1986); Nespor and Vogel (1986); Bolinger (1989); Gussenhoven (2004); Ladd (1996, 2008)). Since the focus of this thesis is on the acoustic correlates of structural metadata events, we will briefly comment the higher-level structures described in the literature. In the Intonational Phonology framework, based on the study of Pierrehumbert (1980) and much subsequent work (e.g., Beckman and Pierrehumbert (1986); Pierrehumbert and Beckman (1988); Pierrehumbert and Hirschberg (1990); Beckman et al. (2005)), inspired in Liberman (1975) and Bruce (1977), the prosodic structure involves a hierarchy of prosodic constituents, encompassing from mora or syllable, the smallest constituents, to intonation phrase or utterance, the largest ones. This hierarchical structure vary in what regards intermediate levels, namely the intermediate intonational phrase (Selkirk, 1984; Beckman and Pierrehumbert, 1986;

16

CHAPTER 3. STATE-OF-THE-ART ON PROSODY PROCESSING

Nespor and Vogel, 1986; Gussenhoven, 2004; Nespor and Vogel, 2007; Ladd, 2008; Frota, 2012). Although the levels of the prosodic structure may vary within the framework and languages may display different prosodic constituents Beckman and Pierrehumbert (1986); Pierrehumbert and Beckman (1988); Jun (2005), cross-language studies point out to two important strengths of the framework: the hierarchical organization of speech and the knowledge that allows the assessment of cross-language similarities and differences. Cross-language studies have also investigated the acoustic correlates that better characterize sentence-like units boundaries (Vassière, 1983; Vaissière, 2005). Features that are known to characterize higher-level structures, such as pause at the boundary, pitch declination over sentences, post-boundary pitch and energy resets, pre-boundary lengthening, and voice quality changes, are amongst the most salient cues to detect sentence-like units. This set of prosodic properties has been used in the literature to successfully detect punctuation marks and disfluencies. By studying the acoustic correlates of sentence-like units and disfluencies in Portuguese we expect to detect higher-level structures of speech as intonational phrases and utterances.

3.2

ToBI system

The seminal work of (Pierrehumbert, 1980) inspired the creation of an annotation system called ToBI (Silverman et al., 1998; Pitrelli et al., 1994), which stands for Tones and Break Indices. ToBI is one of the most well-known systems used to describe intonation across languages and dialects (for an overview on the original ToBI, vide Beckman et al. (2005), and for intonational comparisons between languages, vide Jun (2005)). ToBI contains 4 tiers: tones, breaks, orthographic, and miscellaneous tiers, as illustrated in Figure 3.1. The tone tier, displaying the intonation contours decomposed into high (H) and low (L) tones, stems from the work of Pierrehumbert (1980); Beckman and Pierrehumbert (1986) and also from the work of Ladd (1983) for the analysis of downstep (!). The break tier, with the analysis of perceived disjuncture between words, is built upon the work of Price et al. (1991). The tone tier, as established for Standard American English, consists of pitch accents (associated with accented syllables) and boundary tones (associated with phrase boundaries). Phrase boundaries correspond to two types: intermediate phrase and intonational phrase boundaries. The intermediate phrase consists of at least one pitch accent and a phrase accent (H-, !H-, and L-, marked with the diacritic “-”), used to describe the pitch movement between the last pitch accent and the phrase boundary. The intonational phrase boundary is formed by one or more intermediate phrases. The intonational phrase boundary has an additional boundary tone (marked with the diacritic “%”), use to describe a final pitch movement (either H% or L%). Pitch accents can either be simple or bitonal (e.g., L*, H*, L+H*, L*+H). The star * diacritic marks the tone associated with the accented syllable and the diacritic “!” is used whenever the H pitch range is compressed, resulting in a !H label.

3.3. AUTOBI

17

Figure 3.1: This example was extracted from the AME ToBI transcription course vide http://anita.simmons.edu/~tobi/iap.htm Break indices are degrees of perceived disjuncture between words, ranging from 0 to 4. The level 0 means the strongest link between words and it marks a high co-articulation between two consecutive words, e.g., in European Portuguese it would be the index for a sequence like [’tESt 5’gOr5] (test know) with the ellipsis of the schwa vowel [’tESt 5’gOr5] instead of [’tESt1 5’gOr5]. The level 1 is the common index between two connected words within a phrase. The level 2 stands for dubious interpretations (either perceived as a break 1, but displaying tonal and lengthening cues; or perceived as 3 or 4, but without phrase accent/ boundary tone). The levels 3 and 4 represent intermediate intonational phrase boundaries and intonational phrase boundaries, respectively. The miscellaneous tier is used for comments (e.g., silence, laughter, disfluencies, inter alia) and those should be temporally delimited.

3.3

AuToBI

Automatic ToBI annotation system (AuToBI) was done for Standard American English (SAE) by Rosenberg (2009, 2010). AuToBI is a publicly available tool1 , which detects and classifies prosodic events following SAE intonational patterns. The AuToBI relies in the fundamentals of the ToBI system, meaning, it predicts and classifies tones and break indices. 1 http://eniac.cs.qs.cuny.edu/andrew/autobi/

18

CHAPTER 3. STATE-OF-THE-ART ON PROSODY PROCESSING

AuToBI is a modular architecture, which allows for the performance of six tasks separately and provides English trained models for spontaneous and read speech (for further details, vide Rosenberg (2009) and references therein). The six tasks correspond to: i) detection of pitch accents; ii) classification of pitch accent types; iii) detection of intonational phrase boundaries; iv) detection of intermediate phrase boundaries; v) classification of intonational phrase ending tones; and vi) classification of intermediate phrase ending tones. In all tasks raw and speaker z-score normalization 2 for pitch and energy are used. Each task relies on different sets of features in order to capture the acoustic properties of different regions of analysis. Thus, Pitch accent detection is performed with mean, minimum, maximum, standard deviation, and z-score of the maximum of raw and speaker normalized pitch and intensity contours and their slopes, in order to capture pitch and energy excursions in a word-based context window of 8 words (zero, one, two previous and following words). Spectral information regarding the energy contained in the frequency region between 2-20 bark and the ratio of the energy in this frequency region to the total energy of the frame is also used. Pitch accent classification focuses mainly in the pseudo-syllable with the maximum intensity values in a word and in its duration. The syllable contour is capture with the same features used in pitch accent detection but now for the highest syllable in the word. Phrase detection tasks (for both intermediate and intonational phrases) are associated with silence, pre-boundary lengthening, and pitch and energy resets. To account for these, all the features used in pitch accent detection are applied, but now the unit is the word previous to a possible boundary. Additionally, differences between two consecutive words are calculated and the duration of the word prior to a possible boundary is also measured. Phrase boundary tones classification is done simultaneously to account for both intermediate and intonational phrases. Based on the assumption that every intonational phrase boundary is also an intermediate phrase boundary, they are merged into a single unit - the intonational phrase boundary, resulting in the fixed set of the following labels: L-L%, L-H%, H-L%, !H-L%, H-H%. The same features used for the phrase detection are required, but on this specific task they are extracted from the final 200 ms of phrase final words. The accuracy values of the tasks range from a minimum of 54.95% in the classification of intonational phrase boundary tones to a maximum of 93.13% for intonational phrase boundary detection (values reported in Rosenberg (2010)).

3.4

Prosodic and lexical cues for structural metadata

Recovering punctuation marks, capitalization and disfluencies are three relevant MDA (Metadata Annotation) tasks. The impact of the methods and of the linguistic information on 2z

=

q−m , where q is transformed into z, m is the mean, and sv is the standard deviation. sv

3.5. OVERVIEW FOR EUROPEAN PORTUGUESE

19

structural metadata tasks has been discussed in the literature. Kim and Woodland (2001); Christensen et al. (2001) report a general HMM (Hidden Markov Model) framework that allows the combination of lexical and prosodic clues for recovering full stops, commas and question marks. A similar approach was also used by Liu et al. (2006b); Gotoh and Renals (2000); Shriberg et al. (2000) for detecting sentence boundaries. Kim and Woodland (2001) also combines 4-gram language models with a CART (Classification and Regression Tree) and concludes that prosodic information highly improves punctuation generation results. A Maximum Entropy (ME) based method is described by Huang and Zweig (2002) for inserting punctuation marks into spontaneous conversational speech, where the punctuation task is considered as a tagging task and words are tagged with the appropriate punctuation. It covers three punctuation marks: comma, full stop, and question mark; and the best results on the ASR output are achieved by combining lexical and prosodic features. A multi-pass linear fold algorithm for sentence boundary detection in spontaneous speech is proposed by Wang and Narayanan (2004), which uses prosodic features, focusing on the relation between sentence boundaries and break indices and duration, covering their local and global structural properties. Other recent studies have shown that the best performance for the punctuation task is achieved when prosodic, morphological, and syntactic information is combined (Liu et al., 2006b; Ostendorf et al., 2008; Shriberg et al., 2009; Favre et al., 2009; Batista et al., 2012a). Much of the features and the methods used for sentence-like unit detection may be applied in disfluency detection tasks. What is specific of the latter is that disfluencies have a specific structure, as previously described in Chapter 2. It is known that the reparandum, the interruption point, the interregnum, and the repair may display idiosyncratic acoustic properties that distinguish them from each other, inscribed in the edit signal theory (Hindle, 1983), meaning that speakers signal an upcoming repair to their listeners. The signal is edited by means of patterns of repetitions, production of fragments, glottalizations, co-articulatory gestures and voice quality attributes, such as jitter (perturbations in the pitch period) in the reparanda. Sequentially, it is also edited by means of significantly different pause durations from fluent boundaries and by specific lexical items in the interregnum. Finally, it is edited via f0 and energy contrastive or parallelistic patterns in the repair. The main focus is thus to detect the interruption point or the frontier between disfluent and fluent speech. Based on the edit signal theory, Nakatani and Hirschberg (1994); Shriberg (1997, 1999) used CARTs to identify different prosodic features of the interruption point. Kim and Woodland (2004); Liu et al. (2006b) used features based on previous studies and added language models to predict both prosodic and lexical features of sentence boundaries and disfluencies.

3.5

Overview for European Portuguese

Researchers working on Portuguese intonation, on laboratory and spontaneous speech, gathered in 2007 with the aim of combining efforts towards a Portuguese_ToBI system, search-

20

CHAPTER 3. STATE-OF-THE-ART ON PROSODY PROCESSING

ing for “a unified transcription for some aspects of Portuguese intonation”. The results were built upon several previous studies (e.g., Viana 1987; Mata 1999; Frota 2000, 2002; Viana et al. 2003; Vigário 2003; Falé 2005) and were summarized in Viana et al. (2007) at the Workshop on the Intonation in Ibero-Romance, PaPI 2007. The core results are mostly related to intonational properties of sentence-form types (declaratives, interrogatives, imperatives, parentheticals) in several varieties of Portuguese (in the Lisbon variety; in the Northern variety spoken in Braga; and in the Brazilian variety spoken in São Paulo); to intonational properties of discourse functions in European Portuguese; and to prosodic levels relevant for phrasing in Portuguese. We will describe the most salient intonational and phrasing properties in line with the present work. Regarding sentence-form types, in EP, declaratives are the most studied sentence type (e.g., Viana, 1987; Vigário, 1995; Falé, 1995; Cruz-Ferreira, 1998; Frota, 2000; Viana et al., 2007). The intonational contour generally associated with a declarative is a falling one, expressed as a prenuclear H* (in the first accented syllable), a nuclear bitonal event H+L*, and a boundary tone L%. A similar intonational contour is found in wh- questions3 . By contrast, the contour associated with a yes/no question is a rising one, expressed either as H* H+L* H% or (H) H+L* LH% (the latter proposed by Frota, 2002). Mata (1990) also observes falling contours in yes/no questions in spontaneous speech. As for alternative questions, only Viana (1987) and Mata (1990) have described them prosodically. The first intonational unit is described with the contour rising-fall-rising, whereas the second unit exhibits a rising-fall contour. The prosody of tags is still studied too sparsely in EP (Mata 1990, for high school lectures, and Cruz-Ferreira (1998) for laboratory speech). For Cruz-Ferreira (1998), the tags are described with falling contours, while for Mata (1990), these structures are associated with rising ones. Furthermore, Falé (2005); Falé and Faria (2006) offer evidence for the categorical perception of intonational contrasts between statements and interrogatives, showing that the most striking clue associated with the perception of interrogatives is pitch range (the H% boundary tone has to be higher than 2 semitones), whereas declaratives are mostly perceived based on final falls in the stressed syllable. The phonetic features of imperative intonation (encompassing imperatives, orders and requests) are also targeted in Falé and Faria (2007). Higher pitch global values and a pitch contour characterized by an initial rise from the onset to the f 0 peak and a falling movement of large amplitude towards the end of the sentence are the two most salient features to phonetically describe imperatives. As for prosodic phrasing, Frota (2000); Viana et al. (2007) consider two different levels of phrasing equating both of them to the intonational phrase (IP): the major IP and the minor IP, in line with Ladd (1996). Both minor and major IPs are marked with breaks 3 and 4, respectively, and the diacritics “-” and “%” are used for boundary tones to represent the different strengths of the IP. See also Frota (2009) for a reanalysis of this proposal, ascribing both levels to the IP and marking both as “%”. 3 Cruz-Ferreira

(1998) reports rising contours when the wh- question is polite.

3.6. SUMMARY

3.6

21

Summary

This chapter focused on the working-definitions of prosody, on the overview of the core aspects of the original ToBI system, and also its automatic application, the AuToBI. Finally, an overview of studies conducted for European Portuguese was also presented.

22

CHAPTER 3. STATE-OF-THE-ART ON PROSODY PROCESSING

Corpora

4

This work benefits from the efforts in collecting and annotating data undertaken in the last years, both at the Spoken Language Laboratory (L2F) of INESC-ID as well as at the Linguistic Center of the University of Lisbon (CLUL/FLUL). The corpora used for this study are in a process of being publicly available with the guidelines from the European Project - a Network of Excellence dedicated to building the technological foundations of a multilingual European information society (META-NET)1 and the National Project COntrast and PArallelism in Speech (COPAS)2 . The five main corpora used are (in chronological order): CORAL (Viana et al., 1998; Trancoso et al., 1998), CPE-FACES (Mata, 1999), ALERT (Neto et al., 2003; Meinedo et al., 2003), LECTRA (Trancoso et al., 2006, 2008), and, for the sake of comparison, newspaper data from Público. These corpora were recorded and collected with different purposes, and represent different styles of spontaneous and prepared speech, in very distinct communicative situations. They have been used mainly for speech processing, but they cover a variety of applications, ranging from didactics of Portuguese and professional teaching practice, prosodic studies on spontaneous and prepared non-scripted speech, to Automatic Speech Recognition and Text-toSpeech experiments. Along our thesis, we will guide our reader to the corpus used for a specific task, and will provide further information concerning the reasons that informed our choices. In the next sections, a general overview of the five corpora will be given.

4.1

The CORAL corpus

The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC (Institute of Systems Engineering and Computer), CLUL (Linguistic Center of the University of Lisbon), FLUL (Faculdade of Letters of the University of Lisbon), and FCSH-UNL (Faculty of Social Sciences and Humanities, New University of Lisbon). The purpose of this project was the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, prosodic, syntactic and semantic. 1 Vide

the official project website http://metanet4u.eu/.

2 PTDC/CLE-LIN/120017/2010.

24

CHAPTER 4. CORPORA

The corpus has 9 hours (61k words). A more thorough analysis can be found in Viana et al. (1998); Trancoso et al. (1998); Caseiro et al. (2002)

4.1.1

Contents of the maps

CORAL has 64 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some are synonyms (e.g. curvas perigosas vs. troço sinuoso), as Figure 4.1 shows. In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena.

Figure 4.1: Example of CORAL maps. On the left is a map of a giver and on the right the correspondent one for a follower. Example extracted from Viana et al. (1998).

4.1.2

Number and type of speakers

The 32 speakers were divided into 8 quartets, taking part in 8 dialogues, totalling 64 dialogues. Given the reduced number of speakers, they were chosen to achieve an adequate gender balance, but were restricted in terms of age (under-graduate or graduate students) and

4.2. THE CPE-FACES CORPUS

25

accent (Lisbon area). Speakers were chosen in pairs who knew each other, so that half of the conversations took place between friends and half between people who did not know each other. Furthermore, a pilot dialogue was conducted between two females in order to check the reliability of the recording conditions and also the map precisions. Due to its richness, it was integrated in the corpus.

4.1.3

Recording conditions

The recordings took place in a sound proof room, with no visual contact between the speakers. They wore close-talking microphones and the recordings were made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring was done once the dialogues started, after adjusting recording levels. The signal from microphone 1 was recorded in the left channel and the signal from microphone 2 in the right channel. However, since both speakers were in the same room, the microphone of a speaker captured the signal of the other one, although with a much reduced level.

4.1.4

Corpus division

The CORAL corpus was divided into train and test sets, with 75% (quartets 1 to 6) and 25% (quartets 7 and 8), respectively. The recordings were made in stereo with separate channels, but, as stated in the previous section, the signal also had the contribution of the interlocutor. Thus, further on the left and right channels were enhanced with the cancellation of the contributions from the respective interlocutors. This process was done for all the dialogues in the train set.

4.2

The CPE-FACES corpus

CPE-FACES (Mata, 1999) stands for Corpus of European Portuguese Spoken by Adolescents in School Context. It includes spontaneous and prepared non-scripted speech at high school (three teachers and twenty five students), totaling 15h, all orthographically transcribed. The prepared non-scripted speech corresponds to oral presentations about a book the students have read, according to specific programmatic guidelines, whereas in the spontaneous ones they were unexpectedly asked to speak about a pleasant personal experience. This corpus of high school presentations was selected with the purpose of mirroring different levels of proficiency at the end of compulsory schooling (14-15 years old) and it also includes materials from high school teachers. Mata (1999) recorded this corpus with two main purposes: firstly, to study prosodic aspects of spontaneous speech and, secondly, to contribute to the description of prosodic patterns and their variation accordingly to speech styles in school context.

26

CHAPTER 4. CORPORA

4.2.1

Recording conditions

The CPE-FACES corpus has audio and video recordings. For the audio recordings, the author used a UHER 400 Report Monitor recorder with a BASF LPR 35 magnetic tape. A SENNHEISER MD 214 V-3 microphone was used. The recordings took place in two public schools in Lisbon during the school year of 1994/1995. The corpus was latter digitized at 44.1 kHz, using 16 bits/sample and afterwards downsampled to 16 kHz. CPE-FACES was recently extended with the recordings of a male teacher addressing the same topic of a previously recorded female teacher - the lyric-tragic episode of the Sad Inês and of D. Pedro from the book Os Lusíadas de Luís de Camões. By adding these new recordings, we aimed at comparing the two teachers skills and possible different discursive strategies when teaching the same topic. The new class was recorded with a TASCAM HD-P2, a Portable High Definition Stereo Audio Recorder. The sound was recorded in mono, with 16-bit at samples rates of 44.1kHz, and the audio file had the format .wav. The teacher used a head-mounted microphone Shure, a Sub-miniature Condenser Head-worn Microphones, model Beta 54.3

4.2.2

Subset selection

For the sake of comparison and for adding other levels of analysis to the ones already done with this corpus, we used the same subset described in Mata (1999), i.e., four students (ranging from 14 to 15 years old, balanced by gender) and their female teacher. As stated in section 4.2, the students were selected in order to mirror different levels of proficiency in Portuguese. The two 14-year old students were classified with the level 3 (in a scale of 0-5), whereas the two 15-year old students were classified with the level 4 (the highest given in the class).

4.3

The ALERT corpus

The ALERT corpus was collected in the scope of the European project with the same name (Neto et al., 2003; Meinedo et al., 2003). This is a corpus of European Portuguese broadcast news, which was originally collected for training and testing speech recognition and topic detection systems. For recent updates to this corpus and all the processing issues vide also Meinedo et al. (2010) and Batista (2011). This collection was done in a joint collaboration between the public Portuguese television channel (RTP) and INESC-ID. 3 We

would like to thank Professor Tjerk Hagemeijer for his generosity in sharing these materials with us.

4.4. THE LECTRA CORPUS

4.3.1

27

Data collection

A small Pilot Corpus was firstly collected aiming at checking the collection process. This corpus was recorded during one week in April 2000, amounting to 5.5 hours. For the pilot corpus, the audio was recorded at 44.1 KHz, using 16 bits/sample. The following recordings were done at 32 KHz. The whole corpus was later downsampled to 16 kHz. The corpus has 3 main parts: a Speech Recognition Corpus (SRC), a Topic Detection Corpus (TDC), and a Textual Corpus, but only the first one was used in this thesis. In fact, the much larger TDC corpus was not manually orthographically transcribed. The SRC was collected with the purpose of training the acoustic models and of adapting the language models of the large vocabulary speech recognition system (Neto et al., 2008; Meinedo et al., 2008). This corpus was collected from October 2000 through January 2001, and includes 122 programs of different types, totalling 76h of audio data. The SRC corpus was split into training (first 2 months, 61 hours), development (one week in December, 8 hours) and evaluation subsets (one week in January, 6 hours). Altogether, the three subsets total 449k words.

4.4

The LECTRA corpus

The LECTRA corpus (Trancoso et al., 2008) was collected in the framework of a homonym national project sponsored by FCT (POSC/PLP/58697/2004). The goal of LECTRA was to do lecture transcriptions, which can be used not only for the production of multimedia lecture contents for e-learning applications, but also for enabling hearing impaired students to have access to recorded lectures. The corpus includes seven 1-semester courses: Production of Multimedia Contents (PMC), Economic Theory I (ETI), Linear Algebra (LA), Introduction to Informatics and Communication Techniques (IICT), Object Oriented Programming (OOP), Accounting (CONT), Graphical Interfaces (GI). All lectures were taught at Instituto Superior Técnico, Technical University of Lisbon (IST), recorded in the presence of students, except IICT, recorded in another faculty of the University of Lisbon, and in a quiet office environment, targeting an Internet audience. Most classes are 60-90 minutes long (with the exception of IICT courses which are given in 30 minutes). All 7 speakers are native Portuguese speakers and CONT is the only course given by a female speaker. A total of 74 hours were recorded, of which 28 hours were orthographically transcribed and 10h were multilayered annotated (Trancoso et al., 2008). Recently, an additional set of 11 hours were orthographically transcribed in the scope of the Multilingual European Technology Alliance project (META-NET). A full account of all the experiments conducted with the extension of LECTRA can be found at Pellegrini et al. (2012). Table 4.1 shows the number of lectures per course and the audio duration that was annotated, where “V1” corresponds to the first version of the corpus, “Added” is the quantity of added data, and “V2” corresponds to the extended actual version.

28

CHAPTER 4. CORPORA

Acronyms S1 S2 S3 S4 S5 S6 S7 Total

V1 5 3 6 3 4 5 2 28

#lectures added 3 1 1 0 0 1 5 11

V2 8 4 7 3 4 6 7 39

V1 2h25 2h50 4h40 3h11 1h37 4h00 2h00 20h43

Duration added V2 2h30 4h55 0h51 3h41 1h02 5h42 3h11 1h37 2h22 6h22 4h09 6h09 10h54 31h37

Table 4.1: LECTRA corpus. The courses were selected in order to analyze the influence of several factors. The courses obviously differ in terms of topic. Some courses are characterized by a very high frequency of computer jargon in English (e.g., PMC, IICT, CG, OOP). Spelt or partially spelt acronyms (e.g., http) are also very frequent in these courses. Other courses such as AL are characterized by the presence of many mathematical variables. The courses are also different in terms of supporting materials. Some use slides (PMC, ETI), some use a white board (AL), and some use a mixture (OOP). The IICT course was taught in another university. It differs from the other courses in the fact that it was targeted for an internet audience, having been recorded by directly looking at a camera, in a quiet office environment. The annotation of the LECTRA corpus grew with the development of this thesis, therefore reported experiments in different chapters concern different subsets of the corpus.

4.4.1

Recording conditions

The recording conditions of the six IST courses are similar. The lapel microphone, used almost everywhere, has obvious advantages in terms of non-intrusiveness, but the high frequency of head turning causes audible intensity fluctuations. The use of a head-mounted microphone in the last 11 PMC lectures clearly improved this problem. However, this microphone was used with an automatic gain control, causing saturation in 11% of the recordings, due to the increase of the recording sound level during the student’s questions and in the segments after them.

4.4.2

Corpus division

The corpus was divided into 3 different sets: Train (78%), Development (11%), and Test (11%). Each one of the sets includes a portion of each one of the courses. The corpus separation follows a temporal criterion, where the first classes of each course were included in the training

4.5. THE NEWSPAPER DATA FROM PÚBLICO

29

data, and the final classes were included in the development and test sets. Figure 4.2 shows the portion of each course included in each one of the sets.

Figure 4.2: The LECTRA division.

4.5

The newspaper data from Público

From 1995 to 2004, INESC has been collecting newspaper data from Público. Recent efforts have also been made by Batista (2011) in collecting texts and in normalizing them as well. The subset used in this work is called the PUBnews and it covers the period from 1999-2004, comprising around 150 million words.

4.6

Corpora annotation

CPE-FACES has its own annotation schema described in Mata (1999). As for CORAL, ALERT and LECTRA, they share the core annotation schema that will be described in this section. This annotation schema comprises orthographic, morpho-syntactic, structural metadata (Liu et al., 2006b; Ostendorf et al., 2008), i. e., disfluencies and punctuation marks, and paralinguistic information as well (laughs, coughs, etc.). Segmentation marks were also inserted for regions in the audio file that were not further analyzed (due to background noise, signal saturation). The multilayer annotation aimed at providing a suitable sample for further linguistic and speech processing analysis in the lectures domain. A full report on this can be found in Moniz et al. (2008b); Trancoso et al. (2008).

30

CHAPTER 4. CORPORA

4.6.1

Orthographic tier

The orthographic manual transcriptions were done using Transcriber and Wavesurfer tools. Automatic transcripts are used as a basis that the transcribers correct. At this stage, speech is segmented into chunks delimited by silent pauses, already containing audio segmentation related to speaker and gender identification and background conditions. All segmented speech chunks are manually punctuated and annotated with the set of diacritics presented in Table 4.2. Symbols

Context of use

Examples

[]

Auto-corrected sequences Non analyzable speech sequences delimits onomatopoeic words Proper names In the right edge of the word stands for irregular pronunciation; in the left edge stands for spelled sigla or mathematical expression/variable Acronyms Word contractions or syncopated forms

sequences of disfluencies noisy conditions inter alia

&& ^

~

~

@ + §

Morfosyntactic irregular forms

% – =

Filled pauses Word fragment Excessive segmental prolongations

&quá quá quá& (the sound made by a duck) ^António

pode-nos servir (can serve us) pronounced as [S1r’nir] instead of [s1r’vir] ~GNR / matriz ~A (matrix ~A)

@INESC +está (is) pronounced as [’ta], instead of [S’ta] depois parte destas contas §têm que ser §saldadas

(afterwards part of these accounts §they will have to be settled) %aa (%uh) complementar (additional) que= (that=) pronounced as [k1:]

Table 4.2: Symbols used in the orthographic tier.

4.6.2

Disfluency tier

The annotation of disfluencies is provided in a separate tier (annotation file), closely following Shriberg (1994) and basically using the same set of labels. This annotation schema is based on Levelt’s model (1983), and has been successfully used, e.g., to train methods for the identification and automatic removal of disfluencies, in order to produce clean readable texts. It appears to be also the most adequate from a point of view of linguistic research. In spite of some divergences, it is widely accepted that disfluencies have an internal structure and three different regions need to be considered in their analysis: (i) the reparandum, containing the linguistic materials to be repaired; (ii) the interregnum or temporal region of variable length which may contain filled or unfilled pauses, as well as editing terms; and (iii) the repair itself. The reparandum is right delimited by an interruption point, marking the moment in time in which an interruption is visible in the surface form. Following a suggestion of Eklund (2004), disfluent

4.6. CORPORA ANNOTATION

31

items are indexed, as shown in the following example: < tem um número tem um número, não. > tem um elemento. r1

r2

s1.

r1

r2

s1

e1

r2

s1

(< it has a number it has a number, no. > It has an element.) Such a solution appears to be less prone to errors than the complex bracketing used by Shriberg, in order to account for the nested structure of long disfluency sequences. Unlike Eklund, however, all items are indexed for a more direct access to eventual changes in word order and to the different strategies that may be used by speakers. The set of labels used in the disfluency tier are shown in Table 4.3. Labels

Description

Examples

Annotation

.

Auto-corrected Interruption point

< ... >

f

Filled pauses

lm r

Segmental prolongations Repetitions

s

Substitutions

d

Deletions

i

Insertions

e

Editing expressions

– ~

Word fragments Mispronunciations

sequences of disfluencies moment when the speaker interrupts to repair his/her speech ou pode estar trancada (or it can be closed) de= (of=) pronounced as [d1:] e vocês sabem que (and you know that) são o conjunto dos ~X, ~Y (they are the set ~X, ~Y) vai haver uma série de resultados, portanto, nós tínhamos a noção de ~R there will be a series of results, therefore, we had the notion of ~R em que medida é que o padrão é útil? in what way is the pattern useful? acabou o tempo time ran out complementar ( additional)

pode-nos servir (can serve us) pronounced as [S1r’nir] instead of [s1r’vir]





Table 4.3: Labels used in the disfluency tier.

4.6.3

Syntactic tier

The third level of manual annotation aimed both at providing basic syntactic information and a segmentation of the speech string into Sentence-like Units (SUs), closely following LDC guidelines4 . As a representation of the later in standard writing format would constitute an 4 Simple Metadata Annotation Specification, Linguistic Data Consortium, version 6.2, February 2004, http://www.ldc.upenn.edu/Projects/MDE

32

CHAPTER 4. CORPORA

unnecessary reduplication, two basic labels were used, instead: SU for sentence-level breaks and SI for sentence internal ones, as the following examples illustrate. Vamos agora fazer uma breve introduçãoao uso da Internet (SU). Let us now do a short introductionto the internet usage (SU). Se vocês não mandarem a directoria ^travel (SI), aquilo não vai funcionar (SU). If you do not send the directory ^travel (SI),it will not work (SU).

4.6.4

Morphological information

Automatic classifications of part-of-speech (POS) tags, initially by Marv (Ribeiro et al., 2003) and more recently by Falaposta (Batista et al., 2012b) are also applied to the corpus. Currently, LECTRA is being annotated using Falaposta, a CRF-based tagger robust to certain recognition errors, given that a recognition error may not affect all its input features. It performs a pre-detection of numbers, roman numerals, hyphens, enclitic pronouns, and foreign words, and uses features based on prefix, suffix, case info, and lowercase trigrams. It accounts for 28 part-of-speech tags, the same as Marv, processes 14k words/second, and it achieves 95.6% accuracy. The tag set used in both classifiers is illustrated in Table 4.4:

4.6.5

Inter-transcriber agreement

Inter-transcriber agreement has been evaluated for ALERT (Batista, 2011) and for LECTRA (Pellegrini et al., 2012). Since we have been involved in the evaluation of the inter-transcriber agreement for LECTRA, full reports will be given for this corpus and an overview will be provided for ALERT. The inter-transcriber agreement for ALERT was mainly evaluated based on punctuation marks, the focus of the work of (Batista, 2011). Cohen’s Kappa (Carletta, 1996) was measured between previous annotators with no linguistic background and a fully revised version made by a linguist with the main concern of correcting punctuation marks and of adding disfluency annotation based on Moniz (2006). The striking differences are mainly due to commas, since in the old version annotators used commas to delimit sequences of disfluencies or whenever there was a silent pause, even if the placement of a comma did not respect the syntactic structure (e.g., often introducing a comma between the subject and the predicate). As for LECTRA, three annotators (with the same linguistics background) transcribed the extended data. Due to the idiosyncratic nature of lectures as both spontaneous and prepared non-scripted speech, the annotators reported two main difficulties: in punctuating speech and in classifying disfluencies. Punctuation complexities are mainly associated with the fact that speech units do not always correspond to sentences, as established in the written sense. They

4.6. CORPORA ANNOTATION

Category

Noun Verb

Subcategory proper

common main auxiliary

Adjective

Pronoun

Article

personal demonstrative indefinite possessive interrogative relative exclamative reflexive definitive indefinitive

Adverb Preposition Conjunction Numeral

33

Features

gender and number mood, tense, person, and number

V=

degree, gender and number

A= Pp Pd Pi Po Pt Pr Pe Pf Td Ti R= S= Cc Cs Mc Mo I U Xf Xa Xy Xs M O

person, number, case and formation

gender and number degree formation

coordinative subordinative cardinal ordinal

gender and number

Interjection

Unique Residual

Tag Np Nc

marker of mediopassive voice foreign abbreviation acronym symbol

Number Punctuation Table 4.4: Morphological tag set from Ribeiro et al. (2003).

may be quite flexible, elliptic, restructured, and even incomplete (Blaauw, 1995). Therefore, to punctuate speech units is not always an easy task. For a more complete view on this, we used the summary of punctuation mark locations for European Portuguese described in Duarte (2000). The second main difficulty is related to the fact that specific types of disfluencies are not always easy to discriminate. To sum up, the guidelines given to our annotators were: the schema described in Trancoso et al. (2008) and the punctuation summary described in Duarte (2000). The general difficulty of measuring the inter-transcriber agreement is due to the fact that two annotators can produce token sequences of different lengths. This is equivalent to measuring the speech recognition performance, where the length of the recognized word sequence is usually different from the reference. For that reason, the inter-transcriber agreement was calculated for pairs of annotators, considering the most experienced as reference. The standard F1-measure and Slot Error Rate (SER) metrics were used, where each slot corresponds to

34

CHAPTER 4. CORPORA

a word, a punctuation mark or a diacritic: F1–measure =

2 × Precision × Recall number o f errors , SER = , Precision + Recall number o f tokens in the re f erence

where “tokens” correspond to words, punctuation marks and diacritics used in the reference orthographic tier, and errors comprise the number of inserted, deleted or substituted tokens. The inter-transcriber agreement of the three annotators is based on a selected sample of 10 minutes of speech from one speaker involving more than 2000 tokens. The selection of the sample has to do with the reported difficulties of the annotators, in annotating disfluencies (e.g., complex sequences of disfluencies) and also punctuation marks. Table 4.5 reports the inter-transcriber agreement results for each pair of annotators. The table shows the number of (Cor)rect slots, (Ins)ertions, (Del)etions, (Sub)stitutions, (F1)-measure, and slot accuracy (SAcc), which corresponds to 1-SER. There is an almost perfect agreement between A1 and the remaining annotators, and a substantial agreement between the pair A2-A3. Annotator

A1-A2 A1-A3

A2-A3

Cor 1714 1632 1480

Ins 67 38 81

Del 79 34 97

Sub 224 351 444

F1 0.852 0.808 0.735

SER 0.184 0.210 0.308

SAcc 0.816 0.790 0.692

Table 4.5: Evaluation of the inter-transcriber agreement.

4.7

Corpora alignment

Figure 4.3 shows our in-house speech recognizer pipeline. The first module of the broadcast news processing pipeline, after jingle detection, performs audio diarization (Meinedo et al., 2008), which consists of assigning the portions of speech to a speaker and also of classifying the gender of the speakers. The second module is the automatic speech recognition module. Several other modules are also integrated in the speech recognizer pipeline, such as capitalization and punctuation, topic segmentation and indexing, and summarization. Our in-house speech recognizer (Meinedo et al., 2008), trained for the broadcast news domain, is totally unsuitable for other domains, such as the ones just presented, map-task dialogues, oral school presentations or university lectures. Therefore, the ASR is used in a forced alignment mode, in order not to bias the study with the bad results obtained with an out-ofdomain recognizer. This work was initially developed in order to produce suitable data for training and evaluating automatic recovery of punctuation marks, automatic capitalization and more recently also disfluencies, since the efforts of annotating disfluencies for all the corpora were concluded later on. The manual orthographic transcripts include punctuation marks and capitalization

4.7. CORPORA ALIGNMENT

35

Online System

TV broadcasted signal

Teletex Server

Control System & GUI

Subtitling Generation

JD

APP

ASR

Jingle detection

Audio segmentation

Speech Recognition

Punctuation Capitalization

Topic Segmentation and Indexing

XML file

Audio

Summarization

Web Offline System

Figure 4.3: AUDIMUS processing pipeline. Figure extracted from Batista (2011), page 5. information, which constitute our reference data. However, that is not the case of the fully automatic and force-aligned transcripts, which only include information such as: word time intervals and confidence scores. The required reference must be produced by means of alignments between the manual and automatic transcripts, a non-trivial task due to recognition errors. The initial step consists of transferring all relevant manual annotations to the automatically produced transcripts.

Figure 4.4: Creating a file that integrates the reference data into the ASR output. Figure 4.4 illustrates the process of integrating the reference data in the automatic transcripts, and providing additional meta-information to the data. The alignment process requires conversion of the manual transcripts, usually available in the TRS (XML-based standard

36

CHAPTER 4. CORPORA

Transcriber) format, to the STM (segment time mark) format, and the automatic transcripts into CTM (time marked conversation scoring). The STM format assigns time information at the segment level, while the CTM format assigns it at the word level. The alignment is performed using the NIST SCLite tool (http://www.nist.gov/speech) followed by an automatic post-processing stage, for correcting possible SCLite errors, and aligning special words, which can be written/recognized differently and are not considered by SCLite. The automatic postprocessing stage, performed after the SCLite alignment, allows overcoming problems, such as words A.B.C. or C.N.N. appearing as single words in the reference data, but recognized as isolated letters.

Figure 4.5: Example of an ASR transcript segment, enriched with reference data. The resulting file corresponds to the ASR output, extended with: time intervals to be ignored in scoring, focus conditions, speaker information for each region, punctuation marks, capitalization, and part-of-speech information. Figure 4.5 shows an automatic transcript segment, enriched with reference data, corresponding to the sequence: “Boa noite. Benfica e Sporting estão sem treinador. José Mourinho demitiu-se [do] Benfica”/Good evening. Benfica and Sporting have no coach. José Mourinho resigned from Benfica. The example illustrates two important sections: the characterization of the transcript segment and the discrimination of the wordlist that comprises it. As for the attributes of the segment, it is described with the following sequential information: it is a segment identified as 12, it has been automatically characterized as “clean” by the ASR system with a confidence level (0.548); the segment temporal interval has been delimited with a very high confidence level (0.995); the speaker identity is “2001” with a low confidence in the recognition of this specific identity (0.379), but a high one in the classification of its gender as a female (0.897); and it is a native speaker of Portuguese. As

4.7. CORPORA ALIGNMENT

37

for the wordlist itself: each word element contains the lowercase orthographic form, start time, end time, and confidence level; a discrimination of the focus condition (“F3” stands for speech with music and “F0” stands for planned speech without background noise); information about the capitalized form (cap); whether or not it is followed by a punctuation mark (punct="."); and the part-of-speech tag (e.g., “A” for the adjective “Boa”/Good, “Np” for proper noun “Mourinho”). Punctuation, part-of-speech, focus conditions, and information concerning excluded sections were updated with information coming from the manual transcripts. However, the reference transcript segments, manually created and more prone to be based on syntactic and semantic criteria, are usually different from automatically created segments, given by the ASR and APP (Audio pre-processing) modules, which are purely based on the signal acoustic properties. Therefore, whereas, in the reference, exclusion and focus information are properties of a segment, in the ASR output such information must be assigned to each word individually. Recent efforts have also focused on including paralinguistic information (breath pausing, laughs, etc.) and disfluencies, both contained in the manual references, into the resulting enriched ASR file. Figure 4.5 shows examples of instantaneous events, such as jingles (Event name="[JINGLE_F]") and inspirations (Event name=”[I]”) that appear between the words. However, other metadata annotation, covering phenomena like disfluencies, is also being incorporated in the final output. Figure 4.6 illustrates an example of disfluency, containing the filled pause . The sentence corresponds to: ”estávamos a falar %aa de bases de espaços lineares”/ we were talking uh about linear spaces. While marking instantaneous events corresponds to inserting a single event element, disfluency marking deserves special attention because it delimits regions, which may or may not involve more than one segment. Incorporating events like disfluencies makes it possible to have other levels of analysis in a constant enrichment of the transcripts.

Figure 4.6: Excerpt of an enriched ASR output with marked disfluencies. As previously mentioned, integrating the reference data into the ASR output, is performed by means of word alignment between the two types of transcripts, which is a non-trivial task

38

CHAPTER 4. CORPORA

mainly because of the recognition errors. The same difficulties are faced when transferring disfluencies and other events to the automatic transcripts.

Figure 4.7: Disfluency and other events alignment examples. Figure 4.7 shows distinct examples of disfluency and paralinguistic events (mis)alignment, extracted from the SCLite output. Angular brackets delimit disfluent sequences and square brackets paralinguistic events, e.g., [BB] (labial and coronal clicks) and [TX] (stands for cough). The complete list of events can be found in the Transcriber menus. Examples 1 and 2 show that, apart from the filled pause (%aam/um), the remaining events are not present in the forcealigned data. Examples 3 and 4 correspond to low energy segments that the ASR was unable to force-align. Example 3 is an aside from the teacher (Hum estes estão aqui caladinhos a aceitar o que estou a dizer/hum these here are so quiet accepting what I’m saying). Whereas example 4 corresponds to an explanation of a slide with the teacher’s head movement distorting the capture of the speech signal (Tenho esta classe e esta outra e estas duas interagem desta maneira e a terceira [classe]/I have this class and other and these two interacting this way and a third [class]).

4.8

Summary

This chapter adressed the description of the corpora that will be analyzed along this thesis, their annotation schema, and also their automatic alignment. This chapter was a crucial step for the subsequent inter-corpora analysis of the structural metadata events.

Towards an Automatic Prosodic Description

5

In order to study the prosodic behaviour of disfluencies in extensive data sets, one may find difficulties in what regards the prosodic annotation of such linguistic structures. Since no mannual prosodic annotations were done for the whole corpora and only small subsets were avaliable, the need for an automatic prosodic description, flexible enough to account for disfluent structures but also for other linguistic events, was of crucial importance. An automatic prosodic description is motivated by the fact that the analysis of prosodic features is crucial to model and improve natural language processing systems in general and metadata events in particular. Literature points out to a set of phenomena to demilit boundaries in speech, even though languages may show variations in the use of this set of features. Based on crosslanguage comparisons and in the more productive phenomena used to delimit boundaries, there are language-independent proposals, such as the one of Vassière (1983); Vaissière (2005), stating that the set of prosodic features pause, final lengthening, pitch contours and energy peaks are amongst the most salient cues. Building on that assumption, algorithms using such features do seem to be transversal in different natural language processing systems. This chapter encompasses experiments on adjusting phones, on marking syllables, on adjusting word boundaries and silent pauses; and also experiments regarding the durational, pitch and energy characteristics of those segmental and supra-segmental units. It also encompasses initial experiments on adapting an automatic ToBI system (AuToBI) for Portuguese, aiming at a multilayred linguistic structural annotation of the corpora.

5.1

Recognizer output

The experiments performed in this chapter concern the automatic integration of prosodic features into the output of our in-house speech recognizer, previously presented in Chapter 4. Figure 5.1 shows an example of the output phone and diphone segmentation produced by the ASR system, the basis for all the work conducted.

40

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

14.00 14.27 14.28 14.29 14.33 14.34 14.36 14.37 14.40 14.41 14.43 14.44 14.45 14.46 14.48 14.49 14.51 14.56 14.61 14.63 14.65 14.71 14.75 14.76 14.77 14.78

0.27 0.01 0.01 0.04 0.01 0.02 0.01 0.03 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.02 0.05 0.05 0.02 0.02 0.06 0.04 0.01 0.01 0.01 0.06

interword-pause L-m m m=u~ u~ u~=j~ j~ j~=t t t=u u u+R+ L-b b b+R L-o~ o~ o~+R+ L-d d d=i i i=A A A+R+ interword-pause

Figure 5.1: Example of a file containing the phones/diphones produced by the ASR system

5.2

Adjusting phone boundaries

Our in-house speech recognizer uses, in addition to monophone units modelled by a single state, multiple-state monophone units, and a fixed set of phone transition units, generally known as diphones, aimed at specifically modelling the most frequent intra-word phone transitions (Abad and Neto, 2008). The authors used a two-step method: first, a single state monophone model is extended to multiple state sub-phone modelling (e.g., “L-b”; “b” and “b+R”, where L stands for left state units and R for right state units); and secondly, a reduced set of diphone recognition units (e.g., “d=i”) is incorporated to model phone transitions. This approach is supported on the view that each phone is usually considered to be constituted by three regions: an initial transitional region (“L-b”), a second central steady region, known as phone nucleus (“b”), and a final transitional region (“b+R”). The authors initially expected that modelling each one of these regions independently would improve the acoustic phone modelling. Their expectations were confirmed, leading to a reduction of 3% in the word error rate (from 26.8% to 23.8%). Figure 5.1 presents an excerpt of a PCTM input file, produced by the speech recognition system, and containing a sequence of phones/diphones, corresponding to the sequence: “muito bom dia”/good morning. The phonetic transcription uses SAMPA (Speech Assessment Methods Phonetic Alphabet).

5.3. MARKING SYLLABLE BOUNDARIES AND STRESS

41

The phones/diphones information was then converted into monophones by another tool, specially designed for that purpose. Such conversion process was accomplished due to an analysis performed in a reduced test set of 1h duration that was manually transcribed (Moniz et al., 2010). The analysis of this sample revealed several problems, namely in the boundaries of silent pauses, and in their frequent misdetection, problems that affected the phone boundaries. Figure 5.2 presents an excerpt of the resultant information. Still, the existing information is insufficient for correctly assigning phone boundaries. We have used the mid point of the phone transition, but setting more reliable monophone boundaries would, then, enable us to process pitch adjustments and, thus, to mark syllable boundaries and stress in a more sustainable way.

5.3

Marking syllable boundaries and stress

The recognizer output has no syllable boundaries or stress marks. This was a problem, since we know from the extensive literature on prosody that tonic and post-tonic syllables are of crucial importance to account for different prosodic aspects, such as, nuclear and boundary tones or even rhythmic patterns. We built1 a set of syllabification rules, adapted from Mateus and d’Andrade (2000). We say adapted, because we are using a set of phonetic rather than phonological syllabification rules. This is due to the specific properties of the ASR system, meaning, the output is more closer to a phonetic representation than a phonological one. However, the set of rules is quite flexible, and by changing a small subset, we would have a phonological syllabification system in line with the analysis of Mateus and d’Andrade (2000) (e.g., commenting the rule for glides before a vowel, to obtain two syllables rather than a single one with a rising diphthong). These rules were applied to a lexicon with the pronounciations of each word in the ALERT corpus (around 100k words). The rules account fairly well for the pronunciation of native words, but they still need improvement to account for external sandhi processes, and also for words of foreign origin. Figure 5.2 illustrates an example of the syllable and stress marks.

5.4

Adjusting word boundaries and silent pauses

This work was done initially using a subset of the EP broadcast news corpus ALERT (Neto et al., 2003; Trancoso et al., 2003) as a previous evaluation set, but it was recently extended to cover all the corpora. Although the corpus used for training/development/evaluation of the speech recognizer includes 51 hours of orthographically transcribed audio, a limited subset of 1 hour was transcribed word-by-word, in order to allow us to evaluate the efficacy of the post-processing rules. With this sample we could evaluate the speech segmentation robustness, 1 Work

done in cooperation with Isabel Trancoso, Fernando Batista and Hugo Meinedo.

42

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000 2000_12_05-17_00_00-Noticias-7.spkr000

1 1 1 1 1 1 1 1 1 1 1 1

14.000 14.270 14.310 14.350 14.385 14.420 14.450 14.490 14.610 14.680 14.755 14.780

0.270 0.040 0.040 0.035 0.035 0.030 0.040 0.120 0.070 0.075 0.025 0.060

interword-pause "m u~ j~ #t u+ "b o~+ "d i #A+ interword-pause

Figure 5.2: PCTM of monophones, marked with syllable boundary (the diacritic #), stress (the diacritic ") and word boundary (the diacritic “+”). with several speakers in prepared non-scripted and spontaneous speech settings, with different strategies regarding speech segmentation and speech rate. Improving word and silent pause boundaries was the motivation for first applying post-processing rules to the baseline results, and later retraining the speech recognition models. These post-processing rules were applied off-line, and used both pitch and energy information. Pitch values were extracted using the Snack Sound Toolkit2 , but the only used information was the presence or absence of pitch. The energy information was also extracted off-line for each audio file. Speech and non-speech portions of the audio data were automatically segmented at the frame-level with a bi-Gaussian model of the log energy distribution. That is, for each audio sample a 1-dimensional energy based Gaussian model of two mixtures is trained. In this case, the Gaussian mixture with the “lowest” mean is expected to correspond to the silence or background noise, and the one with the “highest” mean corresponds to speech. Then, frames of the audio file having a higher likelihood with the speech mixture are labeled as speech and those that are more likely generated by the non-speech mixture are labeled as silence. The integration of extra information was implemented as a post-processing stage with three rules: 1. if the word starts by a plosive and is preceded by a silent pause, 60 ms of silence are integrated in the plosive sound segmentation; 2. if the word starts or ends by a fricative, the energy-based segmentation is used; 3. otherwise, words are delimited by pitch. With these rules, we expect to have more adequate word boundaries than with our previous segmentation methods, without imposing thresholds for silent pause durations, recognized by Campione and Véronis (2002) as misleading clues that do not account for differences between speakers, speech rate or speech genres. 2 http://www.speech.kth.se/snack/

5.4. ADJUSTING WORD BOUNDARIES AND SILENT PAUSES

cons1tuent#ini1al#phone#

43

cons1tuent#final#phone###

7%#

Improved)Boundaries)

6%# 5%# 4%# 3%# 2%# 1%# 0%# 5#

10#

15#

20#

25#

30#

35#

40#

45# 50# 55# 60# 65# Boundary)threshold)(ms))

70#

75#

80#

85#

90#

95# 100#

Figure 5.3: Improvement in terms of correct word boundaries, after post-processing.

5.4.1

Impact on acoustic models

By comparing the results in terms of word boundaries before and after the post-processing stage in the limited test set of 1h duration, we have found that 9.3% of the constituent initial phones and 10.1% of the constituent final phones were modified. In what concerns the interword pauses, 62.5% of them were modified and 10.9% more were added. Figure 5.3 illustrates the improvement in terms of correct boundaries, when different boundary thresholds are used. The graph shows that most of the improvements are concentrated in an interval corresponding to 5-60 ms. Our manual reference has 443.82 seconds of inter-word pauses, the modified version correctly identified more 67.71 seconds of silence than the original one, but there are still 14.81 seconds of silence that were not detected. The main achievements with our prosodically-based methods are: i) better delimitation of phones and inter-word pauses and ii) more reliable identification of inter-word pauses. Both phone and pause durations are being used as cues to account for the segmentation of sentencelike units. The latter is particularly important since we correctly identify more inter-word pauses related with punctuation marks, diarization and delimitation of disfluent sequences. Figure 5.4 shows an example of a silent pause detection corresponding to a comma. The two automatic transcriptions correspond to the results obtained before (miss-detection) and after post-processing. Two properties of EP trigger erroneous segmentation: the frication of plosives (shown in Figure 5.5), such as [d] or [t], and the epenthetic vowel (in EP [1]), both common at the end of a word followed by a pause. To the best of our knowledge, the relationship of those processes with the prosodic structure is still not well known. The use of alternative pronunciations could

44

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

Figure 5.4: Phone segmentation before (top) and after (bottom) post-processing. The original sentence is "o Infarmed analisa cerca de quinhentos [medicamentos], os que levantam mais dúvidas quanto à sua eficácia." "Infarmed analysis about five hundred [drugs], the most doubtful about their effectiveness.". Initial and final word phones are marked with “L-”, and “+R”, respectively, whereas frequent phone transition units are marked with “=”.

Figure 5.5: Example of an erroneous segmentation due to a fricated plosive.

5.5. PITCH AND ENERGY

45

cons2tuent$ini2al$phone$

cons2tuent$final$phone$$$

5%$

Improved)Boundaries)

4%$ 3%$ 2%$ 1%$ 0%$ !1%$ 5$

10$

15$

20$

25$

30$

35$

40$

45$ 50$ 55$ 60$ 65$ Boundary)threshold)(ms))

70$

75$

80$

85$

90$

95$ 100$

Figure 5.6: Improvement of correct word boundaries, after retraining. be a possible solution for both frication and insertion processes. In a second experiment, we have retrained a new acoustic model using the modified phone boundaries. We have verified that by using this second model the word error rate (WER) decreases from 22% to 21.5%. We also have compared the number of correct phone boundaries for a given threshold before and after the phone adjustment. Figure 5.6 graph shows that the phone boundaries produced by the second acoustic model are closer to the manual reference.

5.5

Pitch and energy

After setting the segmental and supra-segmental units, the next stage was to extract pitch and energy features for those units of analysis (duration was already given by the recognizer). Pitch (f0) and energy (E) are two important sources of prosodic information. So far, pitch and energy were only used in heuristics for adjusting such units. By the time these experiments were conducted, that information was not available in the ASR output. For that reason, pitch and energy have been directly extracted from the speech signal, using the Snack toolkit (Sjölander et al., 1998) and the standard parameters taken from the Wavesurfer tool configuration (Sjölander and Beskow, 2000). Energy was extracted using a pre-emphasis factor of 0.97 and a hamming window of 200 ms, while pitch was extracted using the ESPS method (auto-correlation). Algorithms for automatic extraction of the pitch track have, however, some problems, e.g., octave jumps; irregular values for regions with low pitch values; disturbances in areas with micro-prosodic effects; influences from background noisy conditions; inter alia.

46

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

Figure 5.7: Pitch adjustment. Several tasks were needed in order to solve some of these issues. We have avoided constant micro-prosodic effects and removed all the pitch values calculated for unvoiced regions. The latter is performed in a phone-based analysis by detecting all the unvoiced phones. Octavejumps were also eliminated. As to the influences from noisy conditions, the recognizer has an Audio Pre-processing or Audio Segmentation module (Meinedo and Neto, 2003), which classifies the input speech according to different focus conditions (e.g., noisy, clean), making it possible to isolate speech segments with unreliable pitch values. Figure 5.7 illustrates the process described above, where the original pitch values are represented by dots and the grey line represents the resultant pitch. The first tier is the orthographic tier, also containing POS tags; the second tier corresponds to the multiple-state monophone/diphone units, and the last tier is the resulting conversion to monophones.

5.6

Integration of prosodic information in transcription files

The segmental and supra-segmental levels just described are integrated in a diverse set of transcription files, making them available for further post-processing tasks. Figure 5.8 shows a workflow of the prosodic information added. Firstly, energy and pitch values are extracted from the speech signal. Secondly, the energy values are the basis for a Gaussian mixture model (GMM) classifier3 , which distinguish speech from non-speech regions. 3 Work

done by Alberto Abad and Fernando Batista

5.6. INTEGRATION OF PROSODIC INFORMATION IN TRANSCRIPTION FILES

47

Input data

speech signal

PCTM

Lexicon

XML excluded regions focus conditions punctuation capitalization morphology

Extract Energy

Energy

Extract Pitch

GMM classifier

Pitch

SNS

Adjust diphone boundaries

Pitch

produce monophones

Pitch adjustment

Mark syllables

PCTM

Add syllables & phone

Energy

Add statistics

Pitch

Final XML

Figure 5.8: Workflow of prosodic information. Figure extracted from Batista (2011), page 46.

48

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

Figure 5.9: Excerpt of the final XML, containing information about the word "noite"/night.

The GMM classifier is a probabilistic model for representing the presence of sub-populations (speech/non-speech) within an overall population (speech signal). The advantages of using the GMM classifier are mainly related to more flexible accounts of speech/non-speech regions. Previously, we worked with a threshold of 40 dB to make that binary distinction, and this was in fact a rough measure, which did not account for all the situations. Thirdly, based on the speech/non-speech classification and on the pitch values, the output diphones from the recognizer (given in a file with the extension .pctm) are then combined to adjust the diphones and also to convert them into improved monophones. Fourthly, the resulting monophone file is used to remove pitch values from unvoiced regions. Fifthly, based on the lexicon and on the monophone file, syllables and stress are marked (figure shows “mark syllables” for both tasks). This step results in a new pctm file. The output of the recognizer is an XML file containing information about areas to exclude (Meinedo et al., 2008), focus conditions of the speech signal (noisy, clear, etc.), punctuation and capitalization information (Batista et al., 2008), and finally, morphological classifications. All this information is combined with the resulting syllable and stress marks. Then, the refined pitch and energy values are added, and finally, all this information is displayed in a final XML file, as Figure 5.8 illustrates.

Figure 5.9 shows an excerpt containing information about a single word. “syl” stands for syllable, “ph” for phone, “p*” means pitch, “e*” energy, and “dur” corresponds to duration (measured in 10 ms frames). Information concerning words, syllables and phones can be found in the file, together with pitch, energy and duration information. For each unit of analysis we have calculated the minimum, maximum, average, and median, for both pitch and energy. Pitch slopes were also calculated after converting the pitch into semitone values. Different measures are provided, relating to different base values: pslope (base is the minimum pitch in the range), pmin_st_100 and pmax_st_100 (base is 100Hz), pmin_st_spkr and pmax_st_spkr (base is the speaker minimum). Moreover, normalized values for pitch and energy based on duration of events have also been calculated: pslope_norm and eslope_norm.

5.7. EXTENDED SET OF PROSODIC FEATURES

5.7

49

Extended set of prosodic features

An extended set of prosodic features was extracted with the fields discriminated in Table 5.1. The set was built in order to accommodate the temporal scope of disfluent events and of their contexts. Features were calculated for the disfluent sequence itself and also for the two contiguous words, before (disf-2 and disf-1) and after (disf+1 and disf+2) the disfluent sequence. Energy and f 0 slopes within the words were calculated based on linear regression. Several normalizations were done in order to see if all of them would be significant. The first normalization is by Nolan (2003) and the formula corresponds to: ST = (12/LOG (2)) ∗ LOG ( x/100), where ST stands for semitone, x is the variable, meaning the raw value to be normalized, and 100 is the reference value (100 Hz = 0 ST). Another normalization process has been proved to be efficient (Mata, 1999). The formula is exactly the same, however, instead of the reference value being 100 Hz, the reference is the minimum pitch value of the speaker. The third normalization process is also a standard metric extensively used (e.g., Gravano (2009); Rosenberg (2009), z-scores: z = ( x − m)/sv, where x is a raw measurement to be normalized (e.g., the duration of a particular word), m and sv are the mean and standard deviation, respectively. It is also important to mention that the unified ASR output representation considers and stores a minimal set of information, from which an extended set of features can be derived. For example, the following set of features were considered for the previously described task on analysis of disfluencies: a) number of words, syllables, and phones inside and outside disfluencies; b) duration of speech with and without utterance internal silences; c) articulation rate, rate of speech, and phonation ratio, per sentence and per speaker.

5.8

Towards an Automatic ToBI annotation system for EP

As stated in Chapter 3, ToBI4 stands for Tones and Break Indices (Silverman et al., 1998; Pitrelli et al., 1994). Efforts have been made to adapt this system for Portuguese, namely with the results of Viana et al. (2007), supported in several studies and distinct corpora of Portuguese. However, an automatic annotation of a P_ToBI is not yet available. Therefore, this section describes our preliminary work to implement in European Portuguese 5 the Automatic ToBI annotation system (AuToBI) done for Standard American English (SAE), or Mainstream American, by Rosenberg (2009, 2010). 4 For

a complete description of the system and also for a list of all the languages and dialects already described with ToBI, vide the webpage http://www.ling.ohio-state.edu/~tobi/ 5 We would like to thank Professors Julia Hirschberg and Andrew Rosenberg for their many helpful discussions.

50

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

# Field 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27 28

Description file name and speaker id start time end time confidence level speech conditions part-of-speech orthographic transcription phonetic transcription pitch maximum pitch minimum pitch average pitch median pitch standard deviation energy maximum energy minimum energy average energy median energy standard deviation energy slope energy slope normalized pitch slope pitch slope normalized pitch normalized by 100 pitch normalized by minimum speaker pitch normalized with zscore Orthographic transcription of the utterance Orthographic trasncription of the disfluent sequence Table 5.1: Extended set of prosodic features.

5.8.1

Preliminary results

We applied the AuToBI system in the LECTRA corpus. AuToBI requires three inputs: a wave file, a praat TextGrid file with word segmentation (as exemplified in Figure 5.10), and previously trained classifiers for prosodic event detection and classification tasks. The last input may be substituted by a ToBI annotation TextGrid file, which can serve as an input to train new models. For the set of experiments done on this work, we used the LECTRA corpus word segmentation given by the recognizer in a force alignment mode, meaning, we had the manual transcripts and the automatic speech recognizer aligned them with the signal. We also used the previous classifiers trained with annotated data from both spontaneous and read American English corpora. By using the already trained models for a complete different language, we aimed at answering two research questions: i) Can we predict and classify prosodic events in

5.8. TOWARDS AN AUTOMATIC TOBI ANNOTATION SYSTEM FOR EP

51

Figure 5.10: Excerpt of an input TextGrid file from LECTRA containing word segmentation. The text corresponds to: mas agora pergunto-vos (But now let me ask you). Silents correspond to blank spaces. European Portuguese based on trained models from American English? ; ii) Which previously trained model, the spontaneous or the read one, is the best to account for the prosodic patterns in LECTRA, an university lecture corpus? We could also ask: Are there substantial gains when training with European Portuguese ToBI annotated data? The experiments conducted so far will not answer this latter question, since we do not have enough P_ToBI annotated material (at least 1h) to conduct the experiment. However, efforts are being made to accomplish this task in a near future with under the scope of the Project COPAS. From the resulting TextGrid files (see Figure 5.11 as an example), we analyze a small sample from one speaker (5 minutes of speech selected from the test set). We evaluate the following binary decisions: i) accent vs. nonaccent detection and ii) boundary vs. nonboundary detection. Although we have manually annotated all the boundaries in the small sample, as Figure 5.12 shows, the ones that we are evaluating are related to intonational phrase boundaries (break index 4) only. The option of evaluating only the intonational phrase boundaries is supported on studies by Pitrelli et al. (1994) or Syrdal et al. (2001), which have shown that the detection of the intermediate boundaries (break index 3) in American English is around 50%, meaning, they are hard to detect and evaluate. The performance results are evaluated using the standard metrics Precision and Recall:

Precision =

correct correct , Recall = correct + substitutions + insertions correct + substitutions + deletions

where correct is the number of correctly identified prosodic events, insertions corresponds to false acceptances, and deletions corresponds to missing prosodic events. Since we are eval-

52

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

Figure 5.11: Excerpt of an AuToBI output TextGrid file. The text is the same as in the previous figure.

Figure 5.12: Excerpt of a manual output TextGrid file. The text is the same as in the previous figures.

5.9. SUMMARY

53

Prosodic events Accent detection Boundary detection

Models Read Spontaneous Read Spontaneous

Precision 94% 80% 98% 75%

Recall 68% 58% 99% 99%

Table 5.2: Results for the AuToBI performance on EP data uating binary decisions and not the accent and boundary inventory types, the substitutions correspond to zero events. We would expect that the spontaneous model would perform better for the university lecture corpus. Although the teachers prepare their courses, there is a relevant degree of spontaneity in their speech, with frequent disfluencies, backchannels, discourse markers, grunts, inter alia. However, as Table 5.2 shows, the best results are achieved for boundary detection with the read model. Even though this model was trained for American English, the recall and precision percentages are very good. Silent pauses, pitch and energy resets appear to be the most reliable array of phonetic cues contributing to these results. The lower precision obtained with the spontaneous model might be due to two factors: i) distinguishing intonational phrase breaks (IP) from disfluent endings that precede silence; ii) identifying IP boundaries that are not coincident with silence6 . Both of these are more common in spontaneous than in read speech and difficult for the system to classify. Regarding accent detection, again the read model performed better than the spontaneous one. The precision is high (94%), meaning that the majority of the prosodic events that are identified as accented are correctly so; whereas the recall (or covert) of what is being classified as accented is low, meaning that the system often identifies pitch accents that were not assigned by the annotator. New training models adapted for European Portuguese would possibly contribute to figure out differences regarding accent distribution within the intonational phrase as well as pitch accent types most commonly used. There are some noticeable differences between the annotation provided by AuToBI and the manual classifications, mainly related with nuclear tones, such as H+L*, which is not accounted in AuToBI; or with pre-nuclear L+H*, which can be this bi-tonal event or a simple H* in European Portuguese, as Figure 5.12 shows.

5.9

Summary

This chapter focused on the steps taken towards an automatic prosodic description. From segmental to supra-segmental levels, pitch, energy and duration features were processed for phones, syllables, words, and sentence-like units. Experimental steps towards an AuToBI annotation for Portuguese are also described, showing promising results regarding its forthcoming 6 Professor

Andrew Rosenberg personal communication.

54

CHAPTER 5. TOWARDS AN AUTOMATIC PROSODIC DESCRIPTION

adaptation to Portuguese.

Analysis of interrogatives: a case-study

6

The aim of this chapter is twofold: to quantify the distinct interrogative types in different domains for European Portuguese, and to discuss the weight of the linguistic features that best describe these structures, in order to detect interrogatives in speech. The automatic detection of interrogatives may be of particular interest for potential applications, such as the punctuation of automatic speech recognition (ASR) transcripts. In fact, our previous punctuation module aimed only at full stops and commas. One of the objectives of this work is the extension of this module to encompass question marks as well, by combining different types of features. European Portuguese (EP), like many other languages, has different interrogative types (Mateus et al., 2003): yes/no questions, alternative questions, wh- questions and tag questions. A yes/no question requests a yes/no answer (Estão a ver a diferença?/Can you see the difference?). In EP, this type of interrogative generally presents the same syntactic order as a statement – English may encode the yes/no interrogative with subject-auxiliary verb inversion. An alternative question presents two or more hypotheses (Acha que vai facilitar ou vai ainda tornar mais difícil?/Do you think that it will make it easier or will it make it even harder?) expressed by the disjunctive conjunction ou/or. A wh- question has an interrogative word, such as qual/what, quem/who, quando/when, onde/where, etc., corresponding to what is being asked about (Qual é a pergunta?/What is the question?). In a tag question, an interrogative clause is added to the end of a statement (Isto é fácil, não é?/This is easy, isn’t it?). This diversity may cause some interrogative types to be easier to detect than others. State-of-the-art studies on sentence boundary detection in general and on the detection of interrogatives in particular have discussed the relative weights of different types of feature. Shriberg et al. (2009) report that prosodic features are more relevant than lexical ones, and that better results are achieved when combining both types of features; Wang and Narayanan (2004) claim that results based only on prosodic properties are quite robust; Boakye et al. (2009), analyzing meetings, state that lexico-syntactic features are the most important ones. These diverging opinions led us to question whether the relative weights of these features should take into account the nature of the corpus, namely the most characteristic types of interrogative in each, and the ways a particular language encode sentence-type forms. This study

56

CHAPTER 6. ANALYSIS OF INTERROGATIVES: A CASE-STUDY

addresses that question, using three distinct corpora for European Portuguese: broadcast news (the ALERT corpus, Neto et al. (2003); Meinedo et al. (2003)), classroom lectures (the first transcribed subset of the LECTRA corpus, Trancoso et al. (2006, 2008)), and map-task dialogues (the CORAL corpus, Viana et al. (1998); Trancoso et al. (1998)). The manual orthographic transcriptions of these spoken corpora were recently revised by an expert linguist, thereby removing many inconsistencies in terms of punctuation marks that affected our previous results. For the sake of comparison, we also used a newspaper text corpus with 148M words. All the corpora were subdivided into train, test, and development sets.

6.1

Statistical Analysis of Interrogatives

Table 6.1 shows the overall frequency of interrogatives and other punctuation marks in the train sets of the different corpora, taking into account the number of sentence-like units (SU). The overall frequency of interrogatives in the data is substantially different: on one hand, the university lectures and the map-task corpora present 20.7% and 23.2%, respectively; on the other hand, the BN and the newspapers corpora have only 2.1% and 1.0%, respectively. The first two corpora have ten times more interrogatives than the latter, a percentage interpretable by the teacher’s need to verify if the students are understanding what is being said, and also by the giver’s concerns in making his/her follower localize the right path in a map. In broadcast news, interrogatives are almost exclusively found in interviews, and in transitions from anchormen to reporters.

LECTRA CORAL ALERT Newspaper

Type university lectures map-task dialogues broadcast news newspaper text

? 20.7% 23.2% 2.1% 1.0%

! 0.0% 0.4% 0.1% 0.2%

. 41.6% 66.9% 58.1% 30.7%

, 37.6% 8.0% 39.1% 57.5%

: 0.0% 0.0% 0.5% 2.4%

; 0.1% 1.4% 0.2% 0.7%

#SUs 6,524 8,135 26,467 5,841,273

Table 6.1: Overall punctuation marks frequency in the training sets.

6.1.1

Overall frequency of interrogative types in the training corpora

The automatic tagging in terms of interrogative types was done for the 4 corpora using the following set of heuristic rules: 1. if the interrogative sentence has one of the following items quem, qual, quais, quanto(s), quanta(s), quando, quê, a quem, o quê, por que, para que, onde, porque, porquê, o que, como, then it is classified as a wh- question; 2. if the interrogative sentence has the disjunctive conjunction ou, then it is an alternative question;

6.1. STATISTICAL ANALYSIS OF INTERROGATIVES

57

3. if the interrogative sentence has one of the following items não é, certo, não, sim, okay, humhum, está bom, está, está bem, não foi, tens, estás a ver, estão a ver, é isso, de acordo, percebido, perceberam, correcto, não é verdade prior to a question mark, then it is a tag question; 4. otherwise, it is a yes/no question. This set of rules models the lexical expressions that may function as interrogatives, including expressions observed in the training sets that are still not fully described for EP (e.g., tag questions such as okay?, certo?, correcto?).

LECTRA CORAL ALERT Newspaper

Wh 42.2% 7.5% 34.2% 41.3%

Alt 2.2% 3.1% 5.5% 7.8%

Tags 27.0% 12.3% 10.9% 1.1%

Y/N 28.6% 77.1% 49.4% 49.8%

Total SUs 6,524 8,135 26,467 5,841,273

Table 6.2: Overall frequency of interrogative types in training corpora (automatically produced results). Table 6.2 shows the overall frequency of interrogative types in the training sets of the four corpora. The table shows that there are different trends in the distribution of interrogative subtypes as well. The university lectures, the broadcast and the newspaper corpora present comparable results for wh- questions. Concerning tag questions, the university lectures corpus has the highest percentage. This may be associated with the teacher need to confirm if the students are understanding what is being taught, and ultimately with styles of lecturing. The map-task also shows a representative amount of tag questions, but yes/no questions are the most expressive type of interrogative in this corpus, mainly due to the description of a map made by a giver and the need to ask if the follower is understanding the complete instructions. The ALERT corpus shows results that are closest to the newspaper corpus, as expected. As for alternative questions, they are quite residual across these four corpora. The test sets of the corpora were manually annotated by an expert linguist. Table 6.3 shows the frequency of each type of interrogative, before and after the manual correction. The agreement between the automatic and the manual classifications was evaluated using Cohen’s kappa values (Carletta, 1996), and is shown in the last column. In terms of question mark types, the automatic classification performed better for the alternative questions (0.912 Cohen’s Kappa), followed by wh- questions (0.874) and yes/no questions with similar results (0.863). The most inconsistent classification, as expected, concerns tag questions (0.782). The table reveals that rules perform fairly well, the most notorious difference being the classification between tag and yes/no questions in CORAL. The low Cohen’s Kappa value of this corpus is due to structures that maybe classified as either elliptic yes/no questions (sim?/yes? or é?/is it?) or tag questions (declarative + sim?/yes? or declarative + é?/is it?). The LECTRA corpus

58

CHAPTER 6. ANALYSIS OF INTERROGATIVES: A CASE-STUDY

LECTRA CORAL ALERT Newspaper

#SUs 262 3,406 2,671 90,534

#? 102 511 151 2,859

Wh 39.7% 9.4% 42.4% 44.9%

Automatic Alt Tags 2.2% 39.0% 3.5% 13.5% 2.0% 11.2% 7.3% 0.9%

Y/N 19.1% 73.6% 44.4% 46.9%

Wh 41.4% 10.6% 40.4% 43.5%

Manual Alt Tags 1.0% 40.4% 5.1% 18.2% 2.6% 10.0% 6.3% 0.8%

Y/N 17.1% 66.1% 47.0% 49.4%

Cohen’s Kappa 0.922 0.849 0.895 0.900

Table 6.3: Automatic and manual classification of interrogative types in the test sets. presents the highest Cohen’s Kappa value, and differences are mostly due to similar classification errors between tags and yes/no questions. The ALERT and the Newspaper corpora contain very complex structures which are hard to disambiguate automatically (e.g. embedded subordinate and coordinate clauses) and they are similar in terms of Cohen’s Kappa. Thus, we may conclude that, broadcast news and newspaper data are more similar in what concerns the frequency of interrogative subtypes and the nature of the errors, whereas university lectures and map-task dialogues share flexible structures characteristic of spontaneous speech, such as elliptic yes/no questions and tag expressions, which are hard to identify automatically. Based on language dependency effects (fewer lexical cues in EP than in other languages, such as English) and also on the statistics presented, one can say that ideally around 40% of all interrogatives in broadcast news would be mainly identified by lexical cues – corresponding to wh- questions – while the remaining ones would imply the use of prosodic features to be correctly identified.

6.2

Punctuation experiments for interrogatives

This section concerns the automatic detection of question marks in the different corpora, using different combinations of features. This detection will allow the extension of our previous punctuation module which was initially designed to deal only with the two most frequent punctuation marks: full stop and comma (Batista et al., 2008). This module is also based on maximum entropy (ME). All the experiments described in this section follow the same MEbased approach, making use of the MegaM tool (Daumé III, 2004) for training the maximum entropy models. The performance results are evaluated using the standard metrics already applied in the full stop and comma detection. In the remaining of this section, we will firstly assess the performance of the module, using only lexical information, learned from a large corpus of written data, and then we will study the impact of introducing prosodic features, analyzing the individual contribution of each prosodic feature on spontaneous and prepared speech. In order to avoid the impact of the recognition errors, all the transcriptions used in these experiments were achieved by means of a forced alignment between the speech and the manual transcriptions, which was performed by the Audimus speech recognition system. The test data

6.2. PUNCTUATION EXPERIMENTS FOR INTERROGATIVES

59

of the ALERT corpus has about 1% of alignment errors, while the LECTRA test corpus has about 5.3% alignment errors. The CORAL corpus was not used in this experiment, because of the large percentage of overlapping speech for which no manually marked time boundaries were available. The reference punctuation concerning the ALERT and LECTRA corpora was provided by the manual transcriptions of these corpora. The NIST SCLite tool 1 was used for this task, followed by a post-processing step for correcting some SCLite errors.

6.2.1

Baseline experiments

Corpora LECTRA ALERT Newspaper

#SUs 1,120 9,552 222,127

Correct 158 128 1100

Wrong 32 25 236

Missed 220 287 1740

Precision 83.2% 83.7% 82.3%

Recall 41.8% 30.8% 38.7%

F 55.6% 45.1% 52.7%

SER 66.7% 75.2% 69.6%

Table 6.4: Baseline results, achieved with lexical features only. The baseline results were achieved by training a discriminative model from the Newspaper corpus containing about 143M words of training data. Detecting an interrogative is a binary problem (presence/absence), and each event corresponds to an entire sentence, instead of being a word as in dealing with full stop and comma. Thus, the following features were used for a given sentence: wi , wi+1 , 2wi−2 , 2wi−1 , 2wi , 2wi+1 , 3wi−2 , 3wi−1 , start_x, x_end, len, where: wi is a word in the sentence, wi+1 is the word that follows and nwi± x is the n-gram of words that starts x positions after or before the position i. start_y and y_end features are also used for identifying word n-grams occurring either at the beginning or a the end of the sentence. len corresponds to the number of words in the sentence. The corresponding results are shown in Table 6.4, where Correct is the number of correctly identified sentences, Wrong corresponds to false acceptances or insertions, and Missed corresponds to the missing slots or deletions. Table 6.4 reveals a precision around 83% in all corpora, but a small recall. The main conclusion is that the recall percentages using this limited set of features are correlated with the identification of a specific type of interrogative, wh- questions. Recall percentages are comparable to the ones of the wh- question distribution across corpora. As for yes/no and tag questions, they are residually identified.

6.2.2

Experiments with lexical and speaker related-features

The second experiment consisted of re-training the previous model, created from newspaper corpora, with the transcriptions of each training corpus. The ME models were trained on the forced-aligned transcriptions for each speech corpus, bootstrapping from the initial training with newspaper text. As spoken transcriptions contain much more information concerning 1 http://www.itl.nist.gov

60

CHAPTER 6. ANALYSIS OF INTERROGATIVES: A CASE-STUDY

each word, we have also used all the lexical and acoustic information available. Besides the previous lexical features, the following features were added: GenderChgs, SpeakerChgs, and TimeGap, where GenderChgs, and SpeakerChgs correspond to changes in speaker gender, and speaker clusters from the current to the next sentence; TimeGap corresponds to the time period between the current and following sentence, assuming that sentence boundaries are given by the manual annotation. Lacking a better description for this heterogeneous set, these features will be henceforth designated as acoustic. Table 6.5 illustrates the results achieved with these features, revealing a significant overall performance increase, especially for the LECTRA corpus. In this corpus, there are relatively few speaker changes, thus showing the relevance of the TimeGap feature. Corpora LECTRA ALERT

#Correct 271 144

#Wrong 52 27

#Missed 107 271

Precision 83.9% 84.2%

Recall 71.7% 34.7%

F 77.3% 49.1%

SER 42.1% 71.8%

Table 6.5: Results after re-training with transcriptions and adding acoustic features.

6.2.3

Experiments with prosodic features

Our next experiments aimed at analyzing again the weight and contribution of different prosodic features per se and the impact of their combination, but this time to question mark detection. The same procedures used for full stop and comma were applied. Thus, features were calculated for each sentence transition, with or without a pause, using the same analysis scope as Shriberg et al. (2009) (last word, last stressed syllable and last voiced phone from the current boundary, and the first word, and first voiced phone from the following boundary). The following set of features has been used: f 0 and energy slopes in the words before and after a silent pause, f 0 and energy differences between these units and also duration of the last syllable and the last phone. With this set of features, we aimed at capturing nuclear and boundary tones, energy and pitch resets, and final lengthening. This set of prosodic features already proved useful for the detection of the full stop and comma, showing an improvement of more than 2% SER (absolute) for the ALERT corpus (Batista et al., 2010) relative to the results obtained using only lexical and acoustic features. The results of recovering question marks over the LECTRA and the ALERT corpus, using prosodic features, are presented in Tables 6.6 and 6.7, respectively. Different combination of features were added to a standard model, which uses lexical and acoustic features, with different impact depending on the corpus. When comparing with the second experiment results, some improvements were achieved, specially for the LECTRA corpus, where the combination of all features has produced the best results. Our results partially agree with the ones reported in Shriberg et al. (2009), regarding the contribution of each prosodic parameter, and also the set of discriminative features used, where

6.2. PUNCTUATION EXPERIMENTS FOR INTERROGATIVES

Type of Info Words

Syllables & phones

Added features Pitch Energy Pitch, Energy Pitch Energy Duration Pitch, Energy, Duration

All Combined

Cor 275 266 273 269 269 268 268 273

Wrong 53 54 52 54 49 52 50 50

Missed 103 112 105 109 109 110 110 105

61

Prec 83.8% 83.1% 84.0% 83.3% 84.6% 83.8% 84.3% 84.5%

Rec 72.8% 70.4% 72.2% 71.2% 71.2% 70.9% 70.9% 72.2%

F 77.9% 76.2% 77.7% 76.7% 77.3% 76.8% 77.0% 77.9%

SER 41.3% 43.9% 41.5% 43.1% 41.8% 42.9% 42.3% 41.0%

Table 6.6: Recovering the question mark over the LECTRA corpus, using prosodic features. Type of Info Words

Syllables & phones All Combined

Added features Pitch Energy Pitch, Energy Pitch Energy Duration Pitch, Energy, Duration

Cor 149 146 147 151 146 144 147 146

Wrong 27 25 27 27 24 28 29 28

Missed 266 269 268 264 269 271 268 269

Prec 84.7% 85.4% 84.5% 84.8% 85.9% 83.7% 83.5% 83.9%

Rec 35.9% 35.2% 35.4% 36.4% 35.2% 34.7% 35.4% 35.2%

F 50.4% 49.8% 49.9% 50.9% 49.9% 49.1% 49.7% 49.6%

SER 70.6% 70.8% 71.1% 70.1% 70.6% 72.0% 71.6% 71.6%

Table 6.7: Recovering the question mark over the Alert corpus, using prosodic features. the most expressive feature turned out to be f 0 slope in the last word of the current boundary and between word transitions (last word of the current boundary and the starting word of the following boundary). As stated by ?, these features are language independent. Language specific properties in our data are related with different durational patterns at the end of an intonational unit and also with different pitch slopes that may be associated with discourse functions beyond sentence-form types. Summing up, when training only with lexical features, wh- questions are expressively identified, whereas tag questions and y/n questions are quite residual, exception made in the latter case for the bigram acha que (do you think). There are still wh- questions not accounted for, mainly due to very complex structures hard to disambiguate automatically. When training with all the features, y/n and tag questions are better identified. We also have verified that prosodic features increase the identification of interrogatives in ALERT spontaneous speech and in the LECTRA corpus, e.g., y/n questions with a request to complete a sentence (e.g., recta das?) or the tag questions não é? in the former corpus, and certo? in the latter. Even when all the information is combined, we still have questions that are not well identified, due to the following aspects: • i) a considerable amount of questions is made in the transition between newsreader and reporter with noisy background (such as war scenarios); • ii) frequent elliptic questions with reduced contexts, e.g., eu? (me?) or José?;

62

CHAPTER 6. ANALYSIS OF INTERROGATIVES: A CASE-STUDY

• iii) sequences with disfluencies, e.g., como é que se consegue?, contrasted with a similar question without disfluencies that was identified: Como é que conseguem isso? (how do you manage that?); • iv) sequences starting with the copulative conjunction e (and) or the adversative conjunction mas (but), which usually do not occur at the absolute start of sentence; • v) false insertions of question marks in sequences with indirect questions, which are not marked with a question mark; • vi) sequences with more than one consecutive question, randomly chosen, e.g., ... nascem duas perguntas: quem? e porquê? (...two questions arise: who? and why?); • vii) sequences integrating parenthetical comments or vocatives, e.g., Foi acidente mesmo ou atentado, Noé? (Was it an accident or an attack, Noé?).

6.3

Summary

We analyzed spoken dialogue, university lectures, and broadcast news corpora, and, for the sake of comparison, newspaper texts. The availability of different types of corpora with a significantly different percentage of interrogatives amongst them allowed us to verify that the distribution of interrogative subtypes is also quite distinctive across corpora. The set of rules that were specifically created for the automatic identification of interrogative subtypes in EP captures their lexical differences in a fairly good way. Nevertheless, we extended our analysis in order to discriminate the problematic structures that were misclassified. For both university lectures and map-task corpora, tag questions were the most problematic type of interrogative. Experiments on the automatic detection of interrogatives for European Portuguese, using only lexical cues, show results that are strongly correlated with the detection of a specific type of interrogatives (namely wh- questions). When acoustic and prosodic features (pitch, energy and duration) are added, yes/no and tag questions are then increasingly identified, showing the advantages of combining both lexical, acoustic, and prosodic features. These experiments also allowed us to identify pitch related features as the most relevant ones for this task, when used per se. These findings compares well with the literature, e.g., Shriberg et al. (2009). The results obtained with the study of interrogatives were encouraging and motivated further experiments (Batista et al., 2012a) targeting the impact of lexical and prosodic information on the automatic detection of full stop, comma, and question mark for both English and Portuguese. The detection of full stops and commas is performed in a first step, and corresponds to segmenting the speech recognizer output stream. Question marks are detected afterwards, making use of the previously identified segmentation boundaries. Punctuation experiments conducted

6.3. SUMMARY

63

by Batista et al. (2012a) focus on the usage of additional information sources and diverse linguistic structures. Two different methods were explored for improving the baseline results for full stop and comma. The first makes use of the punctuation information that can be found in large written corpora. The second consists of introducing prosodic features, besides the initial lexical, time-based and speaker based features. The linguistic structure in both languages is captured in different ways for distinct punctuation marks: commas are mostly identified by lexical features, while full stops are mostly depending on prosodic ones. The most significant gains come from combining all the available features. Although the relative small number of question marks does not allow us to observe significant differences, there is a small gain in combining all features both for recognized Portuguese and for English aligned data. To the best of our knowledge, this is the first study for European Portuguese (EP) to quantify the distinct interrogative types and also discuss the weight of lexical and prosodic properties within these structures, based on planned and spontaneous speech data.

64

CHAPTER 6. ANALYSIS OF INTERROGATIVES: A CASE-STUDY

Automatic structural metadata classi cation

7

Enriching automatic speech transcripts with structural metadata (Liu et al., 2006b; Ostendorf et al., 2008; Jurafsky and Martin, 2009), namely punctuation marks and disfluencies, may highly contribute to the legibility of a string of words produced by a recognizer. This may be important for so many applications that the speech recognizer often appears integrated in a pipeline that also includes several other modules such as audio segmentation, capitalization, punctuation, and identification of disfluent regions. The task of enriching speech transcripts can be seen as a way to structure the string of words into several linguistic units, thus providing multilayered structured information which encompasses different modules of the grammar.

Different sources of information may be useful for this task, going much beyond the lexical cues derived from the speech transcripts, or the acoustic cues provided by the audio segmentation module (e.g., speech/non-speech detection, background conditions classification, speaker diarization, etc.). In fact, one of the most important roles in the identification and evaluation of structured metadata is played by prosodic cues.

The goal of this chapter is to study the impact of prosodic information in revealing structured metadata, addressing at the same time the task of recovering punctuation marks and the task of identifying disfluencies. The former is associated with the segmentation of the string of words into speech acts, and the later, besides other aspects, also allows the discrimination of potential ambiguous places for a punctuation mark. Punctuate spontaneous speech is in itself a quite complex task, further increased by the difficulty in segmenting disfluent sequences, and in differentiating between those structural metadata events. Annotators of the corpus used in this study report that those tasks are the hardest to accomplish, difficulty visible in the evaluation of manual transcripts, since the assignment of erroneous punctuation marks to delimit disfluent sequences corresponds to the majority of the errors. Furthermore, prosodic cues either for the assignment of a punctuation mark or for the signaling of a repair may be ambiguous Batista et al. (2012a); Moniz et al. (2012).

66

CHAPTER 7. AUTOMATIC STRUCTURAL METADATA CLASSIFICATION

Subset → Time (h) number of words + filled pauses number of disfluencies disfluencies followed by a repair number of full stops number of commas number of question marks

train+dev 28:00 216435 8390 5608 8363 22957 3526

test 3:24 24516 950 720 861 2612 498

Table 7.1: Corpus properties and number of metadata events.

7.1

Data and methods

The data used in our experiments is the LECTRA corpus (Trancoso et al., 2008), encompassing all the structural metadata events presented in Table 7.1. One important aspect that characterizes Portuguese punctuation marks is the high frequency of commas, which in our corpus accounts for more than 50% of all events. In a previous study (Batista et al., 2012a), where Portuguese and English Broadcast News are compared, the percentage of commas in the former is twice the frequency of the latter. As previously stated, our in-house speech recognizer, trained for the broadcast news domain, is totally unsuitable for the university lectures domain. The scarcity of text materials in our language to train language models for this domain has motivated the decision of using the ASR in a forced alignment mode. For that reason, current experiments rely on force aligned transcripts that still contain about 0.9% of unaligned words.

7.2

Predicting structural metadata events

Our experiments use a fixed set of purely automatic features, extracted either from the ASR output or from the speech signal itself, as stated in Chapter 5. The features involve two words before the event and one word after the event, and characterize either a word or a sequence of two consecutive words. Features involving a single word include: pitch and energy slopes; ASR confidence score; word duration; number of syllables and number of phones. Features involving two consecutive words include: pitch and energy slopes shapes; pitch and energy differences; comparison of durations and silences before each word (dur.comp); and ratios for silences, word durations, pitch medians (pmed.ratio), and energy medians (emed.ratio). For example, eslopes : RFcw, f w is a shape feature that refers to the energy slope in the current (cw) and following words (fw), which is Rising in cw and is Falling in fw; dur.ratiocw, f w is a number between 0 and 1 that indicates the proportion of the duration of cw over the duration of cw+fw. Our experiments were performed using the Weka toolkit (Hall et al., 2009) and distinct statistical methods have been applied, including: Naïve Bayes, Logistic Regression and Classifi-

7.2. PREDICTING STRUCTURAL METADATA EVENTS

Class comma (,) full stop (.) question (?) repair weighted avg.

Precision 60.6 64.1 73.9 60.8 63.0

Recall 27.6 67.6 29.5 13.1 32.9

67

F-meas. 37.9 65.8 42.2 21.6 43.3

SER 90.3 70.2 80.9 95.4 75.6

Table 7.2: CART classification results for prosodic features. Classified as → comma (,) full stop (.) question (?) repair insertions

, 718 76 27 51 312

. 36 579 225 19 44

? 10 35 147 1 6

repair 15 3 4 93 38

del. 1823 163 95 546

Table 7.3: Confusion matrix between events. cation and Regression Trees (CART). The best results were consistently achieved using CARTs, closely followed by Logistic Regression. The remaining of this section shows the achieved results and presents an analysis on the most relevant features.

7.2.1

Results

Experiments aim at automatically detecting structural metadata events and at discriminating between those events, using mostly prosodic features (with the exception of two identical contiguous words). We have considered four different classes of structural elements, full stops, commas, question marks, and disfluency repairs. Table 7.2 presents the best results achieved, using the standard metrics precision, recall, F-measure and Slot Error Rate. The best performance is achieved for full stops, confirming our expectation, since prosodic information is known to be crucial to classify those events in our language. The low results concerning commas are also justifiable, because our experiments rely on prosodic features, but commas depend mostly on lexical and syntactic features (Favre et al., 2009). The performance for question marks is mainly related to their lower frequency and to the multiple prosodic patterns found for these structures. Moreover, interrogatives in our language are not commonly produced with subjectauxiliary verb inversion, as in English, which renders the problem of identifying interrogatives even more challenging. The worse performance, specially affected by a low recall, is achieved for repairs. While prosodic features seem to be strong cues for detecting this class, the confusion matrix presented in Table 7.3 reveals that repairs are still confused with regular words. Table 7.3 also reveals that the most ambiguous class is, without doubt, interrogatives. Our recent experiments as well as other reported work Levelt (1989); Nakatani and Hirschberg (1994); Shriberg (1994) suggest that filled pauses and fragments serve as cues for detecting structural regions of a disfluent sequence. Supported by such facts, we have conducted

68

CHAPTER 7. AUTOMATIC STRUCTURAL METADATA CLASSIFICATION

an additional experiment using filled pauses and fragments as features. These features turned out to be amongst the most informative features, increasing the repair f-measure to 48.8%, and improving the overall f-measure to 47.8%. However, the impact of fragments is lower than the one reported by Nakatani and Hirschberg (1994); Kim et al. (2004) and this may be due to the fact that fragments in our corpus represent only 6.6% of all the disfluent types.

7.2.2

Most salient features

Equivalent experiments performed with Logistic Regression provide a good approximation to the impact of each feature. A first inspection on Table 7.4, there seem to be two pairs of structural metadata events more prone to be classified as ambiguous. On the one hand, full stops and question marks; on the other, repairs and regular words. However, at a close inspection, a set of very informative features stands out as determinant to perform the disambiguation of such events, namely, pitch and energy shapes, confidence levels of the units of analysis, and durational ratios. Features for the discrimination of a repair comprise: i) two identical contiguous words; ii) both energy and pitch increases in the following word and (mostly) a plateau contour on the preceding word; and iii) a higher confidence level for the following word than for the previous word. Reasoning about this, this set of features is showing that repetitions are being identified, that repair regions are characterized by prosodic contrast marking (increases in pitch and energy) between disfluency-fluency repair (as in our previous studies), and also that the repair identification has a high confidence level. As for full stops, the determinant prosodic features correspond to: i) a falling contour in the current word; ii) a plateau energy slope in the current word; iii) the duration ratio between the current and the following words; and iv) a higher confidence level for the current word. Reasoning about this characterization, it is the one that most resemble the neutral statements in our language, with the canonical contour H+L* L%. Question marks are characterized by two main patterns: i) a rising contour in the current word and a rising/rising energy slope between current and following words; and ii) a plateau pitch contour in the current word and a falling energy slope in the current word. The rising patterns associated with question marks are not surprising, since they commonly associated with interrogatives. On the other hand, the fact that interrogatives exhibit falling pitch contours is not that surprising, since such contours have been ascribed for different types of interrogatives, especially wh- questions, as described in Chapter 6. Commas, as stated along the previous section, are the event characterized by fewer prosodic features. Being mostly identified by morphosyntactic features, commas are not clearly disambiguated with prosodic features. With regards to regular words, the most salient features are related to the absence of silent

7.2. PREDICTING STRUCTURAL METADATA EVENTS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Feature pslopes : F − pw,cw pslopes : −− pw,cw pslopes : R− pw,cw con f cw eslopes : RFcw, f w eslopes : −− pw,cw eslopes : F − pw,cw eslopes : R−cw, f w eslopes : R− pw,cw eslopes : RFpw,cw eslopes : FFpw,cw eslopes : RRcw, f w eslopes : − Fpw,cw pslopes : RFcw, f w pslopes : F −cw, f w pslopes : FFcw, f w pslopes : R−cw, f w pslopes : RRcw, f w pslopes : FRcw, f w bsil.ratiocw, f w bsil.comp :>cw, f w emed.ratiocw, f w bsil.ratio pw,cw dur.ratiocw, f w dur.ratio pw,cw emed.ratio pw,cw pslopes : − Fpw,cw pslopes : RFpw,cw pslopes : FFpw,cw pslopes : − Fcw, f w eslopes : − Fcw, f w pslopes : −−cw, f w equals pw,cw pslopes : − Rcw, f w phonescw bsil.comp : pw,cw eslopes : − Rcw, f w eslopes : −−cw, f w pmed.ratio pw,cw eslopes : FRcw, f w pslopes : − R pw,cw eslopes : RR pw,cw eslopes : − R pw,cw eslopes : FR pw,cw eslopes : F −cw, f w pslopes : FR pw,cw pslopes : RR pw,cw equalscw, f w eslopes : FFcw, f w con f f w bsil.comp :=cw, f w bsil.comp := pw,cw dur.comp :>cw, f w dur.comp : pw,cw syls f w

none

,



 

69

.                   

?                   

 

 

repair

                                    

       

  

 



      













               



        

         

                                     

Table 7.4: Top most relevant features, sorted by relevance.

70

CHAPTER 7. AUTOMATIC STRUCTURAL METADATA CLASSIFICATION

pauses, explained by the fact that, contrarily to the other events, regular words within phrases are connected. The presence of a silent pause is a strong cue to the assignment of a structural metadata event.

7.3

Summary

This paper reports experiments on a full discrimination of structural metadata events in a corpus of university lectures, a domain characterized by a high percentage of structural events, namely punctuation marks and disfluencies. Our previous work on automatic recovery of punctuation marks indicate that specific punctuation marks display different sets of linguistic features. This motivated the discrimination of the different SU types. Our experiments, based on prosodic features achieved a considerable performance. Moreover, based on a set of complex prosodic features, we were able to point out regular sets of features associated with the discrimination of events (repairs, full stops, and question marks).

Dis uencies and their uent perspective

8

The aim of this chapter is to validate the assumption that prosodic phrasing is crucial to perform a fluency/disfluency rating task, an assumption that we will try to support both by a perceptual experiment as well as by Classification and Regression Trees (CART) (Breiman et al., 1984). Our concrete goal is to find out what linguistic features are more salient when we classify all types of disfluencies as either fluent or disfluent phenomena. We want to quantify and progressively find thresholds to differentiate fluency from disfluency, and this task is harder than it seems, since fluency is a complex notion. There are two main perspectives in the literature to describe disfluencies (vide Chapter 2): i) as speech errors that disrupt the ideal delivery of speech or ii) as fluent linguistic devices used to manage speech. Taking this into account, can we say that all disfluencies behave alike? What linguistic features (syntactic, prosodic, morphological) play a major role in disfluency rating? Are disfluencies really disfluent when they have functions of on-line planning, lexical search, and speech structuring? Are disfluencies really linguistic material to be deleted in order to obtain the intended message, as in a scripted version of speech, when they may in fact structure spontaneous speech? In the literature there are clear steps forward in the description of disfluencies as normal spontaneous management of speech, but little is said about their linguistic properties, more specifically, their prosodic properties. Our study aims at contributing to this description. First, we will describe definitions of fluency to anchor our work; second, we will report the results of our perceptual test; and third, to validate these perceptual results, we have also used CART techniques on an extended corpus.

8.1

Definitions of fluency

The notion of fluency 1 covers a wide set of different aspects both in a first language (L1) as well as in a second language (L2) - e.g., oral proficiency, adaptation to different communicative 1 For

a more detailed analysis, vide Koponen (2000) and references therein.

72

CHAPTER 8. DISFLUENCIES AND THEIR FLUENT PERSPECTIVE

contexts, mastering linguistic structures in a target language, inter alia. In one of the first studies analyzing fluency (Fillmore, 1979), it is described as a satellite concept with four dimensions: i) the temporal dimension, i. e., to keep the speech flowing; ii) the syntactic and semantic dimension, concerning the coherence and logic of speech; iii) the sociopragmatic dimension, that is, the appropriate uses of speech in different communicative contexts; and iv) the creativity one, i. e., to make explicit use of language and to explore a more metaphorical trend in it. We could state, then, that fluency is the effective way we use language, with respect to all modules of grammar. As synthesized by Lennon (2000), "a working” definition of fluency might be: The rapid, smooth, accurate, lucid, and efficient translation of thought or communicative intention into language under the temporal constraints of on-line processing. This concept of fluency is applicable in principle to both monolinguals and multilinguals, to native speakers and learners. To what extent does this working definition of fluency includes/excludes disfluencies, is still unclear. The study by Wennerstrom (2000) pointed out the importance of prosody for the characterization of fluent speech. The author analyzed informal dialogues between native and nonnative speakers of English and concluded that the most fluent strategies are related to phrasing and boundary tones. Therefore, the more fluent non-native speakers do not interrupt their speech on a word by word basis, they respect prosodic constituents cohesion, and use boundary tones that indicate continuation meaning (e.g., plateaus with filled pauses). In line with Wennerstrom (2000) findings, we aim at analyzing the influence of phrasing and contour type on fluency/disfluency distinctions.

8.2

Perceptual test

We have reanalized a perceptual experiment with 40 subjects, regarding fluency/disfluency judgments, as reported in Moniz (2006); Moniz et al. (2007). For this experiment, 30 stimuli were selected from the corpus CPE-FACES (Mata, 1999), with different disfluencies in distinct prosodic contexts and also with baselines for all the speakers (sentences without disfluencies). The guidelines given for the task were: The excerpts you are about to listen were extracted from oral school presentations and uttered by four speakers: a teacher and three students. Listening to the stimuli we can say that there are moments of ease of expression as well as moments without this characteristic. Help us in identifying both moments, scoring the excerpts you are about to listen in the scale presented. The following five-point scale was used: (1) very bad; (2) infelicitous; (3) acceptable; (4) good or (5) very good.

8.2. PERCEPTUAL TEST

73

Figure 8.1: Median values for disfluencies scores and standard deviation of the perceptual experiment.

Aiming at a more detailed analysis of the experimental results, we have undertaken a statistical analysis. Figure 8.1 shows the median values scored for each type of disfluency and the standard deviation values. The events corresponding to values ≥ 3.0 are: filled pauses (FP), prolongations (PRL), substitutions (SUB), and deletions (DEL). Participants scored fragments (FRAG), complex disfluencies (VARIA) and more than two repetitions (REPs) with scores under the median value of 3.0, the last two being the most penalized. Above a median value of 3.5, only two types of events clearly emerge: single filled pauses and prolongations (PRL and PRLs, for single and two consecutive prolongations, respectively). The implications of such a partition are quite interesting from a psycholinguistic point of view as well as from a generation of natural speech perspective. The events grouped as disfluent pose more comprehension difficulties whereas the most fluent ones do not, e.g., vocalized elongated material. If we cluster them into classes, we would distinguish the disrupting ones from those that behave as sustained linguistic material. We did not predict, however, to have a single REP more penalized than a single DEL based on Fox-Tree (1995a) and also on the validation of the annotator’s rates by two expert linguists. In Fox-Tree (1995a), the author pointed out that, in the experiments that she had undertaken, repetitions did not disturb the understanding of the subsequent units while deletions did. We should say, though, that the examples of deletion we have used were uttered after a prosodic break 4, had f 0 restart, and a plateau contour (H* !H%). This plateau contour is the same found in filled pauses considered fluent as well as in prolongations. For the time being, we just want to point out that the participants in the perceptual experiment seem sensitive to the prosodic properties of disfluencies.

74

CHAPTER 8. DISFLUENCIES AND THEIR FLUENT PERSPECTIVE

Figure 8.2: Tonal scaling of prolongations, filled pauses and repetitions. The difference between the scores considered fluent (above 3) and the scores considered disfluent (below 3) was significant (p < 0.05), with Mann-Whitney U-test, so was the difference between distinct types of disfluencies. Distinct prosodic contexts were also significant: i) the difference between disfluent events produced at break indices 4 (scored as fluent) and the ones uttered within an intonational phrase (scored as disfluent); ii) the difference between disfluent events with plateau contours vs. events with falling contours. The prosodic conditions with the highest median scores are those that match the following properties: disfluencies uttered at break index 4, with f 0 restart and plateau contours.

8.3

Discussion

Filled pauses, prolongations and repetitions have been considered by Clark and Wasow (1998) and by Clark and Fox Tree (2002) as associated to planning efforts. In corpora of school presentations and lectures, which are intrinsically associated with clarifying messages and planning carefully what to say next, these types of disfluencies are thus worth studying in detail. Figure 8.2 shows a schematic representation in semitones (ST) for a subset of disfluencies, and its prosodic context. For each stimulus, we have plotted the onset, maximum and offset values of the disfluent event (Onset_U, Max_U, Offset_U); the maximum and the offset values of the previous constituent (Max_Prev, Offset_Prev); and the onset and maximum values of the subsequent prosodic constituent (Onset_Next, Max_Next). As Figure 8.2 shows, prolongations judged felicitous exhibit f 0 rising contours with high sustained boundary tones, typically observed at the end of a prosodic constituent with continuation meaning. Filled pauses also judged fluent are uttered in a tonal space in between the prosodic adjacent constituents, have plateau contours and behave mostly as parentheticals.

Fundamental frequency (Hz)

8.3. DISCUSSION

0

75

1

2

3

4

400 340 280 220 160 100 #



# éumamaneira diferente de ver os L% H* 4

1 1

H*

H+L*L1

3 1

painéis.

#

L*+H H% 11

4

Figure 8.3: Felicitous example: "aa é uma maneira diferente de ver os painéis" (uh it is a different way to see the panels). When filled pauses are considered infelicitous, they are produced in a lower register with falling contours, disrupting inter-constituents tonal scaling. As for repetitions, the examples that we have tested were prosodically ill-formed and considered disfluent (e.g., lexical and function words repeated), since we did not include emphatic repetitions. The disfluent repetitions behave mostly as disfluent filled pauses, but were preceded by strong melodic breaks. Figures 8.3 and 8.4 show examples of the perceptual test stimuli that were judged felicitous and infelicitous2 . Figure 8.3 represents a felicitous example of a filled pause [5:] (aa / ”uh”) uttered at a break 4 with a plateau contour. This filled pause introduces a new topic of discussion in the teacher’s presentation. The filled pause can be described as a hold while the speaker is planning/structuring her next topic. An example judged disfluent is illustrated in Figure 8.4, where the verb [s’˜5w] ˜ (são, are) is repeated. As in the first example, the repetition by itself forms a prosodic constituent, in this specific case with a falling contour. The unit disrupts the f 0 global contour, and consequently the scaling between peaks of the adjacent constituents. 2 Figures were done using Praat and Pauline Welby’s scripts (Welby, 2003). These scripts may draw the f contour 0 in unvoiced portions of the signal.

Fundamental frequency (Hz)

76

CHAPTER 8. DISFLUENCIES AND THEIR FLUENT PERSPECTIVE

0

1

2

3

4

5

6

250 212 174 136 98 60 I

# a

música H*

1

o

ballet

e a dança moderna

H-

H* H-

31

3

#

L+H* L% 11

1

4

são=

#

H+L* L% 4p

sãoas principais da cultura H* 11

H*

I

H* L11

3

#

cubana

#

H+L*L% 4

Figure 8.4: Infelicitous example: "a música o ballet e a dança moderna são são os principais da cultura cubana" (music, ballet, and modern dance are are the principal [aspects] of cuban culture). The results of this experiment partially agree with the ones of O’Shaughnessy (1992); Shriberg (1999), in the way that filled pauses have gradual f 0 falling contours. However, in our data, they may exhibit rising or plateau contours as well. As pointed out by Shriberg (1999), filled pauses tend to be uttered between the previous peak and the baseline of the speaker. A result that was also observed in our data is that these events are uttered at a tonal space in between adjacent prosodic constituents. The intonational pattern of neutral declaratives in EP is quite consensual, being represented by (H) H+L* L%. When this pattern is associated with uncommon phrasing options it promotes judgments of disfluency, such as the ones exemplified in Figure 8.4. Non-falling, continuation meaning or plateau contours are, otherwise, the ones most adjustable to managing disfluencies, since they clearly indicate cohesion between units. We could ultimately say that they may be judged as felicitously integrated if they are adjusted to the adjacent prosodic units and jointly make a cohesive structure. In previous work (Moniz et al., 2007), we have pointed out that segmental prolongations do not seem to undergo regular external sandhi processes for EP. One of these processes occurs when the first word ends in a consonant /s/, phonetically realized as [z] when the following word starts with a vowel. In connected speech the regular behavior would be to produce [z] before the vowel, as in [d’u6z ’al~m6S] "duas almas" - two souls (example extracted from Mateus and d’Andrade, 2000). In prolongations of words ending in /s/, a different behavior was observed. As an example, take the adversative conjunction "mas" - but. When the vowel [1:] is appended as a prolongation, it is often pronounced as [m5Z1:] instead of [m5z1:]. For filled pauses produced within an intonational phrase, these findings seem to hold as well. Examples such as "efeitos especiais aa" - special effects uh, with no silent pause or glottal-

8.4. CART EXPERIMENT

77

ization interval between the second word and the filled pause, are pronounced as [S] instead of [z] ([if’5jtuz 1Sp1sj’ajS 5:]). The regular external sandhi process is applied in the coarticulation of the first two words, but not between the last one and the filled pause [5:]. The implications of these processes and their relationship to the prosodic structure are still a matter for further study. For the time being we want to stress that different strategies may be used by speakers: regarding intonational aspects, disfluencies may behave like other prosodic units in similar prosodic contexts, but prolongations and filled pauses have ways to be distinguished from the rest of the segmental material. The phonetic and prosodic cues used by speakers to signal managing strategies at different levels of the prosodic structure may be used to identify disfluencies in automatic speech recognition applications.

8.4

CART experiment

For CART experiments we used subsets of the CPE-FACES (Mata, 1999) and LECTRA (Trancoso et al., 2006, 2008) corpora. The subsets were manually annotated in a more detailed way for disfluencies and fluency ratings: 2h, for the high school corpus, and 1.5h for the university one. Fluent/disfluent judgments were added. These judgments were done by the first author of this work and then a randomly selected sample of the first corpus was also annotated by two other expert linguists, in terms of ease of expression, as felicitous or infelicitous. The agreement between the three annotators was of 95%. Our CART experiment was conducted using the SAS software 3 . We started by dividing the annotated data into training, validation and test data (60%, 20% and 20%, respectively). In our data, 56.4% of disfluencies were manually classified as disfluent and the remaining 43.6% as fluent events. The features used were: judgments of fluency/disfluency (as target feature), break indices, f 0 contour, f 0 restart, morphosyntactic information of the adjacent words, morphosyntactic information of the disfluency (including whether it corresponds to a sentence internal chunk or a complete sentence), speaker and speech situation (spontaneous and prepared non-scripted speech). Figure 8.5 is the graphical output of the main leaves. The first split in the tree is on the variable break indices. This variable allows for the distinction between disfluencies uttered within a prosodic constituent (classified most often as infelicitous), and at break indices 3 and 4 (classified mainly as felicitous). Within a constituent, 78.3% of these events are infelicitous, and the remaining 21.7% are classified as fluent devices. The latter (21.7%) are uttered either at the onset of an intonational phrase and with f 0 restart (10.4%), or at the end of a constituent with boundary tones that signal continuation (break 3) or finality (break 4), as in neutral statements in European Portuguese. The second split in the tree ( f 0 contours) shows that events produced at breaks 3 or 4 with 3 http://www.sas.com

78

CHAPTER 8. DISFLUENCIES AND THEIR FLUENT PERSPECTIVE

Figure 8.5: CART results: D stands for disfluent/infelicitous, and F for fluent/felicitous classification. plateau or rising contours are mainly considered fluent (90%) vs. the ones uttered in similar positions, but with falling contours or with glottalization effects (72%). In a second experiment, we withdrew the main feature (break indices) and retrained the tree. This would enable us to use mostly features that are more easily extracted in an automatic way. The retrained decision tree pointed out that if the f 0 contour is a plateau or a rising contour and the morphosyntactic information accounts for completed chunks, then 88.7% of the events are considered fluent. If the disfluent sequence has f 0 restart, then it is fluent 70.7% of the cases. Without f 0 restart and with falling contours, the events are classified as disfluent 95.3%. Moreover, with glotallization effects and also with no f 0 restart they are classified as disfluent in 80.0% of the cases. The test misclassification rate in the first experiment was 29.05% and in the second, 32.9%, when accounting for the six most important leaves. In both experiments the classification of disfluencies is above chance level. We would expect that the duration feature would be selected from the first splits, as demonstrated by Shriberg (1997), though this was not the case. Only when we consider 12 leaves then duration does play a role, but the importance of that feature is not above chance level. In future work we intend to discriminate the classification by type of disfluency and check whether this feature may play a more salient role in the classification of disfluencies.

8.5. SUMMARY

8.5

79

Summary

We have reanalyzed a perceptual experiment on a small set of stimula to test if listeners would rate all disfluencies as disfluent events or if some of them would be rated as fluent devices in specific prosodic contexts. Results pointed out significant differences (p < 0.05) between judgments of fluency vs. disfluency. Distinct prosodic properties of these events were also significant (p < 0.05) in their characterization as fluent devices, specifically prosodic phrasing and pitch contour. In an attempt to validate these perceptual results, we have also used CART techniques on an extended corpus of spontaneous and prepared non-scripted speech. CART results pointed out 2 splits: break indices and contour shape. The first split indicates that disfluent events uttered at breaks 3 and 4 are considered felicitous. The second one indicates that these events must have plateau or rising contours to be considered as such; otherwise they are strongly penalized. The results obtained show that there are regular trends in the production of disfluencies, namely, prosodic phrasing and contour shape, reenforcing the findings of the perceptual test. Results suggest, in line with findings for other languages, that speakers control different segmental and suprasegmental properties, and they seem to do it, in many cases, in a surgical way - adequately adjusting those properties to the adjacent constituents.

80

CHAPTER 8. DISFLUENCIES AND THEIR FLUENT PERSPECTIVE

Analysis of dis uencies in the LECTRA corpus

9

In the last decade, the collection and processing of lectures have gain interest, reflected in several projects, such as the Japanese project described in Furui et al. (2001), the European project CHIL (Lamel et al., 2005), and the American iCampus Spoken Lecture Processing project (Glass et al., 2007). The lectures are quite distinct, ranging from very formal seminars to quite informal ones. The common denominator is the application of speech recognition technology for enhancing accessibility for students with disabilities in the classroom domain. The LECTRA corpus (Trancoso et al., 2008) is part of that heritage. A modest subset of the LECTRA corpus was already analyzed in the previous chapter, regarding the fluent component of disfluent events. In this chapter, we will describe the LECTRA properties again regarding disfluencies, but now focusing on extending the data from just 1.30h to a train set of around 31h, and from only manually annotated data to fully automatic prosodic measures, and on reporting our data-driven empirical evidences. We aim at analyzing: i) if there are different prosodic cues for distinct types of disfluencies and ii) if there are correlations between the prosodic properties of the disfluencies and those of their adjacent contexts. The answer to these questions will hopefully be a step forward in two directions: contributing to a characterization of the so called disfluencies and of the fluency repair in European Portuguese (EP), based on empirical evidence supporting linguistic regularities at different levels, and, consequently, building predictive models based on regular trends in the prosodic behavior of the disfluencies and of their contexts.

9.1

Data and methods

This work uses the extended version of the LECTRA corpus (Trancoso et al., 2006, 2008). Our in-house speech recognition system (Neto et al., 2008) was used to produce the force aligned transcription. The reference data was then provided to the aligned transcription using the NIST SCLite tool. The corpus was automatically annotated with part-of-speech information using Falaposta (Batista et al., 2012a). Table 9.1 presents the overall characteristics of the training subset. The total alignment error for this subset is of 1.0% (1.3% and 1.6% for development and test sets, respectively), lower

82

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

than the one reported by Hazen (2006) for the same domain, although the amount of data in the latter study is significantly larger. The alignment error of the corpus ranges from 0.0% to a maximum of 6.4%, and it depends on the speaker and on specific lectures. The alignment errors are higher for speakers S2 and S4 due to the low volume of the recordings. Whereas in the first case, the lecturer frequently spoke too low, in the second case, the low volume was mainly caused by very frequent head turns to look at the slides. For most speakers there are also many interactions with students, which have not been transcribed, but which the recognizer also tries to align. For the other speakers, the main alignment errors may be attributed to computer jargon, acronyms, high frequency of anglicisms, and a variety of fillers, such as the production of several linguistic structures in their weak forms: tag questions (mostly não é?/isn’t it? pronounced as [n’E] instead of the strong form [n’˜5w ˜ ’E]), discourse markers (mostly portanto/so pronounced as [pt’˜ 5t] or even [t’˜5t] instead of [purt’˜5tu]), etc.. These experiments clearly point to the need for updating the pronunciation lexica with much more spontaneous speech pronunciations.

Speaker

S1

S2

S3

S4

S5

S6

S7

Total

time (h)

3.22

2.50

4.13

2.20

1.17

5.18

5.06

24.28

#fluent words

19774

16158

28547

14883

11351

45519

40621

176853

#sentences

1377

730

1805

739

362

4016

1547

10576

#disfluent sequences

618

554

573

613

663

1376

2985

7382

#disfluent words

1283

1104

1359

1610

959

2545

5497

14357

mean disf sequences

1.47

1.80

1.52

2.01

2.33

1.34

2.83

1.95

mean disf words

3.05

3.58

3.61

5.28

3.36

2.47

5.22

3.80

%disf words

0.67

0.58

0.71

0.84

0.50

1.33

2.87

7.51

%disf

6.5

6.8

4.8

10.8

8.4

5.6

13.5

8.10

words/bd

32.54

29.96

52.26

24.58

16.85

33.33

13.98

30.63

time/bd

16.52

14.09

19.93

8.44

6.09

9.29

4.81

11.75

useful time/bd

8.28

9.24

14.94

5.72

5.25

6.82

3.91

8.07

alignment error

0.1

2.2

0.7

3.2

0.0

0.5

0.35

1.0

Table 9.1: Overall characteristics of the LECTRA training subset. Total time per speaker; number of words, sentence-like units (#SU), and disfluent words and sequences; mean of disfluent sequences and disfluent words per sentence; percentage of disfluent sequences and words; number of words, total and usefull time (in seconds) between disfluencies.

9.2. RATE OF DISFLUENCIES PER SPEAKER

9.2

83

Rate of disfluencies per speaker

The percentages of disfluencies are accounted for in two different ways: i) in a relative way, considering the number of disfluent words of a given speaker divided by the total number of words in the corpus (the overall sum is 7.51%); and ii) an absolute way, accounting for the number of disfluent words of a given speaker divided by the total number of words uttered by the same speaker (the correspondent mean of the corpus is 8.10%). As Table 9.1 shows, the total percentages of disfluencies 7.51% and 8.10% (lines 9 and 10, respectively) are in line with findings by Shriberg (2001), who reported an interval of 5% to 10% in human-human conversations. Means of disfluent sequences per sentence as well as of disfluent words inside those sequences are also given. Thus, there is an average of almost one sequence per sentence and one word per sequence (lines 7 and 8, respectively). However, all the analyzed variables vary considerable per speaker. Speakers 5 and 7 produce 2/3 disfluent sequences per sentence with a mean of 5 words within those sequences. With a manual correction of the dataset, the original number of disfluencies (7382) reported in Table 9.1 was reduced to 7074 (308 cases were removed), as Table 9.2 shows. This correction was related to five aspects: (i) examples of very short filled pauses without reliable pitch values or even with no pitch detection that were not correctly aligned; (ii) examples of automatically identified disfluencies encompassing the disfluent events as well as long stretches of fluent speech; (iii) examples of prolongations that cannot be removed, otherwise the sentence meaning would be incomplete; (iv) examples of fluent stretches of speech that are aligned as disfluencies; (v) few examples of human errors due to not closing the angular brackets used for disfluency marking. Two or more consecutive disfluencies of the same nature were discriminated from a single one, in order to verify if their prosodic behavior would be distinct from their single counterparts and also to check for speaker variation effects. This resulted in 12 categories of disfluencies, as shown in Table 9.2. Notice that a complex disfluency is composed of two or more different categories. A single filled pause and a complex sequence are the most frequent types. This result partially agrees with the majority of the studies on this topic, which report that filled pauses are the most frequent disfluent type. However, to the best of our knowledge, complex sequences are not described as being almost equally representative (with a difference of just 2.5%). The percentage of disfluencies is distributed almost equally between speakers S1 to S5, around 8-9%. Although speaker S5 only speaks for 1:37h and he is the only teacher targeting an internet audience, the production of disfluencies is relatively the same as speakers S1 to S4, due to a high percentage of filled pauses. As for Speakers 6 and 7, they spend equivalent speaking time, however the latter utters 40% of all the disfluencies in the corpus, mostly filled pauses and complex sequences of disfluencies, whereas the former produces 18% and with a more balanced distribution by disfluency type.

84

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

Type

S1

S2

S3

S4

S5

S6

S7

Total #

Total %

Complex

180

178

189

179

107

350

874

2057

29.1

Deletion

9

16

25

21

4

52

37

164

2.3

Deletions

19

23

46

68

1

67

41

265

3.7

Filled pause

157

82

70

84

390

270

1181

2234

31.6

Filled pauses

2

3

5

8

5

4

82

109

1.5

Fragment

33

67

48

31

10

60

191

440

6.2

Fragments

1

1

4

1

1

7

14

29

0.4

Repetition

71

77

60

79

48

245

200

780

11.0

Repetitions

46

29

29

82

11

88

71

356

5.0

Substitution

34

63

74

35

30

128

122

486

6.9

Substitutions

18

13

20

24

7

32

39

153

2.2

Editing expressions

0

0

0

0

0

0

1

1

0

570 8.1

552 7.8

570 8.1

612 8.7

614 8.7

1303 18.4

2853 40.3

7074

Total # Total %

Table 9.2: Distribution of disfluency types per speaker (“S”).

9.3

Rate of disfluencies per lecture and per speaker

Figures 9.1, 9.2, and 9.3 show the means of total and useful time (measured in seconds) between disfluencies per lecture across speakers, the mean of words uttered between disfluencies, and the total number of fluent/disfluent words, respectively. For the overall discriminative results, vide Appendix A. Results show that all measures are subject to speaker and lecture variations. For instance, the average of words uttered between disfluent events (“/bd”) ranges from a maximum of 59.31 words for S3 to a minimum of 12.35 words for S7. Systematically, those two speakers contrast in the number of words uttered between disfluent events and, as expected, they maintain the same tendencies regarding the time spent speaking and the actual useful time used. When accounting for all the speakers, there are significant differences with p < 0.001 regarding all measures: words/bd ( H (6) = 26.783), time/bd ( H (6) = 27.463), and useful time/bd ( H (6) = 28.174). Even when S3, the most different speaker and the only female in the set, is removed from the grouping variable, those differences still stand: words/bd ( H (5) = 19.413), time/bd ( H (5) = 21.871), and useful time/bd ( H (5) = 21.536). However, when analyzing exclusively speakers 1, 2 and 4 there are no significant differences in all the measures: words/bd ( H (2) = 4.331, p = 0.115), time/bd ( H (2) = 5.076, p = 0.079), and useful time/bd ( H (2) = 5.079, p = 0.079). The same applies to speakers 5 and 7 regarding words/bd ( H (1) = 3.267, p = 0.071).

9.4. RATE OF DISFLUENCIES PER SENTENCE

+me/bd"

85

usefull/bd"

50" 45" 40" 35" 30" 25" 20" 15" 10" 5" 0" S1"

S2"

S3"

S4"

S5"

S6"

S7"

Figure 9.1: Total time and useful time (seconds) between disfluencies (/bd), per lecture and speaker. A lecture may be characterized by the following features (vide appendix A): has in average 5668.64 words and 473.73 disfluent words; 30.63 words produced between disfluencies; in every 11.75 seconds a disfluency is produced, or in every 8.07 seconds if not considering silent pauses.

9.4

Rate of disfluencies per sentence

In the previous section, results show that there are differences in the disfluencies distribution across lectures and speakers. The goal of the present section is to analyze the distribution of disfluencies per sentence. The analysis accounted for several variables measured per sentence and per speaker: number of words, syllables and phones within (dis)fluent sentences and also within disfluent sequences; duration of (dis)sentences and of disfluent sequences with and without internal silences, as Table 9.3 shows. The measures were extracted to compare both fluent and disfluent sentences per speaker and also the behavior of disfluent sequences. The overall characterization in a fluent sentence corresponds to an average of 9.96 words, encompassing 18.43 syllables and 38.67 phones per sentence. When analyzing sentences with disfluencies, then the average of words is 28.87 words. 56.75 syllables and 120.57 phones, with 1.95 disfluent sequences, totalling 3.80 words, 5.41 syllables and 9.55 phones. From the comparison of both structures the most immediate outcome is that sentences with disfluencies are uttered with more words than fluent sentences with significant differences (p < 0.001 ). The means of words per sentence are reflected in the duration of fluent and disfluent sentences, being the latter lengthier. Thus, the tempo characteristics of a disfluent sentence corresponds to a mean duration of seconds 10.66 seconds (6.10 for fluent sentences) and to 7.79 (4.32) seconds, when

86

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

Words/bd" 70" 60" 50" 40" 30" 20" 10" 0" S1"

S2"

S3"

S4"

S5"

S6"

S7"

Figure 9.2: Mean of words uttered between disfluencies (/bd), per lecture and speaker.

silent pauses are not included. The average time spent in uttering disfluencies per sentence is of 1.30 seconds.

A question for further analysis is the robustness of the overall aspects relatively to speaker variation. Statistical analysis show that speaker variation is once more reflected at the sentence level, not only at the lecture per se. Thus, results show significant differences with p < 0.001 in all the measures analyzed: #fluent words within fluent SUs( H (6) = 223.699), #fluent syllables within fluent SUs ( H (6) = 230.056), #fluent phones within fluent SUs ( H (6) = 231.827), #fluent words within disfluent SUs( H (6) = 367.095), #fluent syllables within disfluent SUs ( H (6) = 479.716), #fluent phones within disfluent SUs ( H (6) = 503.519), #disfluent words within sequences ( H (6) = 344.322), #disfluent syllables within sequences ( H (6) = 309.507), #disfluent phones within sequences ( H (6) = 286.568), duration of SUs with internal silences ( H (6) = 495.544), duration of SUs without internal silences ( H (6) = 476.220), duration of disfluent SUs with internal silences ( H (6) = 516.969), duration of disfluent SUs without internal silences ( H (6) = 688.294), duration of disfluencies ( H (6) = 558.504). Speaker 5 presents the highest values for the majority of features, whereas Speaker 6 presents the lowest. The patterns of both speakers are in line with their performances in class, meaning that speaker 5 is teaching for an internet audience, whereas speaker 6 is often in dialogue with his students. As for disfluent words and disfluency duration, again speaker 6 exhibits the lowest values and speaker 7 the highest disfluency duration and shares with speaker 4 the highest average of disfluent words.

9.5. PATTERNS IN THE REPARANDUM

#words"

87

#w_in_disf"

12000" 10000" 8000" 6000" 4000" 2000" 0" S1"

S2"

S3"

S4"

S5"

S6"

S7"

Figure 9.3: Total words and disfluent words, per lecture and speaker.

9.5

Patterns in the reparandum

Describing patterns in the reparandum is a way to inform statistical methods of the most relevant structures when uttering disfluencies. Several patterns may be described in the reparandum. For instance, when looking only for sequences of items of the same category, two or more consecutive deletions are the only ones more frequent than a single deletion. Figure 9.4 shows the production of events from the same category. The production of all the disfluent categories is concentrated around 2 to 4 items, however it may sparsely reach the maximum of 14 items in sequence in the specific case of deletions (since it was just two cases, it is not shown in the figure). Notice that deletions and repetitions are the categories more prone to exhibit the production of a higher number of items in sequence, a behavior that is clearly distinct from the other disfluency types. If sequences of repetitions may be relatively easy to identify by contrasting the repair with the reparandum, sequences of deletions are not, since they rely mostly on syntactic and semantic structures.

88

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

Mean

Fluent SUs

Disf SUs

Mean #words #syllables #phones duration duration without silences #fluent words within #fluent syllables within #fluent phones within #disfluent sequences #disfluent words within seq #disfluent syllables within seq #disfluent phones within seq duration duration without silences duration of disfluencies

S1 10.36 18.30 38.46 7.23 3.65 23.47 42.96 91.56 1.47 3.05 4.42 8.10 11.81 6.13 1.16

S2 14.70 27.64 58.16 10.37 6.83 32.32 62.87 132.34 1.80 3.58 5.33 9.95 15.40 10.16 1.31

S3 10.95 20.50 43.70 5.94 4.45 34.30 66.01 142.33 1.52 3.61 5.63 11.00 12.83 9.83 1.13

S4 10.45 18.71 38.30 6.97 4.70 33.93 62.96 130.01 2.01 5.28 7.91 14.52 11.81 8.05 1.53

S5 21.25 45.00 96.19 11.30 9.76 34.09 73.69 159.21 2.33 3.36 4.35 6.80 12.42 10.72 1.01

S6 8.50 15.61 32.37 3.18 2.32 19.57 36.67 76.37 1.34 2.47 3.44 6.03 5.90 4.17 0.63

S7 8.88 17.46 37.28 8.93 7.34 34.35 70.52 151.20 2.83 5.22 7.25 12.27 11.90 9.72 1.99

Total 9.96 18.43 38.67 6.10 4.32 28.87 56.75 120.57 1.95 3.80 5.41 9.55 10.66 7.79 1.30

Table 9.3: Means of words, syllables and phones within (dis)fluent sentences and disfluent sequences (“seq”); duration of disfluencies and of (dis)fluent sentences (in seconds) with and without internal silences per speaker.

250"

200" DELS" 150"

FPS" FRAGS"

100"

REPS" SUBS"

50"

0" 2"

3"

4"

5"

6"

7"

8"

9"

10"

Figure 9.4: Number of events of the same category in sequence. On the vertical axis the total number of events; on the horizontal axis the number of events in sequence.

When analyzing the possible patterns of complex sequences and verifying that at least one item of the complex sequence is of the category X, repetitions and substitutions account for 56% of all the complex sequences, as Figure 9.5 shows. The sequences of repetitions and substitutions are mainly used with precise lexical search, with jargon, words of foreign origin, precision in translating anglicism for Portuguese. The percentages of fragments and filled pauses are also representative, whereas deletions and editing expressions are quite residual. Figure 9.6 illustrates the distribution of those events per

9.5. PATTERNS IN THE REPARANDUM

89

35#

35#

32.7#

30.4# 30#

30# 25.3#

26.4#

25#

25# 21.1# 20# 15#

15#

10#

10#

5#

18.7#

20#

17.4#

4.4#

14.9#

6.6#

5#

1.5#

0.7# 0#

0# DEL#

FP#

FRAG#

REP#

SUB#

DEL#

ED#

FP#

FRAG#

REP#

SUB#

ED#

Figure 9.5: Distribution of disfluencies in the reparandum. At least one event is of the category X in a complex sequence of disfluencies, on the left, and in all the dataset, on the right. 12"

20" 18"

10"

16"

8" 6"

14"

DEL"

12"

FP"

10"

FRAG" REP"

8" 4"

SUB"

6"

ED"

4"

2"

2" 0"

0" S1"

S2"

S3"

S4"

S5"

S6"

S7"

S1"

S2"

S3"

S4"

S5"

S6"

S7"

Figure 9.6: Distribution of disfluencies in the reparandum per speaker. On the left, at least one event is of the category X in a complex sequence of disfluencies, on the left, and in all the dataset, on the right.

speaker evidencing that S5 is the only speaker who produces more filled pauses then other categories, all the remaining speakers produce more repetitions. Notice that S5 is the only speaker with an internet audience and the selection of filled pauses may be due to not having face interactions with interlocutors. Thus, regarding the classification of speakers as repeaters or deleters as reported in Shriberg (1994), there is an obvious preference for repetitions rather then deletions in the corpus. This preference can be related to the mechanisms of engaging attention when using repetitions (Fox-Tree, 1995b; Clark and Wasow, 1998). If we account for the production of a given item of the category X in the entire dataset, then filled pauses are more representative. Repetitions, substitutions and fragments are an important sample to consider as well, as Figure 9.6 illustrates.

90

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

9.6

Prosodic analysis

The features described in Chapter 5 were used to perform a prosodic analysis of disfluencies an of their adjacent contexts. To summarize the main features used, pitch and energy were extracted using the Snack Sound Toolkit1 ; durations of phones, words, and interwordpauses were extracted from the recognizer output. A set of syllabification rules was designed for Portuguese and applied to the lexicon. Features were calculated for the disfluent sequence itself and also for the two contiguous words, before and after the disfluent sequence. The following set of features has been used for each word in those regions: f 0 and energy raw and normalized mean, median, maxima, minima, and standard deviation, as well as POS, number of phones, and durations. Energy and f 0 slopes within the words were calculated based on linear regression.

9.6.1

Overall prosodic characterization

We first analyzed if there would be an overall tendency to both f 0 and energy resets in the repair region, when all the speakers and all the different types of disfluencies are accounted for. As Figure 9.7 shows, there are, in fact, pitch and energy increases from the disfluency region (“disf” or reparandum) to the repair of fluency (“disf+1”). We should add that disfluencies are followed by silent pauses 99% of the times in our data. Due to the fact that our data set is nonparametric, we tested our hypotheses with a Kruskall-Wallis test. Results show significant differences with p < 0.001 in “disf-1”, “disf” and “disf+1” pitch (H (11) = 54.566; H (11) = 361.540 and H (11) = 47.358; respectively) and energy slopes (H (11) = 139.353; H (11) = 150.564 and H (11) = 260.691; respectively) within a word as well as in the differences of pitch and energy amongst those regions (pitch and energy difference between “disf-1” and “disf”H (11) = 429.046 and H (11) = 325.574; between “disf” and “disf+1” H (11) = 833.248 and H (11) = 315.025; respectively). Thus, pitch and energy slopes are significantly different within the words immediately before and after the disfluencies (but not before and after that), meaning that contrasts are marked within relatively small contexts, possibly helping the listener to process useful cues in a shorter memory interval. These results have interesting implications for syntactic-prosodic mapping theories, supporting the view that a prosodic reset is an informative utterance suprasegmental planning cue (Levelt, 1989).

1 http://www.speech.kth.se/snack/

9.6. PROSODIC ANALYSIS

91

Pitch'

0.2#

0.0# disf)1#

diff#()1.2st)#

disf#

diff#(1.0st)#

disf+1#

Energy'

1.0#

0.0# disf)1#

diff#()0.9db)#

disf#

diff#()1.0db)# disf+1#

Figure 9.7: Pitch and energy slopes inside the disfluency (disf), word before (disf-1), and word after (disf+1); and differences between such units based on the average.

9.6.2

Speaker and type of disfluency

Pitch and energy increase from the disfluency to the repair region, independently of the speaker and for the majority of the disfluent types (with the exception of sequences of repetions and of deletions), as Figures 9.8 and 9.9 show. There are, however, degrees in the pitch reset of the next unit. The highest pitch reset is after a filled pause or a sequence of filled pauses (more than 2 ST), significantly different (p < 0.001) from all the other disfluency types. This is, in fact, the disfluency with the subsequent prosodic context that most resemble a full stop. Although filled pauses are the events that contribute the most to disfluency/repair pitch increase, even without them pitch and energy resets are still significantly different (H (9) = 55.130 with p < 0.001; (H (9) = 178.235 with p < 0.001; respectively). We know that for EP (Moniz, 2006), as for other languages, filled pauses tend to occur mainly at major intonational boundaries, therefore pitch and energy resets in the subsequent units are not that surprising. The second highest pitch reset occurs after a single deletion. Again, these findings are related to the fact that the unit after a deletion, as refreshed linguistic material, is more prone to exhibit an f 0 reset, which is an expected property at the beginning of a major intonational unit.

92

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

3.00%

semitones(

2.00%

pw%!>%disf%

disf%!>%fw%

1.00% 0.00% !1.00% !2.00% !3.00% fp%

fps%

rep%

reps%

sub%

subs%

del%

dels%

frag%

frags%

comp%

2.0% 1.5% semitones(

1.0% 0.5% 0.0% !0.5% !1.0% !1.5% !2.0% S1%

S2%

S3%

S4%

S5%

S6%

S7%

Figure 9.8: Difference between the disfluency, the previous and the following word pitch average, per type and speaker.

As for energy, deletions and repetions are significantly different (p < 0.001) from all the remaining types, with the highest energy slope within the repair. It is worth noting that energy increases from disfluency to the repair with sequences of repetitions and of deletions are not significantly different from each other (U = 38630.0, with p = 0.062). Even without repetitions and deletions, again pitch and energy resets are still significantly different (H (7) = 629.876; H (7) = 262.442; respectively).

Additionally, the prosodic contrast strategy does not apply exclusively to error correction categories (substitutions, deletions, fragments, and complex sequences). Substitutions, e.g., when compared with other types, show similar significant pitch/energy increase differences on the onset of the repair, or even on the slope within the repair. Thus, results do not support the use of a contrast strategy exclusively on error corrections (Levelt and Cutler, 1983). There is a more general tendency towards a contrast marking strategy, regardless of the specific disfluency type.

9.6. PROSODIC ANALYSIS

93

1.60#

disf81#

disf#

sub#

subs#

disf+1#

1.40# 1.20# 1.00# 0.80# 0.60# 0.40# 0.20# 0.00# fp#

fps#

rep#

reps#

del#

dels#

frag#

frags#

comp#

1.80# 1.60# 1.40# 1.20# 1.00# 0.80# 0.60# 0.40# 0.20# 0.00# S1#

S2#

S3#

S4#

S5#

S6#

S7#

Figure 9.9: Energy slopes (dB) inside the previous word, disfluency, and in the following word, per type and speaker.

9.6.3

Tempo characteristics

As for tempo analysis, the averages of the different regions are represented in Figure 9.10. The disfluency is the longest event (653 ms), the silent pause between the disfluency and the following word is longer in average (345 ms) than the previous one (270 ms), and the “disf+1” word (271 ms) equals the silent pause before a disfluency, being the shortest events. 434ms disf-1

270ms

653ms disf

345ms

271ms disf+1

Figure 9.10: Duration of all the events in ms.

Tempo patterns exhibit significant differences p < 0.001 per speaker and disfluency type in the units “disf-1”; “silent pause before”, “disf”, “silent pause after”, and “dif+1” (( H (6) = 514.752), ( H (6) = 286.032), ( H (6) = 334.792), ( H (6) = 883.652), and ( H (6) = 511.590);

94

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

( H (11) = 880.179), ( H (11) = 874.084), ( H (11) = 2510.487), ( H (11) = 243.516), and ( H (11) = 949.304); respectively). As Figure 9.11 shows, as expected, sequences of more than one event are lengthier than single events. The longest disfluency is a complex sequence of disfluencies and the smallest a fragment. Furthermore, there is a general tendency to produce lengthy silent pauses after a disfluency. However, there is a striking different pattern concerning the production of filled pauses, i. e., the previous silent pause is longer (423 ms) than the one after (262 ms). When two or more filled pauses occur the adjacent silent pauses are exactly the same (173 ms). complex$

1187$

210$

dele4on$ dele4ons$

262$

1001$

173$

173$

698$

65$ 169$

179$

310$ 729$

disf$

359$

silence$

246$

736$

240$

163$

296$

disf@1$ silence$

130$ 198$ 136$

subs4tu4on$ subs4tu4ons$

2038$

384$

423$

repe44on$ repe44ons$

777$

1121$

354$

fragment$ fragments$

304$

463$

filled$pause$ filled$pauses$

251$

647$

disf+1$

286$ 360$

Figure 9.11: Duration of all events per disfluency type.

When measuring or assessing fluency, the articulation and speech rates as well as the phonation ratio are of crucial importance. Those measures were calculated based on Grojean (1980) an on Cucchiarini et al. (2002). In the latter the units targeted are phones, whereas in the former they are syllables. In the present study both measures will be given. Thus, articulation rate corresponds to the number of phones or syllables divided by the duration of speech without utterance internal silences. Speech rate is based on the number of phones or syllables divided by the duration of speech including utterance internal silences. As for the phonation ratio it corresponds to 100% times the duration of speech without utterance internal silences divided by the duration of speech including utterance internal silences. Those measures were also performed excluding and including disfluent sequences rather than the number of filled pauses or the number of other disfluencies per minute (Cucchiarini et al., 2002). Table 9.4 illustrates the calculated rates per speaker and the overall averages. The ratios per speaker are quite distinct. The articulation and speech rates, either with fluent or disfluent sentences, are higher for speaker 6 (21.18 phones and 10.50 syllables per second, for articulation rate; 18.38 and 9.09, for speech rate per second) and lower for speaker 2 (14 and 10; 13 and 10, respectively). The highest phonation ratios are from speaker 5 (following the order presented in the table, 89.53; 88.32; 89.16) and the lowest for speaker 1 (72.17; 65.41; and 68.61).

9.7. SUMMARY

Fluent SUs

Disfl SUs

Mean articulation rate (phone) speech rate (phone) phonation ratio articulation rate (syllable) speech rate (syllable) articulation rate (phone) without speech rate (phone) without phonation ratio without disfl articulation rate (syllable) without speech rate (syllable) without articulation rate (phone) with disfl speech rate (phone) with disfl phonation ratio with disfl articulation rate (syllable) with disfl speech rate (syllable) with disfl

95

S1 18.35 14.04 72.17 8.97 6.89 15.83 10.71 65.41 7.43 5.03 14.44 10.13 68.61 6.90 4.84

S2 13.95 10.94 77.68 6.83 5.36 13.23 9.35 70.63 6.34 4.49 12.56 9.15 73.01 6.05 4.41

S3 15.60 13.12 83.29 7.84 6.64 15.46 12.66 81.10 7.56 6.23 14.55 12.11 83.01 6.95 5.79

S4 18.01 14.78 80.94 9.27 7.67 16.64 12.74 75.87 8.13 6.22 15.39 12.16 78.73 7.64 6.05

S5 16.09 14.47 89.53 7.57 6.81 14.96 13.23 88.32 6.95 6.14 14.22 12.68 89.16 6.72 6.00

S6 21.18 18.38 90.41 10.50 9.09 18.81 15.17 80.34 9.17 7.38 17.26 14.25 82.56 8.56 7.08

S7 16.28 14.72 83.04 7.96 7.15 15.71 13.72 87.13 7.37 6.44 14.07 12.52 89.00 6.75 6.01

Total 18.54 15.66 83.04 9.20 7.77 16.36 13.20 80.08 7.83 6.32 15.03 12.39 82.24 7.29 6.02

Table 9.4: Ratios per speaker where “S” stands for speaker. The rates for sentences with disfluencies are given including and excluding the disfluent sequences. Based on the prosodic parameters analyzed. there are degrees in mastering all the features. Thus, the acoustic correlates of the most proficient speaker (S6) are expressed by means of: (i) the highest energy slope within the repair; (ii) a considerable pitch increase also in the repair; (iii) the smallest disfluency duration; and (iv) the highest articulation and speech rates. The fact that S6 has the smallest duration of speech with and without internal silences is mainly related to the rich dynamics of the interactions with the class. Despite being a theoretical course. the time spent in asking the students to discuss concepts and to give examples of those is substantial. It is interesting to note that, when asked to classify the speakers regarding “likeability”, our three annotators were unanimous in stating that speaker 6 is the most “likeable” one. The prosodic correlates of this naive classification may be linked to several distinct features, namely, the highest energy slope within the repair, and also a considerable pitch increase, correlates which have been frequently associated with fluency and with higher level strategies of language use.

9.7

Summary

The contributions of this work are threefold: firstly, pitch and energy slopes are significantly different within the units immediately before and after the disfluent sequences; secondly, pitch and energy increase from the disfluency to the repair region, for the majority of the disfluent types, but there are distinct degrees in the contrast made by certain ones (namely, filled pauses, deletions, and repetitions); and thirdly, those features are constant in the production of all speakers. These results can be discussed in different perspectives. Regarding the first contribution, the speaker signals the different regions, using the most economic way, just a word in the disfluency adjacent contexts, possibly helping the listener to process useful cues in a shorter

96

CHAPTER 9. ANALYSIS OF DISFLUENCIES IN THE LECTRA CORPUS

memory interval. As for the second one, there are different contrastive degrees both in terms of pitch and energy (e.g., filled pauses are the most distinct type in what regards pitch increase, and repetitions in what regards energy rising patterns). It seems, thus, that different prosodic parameters are combined in different degrees for different functional purposes. Finally, when repairing fluency, speakers overall produce both pitch and energy increases. Our analysis favors a tendency towards prosodic contrast strategies between the different regions. Thus, we could say that speakers signal different cues in sequences containing disfluencies, by means of contrast.

Analysis of dis uencies in the CORAL corpus

10

With the goal of adding another domain to compare with the university lectures, CORAL, a map-task corpus, will be analyzed in the present chapter. In order to compare the same variables between both corpora, the analysis conducted in the previous chapter will be replicated here. The selection of the CORAL corpus was based on several criteria: (i) it has the exact same annotation schema of the LECTRA corpus (the other corpus with the exact schema is broadcast news, but in this domain disfluencies occur less and the majority of the disfluent data concerns filled pauses); (ii) the corpus has a suitable sample of spontaneous dialogues (around 9.42h); (iii) in a continuum of prepared to fully spontaneous speech (Blaauw, 1995), the university corpus represents prepared non-scripted speech with stretches of spontaneous moments, even informal ones, and the map-task corpus corresponds to fully spontaneous dialogues. As stated in Chapter 9, the analysis will be conducted in two main directions: a first one accounts for disfluency distribution and variation factors, and a second one reports the description of the prosodic properties of disfluencies and of their contexts. Specifically, the overall distribution of disfluencies and the possible variation per speaker, per dialogue, and per sentence will be targeted, in order to establish overall tendencies and also to report idiosyncratic traits. Then, a prosodic analysis will be conducted to check whether there is an overall tendency to select prosodic strategies related to contrast rather than parallelism: i) if there are different prosodic cues for distinct types of disfluencies and ii) if there are correlations between the prosodic properties of the disfluencies and those of their adjacent contexts.

10.1

Data and methods

The corpus used for the experiments described in the following sections is the CORAL corpus(Viana et al., 1998; Trancoso et al., 1998). All our experiments are conducted with the train subset. The same procedure used for LECTRA was done with CORAL. Once more, our in-house speech recognizer (Neto et al., 2008) was used to produce the force aligned transcription. The reference data was then provided to the aligned transcription using the NIST SCLite tool. The corpus was automatically annotated with part-of-speech information using Falaposta (Batista

98

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

et al., 2012a). Table 10.1 presents the overall characteristics of the training subset. The total alignment error for this subset is of 0.2% The alignment error of CORAL is constantly very low (0 or around it), with the exception of 3 dialogues. Whereas two of these dialogues have alignment errors of 3% and 6%, the third has an impressive 65%. The reason for this very high rate is related to the speakers being twins and the follower only needs 8 turns to conclude the dialogue, since the synchronization between them is immediate. It is the smallest dialogue of the corpus, full of laughs produced in simultaneous with backchannels (affirmative answers, either assertive grunts, such as “hum”, or “very well”), very hard stretches to process automatically, even in a force aligned mode. As for the other two dialogues with relative higher rates of alignment errors, they are mainly registered in the beginning of the dialogues, due to the fact that the speakers were not sure about their roles in the dialogue, as either giver or follower, and this uncertainty is manifested in low confidence levels, low energy and pitch and, consequently, higher rates of alignment errors.

10.2

Rate of disfluencies per speaker

The percentage of disfluencies was measured for the total sum of disfluencies from all the dialogues belonging to a speaker x, regardless of his/her interlocutor and of his/her role as either follower or giver. As stated in Chapter 9, the percentages of disfluencies are accounted for in two different ways: i) considering the number of disfluent words of a given speaker divided by the total number of words in the corpus (the overall sum is 8.39); and ii) accounting for the number of disfluent words of a given speaker divided by the total number of words uttered by the same speaker (the correspondent mean of the corpus is 9.16%). The percentages are, thus, 8.39% and 9.16%, ranging from a minimum of 4.04%, for speaker 3, to a maximum of 18.91%, for speaker S16. The percentage of disfluencies is again in line with the ones reported in Shriberg (2001), with the exception of speakers 6, 15, 16, 19, and 20 (totalling 20.83% of the speakers. Means of disfluent sequences per sentence as well as of disfluent words inside those sequences are also given. The main conclusion is that there are fewer sequences and words per sentence than what was observed for the university lectures, ranging from 1 (speaker 23) to two sequences (speaker 24) and from 1.27 words (speaker 4) to a maximum of 3 words (speaker 9) . As for the most frequent disfluency types, represented in Table 10.2, filled pauses are the most frequent types, followed by complex sequences (16.8%). For comparison purposes, Table 10.2 also includes percentages of emphatic repetitions (16.7%, summing single/multiple emphatic repetitions). Emphatic repetitions comprise several structures, being the most productive: (i) affirmative or negative back-channels (“sim, sim, sim.”/yes, yes, yes. “não, não, não”/no, no, no.) and (ii) repetition of a syntactic phrase, such as a locative prepositional phrase (“para cima, para cima”/above, above.), used for precise tuning with the follower and for stressing the most important part of the instruction. Emphatic

10.2. RATE OF DISFLUENCIES PER SPEAKER

99

Speaker

Time (m)

#words

#SU

#disfl seq words

mean disfl seq words

%disf words words

* between disfluencies words time useful

S1

17.48

1174

258

65

117

1.14

2.05

0.25

9.97

24.10

8.10

6.59

S2

22.12

1028

249

45

72

1.10

1.76

0.16

7.00

21.35

8.23

6.58

S3

23.04

1460

227

40

59

1.08

1.59

0.13

4.04

34.98

14.66

11.96

S4

18.96

1370

243

72

84

1.09

1.27

0.18

6.13

21.08

6.64

4.98

S5

23.67

1888

339

62

113

1.11

2.02

0.25

5.99

32.22

10.71

8.73

S6

23.94

1859

301

115

201

1.14

1.99

0.44

10.81

16.73

5.85

5.08

S7

21.08

1618

298

88

144

1.28

2.09

0.31

8.90

20.58

6.28

5.00

S8

21.56

1684

307

72

111

1.14

1.76

0.24

6.59

23.04

7.88

6.37

S9

21.04

1141

197

40

108

1.11

3.00

0.24

9.47

36.75

14.14

10.31

S10

34.33

3574

553

154

279

1.22

2.21

0.61

7.81

23.31

5.98

5.13

S11

17.71

1274

177

75

128

1.34

2.29

0.28

10.05

17.99

6.11

4.60

S12

21.81

1806

295

56

81

1.14

1.65

0.18

4.49

31.32

9.72

7.29

S13

25.41

1909

296

102

161

1.24

1.96

0.35

8.43

19.45

5.77

4.74

S14

19.27

1492

267

59

145

1.07

2.64

0.32

9.72

24.35

0.79

6.64

S15

29.06

2326

422

156

257

1.43

2.36

0.56

11.05

15.40

4.25

3.58

S16

27.3

2121

431

214

401

1.25

2.35

0.87

18.91

10.75

3.10

2.65

S17

16.34

1174

157

69

92

1.25

1.67

0.20

7.84

17.41

5.54

4.54

S18

22.43

1919

317

113

142

1.18

1.48

0.31

7.40

17.36

5.23

4.39

S19

29.00

2405

381

141

302

1.32

2.82

0.66

12.56

16.92

5.59

4.45

S20

36.11

2987

466

283

451

1.40

2.23

0.98

15.10

10.95

3.58

2.90

S21

18.64

703

105

39

59

1.44

2.19

0.13

8.39

16.55

6.50

4.89

S22

26.12

2435

434

54

122

1.10

2.49

0.27

5.01

45.05

11.94

9.57

S23

26.65

1385

297

50

88

1.00

1.76

0.19

6.35

27.45

10.14

8.09

S24

21.98

1302

170

93

133

1.63

2.33

0.29

10.22

14.99

5.90

4.50

Total

565.05

42034

7187

2257

3850

1.24

2.12

8.39

9.16

22.50

7.49

5.98

Table 10.1: Overall characteristics of the CORAL training subset. Total time per speaker; number of words, sentence-like units (#SU), and disfluent words and sequences; mean of disfluent sequences and disfluent words per sentence; percentage of disfluent sequences and words; number of words, total and usefull time (in seconds) between disfluencies.

100

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

repetitions in the map-task are a cue repertoire mainly used to help the follower reaching a location. Thus, for the annotation of the map-task corpus the label emphatic repetition (“rp-e” or rps-e) had to be added, since the occurrence of such events was very notorious and productive, even in the pilot dialogue. Needless to say that emphatic repetitions are not disfluencies, either emphatic mechanisms used to highlight information. However, there was an obvious need to include emphatic repetitions, to see how in a continuum of fluency they would be produced when compared to other events. Since we said in Chapter 8 that disfluencies have a fluent component when they are uttered with specific prosodic properties, setting emphatic repetitions as an example of fluent mechanisms, selected for making information structure more salient, is a way to check if they have a comparable behavior with the one of disfluent events.

10.3

Rate of disfluencies per dialogue and per speaker

Figures 10.1, 10.2, and 10.3 show the means of total and useful time between disfluencies per dialogue and across speakers, the total number of fluent/disfluent words, and the mean of words uttered between disfluencies, respectively. For the overall discriminative results, vide appendix B. Results show that all measures vary per speaker and within speaker per dialogue as well. When accounting for all the speakers, there are significant differences with p < 0.001 regarding all measures: words/bd ( H (23) = 68.237), time/bd ( H (23) = 66.915), and useful time/bd ( H (23) = 68.472). Taking into account the number of words uttered between disfluencies, e.g., speaker 22 utters the maximum average of words/bd (58.70), whereas speaker 20 produces the minimum (only 9.27 words). Within speakers, dialogues tend to be quite distinct from each other too (e.g., S1, S5, S9). A dialogue presents the following set of features (vide Appendix B): has in average 437.85 words and 40.10 disfluent words; 22.50 words produced between disfluencies; in every 7.49 seconds a disfluency is produced, and in every 5.98 seconds if not considering silent pauses.

10.4

Rate of disfluencies per sentence

In the previous section, results show that there are differences in the distribution of disfluencies per speaker and across dialogues. The goal of the present section is to analyze the distribution of disfluencies per sentence. The analysis accounted for several variables measured per sentence and per speaker: number of words, syllables and phones within fluent sentences, sentences with disfluencies and disfluent sequences; duration of fluent sentences, sentences with disfluencies and disfluent sequences with and without internal silences. Table 10.3 shows all the mentioned measures. The overall characterization corresponds to an average of 4.46 words, totalling 8.48 syllables, 17.85 phones and an average of 1.24 sequences of disfluencies per sentence containing 2.12 disfluent words. All measures are comparatively smaller

10.4. RATE OF DISFLUENCIES PER SENTENCE

101

Disfluency types

Emphatic rs

Total

com

dl

dls

fp

fps

fg

fgs

rp

rps

sb

sbs

rp-e

rps-e

tot

%

S1

15

0

0

14

0

15

0

3

0

3

2

1

12

65

2.9

S2

5

2

0

7

0

8

0

5

3

1

0

1

13

45

2.0

S3

6

0

0

11

0

8

0

6

0

2

1

0

6

40

1.8

S4

7

0

0

48

0

7

0

4

0

2

0

0

4

72

3.2

S5

9

0

0

12

0

4

0

6

4

3

2

1

21

62

2.8

S6

21

0

0

14

0

19

1

9

3

10

1

1

36

115

5.1

S7

17

1

0

25

1

11

0

10

7

6

4

0

6

88

3.9

S8

10

0

0

26

0

8

0

6

2

6

2

1

11

72

3.2

S9

9

0

0

12

0

4

0

5

0

0

1

0

3

34

1.5

S10

32

1

3

29

0

15

0

29

11

11

4

2

17

154

6.9

S11

15

2

0

29

0

8

0

4

6

4

2

0

5

75

3.3

S12

8

0

1

12

0

11

0

5

1

10

2

0

6

56

2.5

S13

21

1

1

28

0

13

0

15

6

4

3

1

9

102

4.5

S14

3

0

2

5

0

6

0

3

2

7

0

1

30

59

2.6

S15

22

2

0

32

1

17

2

27

10

15

4

2

21

155

6.9

S16

46

2

2

51

0

21

0

23

5

16

4

2

42

214

9.5

S17

11

1

0

36

0

9

0

7

0

0

0

0

5

69

3.1

S18

6

0

1

56

1

11

0

16

3

7

0

0

12

113

5.0

S19

30

1

3

24

0

18

3

17

6

11

2

2

24

141

6.3

S20

38

0

1

36

0

32

3

92

19

12

3

8

38

282

12.6

S21

8

0

0

11

0

7

0

5

2

4

1

0

0

38

1.7

S22

10

0

1

4

0

5

0

7

4

4

0

5

14

54

2.4

S23

10

0

0

11

0

8

0

3

2

4

0

0

12

50

2.2

S24

18

0

1

51

1

4

0

6

3

3

3

0

2

92

4.1

tot

377

13

16

584

4

269

9

313

99

145

41

28

349

2247

100

%

16.8

0.6

0.7

26.0

0.2

12.0

0.4

13.9

4.4

6.5

1.8

1.2

15.5

100

Table 10.2: Distribution of disfluencies and of emphatic repetitions per speaker (“S”). The diacritic “com” stands for complex disfluencies; “dl” and “dls” for a single deletion and a sequence of deletions, respectively; “fp” and “fps” for a single filled pause or sequences of filled pauses; “fg” and “fgs for a single fragment or sequences of fragments; “rp” and “rps” for a single repetition or for sequences of repetitions; “sub” and “subs” for a single substitution or a sequence of substitutions; and “rp-e” and “rps-e” for a single emphatic repetition or for sequences of emphatic repetitions.

102

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

40"

usefull/bd"

35"

4me/bd"

30" 25" 20" 15" 10" 5" 0" s1" s2" s3" s4" s5" s6" s7" s8" s9" s10" s11" s12" s13" s14" s15" s16" s17" s18" s19" s20" s21" s22" s23" s24"

Figure 10.1: Total time and useful time between disfluencies, per dialogue and speaker. 1400"

#w_in_disf"

#words"

1200" 1000" 800" 600" 400" 200" 0" s1" s2" s3" s4" s5" s6" s7" s8" s9" s10" s11" s12" s13" s14" s15" s16" s17" s18" s19" s20" s21" s22" s23" s24"

Figure 10.2: Total words and disfluent words, per dialogue and speaker.

than the ones observed for the university lectures, meaning that the turns are quite small and dynamic. Consequently, this fact is also corroborated by all tempo characteristics, since a fluent sentence duration in average corresponds to 1.45 seconds and a sentence with disfluencies to 3.18 seconds. When silent pauses are not included, those means are lower, 1.17 seconds for fluent sentences and 2.56 for sentences with disfluencies, being the average time spent in uttering disfluencies of 0.66 seconds. Statistical analysis show that speaker variation is once more reflected at the sentence level, not only at the dialogue per se. Results show significant differences with p < 0.001 in all the measures analyzed: #fluent words within fluent SUs( H (23) = 79.758), #fluent syllables within fluent SUs ( H (23) = 104.225), #fluent phones within fluent SUs ( H (23) = 97.593), #fluent words within disfluent SUs( H (23) = 93.383), #fluent syllables within disfluent SUs ( H (23) = 96.249), #fluent phones within disfluent SUs ( H (23) = 93.797), #disfluent words

10.5. PROSODIC ANALYSIS

70"

103

Words/bd"

60" 50" 40" 30" 20" 10" 0" s1" s2" s3" s4" s5" s6" s7" s8" s9" s10" s11" s12" s13" s14" s15" s16" s17" s18" s19" s20" s21" s22" s23" s24"

Figure 10.3: Mean of words uttered between disfluent sequences, per dialogue and speaker. within sequences ( H (23) = 94.024), #disfluent syllables within sequences ( H (23) = 96.112), #disfluent phones within sequences ( H (23) = 129.096), duration of SUs with internal silences ( H (23) = 143.302), duration of SUs without internal silences ( H (23) = 157.751), duration of disfluent SUs with internal silences ( H (23) = 108.917), duration of disfluent SUs without internal silences ( H (23) = 103.580), duration of disfluencies ( H (23) = 63.760). Speaker 3 produces more fluent words, syllables and phones and utters longer fluent sentences, whereas speaker 24 produces more disfluent sequences and words, and also has lengthier disfluent sentences.

10.5

Prosodic analysis

The features described in Chapter 5 were used to perform a prosodic analysis of disfluencies an of their adjacent contexts in the CORAL corpus, similarly do what was done in Chapter 9 for the LECTRA corpus.

10.5.1

Overall prosodic characterization

We first analyzed if there would be an overall tendency to both f 0 and energy resets in the repair region. when all the speakers and all the different types of disfluencies are targeted. As Figure 10.4 illustrates, there are pitch and energy increases from the disfluency region (“disf” or reparandum) to the repair of fluency (“disf+1”). Results show significant differences with p < 0.001 in pitch and energy differences between the previous word (“disf-1”) and the disfluency as well as between the disfluency and the repair itself (pitch and energy differences between “disf-1” and “disf”H (112) = 25.768 with p < 0.05 and H (12) = 49.331; between “disf” and “disf+1” H (12) = 95.357 and H (12) = 63.127; respectively). However.

104

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

fluent SUs

Disfluent SUs

Disfl sequences

fluent SUs dur

Disfluent SUs dur

#wrd

#syl

#ph

#wrd

#syl

#ph

mean

#wrd

#syl

#ph

with

out

with

out

disf

S1

4.06

7.78

16.18

6.28

12.09

25.40

1.14

2.05

2.82

5.58

1.32

1.12

1.88

1.61

.49

S2

3.55

6.32

13.58

7.07

14.17

30.46

1.10

1.76

2.68

5.17

1.39

1.07

2.85

2.27

.60

S3

5.63

10.59

22.92

10.54

20.97

45.51

1.08

1.59

2.03

3.73

2.30

1.88

4.89

3.88

.60

S4

4.73

9.63

19.85

8.06

15.65

32.17

1.09

1.27

1.55

2.29

1.62

1.14

2.16

1.86

.35

S5

5.17

9.42

19.89

7.61

14.20

30.13

1.11

2.02

2.68

5.95

1.67

1.36

2.75

2.13

.65

S6

4.95

9.17

19.24

8.61

17.51

36.67

1.14

1.99

2.59

5.30

1.68

1.44

3.23

2.78

.60

S7

4.31

7.90

16.93

9.13

16.30

35.20

1.28

2.09

2.90

5.23

1.35

1.06

2.72

2.15

.68

S8

4.31

8.08

16.67

10.05

19.71

41.86

1.14

1.76

2.41

4.33

1.38

1.15

3.80

2.93

.59

S9

5.12

9.70

20.43

8.78

17.47

37.11

1.11

3.00

5.50

11.50

1.93

1.42

3.86

2.65

1.51

S10

5.35

10.09

21.37

10.24

19.56

41.52

1.22

2.21

3.34

6.57

1.37

1.16

2.68

2.31

.60

S11

4.75

8.31

17.17

12.48

24.91

53.98

1.34

2.29

3.21

5.86

1.32

1.03

4.69

3.52

.57

S12

5.09

8.82

18.29

11.33

20.06

41.76

1.14

1.65

2.00

3.84

1.55

1.17

3.79

2.76

.45

S13

4.80

8.52

17.92

10.74

20.23

43.30

1.24

1.96

2.63

4.89

1.35

1.13

3.57

2.76

.57

S14

4.59

8.32

17.58

9.44

18.27

39.35

1.07

2.64

4.11

8.73

1.43

1.19

3.33

2.70

.87

S15

4.21

7.50

16.09

9.26

17.28

36.90

1.43

2.36

3.28

6.20

1.16

.95

2.61

2.19

.64

S16

3.52

6.08

12.78

7.06

13.07

27.86

1.25

2.35

3.24

6.09

.98

.84

2.07

1.76

.72

S17

4.57

8.99

19.27

12.87

25.60

54.69

1.25

1.67

2.25

3.93

1.41

1.18

4.36

3.43

.61

S18

4.67

8.30

17.18

9.24

17.36

36.46

1.18

1.48

1.98

3.24

1.33

1.12

2.90

2.44

.69

S19

4.86

8.97

18.91

10.03

18.59

38.82

1.32

2.82

4.07

7.89

1.72

1.31

3.17

2.66

.83

S20

3.95

7.34

15.28

9.62

18.65

40.34

1.40

2.23

3.10

6.06

1.28

1.02

3.22

2.65

.69

S21

4.33

7.96

16.21

13.52

27.15

57.04

1.44

2.19

3.07

4.93

1.58

1.24

6.54

4.53

.54

S22

5.21

9.37

19.48

8.80

16.10

34.00

1.10

2.49

3.90

.87

1.34

1.08

2.35

1.92

.58

S23

4.46

8.34

18.04

5.68

10.70

23.08

1.00

1.76

2.44

4.96

1.63

1.30

1.93

1.55

.51

S24

3.47

5.82

12.13

15.96

31.74

66.72

1.63

2.33

3.11

5.40

1.13

.88

6.76

5.14

.92

Mean

4.62

8.48

17.85

9.47

18.14

38.56

1.24

2.12

2.98

5.69

1.45

1.17

3.18

2.56

.66

Table 10.3: Means of words, syllables and phones within (dis)fluent sentences and disfluent sequences (“seq”); duration of disfluencies and of (dis)fluent sentences (in seconds) with and without internal silences per speaker.

10.5. PROSODIC ANALYSIS

105

when looking at pitch and energy slopes within words, there are no significant differences in “disf-1” and “disf+1” pitch slopes (H (12) = 13.712 p = 0.319; and H (12) = 13.427 with p − value = 0.339. respectively), only in energy slopes of “disf-1”, “disf” and “disf+1” with p < 0.001 (H (12) = 38.399; H (12) = 45.256 and H (12) = 98.223; respectively) and in “disf” pitch slope (H (12) = 51.206 with p < 0.001). In the map-task dialogues the strategy of having f 0 and energy increases in the repair region still maintains. However, when looking within words and specifically to the slopes of those words, the behavior is quite different. Only the energy slopes for all units are significantly different and pitch slopes exclusively for disfluent events.

Pitch'

0.2#

0.0# disf)1#

diff#(00st)#

disf#

diff#(00st)#

disf+1#

diff#(00db)#

disf+1#

Energy'

1.0#

0.0# disf)1#

diff#()01db)#

disf#

Figure 10.4: Pitch and energy slopes inside the disfluency (disf). word before (disf-1). and word after (disf+1); and differences between such units based on the average.

10.5.2

Speaker and type of disfluency

There are two distinct patterns regarding pitch and energy increases from the disfluency to the repair region. Pitch increases are strongly dependent on the disfluency type and on the speaker, whereas energy ones do not vary per speaker, only per disfluency type, as Figures 10.5 and 10.6 show. 20 speakers (71% of the speakers) produce pitch increases from disfluency/fluency repair and half of the categories are uttered with subsequent pitch increases. The energy increases are constant per speaker, however they also vary per disfluency type, i.e.. deletions and fragments do not exhibit an energy gain from the disfluency to the repair.

106

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

pw%!>%disf%

4.00%

disf%!>%fw%

3.00%

semitones(

2.00% 1.00% 0.00% !1.00% !2.00% !3.00% !4.00% !5.00% fp%

fps%

rep%

rep_enf%

reps% reps_enf%

sub%

subs%

del%

dels%

frag%

frags%

comp%

2.00% 1.50%

semitones(

1.00% 0.50% 0.00% !0.50% !1.00% !1.50% S1% S2% S3% S4% S5% S6% S7% S8% S9% S10% S11% S12% S13% S14% S15% S16% S17% S18% S19% S20% S21% S22% S23% S24%

Figure 10.5: Difference between the disfluency, the previous and the following word pitch average, per type and speaker.

The highest pitch reset is after a single filled pause, again similar to the university lecture corpus, and an emphatic repetition. However, contrarily to the university lectures, in the maptask corpus the pitch reset does not encompasses contiguous sequences of filled pauses and the reset is not as striking as in the university lectures. A single deletion exhibit a high pitch and energy slope inside the disf+1 word.

As already observed for the lecture corpus, the prosodic contrast strategy is also applied in the map-task corpus but with different nuances. Whenever the contrast is not made with the differences between “disf” and “disf+1”, there is a tendency to differentiate “disf+1” from all the sequence. Again. results do not support the use of a contrast strategy exclusively on error corrections (Levelt and Cutler, 1983). There is a more general tendency towards a contrast marking strategy.

10.5. PROSODIC ANALYSIS

107

Figure 10.6: Energy slopes (dB) inside the previous word, disfluency, and in the following word, per type and speaker.

10.5.3

Tempo characteristics

As for tempo analysis, the averages of the different regions are represented in Figure 10.7. The disfluency is the longest event (522 ms), the silent pause between the disfluency and the following word is longer in average (231 ms) than the previous one (186 ms), and the “disf+1” word (257 ms) is shorter than the disf-1 word.

318ms disf-1

186ms

522ms

231ms

disf

257ms disf+1

Figure 10.7: Duration of all the events in ms.

Tempo patterns exhibit significant differences p < 0.001 per speaker and disfluency type in the units “disf-1”, “silent pause before”, “disf”, “silent pause after”, and “dif+1” (( H (23) = 79.005), ( H (23) = 74.878), ( H (23) = 69.465), ( H (23) = 161.392), and ( H (23) = 44.599, p < 0.01), ( H (12) = 67.384), ( H (12) = 171.633), ( H (12) = 944.476), ( H (12) = 110.527), and ( H (12) = 222.460); respectively). Figure 10.8 shows, as expected, that sequences of more than one event are lengthier than

108

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

single events. The longest disfluency is a sequence of filled pauses (for the university corpus it was a complex sequence of disfluencies) and the smallest a fragment. Another aspect that seems to differentiate between the map-task and the university lectures is that the general tendency to produce lengthier silent pauses after a disfluency than before a disfluency is not manifested with the same regularity found for the lecture corpus (namely with complex sequences and fragments). Complex sequences and filled pauses as well as emphatic repetitions are uttered with previous silent pauses longer than the subsequent ones. Furthermore, emphatic repetitions have in fact the biggest comparable difference (almost 100 ms) between the adjacent silent pauses. The patterns observed for silent pauses regarding emphatic repetitions, either single or in sequences, distinguish them from repetitions per se. complex$ dele4on$ dele4ons$ filled$pause$ filled$pauses$ fragment$ fragments$ repe44on$ emph_rep$ repe44ons$ emph_reps$ subs4tu4on$ subs4tu4ons$

991$

250$ 262$

242$

198$ 268$

846$

139$

272$

367$

379$

328$

3023$

197$

115$

78$ 218$

100$ 88$ 120$ 164$ 94$ 141$

355$

752$

146$

320$

63$ 154$

disfB1$ silence$

283$ 22$

disf$

561$

243$

silence$

741$

68$

disf+1$

289$ 662$

137$ 162$

Figure 10.8: Duration of all the events per disfluency type in CORAL.

Table 10.4 illustrates the speech and articulation rates, as well as the phonation ratio per speaker. As previously described for speaker, dialogue and sentence tempo characteristics, there are also considerable differences in speech and articulation rates.All measures are given for phones and syllables per second. The measures of the fluent sentences correspond to: i) articulation rates encompass a miminum of 12.49 phones uttered per second (speaker 2) and a maximum of 21.44 phones (speaker 4 exclusively) or 6.18 syllables and a maximum of 12.69 per second; ii) speech rate measures display the same tendencies in both speakers 2 and 4 for minimum and maximum, respectively, but now 10.98 phones to 19.17 phones or 6.18 to 12.69 syllables; and iii) phonation ratios range from 85.91 (speaker 9) to 92.47 (speaker 6). As for sentences with disfluencies, the trends are: i) articulation rate totalling 12.86 phones (speaker 3) to 19.22 (speaker 4) or 6.39 syllables (speaker 24) up to 9.91 syllables (speaker 4); ii) speech rates encompass 11.39 (speaker 3) to 17.16 (speaker 4) or 4.96 syllables (speaker 21) to 8.86 (speaker 4); and iii) phonation ratios range from 75.66 (speaker 9) up to 90.64 (speaker 19).When accounting for disfluencies in all measures, then the means of phones and syllables decay (1-2 phones and 1 syllable) and the phonation ratio raises to 3% (from 87.29 to 90.54). Summing

10.6. SUMMARY

109

Fluent SUs art rate

speech rate

Disfluent SUs

phon ratio

art rate

phones

speech rate

syllables

art rate

speech rate

ph with disf

phon ratio

art rate

speech rate

art rate

speech rate

phon ratio

art rate

speech rate

syl without disf

with

syl with disf

ph without disf

with

S1

13.51

12.09

89.63

6.89

6.14

15.07

13.57

90.36

7.32

6.61

14.13

13.13

93.22

6.84

S2

12.49

10.98

89.66

6.18

5.44

15.14

13.18

86.37

7.17

6.25

13.43

12.23

90.81

6.35

5.78

S3

12.67

11.48

89.74

6.55

5.98

12.86

11.39

87.96

6.50

5.81

11.10

9.99

90.37

5.34

4.81

S4

21.44

19.17

89.75

12.69

11.43

19.22

17.16

87.67

9.91

8.86

15.83

14.26

89.86

8.11

7.30

S5

14.28

12.76

89.59

7.37

6.61

14.38

12.54

86.71

7.00

6.12

12.74

11.54

90.75

6.04

5.45

S6

12.72

11.74

92.47

6.29

5.81

13.88

12.45

89.86

6.73

5.99

12.08

11.15

93.00

5.75

5.30

S7

15.73

14.08

88.24

8.54

7.70

16.65

14.45

86.87

8.32

7.32

13.98

12.40

89.73

6.84

6.10

S8

14.62

13.22

90.46

7.62

6.92

14.44

12.14

84.85

6.92

5.80

12.84

11.31

88.45

6.33

5.59

S9

13.59

11.62

85.91

6.42

5.48

14.20

10.72

75.66

6.81

5.10

12.69

10.63

84.20

5.98

4.98

S10

18.00

16.43

91.14

9.68

8.85

18.01

16.21

89.82

8.59

7.74

15.99

14.73

92.53

7.68

7.07

S11

17.99

15.98

87.88

10.71

9.65

15.34

12.55

82.59

7.29

6.00

14.59

12.33

85.19

7.05

5.99

S12

16.30

13.97

86.16

8.51

7.24

16.48

13.57

81.51

8.10

6.67

14.65

12.35

84.40

7.30

6.16

S13

16.42

15.00

91.28

8.63

7.95

16.37

13.93

85.09

7.78

6.63

14.70

12.89

88.14

7.10

6.24

S14

13.68

12.33

89.09

7.25

6.58

15.27

13.05

86.03

7.09

6.03

12.69

11.27

90.66

5.96

5.28

S15

15.63

14.09

90.35

7.70

6.95

18.41

16.56

89.07

9.13

8.25

14.93

13.73

92.31

7.32

6.75

S16

14.65

13.16

90.24

7.93

7.11

15.99

14.30

89.33

7.69

6.87

13.71

12.72

93.34

6.66

6.18

S17

17.40

15.73

90.30

10.05

9.04

16.75

14.31

85.06

7.89

6.75

14.49

12.75

87.81

6.99

6.15

S18

15.30

13.84

90.38

8.41

7.59

15.07

13.53

89.88

7.25

6.49

12.51

11.45

92.29

6.15

5.62

S19

14.75

13.31

89.86

7.67

6.94

15.20

13.83

90.64

7.64

6.99

13.86

12.90

93.01

6.83

6.36

S20

14.41

12.47

86.97

7.10

6.14

15.84

14.11

88.82

7.42

6.62

13.86

12.63

91.52

6.61

6.04

S21

13.21

12.16

91.31

7.95

7.30

12.94

10.01

77.77

6.40

4.96

12.45

9.97

80.47

6.78

5.54

S22

17.21

15.62

90.51

9.20

8.38

18.32

16.38

89.30

8.67

7.74

16.31

14.90

92.01

7.81

7.12

S23

13.18

11.77

89.25

6.70

6.00

14.34

12.68

88.23

6.87

6.07

13.06

11.95

92.00

6.27

5.73

6.36

S24

14.37

12.76

87.70

7.67

6.74

13.16

10.63

80.05

6.39

5.16

11.93

9.90

82.86

5.88

4.88

Mean

15.27

13.69

89.60

8.06

7.24

15.85

13.89

87.29

7.67

6.73

13.84

12.48

90.54

6.72

6.06

Table 10.4: Ratios per speaker where “ph” stands for phone; “syl” for syllable; “art” for articulation”; and “phon” for phonation ratio. up, when looking to the trends in fluent sentences, speaker 4 presents the highest values and speaker 2 the lowest; when looking to sentences with disfluencies again speaker 4 displays the highest values and speaker 3 the lowest.

10.6

Summary

In the dialogues, 71% of speakers produce pitch increases from the disfluency to the repair, and half of the categories are uttered with subsequent pitch resets. Energy increasing patterns are constant per speaker, however they also vary per disfluency type, i.e.. deletions and fragments do not exhibit an energy gain from the disfluency to the repair. The context previous to the disfluency is almost not significant to the prosodic interplay of the regions, showing a different prosodic signaling from the university lectures, that will be further explored in the next chapter. Moreover, the tempo parameter varies per speaker and per disfluency type. The need to reach a final destination in a map may in fact cause temporal constraints on the editing of the

110

CHAPTER 10. ANALYSIS OF DISFLUENCIES IN THE CORAL CORPUS

signal, since all events uttered are relatively shorter than the ones of the university lectures. In the next chapter we will focus on the comparison between map-task corpus and university lectures in order to verify the impact of speaking style effects in the production of disfluencies.

Speaking style e ects in the production of dis uencies

11

This chapter explores speaking style effects in the production of disfluencies, building on the analysis conducted in Chapters 9 and 10 for university lectures and map-task dialogues. In both corpora speech is edited on-line. However, they vary in the ways speakers adjust to communicational contexts. Distributional patterns, speech and articulation rates and prosodic disfluency/fluency repair strategies are targeted in a broader comparison of inter-corpora styles, aiming at contributing to the differentiation between fully spontaneous and prepared nonscripted speech.

11.1

Related work

In what regards the discrimination of speaking styles, disfluencies have been classically accounted for in a limited view: the presence/absence of disfluent events predicts speech as either spontaneous or read. Having established the linguistic properties of disfluencies, cross-linguistic studies already pointed out language universal and language specific regularities (Allwood et al., 1990; Eklund and Shriberg, 1998; Clark and Fox Tree, 2002; Eklund, 2004; Vasilescu and Adda-decker, 2007), both segmental and prosodic. Recent studies are gradually focusing on varia (para)linguistic properties of such events. Either per se or combined with other features, disfluencies have been shown to characterize social and emotional behavior (Gravano et al., 2011; Benus et al., 2012; Ranganath et al., 2013; Schuller et al., 2013). Studies are, thus, moving much beyond the classical view of presence/absence of disfluencies for the classification of spontaneous vs read speech, embracing a diverse set of domains (e.g., speeddating, Supreme Court hearings, etc.). The expression speaking style is complex to define. Literature (Biber, 1988; Eskénazi, 1993; Blaauw, 1995; Barry, 1995; Biber and Conrad, 2009; Hirschberg, 2000) has pointed out to the role that multiple dimensions of variation play in style changes, contributing to a more comprehensible view of speaking style. For Eskénazi (1993), there are three essential axes of variation: the degree of intelligibility required by the situation, the familiarity between speaker and listener(s), and the social strata of the communicative participants. The effect that the speaker intends to have on the listener is also another dimension to consider, as evidenced by Barry (1995).

112 CHAPTER 11. SPEAKING STYLE EFFECTS IN THE PRODUCTION OF DISFLUENCIES

As for intonational and rhythmic variation features related to speaking style changes, correlates have already been identified across tasks and domains in different languages (e.g., Blaauw (1995); Mata (1999); Hirschberg (2000)). Previous work on disfluencies for Portuguese (Moniz, 2006) has also pointed out to prepared vs. spontaneous task distinctions considering other distributional patterns beyond the mere presence/absence of disfluency events. Characterizing speaking styles plays a crucial role in several automatic speech processing areas and is still an open research area. For instance, the discriminant properties of speaking styles (e.g., first and second person pronouns and verbal forms in a dialogue vs third person forms in the broadcast news; frequent familiar vocabulary vs. technical terms; linguistic and non-linguistic vocalizations inter and intra-corpora (Weninger and Schuller, 2012)) may be useful for language models adaptation in ASR. Recently, due to well suited samples of data for different domains, speaking styles are also being modelled for a more natural speech synthesis (Parlikar and Black, 2012). In the next sections, inter-corpora distribution of disfluencies and comparative analysis of prosodic parameters will be presented.

11.2

Inter-corpora distribution

Table 11.1 reviews the overall characteristics of the training sets of both corpora and adds the statistical significance of each feature. Features statistical significant with p < 0.001 are marked with “*” and those with p < 0.01 are marked with “**”. Corpora present different properties regarding the features analyzed. Lectures display a mean of 18 words per sentence, from which 1 is in average a disfluency (1.36 words), and a speech rate of 7.81 syllables per second. 6 seconds is the mean duration of a sentence with silent pauses included. Whereas dialogues display a mean of 6 words per sentence (6.38 words), being 0.5 in average a disfluency, and a speech rate of 7.24 syllables per second. 1.9 seconds is the mean duration of a sentence. Thus, there is a general trend for the lecture corpus to present higher values for the majority of the features. When considering sentences with disfluencies, again lectures present the higher mean values. Sentences encompassing disfluencies are characterized by having around 2 (1.95) disfluent sequences, with 3.8 words within those sequences and a duration of 1.30 seconds. As for the dialogues, sentences with disfluencies have 1 disfluent sequence (1.24) with 2.12 words within that sequence and a duration of 0.70 seconds. What is also evident in Table 11.1 is the fact that the number of words and mean duration of disfluent sentences is higher than in fluent sentences in both corpora. Previously, Moniz et al. (2007) pointed to this same trend in a corpus of high-school presentations, associating this tendency to the co-ocurrence of certain types of disfluencies in syntactic complex loci.

11.2. INTER-CORPORA DISTRIBUTION

dialogues

24:28

9:41

1.0

0.2

#fluent words

176853

42034

#disfluent words

14357

3850

#sentence-like units (SUs)

10576

7187

#disfluent sentences

3772

1817

%disfluent sentences

35.7

25.28

%disfluent words

7.51

8.39

mean of words

18.08

6.38

−55.089*

mean of disfl words

1.36

0.53

−17.885*

mean fluent words

9.96

4.62

−40.473*

mean fluent syllables

18.43

8.48

−40.849*

mean fluent phones

38.67

17.85

−40.406*

speech rate

7.77

7.24

-15.377*

articulation rate

9.20

8.06

-25.942*

phonation ratio

83.04

89.60

-21.631*

mean duration

6.10

1.90

−50.258*

mean duration without silences

4.32

1.52

−49.055*

mean of disfluent sequences

1.95

1.24

−20.104*

mean disfl words

3.80

2.12

−18.360*

mean disf syllables

5.41

2.98

−17.044*

mean disf phones

9.55

5.69

−12.564*

mean fluent words

28.87

9.47

−37.925*

mean fluent syllables

56.75

18.14

−37.925*

mean fluent phones

120.57

38.56

−37.696*

speech rate

6.02

6.06

not significant

articulation rate

7.29

6.72

-10.412*

phonation ratio

82.24

90.54

-24.967*

mean duration of disfluency

1.30

0.66

−14.439*

mean duration

10.68

3.18

−36.597*

mean duration without silences

7.81

2.56

−36152*

words between disfluencies

30.63

22.50

−3.008**

time between disfluencies

11.75

7.49

−3.598*

usefull time between disfluencies

8.07

5.98

−2.935**

alignment error

Fluent SUs

Disfl SUs

Z for Mann-Whitney

lectures time (hours)

Overall

113

Table 11.1: Overall characteristics of lectures and dialogues.

114 CHAPTER 11. SPEAKING STYLE EFFECTS IN THE PRODUCTION OF DISFLUENCIES

We interpret the differences outlined above as being linked to underlying distinctions between dialogic vs. (essentially) monologic communication. In a dialogue, sentences have fewer words and are shorter than the sentences produced by a teacher in expository lectures. The on-the-fly editing process in a map-task dialogue implies a straight cooperative process between two interlocutors under strict temporal constraints, which are totally different from the production circumstances of an university lecture.

With respect to the average of words uttered per sentence, a comparison can be made with previous studies for European Portuguese. Batista et al. (2012a) reports an average of 22 and 21 words per sentence in corpora of Portuguese and English broadcast news, respectively. A similar result is reported by Ribeiro and de Matos (2011); Amaral and Trancoso (2008) for Brazilian Portuguese newspapers (21 words). Moreover, Batista (2011) points out an average of 29 words in the European Parliament Proceedings Parallel Corpus (Europarl, Koehn (2005)). When analyzing a high-school lecture, Mata (1999) reports an average of 17 words produced by a teacher within an intonational utterance. As for the study of child-adult dialogues, Mata and Santos (2010) present an average of 3 words per sentence in the questions made by adults to young children. Comparing the results just described, the latest are the ones closer to the averages of both lectures and dialogues analyzed, as represented in Table 11.2.

child-directed speech 3 words

map-task dialogues 6 words

high-school lecture 17 words

university lectures 18 words

Portuguese broadcast 22 words

European parliament 29 words

Table 11.2: Mean words in distinct corpora.

In Table 11.2 there is a clear distinction between dialogues and the remaining corpora. Dialogues are build upon interactions between interlocutors, as previously mentioned. It is thus comprehensible that fewer words are produced per sentence. Academic presentations are associated with the need to explain in detail several concepts. To do so, teachers often use paraphrases, explicative sentences, examples to illustrate theoretical concepts, etc.. The European parliament presentations, as an oratory domain, are mostly related to a clear structured presentation of arguments, being, therefore, the most verbose. What it is also interesting to note is that words uttered between disfluencies in the university lectures (30.63) are closer to the parliament presentations.

11.3. INTER-CORPORA PROSODIC ANALYSIS

Type

115

lectures

dialogues

%

%

Complex Deletions Filled pauses Fragments Repetitions Substitutions

29.1 6.1 33.1 6.6 16.0 9.1

20.2 1.6 31.4 14.9 22.0 9.9

Total %

100

100

Table 11.3: Distribution of disfluencies per corpora.

Regarding the distribution of disfluent categories, as illustrated in Table 11.3, filled pauses are the most frequent type in both corpora, as well as the most frequent type reported in the literature (e.g., Shriberg (1994); Eklund (2004)). Complex sequences and repetitions are also very frequent in both corpora. However, while lectures display a higher percentage of complex sequences (29.1%) than repetitions (16%), in dialogues both categories have a similar distribution. Additional differences in the distribution of categories are: dialogues show twice as much fragments as lectures and fewer deletions. The higher frequency of fragments in dialogues is an evidence more of the strict time constraints of this domain, since speakers interrupt themselves as soon as they notice an error, not preserving the integrity of the word (Levelt, 1989). Speakers rarely choose a deletion, since deletions are more complex to process Fox-Tree (1995b). A plausible explanation for the above mentioned strategies may be linked to the fact that, unlike dialogue participants, teachers have more time to edit their speech, displaying strategies associated with more careful word choice and careful speech planning.

11.3

Inter-corpora prosodic analysis

For an easy inter-corpora comparison of pitch and energy, figures previously presented in Chapters 9 and 10 are now repeated in this section. As Figures 11.1 and 11.3 illustrate, in the lectures, pitch and energy increase from the disfluency to the repair region, independently of the speaker and for the majority of the disfluent types (with the exception of sequences of more than one repetition or deletion). As Figures 11.2 and 11.4 show, in the dialogues, 71% of speakers produce the pitch increases, and half of the categories are uttered with subsequent pitch resets. Energy increasing patterns are constant per speaker, however they also vary per disfluency type, i.e., deletions and fragments do not exhibit an energy gain from the disfluency to the repair. What is interesting to observe is that the disfluent categories with no energy increases have in fact a very striking difference between the disfluency and the repair, meaning, there may not be a gain from the

116 CHAPTER 11. SPEAKING STYLE EFFECTS IN THE PRODUCTION OF DISFLUENCIES

3.0% pw%!>%disf%

semitones(

2.0%

disf%!>%fw%

1.0% 0.0% !1.0% !2.0% !3.0% fp%

fps%

rep%

reps%

sub%

subs%

del%

dels%

frag%

frags%

comp%

semitones(

2.0% 1.0% 0.0% !1.0% !2.0%

S1%

S2%

S3%

S4%

S5%

S6%

S7%

Figure 11.1: Pitch differences between units based on the average for university lectures. disfluency to the repair, but the differences between those regions are very clear. Lectures display significant differences at p < 0.001 in all units of analysis, even in the context previous to a disfluency; whereas in dialogues such cues for the context previous to a disfluency are not significantly different. Although both corpora display pitch and energy increases, intercorpora significant differences are found (p < 0.05 for energy slopes inside disfluencies and p < 0.001 in all the remaining features) in pitch and energy; the only feature with no significant differences is pitch slope inside disfluencies (p =.171). The differences are due to the fact that lectures present higher pitch maxima values than dialogues, around a semitone more for both disfluency adjacent contexts. As for energy, dialogues display higher energy maxima values, around 2dB more in both disfluency adjacent contexts and also within the disfluency region itself. Inter-corpora prosodic contrast marking strategy of disfluency-fluency repair does not fully agree with the one established by Levelt and Cutler (1983), since there is a cross-speaking style strategy displayed by the majority of the disfluency types and not only by error correction categories, such as substitutions. However, the patterns observed in the lectures, ascribed either for speakers or for disfluency types, do not hold with the same regularity for the dialogues. In university lectures, prosodic cues are being given to the listener both for the units inside disfluent regions, and between these and the adjacent contexts, pointing out to a stronger prosodic contrast marking of disfluency-fluency repair when compared to dialogues. Pause duration can also be considered as a cue to signal prosodic contrast (Vaissière, 2005). For both corpora, there are 21.24% of disfluent sequences without a previous silent pause

11.3. INTER-CORPORA PROSODIC ANALYSIS

5.0%

pw%!>%disf%

semitones(

3.0%

117

disf%!>%fw%

1.0% !1.0% !3.0% !5.0% fp%

fps%

rep%

rep_enf%

reps% reps_enf%

sub%

subs%

del%

dels%

frag%

frags%

comp%

2.0%

semitones(

1.5% 1.0% 0.5% 0.0% !0.5% !1.0% !1.5% S1% S2% S3% S4% S5% S6% S7% S8% S9% S10% S11% S12% S13% S14% S15% S16% S17% S18% S19% S20% S21% S22% S23% S24%

Figure 11.2: Pitch differences between units based on the average for dialogues.

(9.56% for dialogues and 11.68 for lectures) and only 3.93% without a subsequent silent pause (3.38% for dialogues and 0.55% for lectures). Figure 11.5 illustrates the durations of disfluencies, previous and following lexical contexts, and silent pauses. The disfluency is the longest event, the silent pause between the disfluency and the following word is longer than the previous silent pause, and the disf+1 word is shorter than the disf-1 word. Thus, a similar general trend was observed in both corpora. However, two specific properties are crucially different in dialogues: the duration of the silent pause before a disfluency is shorter than disf+1, whereas in the lectures they are practically equal; all the averages in the dialogues are shorter than the ones reported for the lectures. Inter-corpora comparisons show that there are significant differences in disf-1 (U = −17.099, p < 0.001), previous silent pause (U = −11.034, p < 0.001) and disf (U = −2.616, p < 0.01); whereas no significant difference was found for silent pause after (U = −.627, p = 0.530) and disf+1 (U = −.792, p = 0.428). The tempo characteristics of the disfluency and adjacent contexts are an evidence more of the dynamic nature of dialogues.

118 CHAPTER 11. SPEAKING STYLE EFFECTS IN THE PRODUCTION OF DISFLUENCIES

1.60#

disf81#

disf#

disf+1#

sub#

subs#

1.40# 1.20# 1.00# 0.80# 0.60# 0.40# 0.20# 0.00# fp#

fps#

rep#

reps#

del#

dels#

frag#

frags#

comp#

1.80# 1.60# 1.40# 1.20# 1.00# 0.80# 0.60# 0.40# 0.20# 0.00# S1#

S2#

S3#

S4#

S5#

S6#

S7#

Figure 11.3: Energy slopes (dB) inside the previous word, disfluency, and in the following word for lectures.

disf:1# Lectures#

sil:before#

434#

Dialogues#

318# 0#

270#

186# 500#

disf# 653#

522#

231# 1000#

sil:a