DEVELOPMENT AND EVALUATION OF JAMAICAN

3 downloads 0 Views 7MB Size Report
Currently services are only offered in Standard Jamaican English and the wait time to gain ..... The ability to purchase or download for free and use on one's ...
DEVELOPMENT AND EVALUATION OF JAMAICAN CREOLE SYNTHETIC SPEECH USING CLUSTER UNIT SELECTION

A Thesis Submitted in Fulfilment of the Requirement for the Degree of Master of Philosophy in Computer Science

of The University of the West Indies

by Dahlia Marie Thompson 2015

Department of Computing Faculty of Science and Technology Mona Campus

ii

ABSTRACT

Development and Evaluation of Jamaican Creole Synthetic Speech Using Cluster Unit Selection

Dahlia Marie Thompson

The use and value of synthesized speech and applications based on synthetic speech to provide and enhance basic yet crucial daily functions is widespread and continues to increase at a very steady pace. In this thesis, we propose to create quality, domain-specific Jamaican Creole synthetic speech through the adaptation of cluster unit selection speech synthesis technique. The multi-lingual open source Festival Speech Synthesis framework developed at the Centre for Speech Technology Research, Edinburgh Speech Tools and Festvox Tools were used to accomplish our synthetic voice building objective. The selected variety of the Jamaican Creole language used within the speech corpus is based on an Eastern variety of the language, with the target domain being in-car street level voice navigation. We validate the success of this pioneer Jamaican Creole synthetic voice building through the use of objective and subjective assessments and also put forward a benchmark for Jamaican Creole synthesized speech. We present the results of the evaluation of navi, specifically in relation to the (i) overall quality, (ii) intelligibility, (iii) acoustic accuracy, (iv) functionality, (v) acceptability and (vi) appropriateness of the voice built. Keywords: Dahlia Marie Thompson; navi; navigation; Jamaican Creole; Speech Synthesis; Text-to-Speech Synthesis; Concatenative Speech Synthesis; Unit Selection; synthetic speech; Festival; Festvox; open source voice creation toolkit.

iii

Acknowledgements

I would like to thank my supervisor, Dr. Ashley Hamilton-Taylor for his constant encouragement to ‘push forward’, despite the overwhelming number of errors at the initial stage of this project; as well as for taking time out of his busy schedule, particularly on a weekend to facilitate our long distance supervisory relationship. I would also like to take the opportunity to thank Professor Hubert Devonish, who provided constant feedback and critique. Thank you too to Professor Alan W. Black of Carnegie Melon University, for taking time out of his busy schedule to answer an unscheduled telephone call, listen to questions and provide timely feedback on the usage of the Festival Speech Synthesis platform and Festvox Tools. Thank you to the lecturers, staff and students of the Department of Computing, Faculty of Science and Technology, UWI Mona Campus, for timely peer review and critique. Thank you too to the volunteer participants who spared time out of their busy schedules to complete the evaluation surveys and provide pertinent feedback. To all the other persons who played an equally integral part in the completion of this study but who are too numerous to list, I say a hearty thank you to each and every one of you, for your ongoing support and encouragement.

iv

Dedication For Althea Ambrozene McLish.

v

Table of Contents ABSTRACT

ii

Acknowledgements

iii

Dedication

iv

Table of Figures

viii

List of Tables

ix

List of Acronyms

x

Chapter 1: Introduction and Background to Study

1

1.0. Introduction

1

1.1. Background to Text-to-Speech Synthesis

1

1.2. The Text-to-Speech Synthesis Process

7

1.3. Gaps in Existing Research

11

1.4. Research Focus and Proposed Scope of Contribution

13

1.5. Organisation and Presentation of Study

16

Chapter 2: Research Design and Methodology

18

2.0. Introduction

18

2.1. Research Questions

18

2.2. Research Design and Process

19

2.2.1. Concise Overview of Conventional Waveform Techniques

19

2.2.2. Toolkit Selection for the Jamaican Creole Voice Building Project

26

2.2.3. Design and Implementation of a Standard Data Set for the JC Voice 35 2.2.4. Evaluation of the Jamaican Creole Synthetic Speech Chapter 3: Adaptation of Unit Selection Voice Building to Jamaican Creole

37 39

3.0. Introduction

39

3.1. Unit Selection Voice Building Process within the Festival Framework

41

3.2. Speech Corpus Design for navi

45

3.2.1. Speaker Definition for the Jamaican Creole Synthetic Voice

54

3.2.2. Annotation of Jamaican Creole Prompts

58

3.3. Jamaican Creole Phoneme Inventory and Lexicon

60

vi 3.4. Assignment of Prosody in Corpus-Based Speech Synthesis

77

3.4.1. Phrase Boundaries and Phrasal Breaks

79

3.4.2. Duration Assignment

84

3.4.3. Defining Intonation and Fundamental Frequency Parameters

87

3.5. Speech Signal Analysis for the Jamaican Creole Synthetic Voice

90

3.5.1. Pitch Mark Extraction

91

3.5.2. Modification of Pitch and Timing

97

Chapter 4: Evaluation of Jamaican Creole Synthetic Speech

101

4.0. Introduction

101

4.1. Key Role of the Evaluation of Synthetic Speech

101

4.1.1. Current Evaluation Methods

102

4.2. Evaluation Design

106

4.3. Evaluation Process

107

4.4. Presentation of Objective Assessment

108

4.4.1. Ongoing Assessment during Voice Building

109

4.4.2. Assessing Acoustic Accuracy and Prosody Generation

114

4.5. Measuring User Perception of Jamaican Creole Synthetic Speech

125

4.5.1. Instrument

126

4.5.2. Pilot Evaluation

127

4.5.3. Formal Evaluation

135

4.6. Benchmarking Jamaican Creole Synthetic Speech Chapter 5: Voice Building in Festival: Limitations and Recommendations

148 152

5.0. Introduction

152

5.1. Limitations Observed and Recommendations

152

Chapter 6: Conclusion and Recommendations

160

References

164

Appendices

178

Appendix 1

Installing and Compiling Festival Framework and Tools

178

Appendix 2

Direction-Giving in Jamaican Creole Questionnaire

180

Appendix 3

Sample Questionnaire Respondent - MB

182

vii Appendix 4

Sample Questionnaire Respondent - CR

184

Appendix 5

Jamaican Creole Navigation Prompts

186

Appendix 6

Jamaican Creole Phoneme Inventory

194

Appendix 7

Jamaican Creole Letter-to-Sound Rule Set

198

Appendix 8

Evaluating Jamaican Creole Synthetic Speech

201

Appendix 9

Sample Fall-off Survey

208

Appendix 10

Sample Respondent Data

210

viii

Table of Figures Figure 1.1 Text-to-Speech Synthesis 1770 - 2000 ................................................. 3 Figure 1.2 von Kempelen’s Acoustic Mechanical Speaking Machine .................. 4 Figure 1.3 Schematic Circuit of Dudley’s Manually Controlled VODER ............ 5 Figure 1.4 General Text-to-Speech Synthesis Process .......................................... 8 Figure 2.1 Inside the Corpus Based (Unit Selection) Type Synthesizer .............. 24 Figure 2.2 Viterbi Search and Unit Selection Candidates ................................... 25 Figure 2.3 Calculating Optimal Path in Unit Selection Speech Synthesis........... 25 Figure 2.4 Demonstrating Festival Heterogeneous Relation Graph in JC ........... 31 Figure 3.1 Navigating using Jamaican Creole ..................................................... 51 Figure 3.2 Satellite Map of Mona Heights, Kingston, Jamaica ........................... 53 Figure 3.3 NLP Module and Lingustic Analysis ................................................. 61 Figure 3.4 Block Diagram of a Concatenative Synthesizer ................................. 63 Figure 3.5 Jamaican Creole Segments ................................................................. 85 Figure 3.6 Duration for JC Phoneme /a/ using Praat ........................................... 86 Figure 3.7 Speech Segment Representing navi003.wav – 3 /chrii/ ..................... 94 Figure 3.8 Modification of Pitch and Timing Through Cepstral Analysis .......... 99 Figure 4.1 Dichotomy of Black Box-Glass Box Testing ................................... 103 Figure 4.2 Sample Error 1 .................................................................................. 111 Figure 4.3 Sample Error 2 .................................................................................. 111 Figure 4.4 Sample Error 3 .................................................................................. 112 Figure 4.5 Waveform of Natural Speech from JC Speech Corpus .................... 115 Figure 4.6 Waveform of JC Synthesized Speech Output................................... 116 Figure 4.7 Source Data from JC Speech Corpus ............................................... 118 Figure 4.8 JC Synthesized Speech Output ......................................................... 119 Figure 4.9 Source Data from JC Speech Corpus ............................................... 121 Figure 4.10 JC Synthesized Speech Output ..................................................... 122 Figure 4.11 Comparison of Speech Quality ..................................................... 129 Figure 4.12 Speech Output Similarity .............................................................. 132 Figure 4.13 Original versus Synthesized Speech ............................................. 133 Figure 4.14 User-Based MOS of JC Synthetic Speech Quality ....................... 138 Figure 4.15 Comparing Speech Quality of Audio Clips .................................. 139 Figure 4.16 Comparison of Synthetic and Original Speech ............................. 144

ix

List of Tables Table 3.1 Table 3.2 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6

Distinctive Feature Matrix for Jamaican Creole Vowels ................... 69 Jamaican Creole Phoneme Inventory of Consonants ......................... 70 Comparison of Voice Reports for navi26 using Praat ...................... 117 Comparison of Voice Reports for navi34 using Praat ...................... 120 Comparison of Voice Reports for navi104 using Praat .................... 123 Profile of Pilot Participants............................................................... 128 Survey Respondents and Fall-off Report.......................................... 137 Sample Comments Submitted by Participants.................................. 147

x

List of Acronyms

CALL

Computer Assisted Language Learning

CLP

Cassidy-Le Page

CMU

Carnegie Mellon University

CSTR

Centre for Speech Technology Research

DCT

Discrete Cosine Transform

DFT

Discrete Fourier Transform

DRT

Diagnostic Rhyme Test

DSP

Digital Signal Processing

EST

Edinburgh Speech Tools

HMM

Hidden Markov Model

IPA

International Phonetic Association

JC

Jamaican Creole

JLU

Jamaican Language Unit

LPC

Linear Predictive Coding

MCEP

Mel-cepstrum

MFCC

Mel-Frequency Cepstral Co-efficient

MOS

Mean Opinion Score

NLP

Natural Language Processing

PCM

Pulse Code Modulation

PSOLA

Pulse Synchronous Overlap and Add

SJE

Standard Jamaican English

SS

Speech Synthesis

xi TTS

Text-to-Speech Synthesis

UCLR

Unit for Caribbean Language Research

US

Unit Selection

VOCODER Voice Operated ReCorDER VODER

Voice Operating DEmonstratoR

1

Chapter 1: Introduction and Background to Study 1.0. Introduction Text-to-Speech Synthesis or Speech Synthesis as it is also commonly referred to, can be defined as the “production of speech by machine, by way of the automatic phonetization of the sentences to utter” (Dutoit 1997, 13). This definition in essence, speaks to the process of automatic reading or the act of “getting computers to read out loud” (Taylor 2009, xix - xxii, 1). Louw further extends Dutoit’s definition by making reference to “the automated process of mapping a textual representation of an utterance onto a sequence of numbers representing the samples of synthesized speech”, in his definition on the subject area (2008, 165). A somewhat simpler yet more concise definition can be found in Ladefoged (2005) where Text-to-Speech Synthesis is presented as the artificial production or generation of human speech.

1.1. Background to Text-to-Speech Synthesis In reviewing the definitions provided in the preceding introductory paragraph, we make note of important terms or preliminary concepts that underscore the subject area and help us to appreciate some of the steps that would be required in order to produce machine generated speech in the Jamaican Creole language. These include (i) the phonetization of utterances, (ii) the mapping of text onto corresponding numerical values, (iii) the provision of synthesized

2 speech samples, as well as, (iv) the automation of the overall process. We will return to each of these mentioned points throughout the study, going more in depth at varying stages of the synthetic speech generation process and presentation of findings. First, we will conclude our introduction and background, thereby laying the foundation for our study. Figure 1.1 below, adapted from Lemmetty (1999) chronologically charts the early history of Text-to-Speech Synthesis. It presents some of the major milestones that occurred in the area of study between the late eighteenth century and twentieth century. These milestones laid the foundation for work that continues in Speech Synthesis to this day. This chart also provides a good reference point, allowing us to compare what was required and available to generate speech in the developmental years of Speech Synthesis, to what is now available to us as we seek to create synthesized speech in Jamaican Creole. We will summarise this milestone chart by presenting what we perceive to be the three major phases which Speech Synthesis has undergone, namely (1) the hardware-driven years between 1770 and 1940, (2) the theory driven and software-based years of the 1950s to 1990s and (3) modern day Speech Synthesis from 2000 to present day.

3

Figure 1.1

Text-to-Speech Synthesis 1770 - 2000

In the late 1770s to 1930s, work in the area of Speech Synthesis focussed primarily on the creation of physical speech models and machines that accurately emulated the human speech reproduction act. In 1779, Christian Kratzenstein, a Danish scientist of the Russian Academy of Sciences built the first known physical model of the human vocal tract. His successful attempt influenced the work of other researchers such as Wolfgang von Kempelen, who went on to produce the bellows-operated acoustic mechanical speech machine some twelve years later (Lemmetty 1999). The design of von Kempelen’s mechanical speech machine, presented in Figure 1.2 below, was based on a perceived model of the human articulatory system and had the amazing ability to imitate and produce roughly twenty human speech sounds (Dutoit 1997). It required a human operator to function and consisted of small pipes, which simulated the human nostril, a main bellows to

4 mimic the human lungs, a reed simulating the human vocal cords and Indiarubber cup to represent the human nose and mouth.

Figure 1.2

von Kempelen’s Acoustic Mechanical Speaking Machine

Although von Kempelen’s bellows operated system was a phenomenal breakthrough in the then known field of Speech Synthesis, it was not until 1939 that the first device that would be considered a speech synthesizer by modern day usage of the term emerged. This system was known as the VODER1 (Voice Operating DEmonstratoR), a manually operated electrical speech system

1

The VODER, built for an educational exhibit, was the manually controlled version of Dudley’s voice coder (VOCODER). The VOCODER was a system that sought to reduce the amount of transmission bandwidth required, thereby increasing the number of calls that could be sent over a telephone line at any given time.

5 developed by Homer Dudley, a pioneer acoustic engineer with AT&T Bell

Laboratories. Figure 1.3

Schematic Circuit of Dudley’s Manually Controlled VODER

The VODER had the ability to imitate animal sounds and could ‘talk’ practically any language that the operator spoke. The above versatile image of the VODER helps to visually paint a more accurate picture of the text-to-speech synthesis process. As illustrated in the above schematic circuit, an input source message is keyed in and passed (by the operator) through a carrier speech system, where it then undergoes modulation and other modifications, such as pitch, later emerging as speech (Harrington and Cassidy 1999, 148-149; Taylor 2009, 151). The work carried out in the field of speech synthesis by earlier pioneers, some of whom were named in the preceding paragraphs, laid the foundation for work that was later carried out in the late twentieth century. Researchers in the late twentieth century added to the successful attempts of these earlier pioneers,

6 by (1) formalising and presenting in the 1950s the three major techniques2 upon which modern day speech synthesis is based, namely Articulatory Synthesis, Formant or Rule-based Synthesis and Concatenative or Corpus-based Synthesis, (2) producing the first full speech synthesis system or synthesizer for the English Language in 1968, and (3) introducing and making commercially available in the late 1980s, devices and applications with embedded synthetic speech functions. Examples of these include the now vintage collectible Speak and Spell toy from Texas Instrument and the Sega arcade games. Having surmounted the obstacles of generating synthetic speech and creating speech systems, the goal of present day Speech Synthesis is somewhat different. As demonstrated by the work that is currently being carried out within the community at the time of this research, the objective of modern day Speech Synthesis tends to focus more on the creation of speech systems that generate synthetic speech that is less robotic, highly intelligible and more natural, which is to say of acoustic quality comparable to human speech. The goal is also to produce systems that have open-ended flexibility, with the means to generate output regardless of whether or not a model utterance or source reference text was provided for that specific combination. In addition, the focus is also on increasing the wide range of speech systems applications, with those available having the ability to accurately generate and utilise synthetic speech in a wide array of domains, in as many available languages (Taylor 2009; Black and Lenzo 2004).

2

The fourth formal speech synthesis technique, Hidden Markov Model (HMM)-based Speech Synthesis did not emerge until the early twenty-first century.

7

1.2. The Text-to-Speech Synthesis Process A speech synthesizer and a waveform generation technique are the two key components that are required in order to develop and generate synthesised speech. This speech synthesizer to which we make reference is defined as a machine or speech system that has the ability to read the given input text out loud (Dutoit 1997). In essence being fully capable of transforming a given text-based input data stream into output speech that accurately corresponds to the input text provided (Taylor 2009). In the formative years of speech synthesis, as shown in earlier paragraphs, this speech system or speech synthesizer would normally have been hardware driven. However, the invention and implementation of very-large-scaleintegration (VLSI) circuits in the 1970s brought about a change. VLSI circuits made it possible to have a complete speech synthesizer depicted on a singleapplication-specific integrated circuit. This invention paved the way in allowing present day speech synthesizers to be primarily rooted in software rather than hardware. Although some speech synthesis dedicated hardware still exists today, the majority of speech systems are software-based (Dutoit 1997). In the case of our Jamaican Creole synthetic speech, we used open source voice creation toolkit and accompanying tools to accomplish our voice building objectives.

8 A speech synthesizer, whether hardware or software based, is made up of two major modules, namely (1) a Natural Language Processing (NLP) module, commonly termed the Front End and (2) a Digital Signal Processing (DSP) module also referred to as the Back End. These two components are chiefly concerned with analysis and waveform generation respectively. In present day software based applications, the NLP and DSP modules are abstract components. Module 1, the NLP Module or so-called Front End is chiefly concerned with the Natural Language Processing that takes place within the system and is divided into two blocks that are used to perform Text Analysis and Linguistic Analysis. Within these two blocks of the NLP module, the required key tasks of text pre-processing, text normalisation, text processing, phonetic analysis and linguistic-syntactic analysis are performed. Module 2, the DSP module, commonly termed the Back End is wholly concerned with Waveform Synthesis or the actual generation of waveforms (Dutoit 1997, 14; Black 1999, 2000; Weerasinghe et al. 2007). To illustrate the text-to-speech synthesis process, we present the below figure of the general anatomy of a basic speech synthesis engine and the general speech synthesis process.

TEXT

NATURAL LANGUAGE PROCESSING TEXT ANALYSIS

Figure 1.4

LINGUISTIC ANALYSIS

DIGITAL SPEECH PROCESSING SPEECH SIGNAL ANALYSIS

General Text-to-Speech Synthesis Process

WAVEFORM GENERATION

SPEECH

9 During the general Text-to-Speech Synthesis process, the input text is first accepted into the Text Analysis block of the Natural Language Processing (NLP) module. The primary goal of this block is to identify chunks of text and then produce a list of words, from the string of characters received. Within the Text Analysis block, the input text then undergoes several processes. These include text normalisation whereby the text received is analysed according to both language-specific and natural language rules that have already been pre-stored by the developer, the identification of tokens and token types and their subsequent mapping onto words, and part of speech (POS) tagging. Having undergone the afore-mentioned processes, the input text then undergoes tokenisation using the tokenisation method pre-defined by the developer or the speech system default where none is pre-defined. Examples of tokenisation methods include whitespace or punctuation (Schroeter 2008). The Festival Speech System open source toolkit which is used in this study utilises both whitespace and punctuation as the default tokenisation methods (Taylor et al. 1998). At this level of natural language processing, the input text is further subjected to an additional normalisation process, whereby Non-Standard Words (NSW) are converted into their textual equivalents. NSWs include numbers (ordinal, cardinal), dates, times and special characters or symbols such as the percentage sign (%) and the dollar currency sign ($). (Alam, Nath, and Khan 2007; Weersignhe et al. 2007).

10 The correct interpretation or expansion of abbreviations and acronyms if present in the particular language being modelled is also done at this level. Although this is a mandatory step for synthetic voices created for English, which is the major contributing lexifier for Jamaican Creole, it is not an obligatory step for Jamaican Creole. Having performed text analysis on the input text, the processed output, comprising tokenised words is then passed to the second block of the NLP module, to undergo Linguistic Analysis. The objective of the Linguistic Analysis module at this point is to perform phonetic and prosodic analyses and Graphemeto- Phoneme (G2P) conversion of the tokenised output received from the Text Analysis block. Using both specific and generic language defined prosodic information as well as pre-stored Letter-to-sound (LTS) rules and a lexicon to predict pronunciation, a phonetic transcription of the tokenised text is then produced. Desired prosodic features, such as phrasing, intonation and rhythm which must all be pre-defined and provided during initial voice set-up by the developer are then added (Dutoit 1997, Lemmetty 1999). The resulting output of phonemes, which are encoded with prosodic-rich information, is then passed on to the Digital Signal Processing (DSP) module as symbolic information. Within this module they then undergo speech signal analysis and subsequent waveform generation (Taylor 2009). The role of the DSP module or back end is to synthesize this abstract information of phones, durations and fundamental frequency (F0) that it has received from the NLP module and to

11 generate an acoustic digital signal corresponding to the original input text. In order to perform its requisite task, the DSP module uses the waveform generation technique that is either selected by the developer or is currently supported by the framework within which it is operating.

1.3. Gaps in Existing Research Virtually everywhere around us are products which have computerised or machine-generated speech at their very core. Synthetic voices are widely used in avatars, chat applications, bus systems with next-stop announcement features, directory assistance, at airports announcing arrivals, departures and other key flight-related information, at terminals, information-giving kiosks, in reading screens and within computer navigation help systems for the visually impaired, as well as in educational talking books and toys. Immense work has been ongoing in the field of Speech Synthesis, particularly within the past decade (Black and Lenzo 2004). The increase, we believe has been fuelled in part by the availability of open-source speech synthesis voice creation toolkits. In a 1987 journal report article, Devonish posed a rather intriguing question in the report title itself, “Can computers talk Creole?” This report article proposed the modification of existing operating systems in order to allow messages in Creole languages to be used in the output of the computer. This was to be achieved through the customisation of the keyboard to allow for the accurate representation and keying of symbols used in vernacular languages, as well as

12 through the usage of Caribbean Creole language specific spell checkers in word processing (Devonish 1987). Although this report does not make specific reference to or use of the terms text-to-speech synthesis and speech synthesis, the question posed and the issues debated within do have particular relevance to this research project. A review of the literature, including copious conference proceedings on speech synthesis and speech processing, reveals that in the true spirit of Devonish’s report, the research focus in the field of speech synthesis has widened to include non-European languages such as Sinhala, Telugu, Hindi, Amharic and Maori to name a few (Black and Lenzo 2004). In fact, pioneer work on Artificial Intelligence, Human Language Technologies, Machine Translation and Speech Synthesis in Creole languages started in the late 1990s with Haitian Creole (Allen 1992; Allen 1998; Eskenazi et al. 1997; Lenzo, Hogan and Allen 1998). More in-depth research in the subject area and literature available on Speech Synthesis has revealed that whilst there are many recorded Jamaican voices currently being used as add-ons or plug-ins in various dialogue and navigation systems and devices, currently there does not exist synthesized speech for the Jamaican Creole language nor a Jamaican Creole synthetic voice in use in these devices or applications. Additionally, our research has not revealed the existence or the availability of a road map or accurate documentation on how to build synthetic voices for Creole languages, specifically those that are historically

13 documented as oral languages, for which written texts in a formal orthography may not be easily available. Whilst there is copious amount of presentations of systems created and analyses of results, there is less in terms of road maps or guides to synthetic voice creation or building. In fact our research and review of the literature did not reveal any known or well-known documented source on speech synthesis for the Jamaican Creole language, nor recommended waveform generation technique, recommended type of speech units, or even evaluation methods and benchmark to use for Speech Synthesis in Creole languages. Recognising that there is much work to be done in Speech Synthesis and Creole languages, specifically in relation to Jamaican Creole, this research is being conducted in an attempt to start filling that gap. We propose to create quality domain-specific synthetic speech for Jamaican Creole, to provide an accurate road map and reference guide for other researchers interested in generating synthetic speech and synthetic voices for Jamaican Creole and to set a benchmark for new voices created.

1.4. Research Focus and Proposed Scope of Contribution In reviewing speech synthesis applications and speech systems around us, we notice that in general speech synthesis applications and speech systems tend to be more domain-specific rather than general purpose oriented. Thus, although there are open domains or general purpose speech applications such as reading emails and text messages; many systems, if not most, provide speech applications

14 that are more targeted than generic in nature (Black and Lenzo 2007). This can be seen in speech applications such as those found in talking clocks, telephone systems, weather systems, information giving systems, navigational devices and information kiosks. In these systems, the information being provided is normally targeted or specific to a particular domain, such as telling time, providing a requested telephone number or directory assistance, providing the current weather conditions, giving directions, or other targeted information. According to Rudnicky et al. (2000), when applications use utterances that are closer to the domain of the target database, then the quality of the output synthesis is often better. Indeed, by creating a speech database that is closer to the intended target domain, the output for the speech application tends to provide a richer match in both prosody and intonation, when deployed and used within the intended domain. The target domain for the Jamaican Creole synthetic voice being proposed is in-car street level voice navigation. Thus the database of the standard prompts and vocabulary created for the Jamaican Creole voice would be tailored for targeted usage in a navigation system that provided street level local directions in one of the two languages spoken in Jamaica, namely in Jamaican Creole. It should be noted, however, that although the corpus database was customised for the specific domain to enhance intelligibility and naturalness, the incorporation of

15 letter-to-sound (LTS) rules for the Jamaican Creole phoneme inventory makes it possible for the voice to have a much wider and more general usage. Notwithstanding the vast knowledge to be gleaned from conducting research in text, speech and language technology in Jamaican Creole, for example in relation to prosody and emotive synthesis, the scope and significance of this research is both short-term and long-term. In the short-term, the initial significance of this research is as follows: (1) To successfully customise open source voice creation software and create working domain-specific synthetic speech for Jamaican Creole, using cluster unit selection technique; (2) To provide a road map and an authoritative body of knowledge and reference source for the generation of synthetic speech in Jamaican Creole, by accurately documenting the overall cluster unit selection domain-specific speech synthetic design and creation process; (3) To propose to users of navigation systems another viable language option for receiving instructions, namely in Jamaican Creole. The long term viable applications of Jamaican Creole synthetic speech are both commercial and research-oriented. Commercially, a Jamaican Creole synthetic voice or synthesized speech represents an extremely practical and viable option in local airports, providing flight and other information as well as at information kiosks. Additionally it would be a beneficial feature in transportation

16 centres such as the Half-Way-Tree Transportation Centre in Kingston, Jamaica, whereby mono-lingual Jamaican Creole commuters would have the option of accessing transit information in their native tongue through a guided system or whereby bilingual commuters could choose to switch between available language options to access transit information. Other significant applications can be visualised in governmental as well as non-governmental agencies, in both urban and rural areas. These agencies, for example the National Meteorological Centre, the Office of Disaster Preparedness and Emergency Management (ODPEM), the Registrar General Department (RGD) and the Inland Revenue Department (IRD) to name a few, are all obligated to provide essential, time sensitive information that is comprehensible to all. Currently services are only offered in Standard Jamaican English and the wait time to gain access to a human customer service representative can be tedious. We believe Jamaican Creole synthetic speech would assist these agencies to cater to the population as a whole, in a more timely and responsive manner.

1.5. Organisation and Presentation of Study The general layout of the thesis is organised into a total of six chapters, not including the list of references and the list of appendices of other projectrelated material used or derived from conducting this project.

17 In Chapter 1, we present an overview of speech synthesis, the background to the project and a presentation of the research focus. Chapter 2 outlines the research design and methodology approach used in the project. In this chapter, a concise overview of current conventional waveform technologies and the specific technique and tools used within this project are provided. Using a chronological approach, Chapter 3 describes the adaptation of the speech synthesis voice building process to the Jamaican Creole language, including the design of the requisite data corpus and language-specific customisation of the Festival TTS Natural Language Processing module. Chapter 4 presents the results and findings of the Jamaican Creole synthetic voice assessment, conducted using objective and subjective methods and addresses the issue of benchmarking for the Jamaican Creole synthetic voice. Chapter 5 summarises some key limitations observed from the framework used for voice building and put forward some recommendations. Chapter 6 summarises and concludes the study, providing some recommendations for additional work on the Jamaican Creole synthetic voice created as well as within the research area under discussion.

18

Chapter 2: Research Design and Methodology 2.0. Introduction The primary objective of this study as previously stated is to create quality Jamaican Creole synthetic speech using open source software. The proposed synthetic speech is domain-specific, being created primarily for usage in navigation. In this chapter, we present the research design and research process that we used to accomplish our goal. In an endeavour to accomplish our research goal, we proposed three (3) research questions which served as guidelines throughout this study. These questions are presented in the section below.

2.1. Research Questions 1.

What current speech synthesis waveform techniques and voice creation toolkits exist and how do we adapt and implement the synthetic speech generative principle to the Jamaican Creole language?

2.

What standard data set and speaker characteristics do we propose as a model for navi, the Jamaican Creole navigation synthetic voice, and how do we go about designing and providing these?

3.

What is the overall degree of acoustic quality, naturalness and acceptability of the synthetic speech created and how does the Jamaican Creole synthetic speech measure in comparison to current industry standard benchmark?

19 2.2. Research Design and Process Our research design and process consisted of three principal tasks, namely to (1) identify and select the appropriate current waveform generation technique and voice creation framework, (2) create a standard data set of Jamaican Creole navigational prompts and adapt the waveform generation technique identified to Jamaican Creole within the selected framework, and (3) evaluate and benchmark the synthetic voice created . The tasks correspond to and provide answers to each of the above-mentioned research questions. Each stage identified is presented in further detail in subsequent chapters. The first step that was required was to ascertain what current conventional waveform techniques were predominantly in use; particularly what technique or techniques resulted in higher quality and more natural synthetic speech output. In order to determine this, we examined literature on the subject matter, reviewing analyses of current speech synthesis waveform techniques as well as other speech synthesis projects.

2.2.1. Concise Overview of Conventional Waveform Techniques Conventional waveform or synthetic speech generation techniques are generally divided into three major sub-categories, commonly termed First Generation, Second Generation and Third Generation techniques. First Generation techniques are predominantly based on source-filtering and speech production

20 modelling and include methods such as Articulatory Synthesis, Formant Synthesis and Rule-based Synthesis. Second Generation type based synthesis use methods that are more data driven, for example Corpus-based Synthesis and Sinusoidal Synthesis. Third Generation techniques use a parametric or statistical machine learning approach to data (Indumathi and Chandra 2012; Taylor 2009; Dutoit 2008; Weerasinghe et al. 2007). First Generation type speech synthesis techniques, such as Rule-based, Articulatory and Formant Synthesis tend to have minimal memory requirements but are normally regarded as being computationally intensive with a large Central Processing Unit (CPU) demand. In addition to mathematical modelling to calculate the resulting output sound from its textual input, Articulatory Synthesis uses models of the human articulators, namely the teeth, tongue, lips, jaw and vocal cords to simulate the airflow in human speech production. Formant or Rulebased Synthesis takes the Articulatory approach one step further by attempting to model the human speech organs off resonances, which is to say the formants or spectral peaks found in natural human speech. The von Kempelen acoustic mechanical speaking machine and the VODER, introduced in the preceding chapter are examples of speech systems based on these First Generation approaches (Klatt 1987; Allen 1992; Dutoit 1997; Black, Taylor and Caley 1999). Although First Generation techniques are noted for providing more “accurate” synthetic speech (Lemmetty 1999, 3), they are normally viewed as being generally obsolete or rarely used nowadays in commercial systems. Most modern day speech systems are created using Second and Third Generation

21 techniques, which are perceived as reliably producing more “natural sounding” synthetic speech in comparison. Second and Third Generation speech synthesis techniques include the Concatenative or Corpus-based approach and the Parametric or Hidden Markov Model-based type Speech Synthesis (HTS) method (Lemmetty 1999; Black, Taylor and Caley 1999; Black and Lenzo 2007; Dutoit, 1997, 2008; Zen, Nose, et al. 2007; Sproat and Olive 1999; Sproat 2008; Taylor 2009). Concatenative or Corpus-based Speech Synthesis utilises a pre-recorded speech corpus database that is accessed during speech synthesis run time to provide phones (the smallest unit of speech) and diphones (the most stable part of a transition between an adjacent pair of phones) which closely match the input text for concatenation. Of the three genres of corpus-based synthesis, namely diphone-based, limited domain and unit selection; unit selection is the most widely used method. Unit Selection (US) has long been recognised by researchers and developers as being able to produce the most natural sounding synthetic speech for general synthesis (Sproat 2008; Taylor 2009). It is this Corpus-based speech synthesis type technique that is used within this study. Hidden Markov Model-based type Speech Synthesis or HTS-type based Speech Synthesis is a Third Generation type speech synthesis technique. It is a parametric approach that is based on the statistical mathematical concept of Hidden Markov Models. The HMM concept uses both data training and synthesis components, all in an effort to produce more natural sounding speech. This

22 method provides the speech system with the added edge of continuous refinement through constant data training and usage (Zen, Nose, et al. 2007; Taylor 2009). Unlike Corpus-based Speech Synthesis, statistical waveform techniques do not require a speech database; however as Zen, Nose et al. (2007) notes in a detailed description and analysis of the technique, it suffers from the inability to provide models that are both accurate and reliable. Having reviewed First Generation, Second Generation and Third Generation speech synthesis techniques, including their differences, strengths and weaknesses, we were able to pinpoint the specific technique that met our objective. Of the three Corpus-based speech synthesis techniques outlined in the preceding section, Cluster Unit Selection-based Synthesis (US) is the most widely used Concatenative type synthesis for general synthesis. It is “widely recognised” as producing “the most natural sounding synthetic voice” for general synthesis. In fact, Taylor reveals that this method is the “dominant synthesis technique in text-to-speech today” (2009, 474). There are several reasons for its popularity amongst researchers and developers. These include the creation of new voices in previously unsupported languages in less time, higher intelligibility and naturalness as well as better speech quality of the voices created (Latacz, Mattheyses and Verhelst 2011). Researchers such as Indumathi and Chandra (2012) place Unit Selection synthesis in the Third Generation category alongside HMM synthesis, while others such as Taylor (2009) and Dutoit (2008) place it with Second Generation techniques within the Concatenative or Corpus based type synthesis. While they

23 may not agree on the specific category placement, what the researchers do agree upon are the following three facts, namely that Unit Selection speech synthesis technique (1) is an extension of second generation speech synthesis, (2) is the dominant speech synthesis technique currently in use in modern day speech synthesis and (3) it produces high quality and natural sounding synthetic speech. As it relates to the output quality of the synthetic speech created using this technique, it is widely appreciated that (1) the researcher has the option of storing “several instances of each unit”, (2) the system has the means of choosing from among these varying instances at “run time”, and (3) signal degradation in the final speech output generated is seemingly less when compared to other waveform generation techniques (Dutoit 2008, 447). The key foundation upon which this particular synthesis genre rests is on the system having a pre-recorded database or speech corpus upon which speech signal analysis can be performed. Although this dependency on a speech database results in a large footprint, the entire speech database of natural units, that is all the units as well as their various instances and speech sequences, can potentially be used during speech processing and output generation, thereby resulting in better quality (Dutoit 2008, 447; Black and Taylor 2007; Taylor 2009). It is very crucial therefore that attention is paid to database preparation, as this has direct impact on the quality of the output speech generated. Figure 2.1 below, adapted from Dutoit (2008) provides a schematic view of a unit selection based speech synthesizer, highlighting the requisite speech corpus which still remains the backbone of Concatenative or Corpus-based speech synthesis.

24

Figure 2.1

Inside the Corpus Based (Unit Selection) Type Synthesizer

In Unit Selection (US) speech synthesis, units which (a) belong to the same phone class type and which also (b) share certain similarities within or across their phonetic and prosodic contexts are normally clustered or grouped together for concatenation when required. Once the most discrete grouping of speech units is found and established, a Viterbi-style search algorithm is then used to determine the most “optimal path”, which is to say the best sequence of units with the lowest total join costs (Black and Taylor 1997; Taylor 2009; Hunt and Black 1996). This Viterbi search, as shown in Figure 2.2 adapted from Tihelka, Kala, and Matousek (2010), visually demonstrates the search for the most optimal

25 path for two candidates.

Figure 2.2

Viterbi Search and Unit Selection Candidates

The optimal path or best sequence of units in Unit Selection is calculated based on weighted distance and cost function, namely target cost and join (concatenation) cost. The target cost Ct (ui, ti) estimates the difference between a database unit, a unit that is actually present in the speech corpus, represented as ui and its represented target, the synthesized version of this unit, depicted as ti (Hunt and Black 1996). Figure 2.3 adapted from Hunt and Black , illustrates this “sum of target and concatenation costs” calculation as used within unit selection to calculate optimal path “through the state transition network” returning the lowest join cost (1996, 374-375), whereby optimal path = (sum of target cost) + (sum of concatenation cost).

Figure 2.3

Calculating Optimal Path in Unit Selection Speech Synthesis

26 2.2.2. Toolkit Selection for the Jamaican Creole Voice Building Project In addition to selecting the specific technique that best met our criteria, we also sought to determine what speech synthesis framework and voice creation toolkits existed that would assist us to achieve our objective of creating cost effective, high quality speech synthesis. In our research, we placed emphasis on open source non-proprietary software, which we believe would allow us to develop and generate quality synthetic speech at a more economical rate. In doing so any speech product that eventually arose from our synthetic voice creation project could be distributed to a prospective end user at a very low cost. The low developmental cost, we believe would also prompt more researchers to embark on similar projects. A digital search for information on speech synthesis frameworks, voice creation toolkits and speech systems, using keywords such as speech synthesis OR speech synthesizers OR speech synthesis software, will return an inexhaustible list of speech synthesis software and products currently available and widely in use both on a commercial and non-commercial scale. Some of the major modern day systems available and widely in use, as noted in the results of an electronic search conducted using search engines and research databases include NeoSpeech, NCH Swift Sound Software, Text Aloud, ModelTalker, InfoVox, Apple Plain Talk, DecTalk, Bell Labs TTS, AT&T, Acapela, ScanSoft, Loquendo TTS, Cepstral, SoftVoice, Lyricos, Pavarobotti, Orator, Lucent TTS, CyberTalk, eSpeak, Cepstral, Festival Speech Synthesis System, CSLU Toolkit, Mary TTS, Ivona TTS, Natural Voice Reader, Laureate, MBROLA, ETI

27 Eloquence, Rhetorical and IBM WebSphere Voice. Although there are a large number of speech systems available on the market, not all offer a voice creation toolkit component. The majority of these speech systems are proprietary owned and do not include a voice building toolkit, a speech synthesis engine or program for usage by the general public, or step by step documentation for developers on how to create new voices, either for supporting or new languages. The ability to purchase or download for free and use on one’s operating system as a screen reader, or in any domain for purposes of text to speech is a more available and possible option than attempting to create a new voice within these speech systems. The fact remains that not many speech systems, such as those offered by Bell Labs TTS3 allow for hands-on development, which is to say the ability to modify modules and the possibility of building new synthetic voices or synthesized speech in new or unsupported languages at a reasonable cost. In addition, whereby some toolkits do offer the option to create your own voice, there are restrictions and limitations on the language within which you are able to work as some of these systems mostly provide coverage within the major European languages or the system default language. Thus even if you wanted to create a new voice, you would be restricted to using the existing system available phoneme inventory. The literature or documentation accompanying these speech

3

Lucent Technologies

28 systems which cover voice building, voice or synthesized speech creation particularly in new languages is also somewhat sparse. One speech system which does offer hands on development as well as the framework for building speech synthesis systems and which is also geared toward modification and usage by the wider public is the Festival Text-to-Speech Synthesis System provided by the Centre for Speech Technology Research (CSTR) of the University of Edinburgh. At the time of the submission of this study for examination in 2014, the Festival platform was being distributed under an X11-type license allowing “unrestricted commercial and non-commercial use alike” for voices created using this platform (The Centre for Speech Technology Research n.d.). Along with its partnering Festvox Tools and documentation on voice building, researchers and developers are guided along the path of synthetic voice building both in supported and new languages (Festvox n.d.). The Festival Text-to-Speech (TTS) Synthesis System is an open source synthesizer or engine which was developed in the late 1990s at the Centre for Speech Technology Research (CSTR), University of Edinburgh, one of the leading research groups in the field of text-to-speech synthesis. The Festival toolkit version 2.1, as mentioned in the preceding paragraph is distributed under an X11-type license. Previous versions of the toolkit such as versions 1.43 and 1.96 had specific restrictions on the adaptation of synthetic voices created within this platform (Black and Lenzo 2007; Taylor, Black, and Caley 1998; The Centre for Speech Technology Research n.d.).

29 Synthetic voices created under versions earlier than 2.1 were predominantly for “educational, research and individual use” and primarily geared toward (i) researchers in speech synthesis, (ii) speech application developers, (iii) speech synthesis end users, as well as (iv) persons interested in creating language systems and who wish to include synthesis output such as different voices, specific phrasing and dialogue types (Black and Lenzo 2007; Taylor, Black, and Caley 1998; The Centre for Speech Technology Research n.d.). The version of Festival and speech tools used to conduct this voice building exercise for the Jamaican Creole language, Version 2.1, at the time of this study was the “most recent version available for free and unrestricted use”, whether commercial or non-commercial (The Centre for Speech Technology Research n.d.). The Festival toolkit sets itself apart from other frameworks available by the nature of its modular design and more importantly by being one of the most complete open source voice building toolkits available. As it relates to its design features, the overall make-up of the Festival toolkit makes it easier to use as well as to modify. It is modular-based and its architecture is separated into both (i) core system and (ii) core modules which offer components that support front-end processing, language modelling and speech synthesis (Louw 2008). Though primarily based on a Concatenative approach, the Festival toolkit also caters to parametric synthesis by offering an HTS engine as one of the many modules in its framework.

30 The core system for Festival is written in C++ and is notably unchangeable; however the key modules, which are required in order to perform text processing, prosodic processing and synthesis tasks, can be written in either C++ or Scheme programming languages, with the ability for modifications without adversely affecting other modules. C++ allows the system to have better data abstraction, while the Scheme scripting language acts as an interface for the system, providing it with a more generic way to store complex data structures. It is this dual-language approach which allows Festival to operate as both a run-time system and a research platform, thus adding to its increasing and ongoing popularity among researchers and developers (Louw 2008; Taylor, Black, and Caley 1998). The architectural design of the Festival Speech Synthesis System is based on intersecting relations which are used to store its utterance structures. A relation is used to define lists, trees or groups of items, while an utterance is made up of a set of (linguistic) items, each having a set of features, such as words and phones, and being related by one or more relations. In Festival, it is possible for linguistic items to belong to more than one structural representation. For example, an item can be represented as belonging to both a word and a syllable, as well as to a segment. This ability allows for more efficient representation of items (Taylor, Black, and Caley 1998). This design formalism on which Festival is built, separates it from other speech synthesis architectures which may be built on string-based processing and

31 multi-level data structures. The Festival architecture is neither constrained nor limited to having linguistic items stored only as a string or in linear lists. Instead, it provides the option of storing information using any graph structure, such as trees. In addition, there is also an explicit connection between the levels and Festival seeks to generalise the content of an item, thus affording the system the ability to store “information of arbitrary complexity” (Taylor, Black, and Caley 1998). In Figure 2.4 below, we visually illustrate this heterogeneous relation graph (HRG) structure found in the Festival TTS framework (Taylor, Black, and Caley 2001), using the Jamaican Creole expression febyuweri twenti wan ‘February twenty-first’. This expression is made up of, or can be divided primarily into three major word groupings: (1) Febyuweri alternative febuweri ‘February’, (2) twenti ‘twenty’, (3) wan ‘one’. All three word groupings however can be further sub-divided into smaller parts, splitting into syllables, all the way down to segments and/or phonemes. Word Syl Structure

Syllable

Segment

Figure 2.4

Demonstrating Festival Heterogeneous Relation Graph in JC

32 This overall design concept, according to Taylor et al. (1998), makes it easier to affect changes or provide alternatives, when necessary. Hence changes that are required can be done at run-time instead of relying solely on recompilation as would have been the case if the system had been implemented in solely C or C++. This dual-language implementation combined with its modularity and modifiable modules results in an overall greater efficiency and flexibility. As such, a researcher or developer has the ability to individually alter, whether add or delete, without wreaking havoc on the overall running of the general system (Taylor, Black, and Caley 1998; Taylor et al. 1998). This design concept, without a doubt, certainly adds to the continued popularity of this open source platform as one of the more preferred speech systems and standard application of choice for text-to-speech synthesis and voice building. Over the last decade, the Festival text-to-speech system has indeed become and is commonly regarded as the “defacto standard free toolkit for speech synthesis research” (Louw 2008, 165). In fact, according to Alam, Nath and Khan (2007), Festival is one of the most complete open-source multilingual synthesis systems currently available that offers the general framework for building speech synthesis systems. It is also the starting point for other leading commercial systems, such as AT&T, Cepstral and Nuance (Clark et al 2007). This particular Jamaican Creole voice building project was conducted using Festival version 2.1, originally released in November 2010. Separate and apart from the improved support for cluster unit voices which was of major

33 interest to this study, the release license under which it was offered would also prove beneficial. As mentioned in preceding paragraphs, Festival Version 2.1, unlike previous releases allows the researcher and developer the option to incorporate the framework and products created from using this framework into both “commercial products and free source products without restriction” (The Centre for Speech Technology Research n.d.).4 Festival provides the developer with the requisite speech engine, the general speech synthesis framework within which to work. However, in order for new voices to be built or for synthesized speech to be created in new languages and implemented in Festival, the developer requires additional tools. In this instance, the voice building toolkit offered by the partnering Festvox currently housed at the Carnegie Mellon University (CMU), in addition to the Edinburgh Speech Tools (EST) Library. Festvox provides the necessary tools, documentation and scripts pertinent to the voice building process (CMU). According to Black and Lenzo in their Festival documentation on Building Synthetic Voices, the Festvox project was “designed to specifically address the issues of building synthetic voices for socalled minority languages or previously excluded languages …” In fact, at the time of the drafting of the Festival-Festvox documentation, Festvox had already been used to build synthetic voices for “at least 40 languages” for the Festival Speech System (2007, 761). 4

At the time of submission of this study for examination, Version 2.1 was still the “most recent version available for free and unrestricted use” (CSTR, 2014).

34 This particular study was conducted using Festvox Version Release 2.4, which included among other important features, support for unit selection voice building specifically the design, recording and auto labelling of unit selection databases, lexicon and building grapheme-to-phoneme support, as well as support for the Cygwin environment under Windows. In describing the key value of the EST library to the overall functioning of the Festival Speech System particularly as it relates to the creation of new voices, Black, Taylor and Caley (1999) notes that the Festival framework itself not only “sits on top” of this EST library system but derives much of its functionality from it as well. The Edinburgh Speech Tools is a library which is used to aid and simplify the process of building speech systems. Its primary task and ability lie within the core C++ classes, functions and programs which are housed within its sub-system. These classes, functions and programs are used predominantly for “manipulating the sorts of objects used in speech processing” such as the ability to support waveforms, parameter files, linguistic type objects and labelling files. It is also within this very EST system that the key classification and regression tree (CART) building c++ program wagon is housed (The Centre for Speech Technology Research n.d.). This project is based on the Version 2.1 release, the most current version at the time of the submission of this study for examination. In addition to installed versions of Festival, Festvox and the Edinburgh Speech Tools, a working development environment such as UNIX (for Linux) or Cygwin (Linux-type environment for windows) , a working GCC compiler and a

35 speech labelling tool such as that offered by the EMU Speech Database, are also required. The Jamaican Creole synthetic voice building project, was initially started within a UNIX environment but was finally ported to and completed within a Cygwin environment on Windows. Having identified the specific waveform technique and voice creation framework that would be used to accomplish our goal, we then proceeded to customise and adapt this technique to Jamaican Creole within the identified framework. In Chapter 3, we document the actual customisation and adaptation of language-specific and generic modules and templates of the open-source voice creation framework to the Jamaican Creole language; thereby providing a roadmap.

2.2.3. Design and Implementation of a Standard Data Set for the JC Voice One of the major components of any navigation system is the list of key street and place names and points of interests, which normally amounts to hundreds for a city or region, maybe thousands. This is normally considered the open domain segment in a closed domain system (Black and Lenzo 2007), as the list is exhaustive with some street names offering different pronunciation options. Jamaica is divided into 3 counties - Cornwall, Middlesex and Surrey, 14 parishes and 4 major cities. Information gleaned from the Central Intelligence Agency (CIA) World Factbook and the Mona GeoInformatics JAMNAV project reveal that there are over 46 major towns, some 3,212 settlements, more than

36 22,121 kilometres of roadways, over 17,000 kilometres of navigable roads, and more than 17,000 points of interests (POIs) in Jamaica (Central Intelligence Agency n.d.; Mona GeoInformatics Institute n.d.). There are over 17,000 place or street names in Jamaica, all of which are presently written in Standard Jamaican English, each with its oral Jamaican Creole variant(s). There is no known record of a complete electronic or paper database of Jamaican street or place names written in Jamaican Creole whether in the standard Cassidy-Le Page orthography or any other non-standard orthography. This is by no means surprising as historically, Jamaican Creole is documented as an oral language (Cassidy and Page 1976/1980). Thus there is limited literature available in the language, particularly as it relates to certain informational aspects. To achieve such a task would necessitate, either (1) the translation and conversion of over 17,000 place names, street names and POIs from Standard Jamaican English to Jamaican Creole, to be used in the corpus or lexicon of the Jamaican Creole synthesizer, a task or research project in itself or (2) the utilisation of a subset database of street names which adequately represents the targeted language phonetically and prosodically. This subset database would be used in conjunction with the system’s Grapheme-to-Phoneme (G2P) conversion module to provide accurate speech output for other street names not stored. The decision was taken by the research team to randomly select an area that would serve as the initial data source for the speech synthesizer's navigation corpus database. This data would then be complemented by Jamaican Creole letter-to-sound rules outlined within the NLP module that would be used at

37 runtime to provide pronunciation of street names not explicitly recorded or provided within the database. In our research project, we designed and produced a standard database of Jamaican Creole navigational prompts that would be used to provide rich natural prosodic information to our synthetic voice. The utterances that comprise the speech corpus adequately cover the proposed target domain of in-car street level voice navigation. Fully recognising the language variation that exists for Jamaican Creole, we sought to provide a set of prompts that did not cater to any one variety of the language over another but that represented a list of formal prompts that were considered to be understood and accepted across the board by native Jamaican Creole speakers. In order to provide an unbiased list of standard navigational prompts, we administered a navigation exercise questionnaire to native Jamaican Creole speakers. The respondents represented different varieties of the Jamaican Creole language. Using their feedback, we generated the list of prompts that would be considered both standard and acceptable, irrespective of language variety. 2.2.4. Evaluation of the Jamaican Creole Synthetic Speech No system or project can be fully validated without accurate testing and assessment. In the case of the Jamaican Creole synthetic speech, we designed an evaluation process, comprising both objective and user-based components. This evaluation process served to formally measure and validate the integrity as well as attest the success of the resulting synthetic speech. In addition to evaluating the

38 JC synthetic speech, we also sought to investigate and rate the JC synthetic speech against current industry benchmark standards to see how it measured in comparison. The evaluation process and results are presented in Chapter 4.

39

Chapter 3: Adaptation of Unit Selection Voice Building to Jamaican Creole 3.0. Introduction Festival and Festvox offer an installation script file to unzip, install, compile and configure all three distributions in one go. However, our experience has shown that it is best to run each component individually, in sequence. This ensures each distribution is configured properly and without error, and assists with debugging. A research of several forum discussions, documentation and notes that referenced the installation, compilation and configuration of these three directories revealed recommendations regarding the suggested manner and sequence in which they should be installed. The recommended order of installation given was first the Edinburgh Speech Tools; second Festival, then Festvox third. This order was provided specifically in relation to the interdependence of sub-directories and modules. The Festival Speech Synthesis framework, EST Library and Festvox tools were retrieved from the Carnegie Mellon University (CMU) website (CSTR North American mirror website). Having acquired the source distribution required for the project: Festival 2.1 (1.96 Beta) , Festvox 2.4 and Speech Tools 2.1, all relevant files and the source distribution compilation script, we set about unpacking them into our voice building directory. Following the download, installation and configuration of the Edinburgh Speech Tools (EST), the Festival Speech System framework and Festvox files, the next task was to set-up requisite environment variables, voice scripts and build directories.

40 A new sub-directory navi to house the voice definition files specific to the Jamaican Creole speech synthesis project was created in C:\cygwin\home\dahlia\fes\Festvox\Festvox\data\navi. The environment variables for all three directories- ESTDIR, FESTIVAL, FESTVOXDIR, were then set to point to the researchers’ specific build directory, which contained the unpacked Festival, Festvox and EST distributions. The below mentioned commands were used to achieved this, whereby path name refers to the direct path leading to the researchers’ directory containing the Festival, Festvox and Speech Tools directories. Under Cygwin the environment path variables were set as outlined below.

● export ESTDIR=/path name/speech_tools ◦

dahlia@dahlia-PC ~/fes



$ export



ESTDIR=/cygdrive/c/cygwin/home/dahlia/fes/speech_tools/

● export PATH=/path name/festival ◦

dahlia@dahlia-PC ~/fes



$ export PATH=/cygdrive/c/cygwin/home/dahlia/fes/festival/bin:$PATH

● export FESTVOXDIR=/path name/festvox ◦

dahlia@dahlia-PC ~/fes



$ export FESTVOXDIR=/cygdrive/c/cygwin/home/dahlia/fes/festvox/

41 The setting up the environment variables was in essence simply telling the system the path to the researchers’ directory, that is to say the location where the specific voice building files were being hosted. Following the setting of the environment variables to point to the desired directories, the build directories and set-up voice scripts relevant to the voice building process were then automatically generated using the provided script: $FESTVOXDIR/src/unitsel/setup_clunits organisation_name domain_name speaker's name. The result of this above-mentioned script being run was the creation of modules and directories necessary for a text to speech system, including the copying of relevant scripts and specific parameters, as well as the generation of the clunit specific voice and build file templates for the new Jamaican Creole synthetic voice. The structure of the voice building database “requires three identifiers for the voice” which is being built, namely (1) the institution, (2) the domain name and (3) the speaker/voice identifier initials (Black and Lenzo 2007). In the case of our Jamaican Creole voice, these were set to uwi, navi and jc respectively, denoting the University of the West Indies, navigation and Jamaican Creole. Once set-up was complete, the speech synthesis architecture and relevant modules were then available for modification and actual voice building.

3.1. Unit Selection Voice Building Process within the Festival Framework Within the Festival Speech Synthesis system, there are a number of key steps and processes that must be undertaken in order to successfully create Unit

42 Selection based synthesized speech or build a Cluster Unit Selection Voice, particularly for new, previously unsupported languages, as was the case with our proposed Jamaican Creole synthetic voice. The following list as outlined by Black and Lenzo (2007) in the Festival documentation chronologically highlights nine of the fundamental tasks that must be defined in order to successfully build a new cluster unit selection voice and create synthesized speech for a new language within Festival. These tasks comprise pre-processing, processing and speech signal modification tasks:

-

Step 1:

Design the (corpus) database

-

Step 2:

Customise the synthesizer front end (NLP module)

-

Step 3:

Record the prompts/voice talent upon which the synthetic voice will be modelled

-

Step 4:

Annotate or label the prompts generated

-

Step 5:

Build the utterance structures for the voice database using recommended procedures

-

Step 6:

Extract pitch mark, build coefficients for acoustic distances and distance tables

-

Step 7:

Build the clunit (cluster unit) based synthesizer by building cluster trees using wagon with features and acoustic distances dumped from the preceding stages

-

Step 8:

Build the voice description for the new synthetic voice

-

Step 9:

Test and tune the new cluster unit selection synthetic voice through evaluation and analysis

43 In conducting the actual synthetic voice building process for Jamaican Creole, we came to realise that while it may be recommended to follow each stage of the voice building process in the outlined chronological fashion, there are steps that can be performed before or after others without adversely affecting the output. For example, the work required at Step 3, recording and building the corpus database, can be performed along with Step 1 or before Step 2 NLP customisation takes place. Steps that cannot be run until all other steps have been satisfied, however are Steps 7, 8 and 9 as these steps all depend on the successful completion of the preceding steps. In the case of our JC voice, we chose to follow the above-outlined steps in sequence, for the most part. Within concatenative type based speech synthesis, the corpus or database plays an integral role in the resulting quality of the synthesized speech output. It provides the model upon which the synthetic voice will be mirrored and therefore special attention must be paid to the preparation of the database, particularly in the recording of the utterances and prompts. Some measures we took included ensuring the recordings for our Jamaican Creole database were similar in nature to the targeted domain. This would assist to reduce a high level of bad joins and at the same time improve quality and enhance naturalness. We also made certain the words and vocabulary used within the prompts and utterances had optimal phone coverage, and were balanced both phonetically and prosodically, having at minimum one occurrence in each prosodic context, as recommended (Black and Lenzo 2007).

44 While not definitive, recommendations are provided as it relates to the size and type required for the preparation of a unit selection database. Black and Lenzo (2007) refer to the TIMIT database in providing this rough guideline, noting a dataset which has 460 phonetically balanced sentences, totalling some 14,000 phones. We must make mention, however, that the final size of the speech database is highly dependent on the target domain for which the synthetic voice is being prepared. In the case of this study, the Jamaican Creole synthetic voice created, although general purpose to some extent is primarily targeted for in-car street level navigation. The utterances and prompts which comprise the speech corpus therefore were customised specifically with this target domain in mind. In order to provide an unbiased list of standard navigation prompts and utterances, we conducted a brief investigation on the nature of navigating or providing directions using the Jamaican Creole language. The remaining sections of this chapter outline the actual customisation process that we undertook in order to successfully provide required components and files for use in the development of the JC cluster unit selection synthetic voice. We also highlight issues encountered and steps we took to resolve them.

45 3.2. Speech Corpus Design for navi In proposing a synthetic voice for Jamaican Creole with the primary target of providing in-car street level navigational prompts, it seemed only fitting to get feedback from fellow Jamaican Creole speakers. Our goal was to ascertain what features would “render directions effective from a listener’s perspective”, specifically when done in Jamaican Creole (Hund and Minarik 2006, 180). Travel writer Christopher Baker, advises the driver requesting directions in Jamaica to take those given by locals “with a grain of salt” (2003, 103). As Jamaicans embarking on this particular project, we could understand his wellplaced satire. The primary researcher recalls having to ask a passerby for directions while conducting a child language survey in Western Jamaica in 2008. This request for directions was met with the prompt response to “gu dong gu dong gu dong gu dong gu dong”, accompanied by gestures. This response is translated literally in Standard Jamaican English as ‘go down, go down, go down, go down, go down’, resulting in ‘continue straight for an undetermined number of miles’. With no member of the research team being from Western Jamaica, these directions led to a fair amount of eyebrow raising and hidden smiles at this point. How far were we supposed to go and how would we know when to stop? Needless to say, the research team struggled with mirth at these seemingly abstract directions. The point of this whole recollection is the mere fact that phrases and expressions that are normally used to provide directions in a Jamaican context,

46 issued in Jamaican Creole will tend to vary as you travel from parish to parish and even intra-parish itself. In the case of this specific conversation, all the participants involved were native Jamaican Creole speakers, but each individual represented a different region of the island, namely East, South and West. Thus although the directions were being given in Jamaican Creole to native Jamaican Creole speakers, the difference in use of particular terms to give directions, as well as understanding of these terms or lack of them thereof, is to be noted. In acknowledging the existing variation in terminology used when giving directions, the question of what standard prompts should be used in this database for the Jamaican Creole synthetic voice becomes equally important in relation to synthetic voice building for that particular domain and as such should be addressed to the extent possible. This particularly as it relates to the preparation of the corpus database that would be used in the creation of this domain specific synthetic voice. We also note that this issue of direction giving in Jamaica and Jamaican Creole expressions used and viewed as acceptable is in itself an area that we would recommend for further research. The objective behind this study was to create a quality domain-specific synthetic voice for Jamaican Creole. Our aim and desire was that this synthetic voice would potentially be used by Jamaican Creole speakers; whether first language (L1) or second language (L2) speakers, from varying sectors of the population, independent of location or region with as little or no hindrance as it relates to the understanding of particular terms used to give directions. The focus

47 for this synthetic voice should broadly allow it to be used by speakers of Jamaican Creole to receive street-level voice navigation assistance, irrespective of idiolect or region. When giving directions, navigation speed, which is the time it takes to get from one point to another, as well as accuracy are both key elements So too are cues, which are often employed when navigating from one point to another. These cues may be divided into two major categories: (i) route perspective and (ii) survey perspective. In a route perspective, salient landmarks, as well as specific street and place names are highlighted and used to assist the navigator to get from one location to the other. Among other reasons, this can help to orient the driver, allowing one to pinpoint whether or not the correct route is still being followed (Hund and Minarik 2006). In a survey perspective, the use of cardinal descriptors, namely North, South, East and West and mileage information are employed. For example, a driver could be told to go sout paahn Ligani Avinyuu ‘Drive south on Liguanea Avenue’. This perspective may also include the use of ordinal numbers to alert the diver as to the exact distance to be travelled, for example aafta 600 miita ton lef ‘In 600 metres, turn left’. Most modern day navigation systems such as JAMNAV(c)5 tend to represent a combination of both perspectives.

5

JAMNAV is the local Global Positioning System (GPS) currently available from the Mona GeoInformatics Institute. It provides navigational instructions in many Standard English varieties, among other languages. It does not offer a Jamaican Creole language option.

48 As attested by first hand observation and confirmed by other Jamaican Creole native speakers, when giving directions to fellow Jamaican Creole speakers in Jamaican Creole, Jamaicans traditionally tend to employ the usage of a route or geographically-based perspective rather than a survey perspective. In fact, as observed first hand, direction-giving is normally smattered with geographical or landmark references. Hence one can expect to ton lef aafta yu paas di kuoknat man ‘Turn left after passing the coconut vendor’, tek di neks rait aafta di big bredfruut chrii ‘Take the next right turn after the big breadfruit tree’, go paas di tord stap lait den ton rait ‘Turn right after the third (set of) traffic light’ or wen taim yu paas di biesik skuul yu nuo se yu suuhn riich ‘Once you have passed the Infant School you know you are nearing your destination’. Terms denoting distance, such as ‘metres’ and ‘kilometres’, though found in modern day street level navigation systems, such as JAMNAV(c)6 are notably absent from regular every day direction-giving vocabulary found in use between Jamaicans, when driving directions are provided using Jamaican Creole. In normal Jamaican Creole conversations between Jamaicans, the concept or meaning of a ‘metre’ or a ‘kilometre’ is often lost on an everyday Jamaican seeking or giving directions in a local context. To illustrate, we asked a random Jamaican, a hair stylist in her late thirties from rural St. Andrew but working in Kingston, to explain what her understanding would be if someone were to tell her to ‘go straight for 200

6

Ibid

49 metres’. Her response as related was that mi wudn riili andastan ou faar im a taak boot stil so mi wuda jos ges ou faar dem waahn mi go ‘I wouldn't really understand how far s/he meant, so I would just take a guess of how far I was expected to go’. Indeed, the meaning of the concept of ‘metres’ was basically lost on this particular respondent. She would have to guess, because she did not normally make use of those terms in everyday speech when referring to directions. This was not an isolated case, as the same example was repeated on four more occasions, with participants including two university students, yet with the same results returned. Additionally, it is quite common for indications to ton lef ‘go left’ or ton rait ‘go right’ to be often supplemented with non-verbal cues such as gesturing and hand signalling. The observation is also true of Jamaicans and cardinal points, which are rarely if ever used when Jamaican Creole is the language of interrogation. It has been noted, however, that some Jamaicans, predominantly senior Jamaicans, tend to utilise the terms mailz ‘miles’, yaad ‘yards’ and chien ‘chains’ in direction giving, when measuring distance. In the case of geographical landmarks, markers such as the kuoknat man ‘coconut vendor’ are often used locally by drivers to orient themselves. One option in system design would be to create the coconut vendor as a Custom Point of Interest (POI) for the navigational device. All this provided that we can guarantee the coconut man is and will always be fixed in that particular location, which often is not the case. The kuoknat man may set up shop off Liguanea

50 Avenue because business is booming in that specific location, however if circumstances change, he may well be found next off Oxford Road. The preceding information thus leads us to the following questions: (1) Is direction giving in a Jamaican Creole context different from direction giving in a Standard Jamaican English context? (2) What prompts are normally used in Jamaican Creole to provide driving directions? (3) Does it suffice to merely provide a text to text translation of the standard English prompts currently used in modern day navigation systems? (4) In recognition of modern day satellite-based navigational systems, what adjustments if any would be required from a cultural point of view which is more predominantly geographically-based? In order to present a standard set of prompts within our synthetic voice that was both unbiased and generally accepted for direction giving and navigating in a local context using Jamaican Creole, we administered a questionnaire to eight randomly selected Jamaicans: 3 males and 5 females, with ages ranging between 23 and 40. The questionnaire had 6 questions, including open-ended and closedended questions. In Question 1 of the questionnaire (see Figure 3.1 below) , the 8 participants were asked to provide what they considered the necessary Jamaican Creole prompts required to get a Jamaican Creole speaking driver from a Starting Point to a Final Destination, using a route pre-defined by the researchers. The

51 complete questionnaire and sample responses returned can be referenced in Appendices 2 through 4. Although the questionnaire was being administered to persons who were predominantly L1 Jamaican Creole speakers, with a high level of understanding and speaking ability in the language, the questions and instructions were written in Standard Jamaican English, as they were not proficient in reading using the formal Cassidy-Le Page (1976, 1980) Jamaican Creole orthography. The participants were required to answer in Jamaican Creole and allowance was made for them to write in Jamaican Creole how they thought it should be written, as they were not fully competent using the Cassidy Le-Page writing system.

Figure 3.1

Navigating using Jamaican Creole

52 As it relates to direction-giving on a whole, the basic facts that remain, irrespective of the language, are (a) to avoid the use of confusing terms and (b) to give the least confusing route at all times which would ensure the driver gets to the destination within the shortest time possible (Hund and Minarik 2006; Hund et al. 2008). Thus it makes sense to (1) specify the distance to be travelled, (2) clearly indicate the turns to be made and (3) indicate the specific side of the road (left or right) on which the destination is to be found. In specifying the distance to be travelled, one may mention or display the length of time it will take to get to the desired destination from starting point. The satellite image below of Mona Heights in Kingston, Jamaica courtesy of Google Maps represents the preliminary area used within this project to provide a reference point for the database of street names, place names and POIs to be used in the speech synthesizer's initial navigation corpus database for the Jamaican Creole synthetic voice. Using native speaker competence as well as input from other native Jamaican Creole speakers, this data set of street names was first translated from SJE to JC and then appropriately tagged.

53

Figure 3.2

Satellite Map of Mona Heights, Kingston, Jamaica

Using the information gathered from the results of the administered questionnaire, what follows is the proposed template for the in-car navigation system that provides turn-by-turn directions in Jamaican Creole: i. Indicate the direction in which the driver should go or turn, as in: straight, left or right, or turn back (make a U-turn); ii. Use cardinal numbers to indicate distance to the upcoming or next turn; iii. Use ordinal numbers to define the type of turn the driver will take; iv. Alert the driver as to name of the street to turn or go; v. Alert the driver when nearing destination; vi. Alert driver when making a wrong turn or going contrary to directions given and correct driver; vii. (Optional) Offer compliment to driver upon arriving safely at destination.

54 3.2.1. Speaker Definition for the Jamaican Creole Synthetic Voice In seeking to define the speaker for the JC corpus database, speaker selection, speaker characteristics and voice quality were among some of the factors that needed to be addressed. Another aspect that had to be considered was the recording environment. Black and Lenzo (2007) offer some general guidelines to this effect. Though made with particular reference to Festival, the tips provided are applicable to (concatenative) systems in general. One recommendation is to select a speaker with a clear intonation or speech pattern, which is to say someone who does not mumble and who does not have a speech impediment. Other recommended points are to avoid choosing a speaker who has a cold or a hangover, as these are ‘temporary’ states which are often difficult to reproduce or repeat should re-recording be required for specific segments. Additionally the time of day recommended to conduct the recording sessions are in the morning but not immediately upon awaking. If further recording sessions are required, then it is recommended that they be done at the same time of day as the initial recording. The developers do caution however on the expectations of the resulting voice, despite having a very distinctive and professional speaker. It is rather unlikely they state that “a synthesizer built from recordings of a person will always sound just like that person” (Black and Lenzo 2007, 25). While the above general guidelines for selecting a speaker and recording the corpus are good, they are just that, general guidelines. We also believe that using a speaker who has an awareness of speech technology is an added bonus.

55 Additional factors had to be put into consideration as it relates to the development of the Jamaican Creole domain-specific synthetic voice. For example, the speaker chosen had to be familiar with the Cassidy-Le Page (1976/1980) writing system for the language. This was a requirement to read the data prompts provided, as they were written using the CLP phonemic writing system. The researchers also had to first consider what language variant would be the basis for the synthetic voice. There are different regional varieties of Jamaican Creole spoken across the island of Jamaica. The two major and broader encompassing varieties are Western and Eastern (Harry 2006) under which several sub-varieties may fall. There are different phonological differences between these two varieties, some of which are highly disputed. One such example is the glottal fricative ‘h’. The selected variety of Jamaican Creole used within the speech corpus is more Eastern rather than Western and is based on the speech pattern of the Jamaican Creole speaker used for the recordings. Within this study, [h] is treated as phonetic rather than phonemic. Therefore the lexical, ‘phonemic contrast does not exist’ (Harry 2006, 126). When recording the corpus to be used in a speech synthesizer, a researcher has one of two options: (1) have a professional speaker record the intended corpus in a professional studio or (2) in a relatively quiet environment, record directly to the computer being used to carry out the project. Option 1 results in less background noise hence ensuring a higher quality in the data recorded. Option 2, however, has the greater advantage of being recorded instantly to disk, and written directly in the intended format and straight to the right files to be used in

56 the synthesis process. This results in less loss and reduces the time that may be required to transfer and segment the recording in the case of option 1 (Black and Lenzo 2007). As one of its many modular components, Festival offers a script ./bin/prompt_them which supplies the developer or researcher the option of recording straight to disk. The resulting .wav files from this prompt-recording process are stored in the module /wav to be later referenced by other modules and scripts during the voice building process. When accessed, this prompt_them script which is housed within the bin/ directory prompts each utterance defined in the database file etc/navi.data before being recorded by the speaker. Running this script in essence results in each individually defined prompt being displayed on the screen in sequence, basically prompting the speaker as to which utterance is to be recorded. Despite the seeming straightforward nature of this command script, varying attempts on different occasions to run this script did not result in the expected outcome or success. For example, the system would start prompting the utterances, then without any clear indication as to why and without even returning a particular error message so the required fixes and debugging steps could be undertaken, it would skip several utterances and exit without prompting all the utterances . On several occasions, the speaker doing the recordings was prompted sequentially for utterances 1 through 19, immediately followed by utterance number 38, skipping utterances 20-37 without a resulting error note. Due to this lack of success using the Festival system offered

57 prompt_them script to record directly to disk despite several attempts, the researchers sought an alternate route in order to obtain the necessary recordings for this project. Based on a review of speech synthesis-related discussion board forums and threads on this subject matter, one alternative was to use independent software to record on a personal computer. This option has been used with much success by other researchers and developers in order to achieve requisite recordings for speech synthesis voice building projects. Using an Acer Desktop with an M3N78-VM Asus motherboard, a headmounted microphone and Audacity software, recordings required for this project were successfully carried out in a relatively quiet environment. The total recording time for all prompts was eight minutes and fifty-nine seconds (8 m 59 sec) with recording parameters set to achieving uncompressed pulse code modulation audio output, namely Microsoft wav 16 bit PCM. After listening to the recordings and deleting and re-recording where necessary in order to achieve better audio quality, a unique wave file was then extracted and saved for each individual utterance. This file was defined to match its corresponding prompt file name, for example navi001.wav. Following the renaming of each individual prompt, all recorded wave files were then stored in the voice building sub-folder /wav. The recording files once ported to the /wav folder were then converted to the Festival system required default 16 KHz mono using command line ./bin/get_wavs wav/*.wav. A compiled list of one hundred and forty seven (147) prompts and their utterance numbers, reduced from the original geographically based list of two

58 hundred and sixty-nine (269) was used for the Jamaican Creole speech corpus database. This list of prompts and utterances adequately covers the phoneme inventory of the language and forms the speech database used in the creation of the Jamaican Creole synthetic voice. They were prepared using the procedure for designing prompts highlighted in Black and Lenzo (2007, 35). These JC prompts based on data gathered from native Jamaican Creole informants are provided in Appendix 5, with each prompt categorised under its relevant sub-heading. A Standard Jamaican English (SJE) translation is provided for each prompt. The Jamaican Creole pronunciation of the ordinal numbers is also included after each presentation. The utterances provided were recorded in context, so that coarticulatory effects could be accurately captured and made readily available for use during synthetic voice building.

3.2.2. Annotation of Jamaican Creole Prompts Having designed the speech corpus and provided the list of prompts that would be used during voice creation, we then set about annotating the recorded prompts. This would require labelling and building utterance structures for the recorded prompts. Labelling is the process of matching or aligning the resulting synthesised prompts which were previously derived during the generation process to the output generated during the prompts recording process (Taylor 2009). To obtain labelling information that would be used to perform alignment during voice creation we had two clear options, either (1) to label by hand or (2)

59 to label automatically (Black and Lenzo 2007). Labelling by hand can be done either analytically with the researcher applying specific rules and procedures to the data or intuitively depending solely on the researcher’s own judgement, language and or linguistics ability. Automatic labelling on the other hand uses software tools to perform analytic labelling of the data though the usage of a consistent set of labels and hidden variables within a specific module (Taylor 2009, 519-521). One such collection of software tools is the EMU Speech Database System7. Both hand labelling and automatic labelling have their merits as well as demerits particularly as it relates to accuracy. The Festival framework offers a defined module that can be used to facilitate and fast-track the labelling procedure that is required at this stage of the project. The module make_labs in the bin directory can be referenced through the usage of the script bin/make_labs prompt-wav/*.wav. This script, however did not work for us, despite countless attempts. In fact, we surmised that the command line had to be obsolete. This was later confirmed after debugging the first few failed run attempts and reviewing forums and threads on the particular issue. Researchers and developers using Festival are encouraged to replace the bin/make_labs command line as provided in the Festival-Festvox Voice Building documentation with the script ./bin/do_build label in order to successfully generate requisite lab files. After applying JC specific customisations to the script

7

Phonetic software tool available from the Centre for Language Technology, Macquarie University

60 ./bin/do_build label, we were finally able to run the script without errors and generate labels. Using the segments we had already defined for the Jamaican Creole language and the duration information we had provided for each segment, the system was then able to locate and provide (i) those signals representing the language defined phones, as well as (ii) the registered segmental boundaries within the JC spoken prompts The location of the required signals and boundaries in essence aligned the segments defined with the recorded prompts and the synthesised prompts, thereby generating waveform labels from the prompts (Taylor 2009; Black and Lenzo 2007). Having generated the labels and manually checked them for accuracy we then used the following command line prompt to build the utterance structures for each of the items defined within our prompt list: festival b Festvox/build_clunits.scm '(build_utts "etc/navi.data")' .

3.3. Jamaican Creole Phoneme Inventory and Lexicon One cannot fully conduct research in a domain such as Text-to-Speech Synthesis, without being confronted with possible hot-bed questions surrounding the role or lack thereof of linguistics in speech synthesis or speech processing. Questions or derivatives such as the following are common place and have been raised at speech processing conferences and widely discussed in resulting conference proceedings - (i) What is the role of linguistics in speech generation?,

61 (ii) How does linguistics fit into the process of speech generation?, (iii) How much linguistics knowledge and application of such knowledge to the process is required?, and (iv) Does one have to be a linguist to successfully generate speech? If we recall the generic text-to-speech process and the figure presented in Chapter 1, these questions do seem appropriate, particularly when we consider the mere fact that the Natural Language Processing module does include some amount of Linguistic Analysis.

LINGUISTIC ANALYSIS -

Figure 3.3

Phonetic Analysis Grapheme-to-Phoneme Conversion Prosody Specification

NLP Module and Lingustic Analysis

Taylor (2009) mentions the great cultural divide that exists between linguistics and speech technology. In fact, it has been mentioned that linguists appear to support first generation (rule-based) speech synthesis techniques more so than they do second generation (corpus-based, data-training based) speech synthesis technique. Indeed, there is a great difference in the role linguistics may have had in speech technology during the Chomskian era and today. Researchers and developers continue to be divided on this issue. Indeed there are many who fervently believe that research in linguistics will be able to solve many speech technology problems. In presenting his outlook on the future

62 directions of speech synthesis, Taylor remarks that while there are those who may strongly believe the above, for his part, although a “solid knowledge of the basics of linguistics...” proved “invaluable” in his own research in TTS research, “the field of linguistics, as currently practiced, will never produce any research that will facilitate a major leap forward in speech technology” (2009, 533-37). His rationale stems from what he refers to as the lack of experimentalism in linguistics, in as much as it relates to the application of speech processing theories. We strongly believe, without reservations that knowledge of key linguistics areas is an applicable advantage in speech technology, particularly acoustics phonetics when it comes to Unit Selection-based synthesis, as was the case with this study. It is our belief, however, that the more appropriate question in this great debate is neither on the role of linguistics nor the linguist, but rather how a better understanding of human communication and technology can enable us, that is, the researcher or developer to generate more intelligible, high quality synthetic speech. In recalling two questions posed at the start of this project, namely regarding (1) the necessary pre-processing text and linguistic changes that were required as well as (2) how we would go about implementing said changes in the Festival concatenative-based synthesizer platform, we get a better appreciation of the role that some amount of linguistic knowledge does play in the development of new synthetic voices. According to Schroeter (2008, 419) and as illustrated in

63 the form of the adapted block diagram of the concatenative synthesizer below, the changes to the Front End module fall into two primary categories. These are: i.

Changes to the module or modules responsible for text analysis, letter-tosound, prosody, and

ii.

Changes to the module or modules responsible for the store of sound units

Figure 3.4

Block Diagram of a Concatenative Synthesizer

As it relates to Item ii, modification to the store of sound units is only necessary when the gender of the target speaker differs from the system default implemented gender (Schroeter 2008). The Festival platform requires that a default voice package be installed during the set-up and implementation process. It provides some default voice packages which the developer can access. The

64 voice which was unpackaged during the set-up for the Jamaican Creole voice was a default male voice with accompanying parameters to match. The Jamaican Creole voice we sought to build was modelled off a female speaker. Modifications were therefore required to build parameters, specifically those related to the setting of pitch mark arguments for a female speaker. The direct involvement of the researchers in relation to the NLP module of the Festival platform for the Jamaican Creole language related specifically to the customisation of the NLP modules through clunit generated Scheme template files, as well as the creation of new Scheme files tailored off those previously generated. To satisfy the requirements for the creation of a Jamaican Creole synthetic voice within Festival, the following phonetic, pre-processing and linguistic tasks were performed: i.

Phone set definition

ii.

Tokenisation and text normalisation

iii.

Incorporation of token processing rules for JC numbers/digits

iv.

Providing word pronunciation guides (lexicon and letter-to-sound rules)

v.

Stress assignment (to syllables)

vi.

Phrase breaking assignment

vii.

Duration assignment (to phones)

viii.

Intonation

ix.

F0 contour generation

65 In the case of our Jamaican Creole synthetic voice, some knowledge of Linguistics, in particular phonetics and phonology as well as first-hand knowledge of the language itself did play a key role when performing the above mentioned pre-processing and analysis tasks, especially in the provision of items (i) , (iv) and (vii). The NLP module of the speech system is language dependent. Thus in order to accurately model the specific language within which you are working, logically certain language-specific changes to the Front End of the speech system would be required. Schroeter notes that changes to the “TTS frontend” or Natural Language Processing module are necessary for “each new language” in order to accurately reflect the true nature of each language being added to the speech synthesizer (2008, 413-428). Indeed, simply bootstrapping and using an existing default English system for Jamaican Creole without language-specific modifications would not be enough if we truly wanted to provide an accurate representation of the language. Although English is the major contributing lexifier for Jamaican Creole, the two languages are unique and distinct in their phoneme inventories. A phoneme is an abstract concept, a contrastive linguistic unit of speech used to describe the limited set of sounds found in any given language. Phonemes are a finite language-specific inventory of functioning units, which are used to distinguish words used in the language and are normally listed in what is

66 rudimentarily referred to as the alphabet of the language (Taylor 2009; Ladefoged 2005; Radford et al. 1999). The differences in phoneme inventories for Jamaican Creole and that of Standard English varieties such as Received Pronunciation (British English) and American English which are often used as the default English varieties in text-tospeech systems far outweigh the seeming similarities. As such, we would not be able to assume an absolute one-to-one mapping between these languages. Such differences can be observed in the number of phonemes in each of these languages, the phonemes themselves which comprise this inventory, as well as in their distribution. The Jamaican Creole language, according to Cassidy and Le Page (1976/1980) has thirty-five phonemes, while English has around forty (Ladefoged 2005, 195). Note can also be taken of Jamaican Creole phonemes such as /gy/ as in gyal ‘girl’, ‘young lady’, /ky/ as in kyaad ‘card’, as well as in the behaviour of some phonemes, such as the Jamaican Creole voiced implosives, which do not exist in English. A phone, in Linguistics is defined as an individual speech unit (Jurafsky and Martin 2000) and is the smallest unit of sound found in any language (Ladefoged 2005). It is a concrete unit, the number of which varies from language to language and which are divided into two main classes: consonants and vowels (Jurafsky and Martin 2000). A phone set in this regard would represent a “precise, concrete” list of sounds which can be “detected by phonetic analysis” (Radford, et al. 1999, 82). Within Festival the phone set is considered the basic building block

67 of any synthetic voice. It must be mentioned, however, that the term phone set when used within the Festival Speech System refers to the abstract set of speech sounds used to describe the sound system of any given language (Black, Taylor and Caley 1999), which is to say that phoneme inventory which is defined by a linguist (Radford et al. 1999). In this study, we assumed a phoneme inventory which combined the CLP (1976/1980) and the Devonish and Harry (2004) phoneme inventories, listing a total of thirty-six phonemes. These 36 phonemes, comprising 12 phonemic vowels and 24 consonants provide a more accurate representation of the phoneme inventory of modern day JC. A binary feature format, listing the phonemes, features and assigned value for each feature was used to represent the JC phoneme inventory in Festival. This phoneme inventory is depicted within a framework of the type: ( defPhone set 1) NAME 2) FEATUREDEFS 3) PHONEDEFS )

NAME is the unique identifying symbol of the phoneme set. FEATUREDEFS represents a list or definition of the feature names and their possible values which are used to describe the phonemes that exist or are used in the particular language being defined. PHONEDEFS provides a list of the phonemes of the language and their attributing feature values, as defined in FEATUREDEFS.

68 There are three major parameters used to define Vowels. These are tongue height, tongue advancement and shape of the lips (Jurafsky and Martin 2000; Ladefoged 2005). Using the four binary features of Height (high, mid, low), Frontness (front, centre, back), Roundness (+ roundness, - roundness) and Length (short (S), long (L), diphthong (D)), the Feature Matrix below defines the 12 Jamaican Creole vowels as well as their value specification as used within the Festival platform.

69 Table 3.1

Distinctive Feature Matrix for Jamaican Creole Vowels

HEIGHT

FRONTNESS

ROUND

LENGTH

High

Mid Low

Front Centre Back

Round

S, L, D

i

+

-

-

+

-

-

-

S

e

-

+

-

+

-

-

-

S

a

-

-

+

-

+

-

-

S

o

-

+

-

-

-

+

+

S

u

+

-

-

-

-

+

+

S

Additional Feature = Length (L) ii

+

-

-

+

-

-

-

L

aa

-

-

+

-

+

-

-

L

uu

+

-

-

-

-

+

+

L

Additional Feature = Diphthong (D) ie

+

-

-

+

-

-

-

D

ai

-

-

+

-

+

-

-

D

uo

+

-

-

-

-

+

+

D

ou

+

-

-

-

-

+

+

D

The most commonly used features to describe consonants in general, to which Jamaican Creole is no exception, are (i) voicing, which speaks to the vibration of the vocal folds, (ii) place of articulation, in reference to where the

70 restriction in airflow in the vocal tract takes place and (iii) manner of articulation, defining how this restriction is made. Regarding Jamaican Creole consonants, the features used to reflect Place of Articulation are Bilabial, Labio-Dental, Alveolar, Post-Alveolar, Palatal, Velar, and Labio-Velar; while those used to reflect Type or Manner of Articulation are Nasal, Plosive, Affricate, Fricative, Approximant and Lateral Approximant (Harry 2006; Devonish and Harry 2004; Meade 1996). The twenty-four resulting consonants from the combination of the phoneme inventories proposed by Cassidy and Le Page (1976/1980) and Devonish and Harry (2004) are presented in the feature matrix below. Table 3.2

Jamaican Creole Phoneme Inventory of Consonants

Bilabial

Labio-

Alveolar

Dental Nasal Plosive

t

d

Affricate Fricative

f

v

Velar

Glottal

s

c

ɟ





ʃ

z

Approximant

ɹ

Lateral Approximant

l

LabioVelar

ɲ

n

b

Palatal

Alveolar

m p

Post-

ŋ k g

zh j

w

In the feature matrix, the sounds of consonants are characterised as to the nature of the vocal folds whether they are vibrating, hence voiced, or not

71 vibrating, thus voiceless. Those consonants depicted on the left are voiceless, while those on the right are voiced. Within the binary feature presentation format used in Festival, all the JC consonants were provided with a feature specification or value of ‘+’ or ‘-’, that is plus or having that feature or minus, not having that feature. The complete phoneme inventory as defined for JC in uwi_navi_jc_phoneset.scm can be referenced in Appendix 6. In the same manner a phone(me) set illustrating the phoneme inventory for a new language must be obligatorily defined within the synthesizer framework, so too must the rules dictating the pronunciation of these phonemes and the application of phonological rules. The ultimate goal of the definition and application of these phonological rules is to achieve the best possible letter to phoneme alignment. Some of the current approaches used for implementing phonological rules and G2P conversions in speech systems include the usage of (i) a large lexicon or pronunciation dictionary, (ii) defined language dependent rules and (iii) data driven methods (Taylor 2009; Weersignhe et al. 2007; Taylor, Black, and Caley 1998; Dutoit 1997) or (iv) a combination of these methods. In the “Sound Pattern of English”, Chomsky and Halle presented the grammar rewrite rules formalism, whereby the grammar of a language was expressed as a “linearly ordered sequence of rewrite rules mapping an underlying form (the output of the syntax) to the surface representation” (1968, 390391).Within this formalism, the symbolic rewrite rules are conditioned using left and right contexts, with the rightmost part of the formalism following the dividing

72 slash ‘/’ demonstrating the rules or constraints, and the formalism on the left of the dividing slash ‘/’ containing left and right parts holding the regular expressions, as shown below:

A → B/ #_X A → A/ #_ In order to convert the grapheme to its requisite phoneme counterpart, the most specific rule is checked first. If this rule does not exist to the context being defined, then the second and subsequent rules are checked. Hence A (left context) becomes B (right context) in the environment of X, with hash denoting a word boundary and underscore serving as a placeholder. Should either side of the underscore be blank then this would be taken as an indication that if this rule were to be encountered, then it was guaranteed to apply, thereby providing the default pronunciation for the grapheme defined (Chomsky and Halle 1968). The alignment or mapping required for the segments or phonemes of a language largely depends on how well-defined the mapping from orthography to pronunciation is for that particular language. While the relationship between orthography and pronunciation for a language such as English can tend to be very complex thus requiring a large number of rules and exceptions to these rules (Ladefoged 2005; Radford et al. 1999), the case is different for some languages, for example Spanish or Jamaican Creole. In Jamaican Creole, the behaviour of phonemes in relation to the pronunciation and orthography using a formal writing system such as CLP, tend to be somewhat more regular than in English (Cassidy

73 and Le Page 1976/1980; Devonish and Harry 2004). Recognising the almost one to one relation in pronunciation and orthography, we took the approach of using rewrite rules that were not overly complicated in the design of the JC synthetic voice. Modern day speech systems such as Festival, for the most part tend to make use of a combined grapheme to phoneme conversion approach, using both a lexicon as well as a rule set to provide the pronunciation for the language. The reason(s) behind this combined approach will tend to vary from developer to developer. For some developers, the LTS rule set/G2P algorithm is used to provide the primary pronunciation guide for the language, while the lexicon is used as an exceptions dictionary to handle those cases that do not follow the rule defined in the LTS rule set (Black and Lenzo 2007). For other developers, the usage of a lexicon is primary with the LTS rule set providing the backup for those entries not stored in the lexicon. Taylor proposes that the application of both an LTS rule set and Lexicon are “simply different techniques that trade off memory and processing cost”. He further states that “both have their place and one is not a priori better, more sophisticated or even more accurate than the other”. Additionally, it is his viewpoint that these two approaches are not as distinctly separate as one would think, but instead represent different parts on a continuum scale serving as the pronunciation guide for the language being modelled (2009, 208-210). In the Festival documentation supporting the framework, a word list or lexicon, that is to say a dictionary, is often the easiest way to access the

74 pronunciation of a word. Within the Festival framework, which supports multiple lexicons at once, the structure of the lexicon is composed of three parts, all optional, namely: 1) a short addenda of hand added words, 2) a compiled lexicon and 3) a method for dealing with words not found in either the addenda or the compiled lexicon (Black and Lenzo 2007). Each lexical entry provided within the addenda consists of (1) a head word, (2) a Part of Speech (POS) tagging, and (3) a pronunciation guide which seeks to segment the word according to syllables and stress assignment, both primary and secondary (Black and Lenzo 2007). An example of the structure of such a lexical entry for the addenda as applied within the Festival system is provided below. The word being defined is ‘awb’, acronym for Alan W. Black, one of the developers of the Festival framework. ;;(lex.add.entry ;; '( "awb" n ((( ei ) 1) ((d uh) 1) ((b @ l) 0) ((y uu) 0) ((b ii) 1))) (Black and Lenzo 2007) The defined word is enclosed in quotation marks and immediately followed by its corresponding part of speech (POS) tagger, subsequently followed by a breakdown of the word according to its syllable and stress assignment, with 1 denoting primary stress and 0 denoting secondary stress assignment (Black and Lenzo 2007). Within the Festival framework, the letter -to-sound rule set is an indispensable part of the speech system which works in tandem with the phoneme

75 inventory previously defined for the language. Its primary function is to dictate the rules and environmental factors which govern the distribution of the phonemes of the language. The role of the LTS rule set or G2P conversion algorithm is to work in conjunction with the lexicon, to convert the input text into an acceptable sequence of output phones (Taylor 2009). This input text may be a known word, that is to say, it can be found in the lexicon or word list or an unknown word, not defined in the lexicon. The basic form of an LTS rule and as used within the Festival Speech Synthesis framework is defined in terms of Chomsky and Halle’s (1968) grammar rewrite rules formalism:

( LEFTCONTEXT [ ITEMS ] RIGHTCONTEXT = NEWITEMS )

The interpretation of this rule is that if ‘ITEMS’ appear in the specified right and left context then the output string is to contain ‘NEWITEMS’. Within this particular rule set defined, NEWITEMS are thereby written to a different location and thus cannot be used to feed further rule(s). Additionally we note that any of the above mentioned defined context positions, namely LEFTCONTEXT, RIGHTCONTEXT or NEWITEMS are not obligatory and may be empty (Black, Taylor, and Caley 1999). One could aptly say that the CLP orthography proposed for and currently used as the model for the official writing system for the Jamaican Creole language provides an almost direct grapheme to phoneme rule conversion set for the

76 language, a somewhat similar situation when compared with the romance language Spanish, whose pronunciation can ‘almost be predicted from its orthography’ (Black and Lenzo 2007, Chap.24). For example, in Jamaican Creole, the only definition of the phoneme for which the orthography is represented as /k/ is as a voiceless velar plosive. Thus although there are resulting allophones for example an aspirated variety kh, a voiceless velar plosive phoneme in JC will always be rendered as such in its orthography. The case is quite different with some other languages, for example the English Language, where a word such as c-a-k-e /k ee k/ has the voiceless velar plosive phoneme being rendered in its orthography as both a ‘c’ and a ‘k’. It is our belief and as demonstrated by the outcome of our JC voice, that the overall technique offered by the general Festival framework reduces the time one would be required to spend on a lexicon without impacting negatively on the output of the speech system. Within this framework, the synthetic voice builder adds “the most common words to the lexicon via simple dictionary format, explicitly giving their pronunciation by hand, and then automatically building letter to sound rules from the initial data” (Black and Lenzo 2007, 63). In the case of the Jamaican Creole Synthetic Voice building project, the synthetic voice created was domain-oriented. It was specifically defined by the target objective, namely a synthetic voice for a local navigation system; however, it had minimal coverage outside of its specified domain. The JC lexicon implemented within the system was tailored to fit the primary target domain. A Jamaican Creole domain-specific lexicon and letter to sound rules were provided

77 to denote pronunciation rules for the language. The lexicon was specific to the target domain for which the voice was built, however, the LTS rule set made it possible for words and utterances not previously recorded and stored within the corpus database to be produced as new utterances. The Letter-to-Sound Rule Set defined for JC can be referenced in Appendix 7.

3.4. Assignment of Prosody in Corpus-Based Speech Synthesis Providing the speech system with the ability to find the correct pronunciation of the input text through the provision of a phoneme inventory, lexicon and letter-to-sound rule set is only one half of a very important text conversion coin. The other half that must also be addressed is in relation to obtaining intelligibility and naturalness for the speech output. To the extent possible, a high level of naturalness in synthetic speech can be achieved through a combination of a speech corpus and prosodic modification (Dutoit 2008). Inasmuch as it relates to the achievement of prosody for synthetic voices and how prosody is specified within the Festival framework, we must first recall the specific type of synthetic voice built within this project, namely a Cluster Unit Selection voice, essentially a corpus-based concatenative voice. Referring to Hunt and Black’s (1996) work on unit selection and concatenative speech synthesis, we draw attention to the rationale behind the usage of a corpus database for speech synthesis. In providing a speech database of pre-recorded units for our Jamaican Creole corpus-based synthetic speech, we already had at our disposal a

78 database from which we could extract and concatenate not just mere units but also their natural waveforms already encoded with prosodic information. Having this rich prosodic information readily available meant we were not required to do extensive signal modification to the database and could therefore retain more of the natural elements perceived in human speech, thereby improving the quality of the resulting synthetic speech output. Varying techniques exist to obtain prosodic information for the targeted synthetic speech. Developers often choose from amongst the current methods or combination of these methods to implement during voice building. Within the Festival framework itself, prosodic modification is normally achieved through varying methods or combination of methods. These methods include (i) the prediction of prosodic phrases, (ii) the identification of accent or boundary tone, (iii) the generation of fundamental frequency, (iv) the prediction of duration model for each language, (v) the identification of pitch marks with the addition of cepstral parameters as well as through (vi) the usage of data training algorithms (Black and Lenzo 2007). In relation to our JC Cluster Unit Selection-type based synthetic voice within the Festival framework itself, assignment of prosody was achieved by providing language specific information for (i) Phrasing Prediction, (ii) Duration Assignment, (iii) Prominence and (iv) Intonation and Fundamental Frequency (F0) (Taylor 2009; Black and Lenzo 2007; Sproat and Oliver 1999). Our manual prosodic modification was not overly extensive. By limiting the modification and manual manipulation of speech units, we were able to maintain more of the rich

79 natural prosody from the original database itself and thereby reduce speech signal degradation.

3.4.1. Phrase Boundaries and Phrasal Breaks In regular speech, a speaker intuitively knows where and when to insert appropriate phrasal breaks. If the speaker attempted to utter continuous phrases for a long period of time without pause, the result would be a very breathless speaker, almost similar to a long distance runner who did not take one single pause for the entire duration of a very long race or marathon. Physically speaking as well, there is a maximum limit enforced on how long a person is able to continue uttering speech before being required to pause or take a break, due to vocal restraints. Human speech is made up of appropriately timed pauses grouped into intonational units which define the natural prosody found in human language. Such prosodical breaks are essential to speech communication to ensure the utterance being produced is not only comprehensible to the listener but relays appropriate prosodic information (Ladefoged 2005). Currently there are two techniques supported by Festival for predicting prosodic breaks within a corpus or text. The first is a basic yet highly effective module based on a Classification and Regression Tree structure; henceforth CART and the second is a full statistical model which is trained from corpus data. As the second requires a much larger and more extensive database to affect an optimal training model, the default system supported CART method was used in

80 the JC synthetic voice creation process. The specific CART method used to predict phrasal breaks is one based on a Deterministic Punctuation (DP) phrasing algorithm, whereby punctuation markers are used as indicators of phrase boundaries. Within this method, a “phrase break” is placed “at every punctuation mark” to indicate a required pause or boundary (Taylor 2009, 129-131; Black and Lenzo 2007). The Deterministic Punctuation phrasing algorithm is posited as providing a good starting point when building a voice for a new language as it is relatively “rare that punctuations exists where there is no boundary”. Hence the standard scheme phrase template file generated for the cluster unit selection Jamaican Creole voice during voice building set-up uwi_navi_jc_phrasing, is a “good first step in defining a phrasal model for a new language” within the Festival framework (Black and Lenzo 2007, 64-65). Though arguably basic, Black and Lenzo report that its method of simple rules based on punctuation is a “good predictor of prosodic phrase boundaries” and has acceptably been used successfully in speech systems created for English and other languages (2007, 65). In choosing to go along with the system provided DP phrasing algorithm technique, we therefore expected the same positive outcome for our JC voice. The template generated phrase file defines two types of phrasal breaks, namely (i) B and (ii) BB. Phrasal break B defines a short break accompanying punctuation markers such as a comma (“,”) or space (“ ”), while phrasal break BB defines a longer break present after punctuation markers such as periods,

81 interrogation marks and exclamations signs ( “.” “?” “!”). Only minor modifications were made to the scheme file, an extract of which is provided below: ((lisp_token_end_punc in ("?" "." ":")) ((BB)); indicate a longer pause/break ((lisp_token_end_punc in (" " " ' " "\" "," ";")) ((B)) ;indicate a small pause/break

After defining the usage of punctuation markers as indicators of phrasal breaks, the scripts and function lines (set! phrase_cart_tree uwi_navi_jc_phrase_cart_tree) and (Parameter.set 'Phrase_Method 'cart_tree) were used to run the CART tree builder using wagon. Arguably, although this basic phrasal break predicator is viewed as being quite reasonable for a starter or rudimentary system, we cannot however fail to appreciate that though acceptable for this purpose, it is agreeably somewhat minimalistic and under-specific in itself. Thus while it is indeed true that punctuation marks are apt indicators of phrase boundaries, we do have to account for those particular instances of written sentences and phrases where such defined markers are sparse, as can often be the case for many Creole languages. Black and Lenzo (2007) agree that the punctuation model though extremely effective is also under-predictive. They too recognise the need for additional information to provide better prosodic information. This would be

82 possible even without requiring a training set for the new synthetic voice being built. One option is to use function words, in conjunction with phrasal breaks as boundary indicators (Sproat and Oliver 1999). This intonational phrasing algorithm method is based on a Deterministic Content Function (DCF) approach, whereby function words are defined as indicators of optimal phrasal breaks (Sproat and Oliver 1999; Black and Lenzo 2007). The words in any given language can be defined as belonging to one of two existing categories, namely content words and function words. Content words are normally viewed as belonging to an open set as the list of words used in a given language has the ability to expand or increase. Words belonging to this set are normally defined in everyday terms as nouns, adjectives, verbs, adverbs, numerals, interjections and so on. Function words, on the other hand are often classed as belonging to a closed set, denoting a defined set or list of items belonging to a class with a fixed number of items, defined per language. Words belonging to this set are normally identified as belonging to the set of words referred to as prepositions, pronouns, determiners, conjunctions, modal verbs, auxiliary verbs and particles (Gooden 2007; Sproat and Oliver 1999). Within the Festival framework, the possibility exists to use a combination technique approach, made up of punctuation markers and function words, in order to better define a more appropriate phrasing model for the synthetic voice. This particular intonational method used within the Festival CART framework works by extending the previously defined DP approach model. Within this method, a

83 phrase break is assigned at the boundary existing between function words and content words, namely “every time a function word follows a content word” (Taylor 2009, 129-131). It therefore requires that the function words belonging to the particular language be defined as the “boundaries within strings of words” (Black and Lenzo 2007, 65-68). This definition is made up of probabilistic modelling using part of speech (POS) information combined with the usage of ngram models to predict phrasal breaks (Taylor 2009). We implemented both approaches as a part of our voice building process for JC. This was done by incorporating the DCF approach through the Part of Speech (POS) tagger scheme template, uwi_navi_jc_tagger. Examples of some function words used in Jamaican Creole and which were defined within the Festvox voice generated module uwi_navi_jc_tagger are: ● Conjunction: an ‘and’, bot ‘but’ ● Determiners: di ‘the’, dem-ya ‘these’ sekan ‘second’ likl ‘a little’ ● Prepositions: chuu ‘through’, uova ‘over’, pan ‘on’

Although we have defined and provided a list of JC function words to be used, predicting phrasal boundaries for the Jamaican Creole voice using pos and ngram models is an aspect of this research for which much more can be done. In fact, a substantial amount of work remains in this regard and thus phrasal boundaries definition at this stage is heavily dependent on the basic Festival CART module previously defined.

84

3.4.2. Duration Assignment Merely specifying the phonemes to be used in the JC synthetic voice creation process, as well as defining phrasal breaks was not enough. As a part of the prosodic modification required for our JC synthetic voice, we were also required to specify the duration assignment for each JC segment which would be used in the synthetic voice, in essence making a decision as to how long to make each one (Sproat and Olive 1999). One of the least complicated ways to provide requisite duration models is to provide a fixed duration for each segment to be used in synthesis (Black and Lenzo 2007). The Festival framework provides a foundation through the use of specified scheme file templates. There are three definitions required in providing the required segment duration, namely the definition of (1) a duration modelling method, in this instance zscores, (2) a hand specified tree to be used in predicting zscore durations previously defined and (3) the average for each phone (phoneme) in use. Using recordings of the Jamaican phoneme inventory carried out by the Jamaican Language Unit (2010, Audio Track 1) and Praat, an open-source acoustic analysis software, we identified, extracted and assigned duration for the JC segments to be used in our JC synthetic voice building process.

85

Figure 3.5

Jamaican Creole Segments

As digraphs /gy/, /ky/ and /ny/ were not included in the original JLU recordings, we extracted the durations for these through contextual recordings of the words gyambl ‘gamble’, kyaahn ‘can’t’ and nyam ‘eat’. In order to define the average for each JC phoneme, we selected the approximate start and end points of the segment displayed in the acoustic signal window above, zoomed in to observe the details and then proceeded to record the duration in seconds which was

86 measured, rounding to three decimal points. Below is the extracted representation for the phoneme /a/, out of context:

Figure 3.6

Duration for JC Phoneme /a/ using Praat

The format of recording duration information for segments in Festival is of the type segname 0.0 average, with segname denoting the name of the segment, 0.0 denoting mean and average denoting standard deviation (Black and Lenzo 2007). For silence, specified by default as pau (short for pause) within the Festival framework, we took three random sampling of pauses and computed the mean score, using this to identify the average length of a silence or a pause within the JC context. Below is an extract of the duration assignment for JC phonemes:

87 '(pau 0.0 0.250) (a 0.0 0.080) (aa 0.0 0.080) (an 0.0 0.090) (ai 0.0 0.080) (b 0.0 0.060) … (zh 0.0 0.110)

3.4.3. Defining Intonation and Fundamental Frequency Parameters There are many intonation theories and models used in modern day speech synthesis systems. The most widely used approaches can all be viewed to some extent as variations of the more traditional methods defined within (1) the Dutch School, (2) the INTSTINT model, (3) the British school, (4) the Fujiski model or (5) the Autosegmental/metrical and ToBI models. Thus the principles provided in the definition of intonational models which are used to generate F0 contours may be based on (1) stylisation of F0 contours, (2) a sequence of labels, (3) dynamic features such as rise and fall, (4) the phrase and the accent or (5) a dualapproach that is in part ... paralinguisic and ... extra-linguistic (Taylor 2009, 236 -261). The default intonation structure defined for cluster unit selection type voice synthesis within the Festival Speech Synthesis system operates within a characterisation that differentiates between accented and non-accented syllables, and that predicts accents of stressed syllables in content words. This classification

88 provides nominal accent times that are later converted and used to supply Fundamental Frequency (F0) values that are then used to compute the required F0/pitch contour pattern for the system. Although the option exists within the Festival system to provide a more elaborate hand specified tree, the simple yet robust and highly effective default CART tree and a hat accent structure were used to define modulation information within the new JC cluster unit selection synthetic voice. Within the accent cart structure provided as a part of the basic intonation structure definition, the syllable is accented and assigned the value of 1 for stress if the POS was listed as content, as in (R:SylStructure.parent.gpos is content) ( (stress is 1) ((Accented)) ((NONE)) ) (Taylor 2009; Black and Lenzo 2007, 69-70). Having thus predicted accent types as well as defined the means by which these accents generate and compute the requisite F0 values within the Intonation module, the system then used this output information to build the appropriate pitch/F0 contour model by referencing the various scripts defined within our uwi_navi_jc_f0model module. The methods defined within the Festival system for predicting intonation and generating F0 contours for new speech synthesizers are general and can be used as defined without the need for additional definitions when building new

89 synthesizers. Within the framework itself, three basic Fundamental Frequency (F0) approaches are implemented, specifically (i) F0 by rule, (ii) F0 by linear regression and (iii) F0 by Tilt. Of the three approaches, F0 by rule is presented by the authors of Festival as the “most general” target method (Black and Lenzo 2007, 69-74). It is this F0 by rule approach that was implemented within the default structure generated by the system for the definition of the Jamaican Creole cluster unit selection type synthetic voice. Within this basic approach, defined target points were predicted for each syllable. In addition, trees predicting start, mid and end points in each syllable were calculated for each accent. (define (uwi_navi_jc_targ_func1 utt syl) "(uwi_navi_jc_targ_func1 utt syl) Simple hat accents." (let ((start (item.feat syl 'syllable_start)) (end (item.feat syl 'syllable_end)) (ulen (item.feat (utt.relation.last utt 'Segment ) 'segment_end)) nstart nend fustart fuend fuend fstart fend) (set! nstart (/ start ulen)) (set! nend (/ end ulen)) (set! fustart '130) (set! fuend '110) (set! fstart (+ (* (- fuend fustart) nstart) fustart)) (set! fend

(+ (* (- fuend fustart) nend) fustart))

The intonation range and standard deviation for synthetic voices is fixed and defined on a speaker-based notion, hence changes and modification to pitch are required for different speakers when implemented within a system. Within the

90 default voice template files generated for Fundamental Frequency, the speaker mean and range were set at default values of 105 and 14 respectively. Using the Praat Acoustics Analysis software and a five minute Jamaican Creole recording done by our JC female speaker, we were able to determine the average F0 parameters to be used for the new JC voice. Based on the recording, the F0 minimum or floor was determined to be 120 Hz, the F0 maximum or ceiling 500 Hz and standard deviation 62.16 Hz. Based on the acoustic information derived from our analysis for pitch settings, the F0 mean was determined to be 234.499 and the range 419.3 Hz, hence the speaker average F0 and range for the new JC voice was set to 235 Hz and 419 Hz respectively. The model F0 mean and range set at 170 and 34 respectively were left unchanged.

3.5. Speech Signal Analysis for the Jamaican Creole Synthetic Voice Signal processing and waveform generation is the backbone of Concatenative Speech Synthesis. In the case of the second generation speech synthesis or data driven type technique used for this study, signal analysis is done specifically through a two-step approach, namely through (1) pitch marking and pitch mark extraction and (2) co-efficient building.

91 3.5.1. Pitch Mark Extraction Ladefoged speaks of the disturbance that results from speech and the air pressure that is depicted as flowing in an up and down movement during speech. This essentially creates a wave-like motion with corresponding fluctuations being ascribed to this signal (2005, 7-9). Within this disturbance and air pressure, one is able to retrieve so-called pitch-related information. Pitch, long defined as the rate of vibration of the vocal folds, has the ability to not only increase in frequency thus prolonging a syllable and causing what we term stress in words but also to cause differences in tone. Such differences in tone can be observed in some of the world’s languages, for example Cantonese, where pitch is used in a lexically contrastive manner, producing differences in word meanings (Ladefoged 2005). In Jamaican Creole, as well as many other languages, pitch is used in a manner quite different from what we have just described for Cantonese. For the most part, it is normally used to “reflect speaker emotion”, as well as to produce intonational patterns which are used to distinguish and identify “clause structure and other grammatical aspects of sentences” (Ladefoged 2005, 24). Gooden (2007) defines stress in JC as being phonologically contrastive, with the stress system of JC being weight sensitive. In Jamaican Creole, as Gooden further observes, coda consonants, long vowels and diphthongs, all contribute to syllable weight. The contributors are highlighted in bold in the following examples:

● Coda consonants

- 'eng.ka

'loiter, begging to be noticed'

92 ● Long Vowels

- ku.'muu.jin

'mean, miserable'

● Diphthongs

- 'pie.pa

'paper'

In the majority of JC words, whether they are bi-, tri- or quadra-syllabic in nature, primary stress is normally allocated to the heavy syllable, that is the one bearing any of the above noted features; with secondary stress normally falling two syllables away from the main stress. In a case where no heavy syllable exists, primary stress is then normally allocated to the penultimate syllable, for example pu.'pa.lik ‘cartwheel’ (Gooden 2007). In the corpus designed for our JC synthetic voice, the majority of the utterances can be grouped under the category of broad focus statements. According to Gooden (2007), utterances which fall under what is termed broad focus statement in Jamaican Creole, normally present with a rise or high early in the phrase, an optional rise/high in the middle and a fall at or near the end of a phrase, where it is marked by a L% boundary tone. L% in this instance is used to signal a low boundary tone. There are different types of pitch accents which are to be found in Jamaican Creole, such as the high accent H*, the low accent L* and the rising pitch accents L* + H and L+H*. However, as noted by Gooden (2007), of all the varying types of pitch range manipulations which have been noted in the language, the most prevalent pitch accent is the falling accent H + L*, which is normally present in lexical, compound and reduplicated words.

93 Following closely the structure proposed for the Festival framework, adhering to the rules governing syllabification and stress assignment in Jamaican Creole and using the knowledge apprised from Gooden’s (2007) thorough and well-documented research on pitch and stress in Jamaican Creole, we set about assigning stress values for the JC lexical entries. All entries provided were assigned stress assignment values of 1 denoting primary stress and 0 denoting secondary stress or unstressed. The below lexical entries are samples from the Jamaican Creole synthetic voice. (lex.add.entry '("a go" vb. aux. dial (((a)1) ((go)0)) )) (lex.add.entry '("anchuuriyom" n ((an) ((chuu)1)(ri) (yom)) )) (lex.add.entry '("arieliya" n ((a) ((rie)1) (li) (ya)) )) (lex.add.entry '("avinyuu"n ((a) (vi) ((nyuu)1)) )) (lex.add.entry '("beguoniya" n ((be) ((guo)1) (ni) (ya))) ) (lex.add.entry '("botakop" n ((bo) ((ta)1) (kop)))) … ))

As we progress in better understanding pitch and how it is portrayed in Jamaican Creole, we move to a more in-depth knowledge of the pitch mark and its relevance in the Corpus-based voice we built. When reference is made to a

94 pitch mark, this is in connection to the beginning of the pitch period, a specific short burst of energy, normally found within voiced segments in speech signal, which is more aptly depicted visually as a peak (Ladefoged 2005). In an attempt to illustrate this more clearly, we use Praat, in Figure 3.7 below to present the waveform and corresponding pitch information including pitch markings for a sample speech segment extracted from our synthetic voice.

Figure 3.7

Speech Segment Representing navi003.wav – 3 /chrii/

In the above figure, information relating to pitch analysis and phonetic measurements is represented in the upper half of the figure, enclosed within the

95 rectangle. In this section the waveform representation can be seen by the sharp straight lines running throughout the duration of the audio recording segment. In the lower half of the image; the spectrogram, the average pitch of the sound file selection is displayed at 188.3 Hz. Using Praat to perform a voice report analysis on the audio recording, we were able to determine a total of 52 pulses or peaks in this audio recording with the total number of periods or pitch marks being 51, minimum pitch at 177.796 Hz and maximum pitch at 226.740 Hz. In text-to-speech synthesis, correctly defined pitch marks are highly important in ensuring smooth joins and an acceptable quality in the final synthetic output. Thus the intention is to ensure enough good pitch marks can be extracted from the audio sample. Pitch marks can be extracted (1) directly from the speech recording on hand, (2) through the usage of an electroglottograph (EGG) signal or (3) automatically (Black and Lenzo 2007; Taylor 2009). The Festival framework provides among its many scripts and templates two relatively brief scripts for pitch mark extraction. One script is specifically engineered toward extracting pitch marks directly from the speech recording on hand, that is from the waveforms previously recorded for each prompt, and subsequently creating for each waveform file its corresponding pitch mark file (bin/make_pm_wave wav/*.wav). The other is for generating pitch mark information from electroglottograph signal or laryngograph (LAR) files (bin/make_pm). Having already prepared audio recordings for each prompt defined within the JC database, the decision was taken to proceed with the first

96 option, namely to extract pitch mark information from the speech recordings stored within the voice building framework and system. The pitch mark program used for both scripts is aptly titled pitchmark, which works primarily through the usage of defined high and low frequency cut off points and autocorrelation. As this program is housed within the EST distribution, it is vital that the most current EST distribution is used, ensuring that it is properly installed and configured with its environment variable ESTDIR set to point to the correct speech tools local voice building directory. Black and Lenzo (2007) refer to other crucial voice building specifications which may be defined by the researcher with respect to the varying scripts offered within the platform. Before running the Festival script, bin/make_pm_wave wav/*.wav, some minor yet crucial modifications to pitch mark parameters were required. In the case of the JC voice built, one such modification was the configuration of the default parameter arguments of the Festival pitch mark script to represent an average female speaker.

PM ARGS=’-min 0.0033 –max 0.07 -def 0.01 –wave_end -lx_lf 300 – lx_lo 111 -lx_hf 140 –lx_ho 51 –med_o 0’

In the above composite script lines, parameter information on (i) min, (ii) max, (iii) fill, (iv) def, (v) wave-end, (vi) high pass filter, (vii) high pass filter order, (viii) low pass filter and (ix) low pass filter order are depicted. The default

97 values provided for the speaker voice range were set to 0.0033 and 0.07 respectively. The values for high pass and low pass filter were set to -lx_lf 300 and -lx_hf 140. The parameters for unvoiced sections were left at default value def

0.01. In defining the speaker voice range as well as the frequency cut-off parameters to be used in the synthetic voice, special caution was taken during the definition of the arguments and the assignment of values. The reason for this extra care being that the values chosen would directly influence and determine the number of pitch marks or periods extracted, thereby having a direct impact on the quality of the resulting synthetic voice. Thus care had to be taken in the case of the JC voice to ensure that defined values were neither set too high nor too low, as too many or too few pitch marks would have been the result. In addition, special care was taken during the definition of pitch mark parameters as the quality of the output pitch marks from this phase was highly crucial to the next phase of the voice building process. This phase was the building of coefficients which would be used to calculate the required acoustic distances for the synthetic voice.

3.5.2. Modification of Pitch and Timing There are several techniques that exist for performing signal-processing modification during Speech Synthesis, all of which have their strengths as well as their weaknesses. Some of these techniques include Pitch Synchronous Overlap and Add (PSOLA), Residual-Excited Linear Prediction (RELP), Sinusoidal

98 Models, MBROLA and Synthesis from Cepstral Coefficients. All five techniques are explored in depth by Taylor, who provides a concise summary of each technique, exploring their pros and cons (2009, 412-434). In the case of the second generation data-driven concatenative approach used for our JC synthetic voice building, the required signal analysis of the concatenated waveforms within the Festival framework was achieved in an overly simplistic yet highly effective fashion. One of the techniques used by Festival is to group similar units found within the speech database. This is considered by many as not being purely second-generational in nature but which Taylor ascribes as being effective and doing “the same job as pure second-generation technique” (2009, 429). This particular method to which Taylor referred and which was used to create acoustic representations of the JC speech units is based on MelFrequency Cepstral Coefficient (MFCC). Referencing the Festival script, bin/make_mcep wav/*.wav, we applied the steps and calculations offered to provide the requisite acoustic parameters and generate the coefficients:

SIG2FV=$ESTDIR/bin/sig2fv #SIG2FVPARAMS='-coefs melcep -delta melcep -melcep_order 12 fbank_order 24 -shift 0.01 -factor 2.5 -preemph 0.97'

SIG2FVPARAMS='-coefs melcep -melcep_order 12 -fbank_order 24 shift 0.01 -factor 2.5 -preemph 0.97'

99

$SIG2FV $SIG2FVPARAMS -otype est_binary wav/$fname.wav -o mcep/$fname.mcep -pm pm/$fname.pm -window_type hamming

The statistical information found within this mentioned script is better illustrated using the below process outline showing modification through cepstral analysis adapted from Taylor (2009, 442).

speech → pre-emphasis → DFT → mag → mel-scale filterbank → log → DCT → liftering → coefficients

Figure 3.8

Modification of Pitch and Timing Through Cepstral Analysis

Using a three step approach this Mel cepstrum technique is used to create acoustic parameters of all the similar units within the JC speech database. These three steps include (1) the removal of the tilt normally found within the spectrum, followed by (2) the smoothing of the said spectrum and mel-scaling and finally (3) the removal of higher cepstral coefficients. By performing these three successive steps, the coefficients which are required to perform clustering and join measurements for signal processing are extracted and built pitchsynchronously, which is to say without “performing any explicit source-filter

100 separation” (Black and Lenzo 2007; Taylor 2009, 415, 442). Taylor provides a brief insight as to the justification for choosing this signal processing technique over others, stating that MFCCs by the very statistical independence of coefficients are “highly amenable to robust statistical analysis” (2009, 430). Within the Festival framework, Pitch Mark extraction and Cepstral Analysis form a two-step approach to achieving requisite signal processing requirements that must be done in sequence. Any attempts to perform cepstral analysis and extract cepstral parameters before performing the extraction of pitch marks (from the recordings in this instance) would merely result in an error message when running the voice building commands. Successful and accurate cepstral analysis is wholly dependent on the pitch marks extracted during the Pitch Mark Extraction phase outlined previously and achieved through pm_wave. The fact that Cepstral Analysis is performed in this pitch synchronous manner and is itself part of a two-step process to achieving requisite pitch marking required for synthesis, could in itself be considered by some as a limitation. It was this very point that may have led to the energy contour for pitch marking prosodic modification approach for concatenative speech synthesis method proposed by Ewender and Pfister (2010). Within this approach both short term energy and fundamental wave features are combined to acquire both reliable and highly suited pitch marks for synthesis.

101

Chapter 4: Evaluation of Jamaican Creole Synthetic Speech 4.0. Introduction Having successfully generated synthetic speech for Jamaican Creole using a widely recognised open source voice creation toolkit, the real question was just how effective we were. Did we succeed in generating quality synthetic speech for Jamaican Creole using open source software and just how could we accurately rate or define our success? In the case of the Jamaican Creole synthetic voice creation project that we undertook, we had three primary objectives, namely: 1. To create quality, domain-specific, synthetic speech in the language; 2. To accurately document the synthetic voice building process; 3. To put the synthetic voice to the test, using instrumental feedback from prospective users to improve the resulting synthetic output. The perspective of the developer or researcher is highly biased and a true evaluation of success could only be ascertained by putting the voice in the field.

4.1. Key Role of the Evaluation of Synthetic Speech Evaluation in general allows for the verification and validation of results obtained in relation to an objective or objectives previously set forth. In seeking to evaluate synthesized speech or a new synthetic voice, there are several factors to consider; some of which may be linked to the notion of speech perception

102 (Radford et al. 1999), others to the very obvious question of what exactly it is we were seeking to evaluate or assess. TTS evaluation exercises are dependent on the researchers’ objective and may focus on the overall system, specific aspects of the system, or the actual synthesized output itself. Having a specific evaluation objective in mind is useful in helping one to determine the course of evaluation to undertake for any given project.

4.1.1. Current Evaluation Methods Whilst there have been much written on the topic of which synthesis technique provides better quality synthesized speech output, it has long been acknowledged that there has been far less written on the formal evaluation of speech synthesis systems (Taylor 2009). Indeed, there are many theories regarding how and what to test for, but no internationally acclaimed standard of metric to follow. Taylor (2009), Lampert (2004) and van Heuven and van Bezooijen (1995) write on the proposed taxonomy of speech synthesis evaluation techniques. On one hand, there is the black box or system testing as a whole, whereby testing is conducted in abstract, without any concrete knowledge of how the system actually functions. On the other hand, there is the glass box testing or modular testing, where the performance of specific speech synthesis components and modules is actually taken into consideration. This taxonomy and its dichotomies, adapted

103 from van Heuven and van Bezooijen (1995), is represented in the figure provided below, for reference.

Figure 4.1

Dichotomy of Black Box-Glass Box Testing

There are several methods advocated for and currently being used for the effective evaluation and assessment of synthetic speech output and speech systems. It is worthwhile to mention, however, that the choice of which method or combination of test methods is eventually used will tend to be primarily determined by varying factors. Some factors taken into consideration to determine which test method or combination thereof to use may include the following:

1. The primary purpose of the evaluation itself;

104 2. The synthetic voice building technique used throughout the project or development, for example synthesis by rule or synthesis by concatenation; 3. The particular aspect or combination for which one is evaluating, for example front end processing, quality, accuracy, intelligibility, naturalness or suitability; 4. The researchers’ or developers’ preference for one method or combination of methods over others; 5. Methods given higher credence in the industry and research; 6. The actual usage or application type for the synthetic voice; 7. The end user. (Campbell 2007; Santen et al. 1998; Rosson and Cecala 1986; Cryer and Home 2010; Klatt 1987). Some of the more commonly used evaluation techniques used to assess speech systems and synthesized speech may incorporate global or analytic approaches and may include linguistic, acoustic and other aspects. These other aspects may be in relation to the complete system itself, the particular synthetic voice built as well as target application. We will insert appropriateness here. Of the evaluation tests in use, some of the more universally used and accepted are Multidimensional Scaling (MDS), Mean Opinion Score (MOS), Semantically Unpredictable Sentences or SUS Test, Haskins Sentences, Comprehension Tests, Intelligibility of Proper Names, Prosody Evaluation, Diagnostic Rhyme Test

105 (DRT), Modified Rhyme Test (MRT), Comparison Tests and Competitive Evaluation Test. Tests may be carried out with specific reference to the sentence level, phrase level, word level or phoneme level. Each technique and test mentioned in the preceding paragraph seeks to test different aspects of the synthetic speech output, ranging from perception of naturalness and quality to actual performance and usage. In this regard, researchers, depending on the primary objective may tend to give more credence to one technique over the other (Cryer and Home, 2010; Taylor, 2009; Campbell, 2007; Santen et al., 1998; van Heuven and van Bezooijen 1995; Klatt, 1987; Rosson and Cecala 1986). Multidimensional Scaling (MDS) tests seek to rate the global quality of the utterance whereby listeners are asked to rate pairs of stimuli in terms of naturalness. In Mean Opinion Score assessments, listeners rank the overall speech quality of utterances on a 5-point Likert scale. Semantically Unpredictable Sentences and Haskins Sentences are both sentence level tests that use grammatical yet meaningless sentences to evaluate speech comprehension at the sentence level. Haskins Sentences may also test comprehension at the word level. Comprehension Tests may be seen as an extension of the previously mentioned sentence level tests. They are used to evaluate the understanding of the meaning of the complete utterance, either a few sentences or paragraphs, through question and answer responses, rather than the recognition of specific phonemes or words. Proper names and names in general, such as names of persons and street names, to

106 mention a few, may have multiple pronunciations. The Intelligibility of Proper Names is a modular diagnostic evaluation type assessment that seeks to evaluate the TTS pronunciation of proper names. In Diagnostic Rhyme Testing, isolated words are used to test for intelligibility of consonants in initial position. Error rates are then averaged to provide the results. Although a commonly used method, its major limitations include the testing of only initial consonants, as well as the lack of vowel testing and prosodic features. Modified Rhyme Test seeks to extend DRT, however while it tests consonants in both initial and final positions, the other limitations highlighted for DRT still exist. Comparison Tests are used to measure acceptance of the complete system by comparing stimuli across different synthesizer versions. In Competitive Evaluation testing, different research groups run their systems on a common test data; the results of which are made public (Cryer and Home 2010; Taylor, 2009; Campbell, 2007; Santen et al., 1998; van Heuven and van Bezooijen 1995; Klatt, 1987; Rosson and Cecala 1986).

4.2. Evaluation Design In seeking to design the evaluation process to assess the JC synthetic speech the following questions were proposed as reference points: i.

What methods of assessment should we implement for the JC voice?

107 ii.

How should we measure the accuracy, quality, naturalness and intelligibility of the Jamaican Creole synthetic speech?

iii.

What benchmark standards should we observe or choose as reference for the Jamaican Creole synthetic speech output?

iv.

How much should we depend on subjective analysis to determine and quantify the measure of success in relation to the Jamaican Creole synthetic speech?

In the case of the Jamaican Creole synthetic speech created within the Festival framework, evaluation was performed with the following four objectives in mind: o Objective 1: Confirm the acoustic accuracy of the resulting synthetic speech through speech analysis; o Objective 2: Obtain and assess user-based feedback on the output speech in relation to quality, intelligibility and naturalness; o Objective 3:

Propose a benchmark standard for JC synthetic speech;

o Objective 4: Document the limitations of the synthetic voice created and the framework used to complete the study. The following sections outline and present the design process, the instruments used and the findings.

4.3. Evaluation Process The evaluation of the JC synthetic voice created using open source software comprised both laboratory and field tests. The initial assessment was

108 objective and was carried out in laboratory settings by the researchers. The second assessment was subjective and included user-based feedback of the JC synthetic speech. In relation to the objective evaluation of the JC synthetic speech, our primary objectives were to (1) measure the acoustic quality of the resulting speech, (2) confirm the accuracy of the phoneme inventory and (3) analyse the similarity or lack thereof in F0 curves in relation to prosody. The objective component of the assessment was performed by comparing and analysing resulting synthetic speech samples and human speech sample recordings from the speech database. Praat software was used to perform the speech analysis for this assessment. The subjective user-based evaluation component of the assessment included (1) a pilot assessment that was administered to a closed group of 30 participants and (2) the formal evaluation which was open to the public. We will first present the objective assessment, followed by the user-based subjective evaluation.

4.4. Presentation of Objective Assessment Within the Festival Speech Synthesis framework and Festvox voice creation toolkit, there are various components that are used during the creation of synthesized speech or speech systems for new languages. Some of these components may be altered without adversely affecting the overall system, whereas modification of others may affect varying sub or dependent components.

109 In the creation of synthetic speech, whether this is being carried out within the Festival framework or not, we strongly believe in and do recommend the notion of timely evaluations on a regular basis. Ongoing evaluation of the varying components and or of specific modules during the actual speech synthesis voice building process is as equally important as those tests carried out to assess the final synthesized speech outcome once the process is completed. Above all else, we believe ongoing evaluation will assist the researcher(s) and developer(s) to identify errors, effectively troubleshoot and make modifications or corrections as required before proceeding to other stages.

4.4.1. Ongoing Assessment during Voice Building In the case of the Jamaican Creole synthetic voice building project, we concentrated on three specific modules within the Natural Language Processing module during ongoing assessment. These components were (1) the Phoneme Inventory Module, (2) the Letter-to-Sound Rule set or Grapheme-to-Phoneme Conversion Module and (3) the Tokenizer Module. As these modules are interdependent, we believed they would help us to better assess whether specific language components were being properly accessed during the creation of synthetic speech for Jamaican Creole. Some of the information included in the evaluation may have already been presented during the individual presentation of these three components; however they are restated here in summary format as they form a part of the evaluation analysis and our overall findings.

110 To illustrate and summarise how ongoing evaluation was achieved, we present three sample errors received during Jamaican Creole voice building within the Festival framework. Using these errors we demonstrate how we were better able to evaluate our system on an ongoing basis and thus perform effective troubleshooting. If the system was not able to find information which accurately corresponded to aspects already defined within the system for the new language and specified per say within another module, then errors would be generated. Identification of these errors for us implied the requirement of new definitions or modification of modules or definitions already specified. In the first of the two randomly selected errors provided below, the system had failed to located requisite information as pre-defined for the new voice being built, notably (i) pronunciation for a JC digit defined within the lexicon, (ii) duration information and (iii) incorrect linking for a specified variable. Our findings later revealed that modification of rules specified within the G2P and Tokeniser modules were required in order to ensure that the requisite information could be accessed in required format when later needed by the system to ensure voice building process could flow. These errors highlighted for us two crucial pieces of information. Firstly that there were aspects that required fixing before we could proceed to the next step and secondly that the module specified for the definition of the processing of Jamaican Creole digits was indeed working. Indeed failure to find the pronunciation for a specific digit which we knew we had specified in the database

111 was validation that the system was attempting to access our Jamaican Creole modules at runtime as opposed to the default English modules.

Figure 4.2

Sample Error 1

Figure 4.3

Sample Error 2

112 The case of the third error below is somewhat similar to that which we have outlined previously. This error was in direct relation to the G2P inventory specified and the Jamaican Creole affricate digraph /ch/. Although we believed we had correctly specified the noted digraph and its relevant features within this module, the system still returned an error whenever it was encountered and attempt was made to use at run time. All instances of the affricate when found were returned as two separate phonemes instead of one digraph as specified. Realising an error could exist in our definition, we were forced to review, troubleshoot and make subsequent modifications to its feature matrix in order for it to be accepted and used correctly by the system during voice building.

Figure 4.4

Sample Error 3

113 It was imperative that the phoneme inventory, grapheme to phoneme conversion modules and the tokeniser were properly defined. This was necessary in order to successfully generate the list of utterances and prompts defined, to accurately build the required utterance and to achieve pronunciation modelling that had a direct correlation with the specified input language. Additionally in order to ensure correct information was being passed accurately from one module to the next, in the format required, care had to be taken during the definition of modules coupled with in-process assessment and troubleshooting as soon as issues were identified. Failure to define these modules accurately could and did result in errors, as seen in the three examples presented above. If left unresolved, these errors would in turn affect the quality of the information passed to the next module; even the final output. Additionally a failure to resolve these when identified by the system could and did initially result in other errors being generated at other stages of the voice building process. The end result if these and other errors were not properly reconciled would be an incorrect speech output at the end of the process, or the inability to proceed to subsequent modules. Based on the information we have just presented, one could safely assume that in all instances an overt error message would be provided to signal troubleshooting and fixes were required. There were instances during voice building when all the processes appeared to be working normally and all modifications had been successfully implemented. The assumption that

114 everything was working well could be made because no alarm sounded, no dialogue box appeared, no error message loaded in terminal mode. Unfortunately, this was not always the case. Indeed, there were occasions when the only time the researchers were made aware that a certain process had not been successfully implemented was at the very end, based on the output or failure to generate an output. This in itself could prove to be a deterrent but with open source software, this is quite often a regular occurrence. In such instances, counting on the researcher to recognise that the final output did not match requirements, the only route was to back trace, pinpoint the possible area requiring fixes through the use of regular expressions and queries and make the necessary modification(s). If the above failed, the next step would be to reach out to community groups and forums or the system developers themselves. This was a process that we too had to take during the course of our voice building project. It was also by embarking on this route that we discovered that although the Festival Speech System offered a limited domain component, which had been used to create domain specific synthetic voices; it was not initially created for and did not allow for new non-English based voices to be implemented.

4.4.2. Assessing Acoustic Accuracy and Prosody Generation In this section we used sample audio recordings from the speech database as well as their synthesized counterparts to measure and compare acoustic content

115 of the corpus and its synthesized counterpart. This was performed in an attempt to verify the correct identification of pitch marks, their subsequent extraction and usage within our JC synthetic speech as well as the usage and transfer of natural prosody from the recorded corpus to the generated speech output and the overall generation of prosody in synthesized speech. Using Praat once again to perform the acoustic analysis and account for mean square error, we compared, analysed and rated the acoustic and prosodic output of synthetic output data versus source corpus data.

Navigation Prompt 1 (navi26) – 300 miita ‘three hundred metres’

Figure 4.5

Waveform of Natural Speech from JC Speech Corpus

116

Figure 4.6

Waveform of JC Synthesized Speech Output

In the above waveforms, one can observe noticeable differences in duration between the original recording and its synthesized counterpart of 0.02284s, minimal distortion at the beginning of the synthesized clip and additional peaks. However, we recognise from the below voice reports that the overall acoustic quality of the synthesized audio clip, particularly as it related to pitch and pulses, was comparable to the original corpus recording.

117 Table 4.1

Comparison of Voice Reports for navi26 using Praat Original Recording 0 to 1.222250 seconds (duration: 1.222250 seconds)

Synthesized Output 0 to 1.199410 seconds (duration: 1.199410 seconds)

Median pitch Mean pitch Standard deviation Minimum pitch Maximum pitch Number of pulses Number of periods Mean period Standard deviation of period Fraction of locally unvoiced frames Number of voice breaks

199.778 Hz 195.156 Hz 25.498 Hz 147.169 Hz 255.786 Hz 163 161 5.118689E-3 s

199.428 Hz 195.234 Hz 25.695 Hz 147.135 Hz 255.788 Hz 164 161 5.112964E-3 s

0.713752E-3 s

0.729174E-3 s

27.731% (33 / 119)

25.000% (29 / 116)

1

1

Degree of voice breaks

8.679% (0.106079 s / 1.222250 s)

8.838% (0.106007s / 1.199410 s

0.929331

0.909761

0.095550

0.127029

Time range of selection Pitch

Pulses

Voicing

Harmonicity of the voiced parts only

Mean autocorrelation Mean noise-toharmonics ratio Mean harmonics-tonoise ratio

15.215 dB

13.680 dB

This specific navigational prompt and audio clip pair was also used in the subjective evaluation listening test as a means of testing whether or not the listener could identify any obvious differences between the synthesized audio and the original audio recording, enough to distinguish which was original and which was synthesized. Responses returned from the subjective evaluation showed that

118 the acoustic quality was comparable and close enough for misidentification to take place with some persons identifying the synthesized version as an original audio recording clip. We will return to this during the presentation of the findings of the subjective evaluation survey.

Navigation Prompt 2 (navi34) – ton lef ‘turn left’

The second acoustic comparison focused on navigational prompt ton lef ‘turn left’. By zooming in to scrutinise the second utterance and the voice report, we came to appreciate that there was strong disparity between the original corpus and its synthesized counterpart, specifically in relation to median and max pitch and pitch standard deviation, as well as standard deviation of period (pulse).

0.2926

0

-0.2645 0

0.58 Time (s)

Figure 4.7

Source Data from JC Speech Corpus

119

0.8446

0

-0.7169 0

0.5725 Time (s)

Figure 4.8

JC Synthesized Speech Output

In concatenating the synthesized version and applying requisite labelling and pitch, unnecessary elements were flattened as well as omitted. Based on the auditory quality and as verified during the user based evaluation assessment, it would appear that those omitted, seemingly unnecessary elements were not to the detriment of the segments concatenated and the quality of the final synthesized speech output. The voice reports generated and the subsequent user-based evaluation revealed that the overall acoustic quality of the synthesized audio clip, particularly as it related to pitch and pulses, was comparable to the original corpus recording, and did not affect comprehension.

120 Table 4.2

Comparison of Voice Reports for navi34 using Praat

Time range of selection

Pitch

Pulses

Voicing

Harmonicity of the voiced parts only

Median pitch Mean pitch Standard deviation Minimum pitch Maximum pitch Number of pulses Number of periods Mean period Standard deviation of period Fraction of locally unvoiced frames Number of voice breaks Degree of voice breaks Mean autocorrelation Mean noise-toharmonics ratio Mean harmonicsto-noise ratio

Original Recording 0 to 0.812698 seconds trimmed to 0 to 0.580450 seconds (duration: 0.580450 seconds) 198.209 Hz 202.166 Hz 31.862 Hz

Synthesized Output 0.000914 to 0.572472 seconds (duration: 0.571558 seconds)

160.127 Hz 271.472 Hz 84 83 4.945761E-3 seconds 0.785011E-3 seconds 26.471% (45 / 170) 0

201.769 Hz 262.580 Hz 82 81 4.517168E-3 seconds 0.314300E-3 seconds 35.185% (19 / 54) 0

0 (0 seconds / 0.580450 seconds) 0.956045

0(0 seconds / 0.571558 seconds)

0.052779

0.158615

16.083 dB

14.660 dB

217.469 Hz 221.057 Hz 14.545 Hz

0.900942

121 Navigation Prompt 3 (navi104) – diezi avinyuu ‘Daisy Avenue’

The results of the third sample prompt selected for acoustic comparison and analysis was more or less a combination of the findings presented above for the preceding two prompts. The overall acoustic quality was above average and highly intelligible enough for respondents to be able to understand the difference between ‘Paisley Avenue’ and ‘Daisy Avenue’ and respond and react appropriately. There were no marked distortion and unintelligibility factors that would prevent the user or listener from understanding what action was required. However, we do believe there still remains room for improvement with respect to these and other navigational prompts that were used to accomplish our voice building objective. 0.3273

0

-0.2891 0

1.149 Time (s)

Figure 4.9

Source Data from JC Speech Corpus

122

0.6544

0

-0.5763 0

1.017 Time (s)

Figure 4.10

JC Synthesized Speech Output

A close comparison of the acoustic output of synthesized and original audio samples demonstrated the overall similarity and maintenance of prosodic content in output for the Jamaican Creole synthesized voice, particularly in relation to pitch, pulse and mean autocorrelation. We came to appreciate that from an acoustic point-of-view, prosodic quality in general was correctly transferred from the original segments to their synthesized segments. In addition, pitch mark extraction was accurately applied. These factors, along with the modification of pitch and timing resulted in an overall acceptable synthesized output quality for the Jamaican Creole voice. This was later verified based on subjective responses received using online web-based surveys.

123 Table 4.3

Comparison of Voice Reports for navi104 using Praat

Time range of selection

Pitch

Median pitch Mean pitch Standard deviation Minimum pitch Maximum pitch Number of pulses Number of periods Mean period

Pulses

Voicing

Harmonicity of the voiced parts only

Standard deviation of period Fraction of locally unvoiced frames Number of voice breaks Degree of voice breaks Mean autocorrelation Mean noise-toharmonics ratio Mean harmonics-tonoise ratio

Original Recording 0.005604 to 1.035745 seconds (duration: 1.030141 seconds) 173.042 Hz 174.969 Hz 18.029 Hz 141.531 Hz 214.543 Hz 170 167 5.728280E-3 seconds 0.580010E-3 seconds 5.863% (18 / 307) 1

Synthesized Output 0.001623 to 1.016553 seconds (duration: 1.014930 seconds) 173.890 Hz 175.490 Hz 18.399 Hz 143.391 Hz 214.299 Hz 164 161 5.703887E-3 seconds 0.602751E-3 seconds 4.082% (4 / 98)

5.335% (0.054958 seconds / 1.030141 seconds 0.916641

3.970% (0.040288 seconds / 1.014930 seconds) 0.886499

0.115562

0.166900

13.125 dB

11.306 dB

1

Nevertheless, while the overall acoustic quality of the synthesized speech output provided was above average and was highly intelligible, we were cognisant that the acoustic quality of this or any synthesized voice was still an area where additional work would be recommended. Some areas where additional work would be required were in relation to the smoothing out of the join or

124 concatenation and improving the speech style and overall pitch. Both areas show marked variance between some original and synthesized versions and could possibly be the source for a choppy auditory effect heard in some synthetic audio segments. The ongoing objective laboratory evaluation and acoustic analysis served as functional evaluation exercises to provide (i) ongoing verification and validation of the modification of language-specific schemes files for JC, (ii) the correct implementation of JC components during voice building, as well as (iii) the acoustic accuracy of the resulting synthetic speech, samples of which were provided in the preceding sections. In order to test the accuracy of the system in a non-laboratory setting whereby the researchers’ input was not primary and to receive feedback from a prospective user demographic, we then proceeded to conduct a formal subjective listening test. Through a formal evaluation we gathered subjective responses from a wide cross section of persons whom we viewed as potential users for our Jamaican Creole synthesized voice. The primary focus of the user-based evaluation exercise, the results of which are presented in the next section, was (1) accuracy, (2) overall output quality, (3) intelligibility (4) appropriateness and (5) acoustic quality of the Jamaican Creole synthesized speech.

125 4.5. Measuring User Perception of Jamaican Creole Synthetic Speech

The user-based evaluation exercise consisted of a pilot and a formal evaluation and was conducted using a web-based listening survey. The pilot group consisted of a closed group of 30 participants selected by the researchers to complete the online survey. Participants were able to access the online survey through a specific web link provided by the researchers. All of the pilot group participants were native Jamaican Creole speakers with varying academic and social backgrounds, living both in Jamaica and in the Diaspora. The pilot served a dual-purpose role. First, it was used to test the validity of the instrument and second it served to identify areas requiring additional troubleshooting and modifications to improve the output quality of the resulting JC synthetic speech before conducting the formal evaluation. A non-disclosed filtering step was used in the formal evaluation to filter out and restrict response submissions from participants of the pilot group. Six respondents in total were disqualified from the formal evaluation and their results were not included in the final presentation. Page Exit Logic (unseen by respondents): If respondent took Eval 1, disqualify.: IF: The answer to Question (ID 85) is exactly equal to Yes THEN: Disqualify and display:

126 We're sorry but you do not qualify to continue the survey at this time. You're welcome to send a message to the primary researcher at [email protected]. Thank you!

4.5.1. Instrument A web-based listening test survey was used for this data collection exercise. The survey was built using SurveyGizmo(c) and administered online. The design of the survey elements took into full consideration the target application for which the synthesized speech was designed, namely in-car street level navigation. The final complete survey can be referenced in Appendix 8. Participants were not required to use names or specific markers that would identify them to the researchers. Each participant was assigned a unique ID by the system upon submission of the survey responses. In the formal evaluation ID numbers assigned were 51 through to 208. ID numbers 1 to 50 were test data with automatic generation of responses to test the survey before it was deployed. The survey consisted of 24 short questions spread across four major sections, not including the demographic section, and included both audio content as well as text responses. The four sections were (1) Evaluation of Speech Quality, (2) Testing Comprehension and Intelligibility, (3) Comparing Synthetic Speech and Original Audio Recording and (4) Comparing Acoustic Quality of Speech Output. The maximum response time required to complete the survey, based on survey design and survey elements, was ten minutes.

127 All audio files used in the survey were converted from their original .wav format to .mp3 format in order to allow for maximum browser usage when playing audio files. Browsers supported by SurveyGizmo for this format and which were used by respondents included Internet Explorer (IE), Mozilla Firefox, Google Chrome, Safari, Android and iOS. IE 9 required Flash Player in order to play the audio files. The audio samples used were either original audio samples from the original speech corpus or synthesized speech captured through the usage of the Festival framework text2wave option: dahlia@dahlia-PC ~/fes/Festvox/Festvox/data/navi $ text2wave lefraitriich.txt -o lefraitriich.wav

4.5.2. Pilot Evaluation Of the 30 respondents invited to participate in the pilot evaluation, 27 registered to complete the online survey and a total of 17 responses were submitted. Of the 17 submitted responses, 12 were complete and 5 were partial responses.

128 Table 4.4

Profile of Pilot Participants

Total Registered Participants

27

Completed Submissions

12

Partial Submissions

5

Abandoned Survey/No Response

10

Male Respondents

5

Females Respondents

7

Of the complete responses returned, all 12 or 100% reported being native Jamaican Creole speakers with varying degrees of familiarity with synthesized speech and synthetic voices. The majority of the respondents reported being ‘Somewhat Familiar’ with synthesized speech. No respondent reported being ‘Very Familiar’ with synthesized speech. Max value reported for familiarity was 4 out of 5 with the average respondent having limited to working knowledge of synthesized speech and synthetic voices. The data analysis which follows assumes complete survey responses by 12 participants, 7 females and 5 males, all of whom were volunteers and un-paid. The ages of the respondents ranged from 25 to 44. All of the respondents reported general satisfaction with the survey test design, the format of administration as well as the required time for completion. Recommendations were made regarding the synthesized voice itself and possible future work. What follows is a

129 presentation of the findings as revealed through the administration of the pilot listening test. Information gleaned from each section is presented separately.

Mean Opinion Score (MOS) – Speech Quality of JC Synthesized speech In Section 1 of the listening test, respondents were prompted to (Please) rate the overall speech quality of four randomly selected audio recordings of Jamaican Creole synthesized speech. A 5-point Likert scale was used, with the following values defined – 1 | Bad, 2 | Poor, 3 | Fair, 4 | Good, 5 | Excellent. The analysis that follows assumes 12 completed survey responses submitted by respondents to these four audio recordings.

Count Value Assigned

Speech Quality MOS 20 18 16 14 12 10 8 6 4 2 0

Audio Clip 4 Audio Clip 3

Audio Clip 3

Audio Clip 2

Audio Clip 2 Audio Clip 1

Audio Clip 1 1 | Bad

Figure 4.11

Audio Clip 4

2 | Poor

3 | Fair

Comparison of Speech Quality

4 | Good

5| Excellent

130 The mean value for speech quality assigned to Audio Clip 1 was 3.08, with no respondent returning the max value of 5|Excellent. The majority of the respondents or 58% rated the clip as being 3|Fair, with 25% giving a rating of Good. No respondent rated the clip as being 1|Bad. The mean value for speech quality assigned to Audio Clip 2 was 3.58, a 0.50 increase over Audio Clip 1. The max value assigned for Audio Clip 2 was 5|Excellent, a rating provided by 25% of the respondents. As was the case with Audio Clip 1, no respondent rated the clip as being 1|Bad. Half or 50% of the responses submitted also rated this synthesized speech audio clip as being 3|Fair. The mean value for speech quality assigned to Audio Clip 3 was 3.25, a value slightly lower than that assigned to Audio Clip 2, however a slight increase over Audio Clip 1. The max value assigned for Audio Clip 3 was 4|Good, a rating provided by 50% of the respondents. No respondent returned a rating of Bad for this clip. In the case of Audio Clip 4, the mean value for speech quality assigned was 3.73, the highest value of all four clips. The max value assigned for Audio Clip 4 was 5|Excellent, as was the case for Audio Clip 2. Over a third of the respondents, 37% rated this synthesized audio clip as being 4|Good, with no respondent returned a rating of 1|Bad. Of the twelve completed responses received, no respondent reported any of the synthesized audio clips as being ‘bad’; with none of the four audio clips being assigned the lowest possible value of ‘1’. Of the responses returned, the most commonly assigned quality was 3|Fair, with a count value of 18. The next

131 commonly assigned value was 4|Good, with a count value of 15 being returned. Overall the mean quality of the synthesized speech was considered as being 3|Fair to 4|Good, with an overall MOS of 2.35 being returned.

Intelligibility and Comprehension via Transcription Task In response to the four audio clips and the questions which followed in this section, participants were tested on intelligibility and comprehension. For two of the audio clips, participants were required to answer ‘YES’ or ‘NO’. For two of the audio clips, participants were required to transcribe the instructions they had heard. The overall results from this section indicated listener comprehension. As it relates to Audio Clip 1 ‘Are you supposed to enter the highway?’ one respondent reported a lack of comprehension, being unable to “hear what comes after pan di ‘on the’. Ten respondents or 83.3% answered correctly by providing a negative response. Only one respondent returned an incorrect response. 1 of the 10 respondents who returned the correct response was also able to correctly identify the exact phrase which followed pan ‘on’, namely lili wie ‘Lily Way’. This indicated clarity and notion that the average user would be able to follow instructions provided by the proposed voice. As it relates to the question posed after listening to Audio Clip 2 - ‘After turning left, will you be on Paisley Avenue?’- 3 respondents returned a ‘YES’ response which was incorrect. Nine respondents or 75% returned the correct ‘NO’ response which was required. Of the 9 correct responses returned, 1

132 respondent correctly identified the street name, ‘Daisy Avenue.’ This respondent was not the same participant who provided the exact phrase for Audio Clip 1, an attestation to comprehension. All the respondents or 100% scored a perfect mark in transcribing the Jamaican Creole synthesized navigational instructions provided, namely (1) ton lef den ton rait an yu rich ‘Turn left, then turn right and you have reached your destination’ and (2) aafta 300 miita ton lef ‘After 300 metres turn left.’

Similarity (SIM) Task In an attempt to rate the similarity of synthesized and original audio and judge the correct transfer and application of prosody, participants were asked to (1) listen to 3 audio clips and report which of the clip was the synthesized version and (2) compare two original corpus clips and their synthesized versions, ranking similarity on a scale of 1 to 5, with 1 being Very Dissimilar and 5 being Very Similar.

Audio Clip 1 9

10 5

3

0 Synthetic

Figure 4.12

Original

Audio Clip 2 10

6

6

Audio Clip 3 20

5

10

0

0 Synthetic

Speech Output Similarity

Original

10 2 Synthetic

Original

133 The synthesized audio clip was Audio Clip 2. As illustrated by the responses, this was not easily identifiable by the participants. In fact 50% of the respondents thought this was an original recording. Interestingly, of the 3 audio clips, this was the only clip that received a 50/50 mark for synthetic and original audio. The responses for Audio Clip 1 and 3 do indicate that the majority of participants were able to identify for the most part an original recording as opposed to its synthesized version.

Participant Rating of Original Audio versus Synthesized Audio 7

6

6 5 4

3

3

2

2 1

Synthetic Audio7.mp3 Original: navi0080.mp3

1

Synthetic Audio17.mp3 Original:navi0026.mp3

1 1

0 0

0

0

0 #

#

#

#

1 | Very 2 | Dissimilar 3 | Somewhat 4 | Similar Dissimilar Similar

Figure 4.13

# 5 | Very Similar

Original versus Synthesized Speech

As indicated by the results, the majority of the respondents returned a rating of 5, finding the recordings to be ‘Very Similar’; 42.86% for the first set of audio recordings and 85.71% for the second set of recordings, an average of 64.30%. No participant; 0% of participants found the recordings to be ‘1 | Very Dissimilar’.

134 Appropriateness As it relates to appropriateness, that is the domain for which the voice was being modelled and built, namely in-car street level navigation, comments shared by the participants demonstrated a very positive feedback for the Jamaican Creole synthesized speech and also served to identify areas for additional improvement. Participants pointed to the need for “a bit more fine tuning”, “less background noise” and a reduction in speed, as well as a revisiting of the idea to use metres to provide distances. The feedback in general was very encouraging. Participants reported that “This could work.”, “I don't see a problem with it” and “Navigation is a major use for synthetic voices. Therefore, one for the Jamaican Creole is very appropriate.” Feedback from one participant suggested that whereas the voice seemed to “handle turn anticipation ... street names appear slight off”. This prompted us to make pitch mark, duration and labelling modification before the formal evaluation was deployed. For some respondent, however such a voice for JC would simply serve a comedic value, something to be used for light-hearted moments and not to be viewed as being a serious contender with value. Indeed as one participant puts it, such a system would only be used “...for simple comedic value...” and would in fact “...quickly switch to a different product or voice if ... in a difficult area or in a rush”.

135 4.5.3. Formal Evaluation Following a review and analysis of the results and feedback returned by the respondents who participated in the pilot evaluation, modifications were made to the database as well as to the instrument. The formal evaluation for the Jamaican Creole synthesized voice was carried out four months after the pilot assessment concluded. The evaluation lasted for a period of two months. Based on the instructions we provided, we safely assumed a quasi-silent environment for each participant. Some participants initially had difficulty playing the audio clips in Mozilla Firefox and Google Chrome; however by switching to Internet Explorer or switching from a mobile platform to Desktop or vice-versa, they were able to resolve this, in most instances. The survey and sample response data can be referenced in Appendices 8 through 10. The results and information gathered and feedback return exceeded our expectations and validated the success of navi, quality domain-specific Jamaican Creole synthetic speech created using open source software. The results of this formal evaluation are presented using an outline similar to the one we used for the presentation of the findings of the pilot evaluation in the preceding section.

136 (Registered) Participant Demographic Profile

A total of one hundred and fifty-six (156) participants accessed the online survey URL and registered to participate in the assessment. However, only 105 submitted demographic data. Regarding their level(s) of education, 81.0% had university level education, with 56.2% of them having some postgraduate experience; 14.3% reported high school level education, with 12.4% being high school graduates and 1.9% having some high school experience; 5 participants or 4.7% selected ‘other’ as level of education, 2 of whom had some college or technical training. Regarding the current occupation of the participant at the time the survey was conducted, 22 reported being students, 28 selected academic, 49 identified themselves as professionals, 4 each reported being retired and unemployed, with 2 selecting ‘other’ for current occupation. The majority of the respondents who submitted demographic data were native Jamaican Creole speakers, 67 or 63.8% in comparison to non-native JC speakers at 38 or 36.2%. Based on a 5-point Likert scale with 1 being Very Unfamiliar and 5 being Very Familiar, thirty-five respondents reported a value of 3, being ‘somewhat familiar’ with synthetic speech, 14 reported a value of 5, being ‘very familiar’ and only 18 reported a value of 1, being ‘very unfamiliar’ with synthetic speech. The other 38 respondents rated themselves as being Familiar or Unfamiliar with synthetic speech.

137 Table 4.5

Survey Respondents and Fall-off Report

Total Registered Participants

156

Demographic Data Submitted

105

Partial Submissions

77

Completed Submissions

55

Disqualified Responses

6

Abandoned Survey (including disqualified)

24

Native JC Speakers

67

Non-Native JC Speakers

38

Section 1: Evaluation of Jamaican Creole Synthetic Speech Quality

Using a 5-point Likert scale we asked participants to rate the overall speech quality of four sample Jamaican Creole synthetic speech audio clips. The following values were assigned to rate quality – 1 | Bad, 2 | Poor, 3 | Fair, 4 | Good, 5 | Excellent. We present below the ratings returned by the respondents. 67 participants rated Audio Clip 1 and 68 participants rated Audio Clips 2 through 4.

138

Speech Quality of Audio Clip 1 5| Excellent 6%

1 | Bad 6%

4 | Good 23%

5| Excellent 10%

1 | Bad 6% 2 | Poor 9%

2 | Poor 25%

3 | Fair 22% 4 | Good 53%

3 | Fair 40%

Aafa 300 miita ton lef After/In 300 metres turn left Speech Quality of Audio Clip 3 5| Excellent 12%

Speech Quality of Audio Clip 2

1 | Bad 3%

ton lef pan diezi avinyuu Make a left on Daisy Avenue Speech Quality of Audio Clip 4 5| Excellent 18%

1 | Bad 3%

2 | Poor 6%

2 | Poor 17% 3 | Fair 32%

4 | Good 34% 3 | Fair 34%

Aafa 250 miita ton lef After/In 250 metres turn left

Figure 4.14

4 | Good 41%

rait torn Make a right/Right Turn

User-Based MOS of JC Synthetic Speech Quality

139 Standard Deviation was 1.0 and the max value returned for the rating of quality for all four audio clips was 5 | Excellent. The mean value returned for these four clips was 3.35 out of 5, with the overall average quality of the synthetic audio clips being rated as 3|Fair. Audio Clips 2 and 4 received the highest rating in terms of quality. Forty-three of the sixty-eight respondents or 63.2% rated the overall speech quality of Audio Clip 2 as Good to Excellent, 85.3% or 58 returned a rating of Fair to Excellent. The quality of Audio Clip 4 was rated as Good to Excellent by 58.9% or forty of the sixty-eight respondents. 91.3% rated clip 4 as being Fair to Excellent.

Voice Quality 60.0%

4 | Good

50.0% 4 | Good

3 | Fair 40.0%

3 | Fair

30.0% 20.0%

3 | Fair 4 | Good

4 | Good 3 | Fair

2 | Poor 3 | Fair 4 | Good 5| Excellent

10.0% 0.0% Audio 1

Figure 4.15

1 | Bad

Audio 2

Audio 3

Audio 4

Comparing Speech Quality of Audio Clips

140 Of the participants, 31% rated the quality of Audio 1 aafa 300 miita ton lef ‘after 300 metres turn left’ as being Poor or Bad. A similar rating was provided by 20% for Audio 3 aafa 250 miita ton lef ‘after 250 metres turn left’. However only 15% provided such ratings for Audio 2 ton lef pan diezi avinyuu ‘Turn left on Daisy Avenue’ and Audio 4 ton lef ‘turn left/left turn’. Looking at the phonemes represented in these 4 phrases in context, we realise that when the direction turning expression ‘turn + direction’ was provided in phrase initial position, as in the case of Audio 2, or solo, without additional information, as in the case of Audio 4, the overall quality appeared better to the recipient. In phrase final position, as in the case of Audio 1 and 3, the quality appeared to be reduced. This lowering of quality could potentially be traced back to the type of speech units available for use within the speech database during synthesis. We theorise that during run time, the Viterbi search chose the unit in the database with the lowest cost, the complete phrase ‘turn left’, over other units, since it was readily available as a complete phrase, rather than seeking to concatenate ‘turn’ and ‘left’ or use the letter-to-sound rule-set provided. If this theory was correct, then one possible option would be to adjust the speech database by increasing the contextual variations for the turn expression, before making the voice available to the public at large. Comments in general received from the participants were overall positive. Some of the additional comments received, including recommendations for improving the Jamaican Creole synthetic voice are mentioned below for

141 reference. Some comments and recommendations were specific to voice clarity, speed of the voice and the possibility of creating a male speaker version for navi. i.

The accent is great. There is just a little 'warped' sound. Not sure whether that is just a technical glitch

ii.

Lol this is sooooo cool go for it!

iii.

I was able to hear and understand all that was said.

iv.

did some research on synthetic speech and i am quite pleased with the audio speeches and the possibilities that will be realized.

v.

some ran by too fast, and the voice is too soft. The one I have uses a strong mail [sic] voice

vi.

The last 3 clips could have received full marks for comprehensibility. However, they were placed as 'good' for the lack of clarity.

vii.

Not a problem of voice clarity, but there seems to be some overlapping

Section 2: Testing Comprehension and Intelligibility

In Section two of the evaluation exercise, our primary objective was to test the comprehension and intelligibility of the Jamaican Creole synthetic speech generated. Participants were asked to listen to four audio clips. For two of these audio clips, respondents were required to listen to a navigational instruction and return a value of True or False to the question that followed. For the other 2 clips, participants were required to listen and transcribe the navigational instruction they

142 heard. The language in which participants wrote their responses was irrelevant and was not taken into consideration. In response to the first audio clip and question “Should you enter the highway?” 79.7% of the participants returned the correct value of False. The instructions were in fact to ton rait pan lili wie “Turn right on Lily Way”. The majority or 70.7% returned the correct response of False to the second question, “After turning left, will you be on Paisley Avenue?” The street name in the question was in fact diezi avinyuu ‘Daisy Avenue”, as in ton lef pan diezi avinyuu “Turn left on Daisy Avenue”. Except for one respondent who experienced technical difficulty and did not return a response, all the responses returned for Audio Clip 3 were correct. Respondents returned the correct navigational instructions of ton lef den ton rait an yu rich “Turn left then turn right and you have reached your destination”. The results were more or less the same for Audio Clip 4, with respondents returning the correct instruction of aafa 300 miita ton lef ‘After 300 metres turn left/In 300 metres turn left’. Based on the results, we note that the perceived quality of the synthetic speech did not in any way adversely affect the end user’s ability to fully comprehend the instructions. Respondents who participated in the evaluation exercise comprised both native and non-native Jamaican Creole speakers. We should mention that nonnative speakers did not have a problem understanding the navigation instructions used in the exercise. In fact, non-native JC speakers provided accurate

143 transcriptions for the comprehension and intelligibility section. This for us was an indication that a JC synthetic voice would have a much larger projected distribution and reach than we initially thought. One non-native JC speaker respondent commented that it was “very appropriate to use natural speech appropriate to a speech community in such a situation because intelligibility and ease of comprehension are paramount”.

Section 3: Synthetic Speech or Original Recordings

One of the primary goals of modern day speech synthesis is to generate synthetic speech that is of high quality and natural sounding, that is to say less robotic and more human-like (Taylor 2009). In our assessment we investigated whether or not the Jamaican Creole synthetic speech generated was of high quality and comparable to natural sounding speech. We sought to determine if participants would be able to recognise the difference between synthetic speech generated by a machine and original speech made by a human being. We invited participants to listen to a total of 6 recordings and return a value of Synthetic or Original for each. The feedback was better than we expected. Only 2 of the 6 recordings used in this section for assessment, Audio Clips 3 and 5 were actual original human recordings. The other 4 audio samples used, namely Clips 1, 2, 4 and 6 were in fact speech that had been synthesized and generated by the system. The feedback from the participants demonstrated, however, that the quality of the synthetic speech was comparable to the original

144 recordings. Of the respondents who evaluated these 6 audio clips, 82.1% rated Audio Clip 1, generated speech, as being Original or human recording; 53.6% rated Audio Clip 2, which was also generated speech, as Original; 89.7% rated Audio Clip 4 as Original and 63.8% rated Audio Clip 6 as Original.

Audio 1 Audio 2

Audio 3

Audio 4

Audio 5

Audio 6

Synthetic 17.9%

46.4%

87.7%

10.3%

61.4%

36.2%

Original

53.6%

12.3%

89.7%

38.6%

63.8%

Figure 4.16

82.1%

Comparison of Synthetic and Original Speech

Section 4: Comparing Acoustic Quality and Appropriateness The test for acoustic quality was similar to the method used in Section 3. In this section, however, participants were provided with 2 pairs of audio samples side by side, a synthetic audio clip clearly identified as synthetic and its original audio recording sample, clearly identified as original. They were asked to listen to each pair of recording and to rate the clips in terms of similarity on a scale of 1 to 5, with 1 being ‘Very Dissimilar’ and 5 being ‘Very Similar’. The first pair instructed participants to ton lef den ton rait an yu rich ‘Turn left then turn right and you have reached your destination’. The second pair was a demonstration of 300 miita ‘300 metres’.

145 In both instances, the majority of respondents ranked the audio samples as 5 | Very Similar, with 32.1% in the case of the first pair and 61.1% in the case of the second pair. The acoustic quality of the second audio pair was better than the first pair. No participant rated the second recordings as 1 | Very Dissimilar. In fact, only 3.8% rated the first pair as Very Dissimilar. To complete the evaluation, participants were reminded of the specific target domain for which the voice was designed, namely providing in-car street level voice navigation instructions. Participants were encouraged to comment freely on the Jamaican Creole synthetic voice they had heard across the various samples used during the evaluation exercise. They were asked to provide feedback first in relation to its appropriateness. We deliberately chose not to define ‘appropriateness’ in this instance as we were interested in all responses. Comments returned by the respondents indicate that in general the JC synthetic voice matched its intended purpose or in the words of one participant “very appropriate for what it was created for”. Respondents also remarked that it was appropriate to have voice navigation in “our native tongue” and it was “as appropriate as street level navigation in any other language”. Some participants however were critical and doubted the value of having a Jamaican Creole synthetic voice. For those participants, a JC synthetic voice in such a setting would be mere comic relief and would not be easily acceptable by the average Jamaican.

146 In addition, participants questioned whether it would not be more effective to have the voice in Standard Jamaican English rather than Jamaican Creole, as they theorised a synthetic voice in SJE would potentially have a wider reach. The response of one of the participant, (presented in the below table) could potentially provide the response to such participants, “I want to believe i [sic] am fluent in both languages, however the Jamaican Creole speaks to me much CLEARER!” Another simple response could be that creating a synthetic voice in Jamaican Creole was merely creating a new voice in a different language, as is being done on a large scale for countless other languages. There could be a lengthy debate to such responses, however when we embarked on this voice building project, we expected to receive feedback of this kind. In fact we were surprised at the overwhelming positive and encouraging feedback on the Jamaican Creole synthetic voice. For each unique language represented and each voice created, the potential reach will continue to be wide with end users having the option to choose from amongst the language and gender options available. In Table 4.6 below, we present some of the responses as originally submitted by the participant. Only the font used was changed from Arial to Times New Roman to match the font used throughout this thesis.

147 Table 4.6

Sample Comments Submitted by Participants

Original Participant Responses De concept is good go for it I believe it's very useful to have this sort of material in JC. Hope it can be fully developed I find it very appropriate. I think the instructions were clear It can be done but to what end? Cuteness? Money? It takes all kinds It would be good to have a voice navigation in our native tongue. It's as appropriate as street level navigation in any other language. LOL Pretty good, sort out the Paisley sounding like Daisley and I think it's well on it's way. Sounds good to me! The synthetic voice was clear and understandable in most instances VERY Very appropriate Very appropriate and quite well done. Very appropriate. Pleasant voice quality/tone. Voice is clear and it seems very appropriate anyone can understand it. appropriate great it seems rather appropriate the original voice sounds clearer and is much slower very appropriate for what it was created for. I reallt think that it is a good idea. A nice alternative to the same american yanke voice you hear all the time.I think if it works out well that you should definitely enlarge your target domain. Some of the words aren't entirely clear. I can't distinguish from "Daisy Ave." and "Paisley Ave." The appropriateness of such a tool would largely depend on where Jamaican Creole stands in the society at the time of implementation. Additionally, the people who could afford such a tool would probably not want to hear their instructions in Jamaican Creole. I think this is very appropriate. I have actually pretended to be the voice of a Jamaican version of my own GPS! :-) This is an AWESOME project. I want to believe i am fluent in both languages, however the Jamaican Creole speaks to me much CLEARER! Blessings! I hope to see this endeavour become a reality in the near future! I will only be appropriate for those who understand Jamaican creole, and even then it has shortcomings. Instead of "afta 300 meter tun leff" Jamaicans would instead say "go dung 300 meta more den tun pon de leff" The synthetic quality is fairly decent. However wouldn't you reach a wider demographic by producing this in English? I think it is very appropriate to use natural speech appropriate to a speech community in such a situation because intelligibility and ease of comprehension are paramount. I feel I cannot comment on the synthetic voice's appropriateness, as I don't have a conception of what that means in this case. However, it's a pretty impressive

148 approximation of human speech. I only realized which ones were synthesized when I heard the originals. The originals had better sound quality but that probably had nothing to do with whether they were intelligible or not. I think for GPS purposes, synthetic speech in Jamaican should be optional. Whereas the sound quality might be clear, even some Jamaicans may have difficulty understanding directions in their native language. At times, it may somewhat call for instantaneous translation/interpretation (which is a learned skill in and of itself) and this could be problematic when one needs to get to a particular destination in a timely fashion. On the other hand, I think the use of synthetic speech would be useful for visitors to Jamaica who might need help understanding Jamaican. This is a great piece of work. To tell the truth, the differences are so miniscule as to be imagined. I chose one audio as original over the other, based on what I perceived to be some acoustic noise (slight). Congratulations. the voice sounds fine and clear; there is some background noise and sometimes it seems that different phrases are nearly overlayed It was fairly appropriate for a JC native speaker. Some words are however unclear-highway sounded like Leeway and I heard Daisy Avenue instead of Paisley Avenue. Can be misleading. Appropriate enough to me and sounds good too. One could only tell the synthetic (I think) because of a very slight distortion however this is good for gps I understand, in this case the "target domain" are Jamaicans. This tool then would be used to help navigate them while driving, and interacting with them in a human tone rather than a computerized/ zombie-like/ impersonal tone. Is it appropriate? Well.....based on the language used (Jamaican Creole), and the stereotypical nature of our people, it is difficult to imagine us being dependent on an automation for directions. We prefer to explore and get lost and then found taking the scenic route. Particularly the males, I do not foresee using this. The cost of this device/ application may be prohibitive. Also, using the Jamaican dialect as the interface is limiting. Even though we come from Jamaica, there are differences in our patois. The device may not be able to recognize these differences, or the human operator may not be able to recognize same. Although valuing our patois heritage, I believe that the voice interface should use standard English. Some concession though is that we can use the natural Jamaican accent. The volume of items 23 and 24 were low. Hope the feedback helps. All the best

4.6. Benchmarking Jamaican Creole Synthetic Speech Finding the benchmark or industry standard of metric as it relates to the evaluation of synthetic speech may prove to be more difficult than perceived. In fact, the industry itself does not seem to always agree on a uniform standard of

149 measure by which we can grade all synthetic speech. Indeed, as far back as the mid-90s, researchers van Heuven and van Bezooijen (1995) advocated for the development of benchmarking. In the specific case of our Jamaican Creole synthetic voice which was built using the open source Festival Text to Speech framework and Festvox tools, the literature did not reveal a benchmark specifically being advocated for the JC language itself or for JC synthesized speech. There was no comparison to other JC synthetic voices built nor preferred tests previously carried out for other synthetic JC voices built. In approaching the matter of benchmarking for the Jamaican Creole synthesized speech created using open source software, we decided to align our benchmark reference point in direct correlation with the framework itself that was used to create the voice. We focused on specific output created using this particular framework, testing standards and evaluation techniques used and proposed. To assist us in this endeavour, we used the Blizzard Challenge. The Blizzard Challenge was created to help researchers and developers in the field of Speech Synthesis better understand current research techniques and to compare these techniques by assessing synthesizers built on a common dataset. These challenges have become “a recognised standard in TTS testing” (Taylor 2009, 526). We researched published results of previous challenges and reviewed data from the listening tests of previous challenges. Following our review, we cross-

150 referenced, compared and analysed how our synthesized voice performed relative to the benchmark test output from recent challenges. Inasmuch as it related to quality and intelligibility, data presented in various Blizzard Challenges returned MOS scores for Festival of 3.0 in 2007, 3.3 in 2008 and 2.9 in 2009. The MOS for the Festival unit selection benchmark system from CSTR in 2011 was also 2.9 (King and Karaiskos 2009; 2011). The mean MOS for other systems which took part in the 2011 Blizzard Challenge Workshop ranged from 1.4 to 4.8, with standard deviation ranging from 0.63 to 1.10. The score of 4.8 is for the system marked ‘natural speech’ (King and Karaiskos 2011). The King and Karaiskos analysis and presentation summary of data for the 2013 Blizzard Challenge which was conducted just before the submission of this study, revealed a mean of 2.1 for the Festival unit selection system, with a standard deviation of 0.92, for naturalness on the Task 2013-EHI. This task was the building of a voice from a common corpus with no provided text. The mean for the other nine systems that participated in the challenge ranged from 1.2 to 3.9, with natural speech receiving 4.8. Some systems were hybrid, others parametric based and 3 unit selection type based. In seeking to set the benchmark standard for Jamaican Creole synthetic voices, we compared the benchmark mean returned for Festival for the five years specified above and the mean MOS returned for overall speech quality of our Jamaican Creole synthetic voice which was created using the Festival Speech Synthesis platform. With respect to the overall speech quality, the MOS for the

151 Jamaican Creole synthetic voice was comparable to the results returned for Festival in the 2007 and 2008 challenges. Standard deviation for quality of the Jamaican Creole synthetic voice was 0.975 and mean MOS was 3.3. The proposed ranged for any future work conducted on Jamaican Creole synthetic speech should be at or over the above specified range for our voice.

152

Chapter 5: Voice Building in Festival: Limitations and Recommendations 5.0. Introduction Voice building within the Festival TTS open source framework, whilst no rocket science is still no easy task. This we came to appreciate, having successfully completed cluster unit selection synthetic voice building for a previously unsupported language within the Festival open source framework, using requisite Festvox Tools and the Edinburgh Speech Tools. In reflection, Text-to-Speech Synthesis and synthetic voice building, especially for a new language and particularly within open source, require a combination of persistence, perseverance and the willingness to re-learn. It also requires restarts, as well as the common sense to seek the assistance of others who may have journeyed the same path, including the system developers.

5.1. Limitations Observed and Recommendations One key requirement for potential researchers and developers, above all else, is the full appreciation that within synthetic voice building, particularly using open source software, restarts are unavoidable. Stemming from this project was the full realisation that Forums, Discussion Boards and FAQs were your best allies, especially with respect to voice building using open source software, as was the case in this study. Thus if one chose this path of open source software, then one would have to be willing to spend quality time poring over FAQs and

153 Discussion Boards in search of the answer which 99% of the time does exist, even if it exists hidden deep within the recesses of over-analysed tech talk and very tedious and winding conversations. Based on personal observations, we will also mention that some knowledge of Speech Processing technologies, LISP programming and familiarity with a UNIX or UNIX-type environment in the specific case of Festival, while not pre-requisites are highly recommended before embarking on a journey of speech synthesis voice building. In addition, the researcher should have the ability and or flexibility to work within different OS environment. Different speech synthesis voice creation framework will indicate preferences as to the manner in which tools and programs are to be installed and compiled. Particularly within the Festival framework, we came to realise that the specific manner of installation for each of the three separate components, Festival, Edinburgh Speech Tools and Festvox, was especially important. The compilation of one particular source distribution before another and in the correct manner was often required in order for links, dependent, co-dependent and cross-dependent files to be created successfully, thus leading to fewer or minimising the risk of resulting errors during the actual voice building phase. Based on our experience, the recommended manner of installation, configuration and compilation is as follows, first Festival, then EST, followed by Festvox and not necessarily EST, Festival, Festvox as recommended in the installation script. Having completed our voice building journey with many attempts within a

154 UNIX environment (both Desktop and Virtual Machine) as well as within a Windows environment using CYGWIN, our personal recommendation for those wishing to conduct voice building using the Festival open source software, is to do so within a Windows operating system, using a UNIX-simulated environment such as CYGWIN. Although Festival was initially created in, tested and vetted by the developers within a UNIX based system, the preceding recommendation is as a direct result of the many challenges we experienced while attempting to conduct successful voice building within Linux and based on a successful attempt using CYGWIN. One of the major issues we noted with conducting voice building on a UNIX Platform is in relation to audio configuration and the repeated failure of Festival to successfully access and use the computer provided audio component, sound card and lib/dev tools. Additionally, if errors were to ensue during a specific phase during voice building, it has been noted that it was easier within a CYGWIN environment rather than within UNIX to port and re-use some if not all of those files previously generated within a particular installation to another installation without being required to re-do some steps. While the preference of the researchers based on a performance of voice building on both platforms, is for CYGWIN, we will however, hasten to say that this is not an indication that Festival voice building within UNIX can never be accomplished. After all, the developers were successful in this regard and so too many other researchers. However, for the researcher or developer, who could potentially run the risk of being plagued with both hardware and software related

155 issues while building within UNIX and thus have a strong desire to avoid these issues altogether, a simulated environment is still a more productive and preferred choice. One important fact that must be stated as it relates to voice building within CYGWIN is the matter regarding the editing of files. Based on our personal observation, although you are essentially working on a Windows platform, the caution is to avoid using Windows based editors such as Notepad or WordPad to edit required files. If Festival-provided files or files created within Festival for usage during voice building are created or edited using Windows-based editors, they are notably saved with Windows text file format with CR-LF (\r\n) line endings. CYGWIN’s default configuration uses UNIX format text files, with LF (\n) line endings, hence attempting to run Windows text file formats within CYGWIN will result in errors. Our recommendation based on hands-on experience, which was later validated by suggestions seen across various discussion boards and forums, is to install and use one of several recommended CYGWIN editors such as vi editor to edit scheme files. For this study, vi editor in command line mode was used to edit requisite files. When errors are returned, the immediate response must be to immediately back trace, return to the starting point and redo required steps in order to rid the process of these errors, whether through a process of re-installation, recompilation or re-configuration. Having successfully undergone the complete voice building process, we have observed that within the Festival framework, there are some errors returned which can be safely ignored yet others which must

156 be promptly addressed, if one hoped to avoid later ensuing errors when attempting to run recommended scripts and complete voice building. There is no such list of which errors can be skipped and which need to be addressed immediately. Hence the smart approach which we advocate is to back trace and redo all required steps until such errors have been resolved. Documentation for open source software will always remain dependent on users to maintain accuracy as well as to keep the information current. Reports of errors and debugging results will also continue to play a very essential role. Often times, it is only by these means that others wishing to use these systems are kept up to date with vital changes affecting the software. For example, scripts or command lines that are obsolete, as well as what their replacements. One such instance mentioned previously, was the bin/make_labs command line. In our specific case of developing synthetic speech for Jamaican Creole within the Festival framework, we invested many hours poring over archived Festvox mailing group data, reaching out to support board, as well as speech synthesis related FAQs and Linux Forum Discussion Boards. In addition, we also reached out to the developers. As a result of reaching out to the developers, we were required to change our initial research objective, deciding to develop a cluster unit selection based concatenative speech synthesis voice rather than a limited domain synthetic voice. Having failed to realise the objective of building a limited domain synthetic voice for Jamaican Creole using the Festival open source software

157 despite numerous attempts, we eventually took the initiative to contact one of the primary developers of the software, when nothing else seemed to work. In our telephone conversation with Professor Alan W. Black8, at his Carnegie Mellon University office, we outlined our lack of progress for the proposed Jamaican Creole Limited Domain Speech Synthetic voice despite having followed the clear procedure provided in the Festival voice building documentation. Stemming from this conversation with Professor Black, we gleaned that the Festival framework and Festvox toolkit were somewhat limited with respect to Limited Domain Synthesis and the ability to create new LDom voices for new, particularly non-supported languages. It was revealed that the inability to successfully create a new limited domain voice for Jamaican Creole was specifically due to an “issue with the Festival Limited Domain set-up” (Professor Alan W. Black, February 25, 2012). It was further disclosed that the limited domain scheme set up for Festival was primarily created for the English Language. Hence it was never intended for the purpose of creating non-English limited domain voices. The primary language support for this form of synthesis within Festival rested on it working optimally for English-based voices or voices created using the default English module. Thus in order for the system to work effectively and create a Jamaican Creole LDom voice, we would have to rely on the default English phoneme set, mapping the Jamaican Creole phonemes to the system default English phonemes. The option or solution provided by Professor Black was to set up the voice 8

Alan W. Black, telephone conversation, February 25, 2012

158 files as a standard cluster unit voice and modify such as a limited domain voice. Instead of running the build_ldom.scm function to generate the required voice building files, we would be running the build_clunits function instead. One of the major highlights of this overall exercise was the full realisation that some of the language-specific files, such as the Phone Set or Phoneme Inventory which we had previously created and prepared for the JC limited domain voice but which we had not been successful in implementing within the LDom definition were actually required by this new function to create the new synthetic voice. As disclosed by Professor Black, the steps involved in building unit selection voices “are basically the same as that for building a limited domain voice”9, a remark echoed within the Festival documentation itself. On an international scale, the Festival framework receives the mark of approval as one of the most well-documented open source voice building modular framework. In fact many other modern day speech systems and voice building frameworks are built using a Festival-type modular foundation structure and in some instances even use some Festival derived modules (Clark et al 2007). Various and varying voices have been successfully built for many different languages and continue to be built using its modules, as seen in the case of our Jamaican Creole synthetic voice. Whilst it is true that the Festival framework does receive many well-deserved accolades, it must be reiterated that the process of voice building using Festival and a Linux system is no mere feat. The process is 9

(A.W. Black, personal communication, February 25, 2012).

159 riddled with ensuing errors and requires a high level of patience coupled with several re-starts. For synthetic voice building within Festival or within any open source software for that matter, resolving issues will always remain a matter of research, network and basic trial and errors, troubleshooting and debugging. Having this realisation at the forefront before embarking on voice building within this (or any other) framework is essential and highly recommended.

160

Chapter 6: Conclusion and Recommendations The goal of modern day speech synthesis is to continue to improve waveform techniques to the end of generating synthetic speech that is even more intelligible and of a higher quality than it is today. In reviewing the progress that has been made in speech synthesis and waveform techniques to improve the quality of synthetic speech generated and the results, particularly within the last ten years alone, we are confident that this goal is highly achievable and not very far off. The technology behind speech synthesis will continue to develop and improve at an increasing rate. Synthetic speech will continue to be incorporated in a wide variety of applications to perform crucial daily functions such as on-screen reading for the blind, direction giving in navigation devices, providing transit, flight and other timely information to the public. In addition, Jamaican Creole will join and has joined to some degree through the conclusion of this study, the list of languages for which synthetic speech is being generated to perform daily, necessary functions. In this study, we proposed to offer another viable language option for incar street level voice navigation. Our primary objective was to create quality synthetic speech for Jamaican Creole using open source software and accurately document the process. This we accomplished by adapting the unit selection waveform technique to Jamaican Creole within the open source Festival Text-to-

161 Speech voice creation toolkit using mandatory Festvox and Edinburgh Speech Tools, in a CYGWIN environment on a Windows platform. Although we have defined and provided a listing of Jamaican Creole function words for the synthetic voice, the extension of predicting phrasal boundaries for the Jamaican Creole voice within this DCF method, using pos and ngram models is an aspect of this research for which much still remains to be done. In fact, a substantial amount of work remains in this regard and thus phrasal boundaries definition at this stage is for the most part largely dependent on the basic Festival CART module, which is primarily based on the usage of punctuation markers as indicators. The resulting synthetic voice created during this project is by no means being presented as either advanced or fully complete. We present it as a rudimentary synthetic voice, still in the initial phase and for which much work remains to be done. Although the evaluation carried out demonstrated that the overall quality was good and appropriate for the purpose for which it was created, we believe we can achieve better. In order to improve the quality and achieve more expressive, more natural sounding synthetic speech, we recommend additional research and concentration specifically in relation to prosodic modification and linguistic analysis, particularly the assignment of pitch and labelling of speech units. Some respondents reported audible discontinuities, warping, overlapping and instances when the voice and synthetic speech was too fast during the voice

162 evaluation. Although this did not adversely affect intelligibility and comprehension in an evaluation setting, we believe this to be a limiting factor that should be addressed. We propose to resolve these specific issues that were identified by addressing the speech database and the resulting speech units used during voice design. The results and feedback received from the pilot and formal assessment conducted on synthetic speech samples of the JC voice far exceeded our expectations, with the majority of the participants returning a rating of 4|Good for overall speech quality. The feedback is a strong indicator of both the value and potential of Jamaican Creole synthetic voices within our society. In addition, it confirms the claim that open source software, despite limitations such as the lack of troubleshooting documentation at times, can indeed be used to generate quality synthetic speech. In addition to creating and evaluating the JC synthetic voice, we also accurately documented the process of this pioneer attempt, thereby creating a roadmap for future work on speech synthesis in Jamaican Creole. Although the specific open source software used in this study was the Festival framework, the general procedure and outline provided can be adapted to voice creation within other modular toolkits whether open source or not. As a part of our evaluation, we also proposed a standard of metric for the Jamaican Creole synthetic voice which was based on the widely acclaimed Blizzard Challenges. The proposed standard MOS should be ≥3.3/-0.0

163 We strongly believe that despite some limitations observed, the work we have done here and our achievement will serve as the groundwork for text to speech synthesis and synthetic voice creation within the Jamaican Creole language, particularly through the usage of non-proprietary software, and to an extent other Creole languages. Some of our next steps will include the ongoing fine tuning of the current JC synthetic voice, the subsequent packaging and implementation within an online navigation simulator or app to be accessed and used by the public at large. In addition, based on user feedback received during the evaluation, we will be experimenting with the possibility of providing a male speaker version for navi.

164

References Alam, Firoj, Promila Kanti Nath, and Mumit Khan. 2007. “Text to Speech for Bangla Language using Festival.” In Proceedings of the 1st International Conference on Digital Communications and Computer Applications (DCCA 2007), 853-859. Irbid, Jordan: IEEE Press. Allen, Jeff. 1992. “An Overview of Text-to-Speech Systems.” In Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, 741-790. New York: Dekker. .1998. “Lexical Variation in Haitian Creole and Orthographic Issues for Machine Translation (MT) and Optical Recognition (OCR) Applications.” Paper presented at the First Workshop on Embedded Machine Translation systems of the Association for Machine Translation in the Americas (AMTA) Conference. Philadelphia, October 28, 1998. http://www.linkedin.com/in/jeffallen. Baker, Christopher. 2003. Jamaica: Your Essential Guide to the 'Island in the Sun'. Footscray, Vic: Lonely Planet. Barnard, Etienne, and Marelie Davel. 2004. “LLSTI isiZulu TTS: Evaluation Report.” Local Language Speech Technology Initiative (LLSTI). CSIR, Pretoria, South Africa, September 2004. Accessed March 22, 2013. http://www.outsideecho.com/llsti/pubs/Zulu_testing.pdf.

165 Black, Alan, and Kevin Lenzo. 2004. “Multilingual Text-to-Speech Synthesis.” In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’04) 3:761-764. doi: 10.1109 /ICASSP.2004.1326656. . 2007. “Building Synthetic Voices.” Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA. January 21, 2007. Accessed February 15, 2010. http://festvox.org/festvox. Black, Alan, Paul Taylor, and Richard Caley. 1999. “The Festival Speech Synthesis System: System Documentation.” Version 1.4., June 17, 1999. Accessed December 21, 2009. http://www.cstr.ed.ac.uk/ projects /festival/manual.html. Boersma, Paul, and David Weenink. “Praat: Doing Phonetics by Computer [Computer program].” Version 5.3.34. Accessed January 12, 2012. http://www.praat.org/. Campbell, Nick. 2007. “Evaluation of Speech Synthesis.” In Evaluation of Text and Speech Systems, edited by Laila Dybkjær, Holmer Hemsen and Wolfgang Minker, 29-64. Netherlands: Springer. Cassidy, Frederic, and Robert Le Page. 1976/1980. Dictionary of Jamaican English. Cambridge: Cambridge University Press.

166 Central Intelligence Agency. The World Factbook. Accessed January 12, 2010. https://www.cia.gov/library/publications/resources/the-world-factbook /geos/jm.html. Chomksy, Noam, and Morris Halle. 1968. The Sound Pattern of English. New York: Harper and Row. Clark, Robert, A.J, Korin Richmond, and Simon King. 2007. “Multisyn: OpenDomain Unit Selection for the Festival Speech Synthesis System”. Speech Communication 49:317-330. Cryer, Heather, and Sarah Home. 2010. Review of Methods for Evaluating Synthetic Speech. Technical Report No. 8, 1-12, Birmingham: RNIB Centre for Accessible Information. Accessed August 6, 2012. https:// www.rnib.org.uk/sites/default/files/2010_02_Evaluating_synthetic_speech _review.doc. Devonish, Hubert. 1987. “Can Computers Talk Ceole? Caribbean Creole Languages in the World of Micro-Computers.” Caribbean Journal of Education 14-16 (1-2): 250-257. Devonish, Hubert, and Otelemate Harry. 2004. “Jamaican Phonology.” In A Handbook of Varieties of English, edited by Edgar Schneider, Kate Burridge, Bernd Kortmann, Rjend Mesthrie and Clive Upton, 450-480. Berlin: Mouter de Gruyter.

167 Devonish, Hubert, and Walter Seiler. 1991. “A Reanalysis of the Phonological System of Jamaican Creole.” Society for Caribbean Linguistics Occasional Papers 24. Discussion List for Building Synthetic Voices using Festvox Tools. (n.d.). Accessed December 10, 2009. http://blog.gmane.org /gmane.science.tts.Festvox. Dutoit, Thierry. 1997. An Introduction to Text-to-Speech Synthesis. Netherlands: Kluwer Academic Publishers. .2008. “Corpus-Based Speech Synthesis.” In Springer Handbook of Speech Processing, edited by Jacob Benesty, M. Mohan Sondhi and Yiteng Huang, 437-455. Berlin: Springer-Verlag. Edwards, Alistair D.N. 1994. “ITD Technotes: Speech Synthesis.” Information Technology and Disabilities 1 (2). Accessed January 11, 2012. itd.athenpro.org/volume1/number2/edwards.html. Eskenazi, Maxine, Christopher Hogan, Jeffrey Allen, and Robert Frederking. 1997. “Issues in Database Creation: Recording New Populations, Faster and Better Labelling.” In EUROSPEECH-1997-5th European Conference on Speech Communication and Technology, 1699-1702. Rhodes, Greece, September 22-25, 1995. ISCA Archive, http://www.iscaspeech.org/archive/eurospeech_1997/e97_1699.html.

168 Ewender, Thomas, and Beat Pfister. 2010. “Accurate Pitch Marking for Prosodic Modification of Speech Segments.” In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 178181. Makuhari, Japan, September 26-30, 2010. ISCA Archive, http://www.isca-speech.org/archive/interspeech_2010/i10_0178.html. Festival Source Distribution. Version 2.1. November 2010. Accessed January 15, 2011. http://www.cstr.ed.ac.uk/projects/festival/download.html. Festvox. (n.d.). Accessed August 23, 2009. http://www.Festvox.org/. Gooden, Shelome. 2007. “Intonational Phonology of Jamaican Creole: An Autosegmental Metrical Analysis.”In Online Proceedings of the ICPHS Satellite Workshop on the Phonology of Understudied or Fieldwork Languages. Saarbruken, Germany, August 5, 2007. http://www.linguistics.ucla.edu/people/jun/ Workshop2007ICPhS/ Papers /ShelomeGooden-paper.pdf. Harrington, Jonathan, and Steve Cassidy. 1999. Techniques in Speech Acoustics. Netherlands: Kluwer Academic Publishers. Harry, Otelemate. 2006. “Jamaican Creole.” Journal of the International Phonetic Association 36 (1) 125-131.doi:10.1017/S002510030600243X. Hood, Matthew. 2004. “Creating a Voice for Festival Speech Synthesis System.” Bachelor of Science Thesis. Rhodes University, Grahamstown, South Africa. Research Gate (208033437).

169 Hund, Alicia, and Jennifer L. Minarik. 2006. “Getting from Here to There: Spatial Anxiety, Wayfinding Strategies, Direction Type and Wayfinding Efficiency.” Spatial Cognition and Computation: an Interdisciplinary Journal 6 (3): 179-201.doi: 10.1207/s15427633scc0603_1. Hund, Alicia, Kimberly Haney, and Brad Seanor. 2008. “The Role of Recipient Perspective in Giving and Following Wayfinding Directions.” Applied Cognitive Psychology 22: 896 - 916. http://psychology.illinoisstate.edu /amhund/Publications/Hund_Minarik_2006.pdf. Hunt, Andrew J., and Alan W. Black. 1996. “Unit Selection in Concatenative Speech Synthesis Using a Large Speech Database.” In International Conference on Acoustics, Speech, and Signal Processing –ICASSP, 373376. doi=10.1.1.127.9132. Indumathi, A, and E Chandra. 2012. “Survey on Speech Synthesis.” Signal Processing: An International Journal (SPIJ), 6 (5) 140-145. Accessed December 12, 2012. http://www.cscjournals.org/manuscript/Journals/SPIJ /Volume6/Issue5/SPIJ-206.pdf. Jeanreneaud, Philippe. 2006. “Speech and Personal Navigation Devices.” White Paper. Nuance Communications Inc., August 2006. Accessed January 10, 2010. http://s3.amazonaws.com/zanran_storage/www.nuance.com /ContentPages /16868490.pdf.

170 Jurafsky, Daniel, and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Upper Saddle River, New Jersey: Prentice Hall. King, Simon, and Vasilis Karaiskos. 2009. “The Blizzard Challenge 2009.” Blizzard Challenge Workshop 2009. CSTR, University of Edinburgh, UK. Accessed May 17, 2014. http://festvox.org/blizzard/bc2009 /summary_Blizzard2009.pdf. .2011. “The Blizzard Challenge 2011.” Blizzard Challenge Workshop 2011. CSTR, University of Edinburgh, UK. Accessed May 17, 2014. http://festvox.org/blizzard/bc2011/ summary_Blizzard2011.pdf. .2013. “The Blizzard Challenge 2013.” Blizzard Challenge Workshop 2013. CSTR, University of Edinburgh, UK. Accessed August 5, 2015. http://festvox.org/blizzard/bc2013/ summary_Blizzard2013.pdf. Klatt, Dennis. 1987. “Review of Text-to-Speech Conversion for English.” Journal of the Acoustical Society of America 3 (82): v-793. Ladefoged, Peter. 2001. A Course in Phonetics. 4th Edition. Orlando: Harcourt College Publishers. .2005. Vowels and Consonants: An Introduction to the Sounds of Language. Oxford: Blackwell Publishers.

171 Lampert, Andrew. 2004. “Evaluation of the MU-TALK Speech Synthesis System.” ICT Report. Accessed August 6, 2010. http://www.ict.csiro.au /staff/andrew.lampert/writing/SynthesisEvaluation.pdf. Latacz, Lukas, Wesley. Mattheyses, and Werner Verhelst. 2011. “Joint Target and Join Cost Weight Training for Unit Selection synthesis.” In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 321-324. Florence, Italy, August 27-31, 2011. ISCA Archive, http://www.isca-speech.org/archive /interspeech_2011/i11_0321.html. Lemmetty, Sami. 1999. “Review of Speech Synthesis Technology.” Master's Thesis. Helsinki University of Technology, Finland. Accessed October 12, 2009. http://www.acoustics.hut.fi/publications/files/theses /lemmetty_mst/index.html. Lenzo, Kevin, Christopher Hogan, and Jeffrey Allen. 1998. “Rapid-Deployment Text-to-Speech in the DIPLOMAT System.” In The 5th International Conference on Spoken Language Processing, Incorporating the7th Australian International Speech Science and Technology Conference, paper 0868. Sydney Convention Centre, Sydney, Australia, November 30 – December 4, 1998. ISCA Archive, http://www.isca-speech.org/ archive /icslp_1998/i98_0868.html. Louw, J. A. 2008. “Speect: A Multilingual Text-to-Speech System.” In Proceedings of the 19th Annual Symposium of the Pattern Recognition

172 Association of South Africa (PRASA), 165-186. Cape Town, South Africa. Research Gate (228336530). Mason, Marilyn, and Jeff Allen. 2003. “Computing in Creole Languages: The Web Simulates Growth and Development of Historically Oral Languages.” Multilingual Computing and Technology Magazine 14 (1): 24-32. Meade, Rocky R. 1996. “On the Phonology and Orthography of Jamaican Creole.” Journal of Pidgin and Creole Languages 11 (2): 325-341. .2001. “The Acquisition of Jamaican Phonology.” PhD Dissertation. Netherlands Graduate School of Linguistics. Amsterdam: Den Haag. Mona GeoInformatics Institute. (n.d.). Accessed January 10, 2010. http://projects.monagis.com/jamnav/?q=node/1. Olive, Joseph. 1997. “The Talking Computer: Text to Speech Synthesis.” In HAL's Legacy: 2001's Computer as Dream and Reality, edited by David Stork, 101-130. MIT Press. Pols, Louis C.W. 1998. “Speech Synthesis Evaluation.” In Survey of the State of the Art in Human Language Technology, edited by R. Cole, 429-430. Pisa: Giardini Editori e Stampatori.

173 Radford, Andrew, Martin Atkinson, David Britain, Harald Clahsen, and Andrew Spencer. 1999. Linguistics: An Introduction. Cambridge: Cambridge University Press. Rosson, May Beth, and A.J. Cecala. 1986. “Designing a Quality Voice: An Analysis of Listener's Reactions to Synthetic Voices.” In Proceedings of CHI'86, 192-197. Boston: ACM Press. Rudnicky, Alexander I., Christina, Black, Alan W. Bennett, Aanalada Chotomongcol, Kevin Lenzo, Alice Oh, and Rita Singh. 2000. “Task and Domain Specific Modelling in the Carnegie-Mellon Communication System.” In Proceedings, Sixth International Conference on Spoken Language Processing (ICSLP 2000) 2:130-134. Beijing, China, October 16-20, 2000. ISCA Archive, http://www.isca-speech.org/archive /archive_papers/icslp_2000 /i00_2130.pdf. Santen, Jan P., Louis C. Pols, Masanobu Abe, Dan Kahn, Eric Keller, and Julie Vonwiller. 1998. “Report on the Third ESCA TTS Workshop Evaluation Procedure.” In Third ESCA/COCOSDA Workshop on Speech Synthesis, SSW3-1998, 329-332. Jenolan Caves House, Blue Mountains, Australia, November 26-29, 1998. ISCA Archive, http://www.isca-speech.org /archive_open/ssw3/ssw3_329.html. Schroeter, Juergen. 2008. “Basic Principles of Speech Synthesis.” In Springer Handbook of Speech Processing, by Jacob Benest, M. Mohan Sondhi and Yiteng Huang (Eds.), 413-428. Berlin: Springer-Verlag.

174 Shalonova, Ksenia. “TTS Evaluation.” Local Language Speech Technology Initiative. UK: Outside Echo. Accessed March 22, 2013. http://www.outsideecho.com/llsti/pubs/TTS_eval.pdf. Sitaram, Sunayama, Gopala Krishna, Chiu, Justin Anumanchipali, Alok Parlikar, and Alan Black. 2013. “Text-To-Speech in New Languages Without a Standardized Orthography .” In Eighth ISCA Workshop on Speech Synthesis (SSW-8), 95-100. August 31 – September 2, 2013, Barcelona, Spain. ISCA Archive, http://www.isca-speech.org/archive/ssw8 /ssw8_095.html. Sproat, Richad, and Joseph Olive. 1999. “Text-to-Speech Synthesis.” In Digital Signal Processing Handbook, edited by Vijak K. Madisett and Douglas B. Williams, 46-1 - 46-11. Boca Raton: CRC Press LLC. Sproat, Richard. 2008 “Linguistic Processing for Speech Synthesis.” In Springer Handbook of Speech Processing, edited by Jacob Benesty, Mohan M. Sondhi and Yiteng Huang, 457-469. Berlin: Springer-Verlag. Stack Overflow. (n.d). Accessed February 12, 2010. http://stackoverflow.com. Strom, Volker, Robert Clark, and Simon King. 2006. “Expressive Prosody for Unit-selection Speech Synthesis.” In INTERSPEECH-2006, Ninth International Conference on Spoken Language Processing, 1296-1299. Pittsburgh, PA, USA, September 17-21, 2006. ISCA Archive, http://www.isca-speech.org/archive/interspeech_2006/i06_1522.html.

175 SurveyGizmo. Boulder, Colorado, USA. Accessed March 23, 2013. www.surveygizmo.com. Tabet, Yousef and Mohamed Boughazi. 2011. “Speech Synthesis Techniques: A Survey.” In 7th International Workshop on Systems, Signal Processing and their Applications (WOSSPA), 67-70. Tipaza: IEEE. doi: 10.1109/ WOSSPA.2011.5931414. Taylor, Paul. 2009. Text-to-Speech Synthesis. Cambridge: Cambridge University Press. Taylor, Paul, Alan Black, and Richard Caley. 1998. “The Architecture of the Festival Speech Synthesis System.” In Proceedings of the Third ESCA/COCSDA Workshop on Speech Synthesis, 147-152. Blue Mountains, Australia, November 26-29, 1998: ISCA. https://www.era.lib.ed.ac.uk/handle/1842/1032. .2001. “Heterogeneous Relation Graphs as a Formalism for Representing Linguistic Information.” Speech Communication 33 (1-2): 153-174. The Centre for Speech Technology Research. (n.d.) The Festival Speech Synthesis System. Accessed February 13, 2010. www.cstr.ed.ac.uk/projects/festival/. The Jamaican Language Unit. 2009. Writing Jamaican the Jamaican way: Ou fi rait Jamiekan. Kingston: Arawak Publications.

176 Tihelka, Daniel, Jiri Kala, and Jindrich Matousek. 2010. “Enhancements of Viterbi Search for Fast Unit Selection Synthesis.” In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 174-177. Makuhari, Japan, September 26-30, 2010. ISCA Archive, http://www.isca-speech.org/archive/interspeech_2010 /i10_0174.html. Trapani, Gina. Geek to live: Introduction to Cygwin. June 2006. Accessed March 16, 2012. http://lifehacker.com/179514/geek-to-live-introduction-to -cygwin -part-i?tag=softwaretop. van Heuven, Vincent J., and Renee van Bezooijen. 1995. “Quality Evaluation of Synthesized Speech.” In Speech Coding and Synthesis, edited by Kuldip K. Paliwal (Eds.) W. Bastiaan Kleijn, 707-738. Amsterdam: Elsevier Science. Weersignhe, Ruvan, Asanka Wasala, Viraj Welgama, and Kumudu Gamage. 2007. “Festival-si: A Sinhala Text-to-Speech System.” In Proceedings of the 10th International Conference on Text, Speech and Dialogue, 493-499. Berlin: Springer-Verlag. Wells, John. 1973. Jamaican Pronunciation in London. Oxford: Blackwell. Wright, Richard, and David Nicholas. 2009. “Measuring Vowel Duration in Praat.” University of Washington Phonetics Lab, June 25, 2009.

177 Zen, Heiga, Takashi Nose, Junichi Yamagishi, Shini Sako, Takashi Masuko, Alan W. Black, and Keiichi Tokuda. 2007. “The HMM-based Speech synthesis system (HTS) Version 2.0.” In Sixth ISCA Tutorial and Research Workshop on Speech Synthesis (SSW6), 294-299. Bonn, Germany, August 22-24, 2007. ISCA Archive, http://www.isca-speech.org/archive_open /archive_papers/ssw6/ssw6_294.pdf. Zen, Heiga, Keiichiro Oura, Takashi Nose, Junichi Yamagishi, Shinji Sako, Tomoki Toda, Takashi Masuko, Alan W. Black, and Keiichi Tokuda. 2009. “Recent Development of the HMM-based Speech synthesis System (HTS).” In Proceedings APSIPA-ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference, 121-130. Sapporo: APSIPA-ASC. Zen, Heiga, Tokuda Keiichi, and Alan Black. 2009. “Statistical Parametric Speech Synthesis.” Speech Communication 51 (11): 1039 - 1064.

178

Appendices Appendix 1

Installing and Compiling Festival Framework and Tools

do_test.sh #!/bin/sh gcc -v uname -a mkdir build cd build tar zxvf ../speech_tools-2.0.95-beta.tar.gz tar zxvf ../festival-2.0.95-beta.tar.gz tar zxvf ../festlex_CMU.tar.gz tar zxvf ../festlex_POSLEX.tar.gz tar zxvf ../festlex_OALD.tar.gz tar zxvf ../Festvox-2.1-release.tar.gz tar zxvf ../Festvox_kallpc16k.tar.gz tar zxvf ../Festvox_rablpc16k.tar.gz tar zxvf ../Festvox_cmu_us_slt_arctic_hts.tar.gz tar zxvf ../Festvox_cmu_us_awb_cg.tar.gz tar zxvf ../Festvox_cmu_us_rms_cg.tar.gz export/ FESTVOXDIRIR= 'pwd'/Festvox/ (later issue realised) export ESTDIR=`pwd`/speech_tools cd speech_tools ./configure make cd .. cd festival ./configure make cd Festvox ./configure make cd ../speech_tools make test cd ../festival make test

179

cd ../Festvox make test cd ..

180 Appendix 2

Direction-Giving in Jamaican Creole Questionnaire

Gender:

Male [ ]

Female [ ]

Age Groups

18-30 [ ]

31 -50 [ ]

Over 50 [ ]

Region of Birth

Western [ ]

Central [ ]

Eastern [

]

Current Parish of Residence __________________ Since what age

__________________

Language(s) Spoken:

___________________

Instructions: If you were to provide directions to help a Jamaican Creole speaking driver get from the Terra Nova Hotel, Kingston 10 to Emancipation Park, Kingston 5, how would you go about doing it? The route to be used is: Waterloo Road - Hope Road - Half Way Tree Road- Oxford Road

1. P lease provide the prompts (in Jamaican Creole) you would use. ({Do not worry about the ability to write/spell accurately in Jamaican Creole, just write it how you think it should be written } ____________________________________________________________

181 2. When giving directions in a local context, do you use the terms metres and kilometres? Yes [ ]

No [ ]

a. How would you translate 'drive for 600 metres' from Standard English into Jamaican Creole? ___________________________________________ 3. What local terms that must be included when giving a local driver local directions? ____________________________________________________________ 4. Do you think it makes it more transparent to include: a. The number of stop lights the driver would have to pass? Yes [ ]

No [ ]

b. Points of Interests (POIs)/Landmarks so the driver knows that s/he is still on the correct route? Yes [ ]

No [ ]

c. Why? ____________________________________________

5. Do you think it useful to alert the driver how long the journey might take? Yes [ ]

No [ ]

6. Would you provide an exact time or a rough estimate (bearing in mind traffic conditions etc)? Yes [ ]

No [ ]

7. Please note any other direction- giving techniques that you consider vital to direction giving in a local context.

182 Appendix 3

Sample Questionnaire Respondent - MB

1. P lease provide the prompts (in Jamaican Creole) you would use. {Do not worry about the ability to write/spell accurately in Jamaican Creole, just write it how you think it should be written }

Ans: When yu exit Terra Nova, mek a right unto Waterloo road, you will continue down 'til yu reach the intasection of Waterloo and Hope roads. Mek a right at de intasection, den a lef at the intasection of Hope and Half way tree road. Continue down dat road and mek the left unto Oxford Road. De Park deh pan yu lef.

2. When giving directions in a local context, do you use the terms metres and kilometres? Yes [ ]

No [ x ]

a) How would you translate 'drive for 600 metres' from Standard English into Jamaican Creole? Drive fi 'bout 600 metres__________________

3. What local terms that must be included when giving a local driver local directions? ___N/A____

4. Do you think it makes it more transparent to include: a) The number of stop lights the driver would have to pass? Yes [ x ] No [ ] b) Points of Interests (POIs)/Landmarks so the driver knows that s/he is still on the correct route?

Yes [ x ]

No [ ]

c) Why? - Gives the driver a sense of confidence in my directions___

5. Do you think it useful to alert the driver as to how long the journey might take? Yes [ x ]

No [ ]

183 6. Would you provide an exact time or a rough estimate (bearing in mind traffic conditions etc)? Yes [ x ]

No [ ]

7. Please note any other information giving techniques that you consider vital to direction giving in a local context. _________________________________________________________ ___________________________________________________________

184 Appendix 4

Sample Questionnaire Respondent - CR

1. P lease provide the prompts (in Jamaican Creole) you would use. {Do not worry about the ability to write/spell accurately in Jamaican Creole, just write it how you think it should be written }

Ans: Kom out wich paat yu si di Kyaniediyan Embasi, ton rait. Kantinu jraiv schriet. Dohn torn, Yu a go si di Kuokonot Indoschri and son Rasta man we a sel mat . Jraiv til yu si di intasekshan we wataloo miit uop ruod. Wen yu riich di instasekshan ton rait. YMCA de pan yu lef. Yu a go si wan bos stap. Jraiv til yu riich di intaseekshan we uop ruod miit aaf wie chrii. Wen yu riich, ton lef. Stie pan di lef an said. Jraiv so til aaf we chrii ruod miit aksfod ruod. Stie pan di lef an said. Tek di slip ruod. Jraiv likl bit, emansipeishan paak den pan yu lef an said

2. When giving directions in a local context, do you use the terms metres and kilometres?

Yes [ ]

No [ x ]

a) How would you translate 'drive for 600 metres' from Standard English into Jamaican Creole? Jraiv jos likl bit go dong di ruod___

3. What local terms that must be included when giving a local driver local directions? Shcriet, lef, rait, likl bit, faar

4. Do you think it makes it more transparent to include: a) The number of stop lights the driver would have to pass? Yes [ ] No [ ] Not really b) Points of Interests (POIs)/Landmarks so the driver knows that s/he is still on the correct route? Yes [ x ]

No [ ]

c) Why? Definitely, so the driver can know where he is or if he is on

185 the right track. 5. Do you think it useful to alert the driver as to how long the journey might take? Yes [ x ]

No [ ]

6. Would you provide an exact time or a rough estimate (bearing in mind traffic conditions etc)?

Yes [ x ]Rough estimate

No [ ]

7. Please note any other information giving techniques that you consider vital to direction giving in a local context. Alert about pot holes

186 Appendix 5

Jamaican Creole Navigation Prompts

Prompts Making Use of Non-Standard Words (NSW): Cardinal Numbers JC Prompt

SJE Translation

(navi1 "1") wan

1; one

(navi2 "2") tuu

2; two

(navi3 "3") chrii

3; three

(navi4 "4") fuor

4; four

(navi5 "5") faiv

5; five

(navi6 "6") siks

6; six

(navi7 "7") sevn

7; seven

(navi8 "8") iet

8; eight

(navi9 "9") nain

9; nine

(navi10 "10") ten

10; ten

(navi11 "20") twenti

20; twenty

(navi12 "30") torti

30; thirty

(navi13 "40") faati

40; forty

(navi14 "50") fifti

50; fifty

(navi15 "60") siksti

60; sixty

(navi16 "70") sevnti

70; seventy

(navi17 "80") ieti

80; eighty

(navi18 "90") nainti

90; ninety

(navi19 "100") wan ondred

100; one hundred

Prompts and utterance strings to indicate distance Template: Ordinal Number + Mileage/Distance

187 JC Prompt

SJE Translation

(navi20 "miita")

metre

(navi21 "kilamita")

kilometre

(navi22 "50 miita")

fifty metres

(navi23 "150 miita")

one hundred and fifty metres

(navi24 "200 miita")

two hundred metres

(navi25 "250 miita")

two hundred and fifty metres

(navi26 "300 miita")

three hundred metres

(navi27 "350 miita")

three hundred and fifty metres

(navi28 "400 miita")

four hundred metres

(navi29 "450 miita")

four hundred and fifty metres

(navi30 "500 miita")

five hundred metres

Template indicating direction driver should take including street name Template to turn:

Ton DIREKSHAN pan SCHRIIT NIEM. 'Turn DIRECTION on NAME OF STREET.'

JC Prompt

SJE Translation

(navi31 "ton")

turn

(navi32 "lef ")

left

(navi33 "rait")

right

(navi34 "ton lef")

turn left

(navi35 "ton rait")

turn right

(navi36 "pan")

on/onto

(navi37 "ton lef pan")

Turn left onto

(navi38 "ton rait pan ")

Turn right onto

188 (navi39 "ton lef pan Uol Uop Ruod.") Turn left onto Old Hope Road. (navi40 "Ton rait pan Uol Uop Ruod.") Turn right onto Old Hope Road. (navi41 "Ton lef pan Lili Wie.")

Turn left onto Lily Way.

(navi42 "Ton rait pan Lili Wie.")

Turn right onto Lily Way.

Template Directionality:

Jraiv DIREKSHAN Drive DIRECTION

( Indicating directionality in reference of distance to be travelled)

JC Prompt

SJE Translation

(navi43 "Jraiv schriet")

drive straight

(navi44 "stie ina di lef lien")

stay/keep left

(navi45 "stie ina di rait lien.")

stay/keep right

(navi46 "kip lef")

keep left

(navi47 "kip rait")

keep right

Use of Ordinal Numbers with markers when giving directions Based on both first hand observation and native speaker knowledge, the ordinal numbers most often used in Jamaican Creole are fos 'first', sekan 'second' and tord 'third'. Fuot 'fourth' is less used but will be included for marginal usage. JC Prompts

SJE Translation

(navi48 "fos")

first

(navi49 "sekan")

second

(navi50 "tord")

third

(navi51 "fuot")

fourth

189 (navi52 "kom op")

approaching/ahead

(navi53 "rait torn a kom op ")

Right turn ahead

(navi54 "lef torn a kom op ")

Left turn ahead

(navi55 "rait torn ")

Right turn

(navi56 "lef torn ")

Left turn

Templates using ordinal number and marker to specify turn driver should take:

JC Prompts

SJE Translation

(navi57 "Tek di neks rait.")

Take the next right turn.

(navi58 "Tek di neks lef.")

Take the next left turn.

( navi59 "tek di neks rait torn")

Take the next right (turn).

( navi60 "tek di neks lef torn")

Take the next left (turn).

( navi61 "ton rait den ton lef")

Turn right then turn left.

( navi62 "ton lef den ton rait")

Turn left then turn right.

(navi63 "Tek di fos rait.")

Take the first right turn.

(navi64 "Tek di sekan lef.")

Take the second left turn.

(navi65 "Tek di tord rait.")

Take the third right turn.

(navi66 "Tek di fuot lef.")

Take the fourth left turn.

Template indicating side destination is on:

It de pan yu SAID It is on your SIDE.

JC Prompts

SJE Translation

(navi67 "It de pan yu lef an said.")

It is on your left.

190 (navi68 "It de pan yu rait an said.")

It is on your right.

Utterance strings indicating nearing/arrival at destination & others JC Prompts

SJE Translation

( navi69 "yu suuhn mek a rait torn") Right turn coming up/ahead. ( navi70 "yu suuhn mek a lef torn") Left turn coming up/ahead. ( navi71 "tek di neks egzit")

Take the next exit.

( navi72 "yu suuhn kom aafa di aiwie") Exit ahead. ( navi73 "yu suuhn riich")

You are nearing your destination.

( navi74 "yu riich we yu a go")

You have reached your destination.

( navi75 "yu mis yu torn")

You have missed your turn.

( navi76 "ton ron wen yu get a chaans") Turn around when possible. ( navi77 "mek a yuu torn if yu kyahn mek wan") Make a u-turn when possible. ( navi78 "likl bit muo an yu riich") ( navi79 "yu a paas torn")

You will soon reach your destination. You are going to miss your turn.

(navi80 "ton lef den ton rait an yu riich.") Turn left then turn right and you have reached your destination. (navi81 "ton rait den ton lef an yu riich") Turn right then turn left and you "ton lef den ton lef an yu riich")

have reached your destination. (navi82 Turn left then turn left and you have reached your destination.

(navi83 "ton rait den ton rait an yu riich") Turn right then turn right and you have reached your destination.

Utterance strings indicating driver is making a wrong turn or is going contrary to directions given & others

191 A wa yu a du? A DIREKSHAN mi se yu fi go! What are you doing? I said you should go DIRECTION!

JC Prompts

SJE Translation

(navi84 "bot si ya Pupa Jiizas! ")

What on earth!

(navi85 "a wa yu a du?")

What are you doing?

(navi86 "A rait mi se yu fi go!")

I said to go right!

(navi87 " A lef mi se yu fi go!")

I said to go left!

(navi88 "A schriet mi se yu fi go!")

I said to go/continue straight!

(navi89 "Aarait den.")

Okay then.

(navi90 "Du wa yu waan du! ")

Do as you please.

(navi91 "No kos mi wen taim yu laas!")

Don't blame me when you are lost!

Utterance strings to compliment driver upon arriving at destination & others (navi92 "waa, yu skil man!")

You are skilled!

(navi93 "Yu a gud jraiva stil.")

You are a good driver!

(navi94 "Giv tangks an priez se

We should be grateful we

wi riich sief stil.")

arrived safely.

Selected Street Names and POIs to be used in the Preliminary Jamaican Creole Speech Database

JC Prompts

SJE Translation

(navi95 "Plombiego")

Plumbago

(navi96 "Plombiego Paat")

Plumbago Path

192 (navi97 "Gyaadiinya")

Gardenia

(navi98 "Gyaadiinya Avinyuu")

Gardenia Avenue

(navi99 "Jiraniyom")

Geranium

(navi100 "Jiraniyom Paat")

Geranium Path

(navi101 "Palmeto")

Palmeto

(navi102 "Palmeto Avinyuu")

Palmeto Avenue

(navi103 "Diezi")

Daisy

(navi104 "Diezi Avinyuu")

Daisy Avenue

(navi105 "Aakid")

Orchid

(navi106 "Aakid Paat")

Orchid Path

(navi107 "Jorbiera")

Gerbera

(navi108 "Jorbiera Jraiv")

Gerbera Drive

(navi109 "Vailet")

Violet

(navi110 "Vailet Avinyuu")

Violet Avenue

(navi111 "Gyaadn")

Garden

(navi112 "Gyaadn Bulivaad")

Garden Boulevard

(navi113 "Buoganvila")

Bouganvila/Boungainvillea

(navi114 "Buoganvila Avinyuu")

Bouganvila/villea Avenue

(navi115 "Beguoniya")

Begonia

(navi116 "Beguoniya Avinyuu")

Begonia Avenue

(navi117 "Anchuuriyom")

Anthurium

(navi118 "Anchuuriyom Jraiv")

Anthurium Drive

(navi119 "Lili")

Lily

(navi120 "Lili Wie")

Lily Way

(navi121 "Spatuodiya")

Spatodia

(navi122 "Spatuodiya Avinyuu")

Spatodia Avenue

(navi123 "Botakop")

Buttercup

193 (navi124 "Botakop Jraiv")

Buttercup Drive

(navi125 "Kamiiliya")

Camelia

(navi126 "Kamiiliya Wie")

Camelia Way

(navi127 "Muona")

Mona

(navi128 "Muona Wie")

Mona Way

(navi129 "Kaanieshan")

Carnation

(navi130 "Kaanieshan Wie")

Carnation Way

(navi131 "Sonflowa")

Sunflower

(navi132 "Sonflowa Wie")

Sunflower Way

(navi133 "Yuunivorsiti Krisent ")

University Crescent

(navi134 "Palmuoral ")

Palmoral

(navi135 "Palmuoral Avinyuu ")

Palmoral Avenue

(navi136 "Pechuuniya")

Petunia

(navi137 "Pechuuniya Wie ")

Petunia Way

(navi138 "Muona")

Mona

(navi139 "Muona Ruod")

Mona Road

(navi140 "Uol Uop Ruod")

Old Hope Road

(navi141 "Papiin")

Papine

(navi142 "Arieliya")

Aralia

(navi143 "Arieliya Jraiv")

Aralia Drive

Others: (navi144 "tuol chaaj")

Toll charge

(navi145 "yu waahn tek di ai wie ?")

Toll charge?

(navi146 "yu waahn tek di tuol ruod?")

Toll charge ?

(navi147 "tuol ruod")

Toll road

194 Appendix 6

Jamaican Creole Phoneme Inventory

(defPhoneSet uwi_navi ;;; Phone Features (;; vowel or consonant (vc + -) ;; vowel length: short long dipthong schwa (vlng s l d a 0) ;; vowel height: high mid low (vheight 1 2 3 0 -) ;; vowel frontness: front mid back (vfront 1 2 3 0 -) ;; lip rounding (vrnd + - 0) ;; consonant type: stop fricative affricative nasal liquid approximant (ctype s f a n l r 0) ;; place of articulation: labial alveolar palatal labio-dental ;;

dental velar glottal

(cplace l a p b d v g 0) ;; consonant voicing (cvox + - 0) ) ( ;;(pau - 0 - - - 0 0 -) ;; silence ... ;; insert the phones here, see examples in ;; festival/lib/*_phones.scm

;; JC Phone Set members (36 phonemes) - dt ;;1-phoneme name, 2-vc, 3-vlng, 4-vheight, 5-vfront, 6-rnd, 7-ctype, 8;; cplace, 9-cvox (a + s 3 2 - 0 0 0 )

195 (aa + l 3 2 - 0 0 0 ) (an + s 1 2 - n a - ) (ai + d 3 1 - 0 0 0 ) (b - 0 0 0 0 s l + ) (ch - 0 0 0 0 a p - ) (chr - 0 0 0 0 a p - ) (d - 0 0 0 0 s a + ) (e + s 2 1 - 0 0 0 ) (f - 0 0 0 0 f d - ) (g - 0 0 0 0 s v + ) (gy - 0 0 0 0 s p + ) (h - 0 0 0 0 f g - ) (hn - 0 0 0 0 n v + ) (i + s 1 1 - 0 0 0 ) (ie + d 1 1 - 0 0 0 ) (ii + l 1 1 - 0 0 0 ) (j - 0 0 0 0 a a + ) (k - 0 0 0 0 s v - ) (ky - 0 0 0 0 s p - ) (ks - 0 0 0 0 f p - ) (l - 0 0 0 0 l a + ) (m - 0 0 0 0 n l + ) (n - 0 0 0 0 n a + ) (ng - 0 0 0 0 n v + )

196 (ny - 0 0 0 0 n p + ) (o + s 2 3 + 0 0 0 ) (ou + d 1 3 + 0 0 0 ) (p - 0 0 0 0 s l - ) (r - 0 0 0 0 r a + ) (s - 0 0 0 0 f a - ) (sh - 0 0 0 0 f p - ) (t - 0 0 0 0 s a - ) (u + s 1 3 + 0 0 0 ) (uo + d 1 3 + 0 0 0 ) (uu + l 1 3 + 0 0 0 ) (v - 0 0 0 0 f d + ) (vn - 0 0 0 0 n d + ) (w - 0 0 0 0 r l + ) (y - 0 0 0 0 r p + ) (z - 0 0 0 0 f a + ) (zh - 0 0 0 0 f p + ) (pau - 0 - - - 0 0 - ) ;; silence ... ) ) (PhoneSet.silences '(pau)) (define (uwi_navi_jc::select_phoneset) "(uwi_navi_jc::select_phoneset) Set up phone set for uwi_navi." (Parameter.set 'PhoneSet 'uwi_navi) (PhoneSet.select 'uwi_navi) )

197 (define (uwi_navi_jc::reset_phoneset) "(uwi_navi_jc::reset_phoneset) Reset phone set for uwi_navi." t) (provide 'uwi_navi_jc_phoneset)

198 Appendix 7

Jamaican Creole Letter-to-Sound Rule Set

;; 1) Downcase mapping of JC phonemes ( lts.ruleset navi_downcase ( ([A] =a ) ([I] =i ) ([U] =u ) ([E] =e ) ([O] =o ) ( [ AA ] = aa ) ( [ II ] = ii ) ( [ UU ] = uu ) ( [ AI ] = ai ) ( [ IE ] = ie ) ( [ OU ] = ou ) ( [ UO ] = uo ) ;;consonants ([B ]=b ) ( [ C ] H = ch ) ( [ CH ] = ch ) ([D ]=d ) ([F ]=f ) ([G ]=g ) ( [ G ] Y = gy ) ( [ GY ] = gy ) ([J ]=j ) ([K ]=k) ( [ K ] Y = ky ) ( [ KY ] = ky ) ([L ]=l ) ([M ]=m ) ([N ]=n ) ( [ N ] G = ng ) ( [ NG ] = ng ) ( [ N ] Y = ny ) ( [ NY ] = ny ) ([P ]=p ) ([R ]=r ) ([S ]=s ) ( [ S ] H= sh ) ( [ SH ] = sh ) ([T ]=t ) ([V ]=v ) ([W ]=w )

199 ([Y ]=y ) ([Z ]=z ) ( [ Z ] H = zh ) )) ;; The JC LTS Ruleset (lts.ruleset jamaican ;; name of ruleset being defined ;; 2) Rule sets defined ;;sets that will be used in the definition of the JC rules ((V a i u e o aa ii uu ai ie ou uo) ;; set of all JC vowels (V1 a i u e o) ;; short vowels (V2 aa ii uu) ;; long vowels (V3 ai ie ou uo) ;;diphthongs (V4 i) ;; high front vowel (V5 u) ;; high back vowel (C b ch d f g g t j k ky l p r s sh t v w y z zh ) ;;consonants (C1 m n ng ny) ;; set of all JC nasal consonants (C2 m n ) ;; nasal consonants allowed in both onset & coda position (C3 ng) ;; nasal consonant only allowed in coda position (C4 ny) ;; nasal consonant only allowed in onset position ) ;; 3) The rules ( ([a] =a ) ( [ a ] n = an ) ( a [ n ] = an ) ([i] =i ) ( [ i ] n = in ) ([u] =u ) ([e] =e ) ([o] =o ) ( [ aa ] = aa ) ( [ aa ] n = aan ) ( a [ a ] n = aan ) ( [ ii ] = ii ) ( [ uu ] = uu ) ( [ u ] u = uu ) ( [ ai ] = ai ) ( [ ie ] = ie ) ( [ ou ] = ou ) ( [ uo ] = uo ) ;;consonants ([b ]=b )

200 ( [ ch ] = ch ) ( [ c ] h = ch) ( c [ h ] r = chri) ( c [ h ] = ch) ([d ]=d ) ( [ d ] r = dr ) ([f ]=f ) ([g ]=g ) ( [ g ] y = gy ) ( g [ y ] = gy ) ([h] =h) ( [ h ] n = hn ) ([j ]=j ) ([k ]=k) ( [ k ] y = ky ) ( [ k ] y = ky ) ( k [ y ] = ky ) ( [ k ] s = ks ) ( k [ s ] = ks ) ([l ]=l ) ([m ]=m ) ([n ]=n ) ( [ ng ] = ng ) ( [ n ] g = ng ) ( [ n ] s = ns ) ( [ ny ] = ny ) ( n [ y ] = ny ) ([p ]=p ) ( p [ l ] = pl ) ([r ]=r ) ( [ r ] d = rd ) ( [ r ] n = rn ) ([s ]=s ) ( [ s ] h = sh) ( s [ h ] = sh ) ( [ sh ] = sh ) ( [ s ] t = st ) ( [ s ] p = sp ) ( [ sp ] = sp ) ( s [ p ] = sp ) ([t ]=t ) ([v ]=v ) ( e [ v ] n = evn) ( [ v ] n = vn ) ([w ]=w ) ([y ]=y ) ([z ]=z ) ( [ z ] h = zh )

)) (provide 'uwi_navi_jc_lts)

201 Appendix 8

Evaluating Jamaican Creole Synthetic Speech

Instrument

Page One The purpose of this listening test is to evaluate "synthesized speech" created for the Jamaican Creole language using Open Source Text-to-Speech Synthesis* software and represents a crucial element of the research project. The synthesized speech you will hear was designed with a target application in mind, namely incar street level navigation.

The test is made up of 4 sections, 24 short questions in total and can be completed in under 10 minutes. Your participation is fully voluntary and anonymous. All responses and feedback gathered will only be used within the context of data analysis and presentation in as much as it relates to this project.

You have the option to save your progress and continue at a later time. This 'Save and Continue' option is located in the footer section of the survey.

Thank you for your participation!

- *(speech synthesis is the artificial production of human speech)

New Page Did you participate in the first evaluation exercise for Jamaican Creole synthetic speech in March 2013?* ( ) Yes ( ) No

Page Exit Logic (unseen by respondents):

202 If respondent took Eval 1, disqualify.: IF: The answer to Question (ID 85) is exactly equal to Yes THEN: Disqualify and display: We're sorry but you do not qualify to continue the survey at this time. You're welcome to send a message to the primary researcher at [email protected]. Thank you!

DEMOGRAPHIC DATA 1) Please select your age group ( ) under 25 ( ) 25 - 34 ( ) 36 - 44 ( ) 45 - 54 ( ) 55 +

2) Please select your gender ( ) Male ( ) Female

3) What is your level of education?* ( ) Some High School ( ) High School Graduate ( ) College Graduate ( ) Some Postgraduate ( ) Postgraduate ( ) Other: _________________

4) What is your current occupation? State your area of expertise where applicable.* [ ] Student

203 [ ] Academic [ ] Professional [ ] Service [ ] Retired [ ] Unemployed [ ] Other Comments: ____________________________________________

5) Are you a native Jamaican Creole speaker?* ( ) Yes ( ) No

6) How familiar are you with synthetic speech? ( ) Very Unfamiliar ( ) Unfamiliar ( ) Somewhat Familiar ( ) Familiar ( ) Very Familiar

New Page Dear Participant, All audio files used in this survey are .mp3 format and will play in any of the following browsers: Internet Explorer (IE9*), Firefox, Chrome, Safari, Android & iOS. *IE9 requires Flash Player to play the audio files. If you do not want to download Flash, please use Firefox or Chrome. Thank you. SECTION 1: EVALUATION OF SPEECH QUALITY

Listen and rate the overall Speech Quality of each synthetsized speech audio sample using a scale of 1-5 where 1 is "Bad" and 5 is "Excellent"

204 7) Please rate the overall speech quality of Audio Clip 1 ( ) 1 | Bad ( ) 2 | Poor ( ) 3 | Fair ( ) 4 | Good ( ) 5| Excellent

8) Please rate the overall speech quality of Audio Clip 2 ( ) 1 | Bad ( ) 2 | Poor ( ) 3 | Fair ( ) 4 | Good ( ) 5| Excellent

9) Please rate the overall speech quality of Audio Clip 3 ( ) 1 | Bad ( ) 2 | Poor ( ) 3 | Fair ( ) 4 | Good ( ) 5| Excellent

10) Please rate the overall speech quality of Audio Clip 4 ( ) 1 | Bad ( ) 2 | Poor ( ) 3 | Fair ( ) 4 | Good ( ) 5| Excellent

11) Additional Comments on these 4 clips are welcome:

205 SECTION 2: TESTING COMPREHENSION AND INTELLIGIBILITY (A) Sentence Verification: Listen to these 2 synthesized speech audio samples and answer 'TRUE' or 'FALSE.

12) Audio 1: Should you enter the highway? ( ) True ( ) False

13) Audio 2: After turning left, will you be on Paisley Avenue? ( ) True ( ) False

(B) Intelligibility: Listen to these 2 synthesized speech audio samples and write down what you have heard. The way you write Jamaican Creole is not a considering factor.

14) Audio 3: Write down the navigational instructions you just heard.* 15) Audio 4: Write down the navigational instructions you just heard.*

SECTION 3: SYNTHESIZED SPEECH OR ORIGINAL AUDIO RECORDING?

Can you recognise the difference between "Synthesized Speech" generated by a machine and "Original Speech" made by a human? Listen to each audio clip and indicate whether the sound is machine generated or original human recorded audio.

16) Is Audio 1 synthesized speech or original audio?* ( ) Synthetic ( ) Original

206 17) Is Audio 2 synthesized speech or original audio ?* ( ) Synthetic ( ) Original

18) Is Audio 3 synthesized speech or original audio ?* ( ) Synthetic ( ) Original

19) Is Audio 4 s synthesized speech or original audio ?* ( ) Synthetic ( ) Original

20) Is Audio 5 synthesized speech or original audio ?* ( ) Synthetic ( ) Original

21) Is Audio 6 synthesized speech or original audio ?* ( ) Synthetic ( ) Original

SECTION 4: COMPARING ACOUSTIC QUALITY OF SPEECH OUTPUT 22) On a scale of 1 to 5, where 1 is "Very Dissimilar" and 5 is "Very Similar" , please rank how similar the acoustic quality of Synthesized Audio 1 is to Original Audio 1 Very Dissimilar Synthetic Audio 1 vs Original Audio 1

Dissimilar

Somewhat Similar

Similar

Very Similar

207

23) On a scale of 1 to 5, where 1 is "Very Dissimilar" and 5 is "Very Similar" , please rank how similar the acoustic quality of Synthesized Audio 2 is to Original Audio 2 Very Dissimilar

Dissimilar

Somewhat Similar

Similar

Very Similar

Synthetic Audio 2 vs Original Audio 2

APPROPRIATENESS This synthetic voice was built with a target domain in mind, namely in car street level navigation. Please comment first on its APPROPRIATENESS. Additional comments are welcome.

Thank You! Once again, thank you! Your assistance has been invaluable! Please send additional feedback, inquiries or suggestions to the primary researcher, Dahlia Thompson at [email protected] or dahlia.thompson (Skype).

208 Appendix 9

Sample Fall-off Survey

Response ID:51 Data

209

210 Appendix 10 Sample Respondent Data Response ID:74 Data

211

212

213 Response ID:196 Data

214

215