Development of Language Resources for Speech Application in Gujarati and. Marathi. Maulik C. Madhavi, Shubham Sharma and Hemant A. Patil. Dhirubhai ...
Development of Language Resources for Speech Application in Gujarati and Marathi Maulik C. Madhavi, Shubham Sharma and Hemant A. Patil Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, Gujarat, India E-mail: {maulik_madhavi, shubham_sharma and hemant_patil}@daiict.ac.in
Abstract—This paper discusses development of resources using linguistics and signal processing aspects for two low resource Indian languages, viz., Gujarati and Marathi. Speech resource development discusses the details of data collection, transcription at phone and syllable level and corresponding linguistic units such as phones and syllables. In order to analyze the performance at different fluency levels, three types of recording modes, viz., read, conversation and lecture are considered in this paper. Manual annotation of speech in terms of International Phonetic Alphabet (IPA) symbols is presented. In the later section, we discuss speech segmentation at syllable level and prosodic level marking (pitch marking). Short-term Energy contour is smoothened using group-delay-based algorithm in order to detect syllable units in the speech signal. Detection rate obtained for syllable marking within 20 % agreement duration is of the order of 60 % in case of read mode speech. Prosody pitch marks are analyzed via F0 pattern of a speech signal. The key strength of this study is the analysis for different kinds of recording modes, viz., read, conversation and lecture mode. It is found that CV (where, Consonant is followed by Vowel) type of syllables have highest occurrence (more than 50 %) in both the languages. Read speech is observed to perform better than spontaneous speech in terms of automatic prosodic marking. Keywords-Phonetic transcription, syllabification, pitch marking, low resource language
I.
INTRODUCTION
There has been a growing interest among speech research communities to develop speech corpora for low or under resourced languages. For the development of corpora and resources for Gujarati and Marathi languages, speech signal is analyzed at different representation levels, viz., phonetic-level and prosodic-level. Speech signal can be processed under signal processing and linguistic processing tools. Signal processing tools exploit the information from recorded physical speech signal. On the other hand, linguistic processing tools extract the information from text corresponding to the recorded speech. From acoustic-phonetic point of view, speech can be demarked into different segments in time-frequency domain. There are linguistic abstractions, such as phonemes, allophones, morphophonemes, associated with segments derived from time-frequency domain of speech signal [1],[2]. It is a challenging task for the researchers to correlate both the aspects effectively in order to develop speech technologies such as Automatic Speech Recognition (ASR). It is worth to analyze and extract information from recorded speech itself. In particular for low resourced languages, such analysis may bridge the gap between linguistic and signal processing aspects of
c 978-1-4799-5330-1/14/$31.00 2014 IEEE
speech. To start with, here in this paper, both the aspects are investigated for Gujarati and Marathi languages. Linguistic aspects of collected speech data are discussed first. Here, International Phonetic Alphabet (IPA) symbols based analysis is performed to capture the linguistic information from recordings [3]. In a later section, signal processing algorithms are discussed for segmentation of syllable units, detection of prosodic marks (viz., pitch and break marks). In earlier studies by various researchers, both signal processing and machine learning tools are combined to understand their correlation with linguistic aspects for Indian languages [4]-[6]. IPAbased symbols are automatically derived using Hidden Markov Model (HMM)-based machine learning tool. The machine which does this task is known as a Phonetic Engine (PE). Various issues in Gujarati and Marathi languages related to ambiguity between aspirated plosive vs. fricative and anuswara are discussed in [7]. Anuswara is denoted as the bindu (dot as a superscript) the corresponding sounds are nasalized sound. II. DETAILS OF DATA COLLECTION AND TRANSCRIPTION The database was collected in two Indian languages, viz., Gujarati and Marathi that are spoken by a large population of Gujarat and Maharashtra states across various dialectal zones of India respectively. The database and related statistics were reported in [8]. For each language, data was recorded in three different modes of speech, viz., read, conversation and lecture. Read mode data was collected from a text material to be read by the subject. Conversation data was recorded in the form of question-answer between subject and interviewer. Lecture data was collected from primary schools of Gujarati and Marathi medium. All the speech data was recorded by handy recorder Zoom H4n at 44.1 kHz, 16-bit resolution [8]. For analysis of speech waveforms open source WaveSurfer tool is used [9]. Transcribers were trained for WaveSurfer tool as to transcribe the speech. The speech data is divided into smaller chunks for analysis for annotation. Transcribers listen to the speech and assign phonetic symbols which are responsible to generate particular kinds of speech patterns. Each speech unit is mapped to the corresponding sound production unit. IPA symbols are highly correlated with the speech production mechanism and hence might be better candidate in this analysis. More than 10 hours of speech data is transcribed for both the languages (of which 5 hours of read mode, 2.5 hours of conversation mode and 2.5 hours of lecture mode) [8]. Figure 1 shows an example of an utterance taken from Gujarati lecture mode speech
115
(a)
(b)
(c)
(d) (a) (e) (f) (g) Figure 1: Illustration of label preparation using WaveSurfer for an utterance taken from Gujarati lecture mode speech: (a) time-domain waveform, (b) F0 contour, (c) narrowband spectrogram, (d) phonetic transcription, (e) syllabification, (f) pitch marking and (g) break marking. (sentence: અને એમુ ં આ વચન એ અમર રુ.ં /ane emnuu aaje vacana e amara rahyu/ (English translation: ‘and Today his promise has remained immortal’)).
to demonstrate manual labels at phonetic symbols, syllabification, pitch and break marks. III.
LINGUISTIC PROCESSING ASPECT
The list of phonetic symbols obtained in the transcription task is shown in Table I. It shows the broad phonetic class, corresponding IPA symbols and % coverage for both the languages, viz., Gujarati and Marathi. From Table I, it can be observed that most of the symbols are covered by vowels in both the languages, followed by semivowels. Fricatives are having very low coverage in both the languages. TABLE I: BROAD PHONETIC CLASS AND THEIR RESPECTIVE IPA SYMBOLS, % COVERAGE IN GUJARATI AND MARATHI LANGUAGES (FOR 10 HOURS OF SPEECH DATA). Broad phonetic class Vowel Plosive
IPA symbols , ə, e, i, ɔ, ɛ, o, u, ʊ, ɪ t, k, p, d, ʧ, g, b, ʤ, ɖ, ʈ, dʰ, bʰ, ʈʰ, kʰ, ʧʰ, tʰ, gʰ, pʰ, ɖʰ
Gujarati (in %)
Marathi (in %)
46.67
44.56
10.43
9.39
Semivowels
j, l, r, ʋ, ɾ, ɭ
24.2
23.65
Nasals
n, m, ɳ, ŋ
13.02
15.94
Fricatives
s, ʃ, h, f, z
5.68
6.51
From a phonetic transcription, a syllable cluster is formed. A syllable is a linguistic abstraction of speech, which accommodates one syllable peak as a vowel. Based on the formation, syllables can be considered to have 7 different types, i.e., V, CV, VC, CVC, C*V, VC*, C*VC*, where V, C and C* stand for any vowel, consonant and more than one consonant units, respectively. If C is attached before V, C is called onset part of a syllable and if C is attached after V, C is called coda of a syllable. In general, a syllable can have one or many onsets and/or coda. If it contains multiple onsets or coda, syllable is said to be a complex syllable. Table II and III show the
116
statistics of different kinds of syllable structures observed in both the languages in the three different recording modes. In Table II and III, R, L and C represent read, lecture and Conversation speech, respectively. TABLE II: STATISTICS OF DIFFERENT TYPES OF SYLLABLES OBTAINED IN MANUAL SYLLABIFICATION (NUMBERS IN TABLE INDICATE % COVERAGE) FOR GUJARATI LANGUAGE. Gujarati
V
CV
VC
CVC
C*V
VC*
C*VC*
R
6.41
59.52
2.02
29.01
2.77
0.22
0.05
L
10.06
58.23
3.06
25.58
3.00
0.07
0.01
C
9.08
61.98
2.08
24.55
2.14
0.15
0.03
TABLE III: STATISTICS OF DIFFERENT TYPES OF SYLLABLES OBTAINED IN MANUAL SYLLABIFICATION (NUMBERS IN TABLE INDICATE % COVERAGE) FOR MARATHI LANGUAGE. Marathi
V
CV
VC
CVC
C*V
VC*
C*VC*
R
6.79
56.75
2.02
30.53
3.84
0.05
0.02
L
5.15
62.27
1.31
26.65
4.49
0.02
0.01
C
8.41
62.85
1.62
23.36
3.70
0.04
0.01
From Table II and III, it can be inferred that most of the syllables are of CV types, which is almost more than 50 % of entire syllable coverage for Gujarati and Marathi languages. It is indeed correct since writing script in Indian language allows decomposing the basic graphemes into consonants followed by vowel, forming /CV/ type of syllable units. So /CV/ type syllables are general form of syllable units in Indian languages and hence they are expected to be more frequent. In addition, it can also be observed that the number of syllables formed by consonants in coda position is very less, this might be due to difficulty in terms of pronunciation. Same for complex syllable, it is very less dominant in syllable coverage. Next, we analyze syllable coverage by computing number of unique syllables at every minutes of recorded speech.
2014 International Conference on Asian Language Processing (IALP)
Number of unique syllables
Duration in minutes
Figure 2: Number of unique syllables with respect to amount of duration.
Number of unique syllable increases as speech data increases. At some time point, this increment saturates. Figure 2 shows the syllable coverage for both the languages and all the recording modes. Here, data points are tagged by “-”, where can be Gujarati and Marathi, represented by ‘G’ and ‘M’ respectively; and can be read, lecture and conversation, represented by ‘R’, ‘L’, and ‘C’ respectively. IV.
1) Performance Evaluation Performance is evaluated for different agreement windows, which change with respect to adjacent syllable duration. The evaluation metrics are % detection rate (% DR) within syllable agreement duration and % over segmentation within agreement (% OSWA) and % over segmentation outside agreement window (% OSOA). % DR should be high and over segmentation should be low. x % agreement interval for ith segment, is defined as: x x (1) ]i ] i ] i 1 d H i d ] i ] i 1 ] i 100 100 where ζi’s are the manually marked syllable boundaries. It can be observed from eq. (1), that %AgInt takes syllable duration information into account because it is based on the syllable boundary position. Formally, evaluation metrics are defined based on the position of hypothetical boundary (HyB) and agreement interval (AgInt) as follows: Ψ ൌ
In this section, we will discuss various algorithms to correlate the linguistic abstraction with acoustic-phonetic information via signal processing tools.
Ψ ൌ
ൈ ͳͲͲΨǡ
͓୧୫ୣୱୌ୷ୟ୪୪୭୳୲ୱ୧ୢୣ୍୬୲ ͓୭୲ୟ୪ୌ୷ ͓୧୫ୣୱୌ୷ୟ୪୪୧୬ୱ୧ୢୣ୍୬୲ ͓୭୲ୟ୪ୌ୷
(2)
ൈ ͳͲͲΨǡ
(3)
ൈ ͳͲͲΨǤ
(4)
2) Database The corpus is prepared for two languages, viz., Gujarati and Marathi and three different recording modes, viz., read, conversation, and lecture. The sentences are selected such that they contain at least 10 syllables. The statistics of number of sentences used and duration is shown in Table IV. TABLE IV: STATISTICS OF DATA USED IN SYLLABIFICATION TASK. Gujarati Sentences Duration (minutes) Marathi Sentences Duration (minutes)
Read 2331 295 Read 3128 373
Conversation 1619 200 Conversation 899 90
Lecture 1371 150 Lecture 848 122
3) Results The result obtained from the segmentation task is shown in Figure 3 for Marathi and Gujarati. The value of %AgInt varies from 5 to 50 % in the steps of 5 %. It can be observed that as % AgInt increases, detection rate increase and at the same time over segmentation increases. It means that there is a tradeoff between detection rate and over segmentation. In addition, it can be
% OSWA
% OSOA
A. Speech Syllabification. A speech syllable contains vowel within it. A vowel is found to have relatively higher short-term energy (STE) than consonant units. STE-based information can be an important candidate for syllable-based clustering. The speech segmentation at syllable-level is performed using minimum-phase group delay approach. The detailed descriptions related to group-delay-based algorithm and the specification of Window Scale Factor (WSF) and gamma are given in [10]. For boundary detection, we use lowpass version of speech signal having cut-off frequency of 500 Hz. WSF is chosen 10 and J is taken as 0.01 as suggested in [10]. The lowpass version of signal is taken because of the fact that vowels and syllable energy is more concentrated at low frequency zone (Typical speech vocal source vibrating frequency, i.e., F0 is typically less than 500 Hz.)
% DR
͓୭୲ୟ୪୰ୣୣ୰ୣୡୣୠ୭୳୬ୢୟ୰୧ୣୱ
Ψ ൌ
SIGNAL PROCESSING ASPECT
% AgInt (in %) (a)
͓୧୫ୣୱୌ୷ୟ୪୪୵୧୲୦୧୬୍୬୲
% AgInt (in %) (b)
% AgInt (in %) (c)
Figure 3: Performance of automatic syllabification.
2014 International Conference on Asian Language Processing (IALP)
117
observed that the performance of read mode is better than conversation and lecture mode. The % OSWA increases exponentially w.r.t. % AgInt, as it might be generalized Poisson distribution as it counts number of events within interval. % OSOA decreases linearly w.r.t. % AgInt. % DR also increases w.r.t. % AgInt. B. Pitch Marking A spoken language conveys linguistic information as well as paralinguistic, such as suprasegmental information. This information is related to timing, prominence, intonation patterns, etc. Intonation is a measure of variation in frequency of vocal fold vibration which is known as F0 pattern. The perceived pitch is highly correlated with F0 pattern. F0 pattern can be assigned different levels, viz., VL (Very Low), L (Low), H (High), and VH (Very High). Pitch marks were manually marked based on perception and observing F0 contour for reference. These are used as ground truth. Following algorithm is used for automatic marking of pitch-levels: 1. F0 contour is found for each utterance using 0-Hz resonator [11]. 2. F0 is interpolated using spline interpolation to find pitch frequencies at each sample point to facilitate the comparison with ground truth. 3. Mean F0 is computed for each utterance known as reference pitch (Fref). 4. Each sample point is assigned a pitch-level (VL / L / H / VH) according to the following conditions: a. If F Fref, pitch-level = H d. If F >= 1.5 Fref, pitch-level =VH. 5. To find the accuracy, pitch-level at ground truth is compared with pitch-level automatically determined.
ൌ
͓େ୭୰୰ୣୡ୲୪୷ୢୣ୲ୣ୰୫୧୬ୣୢ୮୧୲ୡ୦୪ୣ୴ୣ୪ୱ ͓୭୲ୟ୪୮୧୲ୡ୦୪ୣ୴ୣ୪ୱ
ൈ ͳͲͲΨǤ
(5)
Further details of the algorithm of automatic pitch marking are discussed in [12]. The performance of pitch marking along with statistics is shown in Table V. It can be observed that the performance of read mode is better than lecture and spontaneous mode. The reason could be that the arithmetic used in step 4, is not generalized for spontaneous and lecture mode speech as these modes of speech contain more variation in F0 pattern than read speech. In addition to this, prosodic breaks can be detected using discontinuity in F0 pattern and STE profile.
Marathi
Gujarati
TABLE V: STATISTICS OF DATA USED IN PITCH MARKING AND PERFORAMNCE OF PITCH MARK DETECTION SYSTEM.
118
Mode Duration Total pitch marks Accuracy Mode Duration Total pitch marks Accuracy
Read 48
Lecture 28
Spontaneous 19
6354
5149
4472
63.33% Read 56
54.08% Lecture 20
46.82% Spontaneous 20
16619
6059
6830
58.89%
41.98%
43.32%
V. CONCLUSIONS AND FUTURE PLANS In this paper, linguistic and signal processing aspects of low resource language such as Gujarati and Marathi are discussed. IPA-based transcription is a better candidate to formulate the bridge between these two aspects. Here, different speech recording modes are considered to analyze the performance. In this paper, statistical analysis of syllables across different recording modes is also discussed. In a later section, signal processing tools are used to mark syllable boundaries pitch patterns. This algorithm works well under read speech in most of the cases. The reason might be the read speech being well articulated and well generalized. The future plan is to mark prosodic breaks in the speech under different recording modes and correlate with silence. In addition, we need to investigate the features which generalize the algorithm for spontaneous (i.e., lecture and conversation) mode. ACKNOWLEDGEMENTS The authors would like to thank Department of Electronics and Information Technology (DeitY), Government of India for sponsoring the project “Development of prosodically guided phonetic engine for searching speech databases in Indian languages” for their kind support to carry out this research work.
REFERENCES [1]
P. Bhaskararao, “Salient phonetic features of Indian languages in speech technology,” Sadhana, vol. 36, no. 5, pp. 587-599, 2011 [2] P. Ladefoged, “A Course in Phonetics,” 5th Ed., Boston: Thomson Wadsworth, 2006. [3] "International Phonetic Alphabets (IPA)," The International Phonetic Association, [Online]. Available: http://www.langsci.ucl.ac.uk/ipa/. {Last Accessed: 9 th May 2014}. [4] B. Yegnanarayana and S. V. Gangashetty, "Machine learning for speech recognition- an illustration of phonetic engine using hidden Marcov models," in Proc. Int. Conf. Frontiers of Interface Between Statistics and Sciences, pp. 319-328, 2009. [5] K. E. Manjunath, K. S. Rao and D. Pati, "Development of phonetic engine for Indian languages: Bengali and Oriya," in Oriental COCOSDA, Gurgaon, pp. 1-6, 25-27 Nov. 2013. [6] B. D. Sarma, M. Sarma, M. Sarma and S. M. Prasanna, "Development of Assamese phonetic engine: Some issues," in Annual IEEE India Conference (INDICON), pp. 1-6, 2013. [7] H. A. Patil, M. C. Madhavi, K. D. Malde and B. B. Vachhani, "Phonetic transcription of fricatives and plosives for Gujarati and Marathi languages," in Inter. Conf. on Asian Lang. Proc. (IALP), Hanoi, pp. 177-180, 2012. [8] K. D. Malde, B. B. Vachhani, M. C. Madhavi, N. H. Chhayani and H. A. Patil, "Development of speech corpora in Gujarati and Marathi for phonetic transcription," in Oriental COCOSDA, Gurgaon, pp. 1-6, 25-27 Nov. 2013. [9] "Wavesurfer software," Speech, Music and Hearing, [Online]. Available: http://www.speech.kth.se/wavesurfer/. {Last Accessed: 9th May 2014}. [10] V. Kamakshi Prasad, T. Nagarajan, and H. A. Murthy. "Automatic segmentation of continuous speech using minimum phase group delay functions," Speech Comm., vol 42, no. 3, pp. 429-446, 2004. [11] B. Yegnanarayana, and K. S. R. Murty, "Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals," IEEE Trans. on Audio, Speech, and Language Processing, vol.17, no.4, pp.614-624, May 2009. [12] A. Sreejith, L. Mary, K. S. Riyas, A. Joseph, and A. Augustine, "Automatic prosodic labeling and broad class Phonetic Engine for Malayalam," Inter. Conf. on Control Comm. and Computing (ICCC) 2013, pp.522-526, 2013.
2014 International Conference on Asian Language Processing (IALP)