Development of Marathi Language Speech Database ...

207 downloads 251732 Views 306KB Size Report
development of two isolated speech database for Agriculture domain, one .... Custard Apple. 12 .... speech data acquisition, software's used, file format and.
Development of Marathi Language Speech Database from Marathwada Region Pukhraj P. Shrishirmal, Ratnadeep R. Deshmukh, Vishal B. Waghmare, Sushma Borade, Pooja V. Janse

Ganesh B. Janvale MGM's Institute of Biosciences and Technology, MGM Campus, N-6, CIDCO, Aurangabad-431 003 (MS) India [email protected]

Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract— This paper describes the development of speech databases in Marathi language spoken in the Marathwada region of Maharashtra state. The paper describes the development of two isolated speech database for Agriculture domain, one isolated word speech database for Travel domain, isolated speech database of Numbers, and a continuous speech database for agriculture purpose. The developed speech database will be useful for all the researchers which are willing to work on speech technologies for Marathi Language. There are 101,000 utterances of isolated words and 36,864 utterances of continuous spoken sentences in the developed speech corpora. The corpora was developed at 16kHz sampling frequency and in natural environment. The corpora is annotated and will be made available for the researchers. Keywords—Corpus; Corpora; PRAAT; Marathi; Continuous speech; Isolated words.

I. INTRODUCTION The human beings communicate or pass messages by writing to each other, sign language or speech. Writing to each other was the oldest and most widely used method for passing the messages earlier. As the use of the technology in the day to day life increased the most common and widely used mode of communication changed from writing to speech. There are around 7000 spoken languages around the world [1]. The researchers are trying to develop new interface for communicating with computer systems. The traditional interactions used for the communicating with system are using a pointer (i.e. mouse), textual input (using Key board). The researchers are working to interact with system using brain signal for controlling the mouse pointer. Speech is also having the capacity to be used to communicate with system. As speech is the most commonly and widely used mode of communication it obvious for human to expect speech for developing an interface between human and computer [2]. The researchers are trying to develop systems which can record, interpret and understand speech. The work for development of

such system started in early 1960’s. In today’s information era most of the information is in digital form and it is accessible to only those who can read or understand a particular language. The language technologies can provide solution by developing natural interface making the information available as digital content available to people speaking different languages and facilitate the exchange of information excluding the barrier of language. Most of the countries around the world are working on the implementation of e-governance system to provide the solution and facilitate the governing work of country. The systems which can understand the spoken language can be very help in domains like agriculture, health care, government services and tourism. The language technologies can play more vital role in developing country and specifically in India which is a multi lingual nation. The work for the development of automatic speech recognition system in English and other European languages has achieved more recognition rate. The work in case of Indian languages has started in late 1990’s and early 2000. There are few speech databases available for Indian languages and those developed are for selected languages only. The paper describes the development of a limited vocabulary continuous speech database, two different isolated speech database for the agriculture purpose, isolated word speech database for travel purpose of Aurangabad and speech database of numbers in Marathi language. This work could be of great help for researchers who are willing working in domain of speech processing and speech recognition. The paper is organized as follows: Section II describes about the Marathi language. Section III describes the development of text corpus. Section IV describes the steps followed for development of speech database. The details of the developed speech databases is mentioned in section V and Section VI describes the annotation of the speech database.

This work is sponsored by University Grants Commission, New Delhi under Major Research Project scheme. Project entitled "Development of Database and Automatic Recognition System for Continuous Marathi Spoken Language for agriculture purpose in Marathwada Region" 978-1-4673-8279-3/15/$31.00 ©2015 IEEE

 

Section VII presents the conclusion and the future work followed by references. II. MARATHI LANGUAGE Marathi is one of the 23 recognized languages by the constitution of India written in Devanagari script The Devanagari script is the script used for writing Sanskrit from which these languages are been derived. Marathi is an IndoAryan language, spoken by the Marathi people of western and central India. Marathi has the fourth largest number of native speakers in India [3] Standard Marathi is the official language of Maharashtra state which covers a vast geographical area which consists of 35 different districts. The major dialects of Marathi are called Standard Marathi and Warhadi Marathi [4]. Standard ‘Marathi’ language is based on dialects used by academics and the print media. The Indic scholars distinguished 42 dialects of spoken Marathi [5, 6].

The text corpus of continuous speech consisted of 230 sentences in Marathi language. The selected sentences are the questions related to the major crops cultivated in four districts (i.e. Aurangabad, Jalna, Beed and Osmanabad) of Marathwada region. The number of sentences for each crop cultivated among the four districts is shown in Table I. The Table II shows the set of sentences for Turmeric crop and its respective IPA (International Phonetic Alphabet). Similarly questions were found for the remaining 14 crops. These sentences were found from various forums and blogs for agriculture written in Marathi and few newspapers. The International Phonetic Alphabet for the developed Text Corpus was found [8]. TABLE I. DETAILS OF THE SENTENCES ACCORDING TO CROP Sr. No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

III. DEVELOPMENT OF TEXT CORPUS The development of a speech database requires a grammatically correct text corpus [7]. Four different text corpora were developed for the development of the speech databases. One Text corpus was for continuous sentences, and three corpora were for isolated words in Marathi. The sentences and the isolated words were selected from various blogs, news articles, forums and websites over the internet.

Name of Crops Jowar Cotton Sugar Cane Soybean Wheat Pigeon Pea Gram Maize Green Gram (Mug) Black Gram Safflower Chili Ginger Turmeric Custard Apple Total Sentences

No. of Sentences 16 15 15 15 16 19 19 17 17 15 14 17 12 11 12 230

TABLE II. SET OF SENTENCES FOR TURMERIC AND ITS RESPECTEIVE IPA (INTERNATIONAL PHONETIC ALPHABET) Sr. No.

Translation

Sentences

1.

?

2.

?

3. 4.

? ?

5.

?

6.

?

7.

? ?

8. ?

9. 10. 11.

? ?

IPA (International Phonetic Alphabet)

In Which Season Turmeric should be sowed? How should be the weather for planting turmeric?

/h

d / /p er

i/ /ko t j a/ /h m am at / /k r a i/

/h

d / /p er

is a

i/ /h

am an / /k s e/ / s a e/

i/ /ku

l j a/ / a amts a/

which varieties of Turmeric should be used for sowing?

/h d / /p er is a / ap r / /k r a a/

How many seeds per hectare are generally required for Turmeric?

/h d its e/ /h ek r i/ /s am an j t h / /kit i/ /b ij a e/ / ap r a e/

What should be the time difference between two crops of turmeric With Turmeric which contrast crops should be taken?

/h

d / /p er

/h

d is ob t / / mt r p ike/ /ku

How t o tillage the land before sowing turmeric?

/h /k

d is a i/ /dz m in its i/ /p ur / /m i/ /k r aj ts i/

How to use fertilizers for turmeric? which are the Common disease for Turmeric?

/h

d is a

/h

d i r / /j e ar e/ /r o / /ko t e/

Which pesticides should be used for turmeric?

/h d is a i/ /ku l j a/ /ki k n a kamts a/ / ap r / /k r a a/

How much yield does we get per hectare of turmeric?

/h d its e/ /s am an j t h / /h ek /ut p ad n / /m i t e/

 

im d il / / mt r / /kit i/ / s a e/ l i/ /

j a it / a t /

i/ /k t amts a/ / ap r / /k s a/ /k r a a/

r i/ /kit i/

Two isolated words text corpora for agriculture purpose of Aurangabad district were developed. Each consisted of 100 words related to agriculture purpose which were categorized in different categories. The categories and number of words for each category is shown in Table III. (a) (b) TABLE III. NUMBER OF WORDS FOR ISOLATED WORDS SPEECH FOR AGRICULTURE PURPOSE (b)

(a) Sr. No. 1. 2. 3. 4. 5. 6.

Category Flowers Fruits Vegetables Equipments Grains Ayurvedic Plants Total

No. of Words 31 19 24 9 10 7

Sr. No. 1. 2. 3. 4. 5. 6. 7. 8.

100

Category Fruits Grains Vegetables Fertilizers Pesticides Diseases Equipments Cashcrops Total

No. of Words 12 10 15 7 19 20 11 6 100

An isolated words text corpus for travel purpose of Aurangabad district was developed which consisted of 120 words of various places which can be visited in Aurangabad district. The selected words were categorized in different categories. The category and number of words for each category is shown in Table III. TABLE III. NUMBER OF WORDS FOR TRAVEL & TOURIST PURPOSE Sr. No.

Category

1. 2. 3. 4. 5. 6. 7. 8. 9.

Mall Auditoriums Cinema Halls Markets Temples Stadium and Grounds Hotels Tourist Places Restaurants Total Words

No. of Words 5 7 11 6 4 6 22 20 39 120

Another isolated words text corpus of numbers in Marathi language was developed. It consisted of ten words which were numbers from 0 to 9. The text corpus of numbers in Devanagari script along with its International Phonetic Alphabet (IPA) is shown in Table IV. TABLE IV. TEXT CORPUS OF NUMBERS AND IT’S IPA Sr. No. 1.

Devanagari

IPA (International Phonetic Alphabet) / un j /

2.

/ek /

3.

/d on /

4.

/t in /

5.

/ts ar /

6.

/p ats /

7.

/s h a/

8.

/s at /

Sr. No. 9.

Devanagari

IPA (International Phonetic Alphabet) /a /

10

/n u/

IV. STEPS FOLLOWED FOR DEVELOPMENT OF SPEECH DATABASE This section describes the procedure followed to develop the continuous speech database in Marathi language. It describes the selection of the speakers, recording environment, speech data acquisition, software’s used, file format and variations captured and problems observed during the recording of speech sample. A. Selection of the Speakers The speech data has been collected from people belonging to four districts i.e. Aurangabad, Jalna, Beed and Osmanabad of Marathwada region. The speakers were in the age group of 20 to 55 years. The speakers are selected on the basis of their age, gender, education and background. B. Recording Environment The speech data was collected in a regular environment having background noise. The speech samples were mostly collected inside the house or in open farm or small room in noisy environment. C. Speech Data Acquitions The Speech Data was acquired using three different microphones. Two headsets of Sennheiser were used for the recording of the speech. PC 350 and PC360 headset of Sennheiser having noise cancellation facility were used for the acquisition of the speech samples. The third microphone used Micro Mic C520 manufactured by KAY Pentax private Ltd. The distance of the Microphone from the mouth of the speaker was approximately 3.5 to 4 cm. Each speaker was asked to speak the word or sentences of developed text corpus. The speech samples were collected using one of the microphones mentioned above which was connected to the laptop. PRAAT software was used to record the speech and the recorded speech file was stored in .wav format. The main strength of PRAAT is its graphical user interface. PRAAT provides functionalities like spectral analysis, pitch analysis, formant analysis, intensity analysis, other functionalities for drawing the cochleagram, spectrogram, speech signal plots. The user can also write custom sprit in PRAAT and convert it into stand-alone program. The feature of scripting allows the user to control PRAAT from another program. The user can also implement the learning algorithm for training the speech recognition. The most important aspect is PRAAT is open source software and one can extend the PRAAT according to one need [9].

 

V. DEVELOPED SPEECH CORPORA This section describes the speech databases developed in Marathi language from Marathwada region. It describes the sampling frequency used for recording the speech samples, number of speakers, educational background of the speakers, total utterances collected for the specific speech database A. Isolated Numeric Speech Database The isolated word speech database for travel purpose was collected from 160 speakers of Marathwada region in the age group of 20-55 years of age. The speakers were mostly literate. The speakers represented different villages or taluka places of Marathwada region. From the 160 speakers 80 speakers were male and 80 were female. The speech data was captured in a regular room with a sampling frequency of 16,000 Hz in 16 bit mono and stored in .wav format. The speakers were asked to speak the numbers 0 to 9 in Marathi language with 5 utterances of each digit. The developed database consists of total 8000 utterances with 800 utterances of each number spoken by different speakers B. Isolated Word Speech Database for Travel Purpose The isolated word speech database for travel purpose was collected from 100 speakers of Marathwada region in the age group of 20-55 years of age. The speakers were mostly literate. The speakers were from different villages or taluka places of Marathwada region. From the 100 speakers 50 speakers were male and 50 were female. The speech data was captured in a regular room with a sampling frequency of 16,000 Hz in 16 bit mono and stored in .wav format.

D. Continuous Speech Database for Agriculture Purpose The development of continuous Marathi speech database is currently under development. The speech samples have been collected from approximately from 753 speakers till date and about 600 speakers are remaining. The total number of speakers targeted for the complete database is 1440 speakers from different villages of Marathwada regions. The Speech Data was collected with a sampling frequency of 16,000 Hz in 16 bit mono mode and stored in .wav format. Each speaker was asked to speak the set of sentences of the corps that are grown in their taluka place. Generally from each speaker about 55 to 60 sentences were recorded. The main aim of the development of the continuous speech database is to capture the maximum phonetic variations of Marathwada region. The current size of the developed continuous speech sample is 36,864 utterances. The total database developed during the research work consist of 1,04,000 utterances of isolated words and 36,864 utterances of continuous speech. The graphical representation of the size of the developed speech corpora is shown in figure 1.

The speakers were asked to speak 120 different words which are related to tourism, recreational, historical and religious places. Each speaker was asked to speak the selected 120 word with 3 utterances of each word. The developed database consisted of total 36000 utterances C. Isolated Word Speech Database for Agriculture Purpose Two different isolated word speech database for agriculture purpose. Having speech samples from 100 speakers of Aurangabad district of Marathwada region. The speakers were in the age group of 20-55 years of age. The speakers were mostly literate. The speakers were from different villages or taluka places of Aurangabad district. From the 100 speakers 50 speakers were male and 50 were female. The speech data was captured in a regular room with a sampling frequency of 16,000 Hz in 16 bit mono and stored in .wav format. The speakers were asked to speak 100 different words which are related to agriculture. Each speaker was asked to speak the selected 100 word with 3 utterances of each word. Each developed database consisted of total 30000 utterances [10]. The isolated Marathi speech database for Agriculture purpose have two different word sets each of 100 words and the total size of the isolated word speech database for agriculture purpose have 60000 utterances.

Fig. 1. Graphical Representation of Size of different Speech Corpora developed

VI. PROBLEM FACED DURING SPECH CORPORA DEVELOPMENT The development of the text corpus took a long time. The information regarding the major crops grown in specific region of Marathwada region, and other related information of the crops. The information related to agriculture was difficult to find. The information is mostly available in English and to find its related correct Marathi words was a big challenge while developing the text corpus. There are very few websites having information related agriculture products and its activities in Marathi language. Most of them are blogs, forums and few websites which are developed by the agriculture research centres. The frequently asked questions or the information mostly required by the farmers were used to prepare the set of sentences. The set of sentences were checked for typographic

 

errors which may have resulted in wrong pronunciation of the specific word while capturing the speech samples. During the speech sample collection it was difficult to convince the speakers to spare their time and give the samples. To convince the female speaker was the most difficult part and the recording of speech samples from female speakers generally required long sessions. VII. ANNOTATION AND LABELLING OF SPEECH CORPORA For the annotation and labelling of the speech database we have used the PRAAT software. PRAAT was developed at University of Amsterdam, Netherland. PRAAT provides functionalities like speech analysis, synthesis, speech manipulation and speech labelling and segmentation. The PRAAT software is specifically designed for studying the phonetics using computer. The PRAAT provides levels of segmentation with the individual items. The labels and boundary information of the speech file are stored as a text file in a specific format in PRAAT. Every labelled segment is associated with the start and end times of the voiced part of the speech signal. This permits labelling of selected segments only which is helpful for the phonetic study and proper labelling [11]. VIII. CONCLUSION AND FUTURE SCOPE In this paper the design and development of speech corpus in Marathi language of Marathwada region is discussed. The methodology adopted for the development of the speech corpus can be followed by other researchers willing to work in the language technologies in Indian language. The development of the speech corpora is finished. The collected speech corpora is now being annotated. The developed speech corpora is phonetically rich. Marathi is a language in which the pronunciation of every word changes after every few kilometres. During the development of the speech corpora we have tried to capture maximum variation that occur in the Marathwada region. The Speech Corpus will be made available to researchers with the prior request permission of institution. After the completion of the work the developed database will be compared with available corpus.

A robust speech recognition system would also be developed after the completion of development of speech corpus.

Acknowledgment The authors would also like to thank the speakers who spare their time during the recording session of the speech database. The authors would also like to thanks the University Authorities for providing the infrastructure to carry out the research.

References [1]

Statistical Summary of Languages [Online] http://www.ethnologue.com/statistics/area#1 [2] Pukhraj Shrishrimal, R. R. Deshmukh, Vishal Waghmare, “Indian Language Speech Database: A Review”, International Journal of Computer Application (IJCA), Vol. 47, No. 5, pp. 17-21 June 2012 [3] "Abstract of Language Strength in India: 2001 Census". Censusindia.gov.in. Retrieved 2013-05-09 [4] Ramesh Vaman Dhongde, Kashi Wali, "Marathi". London Oriental and African language library (John Benjamins Publishing Company) 13: 101-139. ISBN 9789027238139. (2009). [5] K. Samudravija, "Multilingual Telephony Speech Corpora of Indian Languages", In Proceeding Computer Processing of Asian Spoken Languages. Eds, S. Itahashi Ans C. Tseng, Consideration Books Los Angeles, pp.189-193 (2010) [6] Agrawal S. S., K.K. Arora, S Arora, Samudravijaya K, “Text and Speech Corpus Development in Indian Languages”, ibid, pp. 94-97 [7] P. P. Shrishrimal, R. R. Deshmukh, Vishal B. Waghmare, "Development of Isolated Words Speech Database of Marathi Words for Agriculture Purpose", Asian Journal of Computer Science and Information Technology (AJCSIT-2012), Vol. 2, No. 7, July 2012, pp. 217-218. [8] International Phonetic Alphabet for Marathi [Online] Source: http://en.wikipedia.org/wiki/Marathi_phonology [9] PRAAT Tutorial [Online] http://savethevowels.org/praat/UsingPraatforLinguisticResearchLatest.p df [10] Pukhraj P. Shrishrimal, Ratnadeep R. Deshmukh, Vishal B. Waghmare, "Marathi Isolated Words Speech Database for Agriculture Purpose", International Journal of Engineering Innovation & Research, Vol. 3, No. 3, May-June 2014,- pp. 248-253 [11] Santosh Gaikwad, Bharti Gawali, Suresh Mehrotra, "Creation of Marathi Speech Corpus for Automatic Speech Recognition", International Oriental COCOSDA 2013, pp. 23-27.

 