development of isolated words speech database of

12 downloads 0 Views 480KB Size Report
13) Konkani, 14) Marathi, 15) Manipuri, 16) Kannada, and. 17) Nepali Nepali [1]. The Technology .... [6] names _Indianlanguages.pdf cited on 11/03/2012. [7].
Asian Journal of Computer Science And Information Technology 2: 7 (2012) 217 – 218 .

Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal homepage: http://www.innovativejournal.in/index.php/ajcsit

DEVELOPMENT OF ISOLATED WORDS SPEECH DATABASE OF MARATHI WORDS FOR AGRICULTURE PURPOSE P. P. Shrishrimal*, R. R. Deshmukh, Vishal B. Waghmare Dr. Babasaheb Ambedkar Marathwada University, Aurangabaad, India ARTICLE INFO Corresponding Author: P. P. Shrishrimal Dr. Babasaheb Ambedkar Marathwada University, Aurangabaad, India [email protected]

ABSTRACT Development of Speech Database is the very first step for developing an Automatic Speech Recognition system. The Accuracy of speech recognition depends on the quality of the speech data collected and the training set data quality. This paper describes the proposed procedure to be followed for collecting the isolated words data from the farmers of the Aurangabad District for developing an Automatic Speech Recognition System in Marathi Language.

Key Words: Speech Database, Isolate word, Speech Recognition, Marathi.

©2012, AJCSIT, All Right Reserved.

INTRODUCTION Speech is the most effective and common way of communication between human. Human beings have long been motivated to create computer that can understand and talk like human. In this direction, researchers have tried to develop system for analysis and classification of the speech signals. The work done in speech technology for the English and other European language has reached to achieve higher accuracy rate. The amount of work in Indian languages has not yet reached to a critical level to be used as real communication tool, as that in other languages in developed countries. However, there is lot of scope to develop systems using Indian languages which are of different variations.The information in today’s age is available to those who can read and understand the particular language. Language technologies can play a vital role in the natural interfaces for those who can’t understand the particular language. These technologies can play a crucial role for a country like India who is a multilingual. As Hindi written in Devanagari script is the national language other than the 17 languages accepted by the constitution of India: 1) Assamese, 2) Tamil, 3) Malayalam, 4) Gujarati, 5) Telugu, 6) Oriya, 7) Urdu, 8) Bengali, 9) Sanskrit, 10) Kashmiri, 11) Sindhi, 12) Punjabi, 13) Konkani, 14) Marathi, 15) Manipuri, 16) Kannada, and 17) Nepali Nepali [1]. The Technology Development for Indian Languages (TDIL) programme of the Department of Information Technology has initiated a nationwide project that affects such a voice interface in six Indian languages, which are Marathi, Hindi, Tamil, Telugu, Bangla and Assamese. This project is being implemented by seven consortiums of seven institutions, which are Indian Institute of Technology (IIT) Madras, IIT Bombay, IIT Kanpur, IIT Guwahati, International Institute of Information Technology (IIIT) Hyderabad, Tata Institute of Fundamental Research (TIFR) Mumbai and Center for

Development of Advance Computing (C-DAC) Kolkata. IIT Bombay and TIFR Mumbai are jointly involved in developing the Marathi ASR [2]. This paper is organized as follows: Section 2 about Marathi Language Section3 describes methodology adopted for collecting the text corpora. Section 4 discusses the steps for speech data collection. Section 5 describes the Recording procedure and the recording details to be followed. The conclusion and future work is stated in Section 6. I. MARATHI LANGUAGE Marathi is an Indo-Aryan language, spoken by the Marathi people of western and central India. There are over 68 million of fluent speakers all over world. Marathi is written in Devanagari script similar to the National Language of India i.e. Hindi. The Devanagari script is the script used for writing Sanskrit from which these languages are been derived. Marathi has the fourth largest number of native speakers in India. Marathi is spoken in the Complete Maharashtra state which covers a vast geographical area which consists of 34 different districts. The Major dialects of Marathi are called Standard Marathi and Warhadi Marathi. The other few sub-dialects are like Ahirani, Dangi, Vadvali, Samavedi, Khandeshi and Malwani. However, Standard Marathi is the official language of state of Maharashtra. Standard ‘Marathi’ language is based on dialects used by academics and the print media. The Indic scholars distinguishes 42 Dialects of spoken Marathi bordering other major language areas have many properties in common with those languages, further differentiating them from standard spoken ‘Marathi’ language [3, 4]. As mentioned earlier the work for Marathi language is majorly being carried out in IIT Bombay and TIFR (Mumbai). Few universities have also started working for

217

Shrishrimal et.al/Development of Isolated Words Speech Database of Marathi words for Agriculture Purpose. the development of system in Marathi Language in Maharashtra state. II. SELECTING TEXT CORPORA For developing a Speech database the basic requirement is of grammatically correct Text corpus which would be recorded from various speakers. The text corpus should be correct in terms of typography and grammar. In this section, the text selection for the isolated word is described. The text corpus was generated using various agriculture related websites. The words were selected from the Blog articles published on the websites. In all hundred words were selected from the websites [5, 6]. The selected words were then classified into groups according to their category i.e. names of Vegetable, Grains, Cash Crops, Fruits, Diseases, Fertilizers, Pesticides, Equipments. III. SPEECH DATA COLLECTION In this section the steps followed for developing speech corpora are described. Firstly the recording media is chosen to capture the speech signal. The Database developed was recorded using a microphone and laptop using Praat for recording the speech signal. A. Speaker Selection The speech data will be collected from the native speakers of the language. The selected speakers would be resident of the village and will comfortable with reading and speaking the language. The speakers will be selected to cover the complete diversity i.e. age group, gender, literacy and language in which they generally speak. B. Data Collection The speakers will be asked to speak the 100 words with 5 utterance of every word. Proper attention will be given to select the speakers from different age groups between 18 – 25, 26 – 35, 36 – 45, 45 – 55, and 56 – 60. The Speech data will be collected by visiting 10 villages from different Taluka places of Aurangabad district. At every village 10 speakers will be selected to speak the words out of which 5 will be male and 5 will be female. The speakers would also be selected on the basis of the educational qualification and their native language. C. Data Collection Statistics The speech data will be collected from 100 speakers. Each speaker will be asked to speak 100 words with 5 utterance of every word. Total 500 utterances of the words will be collected from every speaker. Total 500 utterances of the words will be recorded. IV. RECORDING PROCEDURE The isolated words will be recorded using two different microphones in front of computer using the PRAAT speech

Software. The data will be recorded in the open field, shop, house and closed rooms where background noise will be present. The recording of the Speech data in such noisy environment will be useful in future for developing a robust automatic speech recognition system. PRAAT can also be used for labeling the events and segments. The main strength of PRAAT is its graphical user interface. PRAAT also provides the functionality of spectral analysis, pitch analysis, formant analysis, intensity analysis, Principal component analysis (PCA), for drawing the Cochleagram and many for facilities [7]. CONCLUSION We have presented the procedure that would be followed to develop a speech database of isolated Marathi words for agricultural purpose from Aurangabad District of Maharashtra. The said database will be useful to cover all the phonetic variations of the Aurangabad district. It will be helpful to develop a Robust Automatic Speech Recognition System. ACKNOWLEDGMENT We would like to thank the University to provide the basic facilities to carrying out the research work. Financial Assistance provided by University Grants Commission, New Delhi. REFERENCES [1] Anumanchipalli G., Chitturi R., Joshi S., Kumar R., Singh S. P., Sitaram R. N. V. , S. P. Kishore, 2005. Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems. In: Proceeding of International Conference on Speech and Computer (SPECOM), Patras, Greece, pp. 591-595. [2] Tejas Godambe and Samudravijaya K., “Speech Data Acquistion for Voice based Agricultural Information Retrieval”, presented at the 39th All India DLA Conference, Punjabi University, Patiala, 14-16th June 2011 [3] K. Samudravijay, 2010. Multilingual Telephony Speech Corpora of Indian Languages. In Proceeding Computer Processing of Asian Spoken Languages. Eds, S. Itahashi Ans C. Tseng, Consideration Books Los Angeles, pp.189-193 [4] Agrawal S. S., K.K. Arora, S Arora, Samudravijaya K, “Text and Speech Corpus Development in Indian Languages”, ibid, pp. 94-97 webreferences [5] www.agrowon.com cited on 10/03/2012 http://lab.cgpl.iisc.ernet.in/Atlas/.../Crop [6] names _Indianlanguages.pdf cited on 11/03/2012 [7] http://www.fon.hum.uva.nl/praat/ cited on 09/03/2012

First A. Author (M’76–SM’81–F’87) and the other authors may include biographies at the end of regular papers. Biographies are often not included in conference-related papers. This author became a Member (M) of IEEE in 1976, a Senior Member (SM) in 1981, and a Fellow (F) in 1987. The first paragraph may contain a place and/or date of birth (list place, then date). Next, the author’s educational background is listed. The degrees should be listed with type of degree in what field, which institution, city, state or country, and year degree was earned. The author’s major field of study should be lower-cased. The second paragraph uses the pronoun of the person (he or she) and not the author’s last name. It lists military and work experience, including summer and fellowship jobs. Job titles are capitalized. The current job must have a location; previous positions may be listed without one. Information concerning previous publications may be included. Try not to list more than three books or published articles. The format for listing publishers of a book within the biography is: title of book (city, state: publisher name, year) similar to a reference. Current and previous research interests end the paragraph. The third paragraph begins with the author’s title and last name (e.g., Dr. Smith, Prof. Jones, Mr. Kajor, Ms. Hunter). List any memberships in professional societies other than the IEEE. Finally, list any awards and work for IEEE committees and publications. If a photograph is provided, the biography will be indented around it. The photograph is placed at the top left of the biography. Personal hobbies will be deleted from the biography.

218