Open Source Projects and “free” software for Tamil ...

8 downloads 30479 Views 92KB Size Report
[5], maintained by CMU, HTK toolkit for speech recognition [6], MILE Android keyboard for mobile phones [7,8] and, Indic Keyboards IME for Windows and Linux [9,10]. For highly ... enthusiasts, who want to develop a good open source OCR for Tamil. However, to the best .... Interface for Business and Personal Use,” Proc.
Open Source Projects and “free” software for Tamil computing A G Ramakrishnan and Shiva Kumar H R MILE Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore. A software may give you access to the source but need not be distributed free of charge and a software that is distributed free of charge need not be open source. An open source project is a software, whose original code (program) is accessible for the user. So, if the user is well informed, depending upon the distribution licence (say, GNUGPLv3), he or she may be able to customize the software for his own use or add additional modules by first understanding and then modifying and adding new programs (functions or classes or a better GUI interface). This article lists the currently existing major open source projects and freely distributed softwares for certain aspects of Tamil computing. This includes Tesseract OCR from Google (originally from Hewlett-Packard Labs) [1], Festival Speech Synthesis System from University of Edinburgh [2], espeak [3], HMM-based Speech Synthesis System (HTS) [4], Sphinx speech recognition toolkit [5], maintained by CMU, HTK toolkit for speech recognition [6], MILE Android keyboard for mobile phones [7,8] and, Indic Keyboards IME for Windows and Linux [9,10]. For highly technical projects, open source projects need technically qualified and committed individuals, working for an extended period of time, to obtain great results. Unlike what it may appear to a casual onlooker, open source projects require a great deal of management, just as any other, mature, proprietory software product needs. In standard open source projects, there is a big hierarchy: there are software architects, who design and architect the whole system; there are a number of contributors, who write code; then there are committers, who check the integrity of that code and then formally accept it into the code base. One needs to also coordinate the number of voluntary contributors, assigning the work between them. In the case of a highly used software, there will also be a constant flow of questions from the contributors and the users, which need to be analysed and answered regularly. Hence, the project also needs people committed to servicing such requests on a regular basis. Open source optical character recognition (OCR) project Tesseract OCR was originally developed by HP Labs during 1985 to 1995. It was one of the top 3 engines in the 1995 UNLV Accuracy test. From 2006, it has been extensively improved by google. It supports about 60 languages, including Tamil. It has versions for almost all operating systems: Windows, Linux, Mac OSX, Android, iPhone, etc. The URL for this open source project is: https://code.google.com/p/tesseract-ocr/ So, this is a good platform for the open source enthusiasts, who want to develop a good open source OCR for Tamil. However, to the best knowledge of the authors, not a single paper has been presented on this open source OCR in any Tamil Internet conference, starting from the Chennai one in 1999 to the recent one in Pondicherry.

Technical Background required for OCR project We have downloaded the Tamil Tesseract OCR and tested it. It is reasonably good. Further improving the performance of this system requires knowledge of digital image processing, segmentation, interpolation, orthogonal transforms, pattern recognition, n-gram language models and a reasonable familiarity with the characteristics of the old and new Tamil script. Open source text to speech (TTS) conversion projects There are also very huge, open source speech synthesis or text to speech conversion projects. I give below the details and urls for three of them. 1. Festival Speech Synthesis System from University of Edinburgh [11]. This has been available for over twenty years [2]. This is directly supported by the original project leader, Dr. Alan Black. 2. HMM-based Speech Synthesis System (HTS) [12]: This is relatively recent [4]. 3. espeak is another open source TTS project [3], which already supports Tamil and I know blind students using the same. This uses formant sysnthesis and is available for both Windows and Linux. These are all reasonably well managed projects. There are people around the world, who have developed TTS in many languages, including Tamil [13, 14], using one or more of the above open source projects. Currently, the Department of Information Technology, Government of India, under its wing, Technology Development for Indian Languages is funding huge projects in TTS and ASR for many Indian languages, including Tamil. The TTS consortium project is led by Prof. Hema Murthy of Computer Science, and ASR project, by Prof. S. Umesh of Electrical Engineering, both from IIT Madras. There are other professors, in other IITs and Engineering colleges, who are co-investigators in these projects, who work on other Indian languages. What is required to develop an acceptable quality of TTS using one of the above frameworks is a few hours of Tamil speech from a very good speaker with good diction and pronunciation. The data can be segmented using HMM based forced alignment and this facility is also provided by the speech synthesis frameworks. However, there has not been any paper presented in anyone of INFITT conferences by any open source group to create a TTS for Tamil using any one of these synthesis frameworks, probably because of the technical background required for the same. Technical and other background required for speech synthesis project Anyone who wants to contribute to the improvement of the quality of a TTS needs to have a good knowledge in speech production mechanism, phonetics and Tamil phonology (grapheme to phoneme conversion rules), digital signal processing and good programming ability. To resolve the actual category of a numeral as to a phone number, year, cost of something in rupees, etc., tools such as classification and regression trees could be used. Also, familiarity with common words borrowed from other languages (say, English), which are regularly used in Tamil will be an advantage.

Automated Speech Recognition (ASR) Projects There is a major open source speech recognition toolkit, called CMU Sphinx [15] maintained by Carnegie Mellon University, Pennsylvania, United States. Even commercial systems are said to have been developed based on this Sphinx speech recognition toolkit [16]. The speech recognition consortium funded by the Department of Information Technology is using this frame work for developing ASR for limited domains (tourism and health care) in Tamil and many other Indian languages. The main input needed to create a Tamil speech recognition system using this toolkit is a few tens of hours of speech data. Ten open source enthusiasts can form a team and record one hour of Tamil speech each, every weekend for a year, and create 520 hours of recorded speech in Tamil, with which they can create a good ASR for Tamil. HTK is another popular research toolkit [6], available for development of speech recognition systems [17]. It is used regularly by many research laboratories around the world. It employs continuous density probability density functions (Gaussian mixture models) and hidden Markov models (HMM) for developing speech recognition systems. It also has modules for forced alignment based segmentation of speech waveform, given the corresponding phone sequence. It also has provision to compute language models from a text corpus and plug in the resultant model into the ASR framework. Technical Background required for automated speech recognition ASR project Anyone who wants to develop an independent ASR framework needs to have a good knowledge in speech signal processing, speech parameterization such as mel frequency cepstral coefficients (MFCC), speaker adaptation fundamentals, phonetics and Tamil phonology, good programming ability, knowledge of machine learning, gaussian mixture models (GMM), hidden Markov models (HMM) and a good proficiency in fast search algorithms and software engineering. Of late, deep neural networks have been replacing the GMMs in the acoustic modelling part. Managing an open source project This varies from one project to another, but in general, requires quite an amount of rigorous evaluation of contributed code. One may refer to [18], for example, to get an idea of how Apache projects are managed. The terms and conditions of usage of open source software also vary from one license to another. Some of the open source licences constrain the user to put whatever they develop using the open source tool back in the open source domain. There are other licences, where there are less constraints. For example, certain Apache licences allow the user to modify the code and use it in their own products, with no constraints placed on releasing the modified code in the public domain. Please see [19 - 21] to get more information on different open

source licences. References https://code.google.com/p/tesseract-ocr/ http://www.cstr.ed.ac.uk/projects/festival/ http://espeak.sourceforge.net/ http://hts.sp.nitech.ac.jp/ http://cmusphinx.sourceforge.net/ http://htk.eng.cam.ac.uk/ 7. H R Shiva Kumar and A G Ramakrishnan, “Open source TamilNet99 keyboard for Android,” Proc. 13-th Tamil Internet Conference, Pondicherry, Sept. 19-21, 2014. 8. https://code.google.com/p/indic-keyboards-android/ 9. Abhinava Shivakumar, Akshay Rao, Arun S, A G Ramakrishnan, “A Free Tamil Keyboard Interface for Business and Personal Use,” Proc. Tamil Internet 2010, Coimbatore, June 23-26, 2010, pp. 634- 640. 10. https://code.google.com/p/indic-keyboards/ 11. Alan Black, Paul Taylor, Richard Caley, Rob Clark, Korin Richmond, Simon King, Volker Strom, Heiga Zen, “The festival speech synthesis system, version 1.4.2”, http://www. cstr. ed. ac. uk/projects/festival. Html 12. Heiga Zen, Keiichi Tokuda, Alan W Black, “Statistical parametric speech synthesis,” Speech Communication, 2009, Vol. 51(11), pp. 1039 – 1064. 13. Sreekanth Majji and A G Ramakrishnan, “Festival Based Maiden TTS System for Tamil Language,” Proc. 3rd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland, Oct 57, 2007, pp. 187-191. 14. Raghava Krishnan K , S Aswin Shanmugam , Anusha Prakash , Kasthuri G R , Hema A Murthy, “IIT Madras’s Submission to the Blizzard Challenge 2014,” Blizzard Workshop, Singapore, 2014. 15. Xuedong Huang, Fileno Alleva, Hsiao-Wuen Hon, Mei-Yuh Hwang, Kai-Fu Lee, Ronald Rosenfeld, “The SPHINX-II speech recognition system: an overview,” Computer Speech & Language, Volume 7, Issue 2, April 1993, Pages 137–148. 16. http://cmusphinx.sourceforge.net/ 17. SJ Young, “The HTK hidden Markov model toolkit: Design and philosophy,” CUED/FINFENG/TR.152 September 6, 1994. http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.17.8190&rep=rep1&type=pdf, accessed on Feb. 16, 2015. 18. http://www.apache.org/foundation/how-it-works.html 19. https://www.gnu.org/philosophy/open-source-misses-the-point.html 20. https://www.gnu.org/philosophy/categories.html 21. https://www.gnu.org/licenses/license-list.html 1. 2. 3. 4. 5. 6.

Suggest Documents