iPGCON 2014, STES’s, SKNCOE, Pune
TEXT-TO-SPEECH CONVERSION FOR MATHEMATICAL DATA By
Sachin Kulkarni PG Student, Department of Information Technology Maharashtra Institute of Technology, Pune
[email protected] +91-94220-83695
Dr. Debajyoti Mukhopadhyay Dean(R&D) MIT Group of Institutions Professor & HOD Information Technology Maharashtra Institute of Technology, Pune
[email protected] +91-77091-52655
ABSTRACT The TTS technology helps easy learning of the written material as it comes in audio form. TTS technique comprises of getting the particular text as an input then converts the graphemes into the phonemes and finally converting those Phonemes into actual speech. Graphemes are the actual text. Phonemes are the smallest meaningful sound element of the language. There are numerous approaches and algorithms available for TTS conversion. Also there are number of automated tools in the market for the TTS conversion of any written material. This work puts forth the various research papers and the novel approach regarding the TTS technology. The basic idea behind the proposed tool, “MathsSays”, is helping legally blind students, also the normal users; understand mathematics better via audio form. The written mathematics material is scanned as an image. Then that image is processed for the text detection. The detected text is then extracted from that image. The extracted text will be provided to the TTS conversion component and as a result, the audio file will be generated which speaks the text in that image. The challenge in this work is how to tackle the formulae which are not in proper English form. For this obstacle, the table will be maintained which has each formula and its respective text. Whenever the formula is encountered, the same will be searched in the table and the respective text will be extracted and spoken out loud. In this way, this system works for the betterment of understanding the mathematics. Keywords - Graphemes, MathsSays, Phonemes, TTS
iPGCON 2014, STES’s, SKNCOE, Pune
I.
INTRODUCTION
Reading and writing mathematics is inherently different from reading and writing text. The task of representing mathematical equations and formulae is very complex. The established techniques to help visually challenged people interact with mathematics fall into two categories, one is the Braile like special languages and other category is audio method. This work deals with the technique that complies with the audio method. The paper includes the basic overview of TTS system and also elaborates the newly designed tool MathsSays. This document surely sets up the mind for the correct path for the TTS conversion especially for mathematics.
math formula which is convertible to audio format by TTS function but not necessarily understandable by users. The Amsmath is a package in LaTeX. Amsmath format of a mathematical equation is transmitted to TTS to produce an audio form. Point of concern is Mathspeak takes only LaTeX equations as input.
B. MATH-READER MathReader [3] reads out the equations and formulae with mathematical symbols in Thai. It involves 4 steps- “Phrase Identification” which consists of phrase segmentation and text identification. “Text Reader” which has word segmentation and syllable segmentation and unit forming (analogous to abbreviation). “MathEx Reader” module generates a Thai syllable sequence from the math expression II. OVERVIEW OF TTS in three steps: MathML Conversion, Math Text-to-Speech system [1] comprises of five Parsing and Math-Thai Mapping. “Math fundamental components which are- Text Reader” generates Thai speech from the Thai Analysis and Detection, Text Normalization syllable sequence. and Linearization (Number conversion, Abbr. C. EASY CONVERTER conversion, etc.), Phonetic Analysis (converts the orthographical symbols into With EasyConverter [4] you can quickly and phonological ones using a phonetic alphabet easily create accessible Word docs, large e.g. /ow/ for “down” as well as “house”), print, MP3 audio, DAISY talking books and Prosodic Modeling and Intonation (prosody Braille. EasyConverter is easy to learn if you is the combination of stress pattern, rhythm are new to creating accessible documents. and intonation in speech. The prosodic Equally, EasyConverter is suitable for modelling describes the speakers emotion), experienced professionals looking for a and Acoustic Processing which involves flexible accessible format creation tool. It is Concatenative Synthesis (pre-recorded simple to use for those with no altformat human voice in database), Formant Synthesis creation experience, and is equally suited to (speech is artificial and robotic) and experienced professionals looking for a Articulatory Synthesis (based on models of single flexible high quality altformat creation tool to meet the needs of dyslexic, visually the human vocal tract are to be developed). impaired and learning disabled students.
III.
EXISTING TOOLS
A. MATH-SPEAK Mathspeak [2] is an application for reading math formula with text-to-speech. It involves two main steps i.e. Converting the math equation to a format of markup language and Preparation of Amsmath format. The written mathematical formulae are calculated by various functions and written as output into files in the notation of MathML or LaTeX. These standard formats can generate the text
D. AEL DATA AEL Data [4] was established 2001 by a group of IT professionals with several decades of experience in IT and data conversion. AEL Data produces DAISY talking books from any source - printed text, microfilms or digital images. E. MATH-DAISY
iPGCON 2014, STES’s, SKNCOE, Pune MathDaisy [4] works with Microsoft Word, the Save As DAISY add-in for Microsoft Word, and MathType. Save As DAISY addin installation adds a \Save As DAISY\ menu item to Word's File menu. This command saves the document as a DAISY Digital Talking Book. MathDaisy enhances the Word-to-DAISY conversion process, converting the equations in the document to MathML as required by the DAISY format. If the original document contains mathematical equations, then users will need to
text lines. Line Height, Space Above and Below Line, Left and Right Indent, Distance between the formula and its sequence number, Density and Proportion degree of the characters’ width, these are the factors for Isolated Formula Extraction. All symbols appearing in Chinese documents are divided into four categories according to the shape and function of the symbols: 1) Chinese characters; 2) Punctuations, such as “,”, “:”; 3) English words and digits; install the MathDaisy add-in to add support for 4) All mathematical signs. Each of these mathematical equations, making them accessible to categories represents the priority variable students, teachers, engineers, and scientists with while applying Baye’s theorem. Finally, the disabilities. sample is divided into the category whose conditional probability is biggest. NonIV. LITERATURE REVIEW Chinese characters are regarded as formulas in this approach; it is inevitable that some (A) TEXT- DETECTION non-traditional mathematic symbols are & EXTRACTION mixed into the extraction result, such as some A. REFINEMENT OF DIGITISED English words, digits and punctuations. DOCUMENTS C. TEXT FINDER This system [5] can refine a text embedded PDF document recognizing the PDF as In this system [7], text is first detected using images and can be combined with other OCR texture segmentation and spatial cohesion systems that output recognition results as text constraints, then cleaned up and extracted. embedded in a PDF document. This system There are 5 major tasks in this approach comprises of following processes; which are; Texture Segmentation, Chip (regions of connected Segmentation of ordinary text / math parts of Generation components), Chip Scale Fusion (output a document; Geometric Position Checking; Word Matching and finally, Recognition of chips are mapped back to original image), parts with mathematical formulae. This Text Cleanup and Chip Refinement. The system can add mathematical information to proposed a text extraction system which PDF files generated by another OCR system works well for normal documents as well as and correct the recognition errors around documents shaded or textured backgrounds. mathematical formulae, by utilizing the D. ROBUST ALGORITHM FOR extracted text files as their recognition results. TEXT DETECTION IN IMAGES The system is not evaluated with respect to This approach [8] is based on the application the bibliography because OCR does not of a color reduction technique, a method for recognize it. edge detection, and the localization of text regions using projection profile analyses and geometrical properties. The software is B. FORMULA EXTRACTION completely written in JAVA to be able to FROM CHINESE DOCUMENT easily run the code on any platform. For It involves two basic steps: Isolated formula above mentioned techniques, specific extraction and Embedded Formula Extraction algorithms are developed to perform the [6]. Parzen windows is used to extract the particular task e.g. edge detection, isolated formulae and Bayes theorem is used localization, etc. The output of the algorithm to extract the embedded formulas from the
iPGCON 2014, STES’s, SKNCOE, Pune is text boxes with a simplified background, rules for the synthesis of consonant-vowel ready to be fed into an OCR engine for transitions. subsequent character recognition. 1. Text to phoneme conversion: -Formatting Preprocessor: Arbitrary English text must be dealt with "nonstandard" input V. LITERATUR REVIEW words such as digit strings, abbreviations, (B) TEXT TO SPEECH and special symbols. CONVERSION -Letter-to-Phoneme Conversion: The system contains a set of heuristic stress rules expressed in the same formalism as the letterA. BANGLA TEXT TO SPEECH to-phoneme rules. Unstressed function words CONVERSION: A SYLLABIC are placed in the exceptions dictionary. The UNIT SELECTION APPROACH This paper [9] describes a syllabic method of stress rules do well most of the time. unit selection synthesis for converting text However, a stress rule error often results in a into speech for Bangla language, the sixth pronunciation that is far from correct, and the most spoken language in the world. The listener may have difficulty in recovering development process of text collection, text from such an error. analysis and speech synthesis is also -Exceptions Dictionary: includes a small set of unstressed function words such as "and", described in this paper. 1. Text collection: novel of Rabindranath "of", "the", etc. 2. Phoneme to speech rules: Tagore. 2. Text Analysis: own list of syllables using Stress Rules: The phonological component every phoneme from Bangla phonology assigns a feature STRESS (value = 0 or 1) to whether that syllable is used in our text each phonetic segment in the output string. The default value is 0 (unstressed). corpus or not. 3. Recording: Each syllable that was listed in C. TEXT TO SPEECH the database had its relevant sound file that CONVERSION TECHNOLOGY was recorded in an ideal environment. [11] 4. Split & Normalize: Then all the syllables Text Normalization: Converts were converted into waveform and were everything in the text stream to letters. observed. The significant part of the wave For example, “1990” becomes represented the actual speech data, the other “nineteen ninety”. parts were the background noise while Dictionaries: the system stores a recording that speech data. These significant phonemic transcription of exact parts were separated and stored into the pronunciation in an exception database. We observed the waveforms and dictionary. searched for maximum global amplitude Letter-to-phoneme rules: convert Amax. Then we found out the local maximum English spelling into phoneme amplitude of a single speech data Almax. Then transcriptions, which are a more exact we found out our normalizing variable N, as representation of pronunciation. in N = Amax / Almax . Prosody Rules: Create the intonation pattern, or rhythm, of sentences in the B. THE KLATTALK – TEXT TO text. All text-to-speech systems have SPEECH CONVERSION somewhat robotic prosodies because SYSTEM the computer, unlike human speakers, A real time text-to-speech conversion system follows rigid rules. [10]. The resulting phonemic representation is converted to speech by a synthesis-by-rule Phonetic Rules: The “fine tuning” of program. The rule program differs from pronunciation takes place in the others of this type in having an extensive set phonetic rules. For example, the of segment duration rules and many detailed
iPGCON 2014, STES’s, SKNCOE, Pune
VI.
phoneme “t” is pronounced differently in “tom,” “atom,” and “cat.” Voice Generation: Turns phonemes into more than 5,000 smaller speech units, which are then converted to voice parameters. Interrupt Driver: Takes frames and sends them to the output hardware at regular intervals. Output Hardware: Various kinds of output hardware can be used for the final stages of T-T-S.
PROPOSED SYSTEM ARCHITECTURE
Figure 1: Block diagram of the proposed system The Math-Says tool is not that much complex to build. The clarity and unambiguousness of the tool makes it more effective. The system architecture block diagram is given in figure 1. The written material gets in as an input to the block of Text Scanner. Text Scanner module scans the pages of the book and produces the PDF-Image. The produced image is then goes through next system block Text Extractor. In this module, the PDFImage is parsed and the text embedded in it is detected. The detected text is extracted for the conversion of the same into the speech. Then the third and final system block comes into the picture i.e. Speech Synthesizer. This module converts the text into the speech and as a result, an audio file is generated of that book. The detailed procedure of how the system will work is as follows:
Written material is first scanned page by page. The scanning of a particular page results into a PDF-Image. This particular image is then parsed to detect the text in that image. Once the text is detected, the extraction of that text begins. The extracted text will be then forwarded to the conversion component. If the extracted text is some mathematical formula, then that formula will be searched into the database table for its respective text. That respective text will be retrieved and will be forwarded to the conversion component. Thus processing of one page is complete.
In his t way, all the pages hence the whole book is processed and audio file is generated. Math-Says app is developed in Java Eclipse editor having in-built Android plug-in. Source code is written in core Java programming language. The app is tested on the simulator provided by Eclipse editor itself. The forecasting of the challenges that may occur while doing this work is done. The most difficult challenge to tackle was found out when certain formula is encountered. The remedy for this challenge is that the table will be maintained in the database which has the formula and its respective text. When the formula is encountered, the table will be searched and the respective text for that formula will get retrieved and spoken out loud. VII.
CONCLUSION
The proposed approach is novel in the field of text to speech conversion for mathematics. The survey of research papers on TTS technology was made rigorously and basic behind TTS is found out. TTS technique comprises of getting the particular text as an
iPGCON 2014, STES’s, SKNCOE, Pune input then converts the graphemes into the Mathematical Formulas Extraction from phonemes and finally converting those Chinese Document” Proceedings of the phonemes into actual speech. The proposed 6th World Congress on Intelligent tool, Math-Says, scans the written material as Control and Automation, Dalian, China. PDF and then extracts the text from it and [7] Victor Wu, Raghavan Manmatha, speaks it out loud. Member, IEEE, and Edward M. Riseman, In future, this stand-alone application can be Sr. Member, IEEE, November 1999. made global based on cloud computing “TextFinder: An Automatic System to technology. Also, the dynamism of the Detect and Recognize Text In Images”, system can be made i.e. runtime TTS IEEE Transactions on pattern analysis conversion of mathematical formulae will be and machine intelligence, vol. 21, no. 11. the huge handle of success in future. [8] Julinda Gllavata1, Ralph Ewerth1 and Bernd Freisleben, University of Marburg, Germany, “A Robust Algorithm for Text Detection in Images”. REFERENCES [9] Farig Yousuf Sadeque, Samin Yasar, Md. Monirul Islam Department of Computer [1] D.Sasirekha, E.Chandra, March 2012, Science and Engineering Bangladesh “Text to Speech: A Simple Tutorial,” University of Engineering and International Journal of Soft Computing Technology (BUET) Dhaka-1000, and Engineering (IJSCE) ISSN: 2231Bangladesh, ©2013 IEEE, “Bangla Text 2307, Volume-2, Issue-1. to Speech Conversion: a Syllabic Unit [2] Azadeh Nazemi, Iain Murray and Selection Approach”, 978-1-4799-0400Nazanin Mohammadi Electrical and 6/13/$31.00 Computer Engineering Department [10] Dennis H. Klatt Box 169, MIT Branch Curtin University, Perth, Western Post Office, Cambridge, MA 02139 Australia, © 2012 IEEE, “Mathspeak: (Also at Mass. Inst. of Tech.), © 1982 An Audio Method for Presenting IEEE, “The KLATTALK- Text to Speech Mathematical Formulae to Blind conversion system”, CH 1746-7/82/0000 Students”, 5th International Conference - 1589 $ 00.75 on Human System Interactions. [11] Michael H. O’Malley Berkeley Speech [3] Wararat Wongkia, Kanlaya Technologies, August 1990, “Text-ToNaruedomkul, Mahidol University, Speech Conversion Technology”, IEEE. Bangkok, Thailand; Nick Cercone, York University, Toronto, Canada, 2009 IEEE, “Better Access to Math for Visually Impaired”, TIC-STH. [4] Information about the tool available at http://www.daisy.org , Accessed on 26 December, 2013. [5] Toshihiro Kanhori, Tsukuba University of Technology, Japan; Masakazu Suzuki, Fukuoka, Japan, © 2006 IEEE, “Refinement of digitized documents through recognition of mathematical formulae”, Proceedings of the Second International Conference on Document Image Analysis for Libraries. [6] Xuedong Tian, Liping Zhang, Haiyan Li and Qingxuan Shi; Hebei University, China, June 21 - 23, 2006, “Research on