Automatic Acoustic Speech Segmentation in Praat Using Cloud Based ASR Mat´usˇ Pleva, Jozef Juh´ar
Andrew Simon Thiessen
Department of Electronics and Multimedia Communications Technical University of Koˇsice Letn´a 9, Koˇsice, Slovakia Email:
[email protected],
[email protected]
Department of Computers and Informatics Technical University of Koˇsice Letn´a 9, Koˇsice, Slovakia Email:
[email protected]
Abstract—The program Praat is a versatile speech analysis program, which becomes a standard in speech/language analysis scientific community. Automatic speech segmentation has been developed and used in Praat using a plugin called EasyAlign, which requires and input text (separated sentences) and clean audio recordings with quiet pauses between the sentences to operate. In this paper, a new functions will be evaluated and implemented into Praat to improve automatic speech segmentation processes and enables to process also noisy audio recordings with no input texts available. One of them is using also our own cloud based ASR (Automatic Speech recognition) solution developed. The results of the automatic speech recognition and different audio processing (noise-gate, de-noiser, normalization) were added into the Praat system to expand the EasyAlign functionality and improve the resulted automatic speech segmentation. The graphical output of the Praat also helps to visualize the automatic speech segmentation and also automatic speech recognition results for educational and system developers subjective evaluation purposes. Keywords—Praat, EasyAlign, automatic speech segmentation, automatic speech recognition, audio processing, TextGrid.
I.
I NTRODUCTION
Speech segmentation is a vital sub problem of the automatic speech recognition process. It divides speech into smaller segments making further processing easier and more precise. Also there is a problem that the segment should not exceed a fixed limit (approx. 60 seconds - depends on the compilation options of the ASR server engine [1] in the cloud solution) and therefore the control of the segmentation process and fine tuning (timestamps should not exist in the speech utterances) of the algorithms plays a crucial role. Praat[2] has a large potential in the field of automatic speech segmentation because of its many tools, diverse uses and programmability. For example visualization of LD-CELP codec using SoX and Praat is presented in [3], or this paper describes integrating HTK, WEKA and Praat [4], but local ASR is more complicated to maintenance and also not all resources could be publicly available. This work will examine and implement new approaches, or alter old ones, to improve speech segmentation in Praat. New functions will be implemented into Praat, using Praat scripts, to control, improve and expand speech processing methods. The different implemented fields include speech segmentation, automatic speech recognition, and audio processing.
The theme of speech segmentation is an important one at this time when automatic processing of speech with computers is becoming more and more integrated into the everyday use of computers, tablets, smart phones etc. The TV broadcasters are also widely interested in automatic transcription of TV shows (broadcast news, discussions, etc.), because the EU governments regulation usually specifies the minimal amount of shows with hidden subtitles for hearing impaired spectators [5] [6]. Automatic speech segmentation is widely used in this solution and it is an important pre-processing part of the audio streams analysis [7]. The Praat scripts in this paper were programmed using Praat’s own scripting language. All of the software plugins and tools used in this paper are open source, and can be freely obtained online except the access to our cloud based ASR solution [8]. II.
G OALS
The main goals of this paper are as follows: First, to analyze the current state of acoustic speech segmentation, which will be used in this work as research. It will then analyze the different programing possibilities of the program Praat and do additional programing for new functionality, like reducing noise from recordings or dynamically expanding recordings, using external tools. This paper will also document the work with objects and scripts in Praat and add new TextGrids to visualize results from an external module of automatic speech recognition. Further, it will create a version of the application of automatic speech segmentation; both for speech segmentation with, and without, an input scripted text, and then compare the results with automatic speech recognition results, in Slovak. And last, it will analyze and create the possibility to process multiple audio input files in the program Praat. The results will be analyzed in the educational and subjective evaluation point of view. III.
A NALYSIS
Speech segmentation (or speech alignment) is the dividing and segmentation of speech, written and spoken, into its smaller lexical and phonetic parts. This is done by dividing sentences into words, words into syllables and syllables into phones. Phonetic segmentation (or phonetic alignment) determines the positions and order of the different phones and syllables of speech. The audio and written (orthographic)
transcription of speech is used to match and align these two together. [9] To improve Praat in its speech segmentation process the different aspects of speech must be considered. The basic way to improve speech segmentation in Praat is to examine speech recognition, audio quality and speech segmentation itself, within Praat. Audio quality is very important for speech segmentation. Speech segmentation analyzes different frequency and amplitude characteristics of speech recordings to determine the location of speech particles. Bad recordings (noisy, no pause between speech utterances, music in background, etc.) can make results inaccurate and unreliable. [10] There is no way to ensure audio quality because of different audio equipment used for recording speech on mobile devices. This results in a need for post processing of the audio recordings after they have been recorded. If done right, it should improve the results of the automatic speech segmentation process in Praat. Such an audio processing function could be added to Praat. There are many different audio tools that could be used to improve audio quality. The most effective would be a noise gate, de-noised and normalizer. All deal with dynamic properties and noise reduction of audio, which are the basis of automatic speech analysis. [11] Automatic speech recognition could be added to Praat to enable the comparison of results side by side and to enable automatic speech segmentation without having to have a text input file. For normal speech segmentation in EasyAlign, which is a Praat plugin for speech segmentation, both an audio and text input file are required. Integrating automatic speech recognition into the system would allow the user to process audio which does not have a text transcript. [12] It would also be useful to be able to automatically process multiple audio speech files to make the whole process faster and easier. This would allow many audio files to be analyzed one by one without the user needing to go through each process individually, which is quite long in some circumstances. For improving any area of Praat, the Praat scripting language must be used to create new, or modify old Praat scripts. Praat runs these scripts, which control Praat and other external programs that work together with it. The scripts will be able to be executed from the Praat menu directly from the Praat interface. [13] The backbone of automatic speech segmentation for this paper is the software plugin EasyAlign. It is a three step speech segmentation process in Praat that uses an audio file and text input file, of speech, to create a Praat TextGrid of the resulting speech segmentation. All of the new functions should be closely integrated with EasyAlign within Praat. [14] IV.
P ROPOSED S OLUTIONS AND R ESULTS
The implementation of new functions, in Praat, to improve the automatic speech segmentation process was accomplished by integrating a few external programs into the program. There were four new functions created in Praat (some of them calling external program). These include: a new speech recognition step in the original EasyAlign speech segmentation process (for visual comparison of the input text and the ASR
Segmented Sound Sound Macro Segmentation
Grapheme to Phoneme conversion
Segmentation into Phonemes
Speech Recognition
Segmented Audio
Text Segmented Text from Recognizer
Fig. 1.
Process model of the Recognition function.
Fig. 2.
Praat TextGrid of the Recognition function results
output); a new Recognition and Segmentation function which can process audio without a text input (only speech audio recording needed); a function to process multiple audio files through the Recognition and Segmentation function (more audio recordings automatically transcribed offline); and an audio enhancement function (improving the basic EasyAlign macro segmentation based on input text). All of the newly implemented functions can be executed from the EasyAlign menu in Praat.
A. Implementing Recognition Function A new Recognition step was implemented into the existing EasyAlign speech segmentation process. This new process enables the user to run the speech through an ASR recognition server (provided by the cloud solution) to obtain the results of the automatic speech recognition. This allows the user to compare the two text variations side by side and make a subjective evaluation of the ASR engine configuration and its results during the tuning of the ASR solution or for presentation and educational purposes. The recognized text is obtained by sending the audio to a speech recognition server to be processed. The ortho tier of the segmentation process is sent to the server. Each interval of the tier is processed individually. The result is then sent back to Praat and inserted into a new recognized tier. The model of the whole process is shown in Fig. 1. The output is a Praat TextGrid, which shows the segmented and recognized text together with the audio. The Recognition function adds the recognized tier into the TextGrid. This TextGrid is depicted in Fig. 2.
Recognized Segments of Text Audio
Basic Speech Segmentation
Segments of Audio
Further Speech Segmentation
Speech Recognition
Segments of Audio
Fig. 3.
Process model of the Recognition and Segmentation function Fig. 4. Praat TextGrid of the Recognition and Segmentation function results
B. Implementing Recognition and Segmentation Function SoX De-Noiser
The Recognition and Segmentation function was implemented to allow the user to process audio speech that does not have a text input file. A model of the function is shown in Fig. 3. The basic EasyAlign plugin was developed for language analysis so the text input (each sentence on separate line) was necessary because the ASR function was not available in Praat. Making a stable connection of the Praat to external ASR cloud solution brings new opportunities of using Praat for automatic speech recognition system designers. The solution of automatic transcription of speech audio recordings is done by Praat communicating with the automatic speech recognition - ASR server used in the Recognition function - described in more details in [15]. The audio is sent to the server after it has gone through a basic segmentation process. This basic speech segmentation divides the speech into smaller portions so the server can better process the audio, and it overrides the problem of big segments processing of the used ASR engine (the segment could not be bigger than 80 seconds). The segmentation program used is called DISTBIC.exe [16] and is an external program executed within the Praat script. After the segmentation, the audio is sent to the ASR recognition server with broadcast news (the most general texts available) adapted language model [17] to be processed. The server returns a log file containing the transcript of the speech. The text is then parsed and inserted into a TextGrid segments. There is an option at the beginning of the process to add a tier into the TextGrid that will contain the words that the server recognized at their positions (word based forced alignment). The data needed could be obtained from the log file (ASR engine forced alignment results). The outputs of the function are the TextGrid, seen in Fig. 4, and a folder which contains text files containing the recognized speech and a .srt type subtitle file for the speech. The .srt file is automatically generated for using the results of the ASR process in external player or next processing.
Raw Audio
Fig. 5.
Choice tool in Praat
An audio enhancement function was implemented to allow the user to enhance the audio quality directly in Praat, which results also in better basic EasyAlign macro-segmentation of the speech with input text. The three audio processing tools chosen to be included in the function were: noise gating, de-noising and normalization. The audio processing functions could help the energy based algorithm of the EasyAlign plugin to work more properly if there is some noise between the sentences. The model is shown in Fig. 5.
SoX NoiseGate
Noise-Gated Audio
Praat Normalizer
Normalized Audio
Process model of the SoX function
The noise gate and de-noiser were implemented using an external program for audio processing called Sound eXchange (SoX). It is executed from within the Praat script. The noise gate eliminates any sound in the recording under the defined threshold value. This helps eliminate sound in the recording that interferes with the speech. The de-noiser eliminates selected noise characteristics from the recording that the user defines as noise. The user defines the noise in the Praat editor before the function is run. In this case the process could not run autonomously, but the noise segment needs to be selected for each audio file. The implemented normalization uses the Praat Scale peak function to adjust the maximum volume of the recording. This enables the user to adjust the volume of the recording without worrying about distorting the audio. When running the function, the user defines what tool to use in the SoX menu, seen in Fig. 6, along with the different settings of each tool, and can then execute it. The result is a new sound object in the Praat menu, which is the processed audio. V.
C. Implementing Audio Enhancement Function
De-Noised Audio
C ONCLUSION
The goal of this work, improving the program Praat for the use of automatic speech segmentation, was accomplished using multiple different methods and approaches (see Table I, the original inclusion of Slovak to EasyAlign described in [18]). By combining the fields of speech segmentation, cloud based automatic speech recognition, and audio enhancement this work was able to create a new modified version of Praat, and the speech segmentation program EasyAlign, that is efficient in showing and analyzing the automatic speech segmentation process and its results.
R EFERENCES [1]
[2] [3]
[4]
[5]
Fig. 6.
SoX menu [6] TABLE I.
P RAAT AUTOMATIC SPEECH SEGMENTATION FUNCTIONALITIES OF THE E ASYA LIGN PLUGIN AND NEW FUNCTIONS IMPLEMENTED COMPARISON .
Yes
Subjective Quality of the Segmentation Medium
Visual Transcription Comparison No
Yes
Higher
No
Yes
Higher
Yes
No
Higher
No
Yes
Higher
Yes
Solutions
Reference Input Text needed
EasyAlign EasyAlign + audio enh. EasyAlign + ASR cloud DistBIC + ASR cloud DistBIC + ASR cloud + Input Text
[7]
[8]
[9]
[10] [11]
There are many uses for the new and modified areas of Praat and EasyAlign created for automatic speech segmentation. It is most useful as a visual and analytical tool, which shows the process, and results, of automatic speech segmentation, along side automatic speech recognition. This is most useful for comparing results of automatic speech recognition results depending on the automatic speech segmentation preprocessing of the audio stream. The audio processing options, that were implemented, improved speech segmentation results and expanded the range of usable recordings. The Praat scripts created are easily expandable leaving further research and development of Praat with many future opportunities and uses. Finally, the solutions could be used for presentation of all mentioned tools and functions developed for educational, dissemination, publicity and the developers subjective evaluation of the results during the next improvements of the cloud based ASR solution.
[12]
[13]
[14]
[15]
[16]
[17]
ACKNOWLEDGMENT This publication is supported partially (50%) by the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF & partially by ITMS-26220220141 project (50%).
[18]
M. Lojka, S. Ondas, M. Pleva, J. Juhar: Multi-thread parallel speech recognition for mobile applications. Journal of Electrical and Electronics Engineering, Vol. 7, no. 1 (2014), pp. 81-86. ISSN 18446035. Available at: http://electroinf.uoradea.ro/index.php/cercetare/ reviste/jeee/13-cercetare/reviste/jeee/120-vol-7-nr-1-may-2014.html P. Boersma, D. Weenink: Praat: Doing Phonetics by Computer, Version 4.0., p. 26, 2002. http://www.praat.org/ L.R. Mathew, A.S. Anselam, S.S. Pillai: Analysis of LD-CELP coder output with Sound eXchange and Praat software Proceedings of 2014 IEEE International Conference on Advanced Communication, Control and Computing Technologies, ICACCCT, 2014, art. no. 7019305, pp. 1281-1285. S.T. Christie, S. Pakhomov: Prosody toolkit: Integrating HTK, Praat and WEKA Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, Florence, pp. 3321-3322. Accessible information television broadcasts. Indicators on political participation of persons with disabilities, EC report, 2014, p. 23. Available at: http://fra.europa.eu/sites/default/files/ accessible-information-television-broadcasts en.docx J. Neves, Audiovisual Translation: Subtitling for the Deaf and Hard-ofHearing, PhD. Thesis, Roehampton University, 2005, p. 358. Available at: http://roehampton.openrepository.com/roehampton/bitstream/10142/ 12580/1/neves%20audiovisual.pdf R. Gubka, M. Kuba, R. Jarina: Universal approach for sequential audio pattern search 2013 Federated Conference on Computer Science and Information Systems, FedCSIS, 2013, art. no. 6644057, pp. 565-569. T. Koctur, M. Pleva, J. Juhar. Interface to intelligent audiovisual archive. (In Slovak) Electrical Engineering and Informatics V, FEI TUKE, 2014, pp. 774-779, ISBN 978-80-553-1704-5. J. P. Goldman. EasyAlign: An Automatic Phonetic Alignment Tool Under Praat. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, Florence, pp. 3233-3236. Available at: http://latlntic.unige.ch/phonetique/ easyalign/goldman interspeech2011 easyalign.pdf B. Edstrom. Recording on a Budget. Oxford University press, 2011, p. 54, ISBN 978-0-19-539041-4. B. Drury. Killer Home Recording: Setting Up. RecordingReview.com, 2009, pp. 4-40. K. F. Lee. Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers, 1989, Fourth printing 1999. p. 8, SECS62, ISBN 0-89838-296-3. W. Styler. Using Praat for Linguistic Research. University of Colorado at Boulder Phonetics Lab, Document version 1.4.2, Jan 16, 2014. Available at: http://savethevowels.org/praat/ UsingPraatforLinguisticResearchLatest.pdf J. P. Goldman. Tutorial on EasyAlign. University of Geneva, p. 4. Available at: http://latlntic.unige.ch/phonetique/easyalign/tutorial easyalign en.php S. Ondas, M. Pleva, M. Lojka, M. Sulir and J. Juhar. Serverbased speech technologies for mobile robotic applications. Journal of Electrical and Electronics Engineering. Vol. 6 (1), p. 95-98, May 2013, ISSN 1844-6035. Available at: http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/ JEEE/JEEE V6 N1 MAY 2013/Ondas may2013.pdf J. Zibert et al. COST278 broadcast news segmentation and speaker clustering evaluation. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2005, Bonn, pp. 629-632. M. Pleva and J. Juhar. TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation. Proceedings of LREC 2014 Language Resources and Evaluation Conference, May 2014, Reykjavik, Iceland, pp. 1709-1713. Available at: http://www.lrec-conf.org/ proceedings/lrec2014/pdf/680 Paper.pdf J. Knap: Automatic speech segmentation, Master Thesis (In Slovak), Technical University of Kosice, 2011, p. 46.