Mo2.H.2 - Semantic Scholar

Speech Recognition for Subtitling Japanese Live Broadcasts Toru Imai, Akio Kobayashi, Shoei Sato, Shinichi Homma, Kazuo Onoe, Takeshi S. Kobayakawa Science and Technical Research Laboratories NHK (Japan Broadcasting Corp.) [email protected]

Abstract There is a great need for more TV programs to be subtitled to help hearing impaired and elderly people to watch TV. NHK has researched automatic speech recognition for subtitling live TV programs in real time efficiently. Our speech recognition system learns frequent words and expressions expected in the program beforehand and also learns characteristics of announcers’ voices in order to reduce recognition errors. It periodically outputs recognition results while the announcer is speaking, which shortens the delay for texts to be displayed on the screen. NHK has been using speech recognition to subtitle some of its news, sports and music shows. In news programs, speech read by an announcer in a quiet studio is automatically recognized and errors are immediately corrected by hand. Live TV programs, e.g., the Winter Olympic Games, the World Cup Football Games, and the Grand Sumo Tournaments, have been subtitled by using a re-speak method in which an announcer listens to the program contents and rephrases them. The method improves the recognition accuracy and makes the subtitles easier to be read because it can generate subtitling for programs with high background noise and also allows summarization and paraphrasing of the original speakers’ words.

1. Introduction There is a great need for simultaneous subtitling of live broadcast programs for the hearing impaired and elderly. Although a special keyboard can be used for real-time captioning in English, it is rather difficult to input Japanese characters by keyboard fast enough to keep up with the speech because of the great number of homonyms in Kanji ideograms. Therefore, NHK (Nippon Hoso Kyokai; Japan Broadcasting Corp.) has undertaken extensive research in automatic speech recognition for subtitling live TV programs in real time. Our studies are in contrast to studies sponsored by DARPA (Defense Advanced Research Projects Agency) [1] for transcribing broadcast news in order to retrieve information in batch processing. Since March 2000, NHK has been subtitling broadcast news every day by using a Japanese continuous speech recognition system we developed [2]. So far subtitling of this sort has been limited to portions of programs where an anchorperson reads in a quiet

Mo2.H.2

studio. Ours was world’s first implementation of simultaneous subtitling with speech recognition. The subtitling system has the following features. Errors in the transcribed texts are immediately corrected by hand with a touch-panel whenever they occur. The system’s language model (Time Dependent Language Model [3]) learns frequent words and expressions expected in the program beforehand in order to reduce recognition errors. It outputs recognition results with low latency even while the announcer is reading a sentence in order to shorten the delay in displaying the texts at the bottom of the screen [4]. The recognition accuracy for anchorpersons’ prepared speech in the studio is 98%, and the delay of the subtitles from the speech is about 10 seconds (after correction). Besides news programs, subtitling of other live programs, such as music or sports programs, would also be helpful to our viewers. The commentary of such a program is spoken spontaneously and emotionally, and sometimes, speakers speak simultaneously. It is very difficult for the current speech recognition technology to recognize such commentary with a sufficient degree of accuracy. Therefore, we use the "re-speak" method, where a speaker listening to the original speech of the programs rephrases the commentary so it can be recognized for subtitling [5]. This speaker is in a quiet studio, not in the field, stadium, or hall where the broadcast originates. Since an acoustic model can be adapted to the re-speaker in advance without the need to consider background noise, the recognition accuracy is much higher than recognizing the original narrations. The method makes not only the recognition accuracy better but also subtitles easier to be read because it allows summarization and paraphrasing. Live music shows and sports programs have been subtitled with our speech recognition system using the re-speak method. The current recognition accuracy is approximately 95%. Any recognition error is promptly corrected manually so that the subtitles can be presented within 5 to 8 seconds. We have received favorable comments from people with hearing impairments, to the effect that this service has enhanced their enjoyment and increased the amount of information obtainable from watching TV. In this paper, we describe our simultaneous subtitling systems for news in Section 2 and for other live programs with the re-speak method in Section 3.

I - 165

Electronic news scripts Printing and revising Fax or memos

Training Scripts for reading Anchor’s speech

Speech recognition

Confirmation and correction

Texts

Caption encoder

Transmission

Caption decoder

Figure 1: Broadcast news subtitling system

2. Subtitling Broadcast News NHK started simultaneous subtitled broadcasting for the evening news program “News 7” on March 27, 2000. Subtitling was later extended to “News 9” in the evening and “News at Noon”. Speech recognition is currently used for the portions where the anchorperson reads in a quiet studio, and subtitles in other portions such as field reports and spontaneous conversations are given by keyboard or prepared electronic script. The speech recognition is followed by manual error correction (Figure 1). The anchorperson reads manuscripts which have been printed and revised from the original electronic news scripts or which may be fax or memos. The output of the anchorperson’s microphone is directly fed into the speech recognition without background noise and music. The speech is transcribed into Japanese Kanji characters in real time. The error correction is done in real time by four persons listening to the news. The corrected texts are encoded into the television signals and transmitted to viewers. Viewers who have a caption decoder can view the texts on the screen. Typically the texts are 15 characters per line, and two lines are displayed at the bottom of the screen every four seconds. 2.1. Transcribing Broadcast News The speech recognition is based on statistical methods with acoustic and language models trained from a large broadcast news database. A search engine outputs the word sequence with the highest score for the input speech. The details are as follows. 2.1.1.

Acoustic models

The speech data are acoustically analyzed into 39 parameters (12 MFCCs with log-power and their firstand second-order regression coefficients) every 10 msec. after digitization at 16 kHz and 16 bits with a Hamming window of 25-msec. width. The acoustic models are

gender-dependent and speaker-independent stateclustered mixture triphone HMMs and were trained with the broadcast news database contents spoken by NHK’s announcers. Since the acoustic models are genderdependent, we use separate recognizers for males and females. 2.1.2.

Language models

A bi-gram model is used for the first pass of the search and a tri-gram model for the second pass. The language models were trained with NHK news scripts extending back over 12 years. The unit of the model is a word given by a morphological analyzer [6]. To adapt the model to the latest broadcast news and get new vocabulary and expressions, recent news scripts made within six hours are added to the texts of the long-term news scripts. These added texts are given a high weight. We call the adapted model a Time Dependent Language Model (TDLM) [3]. It can greatly reduce the perplexity and the OOV (out of vocabulary) rate. In everyday operation, a TDLM with a 20K vocabulary size is created five minutes prior to the news show. 2.1.3.

Search engine

The search engine that finds the best word sequence for the input speech runs in two passes. The first pass with the acoustic models and the bi-gram language model hypothesizes the N-best sentences. It is based on the Viterbi search with a single static tree lexicon. In the second pass, the N-best sentences are rescored with the tri-gram language model. The sentence with the highest score, which is the product of acoustic scores and language scores given by the models, is regarded as the recognition result for the input speech. In order to output recognition results for subtitles as quickly as possible, the decoder makes an early decision without waiting for the end of input utterances [4]. The low latency decoder is based on the two-pass searches described above. During the first pass the decoder periodically executes the second pass that

I - 166

rescores partial N-best word sequences up to that time. If the rescored best word sequence has words in common with the previous one, they are regarded as a part of the final result. This method is not theoretically optimal, but makes a quick response with a negligible increase in word errors. 2.2. Correcting Recognition Errors It is very difficult to achieve errorless speech recognition. Consequently, it is essential for humans to detect and correct the errors in the recognition results in real time if the simultaneous subtitles are to be errorfree. In order to make the delay as short as possible, we have two correction sets operated in a sentence by sentence manner which is automatically segmented by pauses in the speech. Each set has two operators: one operator detects and points out incorrect words with a touch-panel and the other one corrects the identified words with a keyboard. The four operators must concentrate on their jobs during the news with the anchorperson’s speech. 2.3. Performance The word accuracy of the speech recognition for the anchorperson’s prepared speech in the studio is on average 98% with a delay of 1 to 2 seconds. It takes 4 to 5 seconds for the operators to confirm and correct the results. The accuracy after correction exceeds 99%. The remaining errors are particles or trivial mistakes that have less chance to affect the meaning conveyed to the viewer. Since the subtitled page is updated every 4 seconds, the delay from the moment the speech is uttered is approximately 10 seconds.

unspecified speakers, and speaking styles that may not match acoustic models and language models. It is difficult to collect enough training data (audio and text) in the same domain as the target program. Therefore, we employ the re-speak method to eliminate such problems. In the re-speak method, a different speaker from the original speakers of the target program carefully rephrases what he or she hears. We call him or her the "re-speaker". The re-speaker wears headphones (Fig. 2), listens to the original soundtrack of live TV programs, and paraphrases what he or she hears, if needed, so that its meaning will be clearer or more acceptable than the original’s and the expression will be more easily recognized. The method has the following advantages for speech recognition. 3.1.1.

The re-spoken utterances have no background noise. Since only one re-speaker repeats the speech of all the speakers in a program, the speech does not overlap. The re-speaker is known, and the acoustic models can be adapted prior to the program with relatively many adaptation data. The re-speaker does not speak emotionally but clearly and calmly without repeating filled pauses and hesitations in the original sounds. If a recognition error happens, the re-speaker says the same phrase or tries a different phrase. The re-speaker can supplement the speech by mentioning the sounds of audiences such as applause even if no mention is given in the original narrations. The above acoustic advantages make the recognition accuracy better and the subtitles easier to be understood by hearing impaired viewers. 3.1.2.

3. Re-Speak Method for Live Programs 3.1. Re-Speak Method The commentaries and conversations in live TV programs of music or sports are often spontaneous and emotional, and sometimes different speakers speak at the same time. If such utterances are directly fed into a speech recognizer, its output results will not be accurate enough for subtitling. The reasons are background noise,

Acoustic advantages

Linguistic advantages

The method makes it possible to summarize or rephrase the original narrations. Conversational speech is rephrased into a planned speech style. Subjects or positional particles in an incomplete sentence are supplemented. Inverted sentences and difficult words or phrases are avoided. They reduce the mismatch between the language model and the speech, and make the subtitles more accurate and more understandable. Apparently, the way of re-speaking affects the

Original soundtrack

Re-speaker

Rephrased speech Text database Speech database

Speech recognition

Confirmation and correction

Texts Caption encoder

Transmission

Figure 2: Subtitling system with a re-speak method for live programs

I - 167

Caption decoder

speech recognition performance. It is necessary to have skillful re-speakers so that the final subtitles will be as good as possible. 3.2. Experiment [5] Experiments were performed to examine re-speak method’s effectiveness in speech recognition. The target programs were of the speed skating and ski jumping competitions of the 2002 Olympic Winter Games. By using recorded videotapes of those games, four female re-speakers simulated the commentary in two different ways. One was a precise "repeating" of what they heard. The other way was a careful "rephrasing" incorporating the above considerations. N-gram language models of bi-grams and tri-grams were trained by using various weighted texts. The base texts were manuscripts of NHK’s general news. To the target programs we added related texts with higher mixture weights in a count level. The vocabulary size was 18K of general news and 33K after adding the training texts. Acoustic models of gender-dependent and speakerindependent state-clustered mixture triphone HMMs were initially trained from NHK’s news data. Then they were adapted to each re-speaker with 3-hour speech data recorded when the re-speakers trained themselves for different programs from the evaluated ones. Table 1 shows the performance of the two language models for speed skating and ski jumping with different speaking strategies of "repeat" and "rephrase". The table indicates that rephrasing greatly reduces the perplexities and OOV rates. The recognition results are shown in Table 2. Rephrasing also reduced the word error rates. The word errors were reduced by 37% and 27% relative in speed skating and ski jumping, respectively. Table 1: Performance of the language models. program method perplexity OOV

speed skating repeat rephrase

ski jumping repeat rephrase

142.2 0.28%

144.5 0.07%

99.6 0.19%

96.1 0.06%

Table 2: Experimental recognition results. program method word error rate

speed skating repeat rephrase

ski jumping repeat rephrase

7.6%

6.0%

4.8%

4.4%

3.3. Operation

coverage of the Winter Olympic Games, the World Cup Football Games, the Grand Sumo Tournaments, and professional baseball games. The language models are adapted to each program and the acoustic models are adapted to each re-speaker. The recognition accuracy is approximately 95%, and any recognition error is promptly corrected manually. Subtitles can be presented within 5 to 8 seconds. We received a large number of positive responses from viewers about the simultaneous subtitling. Hearing-impaired viewers expressed delight at finally being able to enjoy programs together with their families.

4. Conclusion This paper described NHK’s simultaneous subtitling systems for news and other live TV programs with speech recognition technologies. More and more programs are being subtitled every year. We are researching an advanced speech recognition system that will be able to deal with field reporting and spontaneous conversations in news as well as speech in live programs covering a wide variety of topics.

5. References [1] Proceedings of DARPA Speech Recognition Workshop, Morgan Kaufmann, 1996. [2] A. Ando, T. Imai, A. Kobayashi, H. Isono, and K. Nakabayashi, "Real-Time Transcription System for Simultaneous Subtitling of Japanese Broadcast News Programs", IEEE Transactions on Broadcasting, 46(3): 189-196, 2000. [3] A. Kobayashi, K. Onoe, T. Imai, and A. Ando, “Time Dependent Language Model for Broadcast News Transcription and Its Post-Correction,” Proceedings of International Conference on Spoken Language Processing, pp. 2435-2438, Dec. 1998. [4] T. Imai, A. Kobayashi, S. Sato, H. Tanaka, and A. Ando, "Progressive 2-Pass Decoder for Real-Time Broadcast News Captioning", Proceedings of International Conference on Acoustics, Speech, and Signal Processing, III: 1559-1562, 2000. [5] T. Imai, A. Matsui, S. Homma, T. Kobayakawa, K. Onoe, S. Sato, and A. Ando, “Speech Recognition with a Re-Speak Method for Subtitling Live Broadcasts,” Proceedings of International Conference on Spoken Language Processing, pp. 1757-1760, 2002. [6] Y. Matsumoto et al., Japanese Morphological Analysis System “Chasen” version 1.5 Manual, 1997.

Since December 2001, NHK has been using the respeak method for automatic speech recognition and subtitling of live music shows and sports events. For example, this method of subtitling was used in its

I - 168