to a written style, and the insertion of punctuation marks, such as commas and periods, .... Filled pauses, which constitute filler or interjection parts of speech, are ...
Proceedings of the 7th IASTED International Conference Signal Processing, Pattern Recognition and Applications (SPPRA 2010) February 17 - 19, 2010 Innsbruck, Austria
IMPROVING THE READABILITY OF CLASS LECTURE AUTOMATIC SPEECH RECOGNITION RESULTS USING MULTIPLE HYPOTHESES Yasuhisa Fujii, Kazumasa Yamamoto, Seiichi Nakagawa Department of Information and Computer Sciences Toyohashi University of Technology, Japan email: {fujii,kyama,nakagawa}@slp.ics.tut.ac.jp ing for specific information from the transcriptions [3]. In addition, Munteanu et al. showed that the performance and perception quality of transcripts of webcast lectures is affected linearly by the WER, and that transcripts in which the WER is equal to or less than 25% would be acceptable for use in webcast archives [7]. Furthermore, the recognition results from current Automatic Speech Recognition (ASR) systems are not easily understood by humans even in the absence of recognition errors, because the speech used in class lectures is relatively casual and contains many ill-formed utterances with filled pauses, restarts, repetitions, and so on. This unreadability is due to the differences between spoken and written language styles, which in the case of the conjugate languages like Japanese language are quite significant. Thus, before making transcripts available to users, the readability thereof needs to be improved to assist the readers in the understanding of the contents of the lecture material. Operations for correcting transcripts include removing filled pauses and repetition, style conversion from a spoken style to a written style, and the insertion of punctuation marks, such as commas and periods, and discourse markers such as for paragraphs. In this context, there are currently extensive researches on paraphrasing and correcting recognition results [8, 9]. In this paper, we present a novel method for improving the readability of class lecture ASR results that utilizes multiple hypotheses of the ASR results. Most methods for improving the readability of transcriptions were developed for manual transcriptions and lack of the ability of handling multiple hypotheses, thereby suffering severe degradation when dealing with ASR results [10]. By taking into account multiple hypotheses, the proposed method is expected to perform well under erroneous conditions in which the WER is relatively high. Multiple hypotheses of the ASR results are represented by a confusion network [11]. The proposed method finds the most plausible hypothesis from the multiple hypotheses in the confusion network representing a written language style, converted from the spoken language style, using a language model of the written language style. This paper is organized as follows: Section 2 gives a brief description of the confusion network and presents a construction method of the confusion network for our purpose. The formulation and details of our method are de-
ABSTRACT This paper presents a method for improving the readability of class lecture Automatic Speech Recognition (ASR) results, which hitherto have been difficult for humans to understand, even in the absence of recognition errors. This is because the speech in a class lecture is relatively casual and contains many ill-formed utterances with filled pauses, restarts, and so on. Recently there has been extensive research on paraphrasing and correcting recognition results. However, research on improving the readability of recognition results has focused mainly on manually transcribed texts, but not ASR results. Due to the presence of many kinds of specific words and the casual style, even state-ofthe-art recognizers can only achieve a 30-50% word error rate (WER) for the speech in class lectures. In this paper, we propose a novel method that utilizes multiple hypotheses of the ASR results to improve readability of the recognition results. Experimental results show the proposed method resemble the manually paraphrased text the most and subjective test show the proposed method improve the readability of the ASR results under erroneous conditions where WER is high and 37.7%. KEY WORDS improving readability, confusion network, automatic speech recognition, classroom lecture speech
1
Introduction
The availability of audio transcriptions of speeches allows the contents of the speech to be more easily understood. In particular, class lectures, which are the focus of this paper, benefit from transcriptions because these transcriptions can assist the hearing impaired and can also be used in downstream processing such as summarization [1], indexing [2], browsing systems [3], and so on. Therefore, there are a lot of researches currently underway on transcribing class lectures. However, automatic speech recognition of class lectures is quite difficult due to the presence of many kinds of specific words and the casual style employed, and thus state-of-the-art recognizers typically achieve a Word Error Rate (WER) of only 30-50% [4, 5, 6]. Togashi et al. showed that the accuracy of transcriptions of class lectures needs to be over 70% to obtain usable results when search-
678-091
89
scribed in Section 3. Experimental results are discussed in Section 4. Section 5 states the conclusions and some future works.
2
for computing reachability between edges using the adjacency matrix for nodes. Let E and V be the set of edges and nodes in a word lattice, respectively, where each edge e ∈ E is characterized by its start node Inode(e) and end node Fnode(e). The lattice defines a partial order ≤ on the edges. For e, f ∈ E, e ≤ f iff
Confusion Network
A number of methods such as N-best, word graph (lattice), and word trellis are known to represent multiple hypotheses of ASR results [12, 13, 14]. In addition to these methods, a confusion network which is a special form of a lattice was proposed by Mangu et al. [11]. While the confusion network was originally designed to obtain recognition results minimized in terms of WER, a confusion network is also used as an intermediate representation between speech recognizers and downstream modules, such as machine translation [15] and retrieval [16] and its concise structure resembles a “sausage”, as depicted in Figure 1. In this study, we use a confusion network as the intermediate representation between the speech recognizer and module for improving readability. The method for constructing a confusion network has been slightly modified to use a word-trellis and not a word-graph as in Mangu’s algorithm. This method is described in the next section.
⃝ 1 e = f , or ⃝ 2 Fnode(e) = Inode(f ), or ⃝ 3 ∃e′ ∈ E such that e ≤ e′ and e′ ≤ f . Informally, e ≤ f means that e “comes before” f in the 1 e lattice. The conditions above represent, respectively, ⃝ ⃝ 2 is equal to f , the end node of e equals the start node of 3 there exists an e′ that mediates e and f . These f , and ⃝ conditions mean that if there exists a path between e and f (that is reachable), the relationship between e and f is defined as e ≤ f . The relationship is stored using an adjacency matrix of edges. In order to use an adjacency matrix of nodes instead of the one of edges, we replace the second and third condition with new one condition as follows: ⃝ 2 ′ F node(e) ≤ Inode(f ),
2.1
Confusion Network Construction
where ≤ is defined in the same manner as in the edge case, that is, if there exists a path such that g ∈ V can reach h ∈ V in the lattice, the relationship between g and h is defined as g ≤ h. This replacement reduces the memory requirement dramatically, and allows us to construct a confusion network from the trellis directly, because the number of nodes is significantly less than the number of edges. No further modifications to Mangu’s algorithm are necessary to construct a confusion network from a trellis.
SPOJUS, developed in our laboratory, is used as the speech recognizer in the experiments [17]. It is based on a conventional One Pass DP Algorithm and has several unique characteristics which distinguish it from other recognizers, including the combined use of linear and tree-structured lexicons and the use of a likelihood difference index for the efficient computation of inter-word context dependencies. According to Mangu’s algorithm, a confusion network is constructed from a word lattice [11], however, SPOJUS doesn’t produce a word lattice. Since it assumes a 1-best approximation in the first pass, SPOJUS generates a word trellis instead of a word lattice. Although a lattice could be generated by searching the trellis in the second pass, it would be redundant since we do not need a lattice, but a confusion network. Hence, we need a method to construct a confusion network from a word trellis, instead of a word lattice. A word trellis contains information about the beginning and end points of each word stored in the trellis. Using this information, a huge lattice can be generated from the trellis by regarding words stored in the trellis as edges and the beginning and end frames of each word as the nodes, from which the confusion network can be constructed. However, the lattice usually contains a huge number of edges (typically more than 0.1 million), and consequently needs a huge amount of memory for storing the adjacency matrix which indicates reachability between edges and which is needed for the construction of the confusion network. To deal with this problem, we present a method
2.2
Posterior Probability
In a lattice, we can compute the posterior probability defined as the sum of the posteriors of all paths through a word. When the confusion network is constructed, words that appear at the same time but on other paths are merged into one class. By summing the posteriors of the words that share the same word class, we can compute the posterior of the word class. The confusion network allows a bin 1 to be dropped by introducing a special word del. The posterior probability of this special word del is computed as follows: p(del ) = 1 −
∑
p(c),
(1)
c∈C
where C is the class to which the special word del belongs. 1 For example, in Figure 1, eto(well), de(then) and del correspond to word classes and {eto(well), de(then), del} corresponds to a bin.
90
Figure 1: Confusion Network (eto kogi wo hajime masu:Well (I) will start lecture).
3
Proposed Method
3.1
word in the word class may have different duration with each other. Therefore, to obtain the acoustic score for each word class, we assign the acoustic score of the word which has the highest posterior probability in the word class.
Formulation
The proposed method computes the probability of obtaining the written language style word sequence W from the speech feature sequences O as follows: ∑ P (W |O) = P (W, S|O)
3.3
The confusion network is converted from a spoken language to a written language based on P (S|W ) in (2). Theoretically, we can use any kind of model to compute the sequence-to-sequence translation probability P (S|W ), such as Conditional Random Fields (CRF) [18]. However, for practical reasons, we compute P (S|W ) (S = (s1 , s2 , . . . , sN ), N = (n1 , n2 , . . . , nN )) as follows:
S
=
∑
P (W |S, O)P (S|O)
S
≈
∑
P (W |S)P (S)P (O|S)
S
= P (W )
∑ [ P (S|W ) ] S
P (S)
[P (S)P (O|S)]B ,
P (S|W ) =
A
(2)
P (si |wi ),
(3)
where p(s|w) = 1 if there exists the word pair in the translation table or s = w, otherwise 0. In this paper, with respect to P (S|W ) we deal only with the deletion of filled pauses and the insertion of commas and periods as punctuation marks. 3.3.1
Deletion of Filled Pauses
Filled pauses, which constitute filler or interjection parts of speech, are translated only from the special word del. If a bin contains many filled pauses, it tends to be deleted based on the principle of majority decision. In addition, the posterior probabilities of all filled pauses are added to the posterior probability of del: ∑ p(del ) = p(w) (4)
Construction of Confusion Network
w∈Filler ∪del
The confusion network is constructed by using the method described in Section 2.1 from speech input. By constructing the confusion network, we can compute the posterior probability for each word class as described in Section 2.2, while we can not compute the acoustic score for each word class yet. We need the criterion to assign the acoustic score for each word class which is used in Equation (2). 3.2.1
N ∏ i
where S is a hidden variable representing the word sequences of the spoken language and P (S), P (O|S), and P (S|W ) are calculated by a language model of the spoken language, an acoustic model, and a translation model between the written language W and spoken language S, respectively. Our method constructs a confusion network based on []B in (2), and converts a spoken language network into a written language network based on []A . It then finds the most plausible hypothesis in the written language style from the converted written language network using the language model P (W ) of the written language. Thus, our method has three steps to find the most plausible hyˆ . Each step is described in the following secpothesis W tions. 3.2
Conversion of Confusion Network
where Filler means the set of filled pauses in a bin. 3.3.2
Insertion of Punctuation Marks
Punctuation marks are translated only to SIL or sp, corresponding to silence or short pause, respectively. This means that non speech segments are all candidates for punctuation marks. Figure 2 shows the converted confusion network from Figure 1, where a filled pause (“eto:well”) was deleted and short pause (“sp”) and silence (“SIL”) was converted into punctuation marks (“,:,”, “.:.”). By regarding the converted confusion network as a multiple hypotheses space of writˆ ten language, we can find the most plausible hypothesis W
Acoustic Score of Each Word Class
Each word class in each bin contains the same words that appear at the same time but on other paths. To compute the acoustic score for a word class, we can not sum the acoustic score of each word in the word class because each
91
Figure 2: Converted Confusion Network efficiently by just searching on the network using the language model P (W ) of the written language. The search method used in our algorithm is described in the next section. 3.4
scores to the deleted regions. To deal with this problem, we introduce an approximated score for deleted frames. We approximate the score of deleted frames as the maximum score of the ASR result. The score of deleted frames Pdel−asr (s, e), starting at frame s and ending at frame e, is approximated as follows:
Search by Stack Decoding
Pdel−asr (s, e) = Pˆasr (e) − Pˆasr (s)
After the conversion of the confusion network from a spoken language style into a written language style, only the search on the confusion network remains to find the most ˆ . Thanks to the simplausible written style hypothesis W plification of the translation probability P (W |S) by Equation (3), we can proceed the search process bin by bin and monotonously on the confusion network and can take binsynchronous beam search on the confusion network where sentence level hypotheses are retained in a stack up to beam width. In each bin, all hypotheses in the stack are all popped, and each word in the current bin is connected with all the hypotheses popped from the stack. The new generated hypotheses are all rescored by taking account of the written language model score for the new connected word and the posterior of the connected word in addition to the original score of the hypothesis. After that, if there exist hypotheses that share the same history within the range of the language model, those hypotheses are bundled together for efficient computation. Finally, the hypotheses having the highest scores are pushed into the stack again and are retained up to the beam width. The process is continued untill the end of the confusion network, and the hypothesis of the top of the stack is selected as the final result. To clear the algorithm, we have to explain about two issues still remained. First one is that the acoustic score of each word is needed to be further modified since an ending word of a hypothesis in the stack may have a different ending frame with a beginning frame of the connected word. The other is that since the domains where spoken/written language model are trained are different, we have to alleviate the impact of the mismatch. The solutions for two issues are described in the following sections. 3.4.1
(5)
where Pˆasr (i) denotes the score of the plausible hypothesis at frame i. The score of the deleted frames is overestimated giving a higher score than necessary, since the maximum score is used for the computation. This causes over deletion of bins. Therefore, we introduce a score that takes into account the posterior probability of del to control the deletion of a word wb in bin b. The probability pdel (b) that one frame is dropped in bin b is approximated by (4). The deletion score for wb with duration dur and f frame deletion is computed as follows: Swb = sigm(pdel (b))f · sigm(1 − pdel (b))dur −f , sigm(X) =
1 . 1 + exp( X−0.5 α )
(6) (7)
Incorporating (5) and (6) into (2), we can rewrite (2) as: P ′ (W |O) =
∏{
} wb Swb · Pdel−asr · P (W |O),
(8)
b wb where Pdel−asr is a sum of all Pdel−asr (s, e) for the word wb .
3.4.2
Interpolation of Language Models
The domains of the corpora used to train the spoken and written language models may differ. In fact, in the experiments, the spoken language model is trained from a lecture speech corpus, while the written language model is trained from a newspaper corpus. To alleviate the impact of this difference, we interpolate the language models as follows:
Approximation Score of Deleted Frames
P ′ (W ) = λP (W ) + (1 − λ)P (S).
As already mentioned, each word in each bin in the confusion network may have different duration with each other. Therefore, two words in the neighboring bin dose not share the ending frame (for the preceding word) and the beginning frame (for the following word). To compare two hypotheses correctly, we have to compute and assign adequate
(9)
P ′ (W ) is then used in (2), instead of P (W ). Figure 4 shows expected examples of the search result on the confusion network of Figure 2 by the proposed method while Figure 3 shows the recognition result of the confusion network of Figure 1
92
Figure 3: 1best result of confusion network: Red arcs indicate selected word in each bin.
Figure 4: Expected search result of converted confusion network: Red arcs indicate expected search path.
4 4.1
Experiments
paraphrased transcript for Table 3, respectively. We used the both transcripts as the reference for Table 2.
Setup 4.3
We selected two class lectures with IDs “L11M0031” and “L11M0051” from the CJLC corpus [19] as a test set. Each lecture was about 70 minutes long. Manually transcribed texts (referred to as manual in the experimental results) are available for these lectures, and in addition, we manually prepared paraphrased transcriptions from the spoken language style into a written language style (henceforth referred to as paraphrased) for each lecture. We used the lectures in the CSJ corpus [20] for the acoustic model and trained 928 context dependent acoustic models of Japanese syllables. Each syllable model has a Gaussian Mixture Model (GMM) to model the output distribution of the speech input feature vector, that consists of 12 MFCC, ∆ MFCC, ∆∆ MFCC, 2 ∆ power and ∆∆ power. The number of mixtures in the GMM is 4 and the covariance matrix forms a block structure (full covariance for every MFCC, ∆ MFCC, ∆∆ MFCC, ∆ power and ∆∆ power). Tri-grams with Witten-Bell discount are used to model both the spoken language model and written language model. The spoken language model is trained on the CSJ corpus (2693 lectures), while the written language model is trained on the Mainichi newspaper corpus (75 months). The vocabulary size is 20000 and the lexicon is the set of words with higher frequencies in the CSJ corpus. λ in Eq. (9) and 1/α in Eq. (7) are empirically set to 0.75 and 4.2, respectively. 4.2
Evaluation of Confusion Network
Table 1 gives the recognition results of the test set. In this table, MAP and ConfNet mean that the recognition results were computed using Maximum a Posteriori and consensus decoding [11], respectively. Consensus decoding does not improve the recognition results at all. However, as shown in Table 2, the confusion network contained 76.5% words for manual transcripts and 69.1% words for paraphrased transcripts, and 3.12 words for each bin on average. These were sufficient to represent a multiple hypotheses space. Table 1: Recognition results of the test set [%]. Method Del. Ins. Subs. WER MAP 10.1 4.3 37.7 23.5 ConfNet 11.1 3.8 23.0 37.7
Table 2: Word inclusion and density. Input Word inclusion [%] Word density Manual transcript 76.5 3.12 69.1 Paraphrased
4.4 4.4.1
Evaluation of Proposed Method Comparison with Transcripts
To evaluate the proposed method, we compared three different types of transcripts, namely, the transcript with the best recognition result (1-best), transcripts with filled pauses removed (filler-del) and the transcripts obtained by the proposed method (proposed), with both the manual and paraphrased transcripts. Filler-del transcripts are those from which only filled pauses are deleted based on parts of speech. Punctuation marks were removed from all transcripts for evaluation. The results are given in Table 3. Compared to the manual transcript, the 1-best transcript produced the best result, while the proposed transcript
Evaluation Metrics
To evaluate the transcripts, we used WER computed as follows: D+S+I WER = , (10) C +D+S where D, I, S, and C denote the number of deletions, insertions, substitutions and matches, respectively, where we should notice that the reference (correct transcript) is the transcribed text of input utterances for Tables 1, 4 or the
93
reading order was instructed at random for each subject.
Table 3: Evaluation results of the transcripts [%]. Input
Output 1best ManualFiller-del transcript Proposed 1best ParaphFiller-del rased Proposed
Del. 10.1 18.1 25.6 12.1 14.2 19.4
Ins. 4.3 2.9 1.8 21.8 14.2 9.3
Subs. 23.5 20.8 18.2 28.9 27.2 25.1
WER 37.7 41.7 45.4 62.7 55.5 53.7
2. Compare each transcript for readability (paired comparison). 3. Read the manual transcript, and then read/understand each transcripts again. 4. Compare each transcript for understandability (paired comparison).
Table 4: Classification of error types [%]. Tag F
D
Output 1best Filler-del Proposed 1best Filler-del Proposed
Match 57.4 7.0 5.9 17.8 18.3 10.7
Del. 5.0 58.3 73.3 11.2 21.8 44.7
If there are no preferences between two pairs, subjects are allowed to use “?” to indicate “I counldn’t distinguish them”. The results of subjective test is shown in Table 5. When the Proposed is compared with the 1best in terms of the readability, the Proposed outperformed the 1best for both lectures, and even if the Proposed is compared with the Filler-del, the Proposed also outperformed Filler-del. Hence, we could conclude that the proposed method is superior to the other methods in terms of readability. However, we could not state the same conclusion in terms of understandability. While the Proposed slightly outperformed the other methods for lecute “L11M0031” in terms of understandability as well as readability, the Proposed could not outperforme the other methods for lecture “L11M0051” at all. The analysis of lecture “L11M0051” showed that the speaker of lecture “L11M0051” used more pauses and filled pauses than usual speakers so that our method inserted more punctuations into no adequate point. Too many punctuations appear to degrade the level of understandability more than the level of readability. We thought that that was the why the proposed method did not work well in terms of the improving of the understandability for lecture “L11M0051”. The comparison between 1best and Filler-del showed that the two method could not distinguish from each other in terms of both evaluations. This means that only just removing filled pauses are not sufficient to improve the readability/understandability of the ASR results.
Subs. 37.7 34.6 20.8 71.1 59.9 44.7
produced the worst result. However, when compared to the paraphrased transcript (target of our task), the proposed transcripts produced the best result, while the 1-best transcript produced the worst result. In both cases, the fillerdel transcripts yielded the second best result. These results mean that the 1-best transcript most closely resembles the manual transcript, while the proposed transcript most closely resembles the paraphrased transcript. The filler-del transcripts are somewhere in between the manual and paraphrased transcripts. Table 4 gives a classification of the error types of tagged words (filled pauses and repetitions) in the manual transcripts compared to the three types of automatic transcripts. The column for insertion error is omitted since we could not compute insertion error, because there were no tagged words in the recognition results. The F tag is assigned to filled pauses such as (F well) and (F umm), while the D tag is assigned to repetition of self-sufficient Japanese words such as (D koku) kokuban ((D black) blackboard). The experimental results show that the proposed method removed filled pauses better than the Filler-del method which only removes filled pauses based on parts of speech from the 1-best result. The proposed method succeeded in removing 73.3% of the filled pauses, while the filler-del method was only able to remove 58.3% of the filled pauses. In addition, 44.7% of the repetition was able to be removed by the proposed method. 4.4.2
5
Conclusion
In this paper, we presented a novel method for improving the readability of class lecture ASR results that utilizes multiple hypotheses. The proposed method constructs a confusion network to represent multiple hypotheses and the most plausible paraphrased sentence is found from the confusion network converted by a translation model using a written language model. The proposed method the most closely resembled the manually paraphrased transcript and subjective test showed the method improved the readability of the ASR result under the erroneous condition where word error rate was 37.7%. In the future, we aim to use more complex models for P (S|W ), such as CRF that naturally introduce more accurate translation schemes and a discriminative determination of sentence boundaries.
Subjective Test
To assess whether the proposed method really improves the readability of the ASR result, we conducted a subjective test. The subjective test was conducted for two lectures by 6 persons as follows: 1. Read each transcript; 1best, Filler-del, Proposed. The
94
Table 5: Subjective test results Evaluation
Lecture L11M0031
Readability L11M0051
L11M0031 Understandability L11M0051
Comparison pair 1best − Filler-del 1best − Proposed Filler-del − Proposed 1best − Filler-del 1best − Proposed Filler-del − Proposed 1best − Filler-del 1best − Proposed Filler-del − Proposed 1best − Filler-del 1best − Proposed Filler-del − Proposed
Acknowledgement
1best 3 1 2 2 3 2 3 5 -
Count Filler-del Proposed 3 5 2 4 4 0 4 2 3 3 3 2 3 3 0 5 0
? 0 0 0 0 0 1 0 1 1 0 1 1
[8] T. Hori, D. Willet, and Y. Minami. Paraphrasing spontaneous speech using weighted finite-state transducers. SSPR, pp. 210–222, 4 2003.
Part of this research was supported by Global COE Program Frontiers of Intelligent Sensing from the Ministry of Education, Culture, Sports, Science and Technology,Japan.
[9] K. Shitaoka, H. Nanjo, and T. Kawahara. Automatic transformation of lecture transcription into document style using statistical framework. In Proc. Interspeech, pp. 2881–2884, 2004.
References
[10] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M Ostendorf, and M. Harper. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, No. 5, pp. 1526– 1540, September 2006.
[1] Y. Fujii, K. Yamamoto, N. Kitaoka, and S. Nakagawa. Class lecture summarization taking into account consectiveness of important sentences. In Proc. Interspeech, pp. 2438–2441, September 2008. [2] C. Chelba and A. Acero. Indexing uncertainty for spoken document search. In Proc. Interspeech, pp. 61–64, Sep. 2005.
[11] L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word error minimization and other applications of confusion networkds. Computer Speech and Language, Vol. 14, No. 4, pp. 373–400, 2000.
[3] S. Togashi and S. Nakagawa. A browsing system for classroom lecture speech. In Proc. Interspeech, pp. 2803–2806, Sep. 2008.
[12] A. Stolcke, Y. K¨onig, and M. Weintraub. Explicit word error minimization in n-best list rescoring. In Proc. Eurospeech ’97, pp. 163–166, 1997.
[4] J. Glass, S. Cyphers T. J. Hazen, I. Malioutov, D. Huynh, and R. Barzilay. Recent progress in the MIT spoken lecture processing project. In Proc. Interspeech, pp. 2553–2356, Aug. 2007.
[13] H. Ney and X. Aubert. A word graph algorithm for large vocabulary continuous speech recognition. In Proc. ICSLP ’94, 1994.
[5] S. Kogure, H. Nishizaki, M. Tsuchiya, K. Yamamoto, S. Togashi, and S. Nakagawa. Speech recognition performance of cjlc: Corpus of japanese lecture contents. In Proc. Interspeech, pp. 1554–1557, Sep. 2008.
[14] F. K. Soong and E. F. Huang. A tree-trellis based fast search for finding the Nbest sentence hypotheses in continuous speech recognition. In Proc. ICASSP ’91, 1991.
[6] T.Kawahara, Y.Nemoto, and Y.Akita. Automatic lecture transcription by exploiting presentation slide information for language model adaptation. In Proc. IEEE-ICASSP, pp. 4929–4932, 2008.
[15] N. Bertoldi and M. Federico. A new decoder for spoken language translation based on confusion networks. In Proc. ASRU, pp. 86–91, 2005.
[7] C. Munteanu, G. Penn, R. Baecker, E Toms, and D James. Measuring the acceptable word error rate of machine-generated webcast transcripts. In INTERSPEECH2006, ICSLP, pp. 157–160, Sep. 2007.
[16] T. Hori, I. L. Hetherington, T. J. Hazen, and J. R. Glass. Open-vocabulary spoken utterance retrieval using confusion networks. In Proc. ICASSP, pp. IV– 73–IV–76, 2007.
95
[17] J. Zhang, L. Wang, and S. Nakagawa. LVCSR based on context dependent syllable acoustic models. In Asian Workshop on Speech Science and Technology, SP2007-200, pp. 81–86, Mar. 2008. [18] F. Pereira J. Lafferty, A. McCallum. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001. [19] M. Tsuchiya, S. Kogure, H. Nishizaki, K. Ohta, and S. Nakagawa. Developing corpus of Japanese classroom lecture speech contents. In Proc. LREC, 2008. [20] S. Furui, K. Maekawa, and H Isahara. A japanese national project on spontaneous speech corpus and processing technology. Proc. ASR2000, pp. 244–248, 2000.
96