IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST 2010
1147
Extractive Speech Summarization Using Shallow Rhetorical Structure Modeling Justin Jian Zhang, Ricky Ho Yin Chan, and Pascale Fung, Senior Member, IEEE
Abstract—We propose an extractive summarization approach with a novel shallow rhetorical structure learning framework for speech summarization. One of the most under-utilized features in extractive summarization is hierarchical structure information—semantically cohesive units that are hidden in spoken documents. We first present empirical evidence that rhetorical structure is the underlying semantic information, which is rendered in linguistic and acoustic/prosodic forms in lecture speech. A segmental summarization method, where the document is partitioned into rhetorical units by K-means clustering, is first proposed to test this hypothesis. We show that this system produces summaries at 67.36% ROUGE-L F-measure, a 4.29% absolute increase in performance compared with that of the baseline system. We then propose Rhetorical-State Hidden Markov Models (RSHMMs) to automatically decode the underlying hierarchical rhetorical structure in speech. Tenfold cross validation experiments are carried out on conference speeches. We show that system based on RSHMMs gives a 71.31% ROUGE-L F-measure, a 8.24% absolute increase in lecture speech summarization performance compared with the baseline system without using RSHMM. Our method equally outperforms the baseline with a conventional discourse feature. We also present a thorough investigation of the relative contribution of different features and show that, for lecture speech, speaker-normalized acoustic features give the most contribution at 68.5% ROUGE-L F-measure, compared to 62.9% ROUGE-L F-measure for linguistic features, and 59.2% ROUGE-L F-measure for un-normalized acoustic features. This shows that the individual speaking style of each speaker is highly relevant to the summarization. Index Terms—Extractive speech summarization, lecture speech, rhetorical information.
I. INTRODUCTION
S
POKEN document summarization is the recognition, distillation, and the presentation of spoken documents in a structural text form, to be presented to the user. The challenge of spoken document summarization, other than automatic speech recognition, lies largely in the lack of easily discernible structures in these documents, in the form of titles, subtitles, sentence and paragraph boundaries, punctuations, fonts and styles to help with the interpretation of the underlying semantic information, Manuscript received June 24, 2008; revised July 26, 2009. First published August 25, 2009; current version published July 14, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ruhi Sarikaya. The authors are with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2009.2030951
Fig. 1. ASR transcription text and summary with flat structure.
which in turn are easily accessible to human readers and search engines alike. On the other hand, spoken documents, such as manual or automatic speech recognition (ASR) transcriptions and extracted summaries, as shown in Fig. 1, make up for their lack of titles, subtitles, paragraph boundaries, and other structure information, in other features that are embedded in the speech signal, namely acoustic, phonetic, and prosodic information. They represent how things are said, whereas the actual words spoken (linguistic features) are what is said. Existing speech summarization systems, including our previous work, have shown that, for different spoken documents, how things are said is often as important as what is said. Incorporating both acoustic and linguistic features, most spoken document summarization systems employ an extractive approach where salient sentences or segments of the speech are extracted and compiled into a final summary [1]–[6]. Nevertheless, most existing work has failed to make adequate use of one important feature—the rhetorical structure in the spoken documents. Rhetorical structure is the story flow of the document [7]. A document consists of semantically coherent units, known as rhetorical units. A rhetorical unit is a “continuous, uninterrupted span of text” with a single, coherent semantic theme, in a document. In written documents, rhetorical units are often represented as paragraphs or subparagraphs. Our previous work [6] and other researchers have suggested that rhetorical information exist also in spoken documents and efficient modeling of this information is helpful to the summarization task. References [8] and [9] used the Hearst method
1558-7916/$26.00 © 2010 IEEE
1148
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST 2010
[10] to segment documents and detect topics for text summarization and topic adaptation of speech recognition systems for long speech archives, respectively. Some summarization systems make use of the simplest type of rhetorical information, commonly known as discourse feature, such as sentence or noun position offset from the beginning of the text [2], [11], [12]. Reference [13] applied a HMM generative framework to broadcast news speech summarization. The discourse feature works well for news reports, but not as well in other genres such as lecture presentations [6]. Our proposed work combines the idea of rhetorical structure information and HMM probabilistic framework into summarizing lecture speech presentations. This paper is organized as follows. Section II describes our motivation, and the rhetorical structure characteristics in lecture speech. We then describe the corpus, how to create reference summaries, and our lecture speech recognition system for automatic transcription in Section III. In Section IV, we outline the acoustic/prosodic and linguistic characteristics of lecture speech. We then describe our approach for automatically segmenting lecture speech and the performance of summarization system by using the segmentation structure information in Section V. Section VI details how to build Rhetorical-State Hidden Markov Models (RSHMM) for learning rhetorical information of the lecture speech and improve the extractive summarization system by using this information. The experiments and results are presented in Section VII. Section VIII concludes the paper. II. RHETORICAL STRUCTURE CHARACTERISTICS IN LECTURE Unlike conversational speech, lectures and presentations are planned. Like all planned speech, lecture speakers follow a relatively rigid rhetorical structure at the document level: he or she starts with an overview of the topic to be presented, followed with the actual content with more detailed descriptions, and then concludes at the end. Within each section, there is another level of rhetorical structure. For example, the introduction section might start with the motivation, then background. The proposed methodology is followed by an overview of the rest of the presentation. Each of them is in turn a rhetorical unit. These coherent text spans are units of rhetorical structure. Mann and Thompson assert that the structure of every coherent text span can be described by a single rhetorical structure tree, whose top schema application creates a span encompassing the whole document [14]. For lecture speech presentations, we envision the rhetorical structure of lectures and presentations by a two-layer hierarchical text plan. As illustrated in Fig. 2, the “introduction” section maybe consists of the “overview” subsection and the “background” subsection; the “content” section maybe consists of the “method” subsection and the “experiment” subsection. In theory, all rhetorical units are connected by several kinds of rhetorical relations, such as contrast, condition, sequence, and so on. In our work, rhetorical structures are represented by Markov Models and rhetorical relations are state transitions. Considering that humans tend to summarize presentations by extracting important sentences from introduction and conclusion sections, [1] proposes a novel summarization method based on this structural characteristic. They estimated the introduction
Fig. 2. Hierarchical text plan of lecture speech.
and conclusion section boundaries based on the Hearst method [10], using sentence cohesiveness which is measured by a cosine value between content word-frequency vectors, before performing summarization. Many linguists believe that speech acoustics contribute to rhetorical and discourse structure. [11] provide empirical evidence that discourses can be segmented reliably, and that acoustic features are used by speakers to convey linguistic structure at the discourse level in the English domain. There is a large amount of previous work seeking to demonstrate that acoustic prosodic profile of speech closely models its discourse or rhetorical structure [15]–[17]. In this paper, we stipulate that correlation among acoustic features and discourse structure exists, and that we may use the acoustic features for extracting the discourse structure. Since lecture speeches are mostly based on presentation slides with main gisting points, rather than read from a script, the content and format of the presentation slides is a faithful representation of the document-level rhetorical structure of the lecture speech. To investigate and illustrate this rhetorical structure as represented by acoustic and linguistic features of speech, we use PCA projection of all acoustic/phonetic, linguistic, and discourse features of the lecture speech for visual rendering of the underlying rhetorical structure. PCA reduces the multidimensional feature vectors to two dimensions by finding the orthogonal vectors that best represent all the features. The PCA transformation is given as (1)
ZHANG et al.: EXTRACTIVE SPEECH SUMMARIZATION USING SHALLOW RHETORICAL STRUCTURE MODELING
1149
Fig. 5. Visualization of lecture speech sentences represented by using all features. Fig. 3. Visualization of lecture speech sentences represented by using only acoustic features.
rhetorical structure of the lecture speeches can be obtained by using acoustic features combined with linguistic features. III. CORPUS, REFERENCE SUMMARIES, AND LECTURE SPEECH RECOGNITION SYSTEM A. Corpus and Reference Summaries
Fig. 4. Visualization of lecture speech sentences represented by using only linguistic features.
where is original feature vector matrix, is PCA result mais the singular value decomposition SVD of trix, and . By using only acoustic features, as listed in Section IV-A, we find that the sentences of the transcriptions are not separated into sections, as shown in Fig. 3. By using only linguistic features, as listed in Section IV-B, we find that the sentences of the transcriptions are segmented into two distinct parts: the introduction part and the content plus conclusion part, as shown in Fig. 4. This shows that linguistic features may help extract rhetorical structure of the lecture speech to some extent. On the other hand, when we visualize the rhetorical structure of lecture speech by PCA on acoustic/phonetic and linguistic features of the lecture speech, as shown in Fig. 5, we find that the sentences of the transcriptions are segmented into three sections rather distinctly. This shows that more accurate underlying
We have collected a lecture speech corpus containing wave files of 111 presentations recorded from different speakers at the NCMMSC2005 and NCMMSC2007 conferences, together with power point slides, manual transcriptions, and their associated audio data. Each presentation lasts about 15 minutes on average and is recorded at 22 kHz and 16-bit sampling. All the lecture audio data are manually segmented and transcribed and down-sampled to 16 kHz. In the current work, we use 71 of the 111 presentations which have well-formatted power point slides for our experiments. Each presentation was manually divided into on average 83 segment units. The ASR system runs in multiple passes and performs unsupervised acoustic model adaptation as well as unsupervised language model adaptation [18] with 70.3% recognition accuracy. We propose using presentation slides to compile reference summaries automatically. The motivations behind this procedure are as follows: • presentation slides are compiled by the authors themselves and therefore provide a good standard summary of their work; • presentation slides contain the hierarchical rhetorical structure of lecture speech as the titles, subtitles, page breaks, bullet points provide an enriched set of discourse information that are otherwise not apparent in the spoken lecture transcriptions. We propose a Relaxed Dynamic Time Warping (RDTW) procedure to align sentences from the slides to those in the lecture speech transcriptions, resulting in automatically extracted reference summaries. First we calculate the cosine similarity scores matrix , where , between the sentences in the transcription and the sentences in
1150
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST 2010
the presentation slides, inter-annotator agreement increased to 80.2%. The agreement between automatically extracted reference summary and humans also reaches 75%. Based on this high degree of agreement, we generate reference summaries by asking a human to manually correct those extracted by the RDTW algorithm. The human annotator just checks and modifies the automatically extracted reference summary according to the transcription and the corresponding power point slides. Our reference summaries therefore make for more reliable training and test data. B. Evaluation Metrics
Fig. 6. Compiling an example reference summary.
the slides. We then obtain the distance matrix , . We assume the number of sentence in the where ( ); and the number of sentranscription is tences or segments in the slides is ( ), shown in Fig. 6(a). Next, we calculate the initial warp path which is the min, where imum-distance warp path is a point , by DTW algorithm, shown in Fig. 6(b) (2)
ROUGE [20] has been adopted as a standard evaluation metric in various summarization tasks. It is computed based on the -gram overlap between a summary and a set of reference summaries. Sentence F-measure and ROUGE-N generally perform well in evaluating speech summarization task [1]. We evaluate the performance of extractive summarization methods by using ROUGE-N, a -gram based co-occurrence measure, described by (4), and ROUGE-L (summary-level Longest Common Subsequence) precision, recall and F-measure, which are described by (5), (6), (7) [21]. When we perform extractive summarization experiments on manual transcriptions with manual sentence segmentation, the value of ROUGE-N is equal to that of ROUGE-L because each sentence of the extracted summary contains the same number of words as the corresponding sentence in the reference summary.
ROUGE-N
Given that the speaker often does not follow the slide order strictly, we adopt Relaxed Dynamic Time Warp (RDTW) to relax the alignment order constraint and to find the optimal path, according to the (3). The transcription sentences on this path are reference summary sentence candidates, shown in Fig. 6(c)
(4)
(3)
(6)
(5)
(7) We denote the initial path , is represented by . We then obtain the where optimal path , where is . is relaxation factor. represented by of the transcription whose We then select the sentences are higher than similarity scores of sentence pairs: the predefined threshold as the final reference summary sentences, shown in Fig. 6(d). Referring to Fig. 2, we produce ref, which contains Introduction, erence summaries with Content, and Conclusion. We have compared these reference summaries to human-labeled summaries. When asked to “select the most salient sentences for a summary,” we found that inter-annotator agreement ranges from 30% to 50% only. Sometimes even a single person might choose different sentences at different times [19]. However, when instructed to follow the structure and points in
Given a reference summary of sentences containing a total of words and a candidate summary of sentences containing is the LCS score of the union a total of words, longest common subsequence between reference sentence and candidate summary . C. Lecture Speech Recognition Baseline System 1) Language Modeling: Chinese word segmentation is performed on the training corpora by using the HIT IR Lab Chinese Segmenter [5]. Vocabulary selection based on maximum-likelihood [6] is then applied to the training data to obtain a word list of 6878 words. The out-of-vocabulary (OOV) rate of this word list on the test set is 1.0%. For each training data set, we built one language model with a cutoff threshold of two for the
ZHANG et al.: EXTRACTIVE SPEECH SUMMARIZATION USING SHALLOW RHETORICAL STRUCTURE MODELING
TABLE I PERPLEXITY FOR DIFFERENT LANGUAGE MODELS
TABLE II CHARACTER ACCURACY FOR DIFFERENT ACOUSTIC MODELS
-grams. The individual language models are then linear-interpolated and merged to form a single language model. The interpolation weights are computed with the cross validation approach by dividing all the lecture speech transcription into five portions. Table I gives the bi-gram and tri-gram perplexities for language models under different training sets. The interpolated bigram and trigram models are used in our multipass recognition system. 2) Acoustic Modeling: Our system uses tied-state cross-word triphone HMMs that are constructed by decision tree clustering. The system uses up to 3500 tied states in total and each state contains 16 Gaussian mixture components. For every shift of 10 ms, a 25-ms window of input speech is represented by a feature vector that includes 13 MFCC cepstral parameters (including C0) and their first-and second-order derivatives. Cepstral mean normalization (CMN) is applied to each speech segment. The number of phones for the system is 67, where each of the 27 Mandarin initial phones and the silence are modeled by three states left-to-right HMM with no state-skipping, and each of the 37 Mandarin final phones, noise and unlabeled English word are modeled by five states left-to-right HMM. For the Mandarin final HMMs, state transitions are added such that a minimum of three frames are allowed for matching of short finals. During the training, English to Mandarin phone mapping is applied to a dictionary such that transcribed English words can be trained. Acoustic model training using maximum-likelihood criterion is done for three training sets: read speech only, lecture speech only, read speech, and lecture speech. Table II gives the performance of the acoustic models tested under the supervision of the interpolated bigram language model. Finally, we use Model4 as the acoustic model in our multipass recognition system. 3) Sentence Boundary Detection: For sentence boundary detection in lecture presentations, we first trained Gaussian mixture models (GMMs) for the silence, noise, Mandarin initial speech, Mandarin final speech, and non-English word speech events using the convention EM algorithm, where each of the GMMs contains 256 components and is represented by
1151
three–seven HMM states. A grammar-based Viterbi decoder is then used to find the GMM sequences with time boundaries. The Gaussian mixture model (GMM) sequences are then relabeled to speech/non-speech labels. The final speech boundaries are obtained by merging the speech labels which are nearby (0.2 s) and padding silence (0.1 s) in either end of the merged speech segments. The average length of the automatic created speech segments is 2.2 s, which is shorter compare to the 3.9 s average length of the manual segments. In addition, we adopt rule-based segment merging model for sentence boundary adaptation. We then obtain longer sentences with an average length of 3.75 s, which is only 0.15 s shorter than average length of the manual segmented sentence. 4) Performances of the ASR System: Our ASR system runs in multiple passes. In the first pass, a decoder performs time-synchronous Viterbi beam search through a lexical tree to produce 1-best result and a lattice, where context dependent cross word triphone HMMs and word-based bigram language model are applied. Lattice rescoring is then performed using trigram language model to obtain another 1-best result. A bigram branch and a trigram branch are obtained and unsupervised acoustic model adaptation with the MLLR approach is applied on each branch. Lattice rescoring is then performed on each branch using the adapted acoustic model, and produces new recognized results. The recognized text from the branches are then mixed and used for unsupervised trigram language model adaptation. A final re-decoding is done by using the adapted acoustic model and the adapted trigram language model. We obtained 69.7% and 70.3% accuracy for manual and automatic segmented sentences respectively in our system. We build our ASR system by using HTK [22]. IV. CHARACTERISTICS OF LECTURE SPEECH A. Acoustic/Phonetic Characteristics of Lecture Speech For summarizing broadcast news, [3] suggested that acoustic features are useful for extracting salient sentences. [23] also use acoustic features such as F0 and energy features for speech summarization of spontaneous speech. We argue that, for summarizing lecture speech, the contribution of acoustic/prosodic features versus linguistic features alone in lecture speech may not be the same as those in either news speech or conversational speech. First, lecture speech differs greatly from broadcast news due to speaker variability. Most of broadcast news consists of speech by anchors and reporters who are professionally trained to use prosody to emphasize important points [2]. On the other hand, lecture speakers have a wider range of speaking style as many are not trained speakers. Second, lecture speech is planned, and is less spontaneous than conversational speech. A typical lecture speaker (in a classroom, at a conference), facing a receptive audience, often sounds dull and monotonic, compared to in a conversation. Unlike conversational speech, there are often long sentences in lecture speech delimited by only a short pause [24]. Acoustic/prosodic features in speech summarization system are usually extracted from audio data. Researchers commonly use acoustic/prosodic variation—changes in pitch, intensity,
1152
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST 2010
TABLE III ACOUSTIC/PROSODIC FEATURES
TABLE IV LINGUISTIC FEATURES
speaking rate—and duration of pause for tagging the important contents of their speeches [25]. We also investigate these features for their efficiency in predicting summary sentences of lecture presentation. Our acoustic feature set contains 12 features: DurationI, SpeakingRate, F0 I, F0 II, F0 III, F0 IV, F0 V, E I, E II, E III, E IV, and E V. Considering the speaker variability in lecture speech, we also extract speaker-normalized acoustic features: PF0 I, PF0 II, PF0 IV, PE I, PE II, PE IV. We describe these features in Table III. We calculate DurationI from the annotated manual transcriptions that align the audio documents. We then obtain SpeakingRate by phonetic forced alignment by HTK [26]. Next, we extract F0 features and energy features from audio data by using the Praat [27] tool. We propose to perform cross validation experiment on training data by using a baseline summarizer as described in Section V-B to select the appropriate features by using LIBSVM [28].
is the total number of sentences in the target presentais the number of sentences where the word tion. appears. We extract all linguistic features from the manual and ASR transcriptions respectively. For calculating length features, we segment Chinese words of these transcriptions. We use an offthe-shelf Chinese lexical analysis system, the open source HIT IR Lab Chinese Word Segmenter [30] to process our corpora. C. Discourse Characteristic of Lecture Speech Most methods for text summarization mainly utilize linguistic and discourse features. We are interested in investigating the role of discourse features in comparison to rhetorical structure modeling using our approach. The probability distributions of words in texts can be adequately estimated by Poisson mixture [31]. Noun words that may be primitive organizers of written text also follow Poisson distribution [32]. Based on these findings, we extract the discourse feature Poisson Noun [33] described in (10), which contains discourse information, based on the section boundaries of each presentation.
B. Linguistic Characteristics of Lecture Speech [29] showed that prosodic models outperform language models in topic segmentation for broadcast news. Previous work on broadcast news summarization have even shown that salient sentences can be found based on their acoustic and structural features alone, without linguistic features [3], [21]. However, we argue that since lecture speech is prosodically less stylistic than broadcast news, the relative contribution of linguistic features might be more important in summarization. Similar to text summarization, the linguistic information can help us predict the summary sentences. We extract eight linguistic features from transcriptions: Len I, Len II, Len III, TFIDF, and Cosine. We describe these features in Table IV. TFIDF is the summation of the value of each word in of the target word by (8) the sentence. We calculate and and (9).
(8)
with being the number of occurrences of the target word, and the denominator is the number of occurrences of all words in a presentation. (9)
(10) In (10), is the number of noun words in the sentence , which belongs to the section ; is the frequency of the word in the section ; means that the word appeared in the th time within the section . The Poisson Noun is based on the following assumptions: first, if a sentence contains new noun words, it probably contains new information. The noun word’s Poisson score varies according to its position. We use Poisson distribution to approximate the variation. Second, if a noun word occurs frequently, it is likely to be more important than other noun words, and the sentence with these high frequency noun words should be included in a summary. V. APPROACH 1: SEGMENTAL SUMMARIZATION OF LECTURE SPEECH Having seen in Section II that the rhetorical structure of lecture speeches is rendered by acoustic combined with linguistic features, we propose a first approach of segmental summarization of lecture speech by automatically extracting segment boundaries. A. Extracting Section Boundaries by K-Means Clustering The extraction process by K-means clustering is described in Algorithm 1.
ZHANG et al.: EXTRACTIVE SPEECH SUMMARIZATION USING SHALLOW RHETORICAL STRUCTURE MODELING
1153
Algorithm 1 Extracting segmental structure of lecture speech For training data STEP1: Split each presentation slides into three sections: introduction; content; conclusion; then calculate one tfidf vector for representing each , for wordlist section: extracted from text content of slides of training data , . STEP2: Calculate one tfidf vector for representing each sentence of the transcriptions: , for wordlist extracted from text content of transcriptions of training data , . STEP3: Calculate cosine similarity between , tfidf vectors: , of transcriptions and tfidf vectors: , , of slides; and then find the section boundaries based on the cosine similarity distribution. For test data STEP1: Initialize section boundaries by the ad hoc initial parameters obtained from the training data as follows: e.g., 30% of introduction, 40% of content, and 30% of conclusion. STEP2: Extract the discourse feature–Poisson Noun [21] described in (10), based on the section boundaries. STEP3: Extract acoustic and linguistic features from each sentence of the transcriptions. STEP4: Use one vector containing all features for representing one sentence and project each sentence vector into two dimension form by using Principle Component Analysis (PCA). STEP5: Use K-means for clustering the two-dimension sentence vectors into three groups and then produce new section boundaries. STEP6: Repeat from STEP2 until section boundaries remain the same.
Fig. 7. Spoken document representation with RSHMMs.
sentence or not and produce a global summary without segmentation of the transcriptions. We find that the segmental summarizers, extractive summarization with segmental structure information, yield better performance as shown in Table V(B), 0.6736 ROUGE-L F-measure produced by the segmental summarizer, 4.15% higher than the best performance produced by the single summarizer, even when discourse feature is used by the latter. VI. APPROACH 2: RHETORICAL STATE HMM FOR LECTURE SPEECH SUMMARIZATION A. Extracting Rhetorical Structure by RSHMMs The previous approach of segmental summarization, described in Section V-B, showed us that rhetorical segments are indeed helpful. Looking further, as illustrated in Fig. 2, rhetorical structure is in fact a hierarchical structure. In view of this, we propose a second approach of building Rhetorical State Hidden Markov Models with state transitions that represent several kinds of rhetorical relations to better model this rhetorical structure. We use RSHMMs to model the underlying rhetorical strucwhich consists of a seture of the transcribed document recognized sentences from the ASR output: quence of , . Fig. 7 shows the concatenation of RSHMMs to represent a spoken document. Each RSHMM state contains a probability distribution for the input feature vector obtained from the acoustic and linguistic features for the sentence . We use mixtures of multivariate Gaussian distribution as the probability distribution as (11)
B. Segmental Summarization of Lecture Speech We summarize lecture speech by extracting and concatenating salient sentences or segments from the lecture speech audio and its transcription. Based on the analysis in Section II, we segment the transcriptions into three sections by K-means clustering. We then train three summarization models for different sections. Considering that the transcriptions of test data do not have corresponding power point slides, we split the transcriptions of test data into three sections by using Algorithm 1. We then build three segmental summarizers and extract the summaries of each section with summarizers. Finally, we concatenate the three summaries into a global summary. For comparison, we also build a conventional single summarizer which is a binary classifier for estimating whether each one sentence is summary
is the number of mixture components in the state, where is the weight of the th component and is a multivariate Gaussian with mean vector and covariance matrix for the acoustic and linguistic features as follows:
(12) Given that the spoken document used in this work are lecture presentations and assuming that these presentations consistently follow a rhetorical structure containing sections, HMMs (i.e., and ) are built to represent the respective sections. Each HMM is represented by three states, roughly corresponding to the beginning, the middle, and the ending part in a rhetorical “paragraph.” Each of the states contain several
1154
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST 2010
Gaussian components. We trained each of the HMMs by performing Viterbi initialization, then followed by Baum–Welch re-estimation using the forward-backward algorithm. We then place the trained HMMs into a sequential network and ). We finally use the Viterbi alstructure of ( gorithm to find the best rhetorical unit sequence for document with sentence represented by . This is equal to finding the best state sequence as (13) (14) Fig. 8. Extractive summarization of lecture speech using RSHMMs.
Finally, we annotate each sentence of the given document as which approximately maximizes (15) where is a mapping function for the rhetorical unit, and we have a total of rhetorical units in a single document. is the feature vector representing the sentence sequence . B. Extractive Summarization With Shallow Rhetorical Structure This step in our algorithm assigns each sentence to its place in a particular rhetorical unit, again roughly corresponding to a single power point slide in a presentation. Next we want to find sentences to be classified as summary sentences by using the salient sentence classification function . Based on the probabilistic framework, extractive summarizaof each sention task is equal to estimating tence . We propose a novel probabilistic framework—RSHMM-enhanced SVM—for the summarization process [18]. We approxin the following expression: imate (16) where is the salient sentence classification function; can be obtained by (15). We then predict whether sentence is a summary sentence or not by using a probability threshold. We to be the compression ratio of set the probability rhetorical unit . (17) We model by SVM classifier with Radial Basis Function (RBF) kernel, as described in (18), SVM classifier as in [28]. One SVM classifier is trained for each rhetorical unit of the RSHMM network. All the HMMs in our experiments are trained by HTK [26]. The extractive summarization system with rhetorical information is described in Fig. 8.
K
(18)
We show an example of extractive summaries by using different summarization methods, in Figs. 9 and 10. VII. EXPERIMENTS AND EVALUATION A. Experiment Setup We choose 71 presentations with well-formatted power point slides from the corpus described in Section III-A. Reference summaries are generated from the power point slides and presentation transcriptions using RDTW followed by manual correction. For evaluating different summarization systems and the effect of recognition errors on extractive summarization results, we perform two groups of experiments: (I) is a set of tenfold cross validation experiments based on manual transcriptions; (II) is a held-out experiment which uses 34 presentations as train data and six presentations as test data based on ASR transcriptions. We randomly select a subset as the test data each time. for the First summarizer that selects first We set sentences of each rhetorical unit and set for the First summarizer that selects first sentences of the transcription. B. Summarization Performance 1) Experiments(I): We perform tenfold cross validation experiments on manual transcriptions. First, we divide the training set into ten subsets of equal size. We use nine subsets to train summarizers by using several supervised methods, as listed in Table V(A) and use the remaining subsets for testing. We then evaluate the performance using ROUGE-L F-measure. The average performance of these tenfold cross validation experiments is shown in Table V. First, Table V shows that linguistic features rank higher than acoustic features, but lower than speaker-normalized acoustic features in most experiments. This shows that, for lecture speech, speaker-normalized acoustic features together with linguistic features produce more accurate summaries. Second, Table V(B) shows that RSHMMs with segmental SVMs yield the best performance: 0.7131 ROUGE-L F-measure, is 8.24% absolute higher than that of the single SVM and 8.1% absolute higher than that of the single SVM with discourse feature. We also find that, using RSHMMs for segmentation, First M summarizer produces the best performance: 0.4333
ZHANG et al.: EXTRACTIVE SPEECH SUMMARIZATION USING SHALLOW RHETORICAL STRUCTURE MODELING
1155
Fig. 10. Example of extractive summaries by using different summarization methods(b).
Fig. 9. Example of extractive summaries by using different summarization methods(a).
ROUGE-L F-measure, higher than that by other unsupervised summarization methods. 2) Experiments(II): We perform held-out experiment on ASR transcriptions. We choose 34 presentations as training data and six presentations as test data. We then evaluate summarization performance on the held-out set in terms of ROUGE-1, ROUGE-2, and ROUGE-L F-measure. The best performance produced by different summarization methods are shown in Table VI. From Table VI, we clearly see that segmental summarization plus RSHMMs consistently outperforms other summarization methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F-measure. This finding is the same as the results of cross validation experiments. We do statistical significance test according to Koehn’s method [34]. We repeat the process of Experiments(II) 20 times. The experiment results from the 20 comparisons show that segmental summarization plus RSHMMs consistently
TABLE V EXPERIMENTS(I): TENFOLD CROSS VALIDATION EXPERIMENTS. (A) UNSUPERVISED SUMMARIZATION METHODS. (B) COMPARING UNSUPERVISED SUMMARIZATION METHODS WITH SUPERVISED ONES
outperforms the performance of other summarization methods.
1156
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST 2010
TABLE VI EXPERIMENTS(II): EVALUATION OF HELD-OUT EXPERIMENT
with certain success. We propose to investigate CRFs and other approaches in the future to compare to RSHMMs in terms of their relative modeling power of rhetorical structure modeling. We also will carry out further investigation into a broader set of features for lecture speech summarization tasks. Last but not the least, we refer to the Rhetorical Structure Theory and hope to extend the current proposed shallow rhetorical structure models into more powerful ones. REFERENCES
In other words, we can conclude that the segmental summarization plus RSHMMs is better than others with 100% statistical significance. Furthermore, considering that ASR accuracy is only 70.3%, our segmental summarization plus RSHMMs also produces very high performance: ROUGE-L F-measure of 0.6302 under manual sentence segmentation. Upon error analysis, we find that 91.76% of all misrecognized sentences, which are generated by substitution, insertion or deletion errors, are meaningless characters or words. These sentences often do not bear the core content of lecture presentations. VIII. CONCLUSION AND DISCUSSION In this paper, we have shown that the rhetorical structure in lecture speech is important for summarization, and that such structure is rendered by a combination of acoustic and linguistic features. We then propose a first approach of segmental summarization by automatically segmenting lecture speech into “paragraphs,” using acoustic and linguistic features. In view of the fact that rhetorical structure in speech is inherently hierarchical, we then propose a second approach of Rhetorical State HMMs to help SVM classifiers to automatically extract summaries from lecture speech. The performances of these two approaches are both superior to that of the baseline summarization system without rhetorical information. In particular, in the RSHMM framework, system produced ROUGE-L F-measure of 0.7131, which represents a 8.24% absolute increase in lecture speech summarization performance compared to the baseline without using RSHMMs. We then showed that our RSHMMs are even more helpful for summarization task than the conventional discourse feature—8.1% absolute increase in lecture speech summarization performance. Moreover, we show that the contribution of speaker-normalized acoustic features is more than that of other features. This is due to the fact that, for lecture speech, each speaker has particular speaking style. In addition, extractive summarization relies on finding salient sentences from automatic transcriptions of lecture speech, we find that summarization performance is still very good despite a 29.7% character error rate. This is because the misrecognized words and characters are mostly function words, stop words and filled pauses, which are not pertinent to the central message of the lecture speech. We note that there are other methods, such as Conditional Random Fields (CRFs), that might be more suitable for modeling temporal (and hence sequential) information in a document. This approach has been adopted to text summarization
[1] M. Hirohata, Y. Shinnaka, K. Iwano, and S. Furui, “Sentence extraction-based presentation summarization techniques and evaluation metrics,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’05), 2005, vol. 1, pp. 1065–1068. [2] S. Maskey and J. Hirschberg, “Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization,” in Proc. Interspeech’05 (Eurospeech), 2005, pp. 621–624. [3] S. Maskey and J. Hirschberg, “Summarizing speech without text using hidden Markov models,” in Proc. NAACL, 2006, pp. 89–92. [4] H. M. Wang, Y. T. Chen, H. S. Chiu, and B. Chen, “A unified probabilistic generative framework for extractive spoken document summarization,” in Proc. Interspeech’07, 2007, pp. 2805–2808. [5] C. Hori, S. Furui, R. Malkin, H. Yu, and A. Waibel, “Automatic speech summarization applied to English broadcast news speech,” in Proc. ICASSP’02, Orlando, FL, 2002, vol. 1, pp. 9–12. [6] J. J. Zhang, H. Y. Chan, and P. Fung, “Improving lecture speech summarization using rhetorical information,” in Proc. IEEE Workshop on Autom. Speech Recognition Understanding. ASRU’07, 2007, pp. 195–200. [7] P. Fung and G. Ngai, “One story, one flow: Hidden Markov Story Models for multilingual multidocument summarization,” ACM Trans. Speech Lang. Process. (TSLP), vol. 3, no. 2, pp. 1–16, 2006. [8] D. Tatar, E. Tamaianu-Morita, A. Mihis, and D. Lupsa, “Summarization by logic segmentation and text entailment,” Adv. Natural Lang. Process, Applicat., pp. 15–26, 2008. [9] Y. Nemoto, Y. Akita, and T. Kawahara, “PLSA-based topic detection in meetings for adaptation of lexicon and language model,” in Proc. Interspeech’07, 2007, pp. 602–605. [10] M. A. Hearst, “TextTiling: Segmenting text into multi-paragraph subtopic passages,” Comput. Linguist., vol. 23, no. 1, pp. 33–64, 1997. [11] C. H. Nakatani, J. Hirschberg, and B. J. Grosz, “Discourse structure in spoken language: Studies on speech corpora,” in Proc. AAAI 1995 Spring Symp. Series: Empirical Methods in Discourse Interpretation and Generation, 1995, pp. 106–112. [12] S. Maskey and J. Hirschberg, “Automatic summarization of broadcast news using structural features,” in Proc. Eurospeech’03, 2003, pp. 1173–1176. [13] Y. T. Chen et al., “Extractive Chinese spoken document summarization using probabilistic ranking models,” in Proc. ISCSLP, 2006, pp. 660–671. [14] W. C. Mann and S. A. Thompson, Rhetorical Structure Theory: A Theory of Text Organization. Los Angeles, CA: Univ . Southern California, Inf. Sci. Inst., 1987. [15] M. A. K. Halliday, Intonation and Grammar in British English. The Hague, The Netherlands: Mouton, 1967. [16] D. R. Ladd, Intonational Phonology. Cambridge, U.K.: Cambridge Univ. Press, 1996. [17] J. Hirschberg and C. H. Nakatani, “A prosodic analysis of discourse segments in direction-giving monologues,” in Proc. 34th Conf. Assoc. Comput. Linguist., 1996, pp. 286–293. [18] P. Fung, R. Chan, and J. J. Zhang, “Rhetorical-State Hidden Markov Models For Extractive Speech Summarization,” in Proc. Acoust., Speech, Signal Process. (ICASSP’08), 2008, pp. 4957–4960. [19] A. Nenkova, R. Pasonneau, and K. McKeown, “Employing EM in poolbased active learning for text classification,” ACM Trans. Speech Lang. Process., vol. 4, no. 2, May 2007. [20] C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. Workshop Text Summarization Branches Out (WAS 2004), 2004, pp. 25–26. [21] J. J. Zhang and P. Fung, “Speech summarization without lexical features for Mandarin broadcast news,” in Proc. Conf. North Amer. Chapt. Assoc. Comput. Linguist.; Companion Volume, Human Lang. Technol. , Rochester, New York, Apr. 2007, Assoc. for Comput. Linguist., pp. 213–216.
ZHANG et al.: EXTRACTIVE SPEECH SUMMARIZATION USING SHALLOW RHETORICAL STRUCTURE MODELING
[22] P. C. Woodland, C. J. Leggetter, J. J. Odell, V. Valtchev, and S. J. Young, “The development of the 1994 HTK large vocabulary speech recognition system,” in Proc. ARPA Workshop Spoken Lang. Syst. Technol., 1995, pp. 104–109. [23] S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-text and speech-to-speech summarization of spontaneous speech,” IEEE Trans. Speech Audio Process., vol. 12, no. 4, pp. 401–408, 2004. [24] T. Kawahara, H. Nanjo, and S. Furui, “Automatic transcription of spontaneous lecture speech,” in Proc. IEEE Workshop on Autom. Speech Recognition and Understanding ASRU’01, 2001, pp. 186–189. [25] J. Hirschberg, “Communication and prosody: Functional aspects of prosody,” Speech Commun., vol. 36, no. 1, pp. 31–43, 2002. [26] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.0). Cambridge, U.K.: Cambridge Univ. Press, 2000. [27] P. Boersma and D. Weenink, “Praat, A System for Doing Phonetics by Computer, version 3.4,” Inst. Phonetic Sci. Univ. of Amsterdam, Report, vol. 132, p. 182, 1996. [28] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector machines,” 2001, vol. 80 [Online]. Available: http://www.csie.ntu.edu.tw/ cjlin/libsvm [29] E. Shriberg and A. Stolcke, “Prosody modeling for automatic speech recognition and understanding,” Math. Foundations Speech Lang. Process., pp. 105–114, 2004. [30] H. Z. T. L. J. Ma and X. Liao, “Chinese word segmentation with multiple postprocessors in HIT-IRLab,” in Proc. SIGHAN, 2005, pp. 172–175. [31] K. W. Church and W. A. Gale, “Poisson mixtures,” Natural Lang. Eng., vol. 1, no. 2, pp. 163–190, 1995. [32] A. F. Badalamenti, “Speech parts as Poisson processes,” J. Psycholinguist. Res., vol. 30, no. 5, pp. 497–527, 2001. [33] J. Zhang, H. Chan, P. Fung, and L. Cao, “A comparative study on speech summarization of broadcast news and lecture speech,” in Proc. Eurospeech’07, 2007, pp. 2781–2784. [34] P. Koehn, “Statistical significance tests for machine translation evaluation,” in Proc. EMNLP, 2004, vol. 4, pp. 388–395. Justin Jian Zhang received the B.Eng. degree from the School of Electrical Engineering and Automation in Tianjin University, Tianjin, China, in 2003 and the M.Eng. degree from the School of Software, Tsinghua University, Beijing, China, in 2006. He is currently pursuing the Ph.D. degree at the Human Language Technology Center, Hong Kong University of Science and Technology (HKUST), Hong Kong. He is currently with the Human Language Technology Center, HKUST. He was a member of the Data Mining Group, Institute of Information System and Engineering, Tsinghua University, from 2003 to 2006. His interests include natural language processing and speech understanding and summarization.
1157
Ricky Ho Yin Chan received the B.Eng. and M.Phil. degrees in electronic engineering from the Hong Kong University of Science and Technology (HKUST), Hong Kong, in 1998 and 2001, respectively, and the M.Sc. degree in engineering from Cambridge University, Cambridge, U.K., in 2004. He is a member of the Human Language Technology Center and InterACT at HKUST. His research interests include using novel approaches in language modeling, LVCSR, multilingual speech recognition, and speech applications.
Pascale Fung (SM’96) received the M.S. and Ph.D. degrees in computer science from Columbia University, New York, in 1993 and 1997, respectively. She also studied at the Ecole Centrale Paris, Paris, France, and Kyoto University, Kyoto, Japan. She was formerly a Researcher at AT&T Bell Labs, BBN Systems and Technologies, and LIMSI/CNRS in France. She is currently an Associate Professor in the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology (HKUST). She cofounded the Human Language Technology Center, and is the Director of InterACT at HKUST. Her research interests include speech summarization, speech translation, and multilinguality in both speech and language processing. She is an Associate Editor of the ACM Transaction on Speech and Language Processing. She is a member of the IEEE Signal Processing Society Speech and Language Technology Committee and a founding board member of SIGDAT of the Association of Computational Linguistics.