handling verbose queries for spoken document ... - Semantic Scholar

0 downloads 0 Views 170KB Size Report
... University, Taiwan. 2 IBM Thomas J. Watson Research Center, USA. {shlin ... we propose a novel term-based query reduction mechanism so as to improve the ...
HANDLING VERBOSE QUERIES FOR SPOKEN DOCUMENT RETRIEVAL Shih-Hsiang Lin1#, Ea-Ee Jan,2, Berlin Chen1 1 2

National Taiwan Normal University, Taiwan IBM Thomas J. Watson Research Center, USA

{shlin, berlin}@csie.ntnu.edu.tw, [email protected] 1

ABSTRACT

Query-by-example information retrieval provides users a flexible but efficient way to accurately describe their information needs. The query exemplars are usually long and in the form of either a partial or even a full document. However, they may contain extraneous terms that would have potential negative impacts on the retrieval performance. In order to alleviate those negative impacts, we propose a novel term-based query reduction mechanism so as to improve the informativeness of verbose query exemplars. We also explore the notion of term discrimination power to select a salient subset of query terms automatically. Experiments on the TDT Chinese collection show that the proposed approach is indeed effective and promising. Index Terms—Query-by-example, information retrieval, verbose query, term-based query reduction

1. INTRODUCTION Query-by-example information retrieval, which attempts to retrieve relevant documents when users provide some specific query exemplars describing their information needs, has gained considerable attention in recent years [1-3]. This task is especially useful for news monitoring and tracking, where the user can take an entire newswire story in text form as a query to retrieve relevant radio news stories in audio form from a spoken document collection. Alternatively, one may also take an audio clip of interest to retrieve other related audio clips. Although query-byexample information retrieval offers the users a flexible but efficient way to describe their information needs, extraneous (or off-topic) terms contained within the exemplars would sometimes deteriorate the retrieval performance. In recognition of this problem, considerable research efforts have been devoted to improving the informativeness of such verbose queries. For example, Bendersky and Croft [4] applied supervised machine learning techniques to identify and re-weight query terms. Lease et al. [5] presented a similar idea by using a regression based approach to re-weight query terms. Kumaran and Carvalho [6] analyzed all term subsets, also referred to as subqueries, extracted from the original verbose query. A set of queryquality predictors associated with each sub-queries was created after the analysis. The machine learning techniques were then leveraged to select the most informative sub-queries in place of the original verbose query. Most of the above studies used supervised learning strategies to re-weight or to select informative query #

This work was conduct during the author’s 2010 summer internship at IBM Thomas J. Watson Research Center.

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

5552

terms from the verbose query. In general, such techniques require highly-annotated training queries associated with relevant/irrelevant documents for model training, as well as a set of predefined features as query-quality predictors that characterize the importance of each query term. Normally, the annotating procedure in supervised learning is both time-consuming and tedious. Another potential deficiency for such supervised learning strategies is the lack of generalization capability, making them not readily applicable to new tasks or domains. In contrast to the above supervised learning approaches, we have recently proposed a “sentence-based” query reduction method which integrates extractive summarization techniques into the retrieval process to improve the informativeness of verbose queries for spoken document retrieval (SDR), without recourse to supervised training or specialized linguistic expertise [7]. To do this, we treat the reduction of a verbose query as a summarization task where a set of informative sentences are selected from the original query exemplar, hypothesizing that the selected sentences can succinctly reflect the main theme of the original query and thus have positive contributions to the retrieval process. In this paper, we extend the previous attempt to “term-based” query reduction. The major reason is that the extraneous terms might still remain in the reduced queries composed by the “sentence-based” query reduction methods. For the idea to work, we represent each query exemplar as a network of query terms, where each node represents a query term and the associated weight of each link between two nodes represents the topical similarity relationship between a pair of words. We then employ a graph-based centrality algorithm to select a subset of query terms to form an informative length-reduced query exemplar. The utility of the proposed method is verified by extensive comparisons with other methods. The rest of this paper is organized as follows. Section 2 elucidates the proposed term-based query reduction algorithm for handling verbose queries. Section 3 describes the spoken document collection compiled for the experiments and the associated experimental setup. A series of experiments and discussions are presented in Section 4. Finally, Section 5 concludes this paper with future work.

2. TERM-BASED QUERY REDUCTION The goal of improving the informativeness of verbose queries can be fulfilled by selecting a set of informative query terms from the original query exemplar; meanwhile, extraneous terms should also be properly restrained to diminish the negative impacts they might have on the retrieval result. Conceptually, it can be cast as a summarization problem where we want to select some salient (or representative) terms from the original verbose queries that can

ICASSP 2011

succinctly describe the main theme of the information needs. To achieve this goal, a “sentence-based” query reduction method was proposed recently [7], which demonstrates significant performance improvements over the baseline retrieval model on the SDR task. However, a potential drawback of the previous work is that extraneous terms might still remain in the reduced queries since the system takes sentences as the summarization unit. With this understanding at hand, we change the summarization unit from sentences to terms, and employ a graph-based centrality algorithm to find the potential prestige query terms that can succinctly represent (and then replace) the original verbose query exemplar. Taking a step further, for each query exemplar, we construct a network composed of all of its query terms where each node vi represents a query term and the associated weight of each link eij represents the topical similarity relationship between two distinct query terms vi and v j . Due to the insufficiency of contextual information or lexical evidence carried by a single query term, it would be very difficult to assess the topical similarity relationship between any pair of query terms. One may use the lexical knowledge contained in WordNet, the term co-occurrence statistics or the point-wise mutual information (PMI) to determine the termsimilarity relationship. In this paper, we adopted the latent semantic analysis (LSA) technique to estimate the term similarity in the latent semantic space. Towards this end, we represent the whole document collection to be retrieved as an N u M worddocument co-occurrence matrix where N is the vocabulary size and M is the number of documents in the collection. Then, singular value decomposition (SVD) is conducted on the cooccurrence matrix featured with the term-frequency statistics. Each * row vector ui of the resulting left singular matrix is taken as the term semantic vector for a particular query term wi . By doing so, we can use those term semantic vectors to estimate the proximity between any two terms using measures such as the cosine similarity: * * ui ˜ u j (1) eij Sim vi , v j * * , | ui || u j |





* * where | ui | is the length of ui . After constructing the conceptualized network, a graph-based centrality algorithm [8] is then applied to obtain an association score for each query term, which will be leveraged to determine the term’s importance subsequently. The above-mentioned network actually can be viewed as a Markov chain in which the states are the query terms and the corresponding state transition distribution is given by a similarity matrix M built on top of (1). The association score of each query term can be derived by the following equation: p

>dU  1  d M@

T

p,

(2)

where d is a damping factor and U is a square matrix with all its elements being equal to the reciprocal of the query length [8]; p is the stationary distribution vector of the Markov chain, of which each dimension has to do with a distinct query term. Since this type of Markov chain meets the irreducible and aperiodic property, we can simply use the power method to estimate the corresponding stationary distribution [8]. In other words, solving (2) is equivalent to finding the eigenvector centrality of the network.

5553

The resulting stationary distribution vector p can be viewed as a term association vector for finding informative query terms. In general, a query term with a higher association score in p means that it is more similar to others and it no doubt should be kept in the reduced query. On the contrary, a query term that is more dissimilar to others (i.e., with a lower association score) should also be retained in the reduced query since its semantic role cannot be represented by others. In other words, we can define two thresholds (in accordance with the ratio of the length of the reduced query to that of the original one) to discard those query terms with middle-range scores, but instead keep those terms having the highest or the lowest scores. The rationale behind the method is that those query terms having higher associations with others can increase the recall, while the precision can be improved by retaining those query terms with much lower association scores. Even though we have already developed an algorithm to automatically derive the association score for each query term, we still lack a mechanism to automatically determine the thresholds for term selection effectively. Intuitively, an ideal query should reflect the user’s information need as much as possible, so as to discriminate the relevant documents from the irrelevant ones. According to the information theory, we take the entropy measure as an indicator to assess the informativeness of a length-reduced query. If the query contains ample information, it will result in a higher entropy value; otherwise, it will have a lower entropy value. To realize this intuition, we first define the discrimination power I q of each query term q :

I q

~ ˜ C q' , Q

~ IDFq ˜ C q, Q

¦

~ IDFq ' q 'Q

(3)

where IDFq is the inverse document frequency (IDF) of the query term q , which penalizes q if it is common in the document ~ collection; and C q, Q is the number of times that q occurs in the ~ length-reduced query Q . We then employ the following equation ~ to select one of the possible reduced queries Q* that is the most informative:



~ Q*

ª º arg max« ¦ I q log I q », ~ ~ « »¼ Qȥ Q ¬ qQ

(4)

where ȥ Q denotes the set of all possible length-reduced queries for a verbose query Q (i.e., the candidates that are obtained by ~ varying the selection thresholds). We expect that the Q* will be effective in discriminating the relevant documents from the irrelevant ones.

3. EXPERIMENTAL SETUP 3.1. Corpus We used the Topic Detection and Tracking (TDT-2) collection [9] for this work. TDT is a DARPA-sponsored program where participating sites tackle tasks such as identifying the first time a news story being reported on a given topic, or grouping news stories with similar topics from audio and textual streams of newswire data. In this paper, we used the Mandarin Chinese collection of the TDT corpora for the retrospective retrieval task, such that the statistics for the entire document collection was obtainable. The Chinese text news stories from Xinhua News

Agency were compiled to form the test queries (or query exemplars). Moreover, we extracted the title field of text news stories as the short queries for comparison purpose. The Mandarin audio news stories from Voice of America news broadcasts are used as the spoken documents. Table 1 shows basic statistics of the TDT2 corpus. To assess the recognizer performance, we spotchecked a fraction of spoken document collection (approximately 39.90 hours). The error rates for word, character and syllable are 35.38%, 17.69% and 13.00%, respectively. The retrieval results are expressed in terms of non-interpolated mean average precision (mAP) following the TREC evaluation [10].

3.2. Retrieval Model This paper use probabilistic language modeling (LM) to construct the retrieval model for spoken document retrieval [11]. To do this, each document is treated as a language model for generating a given query, and the ranking of spoken documents (or the relevance measure) is based on the likelihood of the query Q generated by each spoken document D (i.e., P Q | D ). If the query Q is treated as a sequence of terms, Q q1 , q2 ,  , q L , where the query terms are assumed to be conditionally independent given the document D and their order is also assumed to be of no importance (i.e., the so-called “bag-of-words” assumption), the relevance measure P Q | D can be further decomposed as a product of the probabilities of the query words generated by the document:

– P qi D c qi ,Q ,

PQD

(5)

qiQ

where c qi , Q is the number of times that query term qi occurs in Q and P qi | D is the likelihood of D generating qi (a.k.a. the document model). Here, we consider two variants of the LM approach for constructing the document model for each document D . One is to use the unigram language model (ULM), where each document can, respectively, offer a unigram distribution for observing a query word, which is estimated on the basis of the words occurring in the document and further interpolated with a background unigram language model for the purpose of probability smoothing. The other one is the document topic modeling (DTM) approach [12] that calculates the query-likelihood based on the frequency of qi occurring in a given latent topic as well as the likelihood that D generates the respective topic. Here DTM is implemented with latent Dirichlet allocation (LDA) [13].

4. EXPERIMENTAL RESULTS The baseline retrieval results obtained by ULM and DTM are shown in Table 2. The retrieval results, assuming manual transcripts for the spoken documents to be retrieved are available (denoted by TD, text documents), are also listed for reference, compared to the results when only erroneous recognition transcripts generated by speech recognition are available (denoted by SD, spoken documents). Furthermore, all retrieval models are trained without supervision. The number of topics used for DTM is set to 32, which was determined from our previous experiments [12]. Looking at Table 2, we see that the retrieval performance of ULM (can be thought of as a kind of literal term matching) with the original verbose query exemplars is 0.639 and 0.562, respectively, for the TD and SD cases. However, when the short queries (i.e., the title fields of the original query exemplars) are

# Spoken documents # Distinct test queries Document length (in characters) Length of verbose query (in characters) Length of short query (in characters) # Relevant documents per test query

2,265 stories (46 hrs of audio) 16 Xinhua text stories Min. Max. Med. Mean 23

4841

153

287

183

2623

329

533

8

27

13

14

2

95

13

29

Table 1: Statistics for TDT-2 collection used for spoken document retrieval.

ULM DTM

TD

Verbose Queries 0.639

Short Queries 0.379

SD

0.562

TD

0.644

0.293 0.438

SD

0.590

0.407

Table 2: Baseline retrieval results (in mAP) achieved by different modeling approaches. being used, the retrieval performance drops dramatically for both cases, leading to a relative performance degradation of 45% as compared to retrieval using the original query exemplars. Meanwhile, the performance gap between the TD and SD cases is about 7% absolute in terms of mAP when using either long or short queries, although the word error rate (WER) for the spoken document collection is higher than 35%. Further, by using the original verbose query exemplars, the baseline retrieval results for DTM (which can be thought of as a kind of concept matching) are better than that of ULM; DTM also outperforms ULM while using the short queries. We would, therefore, expect that DTM will still achieve consistent improvements over ULM when other extra information cues are properly incorporated. In the next set of experiments, we access the utility of the proposed “term-based” query reduction method (cf. Section 2). The experiments are designed to investigate the impact of different threshold settings made to the retrieval performance. To get to this point, we empirically set two different thresholds, each ranging from 10% to 50% with an increment of 10%, for selecting the query terms that have the highest associations with others and the query terms that have the lowest associations with others, respectively. For example, we can select the query terms that have the highest associations with others and constitute 40% of the original query, and simultaneously select the query terms that have the lowest associations with others and constitute 30% of the original query, to form a length-reduced query. The best results are shown in first column of Table 3 (denoted by “Manual”). This optimum setting demonstrates around 1-3% absolute performance improvement over the baseline result shown in Table 2. If we further analyze the average length reduction rate of the resulting length-reduced queries with the optimum setting, we find that a reduction rate ranging from 10% to 30% can be achieved while yielding better retrieval performance. Put another way, even with less query terms, the length-reduced query exemplars still achieve better retrieval performance than the original ones. It should be

5554

noted here that unlike the conventional summarization tasks that aim at compressing the document length, the main purpose of this work is to improve the informativeness of the verbose queries (or, more specifically, the final retrieval performance). Therefore, the reduction rate is not a main concern of this research. As is evident from the retrieval results, the “term-based” query reduction method indeed can help to retain the most useful information cues regarding the topic of interest, and thus boost the retrieval performance. In the third set of experiments, we utilize (4) to automatically determine the thresholds for informative term selection, and the corresponding results are illustrated in the second column of Table 3 (denoted by “Automatic”). As it can be seen, using the “entropybased” selection approach (cf. Section 2) yields even better retrieval performance than that presented in the previous set of experiments. One possible explanation is that the two thresholds used in the previous set of experiments are, respectively, set to the identical values (e.g., set to correspond to length reduction ratios of 10%, 20%, etc., for the query terms that have the highest associations with others) among all query exemplars, while the “entropy-based” selection approach has the ability to automatically adjust the thresholds according to the information carried by each individual query exemplar. It is worth mentioning that each query exemplar thus will have different length reduction rate. On average, each length-reduced query contains about 315 characters (or, equivalently, achieves a length reduction rate of 35%). From Table 3, it can be found that the best retrieval results are obtained by pairing the length-reduced queries with DTM. As compared to the baseline results obtained by using the original verbose query exemplars with DTM, it gives about 10% mAP improvements relatively for both the TD and SD cases. To go a step further, when analyzing the correlation between the entropy value and the retrieval performance, we find that the corresponding correlation coefficient value is about 0.60; this to some extent reveals that the entropy measure could be used as an indicator for choosing the informative length-reduced queries. In the final set of experiments, we compare the proposed “term-based” query reduction method with the widely used stop word removal method [3-4] and our previously propose “sentencebased” query reduction method [7] (the latter two methods are also constructed in an unsupervised manner). The corresponding results are shown in Table 4. As compared to the results shown in the second column of Table 3, we can see that the proposed “termbased” query reduction method substantially outperforms the other two methods. We have also explored to couple our proposed method and the stop word removal method; however, only moderate improvements are observed. One possible reason is that the background model used in the retrieval model (cf. Section 3.2) in some sense functions as a mechanism for de-emphasizing the contributions of stop words made to document ranking.

5. CONCLUSIONS In this paper, we have proposed a “term-based” query reduction method to improve the informativeness of verbose query exemplars for spoken document retrieval. We have also presented an “entropy-based” selection criterion to compose the reduced queries automatically. Significant improvements in retrieval effectiveness seem to verify the utility of the proposed method. A

5555

ULM DTM

Baseline

Manual.

TD

0.639

0.654

Automatic 0.679 0.586

SD

0.562

0.584

TD

0.644

0.674

0.682

SD

0.590

0.628

0.633

Table 3: Retrieval results (in mAP) achieved by pairing ULM and DTM, respectively, with the length-reduced queries.

ULM DTM

TD

Stop Word Removal 0.643

Sentence-based Reduction [7] 0.650

SD

0.553

TD

0.652

0.576 0.654

SD

0.596

0.611

Table 4: Retrieval results (in mAP) achieved by using the stop word removal and sentence-based reduction approaches, respectively. striking feature of it is that no supervised training is required. Our future research directions include: 1) integrating the retrieval model or the summarization model with more elaborate representations of the speech recognition output and 2) comparing the proposed method with other supervised query reformulation methods for large scale SDR tasks and 3) applying the proposed reduction method to the other retrieval stages such as relevance feedback.

6. REFERENCES [1] H. Meng, et al., “Madarin-English Information (MEI): Investigating translingual speech retrieval,” Computer Speech and Language, vol. 18(2), pp. 163-179, 2004. [2] B. Chen, et al., “A discriminative HMM/n-gram based retrieval approach for Mandarin spoken documents,” ACM TALIP, vol. 3(2), pp. 128-145, 2004. [3] T. K. Chia, et al., “Statistical lattice-based spoken document retrieval,” ACM TOIS, vol. 28(1), Article 2, 2010. [4] M. Bendersky and W. B. Croft, “Discovering key concepts in verbose queries,“ in Proc. SIGIR 2008. [5] M. Lease and J. Allan, “Regression rank: Learning to meet the opportunity of descriptive queries,” in Proc. of the SIGIR 2009. [6] G. Kumaran and V. R. Carvalho, “Reducing long queries using query quality predictors,” in Proc. ECIR, 2009. [7] S. H. Lin, B. Chen, E. E. Jan, “Improving the informativeness of verbose queries using summarization techniques for spoken document retrieval,” in Proc. ISCSLP 2010. [8] G. Erkan and D. R. Radev, “LexRank: graph-based lexical centrality as salience in text summarization,“ J. of Artificial Intelligence Research, vol. 22, pp. 457 - 479, 2004. [9] LDC. 2000. Project topic detection and tracking. Linguistic Data Consortium. http://www.ldc.upenn.edu/Projects/TDT/. [10] D. Harman, “Overview of the fourth text retrieval conference (TREC-4),” in Proc. TREC 1995. [11] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval,” in Proc. SIGIR 1998. [12] B. Chen, “Latent topic modeling of word co-occurrence information for spoken document retrieval,” in Proc. ICASSP 2009. [13] D.M. Blei et al., “Latent Dirichlet allocation,” J. of Machine Learning Research, 3, 2003.