Language Models for XML Element Retrieval

6 downloads 0 Views 449KB Size Report
Dwyane Wade)] (id=2009018). 4 Retrieval Model and Strategies. We use the run of full article retrieval as the baseline for both CO and CAS queries.
Language Models for XML Element Retrieval Rongmei Li1 and Theo van der Weide2 1 2

University of Twente, Enschede, The Netherlands Radboud University, Nijmegen, The Netherlands

Abstract. In this paper we describe our participation in the INEX 2009 ad-hoc track. We participated in all four retrieval tasks (thorough, focused, relevant-in-context, best-in-context) and report initial findings based on a single set of measure for all tasks. In this first participation, we test two ideas: (1) evaluate the performance of standard IR engines used in full document retrieval and XML element retrieval; (2) investigate if document structure can lead to more accurate and focused retrieval result. We find: 1) the full document retrieval outperforms the XML element retrieval using language model based on Dirichlet priors; 2) the element relevance score itself can be used to remove overlapping element results effectively.

1

Introduction

INEX offers a framework for cross comparison among content-oriented XML retrieval approaches given the same test collections and evaluation measures. The INEX ad-hoc track is to evaluate system performance in retrieving relevant document components (e.g. XML elements or passages) for a given topic of request. The relevant results should discuss the topic exhaustively and have as little non-relevant information as possible (specific for the topic). The ad-hoc track includes four retrieval tasks: the Thorough task, the Focused task, the Relevant in Context task, and the Best in Context task. The 2009 collection is the English Wikipedia with XML format. The ad-hoc topics are created by the participants to represent real life information need. Each topic consists of five fields. The field (CO query) is the same as the standard keyword query. The field (CAS query) adds structural constraints to the CO query by explicitly specifying where to look and what to return. The field (Phrase query) presents explicitly a marked up query phrase. The and fields provide more information about topical context. Especially the field is used for relevance assessment. The paper documents our first participation in the INEX 2009 ad-hoc track. Our aims are to: 1) evaluate the performance of standard IR engines (Indri search engine) used in full document retrieval and XML element retrieval; 2) investigate if document structure can lead to a more accurate and focused retrieval result. We adopt the language modeling approach [2] and tailor the estimate of query term generation from a document to an XML element according to the user

request. The retrieval results are evaluated as: 1) XML element retrieval; 2) full document retrieval. The rest of the paper describes our experiments in the ad-hoc track. The pre-processing and indexing steps are given in section 2. Section 3 explains how to convert a user query to an Indri structured query. The retrieval model and strategies are summarized in section 4. We present our results in section 5 and conclude this paper with a discussion in section 6.

2

Pre-processing and Indexing

The original English XML Wikipedia is not stopped or stemmed before indexing. The 2009 collection has 2,666,190 documents taken on 8 October 2008. It is annotated with the 2008-w40-2 version of YAGO ([3]). We index mainly the queried XML fields as follows: category, actor, actress, adversity, aircraft, alchemist, article, artifact, bdy, bicycle, caption, catastrophe, categories, chemist, classical music, conflict, director, dog, driver, group, facility, figure, film festival, food, home, image, information, language, link, misfortune, mission, missions, movie, museum, music genre, occupation, opera, orchestra, p, performer, person, personality, physicist, politics, political party, protest, revolution, scientist, sec, section, series, singer, site, song, st, theory, title, vehicles, village.

3

Query Formulation

We handle CO queries by full article retrieval while ignoring boolean operators (e.g. “-”or “+”) in the field. For CAS queries, we adopt two different strategies to formulate our Indri structured query ([4]) for retrieving full articles or XML elements respectively. The belief operator #combine is used in both cases. Neither CO queries nor CAS queries are stemmed or stopped. Similar to the document, we assume the query terms form a bag-of-words [1] when phrase constraints (noted as “”) are not added in or . • CAS queries for full article retrieval: we extract all terms within “about” and the XML tags that have semantic meaning as our query terms. Boolean operators (e.g. “-”or “+”) in are ignored. For instance, for INEX query (id=2009009) we have: //(p|sec)[about(.//(political party|politics), election +victory australian labor party state council -federal)]

Extraction leads to the following query terms: election, victory, australian, labor, party, state, council, federal, political, party, politics.

• CAS queries for XML element retrieval: we extract all terms within “about” and use the Indri belief operator #not to exclude (“-”) certain terms. At the same time, we add XML element constraints (e.g. ,

) and phrase constraints to our new Indri queries. For instance, for INEX query (id=2009021) we have: //article[about(., ‘‘wonder girls’’)]

The formulated Indri query looks like: #combine[article](#1(wonder girls))

For CAS queries for any XML element type (noted as * ), we retrieve either aritle element only or additional elements such as bdy, link, p, sec, section, st, and title. An example CAS query is //*[about(., Dwyane Wade)] (id=2009018).

4

Retrieval Model and Strategies

We use the run of full article retrieval as the baseline for both CO and CAS queries. The retrieval model for the baseline runs is based on cross-entropy scores for the query model and the document model that is smoothed using Dirichlet priors. It is defined as follows: score(D|Q) =

l ! i=1

Pml (ti |θQ ) · log

"

tf (ti , D) + µPml (ti |θC ) |D| + µ

#

(1)

where l is the length of the query, Pml (ti |θQ ) and Pml (ti |θC ) are the Maximum Likelihood (ML) estimates of respectively the query model θQ and the collection model θC . tf (ti , D) is the frequency of query term ti in a document D. |D| is the document length. µ is the smoothing parameter. For XML element retrieval, we compute the relevance score (score(E|Q)) of the queried XML field (E) in regard to the given CAS query (Q). The smoothed document model (expressed in the log function) is tailored to compute the ML estimate of the XML element model Pml (ti |θE ). We set up our language model and model parameters based on the experimental results of similar tasks for INEX 2008. Here µ is considered to be 500.

4.1

Baselines

Baseline runs retrieve full articles for CO or CAS queries. Only the #combine operator is used. We submitted the results of CAS query for the Thorough and the Focused tasks and the results of CO query for the Relevant in Context and the Best in Context tasks. This baseline performance indicates the performance of the Indri search engine in the setting of XML element retrieval.

4.2

Strategies for Overlapping Removal

Within the ad-hoc XML retrieval track, there are four sub-tasks: • The Thorough task requires the system to estimate the relevance of elements from the collection. It returns elements or passages in order of relevance (where specificity is rewarded). Overlap is permitted. • The Focused task requires the system to return a ranked list of elements or passages. Overlap is removed. When equally relevant, users prefer shorter results over longer ones. • The Relevant in Context task requires the system to return relevant elements or passages clustered per article. For each article, it returns an unranked set of results, covering the relevant material in the article. Overlap is not permitted. • The Best in Context task asks the system to return articles with one best entry point. Overlap is not allowed. Because of the hierarchical structure of an XML document, sometimes a parent element is also considered relevant if its child element is highly relevant to a given topic. As a result, we obtain a number of overlapping elements. To fulfill the overlap-free requirement for the Focused task, the Relevant in Context task and the Best in Context task, we adopt the following strategies to remove overlapping element paths based on the result of the Thorough task: • Relevance Score: The result of the Thorough task is scanned from most to less relevant. When an overlapped element path is found within a document, then the element path with lower relevance score is removed. (see Figure 1). In case that overlapped elements in the same document have the same relevance score, we choose the element with the higher rank.

Fig. 1. Example result of the Focused task (qid=2009005)

Next the overlap-free result is grouped by article. For each query, the articles are ranked based on their highest relevancy score. For each article, the retrieved element paths keep the rank order of relevance (see Figure 2). For the Best in Context task, we choose the most relevant XML element path of each article as our result.

Fig. 2. Example result of the Relevant in Context task (qid=2009005)

• Relevance Score and Full Article Run: In addition to the Relevance Score strategy, we combine our overlap-free result with the result of the full article run (baseline run for CO query). We remove XML element paths whose article does not appear in the result of the full article runs (see Figure 3). The filtered result follows the rank order of our baseline run. We adopt the same strategy for the Reference task as well.

Fig. 3. Example result of the Reference task (qid=2009005)

5

Results

For each of the four sub-tasks, we submitted two XML element results and one extra result for the reference task. On the whole, we had 12 submissions to the ad-hoc track. Among them, 9 runs are qualified runs. Additionally, we report more results in this paper. 5.1

Full Document Retrieval

One of our goals for the ad-hoc track is to compare the performance of Indri search engine used in full document retrieval and XML element retrieval. The observation will be used to analyze our element language models and improve our overlapping removal strategies. For official runs, we submitted full document retrieval using both CO and CAS queries for four sub-tasks. Except the Thorough task, one run of the rest tasks was disqualified because of overlapped results. The qualified run for the Thorough task is an automatic run for CAS query (see Table 1). In the same table, we have an extra run for CO query.

For full document retrieval, the result of CAS query is slightly worse than that of the CO query. This performance gap indicates the difference between two expressions for the same topic of interest. Table 1. Results of full document retrieval tasks

performance metrics iP[.00] iP[.01] iP[.05] iP[.10] MAiP thorough (article.CO additional) 0.5525 0.5274 0.4927 0.4510 0.2432 thorough (article.CAS official) 0.5461 0.5343 0.4929 0.4415 0.2350

5.2

XML Element Retrieval

For official submission, we presented our results using the strategy of Relevance Score and Full Article Run. All qualified runs use CAS query. The result of the Thorough task is in Table 2. The additional runs are the original results without the help of full article runs. The run of element1.CAS returns article element type only for any element type (noted as * in ) requests while the run of element2.CAS returns all considered element types. Table 2. Results of XML element retrieval tasks thorough thorough thorough thorough

iP[.00] (element.CAS.ref official) 0.4834 0.4419 (element1.CAS additional) (element.CAS.baseline official) 0.4364 (element2.CAS additional) 0.4214

performance metrics iP[.01] iP[.05] iP[.10] 0.4525 0.4150 0.3550 0.4182 0.3692 0.3090 0.4127 0.3574 0.2972 0.3978 0.3468 0.2876

MAiP 0.1982 0.1623 0.1599 0.1519

The full document retrieval outperforms the element retrieval in locating all relevant information. The same finding shows in the performance difference between element1.CAS and element2.CAS. Our results again agree with the observation in previous INEX results. System wise, the given reference result (element.CAS.baseline) has better performance over Indri search engine (element.CAS.ref).

Focused Task The official and additional results of the Focused task are in Table 3. Our run (element.CAS.baseline) successfully preserves the retrieval result of the Thorough task and brings moderate improvement. The full document runs still have the highest rank in the Focused task.

Table 3. Results of XML element retrieval tasks focus focus focus focus focus focus

iP[.00] (article.CO additional) 0.5525 0.5461 (article.CAS additional) (element.CAS.ref official) 0.4801 (element.CAS.baseline official) 0.4451 (element1.CAS additional) 0.4408 0.4228 (element2.CAS additional)

performance metrics iP[.01] iP[.05] iP[.10] 0.5274 0.4927 0.4510 0.5343 0.4929 0.4415 0.4508 0.4139 0.3547 0.4239 0.3824 0.3278 0.4179 0.3687 0.3092 0.3999 0.3495 0.2909

MAiP 0.2432 0.2350 0.1981 0.1695 0.1622 0.1527

Relevant in Context Task As explained earlier, we rank documents by the highest element score in the collection and rank element paths by their relevance score within the document. Overlapping elements are removed as required. The retrieval result is in Table 4. The full document runs still dominate the performance and reference runs continue boosting retrieval results. Table 4. Results of XML element retrieval tasks relevant-in-context relevant-in-context relevant-in-context relevant-in-context relevant-in-context relevant-in-context

gP[5] (article.CO additional) 0.2934 (article.CAS additional) 0.2853 0.2216 (element.CAS.ref official) (element.CAS.baseline official) 0.1966 (element1.CAS additional) 0.1954 (element2.CAS additional) 0.1735

performance metrics gP[10] gP[25] gP[50] 0.2588 0.2098 0.1633 0.2497 0.2132 0.1621 0.1904 0.1457 0.1095 0.1695 0.1391 0.1054 0.1632 0.2150 0.1057 0.1453 0.1257 0.1003

MAgP 0.1596 0.1520 0.1188 0.1064 0.0980 0.0875

Best in Context Task This task is to identify the best entry point for accessing the relevant information in a document. Our strategy is to return the element path with highest relevance score in a document. The retrieval result is in Table 5. Using the relevance score as the only criteria has lead to a promising result when we compare the original runs (element1.CAS and element2.CAS) with the run boosted by the baseline (element.CAS.baseline). However, the run (element1.CAS) containing more article returns is still better than the run with other element returns.

6

Conclusion

In our official runs, we present our baseline results and results filtered by our document retrieval and the given reference run. In this paper, we provide additional results of the original document and XML element retrieval. The Indri search engine can provide reasonable result of XML element retrieval compared

Table 5. Results of XML element retrieval tasks best-in-context best-in-context best-in-context best-in-context best-in-context best-in-context

gP[5] (article.CO additional) 0.2663 0.2507 (article.CAS additional) (element.CAS.ref official) 0.1993 (element1.CAS additional) 0.2257 (element2.CAS additional) 0.2089 (element.CAS.baseline official) 0.1795

performance metrics gP[10] gP[25] gP[50] 0.2480 0.1944 0.1533 0.2305 0.1959 0.1499 0.1737 0.1248 0.0941 0.1867 0.1426 0.1125 0.1713 0.1343 0.1084 0.1449 0.1143 0.0875

MAgP 0.1464 0.1372 0.1056 0.1015 0.0924 0.0852

to results of full article retrieval and results of other participating groups. We can also use relevance score as the main criteria to deal with the overlapping problem effectively. However, the full document runs are still superior than XML element runs. When the results of reference and baseline run are used, the result of the Thorough task is improved. This may imply that the search engine is able to locate relevant elements within documents effectively. Besides the accuracy of relevance estimation, retrieval performance also depends on the effective formulation of Indri structured query. For example, the two ways of wildcard interpretation in element1.CAS and element2.CAS present different results in our experiments. The step of overlapping removal is the other key factor that may harm the retrieval performance. In our case, our result (element1.CAS) ranks high in the Thorough task but low in the Focus and Relevant in Context tasks. Except using the relevance score for removing the overlapping element paths, we may try other criteria such as the location of the element within a document. This is especially important for the Best in Context task as users tend to read a document from top-down. Acknowledgments This work is sponsored by the Netherlands Organization for Scientific Research (NWO), under project number 612-066-513.

References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 2. Zhai, C.X., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Trans. on Information Systems. 22(2), 179–214 (2004) 3. Schenkel, R., Suchanek, F.M., Kasneci, G.: YAWN: A Semantically Annotated Wikipedia XML Corpus. In 12. GI-Fachtagung fr Datenbanksysteme in Business, Technologie und Web, Aachen, Germany. (March 2007) 4. Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A Language-model Based Search Engine for Complex Queries, In Proceedings of ICIA (2005)