SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005 Eugen Popovici, Gildas Ménier, Pierre-François Marteau VALORIA Laboratory, University of South-Brittany BP 573, 56017 Vannes Cedex, France {Eugen.Popovici, Gildas.Menier Pierre-Francois.Marteau}@univ-ubs.fr
Abstract. This paper reports on SIRIUS, a lightweight indexing and search engine [6] for XML documents. The retrieval approach implemented is document oriented. It involves an approximate matching scheme of the structure and textual content. Instead of managing the matching of whole DOM trees, SIRIUS splits the documents object model in a set of paths. This set is indexed using optimized data structures. In this view, the request is a path-like expression with conditions on the attribute values. In this paper, we present the main functionalities and characteristics of this XML IR system and second we relate on our experience on adapting and using it for the INEX 2005 ad-hoc retrieval task. Finally, we present and analyze the SIRIUS retrieval performance obtained during the INEX 2005 evaluation campaign and show that despite the lightweight characteristics of SIRIUS we obtained quite good precision at low recall values.
1 Introduction The widespread use of XML in digital libraries, product catalogues, scientific data repositories and across the Web prompted the development of appropriate searching and browsing methods for XML documents. Approximate matching in XML provides the possibility of querying the information acquired by a system having an incomplete or imprecise knowledge about both the structure and the content of the XML documents [14], [15]. In this context, we propose and experiment with a lightweight model for indexing and querying XML documents. We develop a simple querying algebra implemented using fast approximate searching mechanisms for structure and textual content retrieval. In order to evaluate the expected benefits and drawbacks of this new kind of search functionality we propose algorithms and data structures whose principles are detailed hereinafter. We propose specific data structures dedicated to the indexing and retrieval of information elements embedded within heterogeneous XML data bases. The indexing scheme is well suited to the characterization of various contextual searches, expressed either at a structural level or at an information content level. Search mechanisms are based on context tree matching algorithms that involve a modified Levenshtein editing distance [12] and information fusion heuristics. The implementation that is finally described highlights the mixing of structured information presented as
field/value instances and free text elements. Our approach is evaluated experimentally at the INEX 2005 workshop. The results are encouraging and give rise to a number of future enhancements. The paper is organized as follows. In Section 2 we present the main functionalities and characteristics of the SIRIUS XML IR system. In Section 3 we relate on our experience on adapting and using the system in the INEX 2005 ad-hoc retrieval task. In Section 4 we present the evaluation of the SIRIUS retrieval while participating at INEX 2005 campaign on the VVCAS task. Finally, in Section 5 we summarize our conclusions and propose some futures work perspectives.
2 SIRIUS XML IR System SIRIUS [6] is a lightweight indexing and search engine for XML documents developed at the VALORIA laboratory of the University of South-Brittany [6][7]. The retrieval approach implemented in SIRIUS is document oriented. It involves an approximate matching scheme of the structure and textual content. Instead of managing the matching of whole DOM trees, SIRIUS splits the documents object model in a set of paths. This set is indexed using optimized data structures. In this view, the request is a path-like expression with conditions on the attribute values. For instance /document(> date "1994")/chapter(= number 3)/John is a request aiming to extract the documents (written after 94) with the word John in the chapter number 3. We designed a matching process that takes into account mismatched errors both on the attributes and on the xml elements. The matching process uses a weighted editing distance on XML paths: this provides an approximate matching scheme able to manage jointly the request on textual content and on document structure. The search scheme is extended with a set of Boolean and IR retrieval operators, and features a set of thesaurus rewriting rules. Recently the system was extended with a specialized set of operators for extracting, indexing and searching heterogeneous sequential and time series data embedded in heterogeneous XML documents [8]. 2.1 Indexing Scheme An XML document is composed of a set of elements with a possible nested structure. Each XML element may be composed of a set of possible nested XML elements, textual pieces of information (TEXT or CDATA), unordered pairs, or a mixture of such items. XML documents are generally represented as rooted, ordered, and labeled trees in which each node corresponds to an element and each edge represent a parent-child relationship. XML Context. According to the tree structure, every node n inherits a path p(n) composed with the nodes that link the root to the node n. This path is an ordered sequence of XML elements potentially associated to unordered pairs A(ni), that determines the XML context in which the node is occurring. A tree node n, containing textual/mixed information can be decomposed into textual subelements. Each string s (or word, lemma, …) of a textual sub-element is also linked to
p(n). This XML context characterizes the occurrence of s within the document and can be represented as follows: p(n)= …
(1)
Index Model. The indexing process involves the creation of an enriched inverted list designed for the management of these XML contexts. For this model, the entries of the inverted lists are of two kinds: textual sub-element s of a tree node, and node n of the XML tree. For a sub-element s of a node n, three pieces of information are attached: − a link to the URI of the document to retrieve the original document, − an index specifying the location of the sub-element within the document, − a link toward its XML context p(n). For a node n of the DOM tree, only the link to the URI of the document and a link to its XML context p(n) are required. 2.2 Searching Scheme Most of the time, for large heterogeneous databases, one cannot assume that the user knows all of the structures – even in the very optimistic case, when all of the structural properties are known. Some straightforward approaches (such as the XPath search scheme [16]) may not be efficient in these cases. As the user cannot be aware of the complete XML structure of the data base due to its heterogeneity, efficient searching should involved exact and approximate search mechanisms. The main structure used in XML is a tree: It seems acceptable to express a search in term of tree-like requests and approximate matching. The matching tree process involves mainly elastic matching or editing distance [10], [11]. For [10], the complexity of matching two trees T1 and T2 is at least O(|T1|.|T2|) where |Ti| is the number of nodes in Ti. The complexity is much higher for common subtree search [11]. This complexity is far too high to let these approaches perform well for large heterogeneous database with documents with a high number of elements (as nodes). We proposed [6], it seems at the same time than the IBM team at Haïfa, to focus on path matching rather than on tree matching – in a similar manner than the XML fragment approach [15]. The request should be expressed as a set of path p(r) that is matched with the set of sub-path p(n) in the document tree. This breaks the algorithmic complexity and seems to better correspond to the end-user needs: most of the data searches involve a node and its inherited sequence of elements rather than a full tree of elements. This ‘low-level’ matching only manage subpath similarity search with conditions on the element and attributes matching. This process is used to design a more higher-level request language: a full request is a tree of low-level matching goals (as leafs) with set operators as nodes. These operators are used to merge leaf results. The whole tree is evaluated to provide a set of ranked answers. The operators are classical set operators (intersection, union, difference) or dedicated fuzzy merging processors.
Approximate Path Search. Let R be a low-level request, expressed as a path goal pR, with conditions or constraints to be fulfilled on the attributes and attributes values. We investigate the similarity between a pR (coding a path with constraints) and a tree TD of an indexed document D as follow:
δ ( p R , T D ) = Min δ L ( p R , piD )
(2)
i
where δL is a dedicated editing distance (see [12]) and { piD } is the set of path in D, starting at the root and leading to the last element of the pR request – terminal(r). The complexity is O(l(pR).deep(TD).| { piD } |) with |{ piD }| the size of the set { piD } (i.e. the number of different path), l(p) the length of the path p and deep(T) the deepest level of T. This complexity remains acceptable for this application as 99% of the XML documents have fewer than 8 levels and their average depth is 4 [14]. Path Similarity Computation δ(pR, TD). Let pR be the path for the structural request R and {piD} the set of root/../terminal(r) paths of the tree associated to an index document D. We designed an editing pseudo-distance [15] using a customised cost matrix to compute the match between a path piD and the request path pR. This scheme, also known as modified Levenshtein distance, computes a minimal sequence of elementary transformation to get from piD to pR . The elementary transformations are: − Substitution: a node n in piD is replaced by a node n’ for a cost Csubst( n, n’). Since a node n not only stands for an XML element, but also for attributes or attributes relations, we compute Csubst( n, n’) as follows:
1.n.element n’.element : Csubst(n, n’) = • (full substitution) 2.n.element == n’.element : if n’.attCond(n.attributes) is true Csubst(n, n’) = 0 (no substitution) else Csubst(n, n’) = ½ ζ (attributes substitution) where attCond stands for a condition (stated in the request) that should apply to the attributes. − Deletion: a node n in piD is deleted for a cost Cdel(n)= ζ, − Insertion: a node n is inserted in piD for a cost Cins(n)= ζ. For a sequence Seq(piD, pR) of elementary operations, the global cost GC(Seq(piD, pR)) is computed as the sum of the costs of elementary operations. The Wagner&Fisher algorithm [15] computes the best Seq(piD, pR) (i.e. minimizes GC() cost) with a complexity of O(length(piD) * length(pR)) as stated earlier. Let
δL(pR , piD,) = Mink GC(Seqk(pR, piD)) .
(3)
The similarity between pR and piD is computed as follows:
δ(pR , piD) = 1/(1+ δL (pR, piD) ) .
(4)
Given pR and piD, the value for δ( pR , piD) → 0 when the number of mismatching nodes and attribute conditions between pR and piD increases. For a perfect match δ( pR , piD) = 1 all the elements and the conditions on attributes from the request pR match correspondent XML elements in piD . 2.3 Query Language Complex requests are built using the low-level request pR described above and merging operators (boolean or specialized operators). Mainly a complex request is a tree of pR requests as leafs. Each node supports an operator performing the merging of the descendant results. Currently, the following merging operators are implemented in the system for the low-levels management: − or, and : n-booleans or n-set. (or pR pR’) provides the set merging the set of solution from pR and from pR’. (and pR pR’) provides only the answers belonging to both answer sets. − without: this operator can be used to remove solutions from a set. For instance, (without pR pR’) delivers the set of solutions for pR minus the solutions given by pR’. − seq is mainly an extension of the pR request : it merges some of the inverted list to provides a simple sequence management. For instance, (seq warning * error) express the search of a sequence of texts items – it also applies to structure. − in, same : express a contextual relation (same = the pR have the same path, in = inside elements with the specified path), − +, same+ : should be related to the or operator. The or operator is a simple set merging operator, whereas ‘+’ and ‘same+’ are dedicated operators that take into account the number of element answered. We used a dedicated TFIDF-like function for this purpose (TFIDF stands for Term Frequency / Inverse Document Frequency, see [18]). − in+ : add structural matching information to the set of solutions. It performs a weighted linear aggregation between the conditions on structure and the set of solutions. Let argi be a complex request, or a simple (low level) pR request. The similarity computation δ( argroot , TD) for a complex request argroot and a tree T of document D is performed recursively starting at the leafs:
δ ((or arg1 ,..., arg n ), T D ) = Max(δ (arg i , T D )) i
D
δ ((and arg1 ,..., arg n ), T ) = Min(δ (arg i , T D )) i
δ (( without arg1 , arg 2 ), T ) = Max{0, δ (arg1 , T D ) − δ (arg 2 , T D )} D
δ (( seq arg1 ,..., arg n ), T D ) =
1 if arg1,arg2, …, argn occurs in sequence and belong to the same context/leaf, else 0.
δ ((in ctx arg 1 ,..., arg n ), T D ) = Min(δ (ctx / arg i , T D )) i
where ctx is an XML context (a path) and ctx/argi the path made in concatenating ctx and argi.
δ ((in + ctx arg1 ,..., arg n ), T D ) = βδ (ctx / arg i , T D ) + (1 − β )δ (arg i , T D ) where ctx and ctx/argi as above, and β in [0..1] used to emphasise the importance of the structural matching,.
δ (( same arg1 ,..., arg n ), T D ) = Max(δ (ctx D / arg i , T D )) ctx D
with ctxD a path of TD.
δ ((+ arg 1 ,..., arg n ), T D ) = τ ⋅ ∑ λi (δ (arg i , T D )) i
where λi is a weighting factor specifying the discriminating power of argi: λi = 1 – log( (1+ND(argi ) ) / (1+ND) ), where ND(argi ) is the number of documents in which argi is occuring, ND the total number of documents in the collection, and τ a normalization constant τ= 1 / Σk ( λk ) . The same+ operator is computed in a similar way:
δ (( same + arg1 ,..., arg n ), T D ) = Max τ ⋅ ∑ λi (δ (ctx D / argi , T D ) ctx D
i
3 INEX 2005 Experience The retrieval task we are addressing at INEX 2005 is defined as the ad-hoc retrieval of XML elements. This involves the searching of a static set of documents using a new set of topics [1]. We will further present several characteristics of the test collection and of the CAS topics used in INEX 2005 ad-hoc task. Next we will present how we tuned the SIRIUS retrieval approach for the VVCAS task. 3.1 INEX ad-hoc collection The INEX 2005 document collection contains over 16,800 articles from 24 IEEE Computer Society journals, covering the period of 1995-2004 and totalling about 750 megabytes in size in its canonical form. The collection contains 141 different tagnames1 composing 7948 unique XML contexts by ignoring the attributes and the attributes values. The maximum length of an index path is 20, while the average
1
We calculate the statistics from the viewpoint of the retrieval system. That is, we use the XML tag equivalence classes (section 3.5). Also, the XML contexts associated to empty elements or containing only stop words do not count in our statistics.
length (calculated over the unique indexed contexts) is 8. The distribution of elements is heavily skewed towards short elements, such as italics [5]. 3.2 INEX 2005 CAS Topics CAS queries are topic statements that contain explicit references to the XML structure, and explicitly specify the contexts of the user’s interest (e.g. target elements) and/or the context of certain search concepts (e.g. containment conditions). A CAS topic has several parts expressing the same information need: , , , and [4]. An example of the part of a CAS topic expresses in NEXI language [3] in shown in Fig. 1. //article[ about(.//bb, Baeza-Yates) and about(.//sec , string matching)]//sec[about(., approximate algorithm)] Fig. 1. CAS topic 280.
This year, a total of 47 CAS topics were selected of which 30 are subordinate topics of the form //A[B]2, 6 are independent/standalone topics of the form //A[B] and 11 are complex topics of the form //A[B]//C[D]3 constructed using the subordinate topics. We make here some observations on the selected set of topics that are used for retrieval in the INEX 2005 CAS task. First, no constraint on attributes or attributes values is made. This may be explained by the fact that the collection is considered to have no attribute or attribute value with practical interest for a real end user – see “A Note on Attributes” in [3]. Second, even if the collection contains paths with an average length ranging between 6 and 8 elements; a reduced number of elements (2, maximum 3) are employed by the users to express paths for the structural constraints. Third, 21 of the total of 47 CAS topics, 9 of the 17 complex and standalone CAS topics, and 7 of the 10 assessed topics used phrase searching. 3.3 Translating INEX 2005 CAS topics to SIRIUS Query Language We use automatic transformation of the INEX 2005 CAS topics expressed in NEXI [3] to SIRIUS [6] recursive query language. We have two types of CAS queries: simple queries of the form //A[B] and complex queries of the form //A[B]//C[D]. For the simple type queries, the translation process is straight forward by using the SIRIUS operators: and, or, in+, same+ and seq (Fig. 2). //article[ about(.//bb, Baeza-Yates)
(in+ [/article/bb/] (same+ (seq Baeza Yates)))
Fig. 2. Translating CAS topic 277 to SIRIUS query language. 2 3
//A[B] = returns A tags about B [4]. //A[B]//C[D] = returns C descendants of A where A is about B and C is about D [4].
For translating complex queries of the form //A[B]//C[D], we introduce a new filter operator in the SIRIUS query language (Fig. 3). ( filter ( and ( in+ [/article/bb/] ( same+ ( seq Baeza Yates ) ) ) ( in+ [/article/sec/] ( same+ string matching ) ) ) ( in+ [/article/sec/] ( same+ approximate algorithm ) ) ) Fig. 3. CAS topic 280 (Fig. 1) translated to SIRIUS query language.
The filter operator receives as arguments two sets arg1, arg2 of weighted matching results (ri, wi). It retains from the second set all the descendants (rj, w’j) of the results occurring in the first set. For this operator, the new weight w’j associated with a selected result is calculated as the arithmetic average of the initial weight wj of the corresponding result in the second set and the maximum weight ArgMaxi(wi) found in all it’s ancestors and their descendants from the first set of results. This operator implies solving element containment relationships that were not supported by the SIRIUS index model. However, for all the 11 complex CAS topics, the ancestor //A specified in the structural path is the article element. Therefore, only document D level containment conditions are checked in the current implementation of the operator. 3.4 Indexing the INEX 2005 Ad-Hoc Collection The collection is pre-processed by removing the volume.xml files and transforming all the XML files in their canonical form4. At indexing time, the most frequent words are eliminated using a stop list. The terms are stemmed using the Porter algorithm [17]. We index only ALPHANUMERIC words as defined in [3] (like iso-8601). We did not index numbers, the attributes, the attributes values, and empty XML elements. This allowed important performance gains both in indexing and querying time as well as disk space savings. The index model (section 2) enhanced with a node labelling scheme based on pre-ordered containment intervals was implemented using BTrees structures from Berkeley DB5 library. The index size has about twice and a half the initial database size. The indexing time on a PIV 2.4GH processor with 1.5GB of RAM for the inex-1.6 ad-hoc collection in canonical form (~750MB) was about 50min with one more hour post-processing phase for disk optimizations. 3.5 Structure Approximate Match for INEX Structural Equivalence Classes. We create structural equivalence classes for the tags defined as interchangeable in a request: Paragraphs, Sections, Lists, and Headings in conformance with [4]:
4 5
Canonical XML Processor {http://www.elcel.com/products/xmlcanon.html} http://www.sleepycat.com/
Weighting Scheme for Modelling Ancestor-Descendant Relationships. NEXI language [3] specifies a path through the XML tree as a sequence of nodes. Furthermore, the only relationship allowed between nodes in a path is the descendant relation. Therefore the XML path expressed in the request is interpreted as a subsequence of an indexed path, where a subsequence need not consist of contiguous nodes. This is not suited for the weighting scheme allowing (slight) mismatch errors between the structural query and the indexed XML paths implemented in SIRIUS [6]. The ancestor-descendant relationship is penalized by the SIRIUS weighting scheme relative to a parent-child relationship. Therefore we relax the weights of the path editing distance in order to allow node deletions in the indexed paths without any penalty Cdel(n)=0. To illustrate this mechanism we show in Fig. 4 the distances between the path requested in topic 277 (Fig. 2) searching works citing Baeza-Yates and several indexed path retrieved by SIRIUS using the new weighting scheme.. δ ( /article/bm/bib/bibl/bb/au/snm, //article//bb ) δ ( /article/bm/app/bib/bibl/bb/au/snm, //article//bb ) δ ( /article/fm/au/snm, //article//bb )
= 0 = 0 = 1
Fig. 4. Example of distances between the indexed path piD and the request path pR.
In the first two cases the request path pR is a subsequence of the retrieved paths piD and therefore the editing distance is 0 independently of the length of the two index paths. In the last case, where Baeza-Yates is the author of the article, the editing distance is 1 highlighting the mismatch of the requested bb node from the indexed path. The weighting scheme models an end user having precise but incomplete information about the xml tags of the indexed collection and about their ancestordescendant relationships. It takes into account the number of matched XML nodes from the request and their order of occurrence. It heavily penalized any mismatch relatively to the information provided by the user but it is forgiving with mismatches/extra information extracted from the indexed paths. 3.6 SIRIUS VVCAS Runs. CAS queries are topic statements that contain two kinds of structural constraints: where to look (support elements), and what elements to return (target elements). When implementing a VVCAS strategy, the structural constraints in both the target elements and the support elements are interpreted as vague [1]. We submitted 2 official and 4 additional runs within the CAS ad-hoc task using the VVCAS strategy and automatic query translation (Table 1).
Table 1. Sirius VVCAS runs (1,2 official runs, 3, 4, 5, 6 additional runs). ID 1. 2. 3. 4. 5. 6.
Run VVCAS_contentWeight08_structureWeight02 VVCAS_contentWeight05_structureWeight05 VVCAS_contentWeight05_structureWeight05_unofficial VVCAS_contentWeight10_structureWeight00_ unofficial VVCAS_contentWeight05_structureWeight05_SAMEPLUS_unofficial VVCAS_contentWeight10_structureWeight00_SAMEPLUS_unofficial
In the official 1, 2 and additional 3, 4 runs we use the same retrieval approach, namely, strict sequence search, modified editing distance on XML paths for matching the structural constraints, weighted linear aggregation for content and structure matching scores. Additional run 3 corrects a bug found in one limit cases of our official runs. The difference observed on the evaluation curves between the official run 2 and the additional run 3 is minimal (see Fig. 5, Fig. 6, Tables 2 and 3). For additional runs 5 and 6 we use an approximate matching operator instead of a strict sequence search. Strict Sequence Matching Runs: we used a strict seq operator – where strict stands for words appearing in sequence in the textual content (ignoring the stop list words) of the same XML node. Therefore searching for the sequence “Ricardo Baeza-Yates”, in our implementation will point to the ‘name’ element of the first example (Fig. 2), but will completely ignore the sequence from the second one. Ricardo Baeza-Yates …
Ricardo Baeza-Yates …
Fig. 5. Ex. of sequence allowed in the same XML element and ignored in adjacent elements.
Flexible Matching Runs: the additional runs 4 and 6 implements a flexible sequence search based on the same+ operator. These runs rank as best results the XML elements that contain all the researched terms without taking into account their order of occurrence. XML elements that contain part of the research terms are also retrieved and ranked based on the number of different researched terms contained, according to the weighting (IDF) of each occurring term. 4 SIRIUS Retrieval Performance Evaluation XML IR systems relevance in XML retrieval tasks is evaluated along two dimensions: exhaustivity and specificity. An element is exhaustive if the topic of request is exhaustively discussed within that element, whereas an element is specific if the element is highly focussed on the topic [1].
4.1 INEX 2005 Evaluation Results We report here the system-oriented and user-oriented official INEX 2005 evaluation measures: the effort-precision/gain-recall (ep/gr) metric (Fig. 6, Table 3), Extended Q and R Metric (Table 2.), respectively the normalized extended cumulated gain (nxCG) metric (Fig. 7) for all the submitted VVCAS runs. Details of the evaluation metrics can be found in [2].
a)
b)
Fig. 6. SIRIUS effort-precision/gain-recall (ep/gr) curves: a) EP/GR(overlap=off, , quant.=generalized), b) EP/GR (overlap=off, quant.=strict)
Effort-precision (EP) at a given gain-recall (GR) value measure the relative effort (number of visited ranks) that the user is required to spend when scanning a system's output compared to the effort an ideal ranking would take in order to reach a given level of cumulated gain. [2]. Table 2. Extended Q and R metrics. Run ID 1. 2. 3. 4. 5. 6.
Q(overlap=off) Strict Gen 0.009 0.025 0.009 0.025 0.009 0.026 0.008 0.023 0.014 0.045 0.013 0.042
R(overlap=off) Strict Gen
0.039 0.039 0.040 0.040 0.063 0.063
0.084 0.084 0.085 0.083 0.125 0.113
The Q-and-R-measures are modified normalised cumulated gain measures which employ a bonus gain function that directly incorporates the rank position in the measured gain [2].
Table 3. Effort-precision measures at different levels of gain-recall. Run ID 3. 4. 5. 6.
0.01/15 0.3302 0.3165 0.2114 0.2011
3. 4. 5. 6.
0.01/15 0.3483 0.275 0.2474 0.2306
EP v.s. GR (Overlap=off, Quant=strict) 0.02/30 0.1/150 0.2/300 MAep 0.291 0.1123 0.1051 0.0086 0.285 0.0971 0.092 0.0082 0.1786 0.1298 0.1165 0.0125 0.1673 0.125 0.1148 0.0122 EP v.s. GR (Overlap=off, Quant=gen) 0.02/30 0.1/150 0.2/300 MAep 0.2813 0.1549 0.0253 0.0218 0.228 0.1346 0.0245 0.0191 0.217 0.1598 0.0925 0.0351 0.213 0.145 0.093 0.0326
iMAep 0.0338 0.0314 0.0523 0.0491 iMAep 0.03 0.0253 0.0465 0.0428
For a given rank i, the value of nxCG[i] reflects the relative gain the user accumulated up to that rank, compared to the gain he/she could have attained if the system would have produced the optimum best ranking [2].
a) b) Fig. 7. The user-oriented measure of normalized extended cumulated gain (nxCG): a) nxCG(overlap=off, quant.=generalized), b) nxCG(overlap=off, quant.=strict)
4.2 Analysis of SIRIUS Retrieval Performance Using the structural information. The objective of our study was to determine to what extent the structural hints should be taken into account when implementing a VVCAS strategy. For the two official runs (1, 2) we assigned different weights to merge the content and structure relevance scores, i.e.(0.8, 0.2) and (0.5, 0.5): The coefficients used were not discriminate enough to impose a different ranking of the final results. Therefore, we conducted tests completely discarding the structural information from the CAS topics, i.e. using (1.0, 0.0) coefficients (runs 4, 6). The runs with (0.5, 0.5) weight for content and structure (run 3, 5) outperform in average the ones based only on content matching (run 4, 6)(see Fig. 6, Fig. 7 and Table 3).This is true for all of the INEX 2005 evaluation measures (official and additional) and this independently of the quantization function used. This indicates (usual disclaimers apply) that the structural hints, and jointly, the modified editing
distance on the XML paths increases the system retrieval performances (the gain varies from 2% up to 7% on EP measure at a fixed GR value) Sequence Search Strategy. SIRIUS official VVCAS runs have a high effort/precision for low values of gain/recall (Fig. 6, Table 3). This behaviour is due to the fact that the runs implements strict constraints for phrase searching and the topic set was rich (7 among the 10 assessed topics used sequence search) in this kind of hints. The restrictive interpretation of the seq operator improved the system precision for the first ranked results (10% to 12%). The SIRIUS VVCAS official runs were ranked several times in the first top ten runs for the first 10-25 retrieved results. This fact is important because these are the most probably to be browsed by an end user. In the same time, this behaviour has penalized the system overall quality performance. One explanation could be that the system stops to return elements when running out of good answers – i.e. we implement a strict sequence search. This has important implications, as the system is not increasing its information gain until reaching the limit of 1500 returned answers by returning imprecise/less perfect results (Fig. 7 – see the lower curves). This hypothesis is also supported by the system behaviour relatively to the strict and generalized quantization functions. Our runs were better ranked by all the official evaluation measures when a strict quantization function was used (Fig. 6b, Fig. 7b). This indicates that the retrieved results are both fully specific and highly exhaustive. We submitted and evaluated two additional runs (5, 6) allowing a flexible sequential search as described in section 3.6. The superior curve of the additional run 5 based on the same+ operator in figures (Fig. 6, Fig. 7) shows an important improvement. We loose a small amount in the system precision for the first ranked results, but we obtain an obviously improved overall effort-precision/gain-recall curve (Fig. 7b). Table 3 shows a gain that varies from 0.5% up to 1.8% on the MAep and iMAep measures). We could further improve these results by defining a new operator combining same+ ranking with seq ranking strategies. 4.3 Limits of Our Approach Returning small XML elements. XML elements containing a very limited amount of text (less than 3-6 characters) either possibly highly relevant as atl, tig (article titles), st (section titles) or meaningless to an end user as it (italicized text) or sub (subscript), are frequent in the INEX ad-hoc collection [5].
The query optimizer, the query engine's central component, determines the best service execution plan based on QoWS, service ratings, and matching degrees.
Fig. 8. Mixed XML element content with italicized text (i.e. one of the SIRIUS ‘relevant‘ answer for topic 244)
Elements of such small size are unlikely to satisfy the information need of a topic. Indeed, in the INEX 2002 and 2003 longer elements were preferred by the assessors [5]. This is not beneficial to our approach as we consider the XML element as the basic and implicitly valid unit of retrieval, regardless of its size. With this assumption, we return the query optimizer element (Fig. 8) as a relevant answer for topic 244. Returning overlapping elements: The highly mixed nature of the INEX ad-hoc collection may lead to cases where nested/overlapping XML elements could be returned as valid results by our approach. For instance, the p (paragraph) and the it (italicized text) elements of the excerpt from Fig. 8 will be retrieved by a request aiming to extract relevant elements for the “query” term. Our official runs (1, 2) have a set-overlap6 ranging between 0.002 – 0.004 for the first 20 retrieved results and 0.014-0.015 at 1500 results.
5 Conclusions We evaluated the retrieval performances of a lightweight XML indexing and approximate search scheme currently implemented in the SIRIUS XML IR system [6], [7], [8]. At INEX 2005, SIRIUS retrieves relevant XML elements by approximate matching both the content and the structure of the XML documents. A modified weighted editing distance on XML paths is used to approximately match the documents structure while strict and fuzzy searching based on the IDF of the researched terms are used for content matching. Our experiments show that taking into account the structural constraints improves the retrieval performances of the system and jointly shows the effectiveness of the proposed weighted editing distance on XML paths for this task. They also show that the approximate search inside XML elements implemented using our same+ operator improve greatly the overall performance of the ranking, compared to a strict sequence search (seq operator), except for low recall values. The complementarities of the two operators call for the design of a new matching operator based on their combination to further improve the retrieval performance of our system. While designing our lightweight indexing and XML approximate search system we have put forward the performance and the implementation simplicity. SIRIUS structural match is well adapted for managing mismatches in writing constraints on XML paths involving complex conditions on attributes and attributes values [6]. Unfortunately, this was not experimented in INEX 2005 campaign. SIRIUS was designed to retrieve relevant XML documents by highlighting and maintaining the relevant fragments in the document order (approach explored by the INEX CO.FetchBrowse task). For this year, at the VVCAS task we evaluated only a subset of its functionalities and proved its ability of retrieving relevant XML elements. Still, we obtained average and good quality results in the range of the 25 first ranked 6
Set overlap measures the percentage of elements that either contain or are contained by at least one or other element in the set.
answers, which is quite encouraging since first ranked elements are the ones end users will most probably browse. Future research work will include: the semantic enrichment of the requests (at xml tag and query term levels) in order to improve the overall recall of the system, index models better suited for the approximate search scheme, and new matching operators.
Acknowledgements This work was partially supported by the ACIMD – ReMIX French grant (Reconfigurable Memory for Indexing Huge Amount of Data).
References 1. Lalmas M., INEX 2005 Retrieval Task and Result Submission Specification, June 20, 2005. 2. Kazai G., Lalmas M., INEX 2005 Evaluation Metrics, November 2005, INEX 2005. 3. Trotman A., & Sigurbjörnsson B. (2004). Narrowed Extended XPath I (NEXI) In Proceedings of the INEX 2004 Workshop, (pp. 16-40). 4. Sigurbjörnsson B., Trotman A., Geva S.,Lalmas M., Larsen B., Malik S., INEX 2005 Guidelines for Topic Development, INEX 2005. 5. Kamps J., de Rijke M., and Sigurbjörnsson B., The Importance of Length Normalization for XML Retrieval, Information Retrieval. Volume: 8. Issue: 4. Pages: 631-654. 2005. 6. Ménier G., Marteau P.F., Information retrieval in heterogeneous XML knowledge bases, The 9th International Conference on Information Processing and Magement of Uncertainty in Knowledge-Based Systems, 1-5 July 2002, Annecy, France. 7. Ménier G., Marteau P.F., PARTAGE: Software prototype for dynamic management of documents and data, ICSSEA, 29 Nov.- 1 Dec. 2005, Paris. 8. Popovici E., Marteau P.F., Ménier G., "Information Retrieval of Sequential Data in Heterogeneous XML Databases", AMR 2005, 28-29 July 2005, Glasgow. 9. Tai, K.C, “The tree to tree correction problem”, J.ACM, 26(3):422-433, (1979) 10.Wang T.L.J, Shapiro B., Shasha D., Zhang K., Currey K.M., “An algorithm for finding the largest approximately common substructures of two trees”, In J. IEEE Pattern Analysis and Machine Intelligence, vol.20, N°8, August (1998) 11.Levenshtein A., “Binary Codes Capable of Correcting Deletions, Insertions and Reversals”, Sov.Phy. Dohl. Vol.10, P.707-710, (1966) 12.Wagner R., Fisher M., “The String-to-String Correction Problem”, Journal of the Association for Computing Machinery, Vol.12, No.1, p.168-173, (1974) 13.Mignet L.,Barbosa D.,Veltri P., The XML Web: A First Study (2003),WWW2003, May 2024, Budapest, Hungary 14.Carmel D., Maarek Y. S., Mandelbrod M., Mass Y. and Soffer A., Searching XML documents via XML fragments, SIGIR 2003, Toronto, Canada pp. 151-158. 15.Fuhr N., Groβjohann K., XIRQL: An XML query language based on information retrieval concepts, (TOIS), v.22 n.2, p.313-356, April 2004 16.Clark J., DeRose S., XML Path Language (XPath) Version 1.0, W3C Recommendation 16 November 1999, http://www.w3.org/TR/xpath.html, (1999) 17.Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3):130-137 18.Salton G. and Buckeley C., Term-weighting approaches in automatic text retrieval, Information Processing and Management, 24, 513-523, 1988.