A Voting Method for XML Retrieval - FTP Directory Listing - IRIT

0 downloads 0 Views 109KB Size Report
a voting method coupled with some processes to answer content only and con- ..... Ogilvie, P., Callan J.: Using Language Models for Flat Text Queries in XML ...
A Voting Method for XML Retrieval Gilles Hubert 1

2

IRIT/SIG-EVI, 118 route de Narbonne, 31062 Toulouse cedex 4 ERT34, Institut Universitaire de Formation des Maîtres, 56 av. de l’URSS, 31400 Toulouse [email protected]

Abstract. This paper describes the retrieval approach proposed by the SIG/EVI group of the IRIT research centre in INEX’2004 evaluation. The approach uses a voting method coupled with some processes to answer content only and content and structure queries. This approach is based on previous works we leaded in the context of automatic text categorization.

1 Introduction The development of systems to perform searches in collections constituted of XML (eXtensible Markup Language) documents [3] has become a need since the use of XML is growing. Consequently, a growing number of systems intend to provide means to retrieve relevant components among XML documents. XML retrieval systems need to take into account content and structural aspects. Regarding the variety of proposed XML retrieval systems it is interesting to evaluate their effectiveness. For that, the INitiative for the Evaluation of XML retrieval (INEX) provides a testbed and scoring methods allowing participants to evaluate and compare their results. Underlying approaches of systems participating to INEX can be classified in two categories [5] : model-oriented approaches and system-oriented approaches. Modeloriented approaches gather notably approaches based on language models [11], [8], [1] or other probabilistic models [14] which obtained good results in 2003. Systemoriented approaches extend textual document retrieval system adding XML-specific processing. Various systems in this category [10], [6], [13], [16] obtained good results in 2003. In this paper, we present an IR approach initially applied to automatic categorization of structured documents according to concept hierarchies and its evolution brought for XML retrieval notably within the context of INEX. Section 2 is a short presentation of the INEX initiative 2004 edition. Section 3 presents the initial context in which the method was initiated and its first application within INEX in 2003. The evolutions made to this approach for INEX 2004 are described in section 4. Section 5 presents the submitted runs and the obtained results. In section 6 we conclude analyzing the experiment and considering future works.

2 The INEX initiative

2.1 Collection The INEX documents correspond to approximately 12,000 articles of the IEEE Computer Society’s publications from 1995 to 2002 marked up in XML. All the documents respect the same DTD. The collection gathers over eight millions XML elements of varying length and granularity (ex. title, paragraph or article). 2.2 Queries INEX introduces two types of queries: − CO (Content Only) queries describe the expected content of the XML elements to retrieve. − CAS (Content and Structure) queries combine content and explicit references to the XML structure using a variant of Xpath [4]. CAS topics contain indications about the structure of expected XML elements and about the location of expected content. Both CO and CAS topics are made up of four parts: topic title, topic description, narrative and keywords. Within the ad-hoc retrieval task, two types of tasks are defined: (1) the CO task, using CO queries, (2) the VCAS task, using CAS queries, for which the structural constraints are considered as vague conditions.

3 A Voting method in information retrieval The approach we proposed is derived from a process we first defined for textual document categorisation [7], [2]. Document categorisation intends to link documents with pre-defined categories. Our approach focuses on categories organised as taxonomy. The original aspect of our approach is that it involves a voting principle instead of a classical similarity computing. The association of a text to categories is based on the Vector Voting method [12]. The voting process evaluates the importance of the association between a given text and a given category. This method is similar to the HVV method (Hyperlink Vector Voting) used within the Web context to compute the relevance of a Web page regarding the web sites referring to it [9]. In our context, the initial strategy considers that the more the category terms appear in the text, the more the link between the text and this category is strong. Thus, this method relies on terms describing each category and their automatic ex-

traction from the document to be categorised. The result is a list of categories annotating each document. For INEX’2003, this categorisation process has been applied. Every XML component has been processed as a complete document. Every topic has been considered as a category of a flat taxonomy. The result was a list of topics corresponding to each XML component. It was then reversed and reordered to fit the INEX format of results. Results obtained for the submitted runs [15] have led us to improve the process to suit a retrieval process. The axes of this evolution have been as follows: − inverse the voting process to estimate the relevance of each XML component according to each topic, − modify the voting function to take into account the great variations of element sizes and to take into account topic treatment rather than category treatment, − integrate the aggregation aspect of an XML element (i.e. elements composed of relevant elements), − integrate structural constraint processing for CAS topics.

4 Evolution of the voting method within INEX The approach we proposed is derived from a process we first defined for textual document categorisation [7], [2]. Document categorisation intends to link documents with pre-defined categories. Our approach focuses on categories organised as taxonomy. The original aspect of our approach is that it involves a voting principle instead of a classical similarity computing. The association of a text to categories is based on the Vector Voting method [12]. The voting process evaluates the importance of the association between a given text and a given category. 4.1 INEX collection pre-processing From the INEX collection point of view, the documents are considered as sets of text chunks identified by xpaths. For each XML component, concepts are extracted automatically and saved with the xpath identifying the XML component in which they appear and the number of occurrences in the component. Concept extraction involves notably stop word removal. Optionally, some processes can be applied to concepts such as stemming using Porter’s algorithm. For INEX'2004 experiments all XML tags except text formatting tags (bold, italic, underline) have been taken into account. From the topic point of view, although our method can use all the parts constituting CO and CAS topics, we used only the title part for the INEX'2004 experiments as requested. For both topic types, stop words are removed and optionally terms can be stemmed using Porter’s algorithm.

4.2 Voting function The voting function must take into account the importance in the XML element of each term describing the topic and the importance of each term in the topic representation. We have studied different voting functions and the one providing the best results is described as follows:

Vote( E , T ) =

∑ F (t , E ) ⋅

∀t∈T

F (t , T ) S (T )

where T is the topic E is an XML element

F (t , E )

This factor measures the importance of the term t in the XML element E. F(t,E) corresponds the number of occurrences of the term t in the element E.

F (t , T ) S (T )

This factor measures the importance of the term t in the topic representation T. F(t,T) corresponds to the number of occurrences of the term t in the topic T and S(T) corresponds to the size (number of terms) of T.

The voting function combines two factors: the presence of a term in the element and the importance of this term in the topic. 4.3 Scoring function The voting function is coupled with a third factor representing the importance of the topic presence within the XML element. The final function (scoring function) that computes the score of an XML element regarding a given topic is the following:

Score( E , T ) = Vote( E , T ) ⋅ f (

NT (T , E ) ) S (T )

where

NT (T , E ) This factor measures the presence rate of terms representing the topic in the text (importance of the topic). S (T ) S(T) corresponds to the number of terms in the topic representation T and NT(T,E) corresponds to the number of terms of the topic T that appear in the XML element E.

Applying a function ƒ to the third factor (i.e. the presence rate of terms representing the topic in the text) aims at varying the influence of this factor on the scoring function. We tried different functions ƒ, for example the initial function was the exponential (i.e. f ( NT (T , E ) ) = e S (T )

NT (T , E ) S (T )

).

4.4 Additional processes for both CO and CAS topics The scoring function is completed with the notion of coverage. The aim of the coverage is to ensure that only documents in which the topic is represented enough will be selected for this topic. The coverage is a threshold corresponding to the percentage of terms from a topic that appears in a text. For example, 50% of coverage implies that at least half of the terms describing a topic have to appear in the text of a document to select it. If NT (T , E ) ≥ CT S (T )

then Score( E , T ) = Vote( E , T ) ⋅ f ( NT (T , E ) ) S (T ) else Score( E , T ) = 0.0

where CT is a real constant (CT≥0.0) corresponding to the coverage threshold The hierarchical structure of XML has to be taken into account. The hypothesis on which is based our system is that an element containing a component selected as relevant is also relevant. Our system takes into account this hypothesis propagating the score of an element to the elements it composes. The score propagated to the composed elements is decreased applying a reducing factor. ∀ Ea ancestor of E and

d ( Ea , E ) ⋅ α 1.0

where R is the location path (xpath) of the element E from the root of the document X is the location path (xpath) defined as the target constraint in the topic

5 Experiments

5.1 Experiment setup Our experiments aim at evaluating the efficiency of the evolution given to the voting function and the coefficient adjustments resulting from training performed on the INEX’2003 assessment testbed. The training phase only concerns system processes applied to both CO and CAS topics. Three runs based on the voting method were submitted to INEX'2004. Two runs were performed on CO topics and one run was performed on CAS topics. The runs on CO topics differ from the function f used in the voting method. The run labelled VTCO2004TC35xp400sC-515 uses the voting function:

Score( E , T ) = Vote( E , T ) ⋅ ϕ

(

NT (T , E ) ) S (T )

where ϕ=400.

The run labelled VTCO2004TC35p4sC-515 uses the voting function:

 NT (T , D)   Score( E , T ) = Vote( E , T ) ⋅   S (T ) 

λ

where λ=4.

The run on CAS topics labelled VTCAS2004C35xp200sC-515PP1 uses the voting function:

Score( E , T ) = Vote( E , T ) ⋅ ϕ

(

NT (T , E ) ) S (T )

where ϕ=200.

The coefficient taking into account structural predicates associated to searched concepts was fixed to 1.0 (i.e. the vote of an element regarding a given concept is doubled when the element matches the structural constraint associated to the concept). The coefficient taking into account structural predicates for expected results was fixed to 2.0 (i.e. the score of an element matching the structural predicate is doubled). The values of these two coefficients were fixed arbitrarily. For all submitted runs the other parameters of the scoring function were the same. Coverage threshold was fixed to 35% (i.e. more than a third of terms describing the topic must appear in the text to keep the XML component). Coefficients applied to take into account the signs ‘+’ and ‘-‘ used to emphasise a concept or to denote an unwanted one were fixed to: − +5.0 for concepts marked with ‘+’ (the vote of these concepts increases the score of the elements in which they appear), − -5.0 for concepts marked with ‘-‘ (the vote of these concepts reduces the score of the elements in which they appear), − 1.0 for unmarked concepts. The coefficient α used to propagate a component score through the hierarchical structure of the XML document was fixed to 0.1. The values of the parameters are those which gave the best results during a training phase done with INEX’2003 CO topics. 5.2 Results The following table shows the preliminary results of the three runs based on the voting method: Table 1. Results of the 3 runs performed using the voting method

Run VTCO2004TC35xp400sC-515 VTCO2004TC35p4sC-515 VTCAS2004TC35xp200sC-515

Aggregate score 0.0783 0.0775 0.0784

Rank 13/70 15/70 5/51

The results of the two runs for CO topics are detailled in the following table: Table 2. Detailed results of the 2 runs for CO topics

Quantisation strict generalised so s3_e321 s3_e32 e3_s321 e3_s32

VTCO2004TC35xp400sC-515 Average Rank precision 0.0778 18/70 0.0683 14/70 0.0559 16/70 0.0395 22/70 0.0508 17/70 0.1456 10/70 0.1106 11/70

VTCO2004TC35p4sC-515 Average Rank precision 0.0759 19/70 0.0682 15/70 0.0564 15/70 0.0400 21/70 0.0508 17/70 0.1424 11/70 0.1083 13/70

For CO topics, the run which has obtained the best results is the run labelled VTCO2004TC35xp400sC-515. The best measures have been obtained with e3s321 quantisation. Average precision is equal to 0.1456, placing the run at the 10th rank. The run labelled VTCO2004TC35p4sC-515 has obtained values slightly lower for most of the quantisations. Only the best results obtained for CO topics are presented in the following graphs that is to say run VTCO2004TC35xp400sC-515 for e3s321 quantisation.

Fig. 1. Precision/Recall curve of the CO run labelled VTCO2004TC35xp400sC-515 for e3s321 quantisation

Fig. 2. Rank of the CO run labelled VTCO2004TC35xp400sC-515 for e3s321 quantisation

For CAS topics, the run VTCAS2004TC35xp200sC-515PP1 has been ranked at the 5th place. The results of the run are detailled in the following table:

Table 3. Detailed results of the run for CAS topics

Quantisation strict generalised so s3_e321 s3_e32 e3_s321 e3_s32

VTCAS2004TC35xp200sC-515PP1 Average precision Rank 0.1053 5/51 0.0720 6/51 0.0554 9/51 0.0462 12/51 0.0644 10/51 0.1162 5/51 0.0892 5/51

The best measures have been obtained for quantisations strict, e3s321 and e3s32 for which the run is ranked 5. The following figures present the results corresponding to the strict quantisation and e3s321 quantisation.

Fig. 3. Precision/Recall curve of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for strict quantisation

Fig. 4. Rank of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for strict quantisation

Fig. 5. Precision/Recall curve of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for e3s321 quantisation

Fig. 6. Rank of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for e3s321 quantisation

6 DISCUSSION AND FUTURE WORKS Regarding the experiments that were performed and the obtained results we can notice that: − the chosen functions and parameters for the scoring method tend to support exhaustivity rather than specificity. Indeed, the importance of the factor measuring the representation of the topic (i.e. NT(T,E)/S(T)) dominates in the scoring function and this factor is related to the exhaustivity relevance. It would be interesting to modify the scoring function to increase the number of elements judged as relevant regarding specificity. − The measures obtained using INEX’2003 CO topics were globally better. This suggests that our scoring method is more efficient on certain queries. It would be interesting to identify a class (or classes) of queries for which the function works better, a class (classes) of queries for which the function is less efficient and to understand why. The function could evolve to extend its efficiency to other kinds of queries or different functions could be applied regarding different query classes. − The values of coefficients applied for structural constraint matching have been fixed arbitrarily. Additional experiments on INEX’2004 CAS topics will help us to adjust the values of these coefficients. − Evaluate the profit of adding a relevance feedback process to our method. On one hand, feedback from first ranked elements of the assessments can be performed. This is the process chosen this year in the relevance feedback track. On the other hand, we plan to integrate a feedback process using first ranked elements of a first search using our system.

Acknowledgments Research outlined in the paper is part of the project QUEST: Query reformulation for structured document retrieval, PAI Alliance N°05768UJ. However, this publication only reflects the author’s view.

References 1. Abolhassani, M., Fuhr, N.: Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents. 26th European Conference on IR Research (ECIR), Lecture Notes in Computer Science vol. 2997 (2004) 409-419 2. Augé, J., Englmeier, K., Hubert, G., Mothe, J. : Catégorisation automatique de textes basée sur des hiérarchies de concepts. 19ième Journées de Bases de Données Avancées (BDA) Lyon (2003) 69-87

3. Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., Yergeau, Y.: Extensible Markup Language (XML) 1.0 (Third Edition). W3C Recommendation., http://www.w3.org/TR/REC-xml/ (2004) 4. Clark, J., DeRose, S.: XML Path Language (XPath). W3C Recommendation, http://www.w3.org/TR/xpath.html (1999). 5. Fuhr, N., Maalik, S., Lalmas, M.: Overview of the INitiative for the Evaluation of XML Retrieval (INEX) 2003. Proceedings of the Second INEX Workshop, Dagstuhl, Germany (2004) 1-11 6. Geva, S., Leo-Spork, M.: XPath Inverted File for Information Retrieval. INEX 2003 Workshop Proceedings, (2003) 110-117 7. IRAIA: Getting Orientation in Complex Information Spaces as an Emergent Behaviour of Autonomous Information Agents. European Information Societies Technology, IST-199910602, (2000-2002). 8. Kamps, J., de Rijke, M., Sigurbjörnsson, B.: Length normalization in XML retrieval. Proceedings of the 27th International Conference on Research and Development in Information Retrieval (SIGIR). New York NY, USA, (2004) 80-87 9. Li, Y.: Toward a qualitative search engine. IEEE Internet Computing, vol. 2 n°4, (1998) 2429 10. List J., Mihajlovic V., de Vries A. P., Ramirez G., Hiemstra D.: The TIJAH XML-IR system at INEX 2003. INEX 2003 Workshop Proceedings, (2003) 102-109 11. Ogilvie, P., Callan J.: Using Language Models for Flat Text Queries in XML Retrieval. Proceedings of the Second INEX Workshop. Dagstuhl, Germany, (2004) 12-18 12. Pauer, B., Holger, P.: Statfinder. Document Package Statfinder, Vers. 1.8, (2000) 13. Pehcevski, J., Thom J., Vercoustre, A.M.: Enhancing Content-And-Structure Information Retrieval using a Native XML Database. Proceedings of The First Twente Data Management Workshop on XML Databases and Information Retrieval (TDM'04), Enschede, The Netherlands, (2004) 14. Piwowarski B., Vu H.-T., Gallinari P.: Bayesian Networks and INEX'03. Proceedings of the Second INEX Workshop. Dagstuhl, Germany, (2003) 33-37 15. Sauvagnat, K., Hubert, G., Boughanem, M., Mothe, J.: IRIT at INEX 2003. Proceedings of the Second INEX Workshop. Dagstuhl, Germany, (2003) 142-148 16. Trotman, A., O'Keefe, R. A.: Identifying and Ranking Relevant Document Elements. INEX 2003 Workshop Proceedings, (2003) 149-154