Theoretical evaluation of XML Retrieval

1 downloads 0 Views 705KB Size Report
Feb 23, 2009 - 2 Theoretical Evaluation. 3 Theoretical Evaluation of INEX XML Retrieval Models. 4 XML Retrieval Evaluation Evaluation. Tobias Blanke (2009).
Theoretical evaluation of XML Retrieval Tobias Blanke University of Glasgow

February 23, 2009

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

1 / 39

Outline

1

XML Retrieval

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

2 / 39

Outline

1

XML Retrieval

2

Theoretical Evaluation

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

2 / 39

Outline

1

XML Retrieval

2

Theoretical Evaluation

3

Theoretical Evaluation of INEX XML Retrieval Models

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

2 / 39

Outline

1

XML Retrieval

2

Theoretical Evaluation

3

Theoretical Evaluation of INEX XML Retrieval Models

4

XML Retrieval Evaluation Evaluation

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

2 / 39

Outline

1

XML Retrieval

2

Theoretical Evaluation

3

Theoretical Evaluation of INEX XML Retrieval Models

4

XML Retrieval Evaluation Evaluation

5

Conclusion

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

2 / 39

Conclusion

Why bother with theory?

What does it have to do with the Credit Crunch? Banks and hedge funds rely on highly-paid mathematicians and economists - ”quants” (quantitative analysts) - to evaluate risk. Empirical mathematical techniques modelling human behaviour Not necessarily the best models are chosen but those that are accepted by most participants. Strength in numbers (http://news.bbc.co.uk/2/hi/business/7815994.stm). Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

3 / 39

Conclusion

Why bother with theory?

Sometimes it is good to look again at the basics and behind the formulas For that we need a framework and a new level of abstraction We suggest a general framework to theoretically evaluate XML retrieval

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

4 / 39

Part I XML Retrieval

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

5 / 39

XML retrieval

XML retrieval systems aim to provide effective access to XML document repositories. XML retrieval uses the logical structure of elements and edges between them to return more precise results to user information needs.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

6 / 39

Evaluation of XML retrieval: INEX

Promote research and stimulate development of XML information access and retrieval Collaborative effort INEX has allowed a new community in XML information access to emerge

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

7 / 39

Relevance in XML retrieval

Aim: Not only relevant elements but those at the appropriate level of granularity smallest component (specificity) that is highly relevant (exhaustivity) Specificity: extent to which a document component is focused on the information need, while being an informative unit Exhaustivity: extent to which the information contained in a document component satisfies the information need.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

8 / 39

Part II Theoretical Evaluation Background Aboutness Formalism Rules Reflection Pure Type XML Retrieval

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

9 / 39

Background

Is nothing new Frameworks Many frameworks, e.g. Embedding, Meta-theory (Probability) Logic based evaluation approaches to IR are generally based on Cooper’s definition [4] of ‘logical relevance’: d → q

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

10 / 39

Background

Is nothing new Frameworks Many frameworks, e.g. Embedding, Meta-theory (Probability) Logic based evaluation approaches to IR are generally based on Cooper’s definition [4] of ‘logical relevance’: d → q

Aboutness Approaches Huibers and Bruza: Investigating aboutness axioms using information fields [2] Huibers: An Axiomatic Theory for Information Retrieval [6] Wong, Song, Bruza: Application of aboutness to functional benchmarking in information retrieval [12]

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

10 / 39

Aboutness

Theoretical Benchmarks of Aboutness

Reason abstractly about the properties of a document: Documents have properties Objective in the sense of coming from the object rather than the user subject: In discussing aboutness we come from the very concrete notion that index terms represent properties of documents, which we are making more abstract, whereas with relevance we have a very abstract notion which we make more concrete.[10]

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

11 / 39

Formalism

Evaluation steps

1

A formalism to translate the information representation of a particular model into a formal symbolic representation.

2

A set of reasoning rules to describe the functional behaviour of the XML retrieval aboutness and to discriminate specificity and general relevance reasoning.

3

A further investigation of boundaries of aboutness for particular retrieval systems called Reflection. It defines typical non-reasoning related boundary elements of retrieval systems.

4

A comparison of the formal characteristics of an XML retrieval model with its flat document equivalent and pure type XML retrieval. This will qualify the impact of XML structure on the aboutness behaviour.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

12 / 39

Formalism

Step 1: The Translation

Theoretical benchmarks concern the formal representation of qualitative properties of IR models Situation Theory: ◮





Situations are partial descriptions of the world and are composed of information items formalised as infons. Queries and documents are modelled as situations. XML elements are situations Infons represent a model’s information items like keywords, phrases or structural relationships

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

13 / 39

Formalism

The Map Operator

Simple Examples hhkii → {hhk1 ii, ..., hhkn ii} hhR, i1 , ..., in ii ◮ ◮ ◮

map(D) ≡ {hhhouseii, hhgardenii} map(D) ≡ {hhElementType, Paragraph, pii, hhValue, garden, pii} map(D) ≡ {hhElementType, Section, sii, hhParent, s, p1 ii, hhParent, s, p2 ii, ...}

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

14 / 39

Formalism

The Map Operator

Simple Examples hhkii → {hhk1 ii, ..., hhkn ii} hhR, i1 , ..., in ii ◮ ◮ ◮

map(D) ≡ {hhhouseii, hhgardenii} map(D) ≡ {hhElementType, Paragraph, pii, hhValue, garden, pii} map(D) ≡ {hhElementType, Section, sii, hhParent, s, p1 ii, hhParent, s, p2 ii, ...}

Using Polarities {hhElementType, Paragraph, p; 1ii, hhValue, garden, p; 1ii} {hhElementType, Paragraph, p; 1ii, hhValue, garden, p; 0ii}

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

14 / 39

Formalism

Framework

Situation Theory as a logic of information: Document and Queries are situations: D 

Q

Situations can be composed: D ⊗ D1 Containment (explicit or implicit): D → D1. In Boolean retrieval this corresponds to, e.g., the implication that for any valid expression x ∧ y , x is also valid. Preclusion: D ⊥ D1 could lead to anti-aboutness: D ⊠ D1. The information cannot be meaningfully combined, e.g. penguins are not birds that fly.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

15 / 39

Formalism

Framework for XML Retrieval Exhaustivity and Specificity Aboutness Following Chiaramella’s distinction of specificity and exhaustivity [3]: Exhaustivity: D  Q Specificity: Q  D

Chiaramella’s MetaTheory 1

Find: 1 2

Exhaustivity: Preselection of document components (dc): Di ⊆ Q Specificity: Selection of minimal (dc): Q ⊆ Di

2

If both conditions are matched, Di is an exact match for Q

3

Else: the process recursively looks for the minimal dc until condition 1 is not fulfilled anymore

4

Result: Minimal dc

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

16 / 39

Rules

Step 2: Rules Logical representation of how a system decides a document to be about a query Do not hold for all aboutness decisions but only for particular ones Either fully, not at all or conditionally supported.

Left Monotonic Union (LMU) D Q D ⊗ D1  Q

Right Monotonic Union (RMU) D Q D  Q ⊗ Q1 Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

17 / 39

Reflection

Step 3: Reflection Huibers [6] developed a theoretical Reflection as a means to find typical boundaries of retrieval models. These typical aspects are shared among all models and are therefore qualitative properties like the reasoning rules that can be used to compare aboutness behaviour. 1

A top document Dj is always about any query Q: {Dj |Dj ∈ D, Dj  Q}.

2

A top query Qj is one any document D is about: {Qj |Qj ∈ Q, D  Qj }.

3

A bottom document Dj is never about any query Q: {Dj |Dj ∈ D, Dj / Q}.

4

A bottom query Qj is one no document D is ever about: {Qj |Qj ∈ Q, D / Qj }.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

18 / 39

Pure Type XML Retrieval

Step 4: Pure Type XML Retrieval XML enforces a hierarchical information representation, a hierarchical inclusion E between components of the XML documents. Pure Type XML Retrieval is that aboutness systems that has purely this hierarchical inclusion as an aboutness operator. ◮

One can map an XML document to an XML tree by defining ancestors and descendants of XML elements with the document being the root of the XML tree: If in a document D a document component D1 is contained by component D2 then in the corresponding XML tree D1 will be a descendant of D2 . Ancestors can be defined analogously.

Definition A document D represented by an XML tree A is about a query Q represented by XML tree B if and if only A E B, i.e., the information contained in tree B is also contained in tree A. Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

19 / 39

Pure Type XML Retrieval

Step 4: Pure Type XML Retrieval

Complete Theoretical Evaluation 1

Formalism and Translation

2

Rules

3

Reflection Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

20 / 39

Part III Theoretical Evaluation of INEX XML Retrieval Models XML Language Modelling XML Vector Space Retrieval Experimental Behaviour Specificity Aboutness

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

21 / 39

XML Language Modelling

XML Language Modeling Aboutness Definition

Algorithm 1 2

Allocate most informative XML elements across separate indexes. P(ti |e) = λe ∗ Pmle (ti |e) + λd ∗ Pmle (ti |d) + (1 − λe − λd ) ∗ Pmle (ti ) [11]

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

22 / 39

XML Language Modelling

XML Language Modeling Aboutness Definition

Algorithm 1 2

Allocate most informative XML elements across separate indexes. P(ti |e) = λe ∗ Pmle (ti |e) + λd ∗ Pmle (ti |d) + (1 − λe − λd ) ∗ Pmle (ti ) [11]

Aboutness Decision Translation: mapsec (D) ≡ {hhElementType, e, i ii, hhValue, t, i ii|e ∈ {sec}} Aboutness Definition: D  Q if and if only P(ti |e) > θ . The threshold θ is the smoothing value, which is the collection language model (1 − λe − λd ) ∗ Pmle (ti ).

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

22 / 39

XML Language Modelling

XML Language Modeling Rules

Left Monotonic Union (LMU) D Q D ⊗ D1  Q

Proof. We can show that P(ti |e) > θ ⇔ A ∩ B ≡ / 0, / where D ≡ map(A) and Q ≡ map(B). Let us assume that D ≡ map(A), Q ≡ map(B) and D ⊗ D ′ ≡ map(C ). Using the Proposition, we then have C ∩ B ≡ / 0/ and C ⊇ A. Thus, A ∩ B ≡ / 0, / and LMU is unconditionally supported.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

23 / 39

XML Vector Space Retrieval

XML Vector Space Retrieval Aboutness Definition

Algorithm 1

Split up XML elements into several separate indexes.

2

rsv (D, Q) =

3

Automatic Query Refinement based on Lexical Affinity: +| |D − | + − IGQ,D (L) = HQ (D) − [ |D |D| HQ (D ) + |D| HQ (D )] [7]

Tobias Blanke (2009)

∑ti ∈{Q∩D} wQ (ti )∗wD (ti )∗idf (ti ) kQk∗kDk

Theoretical evaluation of XML Retrieval

February 23, 2009

24 / 39

XML Vector Space Retrieval

XML Vector Space Retrieval Aboutness Definition

Algorithm 1

Split up XML elements into several separate indexes.

2

rsv (D, Q) =

3

Automatic Query Refinement based on Lexical Affinity: +| |D − | + − IGQ,D (L) = HQ (D) − [ |D |D| HQ (D ) + |D| HQ (D )] [7]

∑ti ∈{Q∩D} wQ (ti )∗wD (ti )∗idf (ti ) kQk∗kDk

Aboutness Decision Translation: map(D) = {hhElementType, e, i ii, hhValue, t, i ii|e ∈ {article, abs, sec, ss1, ss2, p, p1} Aboutness definition: D 

Tobias Blanke (2009)

Q if and only if rsv (D, Q) ≥ N.

Theoretical evaluation of XML Retrieval

February 23, 2009

24 / 39

XML Vector Space Retrieval

XML Vector Space Retrieval Rules

Left Monotonic Union: rsv (A, B) = ⇒ rsv (A, B, C ) =

f (AB) ||A||∗||B||

f (ABC ) ||A||∗||B||∗||C ||

LMU is conditionally supported ◮

◮ ◮

Flat Document Retrieval Vector Space [6] and Pure Type XML Retrieval LMU is fully supported . A more conservative approach to monotonicity with better control However, the condition is chosen a priori by setting N, and is external to the aboutness decision.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

25 / 39

Experimental Behaviour

Explaining Experimental Behaviour INEX 2005 CO.Thorough Why did XML Vector Spaces and XML Language Modelling do so well for the INEX 2005 CO.Thorough task? Content-only queries XML retrieval systems did not have to eliminate overlap in information when ancestors of elements contain exactly the same information as their children Table: INEX 2005 CO.Thorough Metric:ep-gr, Quantization: gen,Overlap=on S.No 1 2 4 5 6

Tobias Blanke (2009)

Affiliation XML Vector Space XML Vector Space Language Model Language Model Language Model

RunId CO-no-phrase-... CO-no-phrase-... UAmsCOTQrelbasedIndex UAmsCOTLengthbasedIndex UAmsCOTElementIndex

Theoretical evaluation of XML Retrieval

MAep 0.0867 0.0841 0.0829 0.0802 0.0793

February 23, 2009

26 / 39

Experimental Behaviour

Explaining Experimental Behaviour

The particular measure ep − gr checks whether systems deliver only most relevant early in the ranking and all of these. Verify this ... How does the support for Left Monotonic Union affect this? The full support for the expansion from D  Q to D ⊗ D1  Q is not necessarily optimal for the delivery of only relevant elements on top of the ranking, as it might be that D1 / Q. Other important reasoning rules to verify this behaviour:

Mix (MX)

Cut (CU)

S  U, T  U S ⊗T  U

Tobias Blanke (2009)

S ⊗ T  U, S  S U

Theoretical evaluation of XML Retrieval

T

February 23, 2009

27 / 39

Specificity Aboutness

Specificity Aboutness Filter

XML retrieval is about focusing the answer to an information need. Filters are used in INEX to remove all overlapping elements but the most relevant one on any XML path. Filters are the combination of two aboutness systems.

Definition Let Ap , Bp be aboutness systems and D be a document base with a document D and let Q be a query. The filtering function f − answer of Ap with respect to Bp is defined by: f − answer (Ap ; Bp ; Q; D) = answer (Ap ; Q; answer (Bp ; Q; D)) [6].

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

28 / 39

Specificity Aboutness

Filter as a second aboutness decision 1

Aboutness decision: For our framework, filter are simply another aboutness decision, e.g. brute force filtering: D  Q if and only if rsv (D, Q) = max(rsv (D, Q))

2

Reasoning rules Analyse f-answer [6]:

3









The filtering function f − answer (Ap ; Bp ; q; D) can be called useless if for all document bases D and queries q f − answer (Ap ; Bp ; q; D) = answer (Bp ; q; D). We do not want aboutness proof systems to preclude each other: f − answer (Ap ; Bp ; Q; D) =; ;. The aboutness proof systems Ap and Bp are said to be f-equivalent if and only if f − answer (Ap ; Bp ; Q; D) = answer (Ap ; Q; D). Ap s and Bp s will ‘overlap’ if and only if the systems do not preclude each other and are not f-equivalent.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

29 / 39

Part IV XML Retrieval Evaluation Evaluation INEX Quantisations Agent Reasoning

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

30 / 39

INEX Quantisations

Agent Representation in INEX Quantisations

Humans are also reasoning agents Within ST framework advantage to express a user’s need and a system’s attempt to satisfy it within the same framework. Both are reasoning processes that follow rules and axioms.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

31 / 39

INEX Quantisations

Quantisations Reflect Agent Expectations

Combined assessment for (exhaustivity, specificity) Agent models 1

2

Expert and impatient: only reward retrieval of highly exhaustive and specific elements Naive and has lots of time: reward - to a different extent - the retrieval of any relevant elements

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

32 / 39

Agent Reasoning

Quantisation functions Table: Quantisations in INEX 2004 Function Strict4 AnyRel

f (e, s) ( 1 f (e, s) = 0 ( 0 f (e, s) = 1

User model if e=3 and s=3 otherwise

UU

if (e,s) = (0,0) otherwise

TU

Table: Quantisations in INEX 2005 Function

f (e, s)

Strict5

f (e, s) =

BinExh

( 1 0 ( s f (e, s) = 0

User model if e=2 and s=1 otherwise

UU

if e ∈ {?,1,2} otherwise

SDRU

User models reasoning? Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

33 / 39

Agent Reasoning

Agent Reasoning Models

Unanimous User (UU) Di 

Q, Di ≡ D, Qi  D, Qi ≡ Q D  Q, Q  D

Typical User (TU) D n  Q Q1  D Qn  D D1  Q , ... , , , ... , D Q D Q Q D Q D UU: Expert and impatient TU: Demands quick win

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

34 / 39

Agent Reasoning

One Reasoning System

Unanimous User (UU) Left Monotonic Union (LMU) UU:

Di  Q, Di ≡ D D Q

LMU: ⇒

Tobias Blanke (2009)

D Q D ⊗ D1  Q

D ⊗ D1  Q D ⊗ D1 6≡ Di

Theoretical evaluation of XML Retrieval

February 23, 2009

35 / 39

Part V Conclusion

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

36 / 39

Pro and Contra

Pro Widened perspective: We could derive why some behaviour appears in the experimental evaluation. We can even reflect on pros and cons of the approach. New insights on how models perform comparably better. Logic-based evaluation is definitely more open to debate.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

37 / 39

Pro and Contra

Contra Theoretical evaluation too high an abstraction to cover the details of XML retrieval models. Often the fundamental behaviour of models is very similar. Are logics really the foundation of mathematics and therefore suited to analyse IR algorithms? We tried to address these problems by introducing some mathematics into the purely logical framework delivered. Is that a cheat? Problems with analysing real performing systems rather than principle approaches. Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

37 / 39

Bibliography I T. Blanke and M. Lalmas. Theoretical evaluation of xml retrieval evaluation. In DIR ’07, 2007. P. D. Bruza and T. W. C. Huibers. Investigating aboutness axioms using information fields. In ACM SIGIR ’94, pages 112–121, New York, NY, USA, 1994. Springer-Verlag New York, Inc. Y. Chiaramella. Information retrieval and structured documents. In Lectures on information retrieval, pages 286–309. Springer-Verlag, New York, 2001. W. Cooper. A definition of relevance for information retrieval. Information Storage and Retrieval, 7:19–37, 1971. N. Fuhr, M. Lalmas, S. Malik, and G. Kazai, editors. Advances in XML Information Retrieval and Evaluation, INEX 2005, Dagstuhl, volume 3977 of Lecture Notes in Computer Science. Springer, 2006. T. W. Huibers. An Axiomatic Theory for Information Retrieval. Universiteit Utrecht, Utrecht, 1996. Y. Mass and M. Mandelbrod. Using the inex environment as a test bed for various user models for xml retrieval. In Fuhr et al. [5], pages 187–195.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

38 / 39

Bibliography II

B. Piwowarski and M. Lalmas. Structured information retrieval and quantum theory,. In 3rd Quantum Interaction Symposium, DFKI, Saarbruecken, 2009. C. J. van Rijsbergen and M. Lalmas. Information calculus for information retrieval. J. Am. Soc. Inf. Sci., 47(5):385–398, 1996. C. J. v. van Rijsbergen. The Geometry of Information Retrieval. Cambridge University Press, 2004. B. Sigurbj¨ ornsson and J. Kamps. The effect of structured queries and selective indexing on xml retrieval. In Fuhr et al. [5], pages 104–118. K.-F. Wong, D. Song, P. Bruza, and C.-H. Cheng. Application of aboutness to functional benchmarking in information retrieval. ACM Trans. Inf. Syst., 19(4):337–370, 2001.

Tobias Blanke (2009)

Theoretical evaluation of XML Retrieval

February 23, 2009

39 / 39