Content and structure summarisation for accessing XML documents ...
Recommend Documents
system is also effective and well-suited to the task of summarisation in the context ... of text summarisation as the extraction of text spans (typically sentences) from ... Machine Learning (ML) approaches within the classification framework, .....
... 5, 11, 15] and top- k answers are returned ranked by their relevance to the query. ... path scoring achieves very high precision for top-k queries while requiring ...
Dec 28, 2000 - Copies may be requested from IBM T. J. Watson Research Center , P. O. ... XAS automatically maps the underlying data into virtual XML ..... each BOOK component is preexisting, as is the DCL that links NAME and JOB with PERSON. ..... th
same page and the JavaScript code updates the page you're already viewing. ...
Figure 10.2 Ajax web application model (asynchronous). This is called ...
unstructured flat text view of the documents, without as- signing any special significance to the tags or the struc- tural information, (ii) those which take only the ...
treatments, therapies, drugs administered, patient identifying information, legal permissions, and allergies. These records have been represented by. XML format ...
Conference, 27-28 November 2008, Glenelg, South Australia. ... GPO Box 2434, Brisbane QLD 4001, Australia ... ference (AusDM 2008), Glenelg, Australia.
Jul 24, 2009 - approach and the holistic or segmentation-free approach. The segmentation ..... MinGW and MSYS open software that can be downloaded from.Missing:
The changes on ordered XML documents can be classified into two types: changes ... are detected by using the signature and weight of nodes. XMLTreeDiff [3] ...
A data cube is a multidimensional data model used to conceptualize data in a ...... OLAP, International Journal of Data Warehousing and Mining, Idea Group Inc.,.
XML-SIM: Structure and Content Semantic Similarity Detection Using Keys 1185 .... document identifier, simple path expression identifier, start position of a ...
Content And Structure (CAS) index for XML data ... of XML data to support CAS index and querying. In .... name index implemented as B+-tree with ânidâ as a.
Exploiting XML Schema for Interpreting XML Documents as RDF. Pham Thi Thu Thuy, Young-Koo Lee, Sungyoung Lee, and Byeong-Soo Jeong. Department of ...
Apr 24, 2001 - (meliksd1, louisw, saraelo, jzhou, jessicaw)@us.ibm.com,. 2 [email protected],. 3 ..... These choices could be defined as the types of Netfinity servers: NONE | Netfinity_8500R |. Netfinity_7000_M10 | Netfinity_5500_M10 |.
schemas implies that the schema validators are not used to determine if a ...... http://www.oreillynet.com/xml/blog/2006/05/metrics_for_xml_projects_1_ele.html.
scientific papers when mentions of a certain entity are traced in a document. ... Suppose, for instance, a drugs company interested to track in medical journals.
person(code, id, fn, ln, email, phone, conf). 3. Since M(cite)=(paperâ), and there is a table corresponding to paper element, then we add a foreign key to.
Adobe's Acrobat word-finder provides the coordinates of the four corners of the quad(s) of the word. Quad is the quadrilateral bounding of a contiguous piece of ...
DTD), we produce a new XML document (a slice) that contains the relevant information ... (X)HTML or XSL-FO, that can be interpreted by any standard browser.
declarative query languages, efficient bulk loading techniques ... This paper studies the bulk ..... our example: a detector may either send an edit script or re-.
convert a database into a DBML document and two ... in different formats like: text, image, sound, and video. ... HTML) to markup all the existing documents. XML.
First, we define a formal model of access control policies for XML documents. ..... data content of an element is represented as a particular attribute whose name.
This chapter appears in the book, Web-Enabled Systems Integration: Practice and Challenges edited by. Ajantha ... Thus, integrating IR and XML search techniques will enable more sophis- .... Full Fabrication Labs, Inc. ... and thus are greatly inspir
Content and structure summarisation for accessing XML documents ...
Recently, mostly in the context of the Document Understanding Conferences (DUC)10, var- ious evaluation ...... Let us suppose you work for a company as a manager and you have frequent audio meetings ..... App summaries 3.512. 2.321.
1
Content and structure summarisation for accessing XML documents
Zolt´an Szl´avik
Thesis submitted for the degree of Doctor of Philosophy at Queen Mary, University of London
November 2008
2
Declaration of originality I hereby declare that this thesis, and the research to which it refers, are the product of my own work, and that any ideas or quotations from the work of other people, published or otherwise, are fully acknowledged in accordance with the standard referencing practices of the discipline.
The material contained in this thesis has not been submitted, either in whole or in part, for a degree or diploma or other qualification at the University of London or any other University.
Some parts of this work have been previously published as:
Z. Szl´avik, A. Tombros, and M. Lalmas. The use of summaries in XML retrieval. In Proceedings of ECDL 2006, pages 75-86, 2006.
Z. Szl´avik, A. Tombros, and M. Lalmas. Investigating the use of summarisation for interactive XML retrieval. In F. Crestani and G. Pasi, editors, Proceedings of ACM SAC-IARS ’06, pages 1068-1072, 2006.
Z. Szl´avik, A. Tombros, and M. Lalmas. Feature- and query-based table of contents generation for XML documents. In G. Amati, C. Carpineto, and G. Romano, editors, ECIR, volume 4425 of Lecture Notes in Computer Science, pages 456-467. Springer, 2007.
Zolt´an Szl´avik
3
Abstract As the availability of structured documents is constantly increasing, retrieval systems able to return document portions are being developed. Structured documents, usually formatted in XML, may consist of large numbers of document portions, often organised into a hierarchical logical structure. With the high number of document portions, it is necessary to direct the attention of users of retrieval systems towards the most important document portions, and also, to give overviews of the structure of documents, in other words, to show document portions in context. This thesis investigates summarisation as a means to help searchers of XML retrieval systems in the process of accessing the contents of document portions. Two types of summarisation are investigated. First, summaries of the textual contents of document portions, called XML elements, are studied in a user-based environment. Traditionally, summarisation is associated with whole documents or document sets, but rarely with document portions. As summaries of documents have been proved to be useful in whole document retrieval, it is considered worthwhile to investigate summaries of document portions in XML element retrieval. Summaries of elements are presented to searchers in the context of other elements from the document. The textual summaries of elements also reflect the searchers’ information needs: they are query based. The second type of summarisation investigated in this thesis is called structure summarisation. The automatic generation of tables of contents, as structure summaries, is described and examined. ToC generation is studied either when searchers’ queries are available (query based structure summarisation) or otherwise (query independent structure summarisation). The work presented in this thesis has made several contributions to the fields of summarisation and interactive XML retrieval.
4
Contents 1
Introduction 1.1
2
16
The eXtensible Markup Language and summarisation . . . . . . . . . . . . . . . 18 1.1.1
Summarisation of the textual content of XML elements . . . . . . . . . . 18
1.1.2
Summarisation of the the structure of XML documents . . . . . . . . . . 18
On the left, the structure of the XML document with a summary; on the right, a section element displayed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6
The steps of the experiment from a searcher’s point of view. The middle steps are performed twice for each searcher. . . . . . . . . . . . . . . . . . . . . . . . 87
4.7
Sample from an example document from the IEEE collection. . . . . . . . . . . 88
4.8
A sample topic that was used to generate one of the L tasks. . . . . . . . . . . . . 90
4.9
One of the L tasks used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Average summary reading times and number of read summaries in the 24 search sessions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.11 Summary times by structural levels. . . . . . . . . . . . . . . . . . . . . . . . . 99
11 4.12 Summary times by XML element types. . . . . . . . . . . . . . . . . . . . . . . 100 4.13 Number of read summaries per search sessions by structural levels. . . . . . . . . 101 4.14 Number of read summaries per search sessions by XML element types. . . . . . 102 5.1
A topic description and links to its documents. . . . . . . . . . . . . . . . . . . . 112
5.2
Screen shot of the main screen with sliders, ToC and element display. . . . . . . 112
5.3
The steps of the structure summarisation experiment. . . . . . . . . . . . . . . . 113
5.4
Various generated ToCs for the topic about the impact that Albert Einstein had on politics (w5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5
The average difficulty of tasks indicated by searchers. . . . . . . . . . . . . . . . 119
5.6
The average slider values after finishing with a document. . . . . . . . . . . . . . 120
5.7
Number of elements in the documents and the ToC elements’ ratio to it. . . . . . 121
5.8
Number of elements in documents at various size categories. . . . . . . . . . . . 121
5.10 Number of elements in documents at various depth levels. . . . . . . . . . . . . . 122 5.11 ToC items at various depth levels. . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.12 Length distributions of TOCs for the two collections. . . . . . . . . . . . . . . . 125 5.13 Depth level distributions of TOCs for the two collections. . . . . . . . . . . . . . 125 5.14 Slider values for the ten used topics. . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1
The process and stages of selecting features for structure summarisation. . . . . . 133
6.2
A sample from an example run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3
The average depth of result elements of various quality runs. . . . . . . . . . . . 142
6.4
The average length of result elements of various quality runs. . . . . . . . . . . . 143
6.5
The occurrence of section type elements in runs. The first sections are more often returned in results than, eg., second, third or fourth sections. . . . . . . . . . . . 144
6.6
The ratio of result elements that contain links to the whole document they are in. 145
6.7
The average number of elements linking to documents that are in the same Wikipedia category as their documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.8
Overview of the query independent structure summarisation experiment. . . . . . 147
6.10 The best 15 ToC evaluation results with respect to recall. . . . . . . . . . . . . . 179 6.11 The best 15 ToC evaluation results with respect to precision. . . . . . . . . . . . 180 6.12 Summariser version occurrences in the top 15 results, by training method. . . . . 181 6.13 The best 15 ToC evaluation results with respect to the F1 measure (equal weight on precision and recall), results also in the top precision results are shown in bold. 181
14 6.14 The best 15 ToC evaluation results with respect to the F2 measure (double weight on recall). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.15 Feature frequencies in the top 15 results. . . . . . . . . . . . . . . . . . . . . . . 182
15
Acknowledgements I would like to thank the following people, communities and organisations whose help and support made it possible to complete this thesis: My supervisors Tassos Tombros and Mounia Lalmas for their instructions, constructive comments and trust in me. The DELOS Network of Excellence on Digital Libraries for funding my PhD. Participants of my user studies for trying their best to facilitate my research. The INEX community for a stimulating research network through which my aptitude for learning increased considerably. The members of the QMIR group, past and present, with whom we improved and developed our skills together. Special thanks to Elham, Jun and Fred with whom we seemed to be endlessly discussing the meaning of PhD and life. There is still a lot to be learned... S´andor Dominich who drew my attention to the field of IR and helped me to seize the opportunity to start my PhD at Queen Mary. My friends and aikido training partners, instructors, students, and fellow football players who kept me fit, physically and mentally, during the years of my PhD. Special thanks to Chris, Karesz and Csabi. My parents and brother who gave all the support they possibly could, throughout my life. My wife Heni, for her love, support, and for standing by me all these years. Without her I would not have lasted even two months in this Big City. I owe everything to her.
16
Chapter 1 Introduction
As the availability of structured documents is constantly increasing, retrieval systems able to return document portions are being developed. Structured documents, usually formatted in XML (Extensible Markup Language), may consist of large numbers of document portions, often organised into a hierarchical logical structure. With the high number of document portions, it is necessary to direct the attention of users of retrieval systems towards the most important document portions, and also, to give overviews of the structure of documents, in other words, to show document portions in context. This thesis investigates summarisation as a means to help searchers of XML retrieval systems in the process of accessing the contents of document portions. Automatic summarisation has been used to generate overviews of documents since the 1950’s (Luhn, 1958; Edmundson and Wyllys, 1961). Summaries of documents, e.g. abstracts, can be very useful if, for example, searchers do not have time to read many documents (possibly returned by a search system) to find relevant content. A summary, in general, provides the main idea of the document’s content and hence, facilitates the information searching process of a searcher. Summaries are well suited to be used in information retrieval systems, as well as in digital libraries. The result of retrieval is a list of documents that are ranked according to their relevance as estimated by the retrieval engine. Although a search engine may perform very well in returning a list of documents, searchers still need to choose the best document(s) from this list and read the corresponding document’s contents. An overview of the contents, i.e. a summary, is highly useful in deciding which documents are most promising to find relevant information, especially if this summary reflects the document’s content with respect to the searcher’s information need
17 (Tombros and Sanderson, 1998). Summaries can consist of several words to several sentences (White et al., 2002; Turpin et al., 2007). Another stage of the searching process is when the searcher, having chosen documents from the result list, needs to find the relevant content within documents. This is often not supported by retrieval systems, especially those searching the Web. Thus, the idea of adding document overviews to individual documents is worth investigating as the document overview can focus the searcher’s attention towards document portions containing relevant content. According to the classic view on summarisation, a document overview is an overview (i.e. summary) of the textual content of a document. However, if we consider printed books as examples, an overview can also be the summary of the logical structure of the book, i.e. the table of contents. The table of contents (ToC), in addition to a textual abstract, has been a very important part of the searching process when printed books are considered, i.e. we usually browse the ToC in order to find out which section might answer our information need (provided that the information need can be satisfied within a section of the book). However, despite the obvious advantages of the use of tables of contents of documents (Tombros et al., 2005b; Kim and Son, 2004; Kazai and Trotman, 2007), how they ought to be created if not directly available, has not been researched widely with respect to finding information within electronic documents, that is, in information retrieval. This thesis aims at investigating how ToCs can automatically be generated using the logical structure encoded into documents. The classic view on summarisation also assumes that textual summaries are created for full documents or sets of documents (Edmundson, 1969; Kupiec et al., 1995; Stein et al., 2000; Mani, 2001). However, it is also possible to focus the attention of searchers by using summaries of individual document portions (Amini et al., 2007). For example, if this type of summarisation is combined with the use of the table of contents, the retrieval process of searchers can further be facilitated, e.g. when summaries of document portions are also displayed in addition to section titles in ToCs. A section title in a ToC might not be indicative enough of the content of the document portion, but a summary of several sentences might help searchers decide whether reading the corresponding section to find relevant information is worthwhile. This thesis also investigates text summarisation when the text of document portions is summarised and displayed to users of information retrieval systems. The following section introduces text (content) and structure summarisation in the context of
1.1. The eXtensible Markup Language and summarisation 18 XML document access.
1.1
The eXtensible Markup Language and summarisation
As the eXtensible Markup Language (XML) is becoming more widespread, the amount of information available in documents formatted in XML is constantly increasing. These XML documents include, among others, those in the World Wide Web, digital libraries or office software (Dopichaj, 2006; Trotman et al., 2006). The XML markup is often used to mark up the logical structure of documents. The logical structure is hierarchical, e.g. sections can be parts of a chapter and a section may contain several paragraphs. The XML format makes it ideal to investigate focused access to document portions (called XML elements). This thesis is not concerned with how relevant elements can be found and returned from a collection of documents (focused retrieval) but, instead, with how the searching process can be facilitated once searchers get to the relevant document.
1.1.1
Summarisation of the textual content of XML elements
One way to facilitate the searchers’ finding process is to provide them with overviews of the elements’ content by summarising the textual contents of these elements. These summaries can then be appropriately displayed so that searchers can find relevant chapters, sections, etc. more easily by reading a few sentences only. Text summarisation (i.e. content summarisation) for accessing XML documents is one of the two main topics of this thesis. The document’s logical structure, available through the XML markup, allows to create summaries for individual elements (Amini et al., 2007). As elements of a document are overlapping (Kazai et al., 2004) and the document’s logical structure is hierarchical, summaries of the contents of elements might be used differently from those of whole documents. This thesis investigates the usefulness of element text summaries when the searcher has already found a relevant document but s/he wants to access the relevant content within documents (i.e. content of elements).
1.1.2
Summarisation of the the structure of XML documents
The hierarchical logical structure encoded into XML documents also allows to automatically identify tables of contents (ToCs) for documents, which provide an overview of the document. ToCs usually do not directly accompany digital documents so they need to be created, preferably
1.2. Research questions 19 in an automatic manner. XML documents often contain other information marked up the exact same way as the logical structure (i.e. via XML tags). Such document portions include, for example, formatting instructions, which are usually not worth directly referring to in a ToC. Some sections, for example, might be too small, others might not be relevant to the searcher’s information need at all. Thus, not the full XML structure should be used to create ToCs. How to select elements that are worth including in a table of contents (i.e. structure summarisation), which provides an overview of the structure of a document, is a research problem that is also addressed in this thesis. There have been several types of overviews proposed that aim at visualising the relationships among documents (Andrews, 1996; Sebrechts et al., 1999; Shneiderman et al., 2000; BaezaYates and Ribeiro-Neto, 1999, Chapter 10.). To visualise the relationship of documents that are organised hierarchically, however, the use of ToCs (Remde et al., 1987; Nation, 1998) seems to be the most natural solution, as they are easy to comprehend and they have been used for centuries. More recently, ToCs have also been used in the context of XML documents (Tombros et al., 2005a; Larsen et al., 2006a; Malik et al., 2006b) where elements also form a hierarchical structure. These elements are closely related, often with overlapping content. ToCs in the context of XML documents have been used as a means to help searchers gain an overview of the document’s structure containing relevant information. However, these ToCs were static and manually defined (i.e. several types of elements above a certain depth in the logical structure were selected), and they did not reflect the searcher’s information need well enough (i.e. they did not always emphasise possibly relevant elements, and de-emphasise not relevant elements), which may have limited their usefulness. To investigate this problem, this thesis aims at creating ToCs automatically for any XML documents with the use of combinations of features of XML elements, such as length of their textual contents, depth in the logical structure, relevance to the query, etc.
1.2
Research questions
The work described in this thesis is incremental, where the initial research problem - as the first part of the research - is to investigate text summarisation in the context of XML retrieval, where the logical structure of documents is known and presented to users as tables of contents. This step is followed by the research into the automatic generation of tables of contents. The research
1.2. Research questions 20 questions to be answered with respect to text summarisation are the following: 1. Can textual summaries be useful for XML retrieval? Do searchers find textual summaries of elements useful for gaining an overview of the contents of, e.g. sections and paragraphs of documents? Textual summaries are investigated when tables of contents (ToCs) are also displayed and textual summaries are associated with items in the ToC. 2. If textual summaries are useful, when are they so? Questions such as the following are investigated: Are they useful for every XML element, or perhaps for only those being at the top levels of the logical structure? Should summaries of elements, whose textual contents are often as short as the summaries themselves, be displayed? As element text summarisation has not been researched widely before, it is important to examine the usefulness of summaries at various levels of the logical structure. 3. What are the requirements for the development of XML retrieval systems that display textual summaries as well as the logical structure of documents, according to human searchers? For example, should an XML retrieval system return elements of a document grouped or in order of relevance? How detailed the table of contents, in the context of which textual summaries are displayed, should be? What should be the labels of items in the table of contents? After the investigation of text summarisation, as next steps of this thesis’s research, the tables of contents (ToCs) themselves are studied directly. ToCs are regarded as summaries of the document’s logical structure. To distinguish from text summarisation, but also to reflect similarities to it, ToC generation is referred to as structure summarisation throughout this thesis. As there are a number of types (e.g. paragraph, section) and levels (i.e. how deep an element is in the logical structure) of XML elements in the document’s logical structure, several research questions are examined in the context of structure summarisation: 4. Which XML elements should be selected for inclusion in the ToC? Is it always chapters and sections, or paragraphs as well, whose title should appear in a ToC? Perhaps some sections are to be excluded from a ToC, while even paragraphs of other, more important, sections are worth displaying in a table of contents. 5. If the queries, i.e. searchers’ formulated information needs, are known, is it important to use this information and select relevant elements for the ToC and, at the same time,
1.3. Thesis outline 21 hide other elements? This type of structure summarisation is referred to as query based structure summarisation in this thesis. 6. Are any other features than relevance to the searcher’s query important when an item is selected for inclusion in the ToC? For example, we may find that the length of the textual content of elements and documents, or the extent to which an element is nested within the XML structure, can affect which elements are to be displayed to be accessed through a ToC. 7. What are the properties of ToCs that are obtained by the involvement of human searchers who indicate their ‘ideal’ ToCs? If users can determine their preferred ToC for a query, this may give indications for structure summarisation. Often users may also browse within a book to discover its contents without having any specific information need in mind . This browsing can initially be done through the table of contents, which can serve as a structural overview of the document being browsed. The corresponding research questions are related to the creation of ToCs when no relevance to a query information is available, i.e. query independent structure summarisation: 8. How to create and evaluate structure summaries without any user interaction? In other words, what methodology should be used for automatic ToC generation and ToC evaluation. 9. What features should be used to automatically determine whether a structural item is worth including in a general ToC? Among others, XML element length, type and title information will be investigated. 10. How many (and which combinations of) features should be used for structure summarisation? It might be that a good ToC can be generated with using only one feature, e.g. length, but the combination of several more features might work better for ToC generation. The above listed questions are aimed to be investigated and answered in this thesis. The chapters of this thesis are outlined below.
1.3
Thesis outline
The remainder of this thesis is structured as follows.
1.3. Thesis outline 22 Chapter 2 introduces the basic concepts of information retrieval (IR), XML, evaluation of various retrieval system types and summarisation methods that are important in order to understand subsequent chapters of this thesis. Summarisation is used in this thesis to create overviews of the contents of XML elements (text summarisation) and those of the document structure (structure summarisation, i.e. ToC generation). Chapter 3 presents related research and aims at providing motivation for the work of this thesis. Various overviews of hierarchical and non-hierarchical data are discussed. Also, this chapter discusses XML structure display methods as well as summarisation related to the logical structure of documents and the XML format. Related work concerning the involvement of searchers in the retrieval process is also described in Chapter 3. Chapter 4 investigates text summarisation in the context of XML retrieval. It describes a user study and analyses the results obtained through it. The two chapters following Chapter 4 are strongly based on its findings. Chapter 5 looks into semi-automatic generation of ToCs and their properties when query information is available. ToCs are investigated in a user-based environment, where searchers are involved in the ToC generation process by indicating which element features they think are important for selecting elements that are worth including in the ToC. Chapter 6 investigates the selection of features for query independent structure summarisation. It also introduces a probabilistic structure summarisation method via which various features are evaluated. Chapter 6 also describes to manual creation of structural summaries, i.e. ToCs, which are then used for training and testing purposes. Chapter 7 draws conclusions from the work of this thesis and elaborates on possible future work that might be based on the findings of this thesis.
23
Chapter 2 Information retrieval, XML and summarisation
In this chapter, first, the field of information retrieval (IR) and the notion of structured documents are introduced (Section 2.1) as the research presented in this thesis has been carried out in the context of these two areas. The basic concepts of both IR and structured documents (more specifically, documents in XML format) are discussed in Section 2.2. This chapter also outlines the basic methodologies for the evaluation of IR systems, when considering either unstructured or structured documents. Evaluation of IR systems in an interactive (user-based) environment is also discussed, as the work of this thesis heavily relies on the involvement of human searchers (Section 2.3). Section 2.4 presents text summarisation. Summarisation is applied on XML documents in the context of IR in the research presented in this thesis. Therefore, the field of summarisation and several widely used summarisation methods are discussed with a focus on those methods used in this thesis.
2.1
Introduction
The availability of information stored in documents (e.g. in handwritten, printed books or documents in various digital formats) and the increasing number of these documents has made it important to be able to find the documents which contain information useful for a searcher. The first appearance and, hence, the need for the retrieval, of written documents dates back to around 3,000 B.C. (Figure 2.1). The retrieval of documents was already needed in the first libraries (e.g. the Library of Alexandria) where it was done with the help of librarians and, possibly, through
2.1. Introduction 24 catalogues. If was mainly libraries where the retrieval of documents from various document collections was performed until the appearance of digital document collections, digital libraries and the Web. In the 1940’s, the storage and retrieval of information became increasingly important as the amount of information available had started to increase in a scale that had never been experienced before. This was later followed by an even more dramatic increase in the availability of (electronic) documents with the global use of the World Wide Web. The field of Information Retrieval (IR) appeared in the middle of the 20th century, addressing the problem of retrieving text documents from large collections, in an automatic way (by computer), based on full-text indexing of words (Luhn, 1958). Initially, systems were developed to retrieve documents from a collection of abstracts of academic papers, but later IR systems were made to search other collections such as full newspaper articles, literature, television broadcast transcripts (Fatemi et al., 2004) and the Web. In 1998, the World Wide Web Consortium’s1 (W3C) Recommendation for the Extensible Markup Language (XML) (Bray et al., 1998) was completed. This started a rapid growth of the family of standards related to XML, as well as research into the retrieval of documents formatted in XML (Fuhr and Großjohann, 2001; Fuhr et al., 2002a; Abolhassani et al., 2002; Fuhr et al., 2003, 2004). The XML format offers an encoding of structure into documents and has become widely used, for example, in legal documents2 or in content management systems (e.g. (Weitzman et al., 2002)). Also, XHTML3 (used for web pages) is a special case of XML and, as its use has been increasing, it further adds to the rapid growth of availability of XML documents. When discussing the structure of documents, it is important to distinguish between two main types of “structure”. Structured documents, in the sense used in this thesis, were called semistructured in the beginning of research into structured documents. For example, SGML, XML, HTML, and XHTML are formats for semi-structured documents. According to the definition by Abiteboul (1997), semi-structured data are “neither raw data nor strictly typed, i.e., not tableoriented as in a relational model or sorted-graph as in object databases [which is the other type of structured documents]”. Semi-structured documents are loosely marked up, and the structure is said to be implicit (Luk et al., 2002). In this thesis, structured documents are always referred 1 http://www.w3.org/ 2 http://www.legalxml.org/ 3 http://www.w3.org/TR/xhtml1/
2.1. Introduction 25
Figure 2.1: Timeline of information and retrieval systems by Fielden (2002).
2.2. Basic concepts 26 to in the semi-structured sense. The next section introduces the basic concepts of information retrieval and XML that are necessary for understanding later parts of this thesis.
2.2
Basic concepts
The expression information retrieval (IR) was first used by Calvin Mooers in 1951 who defined it as “embraces the intellectual aspects of the description of information and its specification for search, and also whatever systems, technique, or machines that are employed to carry out the operation” (Mooers, 1951). The process of returning documents, e.g. books, articles, web pages, to a query (representing an information need) has been referred to as information retrieval or IR since. It has to be noted that Web IR is considered to be different from classical IR in the sense that it “can be defined as the application of theories and methodologies from IR to the World Wide Web (WWW). It is concerned with addressing the technological challenges facing Information Retrieval (IR) in the setting of WWW” (Bhatia and Khalid, 2008). This thesis is not concerned with Web IR although the studies and methods discussed in this thesis could be applied to documents from the WWW as well. The next subsections introduce the main concepts used in IR, which is followed by the introduction of the eXtensible Markup Language (XML). 2.2.1
IR concepts
The IR process is usually regarded as the process between issuing a query and returning a list of documents in response to the query. Thus, the basic task of information retrieval systems is to perform a search based on a query and return a list of references. The query is the representation of the searcher’s information need, it is is usually treated as a list (bag) of words called query terms. Using a retrieval model and data from documents stored using some indexing structure, IR systems are to provide searchers with a ranked list of answers satisfying the searchers’ information needs. The commonly accepted information retrieval process is visualized in Figure 2.2. Documents mostly contain textual information but they can also contain the logical structure of the document, as well as embedded multimedia information, such as videos and audio recordings. Both the query and the documents (representing the real word; the set of documents is called a document collection) is further represented (indexed) as sets of query and document terms that can be used for the matching (comparison) process. The matching or comparison is then
2.2. Basic concepts 27
Figure 2.2: The information retrieval process. performed on document and query representations (using a retrieval model) and the results, a ranked list of documents (alternatively, document portions, also called document parts, components or elements) are retrieved. The retrieved answers can be used to further refine the query, automatically or by users. This is represented by the relevance feedback loop in Figure 2.2. In the information retrieval process, queries are executed using a retrieval engine. A retrieval system is based on one or more retrieval models that define how a relevance score is assigned to a document or an element in the document structure with respect to how well they match a query. A retrieval model is a mathematical framework that defines the comparison between document and query representations. After determining the relevance score (Retrieval Status Value, RSV) of documents or elements, they are usually ranked in descending order of their relevance scores, of which only the highest ranked ones are presented to the user. Measuring quality of these ranked lists, i.e. the effectiveness of information retrieval systems, is described in Section 2.3. From the user’s perspective, in the information retrieval process as described above, there is a minimal user involvement as the process starts when the user’s query is entered and finishes when the ranked list is returned. However, it is still part of retrieval when a user enters a query and, upon receiving the ranked list, interacts with the result documents and the ranked list itself. Interactive information retrieval (IIR) considers the user and investigates users’ preferences and
2.2. Basic concepts 28 behaviour while interacting with the retrieval system. IIR also looks into the ways the retrieval system is connected to the users, i.e. it also investigates various interface issues and presentation methods as well as interaction itself. IIR is further discussed in Section 2.3.3. As this thesis’s work is also in the context of structured document retrieval where documents are formatted in XML, the next subsection looks into the basic concepts related to the XML format and XML documents. XML allows a retrieval system to query, index and return components (in other words, portions, elements) of documents, e.g. section components of an article (as document).
2.2.2
XML concepts
To understand the work of this thesis it is needed to introduce the main properties of XML documents as well as related definitions and concepts. XML (eXtensible Markup Language) is a markup language for documents containing semistructured information. A markup language is defined as a mechanism to identify structure and/or semantics in a document. The XML specification defines a standard way to add markup to documents. XML was developed by the XML Working Group (originally known as the SGML Editorial Review Board) formed under the auspices of the World Wide Web Consortium (W3C) and the first recommendation for XML has been accepted in 1998 (Bray et al., 1998). XML is a restricted form of SGML (Standard Generalized Markup Language4 ). Two of the design goals of XML included the following. • “XML documents shall be easy to create” • “XML documents should be human-legible and reasonably clear” The use of tags also made it simple for computers to separate the content of a document from its structure, and to identify relationships among parts of a document. These points had a great impact on why XML is used widely for structured textual documents, in addition to documents used for information exchange over the Web. 2.2.2.1
XML documents
An XML document (for an example, see Figure 2.3) is said to be well-formed if the following are true: 4 For
an overview see http://www.w3.org/MarkUp/SGML/
2.2. Basic concepts 29 • It contains one or more elements (this is further explained later). • There is exactly one element, called root, that is not in the content part of any other element. • For all other elements, if the start-tag is in the content of an element, the end-tag must be in the content of the same element. XML documents should begin with an XML declaration that specifies the version of XML being used. Figure 2.3 shows a simple XML document. Hungarian Rhapsodies The Hungarian Rhapsodies are a set of pieces of music by Franz Liszt originally for solo piano.
The first 15 rhapsodies were published in the year 1853, with the last four being added in 1882 and 1885. Numbers 14, 12, 6, 2, 5 and 9 were arranged by Liszt for orchestra, and number 14 was also the basis of Liszt’s Hungarian Fantasia for piano and orchestra. Some are better known than others, with number 2 being particularly famous.
Liszt incorporated many themes which he had heard in his native Hungary and which he believed to be folk music, but which were in fact tunes written by contemporary composers, often played by Roma bands. The large scale structure of each was influenced by the verbunkos, a Hungarian dance in several parts, each with a different tempo.
[...] Figure 2.3: XML document example 2.2.2.2
Entities and elements
Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins with a “root” or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup. The logical and physical structures must nest properly. This thesis focuses on elements.
2.2. Basic concepts 30 Each XML document contains elements which are either delimited by start-tags or end-tags, or for empty elements, by an empty element tag. Each element has a type, identified by name, sometimes called its “generic identifier” (GI), and may have a set of attribute specifications. Each attribute specification has a name and a value. For example, in Figure 2.3, element article is delimited by start-tag and end-tag. It also has an attribute specification: id is the name of the attribute and its value is HUN123. 2.2.2.3
Parent and child elements
The parent-child relationship of elements, which is often used in XML retrieval, is defined as follows. If an element C is in the content of another element P but not in the content of any other element that is in the content of P, P is referred to as parent of C, and C as child of P. For example, in Figure 2.3 element title is a child of element article and section is the parent of element p. Elements a particular element has as parents, grandparents, etc. are called ancestors of the XML element, e.g. both article and section are ancestors of the p element in Figure 2.3. Similarly, children, grandchildren, etc. of an element are called descendants based on the analogy between the tree structure of XML documents and a family tree. To define the allowed parent-child (ancestor-descendant) relations and various other requirements, e.g. maximum number of child elements for a particular element type, a DTD (Document Type Definition) or an XML Schema is used (Fallside and Walmsley, 2004). The heterogeneity of the structure of various XML document collections used in IR has also attracted interest in the research community (Szl´avik and R¨olleke, 2005; Frommholz and Larson, 2007). 2.2.2.4
XPath and XSL
XPath (XML Path Language) is a set of syntax rules for defining parts of an XML document. XPath is not a query language; it only addresses information in XML documents. However, as it can identify XML elements, it is also used for querying. The most important characteristics of XPath are summarized as follows 5 . XPath • is a syntax for defining parts of an XML document, • uses paths to define XML elements, 5 http://www.w3schools.com/xpath/
2.2. Basic concepts 31 • defines a library of standard functions, • is a major element in XSLT (XSL stands for eXtensible Stylesheet Language and T refers to ‘Transformations’), • is not written in XML, • is a W3C Standard. For example, consider the XML example in Figure 2.3. /article/section selects all the section elements. /article/section/p[1] selects the first p (paragraph) element only. XSL (eXtensible Stylesheet Language) is often used to define how to display an XML document. It strongly relies on XPath. As, unlike HTML, XML tags are not predefined, it is important to define how the XML document should be displayed (e.g. in a browser) as, for instance, the computer does not know in an XML document whether
is a piece of furniture or a HTML table. XSLT is a language to transform XML documents to another XML documents. The transformation output is often a HTML document (as HTML is subset of the XML document family). XSLT is used to transform XML documents and elements into browser-friendly format in Chapters 4, 5 and 6. 2.2.2.5
Two views on XML retrieval
It is important to distinguish between two views on XML retrieval. The data-centric view considers XML for exchanging formatted data in generic, serialised form between different applications such as databases (Fuhr et al., 2002a). Data-centric documents contain highly structured data such as that representing the contents of an order, an invoice, or a product catalogue. XQuery6 , which is built on XPath expressions, is a query language to extract data from XML documents. XQuery for XML is the same as SQL for databases and it is rarely used for XML information retrieval to formulate information needs by users. The document-centric view focuses on structured documents in the traditional sense, i.e. the XML format is used to markup textual documents. For example, such documents may contain loosely structured text as in scientific articles or books. This thesis focuses on the documentcentric view. The query language XIRQL7 aims to combine the two XML views (Fuhr et al., 6 http://www.w3.org/TR/xquery/ 7 http://www.haifa.il.ibm.com/sigir00-xml/final-papers/KaiGross/sigir00.html
2.3. IR evaluation 32 2002b; Fuhr and Großjohann, 2001). XIRQL is based on XPath of XQuery; it extends them by weighting and ranking, relevance-oriented search, data types and vague predicates, and structural relativism. However, the query language that is more generally used in the XML IR community is NEXI (Trotman and Sigurbj¨ornsson, 2004). NEXI is a variant of XPath defined for contentoriented XML retrieval evaluation and it is more focussed on querying content than many of the XML query languages. Its main feature is the use of ‘aboutness’ instead of the ‘contain’ criterion that is generally used in XML query languages. Despite the language that is able to express structured queries, text only (content only (Lalmas and Tombros, 2007b)) queries are still widely used in XML retrieval and XML retrieval systems are expected to automatically determine the appropriate XML elements to return to users.
2.3
IR evaluation
IR systems, whether they return whole documents or document portions (e.g. XML elements), need to be evaluated, their performance measured. Performance can be measured in more than one way. It might be decided upon effectiveness (quality of the set of retrieved documents, elements) or efficiency (e.g. retrieval time, size of indices). Both aspects are important in performance measurement but, in this section, the attention is focused on measuring relevance (as for effectiveness) (Mizzaro, 1997; Cosijn and Ingwersen, 2000; Borlund, 2003a). There have been a number of relevance definitions and types proposed, see (Borlund, 2003a; Saracevic, 1996) for overviews. The simplest definition of relevance is that “Nobody has to explain to users of IR systems what relevance is, even if they struggle (sometimes in vain) to find relevant stuff. People understand relevance intuitively” (Saracevic, 1996). Despite this definition, Saracevic (1996) identifies five basic relevance types: 1. systems/algorithmic - describes relation between query and information objects (texts), 2. topical (also called intellectual topicality) - between subject or topic expressed in a query and subject or topic covered by information objects, associated with aboutness, 3. cognitive/pertinence - between state of knowledge and cognitive information need of the users and information objects, 4. situational - between situation, task or problem at hand and information objects,
2.3. IR evaluation 33 5. motivational/affective - between intents, goals and motivations of the user and information objects. The first (algorithmic relevance) is in the so called objective relevance class, the others are in the subjective class of relevance. Algorithmic relevance is also known as logical relevance (Cooper, 1971) and topicality. It is applied to traditional evaluation of IR systems. The algorithmic relevance values calculated by the IR systems (values usually in [0,1]) are used to rank the retrieved information objects. This kind of relevance deals with how the query representation matches the contents of the retrieved information objects. Algorithmic relevance is calculated by the retrieval system in Chapter 4, and such relevance values are used to create structural summaries in Chapter 5. The other four relevance types are concerned with aboutness. These types refer to the degrees of intellectual interpretations carried out by human observers (users, assessors) and are context dependent (unlike algorithmic relevance). Subjective relevance is assessed by the IR system users or assessors. In contrast to the algorithmic relevance, their decision about a document’s relevance is not straightforward to model as there is no exact algorithm such as using a similarity function to decide whether a document is relevant to an information need; there are a number of factors that affect relevance assessments(Borlund, 2003a). In Chapters 4, 5 and 6, it is pertinence and situational relevance that appear most frequently as searchers are asked to find relevant information to a given task. Although this thesis does not aim at evaluating IR systems, introducing basic IR evaluation is needed as the evaluation presented in later sections and chapters is similar to, and sometimes based on, IR evaluation. Also, it is important to discuss how user based (interactive) IR systems are evaluated as this thesis also presents user studies. The following three subsections describe how evaluation is done in standard document retrieval (Section 2.3.1), XML retrieval (2.3.2) and what is needed for appropriate interactive IR evaluation (2.3.3).
2.3.1
Standard IR evaluation
To evaluate an IR system and its processes and algorithms, test collections are used as a laboratory model of evaluation. The idea of using test collections originates from the 60’s. The first collection, which was used for a long time, is from the Cranfield project (Cleverdon, 1967). Widely used test collections are, for example, the Medline (sized 1.1MB), Time (1.5MB), ADI
2.3. IR evaluation 34 (0.04MB), CACM (2.2MB), NPL (3.1MB), LISA (3.4MB), CISI (2.2MB)8 , REUTERS (Sanderson, 1994) and the various collections from TREC9 . A test collection contains the following: • a set of documents, • a set of topics representing information needs, and • relevance judgements/assessments for each of the topics to define which documents are relevant to that information need. The size of the collections can vary greatly, collections in previous decades were small (see above), more recent collections are bigger in size, e.g. the wt10g TREC collection is 10GB, the .gov collection is 18GB (Soboroff, 2002), and the the one used for web spam detection is 420GB compressed (Castillo et al., 2006). The appropriate number of topics and the the size of the document collection to effectively evaluate IR systems is not defined precisely. A study by Buckley and Voorhees discusses these questions (Buckley and Voorhees, 2000). For this thesis, the aim is to use topic sets and document collections of sizes comparable to those used in related research. The two basic and widely used evaluation measures are Precision (P) and Recall (R). Precision is the ratio of the relevant items retrieved to all retrieved items, or the probability given an item is retrieved that it will be relevant. Recall is the ratio of the relevant items retrieved to all the relevant items in the document collection, or the probability given an item is relevant it will be retrieved (shown in Equations 2.1 and 2.2).
Precision =
Recall =
|Retrieved ∩ Relevant| |Retrieved|
|Retrieved ∩ Relevant| |Relevant|
(2.1)
(2.2)
There are recall and precision values for each query. To evaluate a system, these values need to be averaged. There are two methods for averaging: either use macro- or micro evaluation can be used, i.e., either sum the individual P and R values can be obtained and their average calculated (macro evaluation), or the averages of the components in the numerator and denominator can be calculated first and division performed afterwards (micro evaluation). Macro evaluation has the 8 http://www.dcs.gla.ac.uk/idom/ir 9 http://trec.nist.gov/data.html
resources/test collections/
2.3. IR evaluation 35 advantage of treating all the P, R values equally (i.e. they have the same weight) but it can cause problems in case there are no results (division by zero). Micro evaluation circumvents the problem of empty sets and causes every individual document to have an equal influence on the result. Macro evaluation will be used in Chapter 6, in structure summarisation evaluation. Precision and recall are sometimes combined to evaluate a system by only one number (van Rijsbergen, 1980, Chapter 7), e.g. the sum of precision and recall (S = P + R) can be obtained or the weighted harmonic mean (called the F measure) can be calculated (Equation 2.3). With the F measure, if α = 0 (β = ∞) then only recall is used while if α = 1 (β = 0) then only precision is taken into account. α values in interval (0, 1) emphasise either precision or recall depending on the value of α: e.g. α = 0.2 emphasises recall to precision, α = 0.6 means that precision is preferred to recall. The weighted F measure is used for structure summarisation evaluation in Chapter 6.
F=
(β 2 + 1) · P · R 1 = α 1−α β 2P + R P+ R
(2.3)
Other single-valued measures based on precision and recall are also used to evaluate systems. Although none of these measures are used directly in this thesis, they are common in information retrieval and, in addition, XML retrieval measures, through which evaluated results are used for analysis for the purposes of this thesis (e.g. Sections 4.2.1 and 6.3), are based on several of the following measures. Frequently used measures are mean average precision, precision after a specific number, e.g. 5, 10, of documents or at fixed recall levels (Buckley and Voorhees, 2000), mean breakeven point, R-Precision, expected search length (Cooper, 1968), precall (Raghavan et al., 1989). Measures that that are concerned with user oriented IR evaluation are relative relevance, ranked half-life indicator (Borlund and Ingwersen, 1998). Precision and recall introduced above consider binary relevance (e.g. a document either relevant or not to an information need.). For multiple graded relevance, Kek¨al¨ainen and J¨arvelin (Kek¨al¨ainen and J¨arvelin, 2002) introduced generalized precision and recall measures. Cumulative gain-based measurements are also used (J¨arvelin and Kek¨al¨ainen, 2002).
2.3.2
Evaluation of XML retrieval
A difference between traditional document retrieval and XML retrieval is that XML retrieval allows systems to return document portions, i.e. XML elements. Returning these parts of the
2.3. IR evaluation 36 elements is also referred to as focused retrieval (Trotman et al., 2007). This type of retrieval is particularly useful when the document collection contains longer documents, e.g. books, legal documents, user manuals, and focusing on one particular document section instead of the whole document is particularly beneficial. The main forum through which researchers of the XML retrieval community can share their ideas and build test collections for XML retrieval is the Initiative for the Evaluation of XML Retrieval, INEX. INEX was set up in 2002 to “establish an infrastructure and provide means, in the form of large test collections and appropriate scoring methods, for evaluating the effectiveness of content-oriented XML retrieval systems” (Lalmas and Tombros, 2007a). XML retrieval evaluation is done similarly to that of traditional retrieval, i.e. using test collections. However, traditional IR test collections and methodology cannot directly be applied to the evaluation of content-oriented XML retrieval as they do not consider structure (G¨overt et al., 2006). With the introduction of structure into IR, various problems have arisen into XML retrieval. The retrievable units cannot be considered independent as elements from the same document are naturally connected to one another at least by the main topic of the document. As, for example, chapters may contain several sections, elements can also be parts of other elements (ancestor - descendant relationships, Section 2.2.2.3) and this should not be ignored in retrieval. It is said that elements from the same document are the context of one another (Hammer-Aebi et al., 2006). The problem of deciding whether an element, e.g. a section, its descendants, i.e. several paragraphs from the section, or its ancestor, i.e. the chapter the section is in, should be returned is called the overlap problem (Kazai et al., 2004; Clarke, 2005), and it has caused long debates among researchers in the INEX community. As the retrievable units can no longer be treated as roughly equal sized, and below a certain length the contents of elements cannot practically carry enough relevant information, the element size also became an issue in XML retrieval (Kamps et al., 2005). Element size also plays an important role throughout this thesis. The relatively large test collections of INEX have been preceded by smaller scale efforts such as the building of the Shakespeare test collection (Kazai et al., 2003) but these were too small to effectively evaluate various XML retrieval methods. The main document collections used at INEX are the IEEE collection (consisting of a total 16,819 documents and further described in Section 4.3.1), the Wikipedia collection (Denoyer and Gallinari, 2006) (659,388 documents, for details see Section 5.2.2), and the Lonely Planet collection (462 documents containing 203270
2.3. IR evaluation 37 elements, 16MB in size). In XML retrieval, in addition to queries that express information needs with respect to the desired (relevant) content only, it is possible to query an XML retrieval system with respect to content and structure. Considering this, two main types of topics are used at INEX that reflect two types of users with varying levels of knowledge about the structure of the searched collection: • Content-only (CO) topics are requests that ignore the document structure and are, in a sense, the traditional topics used in IR test collections. • Content-and-structure (CAS) topics are requests that contain conditions referring both to content and structure of the sought elements. These conditions may refer to the content of specific elements (e.g. the elements to be returned must contain a section about a particular topic), or may specify the type of the requested answer elements (e.g. sections should be retrieved). As this thesis aims at giving overviews of XML documents to searchers, structural information is not used for querying in this thesis, however, it might be worth doing so in future work. There are several retrieval tasks at INEX that simulate various retrieval scenarios. The main INEX activity is the ad hoc retrieval task (renamed to thorough in 2005) and is described as a simulation of how a library might be used which is a ‘traditional’ task as it has been present since INEX’s launch and also used in standard document IR (Voorhees, 2001). Among the other INEX tasks, the following is of particular interest for this thesis: The Fetch & Browse task (introduced in 2005 and renamed to Relevant in Context in 2006 (Lalmas, 2005; Clarke et al., 2006, 2007)) aims to first identify relevant documents (the fetching phase), and then to identify the most exhaustive and specific elements (see below) within the fetched documents (the browsing phase). The browsing phase can be particularly useful when investigating how users navigate within the structure of a document and which elements are suitable to display to users, thus, focusing their attention to certain elements. For traditional IR, it has been sufficient to find ‘relevant’ documents. However, for XML IR it is also needed to identify the appropriate level of granularity of results. These two dimensions, and their possible values, have been evolved since INEX has started in 2002. The two main ideas have been the following: Exhaustivity measures how exhaustively an element discusses the topic of the user’s request, and specificity measures the extent to which an element focuses on the topic of request (and not on other, irrelevant topics) (Lalmas and Tombros, 2007a).
2.3. IR evaluation 38 To evaluate retrieval results, INEX used the inex eval metric (G¨overt and Kazai, 2003), which applies the precall measure (Raghavan et al., 1989) to XML elements. As for precision and recall (Section 2.3.1), inex eval is based on a counting mechanism, i.e. based on number of retrieved and relevant elements. However, it is not ideal to apply when so called “near misses”, elements from where users can access relevant content, needed to be considered. As a result, INEX adopted in 2005 a new metric, called XCG which is an extension of the Cumulative Gain (CG) based measures (Kazai and Lalmas, 2006, 2005), and includes the user-oriented measures of normalised extended cumulative gain and the system-oriented effort-precision/gain-recall measures (J¨arvelin and Kek¨al¨ainen, 2002). Retrieval result sets produced by INEX participants and evaluated in INEX using the above measures are analysed and used in Chapter 6, for the purpose of structure summarisation.
2.3.3
Interactive IR
This thesis also involves studies using human searchers. Thus, interactive IR (IIR) and its evaluation also needs to be introduced in this chapter. This section discusses requirements for effective user studies and evaluation of IIR systems. In the 1990’s, several researchers expressed that, when searchers are involved, evaluation of IR systems needs to be different from the original Cranfield (Cleverdon, 1967) models (Saracevic, 1995; Borlund and Ingwersen, 1997). Indeed, an information retrieval system is generally not only a black box that is fed with a query and that it returns a list of documents. It should also include an interface through which the query from the searcher is obtained and the retrieval results presented. The interface and its presentation methods should facilitate the searcher’s searching process. Interactive information retrieval involves users (searchers) and since a searcher’s state of mind always changes during any cognitive process (including information searching, seeking, relevance assessments, etc.) it is needed to take these changes into account when an IIR system is evaluated or tested. It is also important to consider that not only the state of mind of individuals changes during any process, but any two persons are also different from each other (see Sections 2.4.4, 5.4.6 and 6.6.4 where various agreement levels are mentioned or discussed). Unlike in standard IR, where retrieval systems receive scores (e.g. mean average precision, precall, etc.) and can directly be compared by comparing the absolute scores, IIR usually compares two systems or system versions and researchers try to detect statistically significant differ-
2.3. IR evaluation 39 ences between them. One system is usually called the control system or baseline, the other is called the experimental system. A framework that is widely used in IIR evaluation described in (Borlund, 2003b). As this thesis uses users to evaluate the summarisation feature of an interface, as well as to assess structural summaries, it is imperative to follow a well established evaluation framework. The framework presented by Borlund (2003b) has been extensively used in various interactive system evaluations and user studies (e.g. (Tombros et al., 2005b; Larsen et al., 2006a; Hammer-Aebi et al., 2006)) and will be used in Chapters 4 and 5. According to the framework, the procedure of IIR evaluation should include the following steps. 1. Recruitment of users (also called searchers or test persons) and informing them about the purpose of the study and their involvement in it. It is of high importance that established ethical procedures are followed. 2. Designing the experiment. The system to be used has to be prepared and made ready to start the experiments. It is also important to know the collection of documents to be retrieved and used, as well as some background information about users in order to create ‘simulated work task situations’ (see below). 3. Conducting a pilot study is recommended as it may point out problems with the system, work task situations or experimental design. Although not reported separately, each study presented in this thesis has involved pilot testing before the full-scale user study. 4. A brief demonstration of how the system works is also advised when the experiment with each user starts. 5. Users might be asked to assess relevance of documents to the corresponding work task situation. This thesis does not aim at evaluating retrieval performance and hence, does not use document relevance assessments and their corresponding measures and analysis procedures. Assessments are used in a different context in Chapter 6, to create a structure summarisation test set. According to Borlund (2003b), “a simulated work task situation is a short ‘cover story’ that describes the situation that leads to an individual requiring to use an IR system”. Using simulated
2.3. IR evaluation 40 work task situations, the aim is to gain control over the cognitive state of system users, hence decrease the possibility of arriving at incorrect conclusions at the analysis of collected experimental data that is caused by the users’ states of mind. A simulated work task situation has to describe the following to users: • the source of the information need, • the environment of the situation, • the problem to be solved. It also has to make the test person understand the objective of the search. In this thesis, simulated work task descriptions are used for the text summarisation study as well as for the query dependent structure summarisation experiment (Chapters 4 and 5). Although in the latter, users are not asked to find relevant information from a list of results, simulated work tasks are still used for control purposes. In an experiment involving users, it is important to make sure that the situational variables are constant and all other effects are as balanced as possible. For example, it might be necessary that all users are at the same computer (as done in the study described in Chapter 4) when they participate in a study. Some unwanted effects that need to be tackled were identified in the TREC-5 Interactive track (Over, 1996) and were addressed the following year (Over, 1997) of the TREC series. These effects were the following. • Searcher effect. Results varied greatly for different users. It is still a problem in experiments and studies that involve human searchers. For instance, it is a problem in IR evaluation that assessors often have little agreement on which documents are relevant for the same topic. In IIR, this effect should be tackled by creating as clear and complete work task situations as possible. • Topic effect. Within a single system and searcher, results varied for different topics. • Searcher-topic interaction. Within a single system, the effect of different searchers on the results varies for different topics. To neutralise the effects on results caused by the order of system versions (usually two versions, i.e. experimental and control systems, are used and obtained results compared) and work task
2.4. Summarisation 41 situations (also called topics; a particular order may cause learning and fatigue effects), Latin square design is used (Table 2.1). The example shown in Table 2.1 is taken from the user study presented in Chapter 4. The Latin square design ensures that the order of systems and tasks is permutated, thus, the order effects are balanced out. The number of systems and tasks also defines the minimum number of users, i.e. no permutation is to be left out if it is possible. User 1 2 3 4 5 6 7 8 9 10 11 12
System AB BA AB BA AB BA AB BA AB BA AB BA
Task BL BL LB LB BL BL LB LB BL BL LB LB
Table 2.1: Latin square design example. As this thesis uses summarisation in the context which is determined by IR, XML documents and their retrieval, as well as human searchers, the next section discusses summarisation types and methods as well as their evaluation.
2.4
Summarisation
One of the main aims of this thesis is to provide overviews of XML documents and elements to searchers. One way to create an overview is to generate a summary for the corresponding document and display it to searchers. This section discusses summarisation types as well as methods upon which summarisation methods introduced in the Chapters 4, 5 and 6 are based. Summaries and abstracts are created to give users a short, but concise, overview of a source document that should present its objectives, scope and findings (Maizell et al., 1971). The rapidly growing amount of information and source documents nowadays requires automatic generation of summaries, and automatic text summarisation has therefore become a major research field. Automatic summarisation dates back to the 1950’s (Luhn, 1958), the first survey to summarisation approaches was done as early as 1961 (Edmundson and Wyllys, 1961). The basic work has been done on summarising documents, as a document was considered one unit of re-
2.4. Summarisation 42 trieval. However, since then, this became a sub-territory within the summarisation field, called single-document summarisation. Multi-document summarisation is also used, mostly to summarise document clusters (e.g. (Radev et al., 2004)). In this thesis, the focus is on summaries of single documents, that consist of hierarchically organised document portions, elements, whose content can also be summarised. As element text summarisation (also referred to as focused summarisation in this thesis) has not been widely investigated as yet, the classical summarisation approaches in this chapter are described in the context of single-document summarisation. Related work on focused summarisation will be covered in Chapter 3. Summaries can be categorised into several types depending of various criteria (Ganapathiraju, 2002): • Detail Summaries can be indicative of what a particular subject is about, or informative about specific details of the subject. In an average IR situation, indicative summaries are used to show what the document is about in order to help the user decide whether reading the whole document is worth to find relevant information. An informative summary, which can also be called an abstract, can serve as a substitute for the whole document. As this thesis’s work focuses mainly on helping users to access the relevant content within a document, summaries described in this thesis are of indicative nature. • Granularity Summaries can describe a specific subject or give an overview of a broader topic. As the source documents can also be specific or broad, both summaries are present in an IR process. These two types of documents are also aimed to be exploited in XML retrieval, thus summaries of this thesis can be both broad or specific, depending on documents and, if applicable, information needs (queries). • Technique Summaries can be coherent abstracts of a document or extracts (e.g. extracted sentences). The web search snippets are usually of the extract type as it is easier (and still quite effective) to take whole sentences, expressions than to create a coherent text from the document. Also, searchers are looking for the exact phrases from the snippets in document. In addition, extracts can often be as useful as abstracts. This thesis uses extract summaries for its text summarisation for the above described reasons. • Content There are generalised and query-based summaries. They are also called query independent and dependent. Most summaries used in IR are query-based, sometimes show-
2.4. Summarisation 43 ing only query words and one or two adjacent words that give context to the query words. In this work, there is a high emphasis both on query independent and query dependent summaries. Chapters 5 and 6 introduce summarisation of the document’s structure; the summariser of Chapter 5 is query-based, while that of Chapter 6 is query independent (generalised). The text summarisation method described in Chapter 4 is query-based. • Approach Summaries can be domain-, genre specific or independent. Scientific and news articles, for instance, may require different approaches when summarised. A scientific paper usually has a conclusion section at its end containing important, and condensed, information about the document’s contents, whereas a news article usually contains the most informative sentences at its beginning and no strikingly important information at its end. • Source We can summarise document sets, single documents, or document portions (XML elements). An XML element’s summary might need a different summarisation algorithm from that of a single document because the individual elements are not independent of one another, they are the context of another, which might need to be considered. Also, the summary of an XML element requires a different approach from that of document sets because, although there is a relation among documents of a set, they are not always in a hierarchical relationship and certainly not parts of other documents from the same set. • Target Most of the time, summarisation focuses on the textual contents of documents and tries to create an overview of the document by creating a textual summary, i.e. a text describing what the document is about. However, it is also possible to summarise other properties as well. This thesis focuses on the logical structure of documents and, in Chapters 5 and 6, aims to create structural summaries. In structure summarisation, the most important portions of documents are selected and used to create structural overviews, which are also referred to as automatic tables of contents in this thesis. With respect to technique, extraction methods tend to be less complex, as summary generation does not require the understanding of contents and recreating of grammatically correct sentences. Despite this, extraction methods perform relatively well. The extraction of sentences has the advantage of the sentences being grammatically correct, i.e. a sentence is a coherent sequence of words. Selecting and placing sentences after one another in a summary might easily
2.4. Summarisation 44 result in a coherent(-looking) text. The first systematic approach to summarisation (Edmundson, 1969) forms the core of the extraction methods even today. The key ideas of this approach are the following. 1. Study human generated abstracts, and specify characteristics expected in automatically generated abstracts. 2. Generate such abstracts manually. 3. Design mathematical and logical formulations to score and pick out sentences from the documents to match manually generated abstracts. 4. Iteratively improve the sentence-scoring scheme to match the automatic abstracts to manually generated abstracts. The following subsections describe a basic sentence extraction method, a probabilistic summarisation method and several other summarisation approaches. The first two summarisation methods are particularly relevant to this thesis as these form the basis of both the content summarisation introduced in Chapter 4, and structure summarisation used in Chapters 5 and 6.
2.4.1
Sentence extraction
Sentence extraction methods for summarisation usually consider each sentence as a possible unit to be included in the summary (Paice, 1990; Kupiec et al., 1995; Brandow et al., 1995) (rarely, paragraphs are considered (Salton and Allan, 1996)). These candidate sentences then receive scores and the highest scored sentences will be selected for inclusion in the summary. They are displayed in order of appearance in the source document, i.e. not by sorting them by their scores like search results are sorted in information retrieval. The score is usually the linear combination of various sentence feature scores (possible features are described in detail later in this section). For instance, the score of each sentence can be computed as shown in Equation 2.4
Si = w1 ·Ci + w2 · Ki + w3 · Ti + w4 · Li
(2.4)
where Si is the score of sentence i. Ci , Ki , Ti are the scores of the sentence i based on the number of cue words, keywords and title words it contains, respectively. Li is the score of the sentence based on its location in the document. w1 , w2 , w3 , w4 are weights for the linear combination of the above four scores.
2.4. Summarisation 45 Sentence features that are generally used include the following (Ganapathiraju, 2002; Goldstein et al., 1999; Edmundson, 1969). Note that while some features, if present, bring higher scores to sentences (e.g., title-keyword, ‘bonus’ words), others are considered being negative (e.g., pronouns, ‘stigma’ words). • Keyword-occurrence Selecting sentences with keywords that are most often used in the document usually represent theme of the document. This feature is the basis of the tf-idf method which can be used on its own as well (Jones, 1988; van Rijsbergen, 1980). The tf-idf model is built considering the usual weighted term-frequency and inverse sentencefrequency paradigm, where sentence-frequency is the number of sentences in the document that contain that term. These sentence vectors are then scored by similarity to the query and the highest scoring sentences are picked to be part of the summary (Neto et al., 2000). Summarisation is query-specific, but can be adapted to be generic by taking non stopwords that occur most frequently in the document as the query words. Since these words represent the theme of the document, the generated summaries are generic. • Title-keyword Sentences containing words that appear in the title (or heading, sub-heading) are also indicative of the theme of the document and will receive higher score. • Location heuristic E.g., in news articles, the first sentence is often the most important sentence; in technical articles, the last sentences of abstracts or those from conclusions inform about the findings of the document. • Indicative phrases Also called Cue Phrases, these are sentences containing key phrases like “this report...”. • Short-length cut-off Short sentences are usually not included in summaries. • Upper-case word feature Sentences containing acronyms or proper names are more likely to be included in summaries. • Pronouns Sentences with pronouns such as “she, they, it” cannot be included in a summary unless they are expanded into corresponding nouns. • Redundancy in summaries Earlier automatic summarisers normally did not consider antiredundancy (which makes sure that information repetition is minimal in the summary),
2.4. Summarisation 46 however, it is an important feature of a summary, and recent systems take redundancy information into account. The redundancy score is computed dynamically as the sentences are selected to ensure there is no repetitive information in the summary. The following are two examples of anti-redundancy scoring, when a new sentence is added to the summary: 1. Scale down the scores of all the sentences not yet included in the summary by an amount proportional to their similarity to the summary generated so far. 2. Recompute the scores of all remaining sentences after the words, that are present both in the summary and in the query/centroid of the document, have been removed. A simplified version of this method is used to summarise the text of document portions in Chapter 4, and an adapted version, i.e. for structure summarisation, is discussed in Chapter 5 of this thesis. The following subsection describes another sentence extraction method that is based on a probabilistic framework and introduces the training of a summariser. The method is used for structure summarisation in Chapter 6.
2.4.2
Probabilistic summarisation
Edmundson (1969) used a linear combination of scores obtained for various features and determined feature weights by experimenting with various weight combinations to find the vector of weights that work best. Probabilistic summarisation methods, however, use a training set of document-summary pairs to determine the probabilities of features occurring in positive or negative examples and use this information to build summaries for other documents. Given a set of training documents and their extractive summaries, the summarisation process is modelled as a classification problem: sentences are classified as summary sentences and non-summary sentences based on the features that they possess (Kupiec et al., 1995; Teufel and Moens, 1997). Features that are usually used to distinguish summary sentences are those listed in the previous section. According to Kupiec et al. (1995), groups of features tend to work better in text summarisation than only individual features. Individual features and their combinations are also compared in this thesis in structure summarisation (Chapter 6). In (Kupiec et al., 1995), Bayesian classification is used, and applied as follows. For each sentence s the probability it will be included in a summary S given the k features Fj ; j = 1..k is computed, which can be expressed using Bayes’ rule shown in Equation 2.5:
2.4. Summarisation 47
P(s ∈ S|F1 , F2 , .., Fk ) =
P(F1 , F2 , .., Fk |s ∈ S) · P(s ∈ S) P(F1 , F2 , .., Fk )
(2.5)
The naive Bayes assumption is then used, which assumes that the features are statistically independent with regards to the classes (Equation 2.6). Although most of the time features are not completely independent from one another, it has been shown that the use of the assumption still leads to good results and simplifies the formula so that much less effort is needed to create summaries (Hand and Yu, 2001).
P(s ∈ S|F1 , F2 , .., Fk ) =
∏kj=1 P(Fj |s ∈ S) · P(s ∈ S) ∏kj=1 P(Fj )
(2.6)
P(s ∈ S) is a constant and P(Fj |s ∈ S) can be estimated directly from the training set and P(Fj ) will drop out when the probabilistic odds are calculated. The probability estimation is done by counting feature occurrences. Since all the features are discrete, we can formulate this equation in terms of probabilities rather than likelihoods. This yields a simple Bayesian classification function that assigns for each s a score which can be used to select sentences for inclusion in a generated summary. The sentence selection function is shown in Equations 2.7 and 2.8 (Elkan, 1997). If this number is higher than zero, sentence s is included in the summary, otherwise excluded.
log
P(s ∈ S|F1 = f1 , F2 = f2 , .., Fk = fk ) P(s ∈ / S|F1 = f1 , F2 = f2 , .., Fk = fk )
Sentence extraction can be easily adapted to structure summarisation where, instead of selecting sentences from a document, elements of a document can be selected and selected elements can be displayed in a table of contents. Such summarisation methods are described in Chapters 5 and 6.
2.4.3
Other summarisation methods
Other summarisation methods that use classification have also been found useful. E.g. linear classifiers such as the method shown above as well as the Support Vector Machine (Hirao et al., 2002) work as well as other, non-linear, classifiers (Chuang and Yang, 2000). Some summarisa-
2.4. Summarisation 48 tion methods operate on document clusters where sentence selection is based on similarity of the sentences to the theme of the cluster. A sentence level extractive summariser that takes document clusters as input is the MEAD (Radev et al., 2004). Another summariser that uses clustering is by Stein et al. (2000). The SUMMARIST system (Lin, 1999) uses three stages for summarisation: topic identification which identifies the most important (central) topics of the text, then topic interpretation and summary generation. Barzilay and Elhadad (1997) use lexical chains to summarise the content of documents. Graph theoretic representation of passages provides a method of identification of these themes (Ganapathiraju, 2002). This thesis does not address the identification of themes documents might contain. We aim at summarising any content or structure within single documents.
2.4.4
Summarisation evaluation
Similarly to IR systems and retrieval methods, summarisation methods also need to be evaluated. According to Mani’s overview of summarisation evaluation (Mani, 2001), there are two types of summarisation evaluation: i) intrinsic, which tests the summarisation system in of itself, and ii) extrinsic, which tests the summarisation system based on how it affects the completion of some other task. Evaluating summarisation systems on their own (i.e., intrinsic evaluation) has two (somewhat orthogonal) criteria: coherence and informativeness. Coherence measures how readable the summary is while informativeness can measure how information from the source is preserved in the summary, or how much information from a reference summary is covered by information in the system summary. The evaluation of intrinsic summaries can be done by comparing the created summaries to manually created reference summaries. The classical way of comparison is to do it manually (Edmundson, 1969). Evaluators are asked to judge the quality of the automatic abstract (originally on a 5-point scale of similarity). Other studies showed, however, that there is low agreement between these assessors (Rath et al., 1961). Automatic scoring may help to avoid this inconsistency. There are several measures to carry out comparisons which include the following (Mani, 2001): • Sentence Recall measures how many of the reference summary sentences the machine summary contains. Similarly, Precision can be also used. This measure is used mainly when the machine abstract is an extract. In Chapter 6, appropriately modified versions of
2.5. Conclusion 49 these measures will be introduced and used. • Sentence Rank can be used when the summary is specified in terms of a ranking of sentences in terms of summary-worthiness. The sentence rankings of the machine summary and the reference summary can be compared by using a correlation measure. This measure is also used to evaluate extracts. • Utility-based measures are based on a more fine-grained approach to judge summaryworthiness of sentences, rather than just boolean judgements. These methods allow to use more precise measurements of informativeness. • Content-based measures are based on vocabulary similarity. Another method to evaluate summarisation is to compare the created abstract and the source document. Assessors are asked to determine summary informativeness in context of the source. The idea of an extrinsic summarisation evaluation is to determine the effect of summarisation on some other task. This other task can be, for example, IR relevance assessment, where assessors are provided either with the summary of the document to judge or with the original document. Then, the effectiveness of summaries is measured. Such measure can be, for instance, reduction of time needed to assess documents. Recently, mostly in the context of the Document Understanding Conferences (DUC)10 , various evaluation methods and measures have been used such as BLEU (Papineni et al., 2001), WSumACCY, ROUGE (Lin, 2004), etc. As this thesis applies extraction-based summarisation methods to element text summarisation and document structure summarisation, evaluation methods that are designed to evaluate abstract type summaries (see those listed above) are not used in the presented work. Evaluation measures based on sentence recall and precision are used in Chapter 6.
2.5
Conclusion
In this chapter, first, the basic concepts of information retrieval (IR) and XML were introduced. To investigate and evaluate summarisation in the context of XML IR, it was also necessary to describe how information retrieval (IR) systems are evaluated (Section 2.3). XML retrieval is a relatively new area within IR, and the logical structure that is offered by the XML markup 10 http://duc.nist.gov
2.5. Conclusion 50 requires to deal with XML retrieval differently from traditional flat document retrieval when evaluating retrieval systems (Section 2.3.2). Evaluation of retrieval systems should also be done differently if users of search systems are also to be considered, i.e. an interactive information retrieval system (IIR) requires yet another evaluation methodologies and methods (Section 2.3.3). Evaluation in these areas, i.e. traditional IR, XML retrieval and IIR, are needed for the Chapters 4, 5 and 6, as results of (XML) retrieval are used and analysed in several sections of this thesis (Sections 4.2.1 and 6.3). Also, studies introduced later in this thesis are mainly user-based (Chapters 4 and 5 and Section 6.6). In this chapter, summarisation was also introduced by describing various types of summaries and particularly relevant extraction methods that are adapted for XML text and structure summarisation in Chapters 4, 5 and 6. Basic summarisation evaluation methods, which are considered to be sufficient for this thesis’s work, have also been described in Section 2.4. The next chapter discusses previous work that has been done in the context of overview presentation to users of retrieval systems (Section 3.2), summarisation when the structure is available (Sections 3.3 and 3.4) and, within that, summarisation specifically connected to the XML format and human searchers (Section 3.5). The chapter aims to provide background and motivation for the work of this thesis (described in Chapters 4, 5 and 6).
51
Chapter 3 Summarisation and XML retrieval
After introducing the basic concepts of information retrieval (IR), XML and summarisation in Chapter 2, this chapter discusses research that is directly relevant to the work of this thesis. The discussion of related literature aims at providing more detailed background and motivation for the work presented in this thesis. This chapter presents related work on overview (i.e. summary) presentation for structured information (Section 3.2), various overview generation methods for documents formatted in XML (Sections 3.3 and 3.4) and research concerning three main themes of this thesis, i.e. summarisation, XML and human searchers, in Section 3.5.
3.1
Introduction
Automatic summarisation has been used to create short, but concise overviews of documents. A summary of a document can be very useful if, for example, searchers do not have time to read many documents (possibly returned by a search system) to find relevant content. A summary, in general, provides the main idea of the document’s content and hence, makes the information searching process easier for searchers. Summaries are well suited to be used in information retrieval systems, as well as digital libraries. The result of the retrieval is a list of documents that are ranked according to their relevance as estimated by the retrieval engine. Although a search engine may perform very well in returning a list of documents, searchers still need to choose the best document(s) from this list and read the corresponding document’s contents. An overview of the contents could be highly useful in deciding which documents are most promising to find relevant information.
3.2. Summarisation and structured information 52 Summaries have been used as overviews; they are often called “snippets” (Turpin et al., 2007), and can consist of several words to several sentences. This thesis does not investigate summaries associated to result list items, but looks at another stage of the searching process. This other stage of the searching process is when the searcher, having chosen documents from the result list, needs to find the relevant content within documents. This is often not supported by retrieval systems, especially those searching the Web. Adding document overviews to individual documents may help searchers locating relevant information more effectively. To create overviews of whole documents, the (hierarchical) logical structure of documents (often available through XML markup) can be exploited. In addition to creating overviews of whole documents, the logical structure also allows the creation of overviews for individual document portions, i.e. XML elements. How this structure can be used for overview generation is one of the main questions investigated in this thesis. The following sections discuss related work that has been carried out in the context of overview generation and presentation. First, various visualisation methods, that are aimed at providing overviews of hierarchical or non-hierarchical data, retrieval search results, etc., are discussed in Section 3.2, where the need for simple and intuitive display methods is emphasised. Section 3.3 describes several ways to create overviews of XML documents and discusses the interpretations of the word ‘summarisation’ in the context of XML documents. Among the three summarisation interpretations, emphasis is given to summarisation that deals directly with the XML format (Section 3.4). Interactive experiments in the context of XML access and retrieval are also discussed, as studies involving human searchers are also closely related to the topic of this thesis (Section 3.5). Section 3.6 aims at providing motivation of this thesis by discussing the related work presented in this chapter.
3.2
Summarisation and structured information
In this section, various overview presentation methods are discussed. As summaries are defined as “short but concise overviews” (Maizell et al., 1971), there are numerous ways such overviews to be created, displayed and, thus, the word ‘summarisation’ to be interpreted. It is not straightforward what can be meant by overviews or summaries. It is possible to summarise the contents of documents, contents of document sets, the most important keywords of documents, even the document structure. When the structure (of documents, sets of documents, etc.) is available (e.g.
3.2. Summarisation and structured information 53
Figure 3.1: A model to illustrate overviews and summaries. with the XML markup), creating and displaying overviews becomes more complex. Figure 3.1 models overview (summary) types. When creating overviews one should determine an overview Source: it is possible to create overviews for document sets (e.g. results of a search), documents or document portions. When creating overviews, it is possible to Focus on the content of the source, structure of the source or both at the same time. When displaying overviews, one can choose to do so using a graphical method, text only or combination of graphics and text. For example, when talking about multi-document summarisation, the source is a document set, the focus is on content, and the display is textual. In this thesis, Chapter 4 uses textual summaries whose sources are document portions, the focus of the summary generation method is on content, and the display method is textual. In Chapters 5 and 6, the source is a document, the focus is on a combination of content and structure, and the display method is textual. Table 3.1 shows further examples1 . Overview TileBars Partial Treemaps (Chen and Dumais, 2000) ProfileSkim Thumbnails Superbook (Buyukkokten et al., 2001b) WebTOC Summarisation at DUC Text summarisation, Ch. 4 Structure summarisation, Ch. 5-6
Focus Content and Structure Content and Structure Content Content and Structure Structure Structure Content and Structure Structure Content Content Structure
Table 3.1: Categorisation of several overviews. The following sections present and discuss various overview generation and presentation 1 Some
overviews could be categorised differently depending on how one interprets the Focus dimension (e.g. we could say that structure summarisation focuses on both content and structure as the XML elements’ contents are also used to create structure summaries), as well as on the definition of what constitutes graphical displays (e.g. a table of contents also containing coloured boxes indicating each element’s relevance can be considered textual but, because of the box, graphical as well).
3.2. Summarisation and structured information 54
Figure 3.2: ProfileSkim (Harper et al., 2003). methods.
3.2.1
Graphical and textual overviews
A number of researchers have investigated various display methods for individual documents and hierarchical or non-hierarchical data before XML became widespread and popular. This section discusses the use of various display methods for showing overviews of documents to searchers. Overviews can be presented graphically, or by using text-based presentation methods. For example, TileBars (Hearst, 1995) visualise various document features such as relative length of documents, the frequency of the topic words in the document, and the distribution of the topic words with respect to the document and to each other. Partial Treemaps (Großjohann et al., 2002) present the relative relevance of elements within a document. The use of thumbnails (Woodruff et al., 2002; Dziadosz and Chandrasekar, 2002) try to help searchers to make better decisions by presenting the query term distributions in retrieved documents or small image based previews of the retrieved documents. Harper et al. (2003) introduce query-based document skimming adding a diagram to the top of the document view showing which parts of the document contain most of the user’s query words (Figure 3.2). It is based on the browser’s Find command and also includes term highlighting (the latter is also used in the system described in Chapter 4). Another example is the work by Byrd (1999) who uses scrollbar-based visualisation for document navigation.
3.2. Summarisation and structured information 55
Figure 3.3: The InfoCrystal display (Spoerri, 1993). Various methods have been used for representations of search results, for example, LyberWorld (Hemmje, 1995), InfoCrystal (Spoerri, 1993) (Figure 3.3) and BEAD (Chalmers and Chitson, 1992). For more visualisation examples see (Andrews, 1996; Sebrechts et al., 1999; Shneiderman et al., 2000; Baeza-Yates and Ribeiro-Neto, 1999, Chapter 10.). Complex graphical user interfaces have not been used widely, mostly because time is needed for searchers to learn the metaphors, structure and navigation (Sebrechts et al., 1999). Also, these presentation methods might require specific software or faster hardware (the latter is not supposed to be an obstacle currently, however, software compatibility issues are still relatively frequent, e.g. different browsers display the same content differently) (Nation, 1998). The success of popular web search engines shows that the searching interface should be as simple and familiar to searchers as possible. To avoid problems originating from complexity and familiarity issues, this thesis aims at providing simple-looking (i.e. text based) and familiar (i.e. web based) interfaces to searchers. The most often used textual (thus, easier to learn and to use) overviews in the area of IR are summaries of the textual contents of documents. When retrieval results are presented to searchers, the natural overview that indicates the content of a document is its title (Hagerty, 1967; Saracevic, 1969). Short summaries (snippets), consisting of several words or sentences, are also
3.2. Summarisation and structured information 56 often used. The addition of other ‘metadata’ (Larsen et al., 2006b) such as size, access path, authors, date are also often displayed to provide searchers with information about documents, thus giving an overview without having to display the whole document. Interface techniques that expose searchers to increasingly larger amount of content of a document, helping them decide whether visiting the whole document is worthwhile, have also been introduced (Zellweger et al., 2000; Paek et al., 2004). Marchionini and Shneiderman (1988) and Dumais et al. (2001) present summaries of the document content when the searcher moves the mouse pointer over a hyperlink. Großjohann et al. (2002) also uses “tool-tip” summaries together with the Treemap presentation method. According to Dumais et al. (2001), “Subjects read the full pages mostly to confirm what they found in the summary. This significantly reduces search time because the short summaries can be read faster than a full page of text”, which shows the usefulness of summaries appearing automatically before a searcher clicks on a link to access the full document. Similar pop-up summaries have been investigated and found effective for providing implicit feedback by White et al. (2002). This technique, i.e. displaying summaries when the mouse pointer is over a link, is also used in Chapter 4 where text summarisation is investigated.
3.2.2
Overviews of multiple documents and hierarchically structured information
As XML documents have a hierarchical logical structure, related work on overviews of document sets, grouped or categorised (hierarchically or otherwise), is discussed in this section. To create overviews of document sets, various methods have been proposed. Clustering approaches such as Grouper (Zamir and Etzioni, 1999) and Scatter/Gather (Cutting et al., 1992) have been developed to better organise search results. The Vivisimo search system2 also uses clustering when search results are displayed (Koshman et al., 2006). However, clustering methods are slow and possible uninformative labeling of created clusters can make clusters difficult to understand. Approaches that categorise documents (Chen and Dumais, 2000; Dumais et al., 2001; Pulijala and Gauch, 2004) have also been shown to be effective. The categories (topics) of documents are automatically determined and displayed in a hierarchical, table of contents like manner in the work by Lawrie (2003). For a document set, topic words are identified, which are then displayed according to the identified topic hierarchy. Corresponding documents can be accessed through browsing this hierarchy. 2 http://vivisimo.com
3.2. Summarisation and structured information 57
Figure 3.4: The PDA Power Browser (Buyukkokten et al., 2001a). The Superbook project (Remde et al., 1987; Landauer et al., 1993) also uses the familiar concept of (hierarchical) table of contents (often referred to as ToC in this thesis) to show an overview to documents. In addition, they highlight the current ToC item (e.g. a section) with a fisheye-like method which is, built upon a familiar concept and being simple to understand, works better. When limited space is available for display, e.g. in a PDA or mobile phone, simple interfaces with an overview of the contents of documents become even more important. Hierarchies are still important in this context because, for example, it might be useful to display only the highest level of the hierarchy first (i.e. similarly to ToCs where only titles of highest level sections are displayed), and expand the view when users select an item at this highest level (for example, see Figure 3.4). Summarisation is investigated in the context of small screen devices and retrieval results in (Sweeney and Crestani, 2006). Buyukkokten et al. (2001b,a) use text summarisation for web browsing on handheld devices (Figure 3.4). They introduce and evaluate five methods for “progressively disclosing STUs” (Semantic Textual Units). Given more display space than a PDA can afford, such as that of a desktop PC’s monitor, the latter work can potentially be extended into a hierarchical table of contents display with additional textual summaries. The hierarchy of the content is displayed in a tree structure in Docuverse (Heo et al., 1996). “Docuverse’s developer realizes that the hierarchy may become unwieldy as the number of nodes increases. This is also a problem for WebTOC and many other systems” (Nation, 1998). WebTOC automatically generates a hierarchical table of contents of a site using two different
3.3. Overviews of XML documents 58
Figure 3.5: The WebTOC system. strategies: following existing links or using the underlying directory and file structure (Figure 3.5). The problem of increasing nodes, which is highly related to increasing document length if single documents are considered, is also investigated in this thesis. When a hierarchical overview (i.e. ToC) becomes long it might be needed to create and overview of it, i.e. overview (summary) of tables of contents, which is considered to be useful for searchers who want to access relevant information in as few clicks as possible. This section has discussed related work on presenting overviews of flat documents and document sets. However, none of the described work presented overviews of overlapping contents within documents. The next sections discuss overviews for XML documents where the contents within documents are structured hierarchically.
3.3
Overviews of XML documents
Having presented several overview methods, this section discusses overviews of XML documents. According to Sengupta et al. (2004), the methods for “compressing document content can be broadly classified” into three groups, namely thumbnailing, compression and summarisation. What they mean by compressing document content is similar to what this thesis refers to as overview. As they use the expression in the context of XML documents, the three groups are further explored in relation to this thesis. This section focuses on the first two groups, i.e.
3.3. Overviews of XML documents 59 thumbnailing and compression. Neither of them is the focus of the work presented in this thesis, however, it is important to discuss them in order to differentiate from the third group, summarisation, which is the focus of this thesis. Related work with respect to summarisation of XML documents is discussed in Section 3.4.
3.3.1
Thumbnailing
Thumbnailing is a visualisation technique that is to help users handle large documents. A thumbnail is typically an image that represents the layout of a document (Sengupta et al., 2004). It has recently drawn attention from the book retrieval community for several reasons: it is straightforward to create thumbnails for individual pages if the page information is available (often in the XML markup); the text of such pages cannot usually be read through the thumbnails so they can be used to give an overview of the general structure of a page but the page is still protected against possible copyright violations. However, this overview type can also be an obvious disadvantage of thumbnails because although a thumbnail shows something (i.e. an overall layout which can be useful, e.g., in cross-language IR where the “not seen text” is in a foreign language (Ogden and Davis, 2000)) it still does not show enough, i.e. it is not visible what the page is about unless its heading text is very large. There has been research carried out to compensate this effect (Suh et al., 2002) but their results are difficult to apply to XML documents where there is a rich and possibly deep logical structure. Also, thumbnailing is not thought to be useful when the structure is nested and the size of data, i.e. XML elements, to be “thumbnailed” can be very different depending on the length and internal logical structure of the document. Thumbnails have been shown more effective in overview+detail interfaces than alone: the overview is given by the thumbnails on one side of the interface (usually on the left hand side), while the detail view shows a portion of the full text representation of the document on the right hand side (Suh et al., 2002). In this thesis, overview+detail display methods are used but the overviews are provided by tables of contents and textual summaries, as they are assumed to be more informative than thumbnails which have several disadvantages as above discussed.
3.3.2
Data compression
In this section, XML document compression is described. When one searches in the literature for ‘XML summarisation’, there are several papers as results with ‘XML summarisation’ in their title. However, authors of these papers often interpret the word summarisation in the sense of
3.4. Summarisation of XML documents 60 compression because they apply ‘XML summarisation’ with respect to the data centric view of XML. To distinguish between the two summarisation interpretations, i.e. those of the document centric and data centric views, in this thesis I refer to summarisation when the document centric view is used and compression when the data centric view is applied. Summarisation is described further in the next section (Section 3.4) as it is the central topic of this thesis. To give an example for compression,“semi automatic summarisation of XML collections” by Fischer and Campista (2005) describes database-like input documents (such as documents containing bibliographic data) for their summariser. They use templates to keep meaningful information, e.g. author, and remove unnecessary other XML elements such as the ISBN, thus, ‘summarising’ the XML document. With the templates they can aggregate information, e.g. adding numbers from elements with the same name and merging them into one, which clearly shows their focus on the data centric view. The SqueezeX compressor (Cannataro et al., 2002) considers XML documents as data storage units. An XML summary based on statistics of queries is created by (Comai et al., 2003). In (Dalamagas et al., 2004), the tree representation of XML documents is used to generate tree structural summaries; these are summaries that focus on the structural properties of trees and do not correspond to summaries in the conventional sense of the term summarisation as used in IR research. Operations such as nesting and repetition reduction in the XML trees are used. More recently, Adiego et al. (2007) compressed structured documents by compressing the text that lies inside each different XML element type. Other compression examples include those by Comai et al. (2003), and Lin et al. (2006). The focus in this thesis, however, is not on compressing data or storing structured documents efficiently, but providing users of XML retrieval systems with overviews of the contents of XML documents considering the document centric view of XML.
3.4
Summarisation of XML documents
In this section, the third group of “compressing document content” (Sengupta et al., 2004) for XML documents, XML summarisation is discussed. Within XML summarisation, several groups can further be identified depending on the use of the XML format. For example, • text summaries can be created for documents in XML format, • text can also be summarised with the help of the structure offered by XML,
3.4. Summarisation of XML documents 61 • the textual contents of elements, i.e. text within document portions, defined by the XML markup, can be summarised, • the logical structure itself, encoded by the XML markup, can also be summarised. These are all overviews of XML documents, and the work of this thesis focuses on the latter two items. Ways of XML summarisation in related work are discussed in this section. Text summarisation has already been introduced in Section 2.4 in the context of flat (i.e. not hierarchically structured) documents. Summarisation when the logical structure is present, mostly in the form of XML markup, is a relatively new research area. As XML is not widely researched in the field of summarisation, investigations should start with basic summarisation methods to understand what XML can bring into summarisation, and also, whether summarisation is useful in accessing XML documents, e.g. in XML retrieval. For the above reasons, the most recent summarisation methods that do not involve structure are not considered to be within the scope of this thesis. For such summarisation methods, see, for example, the Document Understanding Conferences (DUC)3 and (Harman, 2007).
3.4.1
Summarisation of documents in XML format
Recently, several researchers have worked on summarisation where the structure, either in XML or otherwise, has been involved. The involvement is often very marginal, for example, the documents or corresponding information needs (the summary will need to be related to) are formatted in XML but information regarding the logical structure defined by XML is often not incorporated into summarisation methods. For example, although documents used at the Document Understanding Conferences (DUC) are formatted in XML (Over et al., 2007), the forum itself focuses on document summarisation and is not concerned with summaries of document portions or the logical structure of the document. In the context of DUC, Litkowski (Litkowski, 2005, 2006) uses a knowledge management system that automatically parses documents and creates XML markup for NLP applications. The heavily tagged XML representations, with XML used to tag each element of each sentence with information such as key elements of the sentence, syntactic and semantic attributes, are then used to summarise the document. The tags, however, do not reflect the logical structure of the 3 http://www-nlpir.nist.gov/projects/duc/pubs.html
3.4. Summarisation of XML documents 62 document, which is used in XML retrieval to access document portions. Documents in XML format have been summarised by Ling et al. (2007) where semi-structured summaries are generated. A semi-structured summary “consists of sentences covering specific aspects of a gene”. As these summaries are a series of sentences summarising the whole document (gene) the method is not ideal to use for the purposes of this thesis. Other XML based summarisation is the work by Lam et al. (2002) who summarise whole emails (in XML) using various typical fields of e-mails. For documents in XML, Wolf et al. (2004) use the absence or presence of various “structural elements”, e.g. FAQs, Product Information Sections, to determine which document portions needed to be extracted. The next section discusses summarisation when XML is used to mark the logical structure of documents.
3.4.2
Focused summarisation
In the research discussed in the previous section, the XML structure is used as supplement to various summarisation techniques. Also, it was always full documents whose text was summarised. However, as XML allows to identify document portions, i.e. logical units called elements, of the document, summaries of individual elements should also be considered because elements provide focus to certain parts of the document. The current definition of focused summarisation in literature does not tend to cover document portion summarisation. Over et al. (2007) distinguish focused and generic summaries where focused summaries are identified as those “that respond to some specific purpose or information need”. They list tasks used in DUC that involve focused summarisation: Viewpoint, Question/topic, Event, ‘Who is’ question, Complex question. Instead of focusing on a summarisation task, it is also possible to focus on specific summarisation sources, document portions (elements). “Focus can be reflected in all three classes of factors involved in summarisation” (Over et al., 2007), i.e. input source, intended purpose and form of the output summary. In case of XML documents, the input source is an element instead of a whole document. A basic idea of the above introduced sense of focused summarisation can be found in the work by Alam et al. (2003a), who combine textual summaries with the structure of the document. A textual summary of a document is created by using lexical chains, then it is combined with the overall structure of the document with the aim of preserving the structure of the original document and of superimposing the summary on that structure. They introduce a classification
3.4. Summarisation of XML documents 63 of XML summarisation which shows how structure and textual content may be combined. The four combination schemes are as follows. 1. Flat summary It is created when only the title of the document is combined with the textual summary of the document. This is essentially the same as traditional text summarisation where the title of the summarised document is usually displayed together with the textual summary itself. This type of summarisation does not require any structural information about the logical structure of the contents of the whole document (apart from being able to extract the title). However, the structure could be used, for example, be giving higher weights to sentences of important sections. 2. Distributed flat summary In such summaries, “each section is given its fair share of representation”, i.e. for each section of the document a flat summary is produced and the summary of the whole document is obtained by placing these summaries after one another. A similar approach is used in Salton and Buckley (1991), although not with documents in XML, to generate summaries. In their work, automatic hypertext link generation was used to create the structure of documents first. Edmundson (1969) also used an idea closely related to this as early as in the 1960’s. 3. Structured summary “This is presented by combining the textual summary with overall structure of the document. This preserves the structure of the original document and superimposes the summary on that structure.” In the example by Alam et al. (2003a), the structured summary is essentially the same as the distributed flat summary with titles of sections added. Their summariser can be applied to various section levels as sections may be nested. 4. Smart summary “The summariser automatically recommends the best possible type of summary [...] by analysing the document structure”. The list of schemes above lack one important combination of content and structure. Based on the introduction of this section, this missing summary type could be called Focused structured summary, and it would mean that sections, e.g. those containing more meaningful information, or those relevant to a user’s query, are more detailed while other sections might not have summaries at all. Although Alam et al. (2003a) claim that a distributed flat summary is created to
3.4. Summarisation of XML documents 64 avoid bias towards several sections’ being “heavily represented” in the summary because it harms readability, it is thought that a focused summary in the above sense might still be necessary. The work described in (Amini et al., 2005) and (Amini et al., 2007) uses structure-based features as well as other features for text summarisation. Their method offers element focused summarisation, i.e. the introduced method can summarise the text of any arbitrary XML element, e.g. section, subsection, article, etc. They use the following list of structure-based features, whose choice is not thoroughly explained although it might be beneficial to other researchers: • the depth of the element in which the sentence is contained (e.g. section, subsection, subsubsection, etc.), • the sibling number of the element in which the sentence is contained (e.g. 1st, middle, last), • the number of sibling elements of the element in which the sentence is contained, • the position in the element of the paragraph in which the sentence is contained (e.g. first, or not). To sum up this subsection, the logical structure offered by XML can be used to create focussed summaries where focus means to concentrate on document portions rather than full documents and summarise their textual contents.
3.4.3
Summarisation of the logical structure
In addition to textual summaries of XML documents and elements, summarisation can also focus on the logical structure itself (as another type of focus), thus giving an overview of the structure of the document, possibly, in the form of a hierarchical table of contents (ToC). A user might want to gain an overview of the document via the titles of sections (subsections, subsubsections, etc.) only, or start browsing through the document’s contents via this kind of overview (i.e. ToC). Such applications already exist even commercially, examples include Adobe Reader’s4 table of contents function and the TeXnicCenter5 software this thesis is written with. Although these applications are able to display ToCs of various depths and details of particular sections, to see a detailed view of certain structural items, e.g. chapters, it has to be initiated by the user as it is 4 http://www.adobe.com/products/acrobat/readstep2.html 5 http://www.texniccenter.org/
3.5. Users, summarisation and XML documents 65 not offered by the applications. In other words, the ToC has to be unfolded by the user and no emphasis is given automatically to sections, subsections, etc. of possibly higher interest. Similarly, a high number of web sites offer menu-like structural summaries of the sites and although the menu items lead to different web pages the whole site can easily be viewed as one large document with structure (see (Nation, 1998) for example). Structure summarisation, i.e. table of contents generation, has been used in the context of the World Wide Web and small screen devices (Rahman and Alam, 2002; Jones. et al., 1999; Bickmore et al., 1999). However, as the logical structure of the document given by the HTML markup is not well defined, the ToCs are “crudely generated, based on visual clues and are therefore may become unintelligible and sometimes misleading” (Alam and Rahman, 2003). As the XML markup in XML documents mostly captures the logical structure which is hierarchical (which is what is needed for ToCs), it is believed that ToCs are more useful if generated for XML documents and displayed on larger size display units. Structure summarisation has been combined with text summarisation in (Alam et al., 2003b). They create a shallow structural overview which is associated with short summaries. The structural items are then expanded in the same window (as the display method is designed for small screen devices) when selected. In a way this is similar to this thesis’ summarisation where the searcher first sees an overview of the document’s logical structure, then a summary of e.g. a section is displayed when the mouse pointer is over the section title, and finally, the content of the section is displayed when the title text, appearing in the ToC, is clicked on. To stress the importance of focused summaries and structure summaries, the following section discusses the need and use of ToCs and text summarisation in the context of real users, i.e. in interactive XML retrieval.
3.5
Users, summarisation and XML documents
This section presents related work done in the context of XML summarisation and interactive IR. Especially in the last decade “there has been a growing realisation in the IR community that the interaction of searchers with information is an indispensable component of the IR process” (Tombros et al., 2005b). Interactive, i.e. user based, IR has been extensively investigated in the Interactive track of the Text Retrieval Conferences (TREC) (Over, 1996, 1997, 1998; Hersh and Over, 1999, 2000b,a; Hersh, 2000), however these efforts have been in the context of unstructured
3.5. Users, summarisation and XML documents 66 documents (e.g. news articles) or loosely-defined structure such as web pages. Interactive IR has received attention from the XML research community at INEX (Lalmas and Tombros, 2007b) where the Interactive track has been running since 2004 aiming at the investigation of structured documents with respect to the context of human searchers. Previously to the INEX Interactive track (which will also be referred to as the Interactive Track or iTrack in this thesis), relatively little research had been done into the relationship between searchers and structured information. For example, before INEX, Finesilver and Reid (2003) and Lalmas and Reid (2003) studied user interaction with a test collection of Shakespeare’s plays formatted in XML, however, the collection was reasonably smaller than those used in the iTrack.
3.5.1
The INEX Interactive track
This subsection presents the INEX Interactive track and the main characteristics of the systems used there. The Interactive track’s official systems and INEX participants’ interactive systems offer an opportunity to learn about searchers, XML documents and various ways of content and structure summarisation. These systems and corresponding analyses of user interaction is considered to be useful to extract information about the requirements and feedback about possible summarisation approaches. The work presented in this thesis has been done parallel to the INEX interactive track, and information and findings were often exchanged allowing me to develop interactive systems and research hypotheses for the research presented in this thesis. The Interactive track is built upon two aims. “First, to investigate the behaviour of users when interacting with components of XML documents, secondly to investigate and develop approaches for XML retrieval which are effective in user-based environments” (Tombros et al., 2005a). These two aims made the iTrack ideal to learn and to set up research questions investigated in this thesis. The experimental methodologies followed by iTrack organisers and participants are based on the principles discussed in Section 2.3.3 which is also the basis of user-based experiments described in this thesis. Design principles that are also applied to studies introduced later in this thesis (mainly in Chapter 4) include, among others, the following: • The iTrack uses simulated work task situations and different types of tasks, e.g. background vs. comparison tasks (Tombros et al., 2005a); general vs. challenging (Larsen
3.5. Users, summarisation and XML documents 67 et al., 2006a); decision making, fact finding, information gathering as well as hierarchical vs. parallel (Malik et al., 2006b). • Two system versions, e.g. using different display methods, are compared. • Searchers fill in questionnaires before and after the experiment as well as before and after each task they are asked to complete. • Interviews are conducted, search and browsing logs are saved and analysed. • Query term highlighting is used when the elements’ contents are displayed. The work introduced in this thesis and iTrack also share the document collections that are used to study user interaction. Two out of the three XML collections used in the iTrack are also used in this thesis which allows the comparison of results and obtain research ideas from iTrack and related systems. For further details about the collections used see Sections 4.3.1 and 5.2.2. This thesis, however, does not require users to assess elements as it is believed that this additional task would alter the natural behaviour of searchers. Assessing every accessed element “may even be experienced as obtrusive by the test persons” (Larsen et al., 2005). It was also found that the two dimensional assessment scale (Figure 3.6) used in iTrack 2004 was too complex for users to comprehend (Pharo and Nordlie, 2005; Pehcevski et al., 2005). The need of making effort to assess elements is a sign of change in natural search behaviour. Moreover, this thesis is not aimed at investigating relevant elements in response to a query but at examining text and structure summarisation mostly in a user-based environment, thus, relevance assessments do not play a vital part in this research. The 2005-2007 iTrack official systems used not-web-based interfaces (e.g. Figure 3.7 shows the INEX 2005 iTrack interface using the Daffodil system (Fuhr et al., 2002c)). In this thesis, web-based interfaces are used, as it is assumed that users are more comfortable with such interfaces, i.e. which are somewhat similar to those used daily in searching for information on the Web. The following subsection discusses findings from the Interactive track that are relevant to this thesis.
3.5. Users, summarisation and XML documents 68
Figure 3.6: The system used at the INEX 2004 Interactive Track.
Figure 3.7: The system used at the INEX 2005 Interactive Track.
3.5. Users, summarisation and XML documents 69 3.5.2
Relevant iTrack findings
In 2004, one of the main findings of the iTrack was that summarisation of the text of search result elements is needed when presenting the search results to searchers (Tombros et al., 2005b). The result presentation did not include any summaries of either the whole document, the relevant element or the structure, only title and author of the documents were displayed in addition to the XPath and RSV (retrieval status value) of retrieved elements. Another finding of 2004 is that “document structure provides context”. The table of contents displayed after searchers selected a result element, i.e. in the document view, was appreciated by searchers and “seemed to provide sufficient context to searchers in order to decide on the usefulness of the document”. The iTrack systems had highlighting of presumably relevant elements in 2005 and 2006/7. This shows the intuitive need for focusing on relevant elements more than on other elements when displaying tables of contents (ToCs). However, these ToCs were static, i.e. they always contained all sections and subsections from the document and only those. Hence, for example, if a single paragraph was found relevant it would not have been highlighted as it was not possible to show individual paragraphs in the ToC display. Regarding the problem of result structure display, i.e. result elements in context, it was suggested that the hierarchical grouping of elements from a single document in the result list is superior to elements-only result list presentation. Although this thesis does not focus on the actual display of results from multiple documents of the collection, these findings also show the usefulness of tables of contents (ToCs) for individual documents. In addition to this, “many users found the ToC of the whole article useful because it provided easy browsing, navigation, less scrolling and a quick overview of which elements may be relevant and which may be not.” (Malik et al., 2007). As Kamps and Larsen (2006) found, in 54% of the topics used in INEX, “a presentation in context is preferred”. Betsi et al. (2006) conducted interviews with searchers and showed that “users expect the retrieved components to be accompanied by the document that contain them” for which a table of contents can be a natural solution. Kim and Son (2004) also confirms the usefulness of ToCs by finding that the ToC is most frequently used interface feature and also the most liked one (according to questionnaires). The xmlfind system described in (Sigurbj¨ornsson, 2006, Chapter 8.) and (Kamps and
3.5. Users, summarisation and XML documents 70
Figure 3.8: The xmlfind system (Kamps and Sigurbj¨ornsson, 2005). Sigurbj¨ornsson, 2005) displays relevant elements and their ancestors grouped by documents as search results (Figure 3.8). It also adds short text summaries in the result list between elements and uses a heat map to show the most relevant elements. However, the use of XML element specific notions in the hierarchical display, e.g. sec[1], p[5], has later received criticism from XML retrieval users who do not know the documents’ logical structure (similar display method is used in Chapter 4 where the problem with display technique is identified). They display the structure of results in a way that, when an element is retrieved, all its ancestors will also be displayed to place the elements in context. The latter method is used in Chapter 5. The result presentation Sigurbj¨ornsson (2006) used also included grouped results by documents which can also be regarded as a very selective ToC or structure summary that is automatically generated. The list could be used as an overview of the document where the ToC is restricted to relevant elements only. However, when accessing the elements of a document, other, not retrieved, elements can also be useful in a ToC to provide context to the relevant ones. The work presented in Chapters 5 aims at generating structural overviews that include relevant elements as well as other elements identified by combinations of various information sources mentioned above. Hammer-Aebi et al. (2006) found, using the Lonely planet collection, that most of the relevant information was found at levels 2-4 (i.e. depth levels 2-4 of the hierarchical logical structure), corresponding to sections to subsubsections. This is in accordance with the findings in (Sigurbj¨ornsson, 2006, Chapter 8.) where depth levels 2 and 3 have been found most useful together with element types section and subsection using another two, the IEEE and Wikipedia,
3.5. Users, summarisation and XML documents 71
Figure 3.9: The element retrieval system used at the INEX 2006 Interactive Track. XML document collections. This information can be used to define an ideal depth of a ToC for documents, as well as to define which types of elements should be included in a ToC. How these two, and several more, information sources can be combined to build an ideal ToC is one of the research questions investigated in this thesis. In the context of iTrack 2006-2007, which aimed at comparing element retrieval with passage retrieval (Malik et al., 2006b), the table of contents feature was investigated by Kazai and Trotman (2007) (Figure 3.9 shows a screen shot of the element retrieval interface from iTrack 2006). They report that a ToC needs to be length-dependent, i.e. it might not be needed for a short document (which is defined differently for various users who filled in questionnaires) but it is needed for long documents (which is also not clearly identified as a sufficiently long document can be just longer than one page or it can be a whole book). Document and element length is also investigated in this thesis in the context of structure summarisation, in other words, automatic ToC generation. In their work, Kazai and Trotman (2007) criticise the use of ToCs as being “useful for outlining the content but not so useful for navigation” which seems to contradict previous iTrack findings. However, they also claim that a ToC “provides an overview of the structure and improves navigation” which indicates that although the navigation capability might be limited it is still considerable.
3.6. Motivation 72 In 2005, it was concluded that “searchers found document elements more useful than whole documents for tackling their information seeking tasks” (Malik et al., 2007) which shows a demand for XML retrieval and focused access. Also in the context of iTrack 2006-2007, Fachry et al. (2007) found that “element retrieval is considered more effective than the passage retrieval system”. As both systems displayed ToCs where the passage retrieval ToC was as shallow as two - full article and a list of passages - depth levels, whereas the element retrieval system’s ToC was deeper, we can view the above finding as indication that a reasonably deep hierarchical overview is useful to access the contents of documents. Larsen et al. (2006b) found that “that searchers predominantly selected metadata as their entry point for accessing the retrieved documents”, which means they often clicked on the displayed title to access relevant content. This can also be interpreted as a need towards title-driven element access which is exactly what ToCs provide. The importance of ToCs and summaries of elements’ textual contents is emphasised in the next section where related work presented in this chapter is discussed. Having described various overview methods and strategies as well as related work into summarisation and users in the context of XML documents, the next section discusses the above findings focusing on the theme, of this thesis, i.e. content and structure summarisation for accessing XML documents.
3.6
Motivation
There are various ways to gain overviews of documents, collections of documents or any data set. Several graphical interfaces have been proposed over time, nevertheless, little of them have made lasting impact. For this, various reasons might be responsible including, and mainly, that they were not simple enough to learn and use with ease. For hierarchical data, particularly if the hierarchy represents the logical structure of textual documents, an intuitive and natural way to gain an overview of the document is through a table of contents which has been used for centuries, and a summary of the textual contents which also has a long history if abstracts for printed scientific papers are considered. As the amount of information available on the Web and elsewhere, e.g. digital libraries, is high and constantly increasing, also, as searchers of retrieval systems expect high quality results and quick access to relevant content even within documents, it is important to control the focus
3.6. Motivation 73 of attention of searchers of IR systems. This focus can be achieved by giving focused overviews, summaries of documents and document portions. The following two subsections discuss related work presented above with respect to the two summarisation types investigated in this thesis, i.e. text summarisation of XML elements and summarisation of the structure of XML documents.
3.6.1
Text summarisation for interactive XML retrieval
Summarising the text of a document is an effective way to indicate the contents of the document and document portions or to inform about the document’s main ideas. Automatic text summarisation has been extensively researched since the 1950’s. More recently, several researchers, e.g. (Ling et al., 2007; Alam et al., 2003a), have found that the XML document’s structure can be useful in the process of creating textual summaries for whole documents. Amini et al. (2007) also expressed the need for creating summaries for individual document portions. One of the main aims of this thesis is to investigate the usefulness of such textual summaries, i.e. when text summarisation focuses on document portions, in a user-based environment. The logical structure of a document is often deeply hierarchical, with more than two levels of structural elements. This introduces overlapping textual contents where e.g. the text of a paragraph is also the text of the section the paragraph is in. How to handle overlap in IR is a difficult task as it has been shown in XML element retrieval research (Clarke, 2005; Kazai et al., 2004). For example, if a retrieval system finds that a section element and several of its child subsection elements (but not all of them!) are relevant, it is said that the result elements are overlapping. Handling overlap is a difficult task because it usually involves a decision whether to return, e.g. the section or the subsection as returning both might introduce unwanted redundancy of information. However, in interactive IR, with the help of textual summaries of (overlapping) elements, it might not be necessary to make an either-or decision. That is, summaries of all overlapping elements might be shown to searchers (provided that the retrieval interface is well designed) and searchers may decide whether the section or the paragraph is more relevant to them based on the content of textual summaries. In the above example, one textual summary can focus on the section, the other on the paragraph. However, this thesis does not aim at addressing the problem of overlapping elements but at investigating the usefulness of summarisation to help searchers get to the relevant content more effectively. The focus is on the use of summaries, not on the choice between elements. Textual summariries investigated in Chapter 4 will also be
3.6. Motivation 74 associated with the overview of the document’s logical structure, i.e. table of contents (ToC).
3.6.2
Structure summarisation for XML documents
In the INEX Interactive track is was found that the ToC is useful when searchers want to find relevant information within a document, or simply gain an overview of the (logical structure of the) document. However, it has also been found that the use of the ToC may depend on the length of the documents and, in addition, an element might not be included in a ToC if it is not relevant, not at the right level of granularity (e.g. it is nested too deeply in the logical structure) or not of a certain type (e.g. single-word ‘link’ type elements are rarely worth displaying in the logical structure overview). These findings should be considered when creating tables of contents. Also, ToCs need to be created automatically and for any arbitrary documents with reasonably similar structure. Considering how authors structure textual documents such as books, essays, etc. it might be reasonable to assume that documents containing textual information are structured similarly, hence similar methods can be used to generate tables of contents for these documents. Indeed, an evidence for similarity is that the three XML document collections used at the INEX Interactive track, i.e. IEEE, Wikipedia and Lonely planet collections, have similar structure. Although other XML collections, e.g. those used in the INEX Heterogeneous track (Frommholz and Larson, 2007), might have different structure, simple content and logical structure type documents are expected to differ reasonably only at deeper levels of the hierarchy, thus, keeping the upper structural levels (e.g. section, subsection, paragraph level) fairly similar. The studies conducted in the context of the iTrack used ToCs where it was assumed that, e.g. it is only sections and subsections that need to be selected for inclusion in the ToCs (strict type constraints). However, sometimes a certain section is not of interest (for instance, an appendix section might be such), thus, not needed in a ToC and this needs to be considered as well for ToC generation. In addition, various document collections might address a section differently, e.g. section, sec, s, etc. or might use completely different names for elements of the logical structure, even another language (e.g. fejezet, in Hungarian). In this case, information such as length of the element or its depth in the logical structure should be used to determine whether the element is worth including in the table of contents. Indeed, research related to the iTrack also found that a ToC should depend on the length of the document and its elements, and that relevant elements are usually between depth levels two and four (Hammer-Aebi et al., 2006). To incorporate both into ToC generation, searchers are asked how important they think length and
3.7. Conclusion 75 depth are for the creation of ToCs in Chapter 5. In addition to focused text summarisation discussed in Section 3.6.1, this thesis looks into automatic ToC generation both when it is assumed that a searcher is looking for relevant information in relation to a query (Chapter 5) or simply wants to browse within the document (Chapter 6). ToC generation presented in this thesis is also regarded as focused summarisation for the following reasons. • It is summarisation, because the logical structure of the document is summarised, i.e. most important elements are selected and used to construct ToCs. • It is focused, because a ToC can focus a searcher’s attention to sections, paragraphs, etc. that are related to their information need, or simply focus their attention to generally important document portions, e.g. by not selecting too small or retailed elements or not informative sections. The following three chapters present the work of this thesis, which is on focused summarisation, both when the focus is on the text of document portions and logical structure of documents.
3.7
Conclusion
In this chapter, the work directly related to this thesis has been discussed. Previews and overviews of hierarchical and non-hierarchical data have been described in Section 3.2. Sections 3.3 and 3.4 discussed the literature relevant to XML and summarisation. Two overviews, i.e. thumbnailing and XML data summarisation, have been described, as well as the reasons why they are not ideal for XML element retrieval and access. Summarisation of textual data and the logical structure of XML documents have been discussed in Section 3.4. Text and structure summarisation have also been presented in the context of users (Section 3.5). The presented related work has been discussed, and the main research directions of this thesis identified in Section 3.6. The following chapters present the work and contributions of this thesis. Chapter 4 presents a user study which investigates the use of text summarisation in XML retrieval. Chapter 5 investigates structure summarisation through a second user study, in which searchers’ preferences for several element features, including length and depth of elements, are examined in order to generate tables of contents. Chapter 6 presents query independent structure summarisation, which is used to create ToCs when query information is not available. Such query independent ToCs
3.7. Conclusion 76 can be useful for users who are simply browsing within a document, but the findings can also be used later to create query dependent structure summaries, which may then facilitate searchers’ information finding process. The main contributions and future work are discussed in Chapter 7.
77
Chapter 4 Text summarisation for XML retrieval
In this chapter, the use of text summarisation in interactive XML retrieval is investigated. The chapter is based on papers published at the ACM Symposium on Applied Computing - Special track on Information Access and Retrieval Systems (SAC-IARS, 2006) (Szl´avik et al., 2006b), and at the 10th European Conference on Research and Advanced Technology for Digital Libraries (Szl´avik et al., 2006a).
4.1
Introduction
We saw in previous chapters that XML documents consist of XML elements the number of which is typically much larger than that of non-XML documents. For example, if a document contains ten elements (an average XML document contains many more) then the number of retrieval units (searchers focus can be directed to within documents) is already ten times as many as if we considered flat documents only. In addition, elements from the same document are typically closely related (otherwise the author of the document would not have put them into the same document), often they are parts of other elements that are higher in the document structure, e.g. a paragraph is part of a section. Users of XML IR systems, hence, need help to understand the document and its main parts as well as to find the relevant content within documents more easily. To make the retrieval process easier for searchers, it is essential to provide them with overviews of the contents of the retrieved elements (possibly, in addition to overviews of documents). To gain an overview of the contents, summarisation of the text can be used. The use of textual summaries has been shown useful in interactive information retrieval (IIR) where flat documents
4.1. Introduction 78 have been considered (Goldstein et al., 1999; Tombros and Sanderson, 1998). Summaries, usually in the form of ‘snippets’ (Turpin et al., 2007), are widely used in web search systems where they are used to indicate the contents of source documents. Summarisation is also now being investigated by researchers in the XML community (Comai et al., 2004; Sengupta et al., 2004; Campista, 2005; Tombros et al., 2005a); see Section 3.3 for details. For flat documents, textual summaries are usually used in the IR process when the result list is shown to searchers. The elements in these lists are references to documents that are considered relevant to the searchers’ queries. In addition to these references, summaries of the textual contents of result documents are also often displayed in the result list view, as well as other typical information about documents such as their titles or access paths. In XML retrieval, document portions are retrieved, often more than one for each document. There are various methods used to display result lists for XML retrieval, e.g. displaying result elements independently as they were separate documents, or organising the elements returned for a document according to the logical structure (see Section 3.5 for details). When overlapping elements are retrieved, which often happens in XML retrieval, and grouped for documents (for example, see Figure 3.8 in Section 3.5) it is not straightforward how textual summaries of these elements should be displayed. Also, for instance, if several elements are returned for a document, grouping these elements and displaying summaries for each of them might cause that the document gets unnecessarily high focus within the result list. To avoid the above mentioned possible problems of overlapping summaries and summaries of multiple result elements from a document, the work presented in this chapter is not focusing on the result list and summaries in the result list, but on presenting the logical structure of the whole document (i.e. a table of contents, ToC) and displaying summaries for each element in the ToC after the searcher has chosen an element from the result list. In other words, the contents of the element are displayed when a searcher selects the element from the result list, but the element is also presented in context, where the context is provided by the structural overview of the document and textual summaries of other elements from the same document. Elements are also presented to searchers of the INEX Interactive track (Tombros et al., 2005a) in similar context (i.e. ToCs are used), but the iTrack systems lack the presentation of textual summaries when the result element’s contents are displayed. It is believed that textual summaries of elements and the display of the logical structure of
4.2. Experimental system 79 the document can help searchers gain an overview of the document and the contents of elements within the document. Textual summaries associated with ToCs are also believed to facilitate finding relevant information within XML documents. The research questions aimed to be answered in this chapter are the following. • Can text summarisation be useful in the XML retrieval process when searchers are browsing within documents at all? In other words, do searchers appreciate the display of element summaries when the result element is displayed or they are browsing within XML documents? • If so, where (at which structural levels, for which element types) should text summaries be applied? In other words, are summaries useful at all structural levels or, for example, are they only needed for sections but not for paragraphs? A summary is often several sentences long, a paragraph is also of approximately this length, should a textual summary still be displayed for the paragraph? • How closely are the structural display (ToC) and the use of textual summaries related to each other for an XML retrieval system user? Should summaries be displayed for each element used for the ToC, or can there be items in the ToC for which no textual summary is needed? • What are the requirements for a good XML retrieval system that displays summaries and the structure of the XML document? According to searchers, how can the ‘elements in context with summaries’ display method be improved, and what other indications are there for the design of interactive XML retrieval systems? To answer the questions above, an interactive information retrieval system was developed and examined using human searchers. The system is described in Section 4.2 and the experimental design is discussed in Section 4.3. The results of the study are showed and discussed through Sections 4.4-4.5. The chapter is closed with conclusions in Section 4.7.
4.2
Experimental system
In this section, the system that I developed for and used in the text summarisation study is described: the user interface with XML specific features, the summarisation method and the used
4.2. Experimental system 80
Figure 4.1: The experimental system used in the text summarisation study. XML search engine are discussed. Figure 4.1 shows the system modules as well as their connection to the XML document collection and searchers. The searcher enters a query using the user interface which passes the query to the summariser and to the retrieval engine. The retrieval system then retrieves elements from a collection of XML documents for the query and the interface displays the result list. When a searcher selects a result element, the system displays the content of the element as well as a ToC for the corresponding document. For each item (i.e. chapter, section, etc.) in the ToC, a summary is generated based on the searcher’s query and the summary is displayed when searchers move the mouse pointer above a ToC item.
4.2.1
User interface
The user interface of the systems is web-based, and it passes the query to the retrieval module, processes and displays the retrieved result list and shows the result elements. The interface allows searchers to enter a search query and start the retrieval process by clicking on the “Search” button (Figure 4.2). Searchers can also choose the number of items per page they would like to see in the result list. The result list display is similar to standard web search interfaces (Figure 4.3) to minimise searchers’ frustration which may be caused by learning how to use the new system. An effort was made to create a result display similar to the one used at the current INEX iTrack (Tombros et al., 2005b) (Figure 4.4). Although in this study the main focus is on another part of the interface (i.e., the document overview screen), an effort was made to create as informative interface screens as possible everywhere. For each result element that is part of the currently shown result list, the following were shown:
4.2. Experimental system 81
Figure 4.2: The initial page of the interface. • Rank. The rank of the retrieved result element. The element at rank one has the highest retrieval status value. • Retrieval status value. Calculated by the search engine. For details, see Section 4.2.3. • Query-based summary. A summary of the text of the retrieved element. The algorithm is described in Section 4.2.21 . • Title. Title of the document the result element is in. It was chosen to include the document title as the retrieval was based on paragraphs, and paragraphs did not have specific title elements. • Link to the result element. (‘View result’ link.) A separate hyperlink was introduced as a means for searchers to get to the result element. The standard ‘web way’ of linking to the result would have been to associate this link to the title of the target (document). Since the chosen title in this study was the title of the whole document, it was needed to prevent searchers’ possible confusion that could have been the result of (document)title(element)link association. • Document path. Path to the result’s document. Experienced searchers could extract e.g., the year of the document from this. It was inspired by popular web search systems where the URL-s of the results are shown. • Element size. Displayed in kilobytes. The choice of this is also inspired by web search systems as the intention was to keep the look of the interface similar to what searchers are used to. 1 The
emphasis in this study is not on this particular summary display but on the display of summaries for individual elements in the ToCs, described later in this section.
4.2. Experimental system 82
Figure 4.3: Result list view of the used system. • ‘Mini map’. It displays the path of the element within the XML document and gives an idea about how deeply an element is nested within the document structure. The result page also includes the possibility of submitting another query, and shows the number of results, retrieval time and the query terms. Query terms in the titles and summaries are highlighted using a yellow background. Once searchers follow the link to the element, the element is displayed in a new window (Figure 4.5). The frame on the right shows the content of the target element with query words highlighted. On the left, the structural view (ToC) of the whole document is displayed, where the position of the currently shown element is also highlighted. The structural display is based on the XML structure of the whole document, i.e., the root element is shown at the top level, while descendants are displayed at lower levels (indented, with bullets). Each structural item (also referred to as table of contents item, ToC item) is also a hyperlink that shows the content of the corresponding XML element on the right window when clicked. As an XML document may contain element types that are for formatting purposes only (e.g., it corresponds to italic), a selection of element types to be displayed in the structural view
4.2. Experimental system 83
Figure 4.4: Result list at iTrack 2004. Table 4.1: List of elements that have been selected for the structural overviews of documents. Element type article fm abs sec ss1 ss2 p bm bdy
Title //atl Front Matter Abstract //st //st //st Paragraph Back Matter Body
Number of siblings allowed one one one more more more more one one
was chosen. Based on the analysis of the document corpus and the INEX participants’ relevance assessments using this XML document collection (Lalmas and Piwowarski, 2005), nine element types were selected for structural display, including article, abstract, section and paragraph types (Table 4.1). These correspond to the most frequent types that were assessed highly relevant for a number of topics. To obtain labels to be displayed in the ToC, the following algorithm was used. For each of these elements, if there was a descendant element that contained the title of the current element, this title was displayed (e.g. the article’s title is in its child atl (article title) element), otherwise static text (e.g., “Front matter” for fm was used). If an element could have sibling elements of the same type, e.g. there can be several paragraphs in a section (see third column in Table 4.1), the type and sequence number of these elements were used (e.g. “Paragraph 1”, “Paragraph 2”, etc.). For this user study, four levels of structural (ToC) items were displayed. The first level is
4.2. Experimental system 84
Figure 4.5: On the left, the structure of the XML document with a summary; on the right, a section element displayed. the whole article level, the second usually corresponds to front matter, article body and back matter, deeper levels contain sections, subsections and paragraphs2 . The number of levels could be changed by searchers if they wished to do so. For each ToC item shown in the hierarchical structure (ToC) on the left, an automatic summary of the corresponding element’s text was generated. This is the textual summary display this chapter focuses on. The algorithm for creating summaries for ToC items was the same as the one used in the result list display and is described in the next section. Summaries were displayed as ‘tool tips’ when the mouse pointer was over a ToC item (Dumais et al., 2001). Summaries were generated when the ToC page was loaded, and they were displayed immediately when the mouse pointer was moved over a ToC item, i.e. there was no delay that may have been caused by processing the elements’ texts to obtain summaries. Query terms in summaries were also highlighted which is essentially the same as web search engines’ bold-faced query terms in snippets. Query term highlighting is generally appreciated by searchers (Kazai and Trotman, 2007). 2 Note
often 4.
that not all sections have subsection, hence the depth of paragraph elements is not always 5 but
4.2. Experimental system 85 4.2.2
Summarisation
As this chapter investigates whether summarisation is useful in interactive XML retrieval and there has not been extensive research on how to generate summaries for nested XML elements (the work described in (Amini et al., 2005) and (Amini et al., 2007) was preceeded by the study presented in this chapter), a summarisation method that is relatively simple, though not oversimplified was implemented. For summary generation, sentence extraction was chosen which is a widely used in text summarisation (Rath et al., 1961; Edmundson and Wyllys, 1961; Edmundson, 1969; Kupiec et al., 1995; Teufel and Moens, 1997; Lin, 1999; Lin and Hovy, 2003; White, 2005). First, both the sentence terms and query terms are stemmed using the Porter stemming algorithm (Robertson et al., 1980), and stop words3 are removed from both term sets. The summarisation method for selecting extract-worthy sentences of an XML element is as follows. The score of each sentence in the given element is calculated according to Equation 4.1.
Si =
∑ occ( j, Ti ) · occ( j, Q)
(4.1)
j∈Q
where Si is the score of sentence number i, Q is the set of different query terms, Ti is the set of different terms in sentence i, and occ( j, A) is the number of times term j occurs in the term set denoted by A. If two sentences have the same score, the one occurring first in the element is given higher rank among candidate summary sentences. If the source of the summary does not contain sentences that include query terms, which can happen as summaries are generated for all elements of XML documents and some of them might not focus on the topic of the query at all, the first four sentences of the element are being shown as the summary. This approach is based on the location method (Edmundson, 1969), which assumes that the first sentences in a document, or paragraph, are more indicative of its content. A maximum of four sentences (White et al., 2003) with the highest ranks are presented as extracts of the source XML elements, in order of appearance in the source element. 3 We
used a list of 327 very commonly used, hence not informative, words (e.g., basically) and XMLrelated terms (e.g., “&”).
4.3. Experimental design 86 4.2.3
XML retrieval engine
The retrieval was based on the HySpirit retrieval framework (R¨olleke et al., 2001). HySpirit is capable of indexing and retrieving XML documents and elements based on a probabilistic framework that allows defining various retrieval strategies. Indexing was performed based on paragraphs, i.e. no indexing and retrieval of higher-level elements or aggregation of retrieval scores was done4 . Paragraphs were considered as retrievable units in the study. Indexing paragraphs of XML documents might not be the best retrieval approach, nevertheless, it has been used in XML retrieval (Ashoori and Lalmas, 2006; Crouch et al., 2006). Paragraphs were also chosen to avoid the effect of overlapping elements (Kazai et al., 2004). Overlapping elements may confuse searchers because it would have been possible to return, for instance, a section and its paragraphs in the same result list. Another reason for indexing paragraphs was to focus searchers’ attention to elements and to the structural overview. As searchers were assumed to be used to document retrieval they could have easily looked for whole documents in the result list and start finding answers to their queries the way there were used to. If paragraphs are returned and the system displays these paragraphs’ contents, it is believed that searchers would not ignore the structural overview (ToC) and the element summaries shown in the document browsing window. The retrieval was based on standard tf-idf (term frequency - inverse document frequency) weighting (van Rijsbergen, 1980), considering the paragraphs independent (i.e., in idf calculation).
4.3
Experimental design
In his section, the document collection and experimental methodology used is described. The main steps of the experiment are shown in Figure 4.6. The steps are discussed in detail in several of the following sections. The experimental protocol used in the iTrack is commonly used in interactive experiments. The protocol, with the use of simulated work task situations, allows researchers to study user behaviour in a close to realistic, but still controlled, scenario. This also makes it ideal to use in the studies of this thesis. Using questionnaires at various stages provides useful feedback about users and their perception of tasks, system versions and their thoughts about their current situations at 4 The used document collection comprised scientific articles and is further described in Subsection 4.3.1.
4.3. Experimental design 87
Figure 4.6: The steps of the experiment from a searcher’s point of view. The middle steps are performed twice for each searcher. the time of filling in a particular type of questionnaire. One weakness of the questionnaires used in iTrack, which is also a weakness of the studies presented in this thesis, is with respect to the questionnaire filled in before any experiments began: research in cognitive psychology shows that when researchers ask for age, gender and experience before any user study, users will unconsciously change their behaviour to conform more to the behaviour ’expected’ from their group (Aamodt, 2008). In retrospect, these kinds of questions should have been avoided in the studies of this thesis as well, and, instead, they should have been asked after a user had finished with the study.
4.3.1
Document collection
The document collection used is the IEEE collection which contains 12,107 articles, marked up in XML, of the IEEE Computer Societys publications from 12 magazines and 6 transactions, covering the period of 1995-2002. On average, an article contains 1532 XML nodes and the average depth of a node is 6.9. In 2005, the collection was extended with further publications from the IEEE Computer Society, however, these additional documents are not included in the document collection used in this chapter. Figure 4.7 shows a sample from a typical XML document from the IEEE collection. The overall structure of a typical article consists of a frontmatter, a body, and a back matter. The front matter contains the article’s metadata, such as title, author, publication information, and
4.3. Experimental design 88 abstract. Following it is the article’s body, which contains the actual content of the articles. The body is structured into sections, sub-sections and sub-subsections. These logical units start with a title, followed by a number of paragraphs. In addition, the content has markup for references (citations, tables, figures), item lists, and layout (such as emphasised and bold faced text), etc. The back matter contains a bibliography and further information about the article’s authors. [...] IEEE INTELLIGENT SYSTEMS Vol. 15No. 5 SEPTEMBER/OCTOBER2000 pp. 10-18 [...]Quantum Computing: The Final Frontier? pp. 10-18 Richard J.[...] Colin P.[...]
Quantum Computing
Quantum computing is an exciting area from a computer science viewpoint. Not only is there the possibility [...] [...]
Richard Doyle
As NASA spacecraft explore deeper into the cosmos,[...]
The solution might well come from quantum computers,[...]
The potential of quantum technologies goes far beyond enhanced computing capacity. Future space missions will involve [...]
SMALLER, FASTER, RAD HARDER
Computing capacity conventionally grows through increased processor speed.[...]
QUANTUM ALGORITHMS FOR NASA APPLICATIONS
In certain mission-critical applications,[...]. 7
Unstructured quantum search We can intuitively interpret the unstructured search problem as follows. [...] steps on a quantum computer. Quantum algorithms for NP-Hard problems[...] Solid-State Quantum Computing Hardware
Much progress has been made in designing quantum computer hardware. [...] References D. V.Averin "Adiabatic Quantum Computation with Cooper Pairs,[...] (1973)pp.525-532.
Figure 4.7: Sample from an example document from the IEEE collection. Properties of the initial IEEE collection provided a suitably large collection with articles of
4.3. Experimental design 89 varying depth of logical structure for this study. The collection is known and widely used in the XML retrieval community, mainly through the INEX initiative where it had been used between 2002 and 2005 for XML retrieval evaluation and user studies (Fuhr et al., 2003, 2004, 2005, 2006; Tombros et al., 2005a; Larsen et al., 2006a).
4.3.2
Searchers
Twelve searchers were recruited for this study. Although not a high number, it is believed it is sufficient for the purpose of the study. All had computer science background as the collection used contained articles from the field of computer science. Nine searchers were male, three female. Although having Computer Science students as the searchers introduces a bias, the nature of the document collection used required that participants had science background. Participants had five different languages as first language. Seven were between 18 and 27 years of age, 4 were between 28 and 37 and one was between 38 and 47.
4.3.3
Experimental and control systems
Two versions of the developed system were used in this study. The control system was the one described above (referred to in this chapter as System Sc ), the experimental system differed in the display mode of summaries (System Se ). While the control system was ‘fully functional’, the experimental system displayed summaries only at high levels in the hierarchical structure, i.e. the upper three levels had associated summaries, the fourth level did not. The rationale behind this was to see whether searchers realise the difference (i.e. that tool-tip summaries are missing for elements at the fourth depth level and deeper) and act differently. From the difference, the usefulness of showing the structure of the document and summaries can be examined. Disabling the summaries at higher, i.e. at depth levels two or three, was considered ineffective as the experimental system would have provided little information about the use of textual summaries. To avoid bias towards the use of the hierarchical structure and summarisation, a blind study was employed, i.e. searchers were not told beforehand what the purpose of the study was or whether the systems they were asked to use were different.
4.3.4
Tasks
Four tasks were created for the experiments (for details, see Appendix A.1). The aim with tasks was to create simulated work task situations (Borlund, 2003b). As first step in task generation,
4.3. Experimental design 90 four topics were chosen from the INEX 2005 Ad-hoc track topics (Sigurbj¨ornsson et al., 2005). Choosing topics already in use at INEX ensured that relevant XML portions were actually present in the used collection for each chosen topic. Selected topics were then modified by adding introductory sentences that would serve as task background for searchers who would perform searches using the task generated from the given topic. Task description generation also included combining sentences from various parts of the original INEX topics, e.g., title, initial topic statement, narrative. For an example source topic, see Figure 4.8. speech recognition software capabilities limitations commercial speech recognition software Return the names of commercial speech recognition software packages, along with the capabilities and limitations of each. I’m hoping to generate transcripts from audio captured during meetings held through a VoIP tool. I’d like to compile a list of all commercially available speech recognition packages, along with a brief description of their capabilities and any experience (positive or negative) that others have had with them. To be relevant, the name of the software package must be given. Public domain systems are relevant only if they are suitable for commercial use.
Figure 4.8: A sample topic that was used to generate one of the L tasks. Topics were also chosen in a way that allowed to create two types of tasks, each containing two tasks. Background type tasks instructed searchers to look for information about a certain topic (e.g., concerns about the CIA and FBI’s monitoring the public) while List type tasks asked searchers to create a list of products that are connected to the topic of their tasks (e.g., a list of speech recognition software, Figure 4.9). These task types are also referred to as B and L, respectively. The reason for creating more types of tasks was to avoid the effect of testing the research hypotheses under only one type of task condition. Although there are more existing task types such as fact finding, decision making (Marchionini and Shneiderman, 1988; White et al.,
4.3. Experimental design 91 L1 - Speech Recognition Software Let us suppose you work for a company as a manager and you have frequent audio meetings (through a Voice over IP (VoIP) tool). You would like to keep the records of these meetings on paper as well, but you don’t have time and resources to type what is discussed during the meetings. Instead, you are hoping to generate transcripts from audio captured during these meetings. You would like to compile a list of all commercially available speech recognition packages, along with a brief description of their capabilities and any experience (positive or negative) that others have had with them. Public domain systems are good for you only if they are suitable for commercial use. Please write down a list of names of available software along with their capabilities and, if you feel like, some comments about them. Figure 4.9: One of the L tasks used. 2003), it was not thought to be necessary to introduce more task types for this study. From each group of tasks, searchers could freely choose one that was more interesting to them. Searchers had maximum 20 minutes for each task. This period is defined as a search session. Search sessions of the same searcher (i.e. one searcher had two search sessions) are defined and used in this chapter as user session.
4.3.5
Search design
To rule out the fatigue and learning effects that could affect the results, Latin square design was used. Participants were randomly assigned into groups of four. Within groups, the system order and the task order were permutated, i.e., each searcher performed two tasks on different systems which involved two different task types (see Table 2.1 in Section 2.3.3). There was an effort to keep situational variables constant, e.g., the same computer settings were used for each subject, the same (and only) experimenter was present, place of the experiments was the same.
4.3.6
Data collected
Information was collected in three ways: searchers filled in questionnaires, searching logs were recorded and interviews were conducted at the end of each user session. 4.3.6.1
Questionnaire data
Questionnaires were filled in by searchers before and after each search session, and before and after each user session (Figure 4.6). Within the entry questionnaires, general information was collected about searchers (age, first language, education, etc.) as well as information about their experience with computers and searching. Information about searchers’s perception of the search
4.4. Questionnaire and interview analysis 92 tasks and systems was also collected in before- and after-each-task questionnaires. Questionnaires included questions about specific features of the user interface, i.e. the use of summaries and the displayed XML structure. Comments and suggestions were also recorded in the aftereach-task and exit questionnaires. For details on the questionnaires, see Appendix A.2. 4.3.6.2
Interview data
After each user session, searchers were interviewed. In the interviews, their use of the system, summaries and XML structure was discussed in detail to understand how these had affected their searching behaviour and satisfaction with regards to the system. 4.3.6.3
Log data
Two types of events were logged. One type was used to save searchers’ actions based on their mouse clicks (e.g., when searchers clicked on the “Search” button, or opened an element for reading). The other type corresponds to the summary-viewing actions of searchers, i.e. it was logged whenever a summary was displayed (i.e. searchers moved the mouse pointer over an item in the structural overview displayed on the left hand side). During the analysis of summary log files, summary-viewing times that were shorter than half a second or longer than twenty seconds were discarded, because the former probably corresponds to a quick mouse move (without searchers having read the summary at all), while the latter may have recorded searcher actions when the keyboard only was used (e.g., opening another window by pressing CTRL+N and coming back to the old window; although searchers were instructed not to use this method, some of them did).
4.4
Questionnaire and interview analysis
In this section, data collected in questionnaires and the analysis of interviews are presented and discussed. The reason for reporting the analysis of the collected data in two sections (Section 4.4 and Section 4.5) is chronological: during the analysis of the data obtained from the study, questionnaire and interview data had been analysed first and, afterwards, the log analysis was performed in reflection of the query and questionnaire results. When designing the questionnaires, apart from comments written by searchers, 7 point Likert scales were used to measure searchers’ perceptions, where 7 corresponds to the strongest and 1 to the weakest agreement with regards to the question asked. In this section, I refer to such scales unless otherwise stated.
4.4. Questionnaire and interview analysis 93
Table 4.2: Questionnaire data about searchers’ expertise, task and system difficulty.
Level of expertise with computers Level of expertise with searching To[...]understand the nature of the searching task? How easy do you think the task is? Was it easy to do the search on this topic? How well did you understand how to use the system? How easy was it to learn to use the information system? How easy was it to use the information system?
4.4.1
Average 5.92 6.00 5.75 4.54 4.33 6.17 5.83 5.67
Standard deviation 0.90 0.60 0.87 1.10 1.66 0.83 0.83 0.89
Validity of results
According to the entry questionnaires, most of the searchers claimed to be experts in both working with computers and searching (average of 5.9 and 6 points, respectively, where 1 indicated low level, 7 indicated high level of expertise, Table 4.2). This indicates that they were able to understand the search tasks, as well as the search results, as both the document collection and the search tasks were computer (science) oriented. Searchers also indicated their task understanding as 5.75 on average. Information about searchers’ perception of the given tasks was also collected. Searchers found that the difficulty of tasks in general was average both before starting the search (average of 4.54 where 7 corresponds to easy) and after finishing it (4.33 on average). As they also understood how to use the two systems easily (6.17 on average), it is believed the system did not cause frustration to searchers which would have had an unwanted effect on their searching behaviour. Task types, that may have had an effect on searchers’ behaviour, were also examined (Table 4.3). L (i.e. List type) tasks appeared to be easier to start with, but searchers did not find difference in terms of task difficulty later in their search process. According to a t-test (p=0.05, two tailed, unpaired), there is no statistically significant difference in these results either. Both task types appeared to be quite interesting (on average 5.08 for both task types) and realistic (5.58 and 5.25), where realism was explained to searchers as how likely it was to have such a search task in real life. Searchers read about the same amount of summaries regardless of task types, and there was a small, but not significant (t-test under same conditions as above), difference in the use of the hierarchical structure (ToCs). As the understanding and ease-to-use levels are relatively high and there are no significant
4.4. Questionnaire and interview analysis 94
Table 4.3: Questionnaire Data by Task Types.
Was it easy to get started on this search? Was it easy to do the search on this topic? Did you have enough time[...]? Are you confident in your results? Do you feel your results are complete? Did your previous knowledge[...]help[...]? Have you learned much new about the topic[...]? Was the search task interesting? Was the search task realistic? Did you use the hierarchy[...]? Did you read the summaries[...]in the hierarchy?
Task B Average St. dev. 4.50 1.51 4.33 1.61 4.58 1.24 4.50 1.45 3.58 1.24 3.17 1.95 4.08 1.56 5.08 1.08 5.58 1.00 4.67 1.61 4.92 1.68
Task L Average St. dev. 5.25 1.48 4.33 1.78 4.00 2.30 4.42 2.31 3.33 1.72 3.58 1.78 4.17 1.64 5.08 1.08 5.25 1.29 5.08 1.16 4.92 1.56
differences between the perception of the two task types (B and L), in addition, the tasks seemed to be realistic as well, it is believed the above results show the reliability and validity of results that are based on data collected during the study.
4.4.2
System versions
Regarding the use of the two system versions, there is a significant difference, according to t-tests (p=0.05, two tailed, unpaired), in reading summaries (Table 4.4). Appreciating summaries at all hierarchy levels is also, although not significantly, different between system versions. Results show that using the complete system (Sc ), searchers read more summaries as, by the design of the two systems, more summaries were displayed to them. Users of system Sc liked summaries more at all levels they were displayed than those using the restricted system (Se ). In the corresponding questionnaires, searchers indicated “problems” with summaries at low (i.e., paragraph) level in 8 out of 24 search sessions. 5 cases were concerning system Se and searchers reported missing summaries, while 3 were with system Sc where searchers did not want to see those summaries. From the interviews, some of those searchers who did not comment on the use of summaries at low level of the structure, stated that, as a matter of fact, they missed summaries for paragraphs when using system Se . Searchers claimed they did not comment on this issue because they did not realise explicitly that some summaries were missing (especially when they performed search on system Se first) as they were focusing more on the retrieval performance of the systems (i.e., the quality of the result list), not the use of structural information
4.4. Questionnaire and interview analysis 95
Table 4.4: Questionnaire Data by System Types. * indicates statistically significant difference.
Did you have enough time[...]? Are you confident in your results? Do you feel your results are complete? Did you use the hierarchy[...]? Did you read the summaries[...]in the hierarchy? Did you like summaries[...]? Did you like summaries at all levels[...]?
System Sc Average St. dev. 3.92 1.73 4.00 2.09 2.92 1.51 4.83 1.53 *5.42 1.56 5.17 1.75 5.08 1.68
System Se Average St. dev. 4.67 1.92 4.92 1.62 4.00 1.28 4.92 1.31 *4.42 1.51 5.25 1.48 4.17 1.85
and summaries5 . Regardless of systems (and tasks), searchers liked summaries as tool tips (5.54 on average) which is in accordance with the findings of Dumais et al. (2001).
4.4.3
Interview-based results
In addition to help interpret questionnaire data, comments acquired during interviews helped identifying possibly typical problems and requirements of an interactive XML retrieval system. Parallel to this research, INEX Interactive track researchers also investigated these requirements and found similar results (Tombros et al., 2005b; Pharo and Nordlie, 2005; Malik et al., 2006a). Several searchers of this study claimed that they did not like the four level display of the structure but they did not change the number of levels although it was possible. Searchers also said fewer levels would not have been informative enough for them. This indicates that searchers expect automatic determination of what structural elements should be displayed and a general ‘number of displayed levels is x’ approach is not suitable here. Some of the searchers, who had problems with the number of structural levels displayed, indicated that elements without title should be displayed using other labels than the types and sequence numbers such as, e.g. Section 3. This shows that ToC labels should be as informative (i.e., indicative to the corresponding elements’ contents) as possible as several words from an 5 It is thought this is because most searchers associate information retrieval to web retrieval only,
where links are returned and the display of result documents is not done by the actual IR system but by the browser. The reason for not paying particular attention to the document display might also be that the purpose of the study had not been revealed to searchers as that would have compromised their searching behaviour (focusing too much on what ‘they were supposed to do’). Nevertheless, as they focused on retrieval performance more, they are thought to have been acting naturally when the document view was presented. Based on the assumptions above, when analysing the results, the questionnaire data are examined in reflection of the interviews.
4.4. Questionnaire and interview analysis 96 element’s text seem to be more useful than their types and sequential numbers, which indicates the structure but disregards the textual content. Some of the searchers complained about getting not query related summaries (i.e., summaries not containing query terms). However, this was the case with summaries associated with not retrieved elements. In other words, summaries of elements not focusing on the topic of the query were disturbing as searchers did not know whether the element was not relevant or the summary itself was of low quality. It is thought that this problem is not necessarily related to the text summarisation method but it is a question of how to display the logical structure of a document: relevant elements (as estimated by the retrieval system) should be made more prominent compared to non-relevant ones. Non-relevant elements may as well need to stay hidden in the structural overview (ToC) of documents in the document view. The question of how to generate a structural overview, among others, is investigated in Chapters 5 and 6. Searchers indicated that when an element could fit in a window (right panel in Figure 4.5), showing the structure of that element on the left window (i.e. descendants of the element) was not necessary. They said this was because they could easily find the relevant information in a window as long as they did not have to scroll down. Although ‘fitting in a window’ is not an exact definition of where the structural display should be less detailed, it indicates a need towards considering the element’s length when showing its (sub-)structure. Searchers who realised the difference between the two systems, and also remembered this during interviews, stated that missing summaries were disturbing; they also mentioned that summaries shown at low structural levels, although they could be unnecessary, were rarely bothering. These comments are interpreted as a strong connection between the ToC and summary display, i.e., when ToC items are shown, corresponding summaries should always be displayed. Also, it is an indication that having slightly more information displayed is more acceptable than leaving out other, important, information.
4.4.4
Discussion of questionnaire and interview based results
Based on the questionnaire data and interviews with participants of this study, it is believed that summarisation can be helpful in interactive XML retrieval. However, in order to be able to investigate particular summarisation algorithms in a retrieval system such as the one described in this chapter, the display of the ToCs has to be well controlled. It is also believed that the structural document (ToC) display and text summarisation for XML elements are strongly connected. If
4.5. Log analysis 97 the display is not well designed, development including evaluation of various summarisation strategies is not reliable in an interactive environment as searchers are distracted by the effect of the display method. To control the effect described above, the following design guidelines, mostly concerning the ToC display, are proposed as a result of the questionnaire and interview analysis. • ToC items should be displayed based on the estimated relevance of the corresponding element. This is because searchers do not want to waste their time being pointed at unnecessary irrelevant information through a ToC. • ToC items should be displayed according to the corresponding elements’ size independently to their content type, i.e., whether they are composed of figures, tables, textual and non textual content. This is because searchers indicated a relation between the need of display and element length. • Labels of the ToC items should be as informative as possible. Titles, when available are the ideal choice. If not present, keywords of the corresponding elements should be displayed as labels. Types of elements, e.g., section, figure, etc. are not satisfactory according to this study. • Summaries should be displayed for each item in the structural display. Alternatively, summaries should be completely avoided as selective summary presence may disturb searchers. Having analysed the questionnaire and interview based results of this study, the next section presents and discusses results of the log analysis.
4.5
Log analysis
In this section, the analysis of the recorded log files is described. Having already analysed the questionnaire and interview data, the log data analysis, in reflection to findings of Section 4.4.4, is reported. For the investigation of logs, four groups of research questions have been formed. The first group (discussed in Section 4.5.1) concerns summary reading times to see how long these times were for different system versions and element types. The second group (Section 4.5.2) is about the number of summaries searchers read in their search sessions. Section 4.5.3 investigates the relation between summary reading times and number of summaries read. The average number of summaries read and average summary reading times of corresponding searchers are shown in
4.5. Log analysis 98
Figure 4.10: Average summary reading times and number of read summaries in the 24 search sessions. Figure 4.10. The fourth group (Section 4.5.4) looks into the relation between the multi-level XML retrieval and traditional retrieval.
4.5.1
Summary reading times
In this section, it is reported how long searchers read an average summary, whether there were differences in reading times for summaries that were associated with elements at various structural levels and element types, and whether the average summary reading times changed when summaries were not shown at all structural levels in the ToC. The objective is, through the comparison of the two system types, to find some pattern in summary reading times and draw conclusions for the design of interactive XML IR systems. Taking into account both systems Se and Sc , an average summary was displayed for 4.24 seconds with a standard deviation of 3.9. The longest viewed summary was displayed for 19.57s, while the shortest accounted summary was viewed for 0.51s (note the 0.5s and 20s cut-off values explained previously). Figure 4.11 shows the distribution of summary display times by structural levels for each system. Display times of Sc tended to be shorter when searchers read summaries of deeper, hence also shorter, elements, although the length of summaries were the same (i.e., four sentences). For Se , times were more balanced. This indicates that if there are summaries for more levels and
4.5. Log analysis 99
Figure 4.11: Summary times by structural levels. the lowest level is very short (sometimes these paragraphs are as short as the summary itself), people trust summaries of larger, i.e., higher, elements more. If the difference in size between the deepest and highest elements is not so large, times are more balanced. Figure 4.12 shows the display time distribution by XML element types. We can see, for example, that the bdy (body) element has high summary viewing times; this is the element that contains all the main text of the article. We can also see that paragraphs (para) and subsections (ss1 and ss2) have low summary reading times for Sc and none for Se (as they are not displayed at these levels). These three element types appear on the lowest, i.e., fourth, structural level for the used collection. The two system versions (Se and Sc ) were also compared to find out whether significant differences in summary reading times could be found. The comparison of the overall summaryviewing times showed significant difference between Se and Sc (t-test, p = 0.05), i.e., the average summary viewing time for system Se (4.58s) is significantly higher than that of system Sc (3.98s). To examine where this difference comes from, the two systems were also compared by element types (e.g., whether summary reading times for sections are different for the two systems). However, there were not significant differences at comparable element types6 . I also compared the 6 Element
types for which summaries were not displayed for any of the systems were not compared as one of the sample groups would contain zero samples.
4.5. Log analysis 100
Figure 4.12: Summary times by XML element types. two systems with respect to structural levels (e.g., whether average summary reading time at level one is significantly different for the two systems). No significant difference has been found for level one (article), two (body, front and back matters) and three (abstract, sections, appendix) elements. To sum up, the results showed that searchers of system Se read summaries 0.5s longer than thoses of system Sc . However, there were no significant differences at levels or element types between the two systems. An interpretation of this result is that since Se searchers had less available summaries to examine, they were less confused and overloaded by the information available and could take their time reading a particular summary.
4.5.2
Number of summaries read
This section looks into the number of summaries that were read by searchers. First, the average number of summaries seen by searchers in a search session is examined, which is followed by the analysis of distributions of the numbers of read summaries at different structural levels and element types. Differences between the two systems with respect to the number of read summaries are also discussed in this section. Considering both systems together, an average searcher read 14.42 summaries in a (maximum 20 minutes long) search session, with a standard deviation of 10.77. This standard deviation shows considerable differences in searcher behaviour regarding summary reading. The least active summary reader read only one summary in a search session, while the most active saw
4.5. Log analysis 101
Figure 4.13: Number of read summaries per search sessions by structural levels. 52 summaries for at least half a second. Also, 14.42 summaries read in 20 minutes may not sound many, however, some of the searchers finished before the 20 minutes elapsed and it is also believed that the number of read summaries is due to the active reading of elements’ contents, thus, searchers did not browse so frequently in the ToC of the document. Figure 4.13 shows that the deeper we are in the structure of the ToC, the more summaries are read, on average, in a search session. This is consistent with the nature of XML, and all tree-like structures: the deeper we go within a tree, the more elements are available on that level. However, the log data show that the difference between the two system versions is not only based on this structural property, because when only three levels of summaries were displayed, reading of third level summaries (usually summaries of sections) showed higher activity than when four levels of summaries were displayed, i.e., the third level seems to be more interesting than the first and second. If only three levels of summaries are available (Se ) the third level becomes most used in terms of number of read summaries, whereas if four levels of summaries are displayed (Sc ) the number of summaries for 3rd and 4th level elements is high, presumably because searchers are looking for as specific content as possible. The next step is to find out whether this interest is only at deeper levels, or connected to any element types. Contents of the same element types are supposed to have the same amount and kind of information, e.g., paragraphs are a few sentences long, front matters usually contain author and title information and the abstract of the paper. The log analysis shows that summaries
4.5. Log analysis 102
Figure 4.14: Number of read summaries per search sessions by XML element types. of sections, subsections and paragraphs are those most read (Figure 4.14), although searchers take less time to read them (see previous section). Other element types are less promising to searchers according to their summary usage. We can also see in Figure 4.14 that when paragraph and subsection summaries are not available (Se ), section summary reading increases dramatically. It is interpreted as an indication that, for the IEEE collection, sections, appearing mostly at depth level three, are the most promising elements to look at when answering an average query. The comparison of the overall number of viewed summaries showed that an average searcher of system Se read 12.5 summaries per search session, and of system Sc 16.33 summaries per session. In other words, test persons read more summaries where more summaries were available. Although this sounds natural, the difference is not statistically significant. T-tests (p=0.05) did not show significant differences at comparable levels and element types between Se and Sc in the number of read summaries either.
4.5.3
Reading times vs. number of read summaries
In this section, the relationships between the data and findings of the previous two sections are examined. One question being looked into is whether searchers with higher summary reading times read less summaries in a search session. The objective is, still, to identify patterns that could later be used in interactive XML IR system design. Users of system Se read less summaries than those who used system Sc . This is in accordance to that they had less summaries available. However, searchers of system Se also read summaries
4.5. Log analysis 103 for longer. This shows that if there are less available summaries, searchers can focus more on one particular summary, and vice versa, if there are many summaries to view, reading can become superficial. Hence, the structural overview design has to be such that allows searcher find an optimal balance between these two summary reading properties. Considering both systems and element types, it was found that there is negative correlation between summary reading times and the number of read summaries. In other words, it is true for searchers of both systems that the more summaries they read on a particular level the shorter the corresponding reading times are. However, this is only an indication as the critical value table for Pearson’s Correlation Coefficient shows that, at α = 0.05 level, there is no significant relationship. Also, since the number of summaries read increases when going deeper in the structure, it is viewed as an indication that, for searchers, summaries of higher level elements are more indicative to the contents of the corresponding elements than those of lower, and also shorter, elements. With respect to the two task types it has been found that, although searchers indicated no differences in perception between B (background) and L (list) tasks, they behaved slightly differently depending on the task type. Searchers using B tasks read fewer summaries, and their reading times were longer, with respect to element types and depth levels (Table 4.5). This shows that different reading patterns are associated with different task types. Nevertheless, summaries were used extensively by searchers completing both type of tasks. Task type Average of Overall Section summaries Article summaries Para summaries Abs summaries Fm summaries Bdy summaries Bm summaries Ss(1—2) summaries App summaries Depth level 1 Depth level 2 Depth level 3 Depth level 4
L B number of summaries 17.58 11.25 7.166 4.916 0.666 0.916 3.25 1.25 1 0.916 1.25 0.666 1.083 0.25 1.083 0.75 1.416 1 0.75 0.583 0.666 0.916 3.416 1.666 8.75 6.416 4.75 2.25
Table 4.5: Summary reading times and number of read summaries by task types.
4.5. Log analysis 104 4.5.4
Usage of the table of contents and article level
As XML retrieval has the advantage of breaking down a document into smaller elements according to the document’s logical structure, it is investigated whether searchers take advantage of this structure: do they click on items in the ToC, do they use the article (unstructured) level of a document, and how frequently, do they alternate between full article and smaller XML element views? The active usage of the non-article levels would also validate the structured retrieval specific findings (see previous sections). Regarding the usage of the XML structure in terms of the ToC, 58.16% of the displayed elements were results of at least “second clicks” (also referred to as secondary clicks), i.e., more than half of the elements were displayed by clicking on an element in the ToC. For secondary clicks, a searcher had to be already at the document display view following a “first click” which took them from the result list view (Figure 4.3) to the document view (Figure 4.5). This shows that searchers actively used the ToC provided, and that they used the logical structure of the documents by browsing within the ToC. This is an indication that the system served searchers’ browsing behaviour better than that reported by Tombros et al. (2005b), where one of the main problems was the lack of searchers’ within-document browsing. Later, Kazai and Trotman (2007) have also found that searchers appreciate the availability of ToCs. The log files show that only 25% of the searchers clicked on article elements to access the whole document, and none of these searchers clicked on an article type link more than three times in a search session. The distribution of viewing whole articles did not depend on the system, i.e., three article clicks were observed for each system. This result follows naturally, since the display of the article level was not different in Sc (full) and Se (with less summaries). Article level clicks show that articles were only 3.56% of all the displayed elements. This may be misleading as the retrieval system did not return article elements in the result list. Therefore article usage to elements that were displayed when searchers were already in the document view (see “secondary clicks above”) had to be also compared, i.e., elements that were shown right after a searcher clicked on a link in the result list were excluded. The updated number shows that article elements were displayed in 6.12% of the secondary clicks. This suggests that searchers of an XML retrieval system do use the structure available in terms of the ToC, and although it was the first time they had used an XML retrieval system, they did not simply prefer to see the whole document they were accustomed to document parts.
4.6. Discussion 105 The results from previous sections suggest that searchers still want to have access to, and do use, the full-article level. For example, searchers read summaries of articles and read them for longer but, they did not necessarily want to use the full-articles directly, i.e., looking at the fullarticle summary may be enough to decide whether reading any part of the article is worthwhile. 4.5.5
Discussion of log analysis based results
Searchers in this study did indeed use the provided structure actively and did not use the whole article only in order to identify relevant content. In addition, searchers made good use of the XML element summaries, by spending a significant amount of time reading them. This indicates the validity of the results as they come from the extensive usage of the provided ToCs and summaries. It further implies that the system, by the use of text summarisation, facilitated browsing in the ToC level more than that at the INEX 2004 Interactive track (Tombros et al., 2005b). Regarding the use of element summaries, searchers of the study tended to read more summaries that were associated with elements at lower levels in the structure (e.g., summaries of paragraphs), and at the same time summaries of lower elements were read for a shorter period of time. The results also suggest that if more summaries are made available, searchers tend to read more summaries in a search session, but for shorter time. The reading times and the number of read summaries also depend on the search/work task as Table 4.5 on page 103 shows. Based on the close relation between the ToC display and summary presentation (Section 4.4.4), for such XML retrieval systems it is important to find the appropriate ToC, and summary, presentation at the same time. If the ToC is too deep, searchers may lose focus as the reading of many summaries and short reading times at low levels indicated. Nevertheless, if the ToC is not detailed enough, users may lose possibly good links to relevant elements. The results suggest that, for the used collection, a one or two-level ToC (containing reference to the whole article, body, front and back matter) would be probably too shallow, while displaying the full fourth level (normally to paragraph-level) is sometimes too deep.
4.6
Discussion
The study described in this chapter aimed at answering research questions regarding text summarisation in interactive XML retrieval. The first research question is concerned with the usefulness of text summaries in an interactive search environment. Results of the questionnaire, interview and log analysis show that
4.7. Conclusions 106 textual summaries can indeed be useful in the above mentioned context. Summaries were used extensively and appreciated be searchers according to the three analysis data sources. The second research question aimed at finding out where in the logical structure textual summaries were useful. Results of the presented study show that textual summaries are useful at all structural levels, displayed to searchers as tables of contents. Searchers read more summaries at deeper levels, and they also read them for shorter time. As there were more summaries available at deeper levels, the main findings include that summaries are useful at all levels, and the focus of investigation should also be on the structural display, i.e. how deep and detailed ToCs should be. The third research question with respect to text summarisation investigated the system and interface that displayed textual summaries to searchers. Results of the study give indications for the design of the ToC display, as well as for how to present the retrieved elements in a results list. With regards to the latter, it is found that elements from the same document returned by a retrieval system should grouped, thus showing an overview of the most relevant elements from the document. A possible overview of the relevant elements is a ToC, and the result display and ToC display steps can possibly merged in the future. ToCs can also contain other elements from the document, that are not marked relevant by the retrieval system, to provide a more complete overview of the document’s logical structure. A result that had not been expected before the study is with respect to the importance of ToC creation and display (as opposed to results with regards to text summarisation). ToCs should be automatically generated, and ToC generation should depend on the relevance of elements, but also on the length and depth of the elements. Searchers rather prefer that non-relevant elements are displayed in the ToC in addition to relevant ones (i.e. higher recall but lower precision) as opposed to not having access to relevant elements via the ToC (i.e. higher precision but lower recall). The ToC labels, i.e. titles of sections displayed in the ToC, should be as informative as possible when the title is not available directly. For all ToC items, textual summaries can be displayed to help searchers reach the relevant content.
4.7
Conclusions
In this chapter, text summarisation has been studied in the context of XML retrieval. The study presented used human searchers to investigate whether text summarisation was useful for the focused access of the contents of XML documents. The experimental system and design have
4.7. Conclusions 107 been presented in Sections 4.2 and 4.3 which has been followed by the discussion of results of the study. The results obtained from questionnaires, interviews (Section 4.4) and log files (Section 4.5) has also been discussed. The results show that text summarisation can help users reach the relevant content more easily; however, the structural overview (i.e. the ToC), which also provides support for users, has to be carefully designed. This structural overview is investigated in the following two chapters.
108
Chapter 5 Query based structure summarisation in an interactive environment
In this chapter, the use of structure summarisation for XML retrieval is investigated. The chapter is based on the paper published at the 29th European Conference on Information Retrieval titled Feature- and query-based table of contents generation for XML documents (Szl´avik et al., 2007).
5.1
Introduction
In Chapter 4, several problems that were considered limitations of XML retrieval systems have been identified. It was found that a ToC should reflect which elements are possibly relevant to the searchers’ queries, i.e. ToC items of relevant elements are more important to be shown in the ToC than those of non-relevant ones. It was also found that the display of a ToC should depend on the length of the elements, e.g. longer sections are more important to include in the ToC. Another finding of Chapter 4 is that it is important to consider how deeply an element is in the document structure, i.e. for most of the XML documents currently being used in XML retrieval, a one or two-level deep ToC is probably too shallow, while four or five level deep ToCs may sometimes be too deep to help searchers with their information seeking task. In the systems used in the INEX interactive track (Tombros et al., 2005b; Larsen et al., 2006a; Malik et al., 2006b), (for screen shots of the systems see Figures 3.6, 3.7 and 3.9 in Chapter 3), ToCs had two main limitations: 1. they were static, i.e. the same ToCs for a given document were displayed for all searcher
5.1. Introduction 109 queries, and 2. they were manually defined, i.e., before the documents were used in the systems, documents had to be analysed and several (types of) elements selected to be included in ToCs, and this selection was not automatic. Based on the above, the main research questions to investigate in this chapter are the following. • Is relevance important when automatically generating a ToC? In other words, when a ToC is generated by combining various element features such as length, depth and relevance, do searchers think that relevance should be weighted as high as, or above, other features? In the study described in this chapter, relevance determined by a retrieval system (system/algorithmic relevance) is used in generating ToCs, and this relevance value is combined with other element features to support searchers in finding relevant content within XML documents (cognitive relevance). • What is an “ideal” ToC like? Questions such as the following are investigated: what is the proportion of elements that are found ToC-worthy (for definition of the term, see below) according to searchers, how long (i.e. in terms of number of ToC-worthy elements) ToCs tend to be? • Do the characteristics of ToCs depend on individual searchers, document collections, search topics? In other words, can general conclusions for ToC generation be drawn from the study? To answer these questions, a method to automatically generate ToCs by considering various characteristics of XML elements (e.g. element text length) as well as the searchers’ (simulated) queries is introduced and generated ToCs investigated. ToC generation is referred to as structure summarisation, as the table of contents can be regarded as summary of the logical structure of the document. ToCs investigated in this chapter are query-based (query-focused) structure summaries giving an overview of the most important and relevant document portions (i.e. sections, subsections, etc.). The ToC generation method discussed is inspired by text summarisation (sentence extraction) systems (e.g. Edmundson (1969); Tombros and Sanderson (1998)). Features of elements including their length, depth and relevance are considered and combined in order to
5.1. Introduction 110 determine whether an element should have a reference in the ToC (i.e. ToC worthy, see below). The method allows to create a ToC for any XML document. It is important to note that the study presented in this chapter does not focus on the summarisation method itself but on the relative importance of the used element features (i.e. length, depth and relevance). The sizes of ToCs are also investigated in order to understand how searchers can find relevant information effectively using tables of contents. For the investigation, I created a system and human searchers were recruited to use it (Section 5.2). Searchers were asked to consider information seeking tasks and to find relevant information within XML documents. They were allowed to adjust the importance of element features (i.e., length, depth, relevance). By adjusting these, searchers were able to alter the characteristics of the current ToC and the aim was to generate an appropriate ToC for documents in the context of the current query. These searcher preferences were recorded and analysed along with questionnaire data. The system I designed and implemented, as well as the methodology that was followed are described in Section 5.2, followed by the detailed description of the ToC generation algorithm used (Section 5.3). Section 5.4 presents and analyses the results, and the chapter is closed with discussion and conclusion in Sections 5.5 and 5.6.
5.1.1
Terminology
Before investigating ToC generation, it is necessary to define several expressions formally as used in this chapter, as well as in Chapter 6. • Definition i: A ToC is set of references that is displayed to searchers. A ToC reflects the structure of the corresponding document (usually by indentation) at the detail level found appropriate by a structure summariser (also called ToC generator). A ToC allows searchers to access various parts of the document by clicking on the ToC items. • Definition ii: If a reference to an element is worth displaying in a ToC, the element is called ToC-worthy, otherwise, simply, not ToC worthy. ToC-worthy elements form the set called ToC-worthy elements. There is one such set for each XML document at a time. • Definition iii: ToC-worthiness is the extent to which an element is ToC-worthy. It can usually be expressed as a numerical value. If this value is above a certain threshold for an element then the element is ToC-worthy, i.e. it is in the set of ToC-worthy elements.
5.2. Experimental setup 111 The definitions of ToC-worthiness are used particularly in Section 5.3 in this chapter and throughout Chapter 6. The next section discusses the experimental system used to study structure summarisation.
5.2
Experimental setup
In this section, the experimental setup of the study is presented including the used methodology, system, used XML document collections and searcher tasks. For the investigation of query based structure summarisation, a methodology that is based on simulated work task situations has been followed (Borlund, 2003b). Searchers were given work task descriptions so they could search for relevant information. During their search, they were asked to identify the best ToC of the current document with respect to the current work task by adjusting the importance of element features.
5.2.1
Methodology and system
Searchers were asked to read the work task descriptions (an example task is shown in Figure 5.1), proceed to the document view of as many documents as they wish (Figure 5.2), and adjust their preferences for three element features (length, relevance, depth) and a threshold by moving sliders on the interface. By adjusting the sliders, searchers were able to alter the characteristics of the current ToC. When they felt that the displayed ToC was helpful enough to assist them in finding relevant information, they could move on to the next document or topic. Participants were asked to fill in questionnaires before and after the experiment, they were given detailed introduction to what their task was and no time restrictions to finish the experiment were imposed. After filling in the entry questionnaire and having read the introduction, searchers were presented with the first topic description and links to the corresponding relevant documents (Figure 5.1. For the full list of tasks, see Appendix B.1). The order of the displayed topics was randomised to avoid any effect caused by one particular order. After choosing a document, the document view was shown (Figure 5.2). This consisted of four main parts: • on the left, sliders associated with element features were shown. These needed to be adjusted by searchers to generate ToCs; • on the bottom left, the generated ToC was shown;
5.2. Experimental setup 112
Figure 5.1: A topic description and links to its documents.
Figure 5.2: Screen shot of the main screen with sliders, ToC and element display. • on the right hand side, the contents of the document or element was presented. These changed when searchers clicked on an item in the ToC; • on the top left corner, links to the topic description, next topic and final page were displayed. By clicking on the “finish” link, the exit questionnaire was shown, where information about the searchers’ perception of the system and ToC generation were recorded, e.g. the strategies searchers used when adjusting the sliders on the main screen. The steps of the experiment are summarised in Figure 5.3.
5.2. Experimental setup 113
Figure 5.3: The steps of the structure summarisation experiment. 5.2.2
Document collections
Documents from two XML document collections - IEEE (further described in Section 4.3.1) and Wikipedia - were used. The Wikipedia collection (Denoyer and Gallinari, 2006) is an XML version of the Wikipedia1 articles. It consists of the full-texts, marked-up in XML, of 659,388 articles of the Wikipedia project, and totalling more than 60 GB (4.6GB without images) and 30 million in number of elements. The collection has a structure similar to the IEEE collection (Section 4.3.1). On average, an article contains 161.35 XML nodes, where the average depth of an element is 6.72. The above collections were chosen for this study (and for other studies of this thesis) because they are widely used in the XML retrieval community, because they are known to be well formed, and their relatively clear structure allows for carrying out studies that are concerned with the logical structure and textual contents of documents and elements. Other types of documents, e.g. XHTML web pages, could also be used provided that they are well formed XML documents with a clearly document centric aim. XHTML and other document collections can be examined as part of future work.
5.2.3
Tasks
Ten topics from the INEX Ad-hoc tracks of 2005 and 2006 (Sigurbj¨ornsson et al., 2005; Malik et al., 2006b) were selected for this study. Five random topics from 2005 and another five from 2006 were selected and converted into work task descriptions. This selection method ensured that five work task descriptions targeted each collection. For each topic, three to five relevant 1 http://wikipedia.org
5.3. Table of contents generation 114 documents were selected. Relevant documents were obtained by formulating queries from the topic descriptions, submitting these queries to the TopX system2 (Theobald et al., 2005) and selecting the most relevant documents from the result list. The retrieval status values of elements were saved to be used in the ToC generation as relevance information (Section 5.3). Documents of various sizes were selected, the shortest document contained 334 bytes of text while the longest 49KB. An effort was made to select documents from both collections for topics; however, there were only two topics where relevant documents were found in both collections. As the result of the topic and document selection, 33 documents and 10 topics were selected from the two collections. These provide an appropriate level of diversity, thus ensuring that results from this study are not biased by topic and document selection. The next section describes how structure summaries (i.e. ToCs) were generated to be tested with the above described experimental design.
5.3
Table of contents generation
In this section, the algorithm to generate ToCs is described. The structure summarisation method is based on early text summarisation (e.g. (Edmundson, 1969)) and query-biased text summarisation by Tombros and Sanderson (1998). However, instead of extracting sentences from the document, the ToC generation method selects XML elements. The ToC generation algorithm aims to identify, among a set of XML elements, those that will form the ToC. It makes use of an element score that is calculated for every element in consideration. If the score of an element is higher than a certain threshold value (described below), the element is considered as a ToC element (i.e. the element is ToC-worthy; for illustrative ToC examples, see Figure 5.4). Ancestors of ToC-worthy elements, i.e., elements higher in the XML hierarchy, are also used to place the ToC elements into context. For example, a section reference in a ToC without the chapter it is in would be just ‘floating’ in the ToC. The selected elements’ titles are displayed as ToC items, in the order of appearance in the original XML document. If no title is available, the first 25 characters of the text are shown which aims at providing a better alternative to the typesequence number (e.g. Paragraph 1) titles used in Chapter 4. The ancestor-descendant relation of elements is reflected, as in a standard ToC, by indentation (Figure 5.2). 2 TopX
is the system used in INEX for the collection exploration phase since 2006 (Larsen and Trotman, 2006; Trotman and Larsen, 2007). It is also used back-end system for the iTrack in 2006 (Malik et al., 2006b).
5.3. Table of contents generation 115
Figure 5.4: Various generated ToCs for the topic about the impact that Albert Einstein had on politics (w5). The score of an element is computed using three features of the element: its depth, length and relevance. The first two are element-based features whereas the third is query-based. These features have been shown to be important characteristics in various XML retrieval tasks (Fuhr et al., 2006) although other features can be also taken into account as discussed in Chapter 6. The depth and length features were selected to investigate the two possible dimensions of ToCs. Depth is the ‘width’ of a table of contents, and helps examining whether a ToC should be as detailed as displaying, e.g. section, subsection or paragraph level elements. The other dimension is length with the use of which we can examine whether searchers tend to prefer shorter or longer ToCs. Relevance is used to further emphasize elements that are related to the searcher’s information need and de-emphasize those not related to the search task. The three features and their use are further described in the next subsections.
5.3.1
Depth score
Each element receives a depth score between zero and one, based on where it is in the structure of the document. In case of the used collections, an article element is always at depth level one (i.e. it is the root element in the tree structure). Descendants of a depth level one element are defined as being at depth level two (e.g., /article[1]/section[4]), etc. According to what was found in the text summarisation study (Chapter 4), elements at depth level three of a ToC are the most important to access the relevant content whereas the adjacent levels (two and four) were found less important, and so on. Sigurbj¨ornsson (2006, Chapter 8.) also found, using the IEEE collection, that searchers mostly visited level 2-3 elements while looking for relevant
5.3. Table of contents generation 116 information. Hammer-Aebi et al. (2006) confirmed that searchers found the highest number of relevant elements at levels two to four. Since the latter work used a different XML collection (Lonely Planet (van Zwol et al., 2005)), the importance of these levels seems to be general for XML collections. To reflect these findings, the following scoring function is used to calculate an element’s depth score (Equation 5.1):
if depth(e) = 3, if depth(e) ∈ {2, 4}, (5.1) if depth(e) ∈ {1, 5}, otherwise
where Sdepth (e) denotes the depth score of element e. 5.3.2
Length score
Each element receives a length score, which is also normalised to one. The normalisation is done on a logarithmic scale (Kamps et al., 2004), where the longest element of the document, i.e. the root element, receives the maximum score of one (Equation 5.2):
Slength (e) =
log(TextLength(e)) log(TextLength(root))
(5.2)
where Slength (e) is the length score of element e, root is the root element of the document structure and TextLength denotes the number of characters of the element.
5.3.3
Relevance score
A score between zero and one is used to reflect how relevant an element is to the current search topic. The scores were those given by the search engine used in INEX for the collection exploration (Theobald et al., 2005) (i.e., a normalised retrieval status value, RSV). The RSVs are obtained in the document selection phase (already described in Section 5.2). Alternatively, INEX relevance assessments could have been used to obtain relevance scores for this study. However, as ToPX’s performance at INEX was relatively high and as the conversion of INEX’s two dimensional relevance scores may have introduced another problem, this alternative source of relevance values were not used for the study of this chapter.
5.4. Results and analysis 117 5.3.4
Feature weighting
The scores of the above three features are combined so that the importance of a feature over another can be emphasized. This is done by using a weighted linear combination of the feature scores (Equation 5.3). Searchers are allowed to set the weights themselves. This allows to investigate what searchers find important in ToC generation, and also, to determine what weights should be used to generate ToCs based on such features.
S(e) =
∑ W ( f ) · S f (e)
(5.3)
f ∈F
where S(e) denotes the overall score of element e, F is the set of the three features, W ( f ) is the weight of feature f and S f (e) denotes the score that is given to element e based on its feature f . As each of the feature weights and scores are between zero and one, the maximum of S(e) can be 3. The following subsection introduces Threshold which is a cut-off value relative to the maximum achievable score (i.e. when each of the feature scores equals to one) given any set of feature weights.
5.3.5
Threshold
To determine the lowest score an element must achieve in order to be included in the ToC, a threshold value is used. As well as the feature weights described above, this value is set by the searchers of the system. This further allows to determine what the desirable size of a ToC should be. In the algorithm, if the threshold is set to 100% only elements with the maximum depth, relevance and length scores will be included in the ToC. If the threshold is set to zero, every element with greater than zero score will be in the ToC. A default value of 50% is used. With the help of the summarisation method described in this section, the ToCs and searchers’ preferences can be examined. The results of the study are reported in the next section.
5.4
Results and analysis
In this section the results of the query-based structure summarisation study are described and analysed. First, results regarding participation (Section 5.4.1) are shown, followed by the detailed analysis of the collected data. Slider values (i.e. searchers’ feature preferences) are also examined as well as the main characteristics of generated ToCs when searchers were finished with a document (Section 5.4.2). Also, whether there were differences in terms of preferences
5.4. Results and analysis 118
Table 5.1: Questionnaire results about searchers’ expertise, task and system difficulty.
Computer usage Experience with computers Experience in searching Years of online searching How easy was it to learn to use the system? How well did you understand how to use the system? To what extent did you understand your task? How easy was it to set up the sliders [...]
Average 6.89 6.44 6.27 6.48 4.85 4.64 4.78 3.71
Standard deviation 0.40 0.90 0.95 1.05 1.51 1.15 1.18 1.72
among searchers (Section 5.4.3) and whether differences could be found among documents of the two collections and among the topics used (Sections 5.4.4 and 5.4.5) are reported.
5.4.1
Participation and questionnaires
Fifty searchers, mainly with a computer science background, responded to the call for participation in this study. To record information only from searchers who spent a significant amount of time participating in the experiment (thus providing usable data), the log data have been filtered so that only those session logs involving at least three different documents were kept and analysed. As the result, 31 user sessions were analysed where participants used an average of 7.74 (out of the the maximum ten) topics and 15.58 (out of the maximum 33) documents. This gave 483 different settings where one setting is considered as a set of four slider values after a searcher finished with a document. In the post experiment questionnaires, searchers indicated how easy they had found to learn and understand the usage of the system. On a seven point scale, they indicated an average of 4.85 and 4.64 respectively, where 1 meant ‘was not easy at all’ and 7 meant ‘it was extremely easy’. This indicates that the use of the system was not an issue in our study. In the same scale, searchers indicated an average value of 3.71 for the question ‘How easy was it to set up the sliders to get a good table of contents?’ which shows that getting the desired ToC took some effort. In their comments, searchers indicated various understandings of the sliders, but most of them had a strategy how to set them up and, according to the logs, all of the searchers used the sliders extensively. According to the questionnaires, most of the searchers found the tasks’ difficulty easy or average, which is an indication that they were able to focus on the ToC creation process as their energy was not taken by the task understanding (Figure 5.5).
5.4. Results and analysis 119
Figure 5.5: The average difficulty of tasks indicated by searchers. Further questionnaire results are also shown in Table 5.1. For the questionnaire forms of the query based structure summarisation experiment, see Appendix B.2.
5.4.2 5.4.2.1
Sliders and ToC characteristics Average slider values
On average, the depth, length and threshold slider values were around the default value (50), while the relevance slider was in a higher average position (75.79). The standard deviation of the depth, relevance and length sliders was the same (25), and the threshold slider had a lower (15) standard deviation (Figure 5.6). This indicates that searchers found the relevance feature to select ToC elements more important than the other two, i.e. length and depth. Indeed, it may be argued that the importance of relevance to access the relevant content is the most intuitive of the three features, but results indicate that other features also needed to be considered if one wanted to place the relevant elements into context, i.e., to show related contents. The average threshold slider value of 50% with standard deviation of 15 shows that the agreement among searchers regarding the threshold was higher than those of the three features. In addition, a pattern for different slider values for documents of various sizes could not be found, thus, the average values seem to be appropriate for XML documents of any size.
5.4. Results and analysis 120
Figure 5.6: The average slider values after finishing with a document. 5.4.2.2
ToC size and document length
On average, ToCs consisted of 19 elements. This is, on average, 8.16% of all the elements in an original XML document. It was further examined how the size of the ToCs is related to the size of the documents. Since there was reasonably high correlation (correlation coefficient 0.896) between the length of the documents and the number of elements they contained, in this section, these two measures are used as ‘size’ interchangeably. It was found that the more elements a document contained, the smaller the proportion of the number of ToC elements to the number of document elements was (Figure 5.7). In other words, the longer a document is the smaller proportion of its elements will be included in ToCs. This shows that a long document does not necessarily need a long ToC. This is because a too long ToC does not help searchers gain an overview of the contents of the document because they need to gain an overview of the contents of the ToC first. This clearly indicates that a ToC generation algorithm has to perform particularly well for longer documents, as the ToC algorithm is much more selective in element items in longer documents. To also examine the distribution of the length of elements in the ToCs and in the documents, five size categories were created based on the length of text in elements (Figures 5.8 and 5.9). It was found that, regarding element lengths of documents (Figure 5.8), most of the elements cover 10-100 characters of text, i.e., there are many elements with the size of a short sentence or several terms. The number of even shorter elements is also relatively high (these might correspond to
5.4. Results and analysis 121
Figure 5.7: Number of elements in the documents and the ToC elements’ ratio to it.
Figure 5.8: Number of elements in documents at various size categories. single words), and elements longer than 10KB are rare in the examined documents, such long elements can be e.g., the root (i.e., article) element of a long document. With respect to the length of elements included in the ToCs (Figure 5.9), there is a tendency that the number of very short and very long elements in ToCs are low, and the category of 1001000 character long elements contains the highest number of ToC-worthy elements. Slightly shorter and longer elements are less frequent in the ToCs. The above results show that a ToCs constructed reflect the original structure of the document, i.e. there are similar length distribution tendencies in the ToCs and in documents.
5.4. Results and analysis 122
Figure 5.9: ToC items at various size categories.
Figure 5.10: Number of elements in documents at various depth levels.
Figure 5.11: ToC items at various depth levels.
5.4. Results and analysis 123 5.4.2.3
Depth distributions
The distribution of depth levels in the ToCs and document elements has also been examined. (Figures 5.10 and 5.11). Eight depth levels were considered because deeper than the eighth level there were very few, if any, elements in a document, and none of the displayed ToCs were deeper than seven levels. The distribution of element depths of documents (Figure 5.10) follows a trend similar to the length distribution: most of the elements are between the fourth and sixth level, there is only one element at level one (which is the root element of a document’s tree structure) and very few elements are deeper than the seventh level. The depth distribution of the ToC elements (Figure 5.11) has the same tendency3 with the highest number of ToC-worthy elements at levels three and four, which is consistent with what was found and described in Chapter 4 as a requirement for TOC building. In other words, the depth distributions found in the ToCs and documents can also be considered similar. 5.4.2.4
Length and depth characteristics
Considering both the length and depth distributions, the ToCs reflect the main characteristics of the documents (as the distributions are similar). The ToCs can therefore be viewed as extracts of the document structure, so not only is the algorithm based on summarisation but the output of it is also a summary (i.e., that of the document structure).
5.4.3
Searchers
Searchers used different slider setting strategies to generate the best ToC. Nonetheless, the majority of them did set the relevance slider high. This high view of relevance was confirmed in the post experiment questionnaires. The questionnaires show that the relevance slider was usually set first, which also shows its importance. Some of the searchers set the length, depth and relevance sliders first and changed the threshold slider slightly document after document. In this process, setting the most appropriate threshold was found difficult for most searchers, especially because they had an ‘ideal’ ToC in their minds that was to be reached by adjusting the sliders. According to the questionnaires, an ideal ToC contained those, and only those, elements that had been found useful when searchers had been experimenting with the settings for the ToC of the current document. Although not all of them followed these strategies, the ToCs generated did not 3 The
distribution resembles a bell-shape that is usually associated with the normal distribution, however, various normality tests provided inconclusive results so normality of the mentioned distribution can not be claimed.
5.4. Results and analysis 124 differ very much for different searchers: apart from a few searchers, the ToCs were not longer than twenty items, and searchers also seemed to agree in terms of length categories and depth levels. Based on the above, it seems that the size of a ToC does not significantly depend on individuals. Indeed, a ToC should rarely contain more element references than some fixed value, in case of this study, twenty. The study shows that it is important to select the best (maximum) twenty elements, appropriately.
5.4.4
Collections
In Chapter 4 the work was carried out using only one collection, i.e., the IEEE collection, thus comparing user behaviour with more than one collections is necessary. As the study of this chapter focuses primarily on ToC generation and not on user behaviour, the amount of data for collection comparison is limited. However, if it can be seen that searchers generated similar ToCs for documents from both collections, it may show that findings of the current and previous chapters can be applied to various XML document collections as well. For the study of this chapter, there were slightly more documents used from the Wikipedia collection (20) than from the IEEE (13) and therefore 158 slider values were examined for the IEEE collection and 325 for Wikipedia documents. Statistically significant differences (t-test, p=0.05) in the average slider values with respect to the two collections could not be found; settings for one collection also seemed satisfactory for the other. Not only the slider settings were similar but the generated ToCs’ characteristics showed similar tendencies as well (Figures 5.12 and 5.13). Although the structure of the two collections’ documents are similar to each other, it is not expected that these results will be different for other XML documents. This is because documents of XML collections, by definition, must have exactly one root element, which has several child elements etc., so it is expected that the algorithm described in this chapter can be used (and extended with other features) for XML documents of other collections, too. Even if the structure of other documents seems to be completely different from the ones used in this chapter, as long as there is a hierarchical logical structure and textual content in elements, the ToC generation methodology presented in this chapter (and in Chapter 6) can be followed, and ToCs are still expected to be useful.
5.4. Results and analysis 125
Figure 5.12: Length distributions of TOCs for the two collections.
Figure 5.13: Depth level distributions of TOCs for the two collections.
5.4. Results and analysis 126
Figure 5.14: Slider values for the ten used topics. 5.4.5
Topics
It was also investigated whether there were differences in settings and ToCs among the ten topics. The distribution of the slider values did not reveal great differences (Figure 5.14): the relevance values were always higher than that of other sliders and the depth-length-threshold triplet’s order was slightly different for some topics but these were always closely around the default value of 50. This shows that for the ten topics used, searchers did not need to use very different settings to obtain an acceptable ToC. However, the number of topics does not guarantee that a wide enough range of topics and task types were covered. To compare the results for different task types (e.g., finding background information vs. answering a question) and draw generalisable conclusions, more topics are needed from different task types. The number of ToC elements were between 14 and 27 for 8 of the 10 topics. These numbers are around 9% of the number of elements in the documents. There was one topic (w2) for which the average number of ToC elements was as low as 7, and for another topic (w5), this number was 32. These two extreme values are closely related to the number of elements the topics’ documents had, i.e., documents for (w2) were very short while documents for (w5) contained the longest document. This shows that longer documents may require longer ToCs, but since the ratio of the number of elements in ToCs and documents is different for shorter and longer documents (see Figure 5.7), size differences in the ToCs should not be linearly proportional to document sizes.
5.4. Results and analysis 127 5.4.6
Relevance of ToC elements
To investigate whether the ToC items are also the relevant ones of a document, sets of ToCworthy elements were compared to human relevance assessments. Since searchers were experimenting with the ToCs, ToCs that were displayed last for a document were considered. To see whether these ToC elements were assessed relevant by assessors, relevance assessment data from INEX have been used (Lalmas and Piwowarski, 2005, 2006). As INEX used the two collections (and the topics selected for the study) over more than one year, the assessment data had to be extracted from two sources: firstly, the INEX 2005 assessments (for IEEE topics) (Kazai and Lalmas, 2005) and, secondly, the INEX 2006 assessments (for Wikipedia topics) (Lalmas et al., 2007b). INEX used two dimensions for relevance assessments, i.e. exhaustivity and specificity. To construct a list of elements that are relevant (i.e. in one dimension), elements that had nonzero exhaustivity levels (which also assumed non-zero specificity) were chosen. This is the same method that was used to obtain official INEX agreement levels. A limitation of this investigation is that relevance assessments only for 6 topics (3 IEEE, 3 Wikipedia, out of the maximum 10 topics) and 18 documents (out of 35) could be found, nevertheless, it is expected that the analysis can provide indicative results. Previous work on the investigation of the assessors at INEX shows that if a topic was assessed more than once, the agreement between assessors were quite low (25%) (Trotman, 2005), which is lower than the TREC agreement (50%) (Voorhees, 2000), Pharo et al. (2006) also found low agreement between INEX Ad Hoc and Interactive Track assessors. In reflection of these numbers, assessor-ToC ‘agreement’ has been calculated for this study. 27.88% of the ToC elements were present in the relevance assessments. Although this result (i.e., comparing ToCs to assessments) is not directly comparable to two human assessors’ relevance judgements, it shows that once a relevant document is found, the ToC algorithm4 was able to select the relevant elements of the document at a level which is comparable to XML retrieval’s inter-assessor agreement. On the other hand, the high level elements of relevant documents (especially the article element) was always present both in the ToCs and in relevance assessments (by definition of both the ToC generation algorithm and INEX assessments), which limits the reliability of the obtained agreement value. A possible way to see if small elements were indeed displayed in ToCs and assessed relevant by assessors is to use the ‘too small’ notion of INEX relevance assessments. To see whether 4 Note
that the algorithm included RSVs of one particular search engine to obtain the relevance score, i.e., not the ‘perfect’ set of relevant elements have been used.
5.5. Discussion 128 elements marked as ‘too small’ have been selected by the ToC generation, elements marked as such in INEX relevance assessments have been removed from the investigated ToC-worthy element sets and agreement recalculated. The previously obtained 27.88% agreement value has changed only slightly to 27.74%, which shows that the investigated ToC-worthy elements rarely contained too small elements, i.e., the ToC items shown to searchers were not ‘too small’. This means that although the generated ToCs may have missed several relevant elements of higher size (as 27% can be considered as low agreement) but did not tend to mark ‘too small’ elements as ToC-worthy. The agreement levels between two humans with respect to the manual selection of ToC-worthy elements is investigated more thoroughly in Section 6.6.4.
5.5
Discussion
In this chapter, a structure summarisation method that is based on basic text summarisation has been used to select XML elements that are ToC-worthy. The summariser also used searchers’ preferences for various element features. The summarisation method introduced offers a mapping from the set of elements in the document to those in the ToC, where the most important elements of the document are selected as ToC elements and the distribution of length and depth of elements remains similar. During the analysis of the collected data it was found that searchers had several strategies to obtain the best ToCs. Although not every one of the searchers understood the concepts of the features completely, they actively used the sliders and created ToCs that, according to questionnaires, were suitably good to access the relevant contents of elements more easily. In the presented study, it has been found that a ToC generation algorithm that combines various element features to select ToC-worthy elements has to consider the relevance of an element. In other words, a ToC should be query-based. This is shown by the high importance of the relevance feature as indicated by searchers. However, other element features should also be considered in ToC generation, but their importance is lower, according to searchers. The appropriate combination of all features can yield a ToC that can help searchers to reach the relevant content of the XML document. Based on this study, it is understood that ToCs should not be large in size, longer documents should still have a relatively small ToC. It is believed that long tables of contents are not helpful enough for users if relevant information is to be found quickly within documents. Automatic
5.6. Conclusion 129 ToC generation, hence, has to be more carefully designed when a longer document’s logical structure is to be summarised. Based on the analysed data, it is suggested that if a ToC algorithm selects more than a certain number (e.g., 20) of ToC elements, the top scored elements (i.e., top 20) should be kept regardless of what threshold value the algorithm uses. If the number of ToC elements is lower than this number (e.g., 10), these elements should be used to construct the ToC. These suggestions are needed for the summarisation method introduced in this chapter as, normally, searchers cannot be asked to indicate and change threshold values. The feature and threshold weights are determined automatically in the summariser described in the next chapter. The analysis data suggests that the summarisation method introduced can be used with more XML document collections and for various search task types and topics. The data also suggests that the size of a ToC does not significantly depend on individual searchers. The selection of the most ToC-worthy elements, possibly with respect to the users query, is more important. To ensure better results in terms of structure summarisation, it might be necessary to consider more than three features. Element features such as element type (e.g. section, paragraph), presence of title for elements might also give clues for the ToC-worthiness of elements. The summarisation method described in this chapter can easily be extended to incorporate additional features. As the method and the combination of features ensure that very small, thus insignificant, elements are not selected as ToC-worthy, the introduced structure summarisation is considered useful for ToC generation. It could also be used to investigate other element features. A similar summarisation method is used in Chapter 6 where users are not needed to be involved directly in order to determine feature weights. In addition, other features are also considered in Chapter 6, and the investigation is focused on several features other than the relevance of elements.
5.6
Conclusion
This chapter discussed the study of searchers’ preferences of element features in automatic ToC generation. The features and searchers’ preferences can be used to select elements that will form the ToC. Three features have been considered: depth, length of the elements and their relevance to the current query. A user study has been conducted to investigate which of these features are considered important in ToC generation, and what the characteristics of ToCs that were generated by searchers’ feature preferences are. Analysis of the study have also been introduced and discussed and several conclusions drawn.
5.6. Conclusion 130 As, for the purpose of the study, feature importance (as weights) were determined manually by searchers and only three features have been used, further investigation into structure summarisation is needed. The next chapter investigates various other features and introduces an improved structure summarisation method, where feature weights are determined through training.
In this chapter, the selection of XML element features and the features’ use for query independent structure summarisation is investigated. It is assumed that the document is being browsed without any particular information need in mind (or that the query of the searcher is not available, for example, when the structure summary display system is independent from the retrieval system). In this context, several element features are investigated, analysed and their effectiveness in ToC generation evaluated.
6.1
Introduction
In Chapter 5 we saw that the relevance of an element is very important to create a ToC but it is also necessary to consider other features that may also be useful for ToC generation. Depth and length are two important features but there might be more to take into account. The logical structure encoded in XML documents allows to use and examine features that might help, possibly in addition to the above mentioned features, identifying ToC-worthy elements. For example, similarly to the location method of text summarisation, the first child elements of an XML element might be more important to refer to in a ToC. An element might also be more ToC-worthy if it has a child element containing the title text for the element in question, and so on. Also, it can often be the case that a user simply wants to browse within documents and does not have a particular query or information need in mind (Dodge, 2005). In this case, relevanceto-query feature values cannot be obtained and used. When creating tables of contents for this scenario, features other than query relevance become more important. Hence, it is important
6.1. Introduction 132 to investigate how to select these features and evaluate their appropriateness. Since relevance is more important than other features in a searching scenario, these other features can be better compared and evaluated when relevance to a query is not applicable. For example, when a user simply wants to gain an overview of a document without having any specific information need in mind. To the best of my knowledge, no data sets (test collections) are available for structure summarisation with which features could be evaluated. As building test collections is expensive, resources need to be used as effectively as possible. For example, when a summariser is trainable, such as the one introduced later in this chapter (Section 6.4), we might consider substituting the training data set (i.e. structure summaries the system uses to learn) with some other, already existing, training data. For instance, it might be useful to train a summariser using content-based relevance assessments from XML IR (Lalmas and Piwowarski, 2007) or take advantage of highly scored retrieval result sets (runs) (Lalmas and Tombros, 2007a). In this chapter, query independent structure summarisation, i.e. the scenario when no relevance to query information is available, is investigated. The main research questions investigated are the following. • What other features than depth and length can be considered as features for structure summarisation? There are a number of possible features to investigate, it is necessary to preselect a list of features that are then to be evaluated for ToC generation. • Which features are useful in query independent structure summarisation? Various features and feature combinations are to be investigated and evaluated to create high quality ToCs. • How can a query independent structure summariser be trained? As creating a training and test set using humans is costly and time consuming, are there other data sets that can be used to train a structure summariser? Two other training methods, in addition to training by manually created ToCs, will be examined. • Which training methods work better than others? Do the investigated training methods produce very different quality ToCs? Does at least one of the investigated training methods result in ToCs of comparable quality to those generated by training with manually created ToCs? The chapter is structured as follows. Section 6.2 discusses the role of metadata, as possible
6.2. Metadata and element features 133
Figure 6.1: The process and stages of selecting features for structure summarisation. features, and how they can be classified in the context of XML documents. Section 6.3 describes an analysis of several pieces of metadata that might later be considered for structure summarisation as element features. Section 6.4 introduces the main contribution of this chapter, i.e. query independent structure summarisation for XML documents. In Sections 6.5 - 6.8, summariser training, evaluation methodology and evaluation results are presented, respectively. The chapter is closed with the discussion of the summarisation results in Section 6.9 and the conclusion of the work of the chapter (Section 6.10).
6.2
Metadata and element features
When investigating possible features that can be used in structure summarisation, it is first necessary to find ways to describe elements that might distinguish them from other, either less or more ToC-worthy, elements. In other words, descriptors are needed, data about elements. When a set of descriptors is identified, individual descriptors can be examined and decision can be made whether they are to be used in ToC generation as features. The initial set of descriptors can also referred to as metadata. In this thesis, any data that can be obtained about an XML element (see Table 6.1 for examples) is referred to as metadata. The word feature is used whenever a particular piece of metadata, e.g. element length, is found to be worth investigating in ToC generation. Figure 6.1 shows how features can be selected for summarisation purposes.
6.2.1
Metadata
The most general definition of metadata is “data about data”. If we consider an element or document as data, there are a number of properties (“data”) to be found about them. These metadata can be explicitly embedded into documents, e.g. there can be an ‘author’ field in a document but also be implicit, for example, the length of a document is not within the document but it can easily be calculated. Metadata have also been classified as embedded, associated and
6.2. Metadata and element features 134 third party metadata, or as tokens versus labels where labels are human understandable metadata, and tokens may be written in a particular language. They can be objective (author, file size) and subjective (keywords, summary) (Duvak et al., 2002). The classification presented in Section 6.2.2 is based on the classes of explicit and implicit metadata where the implicit class is split into two classes that can be better interpreted for XML documents and elements. metadata classification might be useful for structure summarisation: for example, metadata from one of the possible classes might be better features for ToC generation. Although it is not among the aims of this chapter to compare various metadata classes, classification provides an overview of metadata (i.e. possible summarisation features). This overview then can be used to select several of metadata for investigation in ToC generation.
6.2.2
Metadata classification for XML access and retrieval
Particularly for XML, where elements are hierarchically structured and it is not only the whole document but also the individual elements themselves that can be described with various metadata, these metadata can be classified into three categories. The first class can be called given metadata. It corresponds to the class called explicit above in Section 6.2.1, i.e. metadata that are explicitly given in the XML document are in this class. For example, the title text of an element (if exists) or its type (such as section, paragraph) of an element are given metadata. The second group of metadata can be called extracted metadata. Data that can be extracted from the XML element about the element belong to this class. It can also be considered as the subclass of the previously discussed implicit metadata class. The extracted class contains metadata such as length, depth of an element. Also, retrieval status value or textual summary of the text of the element belong to this class, although these metadata are extracted using various, complex, algorithms. The third class of metadata is called derived metadata. Derived metadata are those implicit metadata that are about the element but are derived from metadata of other, related, elements. For example, the author of an element is the same as the author of a document, but this information has to be derived from one of the metadata of the whole document. Several examples are shown in Table 6.1. Note that some of the metadata (e.g., author) can be both given (possibly when we are talking about a document) and derived (in case of elements) depending on the context they are used in.
6.2. Metadata and element features 135 Given title text author attributes and their values category publisher ...
Extracted rank retrieval status value length summary connected elements via links topic shifts ...
Derived parent element sibling elements author version category parent rank ...
Table 6.1: Examples of some given, extracted and derived metadata of XML documents and elements. In information retrieval, several of these metadata are already selected and used as features (though they might not always be referred to as such). For example, the retrieval status value (RSV) is used to rank the results, the length of a document is used (among many others) to normalise values such as term frequencies of documents, etc. Some of the metadata are used to calculate other metadata (e.g. term occurrence to calculate RSV) while others can also be used to be displayed to users directly and so, help them find the relevant content more easily. Classical examples for user and display related metadata are title and summary which are displayed to help users reach the relevant content (e.g. (Turpin et al., 2007)). Displaying document categories (which are also metadata as they describe a document), and organising the result around them, has also led to improved performance to flat list presentation (Dumais et al., 2001). Summaries, which are of particular interest of this thesis, can themselves be described as metadata but in this chapter the emphasis is on selecting other metadata that are not necessarily of particular usefulness for searchers if displayed, e.g. number of child elements an element has is not considered informative, and use them in structure summarisation as features. However, when these selected metadata are combined and used to generate structural summaries (and to be displayed to searchers as ToCs), they become useful. Nevertheless, there are many kinds of metadata that are not used either in retrieval or in any form of summarisation, though it might be useful to do so. As the XML structure brings even more possible metadata into the overall set of metadata, it is necessary to select, analyse and filter them for the above mentioned applications. The metadata classes of this section serve as an overview of the types of metadata that are available to use in structure summarisation, and elsewhere where metadata of elements might be used effectively. The effect of pieces of metadata from individual classes may bring different results to structure summarisation, nevertheless, as this chapter is only a first step in structure
6.3. Retrieval result analysis for feature selection 136 summarisation, the investigation of metadata classes is considered for future work.
6.2.3
Feature selection for structure summarisation
When developing methods for automatic ToC generation, it is important to identify several features of XML elements that can be used. In Chapter 5, element length, depth and relevance were chosen and built into the structure summarisation algorithm. However, other metadata can also be selected and incorporated as features; they might prove to be even more helpful than the above mentioned three features to generate high quality tables of contents. To find out which metadata can be used as useful features for ToC generation, one can start with heuristically chosen metadata from the classification table above. However, there is a high number of pieces of metadata that can be incorporated into a structure summariser, so it might be more useful to make an initial feature selection before experimenting with the combinations and measuring the effectiveness of various features. A piece of metadata is a good feature if it is useful in distinguishing ToC-worthy elements from not ToC-worthy elements. Ideally, good features could be selected by analysing existing ToC-worthy element sets and count the frequencies of feature occurrences in ToC-worthy and not ToC-worthy elements. However, having no access to such element sets at the beginning of the work introduced in this chapter (as ToC generation test collections do not exist and creating a large set to experiment on is expensive), other data sets have been selected for the analysis of possible features. In the next section, metadata analysis for structure summarisation is introduced that uses an alternative data set to explore possibly useful features. This analysis is followed by the proper evaluation of features, i.e. evaluation using real ToCs (Section 6.4), as well as the investigation into the validity of alternative data sets for feature and (query independent) summarisation evaluation.
6.3
Retrieval result analysis for feature selection
The lack of structure summarisation data creates the need to search for other, alternative, data sets. To study various feature candidates (metadata), XML retrieval result sets have been chosen. Retrieval result sets are thought to be suitable for this thesis, as these sets contain lists of XML elements that are considered relevant to certain information needs. For other methods for feature selection, which are not used in this thesis, one can refer to the machine learning literature, e.g. (Blum and Langley, 1997).
6.3. Retrieval result analysis for feature selection 137 A retrieval result set, also called retrieval run, in XML retrieval consists of a series of, usually ordered, lists of elements that are considered relevant by a particular retrieval system to information needs called topics. A sample of a retrieval result list is shown in Figure 6.2, where the top two elements (with rank and RSV values, identified by their corresponding document numbers and XPath expressions) for topic number 289 are shown. Using XXXX with the XXX similarity measure[...] wikipwedia 294447 /article[1] 1 21.836609 2281211 /article[1] 2 18.775463 [...]
Figure 6.2: A sample from an example run. In this chapter, it is assumed that that the tendency of an element being relevant (in IR) indicates its being ToC-worthy (in structure summarisation)1 . The assumption is based on the idea that it is the generally relevant elements that are worth including in tables of contents. For example, a section that is returned by a search engine because it was considered relevant to an information need is, generally, also worth including in the ToC of the document. Although it might be argued that a retrieval result is highly query biased, whereas the topic of this chapter is query independent summarisation, it is believed that obtaining high numbers of retrieval results and analysing generally relevant element types can lead to useful information regarding query independent structure summarisation. In other words, averaging the occurrence of features in 1 In
a context where the query is not present, relevance to the query cannot be interpreted. However, if a number of elements are investigated and statistics about generally relevant elements are gathered, this information can be used to estimate ToC-worthiness of particular elements, hence transforming relevance into ToC-worthiness.
6.3. Retrieval result analysis for feature selection 138 relevant XML elements of retrieval results provides indications for the type of summarisation described in this chapter. The averaging method described above allows gaining positive information about ToC-worthy elements, i.e. what ToC-worthy elements are like. However, it is better if negative information, i.e. what not-ToC-worthy elements are like, can be examined to increase distinguishing power. To obtain negative information about elements, it might be effective to compare elements that are in high quality retrieval results with those of low quality results. By analysing runs, we can obtain information as to what retrieved elements have in common and, in addition, it is also possible to find out what is distinctive about elements that are high quality runs only (“high quality” is defined later in Section 6.3.2). The following subsections describe the retrieval result sets used, as well as the preparation of these data for the purposes of this section and analysis of various metadata.
6.3.1
The INEX Relevant In Context result set
Submitted retrieval results, called runs, from the INEX Relevant In Context task, which “required systems to return for each article an unranked set of non-overlapping elements, covering the relevant material in the document” (Lalmas and Tombros, 2007a), have been chosen for analysis. Figure 6.2 shows a sample from one of these runs. The reasons for choosing this set are the following. 1. The Wikipedia collection (Denoyer and Gallinari, 2006) that is used in INEX (including the Relevant In Context task) has documents with various given metadata, such as categories and links to other documents within the collection. The previously used INEX collection, the IEEE collection (Malik et al., 2005), was not so rich in metadata and, in addition to this, XML retrieval evaluation has evolved since its usage (Lalmas et al., 2007a). 2. There are 72 different runs submitted by INEX participants. The evaluation scores of high and low quality runs are substantially different, which is useful for the analysis described in this section, as reasonable differences between elements submitted in various quality runs are expected to be observed. 3. There are 125 different information needs in the collection of these runs which makes the size of the analysis data sufficient for the investigation reported in this section.2 2 In
INEX 2006, only 114 of these topics were assessed by human assessors. Although the evaluation
6.3. Retrieval result analysis for feature selection 139 4. The Relevant In Context task combines traditional document retrieval (returning a list of articles) and element retrieval (within articles, returning elements “covering the relevant material in the document”). Relevance judgments for elements within a document can also be considered as an overview of relevant elements, thus can also be used for ToC generation.
6.3.2
Data preparation
To prepare the data for analysis, first it is needed to select high and low quality runs. In addition to this, average quality runs are also selected to make sure that the information extracted from high quality results is indeed of higher than average quality. There are several official measures to evaluate the Relevant In Context runs (Lalmas et al., 2007b). Although there are no striking differences between the order and RSV scores of runs obtained by using various measures, a merged list of runs was created based on their average position in individual ordered run lists by the evaluation measures to balance off the effect of individual measures (the measures were #MAgP, gP[5], gP[10], gP[25], gP[50], respectively (Lalmas et al., 2007b)). The top runs from this merged list form the set of high quality runs, the bottom runs are called low quality runs and the middle ranked results referred to as average runs in this chapter. The chosen runs and the selection process is shown in Appendix C.1. The absolute quality of the runs (i.e. their distance from an imaginary perfect run), naturally, influences the quality of ‘good’ runs and the quality of information we can obtain by analysing them. However, if differences between various quality runs can be found the analysis can give information about what to avoid in structure summarisation (and retrieval). It is also important to consider one run per submitting INEX participant, as result sets from a particular participant are often obtained by applying various versions of the same retrieval approach, e.g. only parameters of their formula are different. Also, runs from the same participant often contain the same relevant elements but in different order (and such order is not of interest for the summarisation used in this thesis). Following the selection procedure described above, six runs were selected into each of the three groups of runs. After selection, the filtering of individual runs is also needed because we need to consider the fact that users of an IR system rarely access, and thus read, results below the top 20 (Jansen et al., 1998). As only the result sets themselves are examined and it is not at INEX was done using these 114 assessed topics, it is believed that this number is high enough to rank runs submitted by INEX participants and so runs for all 125 topics are analysed in this section.
6.3. Retrieval result analysis for feature selection 140 known whether a search interface would display all the result elements from a document or, e.g., only the top element from that document (Fuhr et al., 2002b), the filtering should be based on documents instead of elements. To do so, top 25 documents’ elements from each run were taken. Most of the runs contained at least 25 result documents for each topic.
6.3.3
Analysis of run sets
The goal of the retrieval result set analysis is to examine several metadata and to find out whether they tend to occur at different frequencies at various quality runs thus allowing to obtain information about their discriminativeness. The set of metadata that were investigated contains the following items: • Depth of retrieved elements in the hierarchy of the logical structure, e.g. the whole article is at depth level one, sections are mostly at levels 3 and 4. • Length of retrieved elements’ text, in characters. Length, as well as depth were selected as these features are known to be useful in ToC generation (Chapter 5). • Type of retrieved elements, e.g. section, paragraph, web link, section title, list item, etc. Several types of elements are often of various length and at various depth levels, e.g. sections are sometimes at depth level three while they can also be at levels two or four. Although not independent from the above two features, i.e. length and depth, the use of type information is used as searchers have previously indicated that the title of elements in ToCs are important (Chapter 4). • Sequence number information, e.g. for the element with XPath //section[3] the number three is recorded (section[3] identifies the third section of a number of sections having the same parent element). The sequence number is investigated to find out whether a method similar to the location method in text summarisation can be identified for structure summarisation, i.e. the first elements of a parent element might naturally be more important than the following elements. • Out-links, i.e. the number of links pointing from the current element. Out-link information is often used in information retrieval, e.g. in the PageRank algorithm (Page et al., 1998). This feature can be useful for selecting ToC-worthy elements, as well as for element retrieval.
6.3. Retrieval result analysis for feature selection 141 • Self references, i.e. is there a link in the current element that links back to the whole document it is in. It might be important if an element points back to the whole document. It might be highly collection specific metadata, nevertheless, it was selected for investigation of this section. • Category links, i.e. are there links pointing to documents that are in the same Wikipedia category as the current document? Categorisation has been proved to be useful in IR (Chen and Dumais, 2000). Selecting category links is a way of capturing how closely an element is related to other documents from the same category. If an element has a number of such links it might discuss a central topic of its category, thus, can be considered important and ToC-worthy. There are, obviously, many more metadata that can be investigated as possible structure summarisation feature candidates (e.g. those in Table 6.1 on page 135). However, the analysis presented in this section only aims at exploring a heuristically chosen set of metadata, as well as at finding out whether indications for ToC generation can be found by the analysis of runs. Several features identified above, as well as other features, will be properly evaluated in Section 6.7 where the usefulness of retrieval runs in the context of ToC generation will also be examined. The run data is prepared for analysis by first calculating the average values of the above listed metadata of elements for various quality runs. In addition, for several features, the feature value distributions are calculated for a random element sample from the document collection, and for the set of highest ranked elements in high quality runs. The feature value distributions in runs are reported in the remainder of this section. The results of the analysis of these seven metadata indicate that there is a uniformity among good runs with respect to several metadata. This uniformity of good runs, which comes from the observation that the standard deviation of occurrences in low quality runs is higher, shows that there is a generally good direction towards producing higher quality results. Thus, information from high quality runs can be valuable. Figure 6.3 shows that element depth is lower for high quality runs than for others (the difference is statistically significant according to t-tests at 0.05 level). However, even low quality runs return elements that are not as deep as the collection average (estimated using 4,255,600 elements from a randomly selected set of 19,655 documents from the collection). As the top 20 elements of high quality runs show even lower depth average showing a possible preference
6.3. Retrieval result analysis for feature selection 142 towards shallower elements, it seems to be justified to use depth as a feature for ToC generation. Lower element depth values and depth level three might indicate ToC-worthiness. The feature will be further investigated and its discriminativeness evaluated in Section 6.4.
Figure 6.3: The average depth of result elements of various quality runs. The length will also be used in Section 6.4. As Figure 6.4 shows, the average returned element is significantly longer than elements of the collection (represented by a random selection of elements). Since the top 20 high quality results’ average length is even higher, length might be investigated as a feature in ToC generation (apart from being intuitively worth investigating as long documents usually have longer ToCs). Figure 6.4 also shows, however, that the standard deviation of length in various quality runs is high which means that systems returned elements of different lengths, from very short elements to long ones (e.g. whole articles, that were long themselves). Table 6.2 shows the most frequent element types in various quality runs. Although the most frequent element types are almost the same for low, average, and high quality runs, these types are clearly not the same as the most frequent types of the document collection (see the ‘Random’ column of most frequent elements in a collection sample). This shows that there are element types that are more likely to be relevant, hence probably useful in ToC-worthiness identification. However, identifying only these types for ToCs is not enough as low quality runs did indeed identify several types of elements yet they were not of high quality. The top four element types of the top 20 high quality runs set will be used as one feature in Section 6.4. The feature will be binary: true if an element’s type is among the top four, i.e. it is an article, section, p (paragraph) or body element, false otherwise. Such binary features are widely used in text summarisation.
6.3. Retrieval result analysis for feature selection 143
Figure 6.4: The average length of result elements of various quality runs. The top four element types were selected as these are the most frequently occurring types in the latter set and there is a drop in occurrence at the fifth most frequent element. High quality runs Average quality runs Low quality runs Random sample Top20High
p 38.67% p 45.95% p 48.17% collectionlink 34.51% article 39.28%
article 17.54% item 5.99% article 8.11% unknownlink 8.11% p 22.30%
normallist 4.36% body 3.43% item 7.78% cell 6.23% body 4.58%
item 3.72% article 3.38% body 7.01% emph2 5.96% normallist 2.27%
Table 6.2: The five most frequently occurring element types and their occurrences in the analysed runs. The analysis of the sequence number of result elements seem to reveal the “location method of structure summarisation” (an example is shown in Figure 6.5). In traditional text summarisation, the location method means that summarisation algorithms use the property of documents that the first (or, sometimes, last) sentences are usually more indicative of the contents of the document (Edmundson, 1969; Kupiec et al., 1995; Teufel and Moens, 1997). This method could be applied to structure summarisation as well. However, Figure 6.5 is misleading. Investigation towards the average sequence number distribution of the collection reveals that although, e.g., first sections are returned more frequently than second sections, the proportion of these sections in the collection is the same as in the retrieval runs. This means that, according to the run analysis and the assumption of relevance and ToC-worthiness being equivalent, the sequence number
6.3. Retrieval result analysis for feature selection 144 is not a good indicator of ToC-worthiness, and should not be used as a feature in ToC generation. However, to prove this and to verify the validity of the assumption mentioned above, the sequence number will be used in Section 6.4 where its presence will not be expected to have significant effect on the quality of ToCs generated.
Figure 6.5: The occurrence of section type elements in runs. The first sections are more often returned in results than, eg., second, third or fourth sections. Table 6.3 shows the average numbers of various out-links. Low quality runs return elements containing significantly more links than higher quality runs. This shows that elements with high numbers of links might not be the best for retrieval and possibly for ToC generation. As Table 6.3 also shows, the number of out-links in high quality runs’ returned elements is higher than that of average runs, which does not provide a clear tendency towards the significance of high or low number of out-links. Although the out-links metadata are not used in the next phase of structure summary generation and evaluation because of the above described non-conclusiveness, they might be interesting to investigate in the future. Quality High Average Low
Collection 18.192 11.005 137.482
Wikipedia 0.144 0.142 2.792
Redirect 0 0 2E-05
Link type Unknown Outside 4.241 1.329 2.492 0.723 82.662 1.822
Web 0.042 0.026 0.043
Language 1.003 0.481 0.950
All 24.952 14.871 225.752
Table 6.3: The average number of out-links. The next piece of metadata to be investigated is the proportion of self-references which is
6.3. Retrieval result analysis for feature selection 145 defined as the ratio of elements in runs that contain links to the whole document they are in. Figure 6.6 shows that low quality runs tended to return more article level elements which had links to themselves. As no significant difference has been found at other than article elements and ToC generation does not focus on article level only, this metadata is not selected for further investigation reported in the next section.
Figure 6.6: The ratio of result elements that contain links to the whole document they are in. The last metadata information that is analysed here is the number of links pointing to documents that are in the same category as the current document. As Figure 6.7 shows, the results are similar to those of the out-links metadata. Figure 6.7 shows that low quality runs’ elements are more scattered, there is not any uniformity in terms of number of links. Unfortunately, the high quality runs’ standard deviation is also quite high which shows that high quality runs’ elements cannot be classified well based on their number of links pointing to documents in the same category, i.e. the number of these kinds of links is not a good descriptor of an element. To sum up, seven metadata have been investigated for ToC generation using data from information retrieval. Some of them, e.g. depth, clearly shows a tendency for which they can be considered features in structure summarisation, while others, e.g. sequence numbers, do not seem to be useful based on the analysis of retrieval runs. As these seven metadata constitute only a tiny proportion of possible features, many other features and feature combinations might be worth examining in the future, if needed. The next sections introduce several other metadata and investigate their effectiveness for ToC generation with the use of a probabilistic structure
6.4. Probabilistic structure summarisation 146
Figure 6.7: The average number of elements linking to documents that are in the same Wikipedia category as their documents. summarisation method adopted from standard text summarisation.
6.4
Probabilistic structure summarisation
In the previous section, various metadata that might be used in structure summarisation as features were explored. In this section, one of the aims is to answer the main question in this chapter: “which element features are useful for ToC generation?”. This section aims to answer the above question with the use of a probabilistic method for structure summarisation. In the following sections, a trainable and query independent structure summarisation method is described that is used to answer the above question. It is also investigated what data sets can be used for summariser training and which training method works better with various feature sets. The evaluation of the results is presented in Section 6.7. Figure 6.8 shows an overview of the steps of the investigation and how the data is used. As Figure 6.8 shows, three sets of data are used to train the structure summariser: retrieval runs, IR relevance assessments and manually created ToCs are considered for training. Training is done with respect to a set of features. The manual ToCs are also used for evaluating automatically generated ToCs. How manual ToCs are created is also reported: for a number of XML documents, human users are asked to ‘build’ ToCs by selecting elements they think are ToC-worthy (for the definition of ToC-worthiness, see Section 5.1.1). The next section describes the central subject of the section, i.e. how structure summarisation is carried out.
6.4. Probabilistic structure summarisation 147
Figure 6.8: Overview of the query independent structure summarisation experiment.
In Section 2.4 two basic summarisation methods were described. The first method, introduced by Edmundson (1969) has already been adopted to structure summarisation in Chapter 5. The second method, which was introduced by Kupiec et al. (1995), is a summarisation method that uses probabilistic sentence classification to extract the best sentences that are worth including in a textual summary. The method shown in this section is based on the work by Kupiec et al. (1995). The structure summariser described in this chapter is a slight modification of the above method. It allows to extract elements from XML documents as opposed to extracting sentences from flat documents. The selected elements are then used to construct a table of contents. With this summarisation method, Bayesian classification (Geiger et al., 1997) is used and applied as follows. For each XML element e, the probability that it will be included in a structural summary (ToC) T given k features Fj ( j = 1..k) is computed. This can be expressed using Bayes’ rule as shown in Equation 6.1:
P(e ∈ T |F1 , F2 , .., Fk ) =
P(F1 , F2 , .., Fk |e ∈ T ) · P(e ∈ T ) P(F1 , F2 , .., Fk )
(6.1)
where P(e ∈ T ) denotes the probability that element e is ToC-worthy, P(F1 , F2 , .., Fk |e ∈ T ) is the probability of the k features being observed given that element e is ToC-worthy, and P(F1 , F2 , .., Fk ) is the probability of the k used features. The naive Bayes assumption is then used, which assumes that the features are statistically independent with respect to ToC-worthiness (Equation 6.2).
P(e ∈ T |F1 , F2 , .., Fk ) =
∏kj=1 P(Fj |e ∈ T ) · P(e ∈ T ) ∏kj=1 P(Fj )
(6.2)
P(e ∈ T ) has the same value for each element. P(e ∈ T ), as well as P(Fj |e ∈ T ), which is the probability of the jth feature being observed given that element e is ToC-worthy, can be estimated directly from the training set. P(Fj ), i = 1..k does not need to be estimated at all because of the classification method used (see below in Equation 6.3). The probability estimation is done by counting feature occurrences; the estimation process is further described in the next section. The Bayesian classification function assigns a score for each element e which can be used to select elements for inclusion in a structural summary as described below.
6.5. Training the summariser
149
The element selection (classification) function is shown in Equation 6.3. With the help of this function, we can calculate whether the element is more likely to be ToC-worthy (ToC-worthy is one of the classes) or not-ToC-worthy (the other class of the classification). If the number obtained using Equation 6.3 (i.e. the element’s score, the probabilistic odds of ToC-worthiness) is higher than zero, i.e. the element is more likely to be ToC-worthy than not-ToC-worthy, element e is included in the table of contents.
∏kj=1 P(Fj = f j |e ∈ T ) · P(e ∈ T ) P(e ∈ T |F1 = f1 , F2 = f2 , .., Fk = fk ) = log k log P(e ∈ / T |F1 = f1 , F2 = f2 , .., Fk = fk ) / T ) · P(e ∈ / T) ∏ j=1 P(Fj = f j |e ∈
(6.3)
where P(e ∈ / T ) denotes the probability that element e is not-ToC-worthy, and f j is the value (‘bucket’, explained in the next section) of feature F observed for element e. In the summariser introduced in Chapter 5, it was assumed, and incorporated into the summarisation method, that if an element is found to be ToC-worthy then all its ancestors are also ToC-worthy. This was to place, e.g. a subsection into the context of the chapter and section the subsection is in. The summariser described in this chapter, however, does not use this assumption. Instead, it is assumed that if the training is done adequately, no ancestor-descendant-related assumptions need to be introduced, as the summariser will be able to select the appropriate ToCworthy elements. The trained summariser is still expected to mark an ancestor of a ToC-worthy elements also ToC-worthy but this does not need to be assumed and built into the summariser. Neither is the above assumption of aggregating ToC-worthiness used when manually selected ToC-worthy element sets are created (Section 6.6) and used for training (Section 6.5.4) and evaluation (Section 6.7). Similarly, data for the other two training methods, introduced in the next section, will not be modified. The training data are used to estimate the probabilities described above. This is done by counting occurrences. Training is explained in the next section where three training data sets: training by retrieval runs, relevance assessments and manually created ToCs, are used. The section also presents features F1 , F2 , .., Fk that are used in the summariser.
6.5
Training the summariser
Ideally, a summariser used to select elements for ToCs should be trained using a set a elements known to be either ToC-worthy or not-ToC-worthy. Such sets are example ToCs, as elements
6.5. Training the summariser
150
in the ToC are ToC-worthy, others are not-ToC-worthy. However, as such training sets are not available and they are expensive to create (because it requires human involvement) other training options are also investigated in this chapter. The first alternative data set for training is retrieval result sets, also called retrieval runs. Runs have already been used to analyse various metadata for structure summarisation in Section 6.3. In Section 6.3, it was assumed that relevance and ToC-worthiness were equivalent, and so, runs can also be used to provide data for structure summariser training. Using the relevance-ToCworthiness equivalence assumption, this section also uses another training set from IR: relevance assessments. To measure and compare the effectiveness of these alternative training sets, a small set of manually created ToCs will also be used for training the structure summariser. The use of the manual ToC data set for training is described in Section 6.5.4. How the data set was created using human “ToC assessors” is described in detail in Section 6.6. ToCs were also pseudo-manually created in Chapter 5, however, although human searchers were involved in the ToC creation process, the ToCs obtained in that study are highly biased by two of its features, i.e. depth and length (which, as features, are also investigated in this chapter). To avoid any effect of the depth and length bias, the ToCs of Chapter 5 are not used in this chapter either for training or evaluation. To train the summariser, the probabilities introduced in the previous section (Equation 6.2) need to be estimated by counting occurrences of features within the training documents’ elements. The required four types of probabilities can be estimated as follows: • P(e ∈ T ) The number or ToC-worthy elements divided by the number of elements in all documents of the training set, i.e. the proportion of ToC-worthy elements in the training data set. • P(e ∈ / T ) The number of not-ToC-worthy elements divided by the number of elements in documents, i.e. the proportion of not-ToC-worthy elements in the training data set. Also, it can be calculated using the equation P(e ∈ / T ) = 1 − P(e ∈ T ). • P(Fj |e ∈ T ) Occurrences of feature Fj in ToC-worthy elements, normalised by the number of ToC-worthy elements. • P(Fj |e ∈ / T ) Occurrences of Fj in not-ToC-worthy elements, normalised by the number of not-ToC-worthy elements
6.5. Training the summariser
151
In order to estimate these probabilities, it is necessary to define the feature set F and what is considered as ToC-worthy. The next subsection introduces the feature set which is followed by subsections describing how ToC-worthiness can be estimated.
6.5.1
Selected features
Eight features were selected for this experiment. Two of them, depth and length, have already been part of the work described in Chapter 5. Other features, e.g. sequence number and element type, have been investigated in the metadata analysis phase of the research (Section 6.3). The remaining features were chosen to further explore the use and distinguishing abilities of other available metadata. The list of selected features, naturally, is not complete. It is possible that there are other metadata that can be selected and might serve better than the ones examined in this section. However, it is believed that the selected features, or possibly a subset of them, can lead to automatic ToCs of sufficiently high quality. Furthermore, only features that are specific to a particular element have been selected to keep this step of the research as informative as possible. After training the summariser, it should be possible to generate ToCs for any XML document having a logical structure similar to traditional documents, i.e. containing sections (or their reasonably close equivalents) within a chapter, subsections within sections, etc.. The summaries should also be generated in a short time, preferably without having to store indices of documents. Therefore, when selecting features for this experiment, computationally expensive features were avoided. That is why no word-statistics based features, e.g. title words’ occurrence, cue words, and metadata of connected elements (via links or logical structure) are considered for future work only. Sufficiently high quality ToCs are expected to be generated by using easy-to-compute features. When training the summariser, some features can be binary, i.e. either true or false. For example, an element does or does not have a title. Consequently, there will be four probabilities that will need to be calculated for training: P(FiTrue |e ∈ T ), P(FiFalse |e ∈ T ), P(FiTrue |e ∈ / T ), P(FiFalse |e ∈ / T ) where i denotes Feature i. Several other features, however, can take more than two values (i.e. not binary but discrete), e.g. the depth of the element can be 1, 2, etc. depending on how many ‘buckets’ we want to identify, i.e. how many different values are allowed for a particular feature. In these cases, probabilities twice the number of buckets (i.e. one set for ToCworthiness, another for not ToC-worthiness) need to be calculated for each feature (Equation
6.5. Training the summariser
152
6.4).
P(Fi j |e ∈ T ), P(Fi j |e ∈ / T ), i = 1..k, j = 1, .., li
(6.4)
where k denotes the number of features, li is the number of buckets for feature i, P(Fi j |e ∈ T ) is the probability that the content of the jth bucket of feature i is observed given that element e is ToC-worthy, and P(Fi j |e ∈ / T ) is the probability that the same is observed given that element e is not-ToC-worthy. The list features and their buckets used in this section are the following. 1. Depth. It has been proven to be useful (Chapter 5) and, previously, users expressed their need for depth to be a important feature (Chapter 4). Ten discrete values of depth have been used in the experiment, i.e. depth level 1, level 2, ..., level 10 have been considered. Previous retrieval result and table of contents analysis (Chapter 5, (Hammer-Aebi et al., 2006)) shows that from depth level 8 there are hardly any relevant or ToC-worthy elements. 2. Length. Also used in the previous chapter. In addition, it is a very important feature at retrieval where it is used for normalisation (Kamps et al., 2004) and for filtering out elements that are too small to retrieve (Malik et al., 2005). The length categories used are the same as in the previous chapters, i.e. character lengths 1-10, 11-100, 101-1000, 1001-10000 and 10000+ are used. 3. Type. In the retrieval run analysis (Section 6.3), it was found that four types of elements, namely p (paragraph), section, article and body, tend to occur most frequently in high quality retrieval results. These elements are also supposed to be helpful in deciding whether a particular element is ToC-worthy. The two buckets for element type are top and non-top where top is true if the current element’s type is one of the above mentioned four types. 4. Sequence number. Although the run analysis revealed that the sequence number did not affect the quality of runs, this feature was kept to verify the validity of the assumption made, i.e. that relevance and ToC-worthiness can be considered equivalent. If the prediction that the sequence number information is not of particular is in ToC generation, that may also show that the assumption is not wrong. The buckets for the sequence number feature are the following. seqnum1 (meaning that the current element’s XPath
6.5. Training the summariser
153
is like //*[1]), seqnum2-3 (//*[2] or //*[3]), seqnum4-5 (//*[4] or //*[5]), seqnum6+ (//*[6] or above). 5. Title. As no word statistics based features are selected for this experiment, the title method used in text summarisation is also omitted. However, instead of counting occurrences of words occurring in the document’s title, the explicit presence of an element’s title is recorded, i.e. if an element has a title the element counted into the hastitle bucket. To find out if an element has a title, the collection had to be analysed before the summarisation experiment and the list of tags denoting the current elements title child element to be found. Element types such as name, title, st (i.e. section title) have been noted. 6. Child elements. The number of descendants might carry information about an element’s ToC-worthiness. The following buckets are assigned to this feature: children0 (no descendants), children1-5 (one to five child elements), children6-10 and children11+ (at least 11 direct descendants). 7. Grandchild elements. The same description and bucket numbering applies to grandchild elements as to child elements. 8. Sibling elements. Another feature that is supposed to explore whether the number of child elements of the current element’s parent can help distinguishing ToC-worthy elements from others. Bucket numbering is the same as for the previous two features.
6.5.2
Training by high quality retrieval runs
In this subsection summariser training using high quality retrieval runs is described. For this training type, it is assumed that XML elements that are returned by high quality retrieval results can be used to estimate ToC-worthiness. Based on this assumption, the probabilities introduced in Section 6.4.1 can also be estimated. To estimate the probabilities, it is needed to acquire a sufficiently large set of training or example documents. The training method described in this subsection takes the INEX Relevant In Context retrieval result set’s high quality runs as examples (containing 7438 documents). The relatively high number of query-focused retrieval results enables to acquire sufficiently ’smoothed’ results that can be used to train the summariser that is supposed to select ’generally relevant’ XML elements. The result set and selection of top quality runs has already been described in
6.5. Training the summariser
154
Figure 6.9: Training by runs - Depth probabilities. Section 6.3. Elements listed in these runs (identified by their document number - XPath pairs, e.g. 573 /article[1]/body[1]/section[3]) are considered to be positive examples of ToC-worthiness for the corresponding documents, while not listed elements from the same document are treated as not-ToC-worthy. It is possible that more than one INEX participant submitted elements from the same document for the same topic in their runs. Although using the same document more than once for training can be used to reinforce the effect of a document, in this chapter, documents and elements are aimed to be treated as equal. Hence, when elements appearing in more than one run are found, the corresponding result element sets are merged to avoid using a document twice for training (as we would give a double weight to the document by keeping the result elements separately) but also, not to lose possibly relevant elements (which would come from choosing only one of these documents). The training results show that, according to this training method, only 2.34% of elements are generally ToC-worthy (Table 6.4 on page 163). The corresponding probabilities are P(Fj |e ∈ T ) = 0.0234 and P(Fj |e ∈ / T ) = 0.9765. The low general ratio of ToC-worthiness (i.e. 2.34%) is unexpected, given that the query-based structure summarisation study found that, on average, 8.16% of the elements are included in the ToCs (Section 5.4.2.2). The depth probabilities indicate that as we go down the structure from level three the less likely it will be that an element needs to be included in the ToC of the document (see Figure 6.9 or, for the actual values of various probabilities, see Table 6.8 on page 177). The shape of the probability distribution in Figure 6.9 is similar to those obtained in Chapter 5 (Figure 5.11). The same can be said about the length probabilities (Figures 6.10 and 5.9, respectively). As Figure 6.10 shows, ToC-worthy elements tend to be longer than other elements.
6.5. Training the summariser
155
Figure 6.10: Training by runs - Length probabilities.
Figure 6.11: Training by runs - Type probabilities. The element type feature is expected to perform very well (Figure 6.11) as most of the ToCworthy elements will probably be from the top, i.e. most frequently relevant (Section 6.3.3), four element types (i.e. p, section, article, body) while other types of elements are likely to be not-ToC-worthy. Similarly to what has been found in Section 6.3, it seems that the sequence number information cannot separate ToC-worthy elements from others (Figure 6.12). Since the probabilities are almost identical at various buckets, the feature could as well be omitted from the set of features if we were to use only this training method, i.e. training by runs. In Figure 6.13, the probabilities associated with the title feature are reported. According to the retrieval run based training, the absence of an explicit title is a good indicator that an element is not-ToC-worthy, however, the presence of a title does not give the same high probability that an element is ToC-worthy. The title is expected to be a good feature though, due to the ratio of the probabilities when the title is present (left columns in Figure 6.13), when it is combined with other features.
6.5. Training the summariser
Figure 6.12: Training by runs - Sequence number probabilities.
Figure 6.13: Training by runs - Title probabilities.
156
6.5. Training the summariser
157
Figure 6.14: Training by runs - Children probabilities.
Figure 6.15: Training by runs - Grandchildren probabilities. The sibling and descendant probabilities show that an element is more likely to be ToCworthy if it has many siblings and at least some descendants (reported in Figures 6.14, 6.15 and 6.16). Having estimated probabilities for using retrieval runs, i.e. trained the summariser by runs, the next subsection introduces another training data set.
6.5.3
Training by relevance assessments
In this subsection, relevance assessments from XML retrieval are used for probability estimation. It is assumed that XML elements that are assessed relevant by human relevance assessors at INEX can be used to estimate ToC-worthiness. Based on this assumption, the probabilities introduced in Section 6.4.1 can also be estimated, similarly to the previous section. The assessments set used is that of INEX 2006, where documents for 114 topics have been assessed (out of the 125 topics that were created). This gives 5460 documents with relevant elements marked to train the summariser. The training set, therefore, is in the same format as with training by runs (see previous section), i.e. document, XPath pairs identifying relevant
6.5. Training the summariser
158
Figure 6.16: Training by runs - Siblings probabilities.
Figure 6.17: Training by assessments - Depth probabilities. elements, are used. As with training by retrieval runs, it is still assumed that the high number of documents with query-biased relevant elements will smooth the training so that appropriate training for query independent summarisation can be carried out. The training results show that, according to this training method, 9% of elements are ToCworthy (Table 6.4). The corresponding estimated probabilities are P(Fj |e ∈ T ) = 0.0900 and P(Fj |e ∈ / T ) = 0.9099. This means that, in general, ToCs generated by assessment training will be longer than those by run training. The 9% is in accordance with the findings in Chapter 5, where the average ToC contained 8.16% of the document’s elements. Unlike with the run training, the assessment training probabilities do not show clear differences between relevant and not relevant elements with respect to various probabilities. However, it can still be seen (Figure 6.17) that deeper elements are more likely to be not-ToC-worthy, longer elements are probably better for ToCs (Figure 6.18), and top type elements are slightly more likely to be ToC-worthy (Figure 6.19). As for the run training, the sequence number still do not seem to carry information that could
6.5. Training the summariser
Figure 6.18: Training by assessments - Length probabilities.
Figure 6.19: Training by assessments - Type probabilities.
159
6.5. Training the summariser
160
Figure 6.20: Training by assessments - Sequence number probabilities.
Figure 6.21: Training by assessments - Title probabilities. separate ToC-worthy elements from others (Figure 6.20), neither does the title existence feature (6.21). Probabilities concerning the three features counting the numbers of related elements at various levels, i.e. number of children (Figures 6.22), grandchildren (Figures 6.23) and siblings (Figures 6.24), do not show striking differences. However, the features together (i.e. combined) can still be useful to identify elements that would form high quality ToCs. According to this training method, the individual features do not seem to be as discriminative as several run training probabilities shown in the previous section. Individual features and their combinations are also investigated in Section 6.7.
6.5.4
Training by manually created ToCs
In this subsection, summariser training that uses a data set consisting of ToCs, that are created by human ToC assessors, is presented. This subsection does not concern about how the data set is created (that is presented in Section 6.6) but how it is used to train the summariser. The same ToC data set is used for evaluation which is also discussed later in Section 6.7.
6.5. Training the summariser
Figure 6.22: Training by assessments - Children probabilities.
Figure 6.23: Training by assessments - Grandchildren probabilities.
Figure 6.24: Training by assessments - Siblings probabilities.
161
6.5. Training the summariser
162
Training by manually created ToCs is regarded as the ideal training method as for this type of training, there is no need to introduce any assumptions regarding relevance and ToC-worthiness, i.e. their equivalence for structure summarisation (for the mentioned assumptions, see Section 6.3 and previously described training methods). The probabilities, when manual ToCs are available, can be obtained are already the desired probabilities, e.g. P(Fj |e ∈ T ) instead of P(Fj |e ∈ R) where T is the set of ToC-worthy elements and R is the set of relevant elements. The training is done the same way as for the previous two trainings, i.e. by counting occurrences of features. Creating a suitably large manual ToC set, similarly to obtaining relevance assessments, “is a very tedious and costly task” (Piwowarski and Lalmas, 2004). Therefore, it is needed to use the manually created data set efficiently. For efficiency purposes, the k-fold cross-validation method is used where one subset of the data (i.e. documents) is used for training and another subset for evaluation. The training and evaluation is repeated for k different subsets of the data, which provides k different sets of training probabilities. To compare these training probabilities (i.e. those of manual ToC training) with those of run and assessment trainings, the following figures report the average of probabilities obtained in the k steps of the k-fold cross-validation method. K-fold cross-validation is further described in Section 6.7 where the other subsets of the data (i.e. XML documents used for evaluation) are also discussed. According to the averaged manual ToC training probabilities, the depth feature is very distinctive at depth level one and the positive relevance value is high at level 3 (Figure 6.25). The not relevant distribution follows the shape which was also found and investigated in Chapter 5. The length distributions show that longer elements are more likely to be ToC-worthy (Figure 6.26), type and title are reasonably good ToC-worthiness indicators (Figures 6.27 and 6.29) and the sequence number does not reveal differences between ToC-worth and not-ToC-worthy elements (Figure 6.28). Probabilities the number of children and siblings features are hard to interpret (Figures 6.30 and 6.32) whereas the many grandchildren indicate ToC-worthiness, and zero grandchildren, not-ToC-worthiness (Figure 6.31). Although they are treated as independent in this chapter, length and depth are clearly interdependent. It follows naturally that e.g. longer elements tend to be higher up in the hierarchy within the same document as deeper elements are parts of other elements (i.e. their ancestors). With respect to different documents, for instance, an element with the length of 1000 characters can be at depth level 2 in one document and at depth level 4 in another. The latter example shows
6.6. Manual structure summary (ToC) building 163
Figure 6.25: Manual ToC training - Depth probabilities.
Figure 6.26: Manual ToC training - Length probabilities. that, despite their interdependence, using both length and depth might be needed in structure summarisation. Also, naive Bayes classifiers still work relatively well with interdependent data despite the method’s assumption of their independence. The next section discusses how manual ToCs, whose probabilities have been described above (i.e. summariser training by manually created ToCs), were created using human ToC assessors. run training P(e ∈ T ) P(e ∈ / T) 0.0235 0.9765
assessment training P(e ∈ T ) P(e ∈ / T) 0.0900 0.9100
manual ToC training P(e ∈ T ) P(e ∈ / T) 0.0664 0.9335
Table 6.4: P(e ∈ T ) and P(e ∈ / T ) probabilities per training methods.
6.6
Manual structure summary (ToC) building
This section discusses how manual ToCs, that are used for training the summariser (in Section 6.5.4) as well as for evaluation (in Section 6.7 later), were created. This section presents an experiment where human assessors were asked to select ToC-worthy elements. The experimental setup, as well as the used system that I created for manual ToC building are described. This
6.6. Manual structure summary (ToC) building 164
Figure 6.27: Manual ToC training - Type probabilities.
Figure 6.28: Manual ToC training - Sequence number probabilities.
Figure 6.29: Manual ToC training - Title probabilities.
6.6. Manual structure summary (ToC) building 165
Figure 6.30: Manual ToC training - Children probabilities.
Figure 6.31: Manual ToC training - Grandchildren probabilities.
Figure 6.32: Manual ToC training - Siblings probabilities.
6.6. Manual structure summary (ToC) building 166 section also discusses agreement between ToC assessors, and compares the agreement levels to agreement values from text summarisation and information retrieval studies.
6.6.1
Experimental setup
25 participants (ToC assessors) were recruited with various levels of computer science expertise. Similar numbers of assessors are often used in summarisation (Harman and Over, 2004)3 . Before starting the assessment, participants were given an introduction in which their task was explained, and the system to be used described. After signing a consent form, they were asked to choose documents from a list of maximum 20 documents, and assess items (i.e. elements) of the ToCs for as many documents as they wished. Assessment meant that participants were asked to select ToC-worthy elements for a document by judging the elements’ contents and not the texts that appear as titles (i.e. labels, e.g. “An indentation can mean t...” in Figure 6.33 below) in the ToCs. Participants assessed 322 documents. Documents for assessment had previously been selected randomly from the Wikipedia XML document collection (Denoyer and Gallinari, 2006). 42 of these documents were assessed twice, these selected elements were used to study participants’ agreement levels. The double-assessed documents were obtained by introducing overlap among sets of documents assigned to individual assessors, e.g. assessor1 shared one or two documents with assessor2, assessor2 also shared one or two documents with assessor3, and so on. Questionnaires, interviews and clicking logs were not used as the aim was to create manual ToCs for training and evaluation, and not to study user behaviour. As a result, only selected ToC-worthy elements were saved.
6.6.2
Element selection
Manual structure summary building (i.e. the selection of ToC-worthy elements) is done in a way known both in text summarisation and IR. In classical text summarisation, humans are asked to write sentences (or a paragraph of a desired length) that summarise the document’s textual content (Edmundson, 1969). For sentence extraction, they are asked to select sentences that are most representative of the document’s content. In information retrieval, assessors are asked to select documents that are relevant to certain information needs (topics). In XML IR, assessors are asked to select portions of documents that are relevant to a topic. In manual ToC building, participants 3 http://duc.nist.gov/duc2007/tasks.html
6.6. Manual structure summary (ToC) building 167
Figure 6.33: The manual ToC building interface. are asked to select document portions, i.e. XML elements, that they consider ToC-worthy. Using the selected elements, the summariser can be trained with manual ToCs as explained in the previous section, and structure summarisation can be evaluated (Section 6.7) by comparing element sets selected by the summarisation method to the manually created ToC-worthy element sets.
6.6.3
Element selection interface
For manual ToC building, a system with an interface shown in Figure 6.33 was used. The interface is a modified version of that of Chapter 5. The right panel is reserved for displaying the elements’ contents according to stylesheets supplied for the document collection, the left panel shows the document’s elements structured according to their position in the logical structure. When a ToC item on the left is clicked on, the contents of the element it represents appear in the right panel. Each ToC item is also associated with a checkbox that indicates ToC-worthiness if ticked. When the element selection is finished, participants click on the ‘Submit assessment’ button which saves their selection. Participants were also given the option of not having to assess a ToC if, for whatever reason, they felt doing so (the number of assessed documents, i.e. 322, is the result of this option).
6.6. Manual structure summary (ToC) building 168 /article[1] /article[1]/body[1]/section[1] /article[1]/body[1]/section[2] /article[1]/body[1]/section[2]/section[2] /article[1]/body[1]/section[3] /article[1]/body[1]/section[3]/section[1] /article[1]/body[1]/section[4] /article[1]/body[1]/section[5] Figure 6.34: An example ToC-worthy element set for document 1095034.xml 6.6.4
Level of agreement
In this subsection the agreement between ToC assessors is discussed. As the inter-assessor agreement levels in text summarisation and IR are often quite low (Trotman, 2005), it is worth investigating whether for ToC building, the agreement levels are any different. There are several ways to calculate agreement levels. Hripcsak and Rothschild (2005) calculates agreement in a recall/precision manner, the F measure is also used to measure agreement, usually by balancing recall and precision (α = 0.5) (Mani et al., 1999)4 . Various other methods are also available and have been used to measure agreement between two persons’ data sets (Hunt, 1986; Hripcsak and Rothschild, 2005), such as Pearson’s correlation coefficient, the κ measure (Cohen, 1960) and overlap (Salton, 1968), positive specific agreement, mean average precision difference (Trotman and Jenkinson, 2007), average pairwise recall (Donaway et al., 2000), unigram co-occurrence in extracted sentences (Lin and Hovy, 2003), etc. (Fleiss, 1975; Voorhees, 2000). Their use depends on various factors, such as the absence or presence of negative agreement data (which, in IR, is often not available as the exact number of not relevant documents may not be known). In the case of ToC creation, the negative agreement information, that is, the number of elements that were not marked ToC-worthy by either assessors (which is used as if these elements were marked not-ToC-worthy), is known and, hence, the κ measure, which aims at capturing agreement in reflection of expected agreement (Cohen, 1960) (Equation 6.5), can be applied. 4 Note
that measuring agreement between two assessors is very similar to measuring “agreement” between machine-generated textual summaries and manual summaries, as well as retrieval results and relevance assessments. In these examples, however, one of the data sets is considered to be the ground truth; the higher the agreement is with it the higher the performance of a system is. In the agreement calculation in this section, the two data sources to be compared are considered equally important.
6.6. Manual structure summary (ToC) building 169
Figure 6.35: Notations for two assessors’ agreement.
κ=
2(ad − bc) (a + c)(c + d) + (b + d)(a + b)
(6.5)
where a is the number of cases that both assessors agree are positive, d is the number of cases both assessors agree are negative, and b and c are the number of cases that two assessors disagree on (Figure 6.35). The κ agreement measure for the manual ToC building data set is κ=0.6785 which is slightly lower than the κ = 0.72 found in text summarisation by Carlson et al. (2001), but it shows that the agreement levels in structure summarisation are comparable to those found in text summarisation. As agreement on negatives in IR and text summarisation is usually not available or not measured, overlap is often used. Overlap is defined as the ratio of the intersection (i.e. positive agreement, a) of the two assessors’ sets to their union (i.e. a + b + c, Equation 6.6). This overlap measure is also referred to as the Jaccard measure or coefficient. In this chapter, overlap is calculated to measure ToC assessor agreement, so that the agreement level can be compared to more, typical values from IR and summarisation.
Agreement =
a |Selecteda ∩ Selectedb | = |Selecteda ∪ Selectedb | a + b + c
(6.6)
where Selectedi is the set of elements selected by assessor i. Based on the 42 documents that were assessed twice, the level of agreement between assessors of ToCs is 57.22%. This number is higher than typical values of human assessor agreements in IR (e.g. 33% at TREC-6 (Voorhees, 2000), 49% at TREC-4 P/B and 27% at INEX 2004 (Trot-
6.7. Evaluation 170 man, 2005)). Also, in the context of text summarisation, for example, Hassel and Dalianis (2005) has found 39.6% agreement between summary assessors. Salton and Buckley (1991) has found 45.81%, elsewhere 46% has been found and such agreement levels are generally expected in extraction type document summarisation (Mani, 2001). Table 6.5 shows the individual agreement levels for the 42 double-assessed documents (more precisely, 40 double-assessed documents and two triple assessed documents treated as twice double-assessed, denoted by italic). Two of the 42 documents have registered 100% agreement (bold in Table 6.5) and the scores by the κ measure were always greater or equal than those by the overlap measure. The latter is not surprising as considering agreement on negatives naturally inreases the number of cases (i.e. elements) when assessors agree. As the results show, the agreement measured by considering positive agreement (overlap) is higher than what is generally found in IR or summarisation. However, when the agreement on negatives is also taken into account the overall agreement level seems to fall below agreement levels reported in the IR and summarisation community5 . The lower κ score can be explained by the number of elements in documents: the number of elements in the compared documents was 147 on average, which might be lower than the number of sentences in summarisable textual documents or IR document collections (which generally contain far more documents whose being not relevant can be more easily agreed upon). However, the agreement levels seem to be higher than those of XML retrieval (indicated by the overlap agreement levels) despite not knowing the kappa for assessed documents in XML IR. Based on the findings of Salton (1968); Voorhees (2000), it is believed that, even though there are disagreements among assessors, the results are stable. This is because evaluation results (Section 6.7) are reported as averages over many documents, and it is assumed (based on the above cited work) that disagreements among judges affect borderline elements, and the most important ToC elements are unanimously agreed upon.
6.7
Evaluation
In this section, the evaluation of the summariser introduced in Section 6.4.1 is described. It is discussed how trained summarisers can be evaluated, as well as the choice of evaluation methods for the three training sets of this chapter (i.e. run, assessment and manual ToC trainings). 5 Note
that statistical significance here cannot be computed due to lack of data.
As the summarisation method used in this chapter involves training, training-based evaluation methodologies are followed. The holdout method is used for the run and assessment trainings, while cross-validation is employed for the manual ToC training (Kohavi, 1995). When a system is trained on a set of documents, documents that are used for training should not be used for testing. Although the run and assessment training methods do not use ToC-worthy element sets for training the summariser (but retrieval results and assessment data from IR), documents that are involved in the training in any way are avoided in the testing phase. Following the holdout method, the document set is divided into training and testing sets. Usually, the ratio of the number of training and testing documents is two to one but, as testing is very expensive due to the use of manually created ToCs, the above mentioned ratio in these experiments is higher (run training: 7438/322, assessment training: 5460/322). In the testing phase, summaries generated by the trained summarisers are compared to those created manually as described in Section 6.6. The data set of manual ToCs includes ToC-worthy elements for 322 documents (Section 6.6). These 322 documents are to be used for training (by manual ToCs) and evaluation as well. If the holdout method was followed, the document set would have to be divided into two, i.e. training and evaluation, sets, which would not be efficient and, due to the low number of documents used for evaluation, the results would not be reliable. The k-fold cross validation method (Kohavi, 1995) offers a way to increase the efficiency, robustness and reliability. The k-fold method splits the document set into two parts, so that one (part A) contains
k−1 k th
(e.g.
9 10 th),
the other (part B) the 1k th (e.g.
1 10 th)
of the whole manual
ToC set, respectively. Part A is then used for training and part B for testing. This splitting and testing process is repeated k times, always using the next
1 k
fraction (another
1 10 th)
of documents
for testing and the rest for training. Research shows that the choice of k = 10 gives one of the most reliable results (Kohavi, 1996). Hence, the structure summarisation manual ToC set for 322 documents is split into ten folds and results obtained in each of the ten training-testing rounds will be averaged to obtain overall performance values for the corresponding summariser. The next subsection introduces the measures that are used in the testing phase, i.e. in the evaluation of structure summarisation.
6.7. Evaluation 173 6.7.2
Evaluation measures
To evaluate structure summarisation results (i.e. ToC-worthy elements selected for XML documents), recall and precision were chosen. These two measures are also used text summarisation (more precisely, in sentence extraction) as sentence recall and sentence precision, respectively (Section 2.4.4). Recall and precision are used in this chapter because structure summarisation, as it is described in this thesis, can be considered as extraction of ToC-worthy elements from the document.6 . For extraction type summaries, recall and precision are widely used and, therefore, there is no need to adopt content based evaluation measures such as ROUGE (Lin, 2004). Overall recall and precision are calculated using macro averaging (van Rijsbergen, 1980) as it is macro averaging that is usually used in “evaluating query driven retrieval, partly because it gives equal weight to each user query” (Lewis, 1991): for each document, recall and precision values are calculated by simple counting and division, then average recall and precision values over all documents are obtained to describe the overall performance of the ToC generation method (Equations 6.7 and 6.8).
Rmacro =
1 N |ToCworthyi ∩ Selectedi | ·∑ N i=1 |ToCworthyi |
(6.7)
Pmacro =
1 N |ToCworthyi ∩ Selectedi | ·∑ N i=1 |Selectedi |
(6.8)
where N is the number of documents whose ToC-s are used in evaluation, Selectedi is the set of elements in document i that are selected by the summariser being evaluated (and corresponds to Retrieved in IR) and ToCworthyi denotes elements that are selected manually for document i by participants of manual structure summary building (Relevant, in the formula used for IR, Section 2.3.1). For the evaluation of results obtained by using the 10-fold cross evaluation method, recall and precision are calculated for each fold first, then the average recall and precision values over these folds are reported. Averaging is the standard way to obtain overall evaluation scores for systems using the k-fold method (Kohavi, 1995). It is often desirable to use only one number to describe the performance of a system. This way, various methods can directly be compared and an order easily be obtained. In this chapter, 6 Note
that although the titles or first words of the elements’ contents are shown to users when ToCs are displayed, it is the element selection that is evaluated and not the presentation itself, as this was also described to participants who created the manual summaries.
6.8. Results 174 results by the F2 measure (with α = 0.25(β = 2), giving double emphasis to recall) are also reported in addition to recall and precision values (Equation 6.9).
F2 =
1 (22 + 1) · (P · R) = 0.25 0.75 (22 · P + R) P + R
(6.9)
As the values of α and β show (which determine the weight between recall and precision), recall is emphasized over precision. The rationale behind double emphasis on recall is that it is assumed that returning some of the not-ToC-worthy elements is not as bad as not returning ToCworthy elements. In other words, a user prefers if not-ToC-worthy elements are displayed (and possibly ignores them) to ToCs where important (ToC-worthy) elements are omitted (Section 4.6). Double emphasis on recall is also used in IR (van Rijsbergen, 1980). However, it is simple to train a structure summariser to return higher proportion of elements by introducing a correction factor (Section 6.8.1).
6.8
Results
In this section, the results of the ToC generation experiments are reported. The three training methods (i.e. run, assessment and manual ToC training), the performance of individual features (listed in Section 6.5.1), as well as feature groups are compared.
6.8.1
Results by training
The first set of evaluation results is generated by using all eight features. Thus, in this subsection, the overall performances of the three training methods, i.e. run, assessment and manual ToC, are compared. As Table 6.6 shows, the recall values of the assessment and manual ToC trainings are higher than those of the run training when all eight features are used for structure summary generation. This difference between the recall values is statistically significant (according to ttests, at p = 0.05 level). However, all three precision values are found to be not significantly different from one another (Table 6.7). As it might be expected as a result of ‘proper’ training data (i.e. ToC-worthy element sets), the F score of results by manual ToC training is the highest among the three result sets. However, the other two scores are not much lower which shows that reasonably good results can be achieved by assessment and run trainings if no explicit reference training summaries are available. The worse recall of the run training result can possibly be tuned by changing one of the train-
6.8. Results 175 Training Run Assessment Manual
Recall 0.7966 0.8762 0.8637
Precision 0.5197 0.4894 0.5199
F2 0.0703 0.7313 0.7412
Table 6.6: Recall and precision values when using all eight features. Training Run Assessment Manual
Run N/a Eq precision Eq precision
Assessment NEqrecall N/a Eq precision
Manual NEqrecall Eqrecall N/a
Table 6.7: Statistical significance of results when all eight features are used. Eq indicates equality, NEq statistical difference. ing probabilities only (i.e. P(e ∈ T ) can be set higher to return higher proportions of elements) or by introducing a correction factor (ζ ) to the classification equation (Equation 6.10), as the results indicate that this low recall is caused by smaller sets of ToC-worthy elements selected by run training (see Section 6.5.2). The possible increase in recall might result in the decrease in precision, which might be worth investigating in the future.
log
∏kj=1 P(Fj = f j |e ∈ T ) · P(e ∈ T ) / T ) · P(e ∈ / T) ∏kj=1 P(Fj = f j |e ∈
! ·ζ
(6.10)
where P(e ∈ / T ) denotes the probability that element e is not-ToC-worthy and f j is the value of feature F observed for element e. ζ ∈ (0, ∞) is the correction factor that increases the proportion of returned elements (hence, possibly increases recall) if greater than one, otherwise the mentioned proportion decreases. The scores for assessment training also show that, with the use of the introduced trainable structure summariser, it is possible to achieve recall values higher than those of manual ToC training. However, the lower precision and overall performance (measured by the F measure) indicates the opposite of what has been found for the run training, i.e. a correction factor might need to be used to increase overall performance measured by the F score. Nevertheless, using the eight features, the results indicate that the three training methods produce comparable quality ToC-worthy element sets and, thus, ToCs. The three training methods are further compared, considering various feature combinations, in Section 6.8.3.
6.8.2
Individual features
The previous subsection discussed differences among training methods. This section presents results obtained by using individual features only. Individual features, as expected, do not perform
6.8. Results 176 well (apart from one exception which is discussed below). Similarly to text summarisation, combinations of features rather than individual features, should be used to yield better performance (Edmundson, 1969; Kupiec et al., 1995). Before examining the recall and precision values of results, an initial analysis of the individual feature probabilities is needed (see table 6.8 for an overview). According to the individual training probabilities, when using run training, an element is ToC-worthy only if it is at depth level one or has no siblings. This would, most of the time, return the article element when the two affected features are used. Returning the article only can be acceptable for very short documents but not for any documents from the collection. The six other features can not produce any ToC-worthy elements by themselves, hence, there is a need for at least two features that, by reinforcing each other, yield high enough ToC-worthiness values that the ToC generated will not be empty. According to the individual training probabilities of the assessment training, three features can result in at least one ToC-worthy element from documents. Apart from the two mentioned above, an element can also be ToC-worthy when its textual content is longer than 10,000 characters (Table 6.8, middle columns). However, there are still five other features that can not produce anything but an empty ToC which is not acceptable. Taking the average of the manual ToC training probabilities by the 10 folds (like in Section 6.5.4 where training probabilities have been discussed), five individual features seem to be able to produce non-empty ToCs. Depth level one (i.e. the document level) is a good indicator that an element is ToC-worthy but the depth feature would fail to identify any good elements at other levels in the hierarchical structure. An element is also identified as ToC-worthy if it has a title, has no siblings, has more than 11 grandchild elements or its corresponding text is longer than 1000 characters. Having considered the above issues, features that are proven to return only empty ToC-worthy element sets were were excluded from the ToC generation experiments. These features are treated as supplementary features for other features or feature sets. Results of structure summaries using only individual features can be seen in Table 6.9 where the ID of the ToC generation is made of two character groups as follows: the training used is denoted by either R (run), A (assessment) or M (manual); the second group identifies the used features (D - depth, Ty - type, Se - sequence number, L - length, Ti - title, C - children, Si -
6.8. Results 177
Fi j depth 1 depth 2 depth 3 depth 4 depth 5 depth 6 depth 7 depth 8 depth 9 depth 10 type top type nontop seqnum 1 seqnum 2-3 seqnum 4-5 seqnum 6+ length 0-10 length 10-100 length 100-1000 length 1000-10000 length 10000+ title yes title no children 0 children 1-5 children 6-10 children 11+ siblings 0 siblings 1-5 siblings 6-10 siblings 11+ grandchildren 0 grandchildren 1-5 grandchildren 6-10 grandchildren 11+
Table 6.9: Individual features’ evaluation results The recall values are relatively low compared to results with all features used apart from the M Ti which produces relatively high recall and F scores. This shows that if a title exists for an element, it is a good indicator of ToC-worthiness. Indeed, a traditional ToC contains titles of, for example, chapters and sections of a book so the results based on the title feature are not entirely surprising. All depth and sibling results got exactly the same recall and precision values. It is not surprising for depth (D) because if all the time only the article level is returned the results will be the same. It also seems that the siblings feature (Si) returns exactly the same elements as depth. This result is surprising given that, although an article element dos not have siblings, no other elements seem to be in the evaluation set without siblings. Grandchildren (G) work reasonably well. Results show that if an element has a rich and reasonably deep sub-structure (i.e. it has many descendant elements at lower than children levels), the element is likely to be ToC-worthy. It has to be added to the individual features’ investigation, though, that although the overall performance is relatively high, there is a high standard deviation in terms of individual recall and precision values by documents, i.e. sometimes the results are almost perfect but sometimes they are very poor. This shows that using only one feature such as Ti (title) does not yield stable results.
6.8.3
Feature groups
As previous results in the context of text summarisation suggest, groups of features might work better than individual features (Edmundson, 1969; Kupiec et al., 1995). The results presented in the previous subsection also show that individual features do not produce stable results.
6.8. Results 179 To find out which feature groups work best and whether they outperform individual features only, various feature groups were made and ToCs were created using these groups. The evaluation results are discussed in this section. Usually, if recall is emphasised over precision the emphasis on recall is double. This can is reflected through the F2 measure. However, scores with respect to precision and recall are also reported in this section. The best results with respect to recall are shown in Table 6.10. The top result, A DTyLSeTiCG (all features but the number of siblings) is the summariser version that is trained by retrieval assessments and uses all features but the number of siblings. As we can see, the assessment trainings usually outperform the manual and run trainings for recall (the same has been found when comparing the training methods using all eight features in Section 6.8.1). It is also visible from the results that for good performance, several features have to be used as no versions with less than 4 features have reached the top 15 results with respect to recall. ID A DTyLSeTiCG A DTyLSeTiSiG A DTyLSeTiCSiG A DTyLTiCSiG A TyLSeTiCSiG A DTyLTiSi M DTySeL A DTyLTi M DTyLTi M DTyLSeTiCG M DTyLSeTiSiG A DTyLSeTiCSi M DTyLSeTiCSi M DTyLTiCSiG R DTyLTi
Table 6.10: The best 15 ToC evaluation results with respect to recall. The best results with respect of precision are displayed in Table 6.11. Feature groups with high precision results contain less than five features and there is not much difference between various training methods (denoted by R,A,M, respectively) as none of these methods dominate the top 15 results with respect to precision. Several version that reached the top 15 have exactly the same precision values and as their recall is also the same for all eleven of these summariser versions (including the version using the title (Ti) feature only), it is suspected that their outputs are exactly the same ToC-worthy element sets. As all these results contain the title feature, it seems that this feature is very effective to obtain high precision, and its effect is so strong that often two or three other features are not enough to produce different output. It can also be seen
6.8. Results 180 in Table 6.11 that the top 4 results all contain the depth (D) feature which, together with the previous finding, indicates that the title and depth information are very effective for obtaining high precision values. ID A DSeTiSi R DSeTiSi R DLTi M DSeTiSi A TySeTiSi M TySeTiSi R TySeTiSi A TySeTi A TyTiSi M SeTiSi M TySeTi M TyTiSi R TySeTi R TyTiSi M Ti
Table 6.11: The best 15 ToC evaluation results with respect to precision. When considering recall and precision with equal weights to obtain a single F value for each summariser version (the F1 score, α = 0.5, β = 1), results show that even though the two measures are supposed to be treated as equal, summariser versions that performed very well in terms of precision dominate the overall F scores (Table 6.13). As it makes sense to allow lower precision but higher recall in ToC generation, i.e. a user would rather ignore a couple of not-ToCworthy elements than miss out on ToC-worthy ones (Section 4.6), recall is given higher weight in the formula of the F measure. As in IR, where recall often has a weight twice as high as precision (van Rijsbergen, 1980), α = 0.25 (β = 2) is used to calculate F scores (introduced previously in Equation 6.9 in Section 6.7.2). As Table 6.14 shows, the combinations that have both reasonably high precision and recall values, thus yield higher F2 scores, are not within the top 15 combinations when considering solely recall or precision, which shows the trade-off between the recall and precision measures. Also, summariser versions trained by runs, assessments and manual ToCs are distributed in the F2 top 15 results evenly (see Table 6.12), which further shows the effectiveness of training data obtained from XML retrieval. Table 6.12 also shows that summariser versions by the three training methods also occur evenly in the top 15 results by precision. However, the run training method seems to perform worse in terms of recall, which might be a result of the low P(e ∈ T ) = 0.0235 training prob-
6.9. Discussion 181 ability. The ζ correction factor introduced in Section 6.8.1 can possibly offer a solution to the problem of lower recall.
F2 Top 15 Recall Top 15 Precision Top 15
Run 5 1 6
Training method Assessment Manual ToC 4 6 8 6 4 5
Table 6.12: Summariser version occurrences in the top 15 results, by training method. ID A TySeTiSi M TySeTiSi R TySeTiSi A TySeTi A TyTiSi M SeTiSi M TySeTi M TyTiSi R TySeTi R TyTiSi M Ti M DSeTiSi M DLTi A DTySeTiSi A DTyTi
Table 6.13: The best 15 ToC evaluation results with respect to the F1 measure (equal weight on precision and recall), results also in the top precision results are shown in bold. As Table 6.15 shows, if high recall is desired more features should be used and depth (D), type (Ty) and length (L) are particularly useful for producing higher recall. For higher precision, less features are sufficient and the title feature (T) is particularly effective. As the feature occurrences for the F2 scores show, high F scores are generally results of combinations of high recall and high precision predictor features, and there is a trade-off between recall and precision. To sum up the general results obtained by using combinations of features, more features lead to higher recall, less but carefully selected features yield high precision but there should be an optimal combination of the two to get results where not only one these two measures is emphasised.
6.9
Discussion
The run analysis, that was used to explore several pieces of metadata (possible features for structure summarisation) in the first part of this chapter (Section 6.3), provided an estimation for the
6.9. Discussion 182 ID R DTySeTiSiG M DTyLTiSi R DTyLTiSi M DTyLTi R DTyLTi M DTyLSeTiSiG M DTyLSeTiCSi M DSeTiCSiG A DTySeTiSi A DTyLTi R DTySeTiCSiG M DTySeTiCSiG A DTyLTiSi R DTyLSeTiSiG A DTyTiSi
Table 6.14: The best 15 ToC evaluation results with respect to the F2 measure (double weight on recall). F2 Top 15 Recall Top 15 Precision Top 15
D 15 14 4
Ty 14 15 9
L 9 15 1
Se 8 9 10
Ti 15 12 15
C 4 8 0
Si 12 9 10
G 6 7 0
Table 6.15: Feature frequencies in the top 15 results. appropriateness of individual metadata for structure summarisation. Similar investigations of other metadata could also be carried out using the run data set, if necessary, as the use of the run data set also proved to be useful for summariser training. Using a retrieval data set to explore possible features also offers a larger volume of data that can be used to study features, thus the manual ToCs are saved for evaluation. Based on the run analysis, several pieces of metadata have been chosen for further investigation. Also, other pieces of metadata have been examined for ToC generation in Sections 6.4 to 6.8. The results of Section 6.4 show that it is possible to train a summariser with other data sources than those having exactly the same structure as the evaluation data. The results by the three used training data have been shown comparable performance. This shows that retrieval data sets (runs) and IR assessments can substitute for the explicit training (by manual ToCs). This result is important, because it may offer the opportunity of using alternative data for researchers who do not have resources to create a sufficiently large training and evaluation data set for structure summarisation. It is expected that the same training methodology can be followed using other XML document collections, where similar results are expected to be obtained. This is because XML document collections tend to have similar structure characteristics, and most of the features used in this
6.10. Conclusion 183 section are not collection specific. With respect to features, there can also be several other features that are not used in this section which, taken into account, might increase the performance of various training methods. In the previous sections describing the training and evaluation of the probabilistic structure summariser, only eight features have been investigated, and an additional three pieces of metadata have been analysed previously in Section 6.3. It might be the case that other features, such as the currently not considered word-statistics based ones, can yield higher performance. Thus, other features can also be investigated for ToC generation as part of future work. Individual features sometimes can produce relatively high performance values, such as the title feature of manual ToC training, but these results are not stable. To obtain stable performance values, more features are needed. However, having too many features for stability will lead to lower performance values in terms of precision. As recall usually becomes lower when precision increases, a feature combination has to be determined so that it is appropriate for the purpose of the summariser application. In other words, depending on whether recall or precision is to be emphasized, the appropriate feature combination can be selected. In this chapter, recall is preferred over precision because it is assumed that a user rather ignores not-ToC-worthy elements in the ToC but does not tolerate missing out important elements. Based on this assumption, which comes from one of the findings of Chapter 4, 4-7 features should be combined in a structure summariser, and the most important features are the depth of an elements (D), element type (Ty), presence of the title text (Ti) and number of sibling elements (Si). As the retrieval runs and relevance assessments taken from IR could effectively be applied in structure summarisation, the results found in this section can possibly be applied in XML IR. For example, having identified that the existence of the title for an element makes it highly likely that the element is ToC-worthy, the title information could be fed back to a retrieval algorithm, for instance, as a language modelling prior (Ponte and Croft, 1998; Ogilvie and Callan, 2004). Also, when selecting the appropriate element to retrieve, one might consider weighting several elements higher, such a those referred to as ‘top types’ in this section.
6.10
Conclusion
In this chapter, query independent structure summarisation has been investigated. Several pieces of metadata (discussed in Section 6.2) have been selected and analysed using retrieval result sets
6.10. Conclusion 184 (i.e. runs, Section 6.3). Some of these metadata have also been selected and, with others, used in a structure summariser, for ToC generation. Section 6.4 has introduced a summarisation method that can effectively be trained using various training data. Training data have been obtained by using retrieval runs, relevance assessments and manual ToCs. The summariser uses features to estimate the ToC-worthiness of elements. Eight features have been used and their effectiveness evaluated in this chapter after seven metadata had been investigated as possible feature candidates. The creation and use of manually selected ToC-worthy element sets (i.e. ToCs) has also been presented in Section 6.6. Manual ToCs have been used for summariser training (Section 6.5) and evaluation (Section 6.7). The results of this query independent structure summarisation have been presented and discussed in Sections 6.8 and 6.9.
185
Chapter 7 Conclusions and future work
This thesis has investigated summarisation as a means to help searchers of XML retrieval systems in the process of accessing the contents of document portions. Two types of summarisation have been examined: summarisation of the textual content of elements, and summarisation of the logical structure of documents. In this chapter, first, the contributions this thesis has made are discussed in Section 7.1. This is followed by an outline of future work based on the work of this thesis (Section 7.2).
7.1
Contributions
The purpose of the research presented in this thesis was to investigate the use of summarisation to help searchers in accessing the contents of XML elements. The work that has been carried out included designing, conducting, and analysing the results of, several users studies. The work also included developing methodologies for structure summarisation, as it has not been investigated widely before. This thesis has made use of several data sets obtained from XML retrieval, such as retrieval result sets and relevance assessments. XML documents of various document collections have also been analysed in order to obtain information that could be used for summarisation. This section presents the main contributions of this thesis. It is structured as follows. Section 7.1.1 discusses contributions with respect to text summarisation, which is followed by the contributions made with regards to structure summarisation in Section 7.1.2.
7.1. Contributions 186 7.1.1
Content summarisation
This thesis’s work on content summarisation has included the generation and presentation of summaries of the textual contents of XML elements. Various data have been obtained and analysed through a user study, and the results allowed to draw conclusions regarding the use of content summarisation in interactive XML retrieval. The following subsections discuss the contributions of this thesis with respect to summarisation of the content of XML elements in a user-based, interactive environment. 7.1.1.1
The usefulness of element text summarisation for interactive XML retrieval
As summarisation of the textual content of whole documents is useful in document retrieval, it is also important to establish that element text summarisation is useful in XML retrieval. In this thesis, it has been found that summarisation focusing on document portions, i.e. XML elements, can be useful in the retrieval process. Summarisation of elements can be used either when the list of retrieved elements is shown to searchers (Kamps, 2008) (which is essentially the same use of summarisation as in document retrieval) or when the retrieved elements are presented to searchers in the context of other elements from the same document. This thesis has focused on the latter use of text summarisation, i.e. when the context is provided by the structural overview of the whole document, and found that, according to searchers, summaries of the textual content of elements can help searchers reach the relevant content within XML documents effectively. The identification of the importance of such overviews led to the design of subsequent studies which concentrated on structural overviews presented to searchers as tables of contents. 7.1.1.2
The importance of structural overviews of documents
As indicated by the findings of the text summarisation study presented in Chapter 4, it is important to provide searchers with overviews of the logical structure of documents. Content summaries can be effective when a well designed structural overview is also presented to searchers. Structural overviews are displayed as tables of contents (ToCs) in this thesis. A ToC is helpful to searchers as it provides a general overview of the structure of the document, but it can also direct the searchers’ attention to the relevant content within documents. 7.1.1.3
Recommendations for the presentation of XML document overviews
The thesis has identified several design issues with regards to interactive XML retrieval systems that display element text summaries and structural overviews of documents. To be able to investigate various systems and interfaces, it is inevitable to design them so that no unwanted
7.1. Contributions 187 effects caused by bad design are introduced into searchers’ behaviour. Findings of this thesis with respect to design include the following: • Elements returned by an XML retrieval system should be displayed in the context of other retrieved elements from the same document, i.e. they should be grouped when presenting search results. • A structural overview, i.e. a table of contents (ToC), can be more effective if the labels of it, e.g. titles of sections in the ToC, are highly informative. • ToCs should be automatically generated, they should include relevant elements but should also depend on the length and depth of elements. For example, a long element that is near the top of the hierarchical logical structure but not returned by the retrieval system should be included in the table of contents. Such elements serve as context for other, relevant, elements. • Text summaries should be displayed for all items in the ToCs or, alternatively, they should be completely omitted. • Searchers prefer not-relevant elements in the ToC to relevant elements that are not displayed, as missing relevant information is considered worse than not relevant information displayed.
7.1.2
Structure summarisation
Regarding structure summarisation, this thesis has aimed at investigating how structure summaries, which are presented to searchers as tables of contents, can be created, evaluated and used. The importance of summarising the logical structure has been identified in the text summarisation study in Chapter 4. Structure summarisation has been investigated in Chapters 5 and 6 in a user-based environment, and also, when summaries are examined without direct user involvement. The following sections present the contributions of this thesis with respect to structure summarisation. 7.1.2.1
Summarisation of the logical structure of XML documents
This thesis has challenged the static nature of tables of contents (ToCs) which are considered as summaries of the logical structure of documents. Traditionally, a ToC remains the same under
7.1. Contributions 188 every situation, which is not sufficient for dynamically changing information needs and when the availability of information is constantly increasing. The usefulness of ToCs is evident, as ToCs have been used for centuries, and they are also useful in XML retrieval (Kim and Son, 2004; Kazai and Trotman, 2007). When a ToC is not available for a document, it should be automatically created considering various features of elements, such as their relevance to the query, depth in the logical structure, or length of their textual contents. This thesis has presented two structure summarisation methods that have been used to create such tables of contents. 7.1.2.2
The importance of various element features for structure summarisation
As mentioned above, ToCs should not be static or manually defined. The two summarisers described in this thesis use features of elements to automatically create ToCs. This thesis has found that the most important element feature to consider is the relevance of elements to a given query. In addition to being query-based, ToCs should also contain other elements with which a better overview of the logical structure can be obtained. Additional elements provide context for relevant elements. This thesis has also investigated features that can be used to select elements that provide an appropriate context for relevant elements. Through user studies, the importance of the depth and length features has been identified. Also, various other feature candidates for structure summarisation have been examined using various data sets, and additional features have been used in a structure summariser. Through the evaluation of summaries created by using various feature combinations, the following conclusions could be drawn: • The use of any individual feature does not produce stable results. In other words, considering only one feature, such as type of the element or its depth, is not sufficient to create high quality ToCs. • The use of more features in combination yield more stable evaluation results, but too many features can lead to lower precision scores. • The combination of four to seven features have been proposed. These features can be combined with the relevance feature, and further investigated in the future. • The features that have been useful to gain high evaluation scores are depth of the element (D, in Chapter 6), type of the element (Ty), the presence of explicit title text (Ti), and the number of sibling elements (Si).
7.1. Contributions 189 7.1.2.3
Methodologies for structure summarisation
The importance of structure summarisation has been identified in this thesis. As structure summarisation has not been studied extensively in the context of structured document retrieval before, attention needed to be paid to the design of experiments investigating ToC generation. This thesis has introduced a way to investigate element features for this purpose. The investigation allowed to create and analyse automatically generated ToCs, and involve human searchers in the process. A methodology for using alternative resources for structure summarisation has also been developed. These resources have been used to investigate element features and to provide training data for structure summarisation. Alternative data sets were needed for training as no direct training data sets have been available for the automatic summarisation of the logical structure of documents. According to the traditional way of training and testing, a structure summariser (i.e. ToC generator) would have to be trained by using example tables of contents. A methodology to evaluate structure summarisation has also been presented, which included creating an evaluation data set consisting of manually determined ToCs, as well as the adaptation and use of evaluation measures from text summarisation and information retrieval. 7.1.2.4
The use of alternative training data sets
In this thesis, it has been found that, in order to train a structure summariser, alternative data sets can also be used effectively. As no large sets of manually created ToCs have been available for XML documents before the work presented in this thesis and, in addition, to create such ToCs data sets is expensive and time consuming, it is important to determine whether other, already existing, data sets can also be used for training a structure summariser. In this thesis, retrieval result sets and relevance assessments from XML retrieval have been used as alternative training data sets, and it has been shown that they can be used as effectively as manually created example ToC sets. 7.1.2.5
A test collection for structure summarisation evaluation
While developing an evaluation methodology for structure summarisation, this thesis also produced structure summaries (ToCs), created with the involvement of human assessors, for a number of example XML documents. This manual ToC set can be used to evaluate and compare other ToC generation methods in the future. Although the ToCs are not based on queries, the set of manually created ToCs and the corresponding XML documents form a small test collection for query independent structure summarisation.
7.2. Future work 7.1.2.6
190
Characteristics of “ideal” ToCs
This thesis has also examined the characteristics of generated ToCs. This information can be used as a guideline to design structural overviews for interactive retrieval systems in the future. It has been found that a ToC is not needed if the content of a document can be displayed in a window, i.e. when the searcher does not need to scroll in the document view to access certain parts of the document. It has also been found that, in general, a ToC should be 3-5 levels deep. A ToC should generally contain maximum 20 items (section, subsection, etc. titles), because if it is too large, the effectiveness of providing an overview of the logical structure, hence, the effectiveness to help searchers access the relevant content, decreases. An “ideal” ToC usually includes 2-9% of the elements of the document. ToCs do not depend significantly on individual searchers but they should depend on the searchers’ queries.
7.2
Future work
In this section, future work based on the work presented in this thesis is outlined. First, future work with respect to content summarisation is discussed, which is followed by that regarding structure summarisation and various interface issues. Finally, applications of the findings of this thesis are listed.
7.2.1
Content summarisation
As text summarisation has not been investigated widely in the context of XML retrieval, particularly when searchers are also involved, text summarisation presented in this thesis did not use state of the art summarisation methods, neither did it consider the logical structure to create textual summaries. It is believed that text summarisation methods that exploit the logical structure of XML documents can be more effective in interactive XML retrieval. The use of the relationships among various XML elements, combined with state of the art (flat) text summarisation methods is expected to effectively facilitate the interactive searching process. Structural features to consider in summarisation may include • a range of properties of elements within the element being summarised, e.g. title words of subsections of a section to be summarised might be more important than others, the number of subsections might determine the length of the summary of the section, first sentences of subsections might be more useful to include in a section summary, etc.. Also, the use
7.2. Future work
191
of the Resource Description Framework (RDF)1 may provide additional information about XML elements. For example, the Wikipedia documents are also available in RDF version2 which may be used to supplement the more document centric XML documents used in this thesis. • properties of other, related, elements can also be used for text summarisation, e.g. contents of sibling elements might be used to highlight the content of a section that are not discussed in subsequent sections, words in the title of ancestor elements might be important to find summary sentences in an element’s textual content, etc.. The problem of overlapping elements in XML retrieval has been investigated by researchers of the INEX community for several years (Kazai et al., 2004; Clarke, 2005). The decision whether to return, e.g. a section or its subsection(s) when all are found to be relevant is not straightforward. The tables of contents of this thesis have offered a way to address this problem by showing all important elements from a document. However, it is still not known well how searchers would react to textual summaries of overlapping elements. For example, when a summary of a section (included in the ToC) is displayed, do searchers expect that sentences of the corresponding textual summary are those found most important in the subsections of the section (possibly also shown in the ToC), i.e. should textual summaries be created by aggregating summary sentences from descendant elements? Alternatively, should a summary concentrate on the overall content of the section without considering what the summary sentences of its subsection are? The investigation into this problem could also provide indications for XML retrieval. Redundancy removal plays an important role in text summarisation, where the aim is to minimise redundancy within summaries, e.g. by ensuring that the information contained in the first summary sentence is not repeated in subsequent summary sentences. As summaries of sections and subsections can be displayed one after another (e.g. when a searcher moves the mouse pointer over a section title from over another section title in the ToC, see Chapter 4), it might be necessary to investigate whether searchers prefer introducing redundancy removal for, e.g. summaries of subsequent (sibling) elements where the summary of Section Two focuses on the content of it, but does not include sentences discussing topics that are also mentioned in Section One. 1 http://www.w3.org/RDF/ 2 http://labs.systemone.at/wikipedia3
7.2. Future work 7.2.2
192
Structure summarisation
With respect to structure summarisation, several questions remain to be investigated that are not covered in this thesis. For example, Chapter 5 has found that relevance information is to be considered with high weight when creating ToCs for documents in XML retrieval. Chapter 6 has identified that combinations of various element features can lead to increased effectiveness in ToC generation. Based on the above, it is straightforward to combine the findings of the two chapters, i.e. the investigation of combinations of several features (even features not used in this thesis, e.g. those mentioned above in the context of text summarisation) together with the relevance information might be worthwhile to use. The INEX 2006 Interactive track (Malik et al., 2006b) has compared passage retrieval with element retrieval, where ToCs were displayed for both of these retrieval types. However, the passage retrieval ToCs have been ‘almost flat’, i.e. apart from the title of the whole document, a flat list of passages has been shown to searchers as ToC. A normal ToC usually has a deeper hierarchical structure, through which a more complete overview of the document’s logical structure can be gained. Hence, it can be necessary to generate hierarchical ToCs for documents for which only a shallow structure is provided. Based on similarities among passages or, for instance, on changes in topic, hierarchical tables of contents can be generated for documents that are ‘almost flat’, e.g. when the document only contains a list of paragraphs that are not organised into sections, chapters, etc.. Using text segmentation, ToCs could also be created for completely flat documents as well, i.e. where the document does not contain any explicit information about its logical structure.
7.2.3
Interface issues
As one of the findings of Chapter 5 indicates, labels in the ToCs when no title information is available need to be as informative as possible. It is necessary to investigate which label generation methods are more appreciated by searchers in order to access relevant content within documents more effectively. Example labels might include the first words of the content of the element, topic words (Lawrie et al., 2001), etc.. This thesis has argued that the use of tables of contents is a straightforward and natural solution to give overviews of the logical structure of documents. The display method of ToCs can possibly be improved, e.g. by using other methods to emphasise more important elements which do not simply display ToC-worthy elements in the ToCs and omit not-ToC-worthy ones.
7.2. Future work
193
For example, fisheye-like (Landauer et al., 1993) display or highlighting of different quality elements (in a heat map manner, for instance) can also be appreciated by users. It might also be worth comparing ToC-type display with completely different visualisation methods, in interactive XML retrieval. This thesis has used a two stage model for accessing elements returned by an XML retrieval system, i.e. first the list of results are supposed to be displayed, and then the content of the selected element is shown which is accompanied by the structural overview of the document. These two stages might be combined, so the result list can be a unified table of contents. Retrieved documents or sections could be organised according to topic (e.g. by using a hierarchy of topics (Lawrie et al., 2001)) as higher levels of the unified ToC, while elements relevant to the query could be displayed at deeper levels, hierarchically structured, according to the logical structure of selected elements.
7.2.4
Applications
As the investigations of this thesis have been carried out using two document collections, i.e. the IEEE and Wikipedia collections, several other document collections, containing hierarchical structured documents, might also be examined in the context of text and structure summarisation. Other collections could include collections of books for which the tables of contents, especially if query based, are expected to be particularly useful. In addition, a table of contents of a book with additional section summaries associated with the sections, subsections, etc. in the ToC can also help searchers find the relevant parts of books. Although the currently existing XML document collections have reasonably similar structure, different collections usually do not share the names of elements, e.g. a section can be call sec in one collection, s in another. Sometimes, the names and structure also differs within collections. Content and structure summaries completely independent from element names (types) can easily be created using the findings of this thesis, and summaries can be investigated for heterogeneous document collections. In the structure summarisation studies presented in this thesis, there was a possibility that an item in the ToC did not refer to a section, subsection or paragraph, but to a multimedia element (in case of the used collections, an image). The investigation of overviews considering multimedia elements can also be carried out, where, e.g. instead of displaying textual summaries, a preview of the image is shown as a tool tip “summary”, as the system adapts the summary type to that of
7.3. Conclusions 194 the content of the element. Thumbnail type previews of images are often used in the Web, these might be suitable for image type summaries. This thesis has identified several element features that are useful for ToC generation. Also, it has been found that data sets from XML retrieval can successfully be used in structure summarisation. It might also be the case that findings from structure summarisation can be fed back to XML retrieval. For example, length is already used in retrieval as a prior in language modelling approaches (Hiemstra, 2003). Element type is also used as several search engines return, e.g. paragraphs only (Ashoori and Lalmas, 2006; Crouch et al., 2006). Other features identified useful for structure summarisation, e.g. depth or presence of a title could also be used in XML retrieval algorithms.
7.3
Conclusions
This thesis has investigated summarisation as a means to help searchers of XML retrieval systems in the process of accessing the contents of document portions. Two types of summarisation have been investigated. First, summaries of the textual contents of document portions, called XML elements, have been studied in a user-based environment. Traditionally, summarisation is associated with whole documents or document sets, but rarely with document portions. As summaries of documents have been proved to be useful in whole document retrieval, it has been considered worthwhile to investigate summaries of document portions in XML element retrieval. Summaries of elements have been presented to searchers in the context of other elements from the document. The textual summaries of elements reflected the searchers’ information needs: they have been query based. The analysis of the results of this study has revealed that text summarisation can be useful in XML retrieval, provided that the interactive retrieval system, which also displays the retrieved elements in context, is well designed. This thesis has identified several recommendations for the design of interactive XML retrieval systems, among these, it is found that the table of contents should not be static and ToC generation needs to consider several features of the XML elements, such as the length of their textual contents or their depth in the logical structure. The second type of summarisation investigated in this thesis is called structure summarisation. The automatic generation of tables of contents, as structure summaries, has been described and examined. ToC generation has been studied both when searchers’ queries are available (query
7.3. Conclusions 195 based structure summarisation) and when queries are not used (query independent structure summarisation). Query based structure summarisation has been investigated in a user study where searchers were asked to indicate their preferences for various element features including those identified previously in the text summarisation study. With the involvement of searchers, “ideal” ToCs have been generated and analysed. The analysis has showed that, when ToCs are created, relevance information is considered highly important by searchers, which indicates that the structural overview of documents should be based on the searcher’s information need if that is available. With respect to query independent structure summarisation, in which it is assumed that the searcher’s query is not available or the document is being browsed without any particular information need in mind, several element features have been investigated, analysed and their effectiveness in ToC generation evaluated. A trainable structure summariser has been introduced. Various data sets from obtained from the area of XML retrieval have been used for training. Their effectiveness with respect to structure summarisation has been compared to that of proper training data. The latter data is a set of example structure summaries. The example summaries have been created by human assessors, and they have also been used for evaluation. The set of these summaries, together with the source XML documents, has provided a test collection for structure summarisation. The results of evaluation have showed that to generate high quality structure summaries (ToCs), several element features need to be combined. Also, it has been reported that to train a structure summariser, alternative data sets, such as retrieval result sets and relevance assessments from XML retrieval, can be used effectively.
196
Bibliography S. Aamodt. Welcome to Your Brain. Rider & Co, London, 2008. ISBN 1846040779. S. Abiteboul. Querying semi-structured data. In F. N. Afrati and P. G. Kolaitis, editors, Database Theory - ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings, volume 1186 of Lecture Notes in Computer Science, pages 1–18. Springer, 1997. ISBN 3-540-62222-5. M. Abolhassani, N. Fuhr, N. G¨overt, and K. Großjohann. HyREX: Hypermedia retrieval engine for XML. Research report, University of Dortmund, Department of Computer Science, Dortmund, Germany, 2002. J. Adiego, G. Navarro, and P. de la Fuente. Using structural contexts to compress semistructured text collections. Information Processing and Management, 43(3):769–790, 2007. ISSN 03064573. A. Alam, A. Kumar, M. Nakamura, A. F. Rahman, Y. Tarnikova, and C. Wilcox. Structured and unstructured document summarization: Design of a commercial summarizer using lexical chains. In ICDAR, pages 1147–1152. IEEE Computer Society, 2003a. ISBN 0-7695-1960-1. H. Alam and F. Rahman. Web document manipulation for small screen devices: A review. In Web Document Analysis Workshop (WDA), 2003. H. Alam, R. Hartono, A. Kumar, F. Rahman, Y. Tarnikova, and C. Wilcox. Web page summarization for handheld devices: A natural language approach. In ICDAR ’03: Proceedings of the Seventh International Conference on Document Analysis and Recognition, page 1153, Washington, DC, USA, 2003b. IEEE Computer Society. M. Amini, A. Tombros, N. Usunier, M. Lalmas, and P. Gallinari. Learning to summarise XML documents by combining content and structure features (poster). In CIKM’05, Bremen, Germany, October 2005.
197 M. R. Amini, A. Tombros, N. Usunier, and M. Lalmas. Learning-based summarisation of XML documents. Information Retrieval, 10(3):233–255, 2007. ISSN 1386-4564. K. Andrews. Browsing, Building, and Beholding Cyberspace: New Approaches to the Navigation, Construction, and Visualisation of Hypermedia on the Internet. PhD thesis, Graz University of Technology, 1996. E. Ashoori and M. Lalmas. Using topic shifts in XML retrieval at INEX 2006. In Fuhr et al. (2007), pages 261–270. ISBN 978-3-540-73887-9. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / AddisonWesley, 1999. ISBN 0-201-39829-X. R. Barzilay and M. Elhadad. Using lexical chains for text summarization. In Intelligent Scalable Text Summarization Workshop (ISTS’97), ACL Madrid, 1997. S. Betsi, M. Lalmas, A. Tombros, and T. Tsikrika. User expectations from XML element retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 611–612, New York, NY, USA, 2006. ACM. ISBN 1-59593-369-7. MPS Bhatia and A. Kumar Khalid. A primer on the web information retrieval paradigm. Journal of Theoretical and Applied Information Technology, xx(xx):657–662, 2008. T. W. Bickmore, A. Girgensohn, and J. W. Sullivan. Web page filtering and re-authoring for mobile users. Comput. J., 42(6):534–546, 1999. A. L. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. ISSN 0004-3702. P. Borlund. The concept of relevance in ir. J. Am. Soc. Inf. Sci. Technol., 54(10):913–925, 2003a. ISSN 1532-2882. P. Borlund. The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 2003b. doi: http://informationr.net/ir/8-3/ paper152.html.
198 P. Borlund and P. Ingwersen. Measures of relative relevance and ranked half-life: performance indicators for interactive ir. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 324–331. ACM Press, 1998. ISBN 1-58113-015-5. P. Borlund and P. Ingwersen. The development of a method for the evaluation of interactive information retrieval systems. Journal of Documentation, 53(3):225–250, 1997. R. Brandow, K. Mitze, and L. F. Rau. Automatic condensation of electronic publications by sentence selection. Information Processing and Management, 31(5):675–685, September 1995. T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0 - W3C recommendation 10-february-1998. Technical Report REC-xml-19980210, 1998. C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33–40. ACM Press, 2000. ISBN 1-58113-226-3. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: text summarization for web browsing on handheld devices. In WWW ’01: Proceedings of the 10th international conference on World Wide Web, pages 652–662, New York, NY, USA, 2001a. ACM. ISBN 1-58113-348-0. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Text summarization of Web pages on handheld devices. In NAACL-WS2001A, 2001b. D. Byrd. A scrollbar-based visualization for document navigation. In DL ’99: Proceedings of the fourth ACM conference on Digital libraries, pages 122–129, New York, NY, USA, 1999. ACM. ISBN 1-58113-145-3. I. J. L. Campista. Semi-automatic summarization of XML collections. Master’s thesis, University of Duisburg-Essen, IIIS, 2005. M. Cannataro, C. Comito, and A. Pugliese. Squeezex: Synthesis and compression of XML data. In ITCC, pages 326–331. IEEE Computer Society, 2002. ISBN 0-7695-1506-1. L. Carlson, J.M. Conroy, D. Marcu, D.P. O’Leary, M.E. Okurowski, A. Taylor, and W. Wong. An empirical study of the relation between abstracts, extracts, and the discourse structure of
199 texts. Proceedings of the Document Understanding Conference (DUC-2001), New Orleans, LA, Sept 13-14, 2001. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11–24, December 2006. M. Chalmers and P. Chitson. Bead: explorations in information visualization. In SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 330–337, New York, NY, USA, 1992. ACM. ISBN 0-89791-523-2. H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In CHI ’00: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 145–152, New York, NY, USA, 2000. ACM. ISBN 1-58113-216-6. W. T. Chuang and J. Yang. Extracting sentence segments for text summarization: A machine learning approach. In SIGIR 2000, pages 152–159, 2000. C. Clarke, J. Kamps, and M. Lalmas. INEX 2006 retrieval task and result submission specification, April 2006. URL http://inex.is.informatik.uni-duisburg.de/2006/ inex06/pdf/INEX06_Tasks_v1.pdf. C. Clarke, J. Kamps, and M. Lalmas. INEX 2007 retrieval task and result submission specification, April 2007. URL http://inex.is.informatik.uni-duisburg.de/2007/ inex07/pdf/INEX07_Tasks_v1.pdf. C. L. A. Clarke. Controlling overlap in content-oriented XML retrieval. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 314–321, New York, NY, USA, 2005. ACM. ISBN 1-59593-0345. C. Cleverdon. The cranfield tests on index language devices. AsLib Proceedings, 19(6):173–194, 1967. J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37, 1960.
200 S. Comai, S. Marrara, and L. Tanca. Representing and querying summarized XML data. In V. Mar´ık, W. Retschitzegger, and O. Step´ankov´a, editors, DEXA, volume 2736 of Lecture Notes in Computer Science, pages 171–181. Springer, 2003. ISBN 3-540-40806-1. S. Comai, S. Marrara, and L. Tanca. XML document summarization: Using XQuery for synopsis creation. In DEXA Workshops, pages 928–932, 2004. W. S. Cooper. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation, 19(1):30–41, 1968. W. S. Cooper. A definition of relevance for information retrieval. Information Storage and Retrieval, (7):19–37, 1971. E. Cosijn and P. Ingwersen. Dimensions of relevance. Information Processing and Management, 36(4):533–550, 2000. ISSN 0306-4573. C. J. Crouch, D. B. Crouch, M. Ganapathibhotla, and V. Bakshi. Dynamic element retrieval in a semi-structured collection. In Fuhr et al. (2007), pages 82–88. ISBN 978-3-540-73887-9. D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 318–329, New York, NY, USA, 1992. ACM. ISBN 0-89791-523-2. T. Dalamagas, T. Cheng, K-J. Winkel, and T. K. Sellis. Clustering XML documents using structural summaries. In W. Lindner, M. Mesiti, C. T¨urker, Y. Tzitzikas, and A. Vakali, editors, EDBT Workshops, volume 3268 of Lecture Notes in Computer Science, pages 547–556. Springer, 2004. ISBN 3-540-23305-9. L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 40(1):64–69, 2006. M. Dodge. Information maps: tools for document exploration. Working paper, Centre for Advanced Spatial Analysis (UCL), London, UK, 2005. R. L. Donaway, K. W. Drummey, and L. A. Mather. A comparison of rankings produced by summarization evaluation measures. In NAACL-WS2000A, pages 69–78, 2000. P. Dopichaj. Element retrieval in digital libraries: Reality check. In SIGIR 2006 Workshop on XML Element Retrieval Methodology, 2006.
201 S. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In CHI ’01: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 277–284, New York, NY, USA, 2001. ACM Press. E. Duvak, W. Hodgins, S. Sutton, and S. L. Weibel. Metadata principles and practicalities. D-Lib Magazine, 8(4), April 2002. S. Dziadosz and R. Chandrasekar. Do thumbnail previews help users make better relevance decisions about web search results? In SIGIR 2002, pages 365–366, 2002. H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–285, 1969. ISSN 0004-5411. H. P. Edmundson and R. E. Wyllys. Automatic abstracting and indexing - a survey and recommendations. Commun. ACM, 4(5):226–234, 1961. ISSN 0001-0782. C. Elkan. Naive bayesian learning. Technical report, 1997. K. N. Fachry, J. Kamps, M. Koolen, and J. Zhang. The University of Amsterdam at INEX 2007. In N. Fuhr, M. Lalmas, and A. Trotman, editors, Pre-Proceedings of INEX 2007, pages 388–402, 2007. D. C. Fallside and P. Walmsley. XML schema part 0: Primer second edition, october 2004. N. Fatemi, M. Lalmas, and T. Roelleke. How to retrieve multimedia documents described by MPEG-7. In Advanced Information Systems Engineering, 16th International Conference, CAiSE04, 2004. N. Fielden. History of information retrieval systems & increase of information over time. In Biennial California Academic & Research Librarians (CARL) (poster), 2002. K. Finesilver and J. Reid. User behaviour in the context of structured documents. In Sebastiani (2003), pages 104–119. ISBN 3-540-01274-5. G. Fischer and I. J. L. Campista. A template-based approach to summarize XML collections. In Proceedings of LWA 2005, October 10-12, Saarbr¨ucken, Germany, 2005. J. L. Fleiss. Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31(3):651–659, 1975.
202 I. Frommholz and R. Larson. Report on the INEX 2006 heterogeneous collection track. SIGIR Forum, 41(1):75–78, 2007. ISSN 0163-5840. N. Fuhr and K. Großjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th Annual International Conference on Research and development in Information Retrieval, pages 172–180, 2001. N. Fuhr, N. Goevert, G. Kazai, and M. Lalmas. Inex: Intitiative for the evaluation of xml retrieval. In ACM SIGIR Workshop on XML and Information Retrieval, 2002a. N. Fuhr, N. G¨overt, and K. Großjohann. HyREX: Hyper-media retrieval engine for XML. In Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, page 449, 2002b. Demonstration. N. Fuhr, C.-P. Klas, A. Schaefer, and P. Mutschke. Daffodil: An integrated desktop for supporting high-level search activities in federated digital libraries. In M. Agosti and C. Thanos, editors, ECDL, volume 2458 of Lecture Notes in Computer Science, pages 597–612. Springer, 2002c. ISBN 3-540-44178-6. N. Fuhr, N. G¨overt, G. Kazai, and M. Lalmas, editors. INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the First INEX Workshop. Dagstuhl, Germany, December 8–11, 2002, ERCIM Workshop Proceedings, Sophia Antipolis, France, March 2003. ERCIM. N. Fuhr, M. Lalmas, and S. Malik, editors. INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop. Dagstuhl, Germany, December 15–17, 2003, March 2004. N. Fuhr, M. Lalmas, S. Malik, and Z. Szl´avik, editors. Proceedings of INEX 2004, volume 3493, 2005. N. Fuhr, M. Lalmas, S. Malik, and G. Kazai, editors. Proceedings of INEX 2005, volume 3977, 2006. N. Fuhr, M. Lalmas, and A. Trotman, editors. Comparative Evaluation of XML Information Retrieval Systems, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17-20, 2006, Revised and Selected Papers, volume 4518 of Lecture Notes in Computer Science, 2007. Springer. ISBN 978-3540-73887-9.
203 M. K. Ganapathiraju. Relevance of cluster size in mmr based summarizer: A report. Research report, http://www-2.cs.cmu.edu/ madhavi/11-742/report.pdf, 2002. D. Geiger, M. Goldszmidt, G. Provan, P. Langley, and P. Smyth. Bayesian network classifiers. In Machine Learning, pages 131–163, 1997. J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell. Summarizing text documents: sentence selection and evaluation metrics. In SIGIR’99, pages 121–128. ACM Press, 1999. ISBN 1-58113-096-1. N. G¨overt and G. Kazai. Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002. In Fuhr et al. (2003), pages 1–17. N. G¨overt, N. Fuhr, M. Lalmas, and G. Kazai. Evaluating the effectiveness of content-oriented XML retrieval methods. Inf. Retr., 9(6):699–722, 2006. ISSN 1386-4564. K. Großjohann, N. Fuhr, D. Effing, and S. Kriewel. Query formulation and result visualization for XML retrieval. In Proceedings ACM SIGIR 2002 Workshop on XML and Information Retrieval. ACM, 2002. K. Hagerty. Abstracts as a basis for relevance judgment. Technical report, Chicago Univ., IL. Graduate Library School, Chicago, 1967. WP-380-5. B. Hammer-Aebi, K. W. Christensen, H. Lund, and B. Larsen. Users, structured documents and overlap: interactive searching of elements and the influence of context on search behaviour. In Proceedings of IIiX, pages 46–55, 2006. ISBN 1-59593-482-0. D. J. Hand and K. Yu. Idiot’s bayes - not so stupid after all? International Statistical Review, 69 (3):385–398, 2001. doi: 10.1111/j.1751-5823.2001.tb00465.x. D. Harman. Special issue on text summarization. Information Processing and Management, 43 (6):1441–1834, November 2007. D. Harman and P. Over. The effects of human variation in DUC summarisation evaluation. In Proceedings of the ACL-04, pages 20–17, 2004. D.J. Harper, I. Koychev, and S. Yixing. Query-based document skimming: A user-centred evaluation of relevance profiling. In Sebastiani (2003), pages 377–392. ISBN 3-540-01274-5.
204 M. Hassel and H. Dalianis. Generation of reference summaries. In Proceedings of 2nd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland, April 2005. M. A. Hearst. TileBars: visualization of term distribution information in full text information access. In CHI ’95: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 59–66, New York, NY, USA, 1995. ACM Press/Addison-Wesley Publishing Co. ISBN 0-201-84705-1. M. Hemmje. Lyberworld: a 3D graphical user interface for fulltext retrieval. In CHI ’95: Conference companion on Human factors in computing systems, pages 417–418, New York, NY, USA, 1995. ACM. ISBN 0-89791-755-3. M. Heo, E. Morse, S. Willms, and M. B. Spring. Multi-level navigation of a document space. In WEBNET 96, San Francisco, California, USA, 1996. W. R. Hersh. Trec-2002 interactive track report. In Text REtrieval Conference, 2000. W. R. Hersh and P. Over. Trec-2001 interactive track report. In Text REtrieval Conference, 2000a. W. R. Hersh and P. Over. Trec-8 interactive track report. In Text REtrieval Conference, 1999. W. R. Hersh and P. Over. Trec-9 interactive track report. In Text REtrieval Conference, pages 41–50, 2000b. D. Hiemstra. Statistical language models for intelligent xml retrieval. In H. M. Blanken, T. Grabs, H.-J. Schek, R. Schenkel, and G. Weikum, editors, Intelligent Search on XML Data, volume 2818 of Lecture Notes in Computer Science, pages 107–118. Springer, 2003. T. Hirao, Y. Sasaki, H. Isozaki, and E. Maeda. NTT’s text summarization system for DUC-2002. In DUC2002, 2002. G. Hripcsak and A. S. Rothschild. Agreement, the F-measure, and reliability in information retrieval. J Am Med Inform Assoc, 12(3):296–298, May 2005. R. J. Hunt. Percent agreement, Pearson’s correlation, and kappa as measures of inter-examiner reliability. J Dent Res, 65(2):128–130, 1986.
205 B. J. Jansen, A. Spink, J. Bateman, and T Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5–17, 1998. K. J¨arvelin and J. Kek¨al¨ainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002. ISSN 1046-8188. K. Sp¨arck Jones. A statistical interpretation of term specificity and its application in retrieval. pages 132–142, 1988. M. Jones., G. Marsden, N. Mohd-Nasir, and G. Buchanan. A site based outliner for small screen web access (poster). In 8th World Wide Web conference, Toronto, 1999. ¨ J. Kamps. Presenting semi-structured text retrieval results. In Ling Liu and M. Tamer Ozsu, editors, Encyclopedia of Database Systems (EDS). Springer-Verlag, Heidelberg, 2008. J. Kamps and B. Larsen. Understanding differences between search requests in XML element retrieval. In Proceedings of the SIGIR 2006 Workshop on XML Element Retrieval Methodology, pages 13–19, 2006. J. Kamps and B. Sigurbj¨ornsson. What do users think of an XML element retrieval system? In Fuhr et al. (2006), pages 411–421. J. Kamps, M. de Rijke, and B. Sigurbj¨ornsson. Length normalization in XML retrieval. In Proceedings of ACM SIGIR, pages 80–87, 2004. ISBN 1-58113-881-4. J. Kamps, M. De Rijke, and B. Sigurbj¨ornsson. The importance of length normalization for XML retrieval. Inf. Retr., 8(4):631–654, 2005. ISSN 1386-4564. G. Kazai and M. Lalmas. eXtended cumulated gain measures for the evaluation of contentoriented XML retrieval. ACM Trans. Inf. Syst., 24(4):503–542, 2006. ISSN 1046-8188. G. Kazai and M. Lalmas. INEX 2005 evaluation measures. In Fuhr et al. (2005), pages 16–29. G. Kazai and A. Trotman. Users’ perspectives on the usefulness of structure for XML information retrieval. In Proceedings of the 1st International Conference on the Theory of Information Retrieval, 2007. G. Kazai, M. Lalmas, and J. Reid. Construction of a test collection for the focussed retrieval of structured documents. In Sebastiani (2003), pages 88–103. ISBN 3-540-01274-5.
206 G. Kazai, M. Lalmas, and A. P. de Vries. The overlap problem in content-oriented XML retrieval evaluation. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 72–79, New York, NY, USA, 2004. ACM. ISBN 1-58113-881-4. J. Kek¨al¨ainen and K. J¨arvelin. Using graded relevance assessments in ir evaluation. J. Am. Soc. Inf. Sci. Technol., 53(13):1120–1129, 2002. ISSN 1532-2882. H. Kim and H. Son. Interactive searching behavior with structured XML documents. In Fuhr et al. (2005), pages 424–436. R. Kohavi. Wrappers for performance enhancement and oblivious decision graphs. PhD thesis, Stanford, CA, USA, 1996. R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, pages 1137–1145, 1995. S. Koshman, A. Spink, and B. J. Jansen. Web searching on the Vivisimo search engine. J. Am. Soc. Inf. Sci. Technol., 57(14):1875–1887, 2006. ISSN 1532-2882. J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. In SIGIR’95, pages 68–73. ACM Press, 1995. ISBN 0-89791-714-6. M. Lalmas. INEX 2005 retrieval task and result submission specification, 2005. M. Lalmas and B. Piwowarski. INEX 2005 relevance assessment guide. In INEX 2005 preproceedings, pages 391–400, 2005. M. Lalmas and B. Piwowarski. INEX 2006 relevance assessment guide. In INEX 2006 preproceedings, 2006. M. Lalmas and B. Piwowarski. INEX 2007 relevance assessment guide. In INEX 2007 preproceedings, pages 454–463, 2007. M. Lalmas and J. Reid. Automatic identification of best entry points for focused structured document retrieval. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 540–543, New York, NY, USA, 2003. ACM.
207 M. Lalmas and A. Tombros. Evaluating XML retrieval effectiveness at INEX. SIGIR Forum, 41 (1):40–57, 2007a. ISSN 0163-5840. M. Lalmas and A. Tombros. INEX 2002 - 2006: Understanding xml retrieval evaluation. In DELOS Conference on Digital Libraries, Tirrenia, Pisa, Italy, February 2007b. M. Lalmas, G. Kazai, J. Kamps, J. Pehcevski, B. Piwowarski, and S. Robertson. Inex 2006 evaluation measures. In Comparative Evaluation of XML Information Retrieval Systems: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17-20, 2006, Revised and Selected Papers, volume 4518 of Lecture Notes in Computer Science, pages 20–34. Springer Berlin / Heidelberg, 2007a. URL http://hal.inria.fr/inria-00174121/en/. M. Lalmas, G. Kazai, J. Kamps, J. Pehcevski, B. Piwowarski, and S. Robertson. INEX 2006 evaluation measures. In Comparative Evaluation of XML Information Retrieval Systems: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2006), 2007b. D. Lam, S. L. Rohall, C. Schmandt, and M. K. Stern. Exploiting e-mail structure to improve summarization. Technical paper, IBM Watson Research Center, 2002. T. Landauer, D. Egan, J. Remde, M. Lesk, C. Lochbaum, and D. Ketchum. Enhancing the usability of text through computer delivery and formative evaluation: Super book project. In J. Richardson C. McKnight, A. Dillon, editor, Hypertext: A psychological perspective, pages 71–136. New York: Ellis Horwood, 1993. B. Larsen and A. Trotman. INEX 2006 guidelines for topic development. In INEX 2006 Workshop Pre-Proceedings, Dagstuhl, Germany, 2006. B. Larsen, A. Tombros, and S. Malik. Obtrusiveness and relevance assessment in interactive XML IR experiments. In Proceedings of the INEX 2005 Workshop on Element Retrieval Methodology, 2005. B. Larsen, S. Malik, and A. Tombros. The interactive track at INEX 2005. In Fuhr et al. (2006), pages 398–410. B. Larsen, A. Tombros, and S. Malik. Is XML retrieval meaningful to users?: searcher preferences for full documents vs. elements. In SIGIR ’06: Proceedings of the 29th annual inter-
208 national ACM SIGIR conference on Research and development in information retrieval, pages 663–664, New York, NY, USA, 2006b. ACM. ISBN 1-59593-369-7. D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 349–357. ACM Press, 2001. ISBN 1-58113-331-6. D. J. Lawrie. Language models for hierarchical summarization. PhD thesis, University of Massachusetts Amherst, 2003. Director-W. Bruce Croft. D. D. Lewis. Evaluating text categorization. In In Proceedings of Speech and Natural Language Workshop, pages 312–318. Morgan Kaufmann, 1991. C-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In ACL-WS2004A, 2004. C.Y. Lin. Training a selection function for extraction. In CIKM 1999, pages 55–62, 1999. C.Y. Lin and E. Hovy. The potential and limitations of automatic sentence extraction for summarization. In HLT/NAACL-WS2003A, 2003. Z. Lin, B. He, and B. Choi. A quantitative summary of XML structures. In D. W. Embley, A. Oliv´e, and S. Ram, editors, ER, volume 4215 of Lecture Notes in Computer Science, pages 228–240. Springer, 2006. ISBN 3-540-47224-X. X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. Schatz. Generating gene summaries from biomedical literature: A study of semi-structured summarization. Information Processing and Management, 43(6):1777–1791, 2007. ISSN 0306-4573. K. C. Litkowski. Evolving XML summarization strategies in DUC 2005. In DUC 2005, 2005. K. C. Litkowski. Cl research summarization in DUC 2006: An easier task, an easier method? In DUC 2006, 2006. H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159–165, 1958. ISSN 0018-8646. R.W.P. Luk, H. V. Leong, T. S. Dillon, A. T.S. Chan, W. B. Croft, and J. Allan. A survey in indexing and searching XML documents. J. Am. Soc. Inf. Sci. Technol., 53(6):415–437, 2002. ISSN 1532-2882.
209 R. E. Maizell, J. F. Smith, and T. E. R. Singer. Abstracting scientific and technical literature : an introductory guide and text for scientists, abstractors, and management. Wiley-Interscience, 1971. S. Malik, G. Kazai, M. Lalmas, and N. Fuhr. Overview of INEX 2005. In Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers, pages 1–15, 2005. S. Malik, C.-P. Klas, N. Fuhr, B. Larsen, and A. Tombros. Designing a user interface for interactive retrieval of structured documents - lessons learned from the INEX interactive track. In Proceedings of ECDL 2006, pages 291–302, 2006a. S. Malik, A. Tombros, and B. Larsen. The interactive track at INEX 2006. In Fuhr et al. (2007), pages 387–399. ISBN 978-3-540-73887-9. S. Malik, B. Larsen, and A. Tombros. Report on the INEX 2005 interactive track. SIGIR Forum, 41(1):67–74, 2007. ISSN 0163-5840. I. Mani. Summarization evaluation: An overview. In NAACL-WS2001A, 2001. I. Mani, D. House, G. Klein, L. Hirschman, T. Firmin, and B. Sundheim. The TIPSTER SUMMAC text summarization evaluation. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, pages 77–85, Morristown, NJ, USA, 1999. Association for Computational Linguistics. G. Marchionini and B. Shneiderman. Finding facts vs. browsing knowledge in hypertext systems. Computer, 21(1):70–80, 1988. ISSN 0018-9162. S. Mizzaro. Relevance: The whole history. Journal of the American Society of Information Science, 48(9):810–832, 1997. C. N. Mooers. Zatocoding applied to mechanical organization of knowledge. American Documentation, 2:20–32, 1951. D. A. Nation. WebTOC: a tool to visualize and quantify web sites using a hierarchical table of contents browser. In CHI ’98: CHI 98 conference summary on Human factors in computing systems, pages 185–186, New York, NY, USA, 1998. ACM. ISBN 1-58113-028-7.
210 J Larocca Neto, AD Santos, CAA Kaestner, and AA Freitas. Document Clustering and Text Summarization. In N Mackin, editor, Proc. 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), pages 41–55, London, January 2000. The Practical Application Company. W. C. Ogden and M. W. Davis. Improving cross-language text retrieval with human interactions. In HICSS ’00: Proceedings of the 33rd Hawaii International Conference on System SciencesVolume 3, page 3044, Washington, DC, USA, 2000. IEEE Computer Society. P. Ogilvie and J. Callan. Hierarchical language models for XML component retrieval. In Fuhr et al. (2005), pages 224–237. P. Over. TREC-5 interactive track report. In Text REtrieval Conference, 1996. P. Over. TREC-6 interactive track report. In Text REtrieval Conference, 1997. P. Over. TREC-7 interactive track report. In Text REtrieval Conference, pages 33–39, 1998. P. Over, H. Dang, and D. Harman. DUC in context. Information Processing and Management, 43(6):1506–1520, 2007. ISSN 0306-4573. T. Paek, S. Dumais and R. Logan. WaveLens: a new view onto internet search results. In CHI ’04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 727–734, New York, NY, USA, 2004. ACM. ISBN 1-58113-702-8. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. C. D. Paice. Constructing literature abstracts by computer: Techniques and prospects. Information Processing and Management, 26(1):171–186, 1990. K. Papineni, S. Roukos, T. Ward, and W-J. Zhu. BLEU: A method for automatic evaluation of machine translation. Technical report, IBM, 2001. J. Pehcevski, J. A. Thom, and A-M. Vercoustre. Users and assessors in the context of INEX: Are relevance dimensions relevant? CoRR, abs/cs/0507069, 2005. N. Pharo and R. Nordlie. Context matters: An analysis of assessments of XML documents. In F. Crestani and I. Ruthven, editors, CoLIS, volume 3507 of Lecture Notes in Computer Science, pages 238–248. Springer, 2005. ISBN 3-540-26178-8.
211 N. Pharo, A. Trotman, S. Geva, and B. Piwowarski. Inter-assessor agreement at INEX 06. In INEX 2006 Workshop pre-Proceedings, page 273, 2006. B. Piwowarski and M. Lalmas. Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 361–370, New York, NY, USA, 2004. ACM. ISBN 1-58113-874-1. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–281, New York, NY, USA, 1998. ACM. ISBN 1-58113-015-5. A. Pulijala and S. Gauch. Hierarchical text classification. In International Conference on Cybernetics and Information Technologies, Systems and Applications: CITSA 2004, 2004. D. R. Radev, H. Jing, M. Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40(6):919–938, 2004. ISSN 03064573. V. Raghavan, P. Bollmann, and G. S. Jung. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst., 7(3):205–229, 1989. ISSN 1046-8188. A. Rahman and H. Alam. Challenges in web document summarization: Some myths and reality. In Document Recognition and Retrieval IX, Electronic Imaging Conference, SPIE 4670-27, 2002. G. Rath, A. Resnick, and R. Savage. The Formation of Abstracts by the Selection of Sentences: Part 1: Sentence Selection by Man and Machines. American Documentation, 12(2):139–141, 1961. J. R. Remde, L. M. Gomez, and T. K. Landauer. SuperBook: an automatic tool for information exploration hypertext? In HYPERTEXT ’87: Proceedings of the ACM conference on Hypertext, pages 175–188, New York, NY, USA, 1987. ACM. ISBN 0-89791-340-X.
212 S.E. Robertson, M.F. Porter, and C.J. van Rijsbergen. New models in probabilistic information retrieval. Technical report, Computer Laboratory, Cambridge University, 1980. URL http: //tartarus.org/˜martin/PorterStemmer. T. R¨olleke, R. L¨ubeck, and G. Kazai. The hyspirit retrieval platform. In SIGIR’01, page 454, New York, NY, USA, 2001. ACM Press. ISBN 1-58113-331-6. G. Salton. Relevance assessments and retrieval system evaluation. Technical report, Ithaca, NY, USA, 1968. G. Salton and J. Allan. Automatic text decomposition and structuring. Information Processing and Management, 32(2):127–138, 1996. G. Salton and C. Buckley. Automatic text structuring and retrieval-experiments in automatic encyclopedia searching. In SIGIR ’91: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21–30, New York, NY, USA, 1991. ACM. Mark Sanderson. Reuters Test Collection. In BSC IRSG, 1994. T. Saracevic. Comparative effects of titles, abstracts and full texts on relevance judgments. In Proceedings of the American Society for Information Science, pages 293–299, San Francisco, CA, 1969. Westport, CT: Greenwood Publishing Corporation. T. Saracevic. Evaluation of evaluation in information retrieval. In Proceedings of the 18th Annual International Conference on Research and Development in Information Retrieval (SIGIR’95), pages 137–148. ACM Press, 1995. ISBN 0-89791-714-2. T. Saracevic. Relevance reconsidered. In Proceedings of the second conference on conceptions of library and information science (CoLIS 2), pages 201–218, 1996. F. Sebastiani, editor. Advances in Information Retrieval, 25th European Conference on IR Research, ECIR 2003, Pisa, Italy, April 14-16, 2003, Proceedings, volume 2633 of Lecture Notes in Computer Science, 2003. Springer. ISBN 3-540-01274-5. M. M. Sebrechts, J. Cugini, S. J. Laskowski, J. Vasilakis, and M. S. Miller. Visualization of search results: A comparative evaluation of text, 2D, and 3D interfaces. In SIGIR ’99: Proceedings
213 of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, 1999, Berkeley, CA, USA, pages 3–10. ACM, 1999. A. Sengupta, M. Dalkilic, and J. Costello. Semantic thumbnails: a novel method for summarizing document collections. In SIGDOC’04, pages 45–51, New York, NY, USA, 2004. ACM Press. ISBN 1-58113-809-1. B. Shneiderman, D. Feldman, A. Rose, and X. Ferr´e Grau. Visualizing digital library search results with categorical and hierarchical axes. In DL ’00: Proceedings of the fifth ACM conference on Digital libraries, pages 57–66, New York, NY, USA, 2000. ACM. ISBN 1-58113231-X. B. Sigurbj¨ornsson. Focused Information Access using XML Element Retrieval. PhD thesis, Faculty of Science, University of Amsterdam, 2006. B. Sigurbj¨ornsson, A. Trotman, S. Geva, M. Lalmas, B. Larsen, and S. Malik. Inex 2005 guidelines for topic development. In INEX 2005 Workshop Pre-Proceedings, pages 375–384, Dagstuhl, Germany, November 2005. I. Soboroff. Do trec web collections look like the web? SIGIR Forum, 36(2):23–31, 2002. ISSN 0163-5840. A. Spoerri. Infocrystal: a visual tool for information retrieval. In VIS ’93: Proceedings of the 4th conference on Visualization ’93, pages 150–157, 1993. ISBN 0-8186-3940-7. G. C. Stein, T. Strzalkowski, and G. Bowden Wise. Interactive, text-based summarization of multiple documents. Computational Intelligence, 16(4):606–613, 2000. B. Suh, A. Woodruff, R. Rosenholtz, and A. Glass. Popout prism: adding perceptual principles to overview+detail document interfaces. In CHI ’02: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 251–258, New York, NY, USA, 2002. ACM. ISBN 1-58113-453-3. S. Sweeney and F. Crestani. Effective search results summary size and device screen size: is there a relationship? Information Processing and Management, 42(4):1056–1074, 2006. ISSN 0306-4573.
214 Z. Szl´avik and T. R¨olleke. Building and experimenting with a heterogeneous collection. In Third Workshop of the INitiative for the Evaluation of XML retrieval INEX 2004, pages 359–368. Springer Verlag, 2005. Z. Szl´avik, A. Tombros, and M. Lalmas. The use of summaries in XML retrieval. In Proceedings of ECDL 2006, pages 75–86, 2006a. Z. Szl´avik, A. Tombros, and M. Lalmas. Investigating the use of summarisation for interactive XML retrieval. In F. Crestani and G. Pasi, editors, Proceedings of ACM SAC-IARS’06, pages 1068–1072, 2006b. Z. Szl´avik, A. Tombros, and M. Lalmas. Feature- and query-based table of contents generation for XML documents. In G. Amati, C. Carpineto, and G. Romano, editors, ECIR, volume 4425 of Lecture Notes in Computer Science, pages 456–467. Springer, 2007. ISBN 978-3-54071494-1. S. Teufel and M. Moens. Sentence extraction as a classification task. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, July November 1997. M. Theobald, R. Schenkel, and G. Weikum. An efficient and versatile query engine for TopX search. In Proceedings of VLDB, pages 625–636, 2005. A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proceedings of ACM SIGIR, pages 2–10, 1998. ISBN 1-58113-015-5. A. Tombros, B. Larsen, and S. Malik. The interactive track at INEX 2004. In Fuhr et al. (2005), pages 422–435. A. Tombros, S. Malik, and B. Larsen. Report on the INEX 2004 interactive track. ACM SIGIR Forum, 39(1):43–49, June 2005b. ISSN 0163-5840. A. Trotman. Wanted: Element retrieval users. In Proceedings of the INEX 2005 Workshop on Element Retrieval Methodology, pages 58–64, 2005. A. Trotman and D. Jenkinson. IR evaluation using multiple assessors per topic. In Amanda Spink, Andrew Turpin, and Mingfang Wu, editors, Proceedings of The Twelfth Australasian
215 Document Computing Symposium, pages 9–16, Melbourne, Australia, December 2007. RMIT University. A. Trotman and B. Larsen. Inex 2007 guidelines for topic development. In INEX 2007 Workshop Pre-Proceedings, Dagstuhl, Germany, 2007. A. Trotman and B. Sigurbj¨ornsson. Narrowed extended XPath I (NEXI). In Fuhr et al. (2005), pages 16–40. A. Trotman, N. Pharo, and M. Lehtonen. XML-IR users and use cases. In Fuhr et al. (2007), pages 400–412. ISBN 978-3-540-73887-9. A. Trotman, S. Geva, and J. Kamps. Report on the SIGIR 2007 workshop on focused retrieval. SIGIR Forum, 41(2):97–103, 2007. ISSN 0163-5840. A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast generation of result snippets in web search. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 127–134, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-597-7. C. J. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1980. URL http: //www.dcs.gla.ac.uk/Keith/Preface.html. R. van Zwol, G. Kazai, and M. Lalmas. INEX 2005 multimedia track. In Fuhr et al. (2006), pages 497–510. E. M. Voorhees. Overview of TREC 2001. In TREC, 2001. E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36(5):697–716, 2000. L. Weitzman, S. E. Dean, D. Meliksetian, K. Gupta, N. Zhou, and J. Wu. Transforming the content management process at IBM.com. In CHI ’02: Case studies of the CHI2002|AIGA Experience Design FORUM, pages 1–15, New York, NY, USA, 2002. ACM. R. White, I. Ruthven, and J. M. Jose. Finding relevant documents using top ranking sentences: an evaluation of two alternative schemes. In SIGIR 2002, pages 57–64, 2002.
216 R. W. White, J. M. Jose, and I. Ruthven. A task-oriented study on the influencing effects of query-biased summarisation in web searching. Information Processing and Management, 39 (5):707–733, 2003. Ryen W. White. Implicit feedback for interactive information retrieval. SIGIR Forum, 39(1): 70–70, 2005. ISSN 0163-5840. C. G. Wolf, S. R. Alpert, J. G. Vergo, L. Kozakov, and Y. Doganata. Summarizing technical support documents for search: expert and user studies. IBM Syst. J., 43(3):564–586, 2004. ISSN 0018-8670. A. Woodruff, R. Rosenholtz, J. B. Morrison, A. Faulring, and P. Pirolli. A comparison of the use of text summaries, plain thumbnails, and enhanced thumbnails for Web search tasks. JASIST, 53(2):172–185, 2002. O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to web search results. In WWW ’99: Proceedings of the eighth international conference on World Wide Web, pages 1361–1374, New York, NY, USA, 1999. Elsevier North-Holland, Inc. P. T. Zellweger, S. Harkness Regli, J. D. Mackinlay, and B.-W. Chang. The impact of fluid documents on reading and browsing: an observational study. In CHI ’00: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 249–256, New York, NY, USA, 2000. ACM. ISBN 1-58113-216-6.
217
Appendix A Text summarisation
A.1
Task descriptions
The following pages contain the task descriptions used in the text summarisation user study.
Please choose one of the following two tasks. You will have maximum 20 minutes to complete the chosen task. If you finish earlier, please let the experimenter know.
B1 - Computer Assisted Music Composing Let us suppose you are a composer, and you want to start using computers for composing music. You want to find out about any possible methods or ways to use computers to help you in your job. You are not only interested in particular programs but also in the possibilities computers might offer to a composer. You may want to compose using notes, so writing notes with computer would help you. On the other hand, you would also like to use real instruments in composing, so a simple way of creating notes automatically while playing would be nice for you. There is a standard called MIDI, which is related to this topic, you may try to use this information during your search. Please write down some notes, key points, comments, etc. which may help you remember the information you have found.
B2 - FBI and CIA Surveillance Concerns
Let us suppose you are a journalist. You've heard that the Central Intelligence Agency (CIA) and the Federal Bureau of Investigation (FBI) are monitoring the public in the United States, and you want to write a feature article for a newspaper about what personal privacy concerns have been raised by the surveillance done by these two agencies. To start the work, you want to collect some background information about how the surveillance affects people's privacy. At this first stage, you are not interested in the technical details. You know that there is a project called Carnivore which is related to this topic. Please write down some notes, key points, comments, etc. which may help you in writing your article.
Please choose one of the following two tasks. You will have maximum 20 minutes to complete the chosen task. If you finish earlier, please let the experimenter know.
L1 - Speech Recognition Software Let us suppose you work for a company as a manager and you have frequent audio meetings (through a Voice over IP (VoIP) tool). You would like to keep the records of these meetings on paper as well, but you don't have time and resources to type what is discussed during the meetings. Instead, you are hoping to generate transcripts from audio captured during these meetings. You would like to compile a list of all commercially available speech recognition packages, along with a brief description of their capabilities and any experience (positive or negative) that others have had with them. Public domain systems are good for you only if they are suitable for commercial use. Please write down a list of names of available software along with their capabilities and, if you feel like, some comments about them.
L2 - Artificial Intelligence Algorithms for Playing Chess Let us suppose you are a programmer and you also like chess. You want to write your own program that plays chess and you want the program to use artificial-intelligence algorithms. Therefore, you are interested in all kinds of algorithms that have been proposed for chess. You assume, however, that earlier algorithms are simpler and it will be easier for you to implement them. Hence, you are only interested in articles on this topic that were published before, or during, the year 2000. In order to choose from the algorithms, you want to create a list of the chess algorithm names, and attach a short comment to each of them. Comments should contain information that may help you to choose the best algorithm for your programming project. Please write down this list.
A.2. Questionnaires 220
A.2
Questionnaires
The following questionnaires were given to searchers of the text summarisation user study.
Participant _____________ Date ___________________ Searcher condition ____________
ENTRY QUESTIONNAIRE Background Information 1. What high school/college/university degrees/diplomas do you have (or expect to have)? ____________________________________________________ Degree Major Date ____________________________________________________ Degree Major Date ____________________________________________________ Degree Major Date
2. What is your first language? _____________________________ 3. What is your occupation? ____________________________________________________ 4. What is your gender? ____ Female 5. What is your age? ___ 18 – 27 years ___ 28 – 37 years ___ 38 – 47 years ___ 48 – 57 years ___ 58 – 67 years ___ 68+
____ Male
Participant _____________ Date ___________________ Searcher condition ____________ Task ID _____________________
BEFORE-EACH-TASK QUESTIONNAIRE 1. How familiar are you with the topic of this problem? Not at all 1
2
3
Somewhat 4
5
6
Extremely 7
6
Very hard 7
6
Extremely confident 7
2. How easy do you think the task is? Very easy 1
2
3
Average 4
5
3. How confident are you that you will find everything you need? Not at all 1
2
3
4
5
Participant _____________ Date ___________________ Searcher condition ____________ Task ID _____________________
AFTER-EACH-TASK QUESTIONNAIRE Please answer the following questions, as they relate to this specific information task. Not at all 1. Was it easy to get started on this search? 2. Was it easy to do the search on this topic? 3. Did you have enough time to do an effective search? 4. Are you confident in your results? 5. Do you feel your results are complete? 6. Did your previous knowledge on this topic help much in this task? 7. Have you learned much new about the topic during your search? 8. Was the search task interesting? 9. Was the search task realistic?
Somewhat
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
Not at all 10. Did you use the hierarchy presented on the left of the document view? 11. Did you read the summaries associated with elements in the hierarchy? 12. Did summaries reflect the contents of their correspondent elements?
EXIT QUESTIONNAIRE Please consider the entire search experience that you just had when you respond to the following questions. Not at all
Somewhat
Extrem ely
1. How easy was it to learn to use the information system?
1
2
3
4
5
6
7
2. How easy was it to use the information system?
1
2
4
4
5
6
7
3. How well did you understand how to use the system?
1
2
3
4
5
6
7
Not at all
Somewhat
Compl etely
4. To what extent did you understand the nature of the searching task?
1
2
3
4
5
6
7
5. To what extent did you find this task similar to other searching tasks that you typically perform?
1
2
4
4
5
6
7
6. How different did you find the systems from one another?
1
2
3
4
5
6
7
7. Please indicate if you have any comments regarding the hierarchy display, including number of levels, style, etc.:
Confirmation of Payment User Study Summarisation for XML Retrieval 1-5 August 2005 Hereby I confirm that I have received ₤10 (ten pounds) for participating in the user study Summarisation for XML Retrieval run by Zoltan Szlavik (PhD Student, Department of Computer Science, Queen Mary University of London).
Name (block capitals) _____________________________ Position
_____________________________
Date
_____________________________
Signature
_____________________________
226
Appendix B Query based structure summarisation
B.1
Task descriptions
The following list contains the task descriptions that have been used in the structure summarisation study. The ‘i’ tasks refer to those designed for the IEEE document collection, while the ‘w’ tasks contain most of their relevant documents in the Wikipedia collection. i1 You may have heard that the microprocessor size reduction is soon expected to reach its limits. What are the problems and physical limits that are encountered by the microprocessor miniaturisation process? - that’s what you would like to know. i2 Video games are being played by an ever increasing number of people of all ages, and the game industry is becoming a major economic player. You would therefore like to find non-technical information about how video games have affected peoples lives as well as how the games have changed the entertainment industry. i3 Let us suppose you are a composer, and you want to start using computers in composing music. You want to know any possible methods or ways to use computers to help you in your job. You are not only interested in particular programs but also the possibilities computers might offer to a composer. You may want to compose using notes so writing notes with computer would help you. On the other hand, you may also would like to use real instruments in composing, so a simple way of creating notes automatically while playing would be nice for you.
B.1. Task descriptions 227 i4 Let us suppose you work for a company as a manager and you have a lot of audio meetings (through a VoIP tool). You would like to read the records of these meetings on paper as well but you don’t have time and resources to type in the machine what happened during the meeting. So, you are hoping to generate transcripts from audio captured during meetings. You would like to compile a list of all commercially available speech recognition packages, along with a brief description of their capabilities and any experience (positive or negative) that others have had with them. Public domain systems are good for you only if they are suitable for commercial use. i5 You have become involved in the organization of a workshop on multimedia at your department. You would like to collect information about other multimedia events (conferences, workshops, etc.) in order to decide the most appropriate topics and dates for the workshop at your department. w1 As you watch the evening news, you often see reports from Iraq, Iran and Afghanistan. To better understand what is happening between these countries and the western countries and how people of these countries think, you would like to know more about the Islam religion and its branches. As there are overlaps with Christianity and other religions, information which contrasts Islam with other religions is also relevant. w2 You were looking at the labels of groceries and were surprised by the large amount of food additives (’E-numbers’ such as E102, E122, etc.). You are eager to find out which food additives are, or may be, toxic. You are also interested in those which have been prohibited. w3 Let us suppose that you work as teaching assistant for operating systems. This inspired your interest in microkernel operating systems, and you are looking for documents that talk about operating systems with microkernel. Documents or document segments talking only about operating systems or microkernel are not of your interest. (A Microkernel is a highly modular collection of powerful OS-neutral abstractions, upon which can be built operating system servers. ) w4 Let us suppose you need to write an essay about the off-side rule. Hence, you are interested in the definition, history and changes of the rule in football and any other sports where the off-side rule is present.
B.2. Questionnaires 228 w5 Let us suppose you are writing a movie script about Albert Einstein. You are looking for particularly interesting facts - even anecdotal, anything that will spice up the script and also point to other interesting involvements. At this point of writing, you are about to find out the impact that Albert Einstein had on politics, his involvement, opinions, people he interacted with, related historical facts and anecdotes. References to the theory of relativity or other scientific work are of no interest unless tied to political issues.
B.2
Questionnaires
231
Appendix C Query independent structure summarisation
C.1
Analysed runs
The following is a list of high, average and low quality runs that were analysed and used for training in Chapter 6. The selection method for each category is summarised below the individual lists.
High quality runs: utwente - A CO ARTorNAME maxplanck - TOPX-CO-AllInContext-ex rmit - zet-okai-AC uamsterdam - all section lm uhebrew - CO.FetchBrowseElement ON lm 0.1 0.1 0.8 2006 18 09 lip6 - OkTg-Lineaire-RkSym-100cent-SemiSu5-Oki-2-7-0.75-D Selection: all within top 10 results by various measures (order determined by average positions); one run chosen per affiliation.
Average quality runs: qmul - Lm LengthPrior TermWeighted F Clustered R queensland - CO T 11 31.AllInContext maxplanck - TOPX-CAS-AllInContext-baseline city - All-BM25-cutoff400 utrecht - aic-And49tf5-End04tf3leaf-Element-10File
C.1. Analysed runs 232 kaislau - CO.CrisTitlen.ALL IN CONTEXT Selection: these are runs closest to position 28 which is the middle of the ranked list of runs (ranked list determined by average positions based on appropriate INEX measures; one run per affiliation.
Low quality runs: kaiserslautern - CAS.FuzzyTitlen.ALL IN CONTEXT IRIT - xfirm.co.relevant.01.1 uhebrew - CO.FetchBrowseElement ON lm 0.1 0.1 0.8 2006 17 02... Wollongong - Title RelevantInContext Task2 utwente - A CO ARTorNAME STAR AiCmax fixed justsystem - VSM 09 Selection: runs at the bottom of the ranked list of runs (based on averaging positions of result tables by the task’s INEX measures; one run per affiliation.