IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
Information Processing and Management xxx (2002) xxx–xxx
PR OO
F
www.elsevier.com/locate/infoproman
A graphical user interface for the retrieval of hierarchically structured documents Fabio Crestani a
a,*
, Jes us Vegas b, Pablo de la Fuente
b
Department of Computer and Information Sciences, University of Strathclyde, 26 Richmond Street, Glasgow G1 1XH, UK b Departamento de Inform atica, Universidad de Valladolid, Valladolid, Spain
ED
Received 12 August 2002; accepted 19 November 2002
Abstract
RE
CT
Past research has proved that graphical user interfaces (GUIs) can significantly improve the effectiveness of the information access task. Our work is based on the consideration that structured document retrieval requires different user graphical interfaces from standard information retrieval. In structured document retrieval a GUI has to enable a user to query, browse retrieved documents, provide query refinement and relevance feedback based not only on full documents, but also on specific document parts in relation to the document structure. In this paper, we present a new GUI for structured document retrieval specifically designed for hierarchically structured documents. A user task-oriented evaluation has shown that the proposed interface provides the user with an intuitive and powerful set of tools for structured document searching, retrieved list navigation, and search refinement. 2002 Published by Elsevier Science Ltd.
UN CO R
10 11 12 13 14 15 16 17 18 19
20 Keywords: Information retrieval; Structured document retrieval; Graphical user interface; Evaluation
21 1. Introduction
22 Information retrieval (IR) systems are powerful and effective tools for accessing documents by 23 content. A user specifies the required content using a query, often consisting of a natural language 24 expression. Documents estimated to be relevant to the user query are presented to the user 25 through an interface. A good user interface enables the IR process to be interactive, stimulating
*
Corresponding author. Tel.: +44-141-548-4303; fax: +44-141-552-5330. E-mail addresses:
[email protected],
[email protected] (F. Crestani),
[email protected] (J. Vegas),
[email protected] (P. de la Fuente). 0306-4573/02/$ - see front matter 2002 Published by Elsevier Science Ltd. doi:10.1016/S0306-4573(02)00120-6
IPM 641 14 December 2002 Disk used 2
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
RE
CT
ED
PR OO
F
the user to review the retrieved documents and to reformulate the initial query, either manually or, more often, automatically, using relevance feedback. Given the complexity of the IR task and the vagueness and imprecision of the expression of the user information need and document information content, interactivity has been widely recognised as a very effective way of improving information access performance (Ingwersen, 1992). New standards in document representation compel IR to design and implement models and tools to index, retrieve and present documents according to the given document structure. In fact, while standard IR treats documents as their were atomic entities, modern IR needs to be able to deal with more elaborate document representations, like for example documents written in SGML, HTML, or XML. These document representation formalisms enable to represent and describe documents said to be structured, that is documents whose content is organised around a well defined structure. Examples of these documents are books and textbooks, scientific articles, technical manuals, etc. This means that documents should no longer be considered as atomic entities, but as aggregates of interrelated objects that need to be indexed, retrieved, and presented both as a whole and separately, in relation to the userÕs needs. In other words, given a query, an IR system must retrieve the set of document components that are most relevant to this query, not just entire documents. In IR the area of research dealing with structured documents is known as structured document retrieval. Structured document retrieval is concerned with the design of IR techniques and tools for the indexing, retrieval and presentation of structured documents. A good survey of the state of the art of structured document retrieval can be found in Chiaramella (2001). In structured document retrieval, the need for a good interface between user and system becomes even more pressing. Structured documents are often long and complex and it has been observed that a certain form of disorientation occurs in situations where the user does not understand why certain long and complex documents appear in a retrieved list (Spence, 2001). The length and structural complexity of such documents makes it difficult for the user to capture the relationship between the information need and the semantic content of the document. This user disorientation makes the task of query reformulation much harder, since the user has first to understand the response of the system and then to choose if any of the retrieved document is a good enough representation of the information need to provide the system with precise relevance feedback. This added difficulties obviously increase the cognitive load of the user and decrease the quality of the interaction with the system (Belew, 2000). A possible effective approach to the disorientation problem in structured document retrieval is to provide the IR system interface with explanatory and selective feedback capabilities. In other words, the system should be able to explain the user, at any moment, why a particular document has been estimated as relevant and where the clues of this estimated relevance lie. In addition, the user should then be able to select for relevance feedback only those parts of the document that are relevant to the information need and not the entire document. By being able to focus on the parts of the document that makes it appear relevant, without losing the view of the relationships between these parts and the whole document, the userÕs cognitive load is reduced and the interaction is enhanced in quality and effectiveness. Part of our current research work is concerned with the design and implementation of a graphical user interface (GUI) for structured document retrieval. The work starts from the consideration that improved structured document retrieval visualisation interfaces provide an
UN CO R
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
ARTICLE IN PRESS
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
PR OO
F
effective contribution to the solution of the above problem. In this paper we present the design, implementation and evaluation of a GUI of a structured document retrieval system that is specifically targeted to hierarchically structured documents. This work is part of a wider project aimed at the design, implementation, and evaluation of a complete system for structured document retrieval. The system will combine a retrieval engine based on an aggregation-based approach for the estimation of the relevance of the document, with an interface with explanatory and selective feedback capabilities, specifically designed to the structured document retrieval task. The paper is structured as follows. In Section 2 we highlight the importance and the difficulty of the design of GUIs for information access. In Section 3 we explain the characteristics of the specific information access task that we are targeting: structured document retrieval. In Section 4 we present the design and implementation a new GUI for hierarchically structured document retrieval. The retrieval model currently used is presented in Section 5, even though this part of our work is only at a very early stage. A formative design-evaluation process is currently being used to validate and refine the interface. The results of a first evaluation phase are presented in Section 6. Finally, Section 7 summarises the conclusions of this work and provide an outline of future work.
ED
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
3
85 2. Graphical user interfaces for information access
RE
CT
GUIs have been studied for quite a long time by researchers in the area of information access. Research in user interfaces pertain all areas of information access, from Data Base Management Systems (Southall, 1989), to Digital Libraries, from Hypermedia (Agosti & Smeaton, 1996) to Electronic Books (Barker, 1991). A great deal of experience, originated from the Human Computer Interaction area, had been transfered into many other research areas (Carroll, Mack, & Kellog, 1988). Also in IR a great deal of work has been devoted to the design of user interfaces. In IR it is well recognised that information seeking is a vague and imprecise process. In fact, when users approach an information access system they often have only an imprecise understanding of what they are looking for and a vague idea of how they can find it. The interface should aid the users in the understanding and expression of the information need. This implies not simply helping the users to formulate queries, but also helping them to select among available information resources, understand search results, reformulate queries, and keep track of the progress of the whole search process. Human–computer interfaces, and in particular GUIs, are less well understood than other aspects of IR, because of the difficulties in understanding, measuring, and characterising the motivations and behaviours of IR users. Nevertheless, established wisdom combined with more recent research results (see Baeza-Yates and Ribeiro-Nieto (1999, pp. 257–322) for an overview of both) have highlighted some very important design principles for GUIs for information access systems. GUIs for information access should:
UN CO R
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
107 • provide informative feedback; 108 • reduce working memory load; 109 • provide functionalities for both novice and expert users.
IPM 641 14 December 2002 Disk used 4
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
ED
PR OO
F
These design principles are particularly important for GUIs for IR systems, since the complexity of the IR task requires very complex and interactive interfaces. An important aspect of GUIs for information access systems is visualisation. Visualisation takes advantage of the fact that humans are highly attuned to images and visual information. Pictures and graphics can be captivating and appealing, especially if well designed. In addition, visual representation can communicate some information more rapidly and effectively than any other method. The growing availability of fast graphic processors, high resolution colour screens, and large monitors is increasing interest in visual interfaces for information access. However, while information visualisation is rapidly advancing in areas such as scientific visualisation, it is less so for document visualisation. In fact, visualisation of inherently abstract information is difficult, and visualisation of textual information is especially challenging. Language and its written representation, text, are the main human means of communicating abstract ideas, for which there is no obvious visual manifestation (Spence, 2001). Despite these difficulties, researchers are attempting to represent some aspects of the information access process using visualisation techniques. The main visualisation techniques used for this purpose are icons, colour highlighting, brushing and linking, panning and zooming, and focus-plus-context (Baeza-Yates & Ribeiro-Nieto, 1999, pp. 257–322). These techniques support a dynamic, interactive use that is especially important for document visualisation. In this paper we will not address these techniques in any detail. We will just use those that are most suitable to our specific objective: structured document retrieval. The distinctive characteristics of this task are described in Section 3.
131 3. Structured document retrieval
RE
Many document collections contain documents that have complex structure, despite this not being used by most IR systems. The inclusion of the structure of a document in the indexing and retrieval process affects the design and implementation of the IR system in many ways. First of all, the indexing process must consider the structure in the appropriate way, so that users can search the collection both by content and structure. Secondly, the retrieval process should use both structure and content in the estimate of the relevance of documents. Finally, the interface and the whole interaction has to enable the user to make full use of the document structure. In Section 3.1 we report a brief and abstract taxonomy of IR systems with regards to the use of content and structure for indexing and retrieval. This will help us to identify what kind of operations should an interface for structured document retrieval provide the user with. For a more detailed discussion on the types of operations necessary for structured document retrieval (Navarro & Baeza-Yates, 1997; Vegas, 1999).
UN CO R
132 133 134 135 136 137 138 139 140 141 142 143
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
CT
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
ARTICLE IN PRESS
144 3.1. Use of content and structure in IR systems 145 From a general point of view, we can classify IR systems into nine different types, depending on 146 how they use content and/or structure in the indexing, querying processes. We can identify them 147 with two sets of two letters: two letters that characterise the indexing process taking place in the 148 system and two to characterise the querying process. We use C and S to indicate the use of the
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
5
Table 1 Taxonomy of IR systems respect the use of content/structure
149 150 151 152 153 154 155
C
CS
S
rs ¼ fc ðcÞ rs ¼ fcþs ðcÞ –
rs ¼ fc ðc; sÞ rs ¼ fcþs ðc; sÞ rs ¼ fs ðc; sÞ
– rs ¼ fcþs ðsÞ rs ¼ fs ðsÞ
F
C CS S
Querying
PR OO
Indexing
semantic content or document structure, respectively, in the indexing and querying processes. The nine types of IR systems are showed in Table 1. In this table we indicate what kind of retrieval function the system would need (the function f ) to produce a retrieval set (rs, either ordered or not). The suffix of the retrieval function indicates if the indexing involved content (c) and/or structure (s). We also indicate what kind of input the user could provide to the querying (the arguments of the function f , being it c or s). The characteristics of the different classes of IR systems are briefly described in the following:
ED
156 (C,C) In this class of systems, indexing is carried out only using the content of the documents, so
UN CO R
RE
CT
either the structure is not taken in account, or the documents are unstructured. Since the structure has not been considered in the indexing, querying can only be related to the content of the documents. This is the most common class of IR systems, and all the classic IR models (vector space, probabilistic, etc.) have been designed to work with this kind of systems. Many examples of these systems can be found in Baeza-Yates and Ribeiro-Nieto (1999), Belew (2000) and Frakes and Baeza-Yates (1992). (CS,CS) This class of systems index document content in relation to document structure, so that the index contains indications of how a document content is arranged in the document. The user can then query the system for content with the ability to specify the structural elements in which the content should be found. This enables a higher degree of precision in the search. Some models have been proposed for this kind of IR systems, but few operational implementations exists. Some examples of systems in this class are Bordogna and Pasi (2000), Lalmas and Ruthven (1998) and Wilkinson (1994). (S,S) This class of systems index documents only in relation to their structure, not content, and querying can only be related to structure. So, this class of systems are useful only when the queries are exclusively about structure. Models for this systems are very simple, and given the very limited and specific use of these systems, only few implementation exists. (CS,C) This class of systems index documents content in relation to document structure, but users are not able to specify structural information in the querying. In other words, document structure is used in the querying process in an implicit way; the user is not aware of document structure, but this is used in the relevance evaluation to achieve better retrieval performance. In this category of IR systems we can include a number of advanced systems aimed at collections where the structure is specified in the document markup (e.g. SGML, HTML, XML, MPEG7, etc.). IR systems that perform passage retrieval are examples on this class of systems (Callan, 1994; Salton & Allan, 1994). Also, some other examples can be found in multimedia IR systems (like for example Dunlop & McDonald, 2000; Hauptmann & Witbrock, 1997, Chap. 11; Kim & Oard, 2002) where a video or speech doc-
IPM 641 14 December 2002 Disk used 6
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
RE
CT
From the above classification it is clear that different kinds of systems make available different functionality to the user. In this respect, we can say that an IR system is competent if all kinds of queries allowed can be answered completely, and incompetent if not. In addition, we need to consider the resources used by the system (e.g. computing time and memory space) to make available to the user the possibility of querying by content and/or structure. This is somewhat reflected in the amount of information contained in the index. We can qualify an IR system as oversize if its index contain more information than are necessary to answer the queries. On the other hand, if an IR system has the right amount of information and no more in its index we can qualify is as fitted. With regard the Table 1 the types of IR systems in the main diagonal are both competent and fitted. The systems of type (CS,C) and (CS,S) systems are competent, but oversize. The system of type (C,CS) and (S,CS) systems are incompetent, but fitted. Notice that only a fraction of the types of systems presented in Table 1 have been studied and implemented. These are mostly systems of type (C,C) and (CS,CS). In our work we aim at designing and implementing a system that is of type (CS,C) for novice users, with the possibility of functioning as a (CS,CS) system for more advanced users. Some initial results of this work have been presented in Crestani, de la Fuente, and Vegas (2001). This system comprises a retrieval engine that is both competent and fitted, with an interface that facilitates querying by structure and content, hiding the complexity of expressively naming the structural elements in the query and in the relevance feedback.
UN CO R
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
ED
PR OO
F
ument is segmented and segments estimated to be relevant are retrieved in response to a content base query. In addition, in this category we find most Internet crawlers, that use information about hyperlinks, titles, etc. of the HTML pages to better rank the retrieved set (see for example Cutler, Shih, & Meng, 1997). (CS,S) This class of systems index document content in relation to document structure, but querying can only uses the document structure. This kind of systems may not seem very useful, but they may be valuable components of (CS,CS) systems, where there is a need to find documents with similar structure (Salton, Allan, & Buckley, 1994). (C,CS) This class of IR systems index documents by content, but allow queries to specify structure too. Clearly, the system cannot answer completely the query, since to structural information is not contained in the index. The user will have to browse the documents retrieved to find those that respond to the structural requirements of the query. We say that systems belonging to this class are ‘‘incomplete’’ since the index contains only part of the information needed to answer the query. We are not aware of any system of this type. (S,CS) This is another class of incomplete IR systems. In this case, the index only holds information about the structure, while the systems enables querying by both structure and content. Users will have to browse the documents retrieved to find those that respond to the content requirements of the query. We are not aware of any system of this type.
222 3.2. Structured document retrieval operations 223 In order to enable querying both content and structure in a competent and fitted way, an IR 224 system need to posses the necessary primitives to index effectively the documents and to enable a
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
RE
CT
ED
PR OO
F
user to specify content and structural elements in the query. In our work we have analysed both. However, given the scope of this paper we will only discuss the querying primitives and operations. Querying by content and structure can only be achieved if the user can specify in the query what he/she is looking for, and where this should be located in the required documents. The what involves the specification of the content, while the where is related to the structure of the documents. A great amount of work in IR has gone into letting the user specify in the most natural way the content of the information need. The tendency in IR is to let the user specify as freely as possible the information need, using a natural language query (Baeza-Yates & Ribeiro-Nieto, 1999; Belew, 2000). This has been proved to be more effective in the IR task than complex query languages (e.g. Boolean or relational database systems). Considerably less effort has been devoted in designing systems that allow the user to specify in the most natural way the structural requirements (the ‘‘where’’) of the information need. It is obvious that the user can give structural specifications only if he/she knows the typical structure of a document in the collection. The problem is not trivial if the document structure is very complex and if the collection is not homogeneous with regards to the document structure. The case of the Web is the most conspicuous. In this case the user is only allowed very limited kinds of structural queries, mostly related to the structural elements that are almost always present in a Web page, i.e. page title, links, etc. On the other hand, links and other structural elements are used by Web search engines in the ranking in a way that is completely transparent to the user (see for example Cutler et al., 1997; Kleinberg, 1998). In other cases, when the documents structure is not homogeneous or poor, automatic structuring techniques can be employed to present the user with a browsable structure. This was found to be particularly effective in the case of long documents (see for example Agosti, Crestani, & Melucci, 1997; Salton & Allan, 1994). It has been recognised that the best approach to querying structured documents is to let the user specify in the most natural way both the content and the structural requirements of the desired documents (Chiaramella, 2001). This can be achieved by letting the user specify the content requirement in a natural language query, while enabling the user to qualify the structural requirements through a GUI. A GUI is well suited to show and let the user indicate structural elements of documents in the collection. In addition, as already pointed out in Section 1, the complexity and the resulting userÕs cognitive load of querying for structured documents can be greatly reduced by designing interfaces that have explanatory and selective feedback capabilities. Such interface not only lets the user specify both content and structural requirements of the sought documents, but also shows the user why and where a document has been found relevant, giving the user the ability to feedback to the system relevance assessments that are similarly selective. In Section 4 we present the design and implementation of one such interface for hierarchically structured documents.
UN CO R
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261
7
262 4. A graphical user interface for hierarchically structured document retrieval 263 The choice of targeting hierarchically structured documents for our GUI is motivated by the 264 consideration that much of the documents found in digital libraries are of this type. All books are 265 hierarchically structured, being them organised in chapters, sections, subsections, paragraphs, and
IPM 641 14 December 2002 Disk used 8
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
286 4.1. Design
ED
PR OO
F
so on down to the single sentence. Most scientific articles, technical reports, manuals, product and software documentation are also hierarchically structured. It is clear that the purpose of the hierarchical structure is to enable different levels of abstractions in the presentation. If we represent this hierarchical structure with a tree, where the highest level is the root and the lowest levels are the leaves, then moving toward the leaves brings about a specialisation of the semantic content being presented. When large and structurally complex documents, like for example entire books, are searched, the user is often interested in some very specific parts of the document. Sometimes it could be a rather large part, like for example an entire chapter, because the user might want the information sought to be complemented by a context. In some other cases, the user might be interested only in a few sentences. In our experience with searching by browsing entire textbooks we found that often the user does not know a priori which structural level might provide the information sought (Crestani & Ntioudis, 2002). The user is only able to identify the information and the structural level that contains it when he/she sees it. Most retrieval browsing tools do not visualise the structure of the document or indicate to the user which element is most likely to be relevant. We believe this information can be very useful to enhance the effectiveness of structured document retrieval. Nevertheless, this information needs to be provided to the user in a low cognitive load modality, like for example through the use of a GUI. In the following we present the design and current implementation of a GUI for the visualisation of the retrieval interactions of hierarchically structured documents.
CT
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
ARTICLE IN PRESS
In Section 1 we have argued that GUI for effective structured document retrieval should enable a user to specify the required content and structure in a natural way. In addition, the GUI should: (a) provide informative and explanatory feedback to user; (b) capture user selective relevance feedback; (c) provide functionalities for both novice and expert users. The interface we have designed has the following characteristics:
292 293 294 295 296 297 298 299 300 301 302 303
• it enables the modelling of hierarchically structured documents, that is of the most common type of long structured documents; • it allows the user to forget about structural elements, if the user wishes so, but still informs the user of the use the system makes of the structure, which is a suitable interaction for novice users and for (C,CS) systems; • it allows the user to chose the structural elements the system should consider as most important, which is a suitable interaction for advanced users and (CS,CS) systems; • it allows the user to provide inclusive or selective relevance feedback by indicating which structural part of the document should be considered most relevant; • it provides the user with a view of the overall search interaction with the system, giving the user a sense of where the search is going to and enabling the user to go back to any past stage of the search process.
UN CO R
RE
287 288 289 290 291
304 The main element of the GUI is a docball, that is an iconic representation of the structured 305 document. An example of a docball with a corresponding hierarchically structured document is
IPM 641 14 December 2002 Disk used
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
ARTICLE IN PRESS
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
i
a
h
2 c d e
1
3
g
b
3 f g h i
2
f
c
PR OO
e
F
1 a b
structured document
9
d
docball
Fig. 1. Visualisation of a structured document.
RE
CT
ED
depicted in Fig. 1. Using a docball it is possible to represent the different structural elements of the document: the whole document (the inner circle), or its sections and subsections (the outer circles), down to the smallest elements that the system designer decides to represent (the outermost circle). The docball also represents the way structural elements of the document are hierarchically nested and enables the user to select any element with the touch of the mouse. The selection process is governed by the document structure, so that the selection of a section (at any level) implies the selection of all its subsections, with the innermost circle implying the selection of the whole document. This document representation is similar in concept to the tree-maps proposed by Shneiderman (1992) and it has been proved to be intuitive to users. In addition, it enables the visualisation of the parts of the document that are estimated to be relevant, in a fashion similar to HearstÕs TileBars (Hearst, 1995) for flat structured documents. Fig. 2 depicts a single query run. A query run is constituted by a query, and a retrieved set, which is represented as a single element and as a list of retrieved documents. A query is a rectangular iconic box than when moved upon with the mouse displays the natural language expression produced by the user. A retrieved set is represented using a simplified version of the
UN CO R
306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
t1,t3
query
more relevant
d2 d3 d1
less relevant
result set of documents
Fig. 2. Visualisation of a query and a retrieved set.
IPM 641 14 December 2002 Disk used 10
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
t1,t3
F
query
sec1,d2 sec2,d2 sec1,d1 result set viewed as sections
PR OO
sec2,d3 d2
docball
view the doc/part content
Fig. 3. Visualisation of a retrieved structured document.
RE
CT
ED
docball which shows only the hierarchical levels of the structure of the documents. The level in bright colour indicates the level upon which the relevance of each document in the list has been estimated. Such level, which indicates the ‘‘retrieved document specificity’’, can be specified by the user. The retrieved set graphical element can be expanded into a retrieved list of documents, ranked by estimated relevance, with the number of documents in the list being set by the user. Moving the mouse on top of the iconic document element in the list displays some data about the document (e.g. title and author). Using the mouse, the user can select any document in the list to be expanded into a docball, as depicted in Fig. 3. The docball depicts in different shades or colours the relevance estimates for the different parts of the document. This enable to explain the user why and where the document was found to be relevant. The user can visualise the part of the document corresponding to any element of its docball. The user can then decide if that element has been estimated correctly relevant or not by the system and can initiate one of two different actions: (a) the user can ask for the retrieved set to be presented according to a relevance ranking based on another level in the document structure, for a better inspection of the set (see Fig. 4) or (b) the user can ask for an automatic reformulation of the query based on relevance feedback provided on some specific element at some specific level of the structure of some inspected document (see Fig. 5). In fact, whenever the user finds an element of some document to be a good example of the information need, an automatic query reformulation process by relevance feedback can be initiated. This causes the creation of a new query representation in the horizontal line that represent the session timeline. An example of this is depicted in Fig. 5. The user can visualise the content of the new query by moving the mouse on the query box.
UN CO R
321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
343 4.2. Implementation
344 A first implementation of the GUI has been carried out at the University of Valladolid. The 345 implementation is in Java, working under the Linux operating system, but potentially portable to 346 other operating systems, given the characteristics of Java.
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
11
time evolution t1,t3
more relevant
d2 d3
less relevant
d1 result set of documents
PR OO
F
query
sec2,d3 sec1,d2 sec2,d2 sec1,d1
result set viewed as sections
ED
Fig. 4. Visualisation of a query session.
RE
CT
relevance feedback
sec2,d3
new query t1, t2, t3
t2
UN CO R
sec1,d2 sec2,d2 sec1,d1
Fig. 5. Visualisation of relevance feedback.
347 348 349 350 351 352 353 354
We present here the current implementation of the GUI through an example of user interaction. The example refers to a user querying the same document collection used in the evaluation presented in Section 6, that is, a collection of Shakespeare plays. Fig. 6 shows the GUI for novice users. The GUI shows a query area, in which the user can write the text of the query and select the structural level of interest (in this case the act level). Just below the query area is the query history area, where the results of each query is presented. The user can set how many hits should be displayed (10 in the figure). There is also a document display area, that shows the text of the document and the docball representation of the document. The specific
IPM 641 14 December 2002 Disk used
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
ED
PR OO
F
12
ARTICLE IN PRESS
RE
structural element selected in the query history area is displayed and also highlighted in the docball. The docball shows in different colours the parts of the document that are most relevant and by moving the mouse on top of any structural element in the docball the text of that element is displayed and so is its relevance estimate to the query. Visualised documents can be displayed in XML, text, or XSL format using the view buttons. The user can ask that hits for the same query be displayed ranked in relation to another structural level (for example the scene level) as depicted in Fig. 7. This new interaction becomes part of the query history, so that the user can always go back to the previous retrieved list. The selection on a new structural level is highlighted in the docball area. Fig. 8 shows the user expanding the window with the query history and providing relevance feedback on one retrieved element. Pressing the search button activates the relevance feedback at that structural level and elements similar to the one indicated as relevant by the user are retrieved and displayed in a new retrieved list (query 10). At any time in the query session the user can go back to a previous retrieved list and analyse any structural element of the retrieved document. The start and back buttons can be used to move quickly in the query history and can also be used to move in the hierarchical structure of the document. In fact, the user can zoom into a specific structural level at any time (using one of the mouse buttons) asking the system to display only relevance information related to that level of the docball. This is particularly useful in the case of document that are highly structured and with many elements at some level (Fig. 9).
UN CO R
355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374
CT
Fig. 6. Visualisation of query results at the ‘‘act’’ structural level.
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
ED
PR OO
F
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
UN CO R
RE
CT
Fig. 7. Visualisation of query results at the ‘‘scene’’ structural level.
Fig. 8. Query history and relevance feedback at the ‘‘scene’’ structural level.
13
IPM 641 14 December 2002 Disk used
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
ED
PR OO
F
14
ARTICLE IN PRESS
CT
Fig. 9. Zooming on a scene.
In standard IR retrievable units are fixed, so only the entire document, or, sometimes, some pre-defined parts such as chapters or paragraphs constitute retrievable units. The structure of documents, often quite complex and consisting of a varying numbers of chapters, sections, tables, formulae, bibliographic items, etc., is therefore ‘‘flattened’’ and not exploited. Classical retrieval methods lack the possibility to interactively determine the size and the type of retrievable units that best suit an actual retrieval task or user preferences. Some IR researchers are aiming at developing retrieval models that dynamically return document components of varying complexity. A retrieval result may then consist of several entry points to a same document, corresponding to structural elements, whereby each entry point is weighted according to how it satisfies the query. Examples of work in this direction can be found in Chiaramella, Mulhem, and Fourel (1996), Frisse (1988), Lalmas and Ruthven (1998), Myaeng, Jang, Kim, and Zhoo (1998) and Roelleke (1999). Models proposed so far exploit the content and the structure of documents to estimate the relevance of document components to queries, based on the aggregation of the estimated relevance of their related components. These models have been based on various theories, like for example fuzzy logic (Bordogna & Pasi, 2000), Dempster– ShaferÕs theory of evidence (Lalmas & Ruthven, 1998), probabilistic logic (Baumgarten, 1997; Roelleke, 1999), and Bayesian inference (Myaeng et al., 1998). What these models have in common is that the basic components of their retrieval function are variants of the standard IR
UN CO R
376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393
RE
375 5. On indexing and retrieval of hierarchically structured documents
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
RE
CT
ED
PR OO
F
term weighting schema, which combines term frequency with inverse document frequency, often normalised keeping into account document length. Evidence associated with the document structure is often encoded into one or both of these dimensions. A somewhat different approach has been presented in Roelleke, Lalmas, Kazai, Ruthven, and Quicker (2002), where evidence associated with the document structure is made explicit by introducing an ‘‘accessibility’’ dimension. This dimension measures the strength of the structural relationship between document components: the stronger the relationship, the more impact has the content of a component in describing the content of its related components. The work presented in this paper is part of wider project aimed at the design, implementation, and evaluation of a complete system for structured document retrieval. The system will combine a retrieval engine based on the aggregation-based approach to the estimation of the relevance of the document, with an interface with explanatory and selective feedback capabilities specifically designed to the structured document retrieval task. A number of approaches are currently being explored for the design and implementation of the retrieval engine. They range from heuristically driven modifications of the vector space model to take into account different terms weights for different structural elements (work inspired by Wilkinson (1994)), to more theoretically driven work using a Bayesian framework to combine the relevance evidence of different structural elements. This latter direction, inspired by results presented in Bordogna and Pasi (2000), De Campos, Fernandez-Luna, and Huete (2002), Kerkouba (1985) and Lalmas and Moutogianni (2000), seems the most exciting to us. The presentation of this work is outside the scope of this paper. At this stage, our work has been mostly concerned with the design, implementation, and evaluation of the GUI. However, in order to evaluated the GUI effectively using a task-oriented methodology we need a preliminary retrieval engine. Since we could not wait for our more methodological and theoretical work to be implemented, we decided to use a rather simplistic approach. This approach, tailored to hierarchically structured documents, consists of producing in response to a query as many rankings as are the hierarchical structural levels of the documents. The user can then choose the ranking that seems more appropriate to the level of specificity required by the information need. So, for example, if a user is interested in a paragraph about a particular topic, the entire retrieved list can be viewed ranked by the estimated relevance of paragraphs, while if the user is interested in a chapter about that topic, the entire retrieved list can be ranked by estimated relevance of chapters. Ties are dealt by ranking equally relevant structural elements by the estimated relevance of the structural level directly below in the hierarchy. No relation between the structural elements of the same document has been exploited. This approach is rather simplistic, but at the current stage of our work it has a number of advantages. First, it can be implemented with any existing classical IR model, like for example the vector space model (Salton, 1968) or any variation of the probabilistic model (Crestani, Lalmas, van Rijsbergen, & Campbell, 1998). Once the level of specificity (or granularity) of the document has been specified (e.g. chapter, section, subsection, paragraph, etc.), any classical model can index and rank the text fragments as they were individual documents. In other words, this approach enables to implement a (C,CS) system using a model designed for a (C,C) system. Secondly, indexing and ranking at different levels of specificity can be carried out using concurrent processes and the retrieved sets can be buffered for a fast presentation to the user. This enables the user to switch quickly and effortlessly from one ranked list to another, providing the kind of
UN CO R
394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437
15
IPM 641 14 December 2002 Disk used 16
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
F
specificity of presentation of results that one would expect from a (CS,CS) system without the complexity of expressing structural requirements in the query. Finally, this approach enables the evaluation of the relevance estimate for every structural element of the documents. This data is necessary for the colouring of the docballs of the GUI, which provide the user with a clear visual indication of which structural elements of a document are most relevant. We are fully aware of the inefficiencies of this approach, but at the current stage of our work we are not concerned with retrieval effectiveness issues, since we only aim at evaluating the effectiveness of the GUI. For the evaluation reported in Section 6, we implemented the retrieval engine using the vector space model. Indexing was performed using the standard normalised tf idf weighting schema (Baeza-Yates & Ribeiro-Nieto, 1999).
PR OO
438 439 440 441 442 443 444 445 446 447 448
ARTICLE IN PRESS
449 6. Evaluation
ED
450 In this section we present the methodology and the results of a task and user oriented evalu451 ation of the proposed interface. 452 6.1. Methodology
RE
CT
We are currently evaluating the proposed GUI following a typical formative design-evaluation process. The process consists in carrying out a first implementation based on the original design, evaluating this implementation, and use the result of the evaluation to refine the design, so that a new improved implementation can be carried out and re-evaluated. The first prototype GUI has been evaluated at the University of Valladolid following a taskoriented and user-centred evaluation methodology. The evaluation methodology is very similar to the one developed in the context of the Esprit ‘‘Mira’’ Project (Pejtersen & Fidel, 1998). This methodology aims at putting the user at the centre of the evaluation, by analysing the user performance in the context of tasks as similar as possible to the real tasks the user carries out in real life. The evaluation involved 10 users with varying degree of expertise in using retrieval tools and GUIs. The user were two librarians and eight graduate students of the Department of Computer Science. All users were proficient in English and had no linguistic difficulties in carrying out the evaluation. A control group of 10 users of very similar background was involved at the University of Strathclyde to compare the results of part of the evaluation. The user at the University of Strathclyde were two members of staff of the Department of Information Science and eight graduate students. It should be clear, however, that the purpose of the control group was not to compare two different systems or two different GUIs, but simply to compare the performance of the two group in the carrying out of the same tasks using different tools with quite similar features. A more controlled evaluation, gathering and comparing quantitative data, was outside the scope of this initial study, which aimed at finding out if there were major faults in the design of the GUI. A more complete evaluation is planned in the future.
UN CO R
453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
CT
ED
The first test aimed at evaluating the effectiveness of the GUI during the structured document retrieval task. The evaluation was carried out in a indirect way, by giving users five tasks to carry out which involved searching the collection of documents and locating some specific document parts. The tasks were simple questions whose answer was to be found in one or more acts, scenes or speeches. Examples of questions are: ‘‘Who kisses the hand of Queen Mary?’’, ‘‘What is the thing to do after killing all the lawyers?’’, ‘‘In which scene Macbeth tells his doctor about the water of his land?’’. Users had a limited amount of time to carry out the tasks (90 min), which started after users had followed an example of a session in which all the functionalities of the GUI were briefly explained. The control group carried out these tasks using the same IR system, but with a flat ranked list of retrieval results. Results were presented at the smallest level of granularity of the structure (speech), but the introduction of hyperlinks in the retrieved documents enabled users in the control group to jump up in the hierarchical structure at the level required, for examples, from speech to scene or from scene to act. The second test aimed at evaluating the user understanding of the docball visual element of the GUI. Again, this was done in indirect way, after the first test above was carried out. It involved asking users to: (a) depict a docball representing a specific document, and (b) explain the structure of a document from its docball representation. The documents and docballs given to the users were unrelated and different from those present in the collection used in the first test. The control group did not carry out this test. The third test aimed at capturing usersÕ judgements and personal considerations regarding the docball visual element and the GUI. It comprised a questionnaire with 46 short questions concerning the appearance, querying and navigation functionalities and usability of the interface. Some questions were closed questions, but most were open question to capture as much as possible the usersÕ opinion of the interface. Examples of questions are: Appearance: ‘‘Is the alignment of the interface elements correct?’’ ‘‘What do you think of the colour and size of the windows?’’. Querying and navigation: ‘‘What is you opinion on the accessibility of the menu bar?’’, ‘‘Do you find the access to GUIÕs windows easy using the keyboard?’’. Usability: ‘‘How do you rate the difficulty in carrying out a search?’’, ‘‘How do you rate the difficulty of managing the query tree?’’. The control group answered a very similar questionnaire, where questions on the elements of the GUI were appropriately modified.
UN CO R
485 • 486 487 488 489 490 491 492 493 494 495 496 497 498 • 499 500 501 502 503 504 • 505 506 507 508 509 510 511 512 513 514 515 516
PR OO
F
The document collection used in the evaluation was the XML version of 4 Shakespeare plays. This collections is available in the public domain. Each play is hierarchically divided in acts, scenes, and speeches. There are other intermediate structural elements, but they were not considered appropriate to be used in this evaluation, since they referred to characters in the play, scene descriptions, etc., which would have required some considerable background knowledge of Shakespeare plays by the part of the users. Given the small size of this collection, we decided to use the act as the default ‘‘document’’ level. The collection comprises 37 acts, divided in 162 scenes and 5300 speeches. The evaluation consisted of the three following tests:
RE
476 477 478 479 480 481 482 483 484
517
17
In Section 6.2 we present a synthesis of the results of the evaluation.
IPM 641 14 December 2002 Disk used 18
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
518 6.2. Results
RE
CT
ED
PR OO
F
Because of space limitations, here we present only the major findings and the most important indications on how to improve the GUI in future work. The full details of the evaluation are reported in Diez and Rostaing (2001). In the first test 8 out of 10 users using the GUI were able to correctly accomplish all five tasks. The remaining two users were able to accomplish only four out of five tasks within the time limit. The fastest user completed the tasks with in just two searches on average per task, completing the task in only 25 min; the slowest user required six searches to complete a task. As a comparison, only 6 out of 10 user in the control group completed the tasks correctly in the allocated time. Of the remaining four users, two accomplished four tasks and two only three tasks. A feature of the GUI that all users utilised in carrying out the tasks was the level selection bar. This enabled users to select any structural element of a document at any level and made it easier for users to assess the relevance of the document under consideration. It was heavily used to locate the most relevant structural element of a document, since the colouring of the docball could only give a clue in that direction. Another element of the GUI often used in carrying out the tasks was the zooming, which was found to be useful in those tasks where a specific structural level of the document was requested. By zooming at that level, users could remove from the docball the visual display of relevance of other levels. On the other hand, one feature that was used rarely was relevance feedback. Only 2 out of 10 users made use of it in both the GUI and control groups. These two users were the most expert users of the two groups, that is, the librarians and the members of staff, respectively. A possible explanation for this can be related to the fact that users were not familiar which such feature, being it uncommon in other information access systems like, for example, web search engines, that are the most popular information access tools for students. The second test showed that all users understood the GUI functionalities and in particular the graphical representation of the structured document. Users had almost no difficulties in drawing a docball for any specific structured document presented to them and also had no difficulties in inferring a document structure given a docball. Of course, we were quite pleased with this result, that shows that the docball metaphor at the centre of the GUI is a very natural and clear representation of the document structure. Given this result, we could conclude that the cause of the lack of completion of some tasks in the first test could not be attributed to a misunderstanding of the use of the docball metaphor. The only minor problems users found in this test was related to documents that did not conform perfectly to the hierarchical structure. Users had to think more about the task when drawing docball for documents that had some chapters organised in sessions and subsections and other without such subdivision. Similarly, they showed some hesitation in interpreting docballs for imperfectly formed documents. However, these doubts did not seem to influence the effectiveness in the accomplishing of the tasks in the first test, where one task actually involved a few such odd documents. Finally, the analysis of the GUI users comments in the questionnaire enabled use to draw the following observations:
UN CO R
519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556
557 • The presence of a single query box and button together with the possibility of identifying the 558 query level graphically was considered very positively by most users.
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
19
RE
CT
ED
Therefore, this first evaluation has confirmed that the docball representation of structured documents and the GUI are intuitive and easy to use. Nonetheless, there are still a number of directions to explore to improve the usability of the GUI. Based on the results of this evaluation we are currently improving the GUI to overcome some of its limitations. A new evaluation will be carried out as soon as a new improved version of the GUI will become available. One of the difficulties in evaluating any system or GUI for structured document retrieval is related to the limited availability of collections of structured documents. Although there is a large amount of documents that are structured, they are either not publicly available or not marked up at the appropriate level of detail (e.g. TREC documents). A possible solution to this problem would be to create a collection of structured documents from a subset of some exiting test collection (see for example Wilkinson, 1994). However, this approach is deemed to produce a very artificial document collection that, though valid for a search engine evaluation, would not be adequate for a user-centred and task-oriented evaluation. In addition, we found that only documents in SGML and XML have the tags necessary to clearly identify the structural elements of the documents. Documents in HTML are too irregular and not sufficiently marked up. Recently, the INEX (initiative for the evaluation of XML retrieval) project is developing a large test collection of structured documents. 1 This collection will comprise of XML marked up structured documents, queries related to content and structure, and relative relevance assessment. It will be a very valuable tool for anybody like us working in this increasingly important area of research.
UN CO R
568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588
PR OO
F
559 • The query tree representing the history of the query sessions was considered non-intuitive and 560 difficult to understand and use. This was mainly due to the limited amount of information it 561 presents to the users about documents. Four users suggested to add more information about 562 the title of the play and to show more clearly the structural path of the element presented. 563 • All users appreciated the docball metaphor and its use for document structural level selection 564 and navigation. Seven users indicated their appreciation to the possibility of navigating the 565 structure of the using the docball. 566 • The response time of the system was too long and often annoyed users. This was due to the 567 current implementation of the GUI which is Java based and has not been optimised yet.
589 7. Conclusions 590 591 592 593 594 595
We presented a GUI for structured document retrieval targeted for hierarchically structured documents. An initial evaluation of the interface shows that it provides the user with an intuitive and powerful set of tools for hierarchically structured document searching, retrieved list navigation, and search refinement. We are currently improving the GUI using the results of the evaluation. We are also designing an aggregation-based IR system for hierarchically structured documents. Although the GUI can
1
See http://ls6-www.informatik.uni-dortmund.de/ir/projects/inex/ for more information.
IPM 641 14 December 2002 Disk used 20
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
596 be plugged into any structured document retrieval system, it is going to be a vital component of 597 the structured document retrieval system that we are currently developing.
F
598 Acknowledgements
PR OO
scar 599 Large part of the implementation of the GUI was carried out by Guzman Dıez Barbero, O 600 Martınez Sanz and Antonio Rostaing Bellido, students at the University of Valladolid. 601 This work was partially supported by The British Council under the ‘‘Acciones Integradas’’ 602 framework, the Spanish CICYT program (project TEL99-0335-C04) and the FEDER program 603 (project 1FD97-2030).
604 References
RE
CT
ED
Agosti, M., Crestani, F., & Melucci, M. (1997). On the use of information retrieval techniques for the automatic construction of hypertexts. Information Processing and Management, 33(2), 133–144. Agosti, M., & Smeaton, A. (Eds.). (1996). Information retrieval and hypertext. Boston, USA: Kluwer Academic Publishers. Baeza-Yates, R., & Ribeiro-Nieto, B. (1999). Modern information retrieval. Harlow, UK: Addison-Wesley. Barker, P. (1991). Interactive electronic books. Interactive Multimedia, 2(1), 11–28. Baumgarten, C. (1997). A probabilistic model for distributed information retrieval. In Proceedings of ACM SIGIR (pp. 258–266). Philadelphia, USA. Belew, R. (2000). Finding out about: A cognitive perspective on search engines technology and the WWW. Cambridge, UK: Cambridge University Press. Bordogna, G., & Pasi, G. (2000). Flexible representation and querying of heterogeneous structured documents. Kibernetika, 36(6), 617–633. Callan, J. (1994). Passage-level evidence in document retrieval. In Proceedings of ACM SIGIR (pp. 302–310). Dublin, Ireland. Carroll, J., Mack, R., & Kellog, W. (1988). Interface metaphors and user interface design. In M. Helander (Ed.), Handbook of human computer interaction (pp. 67–85). Amsterdam, Holland: Elsevier Science Publisher. Chiaramella, Y. (2001). Information retrieval and structured documents. In M. Agosti, F. Crestani, G. Pasi (Series Ed.), Lectures notes in computer science: Vol. 1980. Lectures on information retrieval (pp. 291–314). Heidelberg, Germany: Springer-Verlag. Chiaramella, Y., Mulhem, P., & Fourel, F. (1996). A model for multimedia information retrieval. Technical report FERMI/96/4, ESPRIT Basic Research Action, project number 8134––FERMI, Department of Computing Science, Glasgow University. Crestani, F., de la Fuente, P., & Vegas, J. (2001). Design of a graphical user interface for focussed retrieval of structured documents. In Proceedings of SPIRE 2001, symposium on string processing and information retrieval (pp. 246–249). Laguna de San Rafael, Chile. Crestani, F., Lalmas, M., van Rijsbergen, C., & Campbell, I. (1998). Is this document relevant? . . .probably. A survey of probabilistic models in information retrieval. ACM Computing Surveys, 30(4), 528–552. Crestani, F., & Ntioudis, S. (2002). User centred evaluation of an automatically constructed hyper-textbook. Journal of Educational Multimedia and Hypermedia, 11(1), 3–19. Cutler, M., Shih, Y., & Meng, W. (1997). Using the structure of HTML documents to improve retrieval. In Proceedings of the USENIX symposium on Internet technologies and systems (pp. 241–251). Monterey, CA, USA. De Campos, L., Fernandez-Luna, J., & Huete, J. (2002). A layered Bayesian network model for document retrieval. In Proceedings of the BCS-IRSG European colloquium on information retrieval research (pp. 169–182). Glasgow, UK.
UN CO R
605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637
IPM 641 14 December 2002 Disk used
ARTICLE IN PRESS
No. of Pages 21, DTD = 4.3.1 SPS, Chennai
F. Crestani et al. / Information Processing and Management xxx (2002) xxx–xxx
RE
CT
ED
PR OO
F
Diez, G., & Rostaing, A. (2001). Docball: una interfaz de consulta para recuperacion de documentos XML. M.Sc. Thesis, Departamento de Informatica, Universidad de Valladolid, Valladolid, Spain. Dunlop, M., & McDonald, K. (2000). Supporting different search strategies in a video query interface. In Proceedings of the RIAO conference. Paris, France. Frakes, W., & Baeza-Yates, R. (Eds.). (1992). Information retrieval: Data structures and algorithms. Englewood Cliffs, NJ, USA: Prentice Hall. Frisse, M. (1988). Searching for information in a medical handbook. Communications of the ACM, 31(7), 880–886. Hauptmann, A., & Witbrock, M. (1997). Informedia news-on-demand: Using speech recognition to create a digital video library. In M. Maybury (Ed.), Intelligent multimedia information retrieval (pp. 215–240). MenloPark, CA, USA: AIII Press/The MIT Press. Hearst, M. (1995). TileBars: visualisation of term distribution information in full text information access. In Proceedings of CHI’95 (pp. 59–66). Denver, CO, USA. Ingwersen, P. (1992). Information retrieval interaction. London, UK: Taylor Graham. Kerkouba, D. (1985). Indexation automatique et aspects structurels des textes. In Proceedings of the RIAO conference (pp. 227–249). Grenoble, France. Kim, J., & Oard, W. (2002). The use of speech retrieval systems: A study design. In A. Coden, E. Brown, & S. Srinivasan (Eds.), Information retrieval techniques for speech applications (pp. 87–93). Berlin, Germany: SpringerVerlag. Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the ninth annual ACMSIAM symposium on discrete algorithms (pp. 668–677). San Francisco, CA, USA. Lalmas, M., & Moutogianni, E. (2000). A Dempster–Shafer indexing for the focussed retrieval of a hierarchically structured document space: Implementation and experiments on a web museum collection. In Proceedings of the RIAO conference. Paris, France. Lalmas, M., & Ruthven, I. (1998). Representing and retrieving structured documents with Dempster–ShaferÕs theory of evidence: Modelling and evaluation. Journal of Documentation, 54(5), 529–565. Myaeng, S., Jang, D., Kim, M., & Zhoo, Z. (1998). A flexible model for retrieval of SGML document. In Proceedings of ACM SIGIR (pp. 138–145). Melbourne, Australia. Navarro, G., & Baeza-Yates, R. (1997). Proximal nodes: A model to query document databases by content and structure. ACM Transactions on Information Systems, 15(4), 400–435. Pejtersen, A., & Fidel, R. (1998). A framework for work centred evaluation and design: A case study of IR and the Web. Working paper for Mira Workshop, Grenoble, France. Roelleke, T. (1999). POOL: Probabilistic object-oriented logical representation and retrieval of complex objects––A model for hypermedia retrieva. Ph.D. Thesis, Department of Computer Science, University of Dortmund, Germany. Roelleke, T., Lalmas, M., Kazai, G., Ruthven, I., & Quicker, S. (2002). The accessibility dimension for structured document retrieval. In Proceedings of the BCS-IRSG European colloquium on information retrieval research (pp. 284– 302). Glasgow, UK. Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw Hill. Salton, G., & Allan, J. (1994). Automatic text decomposition and structuring. In Proceedings of the RIAO conference: Vol. 1 (pp. 6–20). Rockefeller University, New York, USA. Salton, G., Allan, J., & Buckley, C. (1994). Automatic structuring and retrieval of large text files. Communications of the ACM, 37(2), 97–108. Shneiderman, B. (1992). Tree visualization with tree-maps: 2-d space-filling approach. ACM Transactions on Graphics, 11(1), 92–99. Southall, R. (1989). Interfaces between the designer and the document. In J. Andre, R. Furuta, & V. Quint (Eds.), Structured documents (pp. 119–131). Cambridge, UK: Cambridge University Press. Spence, R. (2001). Information visualization. Harlow, UK: Addison-Wesley. Vegas, J. (1999). Un Sistema de Recuperation de Informati on sobre Estructura y Contenido. Ph.D. Thesis, Departamento de Informatica, Universidad de Valladolid, Valladolid, Spain. Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of ACM SIGIR (pp. 311–317). Dublin, Ireland.
UN CO R
638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687
21