settlements related to the Boesky - Milken - Drexel Burnham Lambert scandal, unless the report also contains specific information on legal or regulatory change.
From Information Retrieval to Hypertext and Back Again: The Role of Interaction in the Information Exploration Interface
by
Gene Golovchinsky
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Mechanical and Industrial Engineering University of Toronto
© Copyright by Gene Golovchinsky 1997
From Information Retrieval to Hypertext and Back Again: The Role of Interaction in the Information Exploration Interface Doctor of Philosophy, 1997 Gene Golovchinsky Graduate Department of Mechanical and Industrial Engineering, University of Toronto
Abstract This work explores the design space of user interfaces for large-scale full-text database retrieval systems. Research suggests that elements of hypertext interfaces may be merged with traditional information retrieval (IR) algorithms to produce flexible hybrid interfaces for userdirected information exploration. This work examines the effectiveness of multiple-view newspaper-like interfaces, and describes a prototype that uses newspaper-style layouts to organize information retrieval results. Finally, it explores some possible visualization techniques designed to aid browsing performance. The first of two experiments in this thesis examines the effectiveness of the simultaneous display of several documents retrieved by a given query. Experimental results suggest that viewed recall increases with increasing numbers of articles displayed on the screen simultaneously. Subjects' decision-making strategies appear to be independent of user interface factors. The second experiment tests differences in behavior between query-based and link-based browsing. Differences in performance are found between groups of users employing different strategies, but not between interface conditions. These results suggest that dynamic querymediated hypertext interfaces are viable alternatives to more explicit queries, and that subjects' intrinsic strategies have significant impact on their interaction with the system and on their performance. This work proposes an implementation of dynamic links in the WWW medium. It concludes with a discussion about the nature of hypertext interfaces and about the role of the user interface in information exploration tasks, and suggests some avenues for future research in this area. i
Acknowledgments I would like to thank my parents for instilling in me the desire for knowledge; the rest was only a matter of time. My adviser, Professor Chignell, deserves much thanks. His support — intellectual and financial — over the past five years has made this work possible. His perceptions of the field, and of research in general, have affected my current work, and will continue to do so in the future. Finally, his unflagging sense of humor has been a welcome companion, both on and off the research field. I would like to thank my Committee for their patience with my ideas, and for their invaluable feedback. My external reviewer, Professor Marchionini, deserves particular credit for agreeing to come to Toronto in November without first checking the weather. The research group, that multitude sometimes known as the Hyperactives, has been wonderfully — critically! — supportive; I've learned much over the years (and over pizza), and shall miss the heated discussions. Thank you Louisa, Norma and Terrie! Thanks also are due to my friends and co–workers at GMD–IPSI in Darmstadt, Germany. This thesis would not have been the same without our times together. Professor Meadow and Rick Kopak of the Faculty of Information Studies deserve special thanks for arranging some last–minute expert subjects for one of my experiments. Without them, I would still be haunting the public libraries of Toronto. Finally, I would like to thank all my friends who had the choice, and yet still put up with my stupid jokes.
ii
Table of Contents Chapter 1. Introduction........................................................................... 1 1.1 Text display .......................................................................... 2 1.2 Query formulation................................................................... 2 1.3 Research motivation................................................................. 3 1.4 Overview ............................................................................. 5 Chapter 2. Literature Review. ................................................................... 7 2.1 Information retrieval ................................................................ 8 2.1.1 Queries .................................................................... 8 2.1.1.1 Relevance feedback .......................................... 9 2.1.2 Document representations............................................... 10 2.1.2.1 Document vector model...................................... 10 2.1.2.2 Inference Networks .......................................... 10 2.1.2.3 Proximity models............................................. 11 2.1.3 Evaluation................................................................. 12 2.2 Hypertext............................................................................. 13 2.3 Information exploration............................................................. 15 2.4 Electronic newspapers .............................................................. 17 2.5 Visualization of search results ..................................................... 19 2.6 Conclusions.......................................................................... 21 Chapter 3. Experimental Prototypes ............................................................ 22 3.1 QRL................................................................................... 22 3.2 StPatTREC........................................................................... 24 3.2.1 SGML documents ....................................................... 25 3.2.2 Interface................................................................... 25 3.3 BrowsIR..................................................................... 28 Chapter 4. VOIR, The Electronic Newspaper Prototype..................................... 30 4.2 System architecture.................................................................. 31 4.3 Interface .............................................................................. 33 4.4 Query notation ....................................................................... 34 4.5 Search engine requirements ........................................................ 35 Chapter 5. Experiment 1 ......................................................................... 36 5.1 Introduction .......................................................................... 36 5.2 Experimental Design ................................................................ 37 5.3 Research Hypotheses ............................................................... 37 5.3.2 Query notation............................................................ 38 5.3.3 Expertise .................................................................. 39 5.4 Subjects............................................................................... 40 5.5 Methodology......................................................................... 40 5.5.1 Task ....................................................................... 40 5.5.2 Software .................................................................. 41 5.5.3 Procedure ................................................................. 46 5.6 Dataset variables..................................................................... 46 5.6.1 Independent measures................................................... 47 5.6.2 Dependent measures..................................................... 47 5.7 Results................................................................................ 48 5.7.1 Confirmatory analysis................................................... 48 5.7.1.1 Interface hypotheses ......................................... 48 5.7.1.2 Notation hypothesis.......................................... 50 5.7.1.3 Expertise hypotheses......................................... 50 5.7.2 Exploratory analysis..................................................... 51 5.7.2.1 Other effects................................................... 51 iii
5.7.2.4 Cluster analysis............................................... 55 Discussion............................................................................ 59 5.8.1 Page flipping and viewed recall ........................................ 59 5.8.2 Subjects' strategies ...................................................... 60 5.8.3 User interface observations............................................. 61 5.8.4 Conclusions .............................................................. 62 Chapter 6. Dynamic Hypertext Newspaper Prototype........................................ 64 6.1 Introduction .......................................................................... 64 6.1.1 Browsing context ........................................................ 65 6.1.2 Hypertext links........................................................... 65 6.2 Context-setting links ................................................................ 65 6.3 Context-specific links............................................................... 66 6.3.1 Imbedded anchors ....................................................... 66 6.3.2 Dynamic link queries .................................................... 67 6.4 Context-independent links.......................................................... 72 6.5 Visualization ......................................................................... 72 6.5.1 Global visualization...................................................... 72 6.5.2 Local visualization ....................................................... 73 6.6 Applications.......................................................................... 74 6.6.1 Dictionary of Art ......................................................... 74 6.6.2 HCI Bibliography........................................................ 75 Chapter 7. Experiment 2 ......................................................................... 76 7.1 Introduction .......................................................................... 76 7.2 Experimental Design ................................................................ 76 7.3 Research Hypotheses ............................................................... 77 7.4 Subjects............................................................................... 78 7.5 Methodology......................................................................... 78 7.5.1 Task ....................................................................... 78 7.5.2 Software .................................................................. 79 7.5.3 Procedure ................................................................. 83 7.6 Dataset variables..................................................................... 84 7.6.1 Independent measures................................................... 84 7.6.2 Dependent measures..................................................... 85 7.7 Results................................................................................ 86 7.7.1 Confirmatory analysis................................................... 86 7.7.2 Exploratory analysis..................................................... 88 7.7.2.1 Cluster analyses .............................................. 88 7.7.2.2 Query effectiveness .......................................... 91 7.7.2.3 Query strategy ................................................ 94 7.7.2.4 Subjective variables .......................................... 95 7.8 Discussion............................................................................ 96 7.8.1 Effectiveness of query types............................................ 99 7.8.2 Query strategy............................................................ 101 7.9 Conclusions.......................................................................... 102 Chapter 8. Further Research..................................................................... 104 8.1 Extensions............................................................................ 104 8.1.1 Linking paradigms....................................................... 104 8.1.2 Newspaper hypertext.................................................... 106 8.1.3 Negated terms ............................................................ 107 8.1.5 Static links ................................................................ 108 8.1.6 Field-oriented queries ................................................... 109 8.1.7 Semantic hypertext....................................................... 109 8.2 Applications.......................................................................... 110 8.2.1 Multi-lingual interfaces.................................................. 110 5.8
iv
8.2.3 The Web .................................................................. 111 8.3 A framework for interactivity ...................................................... 113 Chapter 9. Conclusions. ......................................................................... 117 9.1 Introduction .......................................................................... 117 9.2 Contributions ........................................................................ 118 9.3 Summary ............................................................................. 121 References ......................................................................................... 123 Glossary............................................................................................ 134
v
List of Tables Table 5-1. Experimental design. ................................................................ 37 Table 5-2. Comparison of frequencies of relevant articles. .................................. 52 Table 5-3. Correlations between dependent measures. . ..................................... 53 Table 5-4. Variables used for cluster analysis. ................................................ 56 Table 5-5. Results of cluster analysis. ......................................................... 57 Table 5-7. Searcher expertise by cluster assignment crosstabulation. ...................... 57 Table 6-1. Weighting schemes for weighted–sum operator. ................................ 67 Table 6-2. Query expansion algorithms based on terms from prior queries. .............. 70 Table 6-3. Weight combinations for queries. . ................................................ 70 Table 7-1. Experimental design. ................................................................ 77 Table 7-2. Recall and precision for initial page of each topic. ............................... 83 Table 7-3. Derived query type values for type variable. ..................................... 85 Table 7-4. Questionnaire response variables. ................................................. 86 Table 7-5. ANOVA of query frequency vs. query type. ..................................... 88 Table 7-6. Recall and precision comparisons by total query count. ........................ 89 Table 7-7. Cross-tabulation of classification methods. ...................................... 90 Table 7-8. Variable means for clusters. ........................................................ 90 Table 7-9. Recall and precision comparisons by cluster. .................................... 91 Table 7-10. Clusters based on normalized frequencies of different query types. ......... 91 Table 7-11. Comparison of reader/skimmer cluster assignment. ........................... 91 Table 7-12. Average recall and precision comparisons by query type. .................... 93 Table 7-13. Means for linking strategy clusters (ltqclust). .................................. 94 Table 7-14. Means for strategy clusters (tqclust). ............................................ 95 Table 9-1. Design innovations. ................................................................. 119 Table 9-2. Experimental and methodological results. ........................................ 120
vi
List of Figures Figure 2-1. Recent trends in information exploration interfaces. ........................... 16 Figure 3-1. DQN graphical query. ............................................................. 23 Figure 3-2. (After Golovchinsky, 1993, Figure 2) The QRL browsing process. ....... 24 Figure 3-3. StPatTREC interface. .............................................................. 26 Figure 3-4. Architecture of StPatTREC. ...................................................... 27 Figure 3-5. BrowsIR architecture. ............................................................. 29 Figure 4-1. QRL mark up architecture. ........................................................ 32 Figure 4-2. VOIR mark up architecture. ...................................................... 33 Figure 5-1. Two-article VOIR interface. ...................................................... 42 Figure 5-2. Four-article VOIR interface. ...................................................... 43 Figure 5-3. Seven-article VOIR interface. .................................................... 44 Figure 5-4. Relationship among sets used to compute recall and precision measures. . . 45 Figure 5-5. Interaction between notation and expertise. ..................................... 50 Figure 5-6. Distribution of judgments. ........................................................ 53 Figure 5-7. Distribution of mean time (in seconds) to select first relevant article......... 54 Figure 5-8. Distributions of judged recall and precision..................................... 55 Figure 5-9. Expertise-Notation interaction for the frequency of proximity slider use.. . . 58 Figure 6-1. Recall-precision tradeoff for the seven weighting schemes. .................. 69 Figure 6-2. Precision–recall tradeoffs for query expansion algorithms. ................... 71 Figure 7-1. Sample experiment 2 query condition interface................................. 80 Figure 7-2. Sample experiment 2 link condition interface. .................................. 81 Figure 7-3. Precision distribution for queries based on description of topics............. 82 Figure 7-4. Recall distribution for queries based on descriptions of topics. .............. 82 Figure 7-5. Distribution of subjects categorized by number of queries. ................... 89 Figure 7-6a. Retrieved precision vs. satisfaction with retrieved results. .................. 96 Figure 7-6b. Judged precision vs. satisfaction with retrieved results. ..................... 96 Figure 7-7. Average performance of groups of subjects between two experiments. . . . . . 98 Figure 7-8. Comparison of performance in Experiment 1 and Experiment 2. ............ 99 Figure 7-9a. Average retrieved recall vs. query type and strategy. ........................ 100 Figure 7-9b. Average retrieved precision vs. query type and strategy. .................... 100 Figure 7-9c. Average viewed recall vs. query type and strategy. .......................... 100 Figure 7-9d. Average viewed precision vs. query type and strategy. ..................... 100 Figure 7-9e. Average judged recall vs. query type and strategy. ........................... 100 Figure 7-9f. Average judged precision vs. query type and strategy. ...................... 100 Figure 7-10. Interaction plots for recall and precision by strategy. ........................ 102 Figure 8-1. Newspaper layout strategies. ..................................................... 107 Figure 8-3. Users' interaction with the system. .............................................. 115
vii
List of Appendices Appendix A. Topics used in Experiment 1..................................................... 136 Appendix B. Instructions for Experiment 1.................................................... 140 Instructions for the CQN condition .................................................... 140 Instructions for DQN condition ........................................................ 143 Appendix C. Topics used in Experiment 2..................................................... 146 Appendix D. Instructions for Experiment 2.................................................... 152 Instructions for query condition........................................................ 152 Instructions for naive link condition ................................................... 153 Instructions for informed link condition............................................... 154 Appendix E. Questionnaires for Experiment 2 ................................................ 156 Post-topic questionnaire................................................................. 156 Post-test questionnaire, query condition .............................................. 157 Post-test questionnaire, link conditions ............................................... 158 Appendix F. Variables used in Experiment 1.................................................. 160 Appendix G. Variables used in Experiment 2 ................................................. 163
viii
Chapter 1 Introduction Advice to Orators In speech it's best — though not the only way — Indeed the best, it's true, can be the worst — Though often I ... as I had meant to say: Qualify later; state the premise first. Vikram Seth Information exploration interfaces mediate the dialog between user and machine. They elicit information seeking intent from users and display potentially relevant documents. The cycle continues as the user modifies the representation of the information need in response to the retrieved results. This process may break down if users are not able to express their information needs adequately, or if the system fails to convey sufficient information about the content and structure of the relevant data. Failure in user–to–system communication may be caused by awkward or inappropriate query languages. Rigid syntactic rules may make it difficult — sometimes prohibitively so — for users to express their search intent at the right level of abstraction. Conversely, poor or inappropriate feedback to the user may make it difficult to understand the relationships between the retrieved information and the information goal, and may also make it difficult for the user to modify the information request in an appropriate manner. This research examines the role of the interface in facilitating the specification of information intent and in providing adequate presentation of retrieved results. The goal of this work is to gain insight into the effect that interface and interaction style have on users' performance and behavior. The motivation for this research stems from some empirical observations that responsive, interactive interfaces tend to encourage exploratory behavior. For example, people typically enjoy using hypertext interfaces, and are frequently intimidated by, or fail to take advantage of, query interfaces. Unfortunately, large hypertexts are prohibitively time– consuming to create, forcing interface designers (and therefore users) to resort to query–based systems. This suggests that interface acceptance and system performance may be improved if query interfaces are made to resemble hypertext interfaces.
1
2 1.1 Text display Typically, full–text retrieval systems provide a list of retrieved titles, and allow users to page through documents, one document at a time. Although this approach may be adequate for small collections, it breaks down when the user is confronted with databases consisting of large numbers of relatively small documents, many of which may be relevant to any particular topic. One problem with displaying one article at a time is that it does not provide the user with enough information about the relationship of a document to the rest of the database. In hypertext interfaces, this may produce symptoms of disorientation: even medium–sized hypertexts have been plagued by navigation ("where have I been?") and disorientation ("where am I?") problems, and it seems likely that these problems should increase with increasing hyperbase size. In some situations, it may be possible to use the over-arching structure of the database to provide this contextual information. SuperBook (Egan et al., 1989), for example, used the table of contents (TOC) to structure the display of retrieved documents. Document collections that do not have a natural (or obvious) hierarchical organization, collections that have multiple organizations that depend on the user's task, and large collections with broad, shallow tables of contents (e.g., encyclopedias) are poorly fitted to the SuperBook browsing model. The newspaper metaphor is proposed here as a vehicle for organizing hypertext neighborhoods. Newspapers have been designed to provide coherence to non–linear text, and their layout and organization properties may also be adapted to display hypertext nodes. Experiment 1 (Chapter 5) tests the effects of newspaper–style displays on subjects' behavior and performance in an information exploration task. 1.2 Query formulation Traditionally, text-based information retrieval systems have relied on Boolean queries to allow users to express their information exploration intent. While possessing some power of expressiveness, these queries were frequently difficult for users to construct and to troubleshoot (Borgman, 1986; Reisner, 1977). Cognitive difficulties associated with Boolean queries are both syntactic (novices and occasional users find it difficult to generate syntactically-correct queries) and semantic (the meaning of Boolean operators is often counter-intuitive). Also problematic is the stilted style of interaction typically supported by query interfaces. Users construct a query, examine the results of its evaluation, and then attempt to diagnose the appropriate modifications required to retrieve relevant documents. Some systems provide feedback on the contribution of the various query components to the results of the search; others merely present the hits list (often empty or extremely long) retrieved by the query.
3 Some of these problems may be caused by notation rather than being inherent to the process. Charoenkitkarn (1996) found that novice subjects with minimal training performed better using graphical representations of Boolean queries compared with textual equivalents. His results suggest that interface may have an effect on subjects' performance, and that it may be possible in some cases to overcome syntactic difficulties of query notation. The graphical notation used by Charoenkitkarn (Golovchinsky and Chignell, 1993) was designed to eliminate syntactic errors and to provide a mechanism for interactive, iterative query refinement. These properties appear to have benefited novice users, suggesting that it is possible to make information exploration interfaces more accessible to novices by encouraging an iterative and exploratory style of interaction. Graphical formulation of queries is one possible mechanism for eliminating syntactic errors in query formulation. Its syntax is formulated in such a way that any user-specified expression is guaranteed to be syntactically correct. Another possibility is to dispense with syntax altogether. Examples of syntax-free interfaces include natural language queries and hypertext. A variety of systems have been designed that convert user-provided phrases into queries automatically. Hypertext links simplify matters even further, reducing users' actions to choices among small sets of alternatives. It is also possible to mediate links with full-text queries. The effectiveness of such algorithms is tested in Experiment 2 (Chapter 7). 1.3 Research motivation The brief overview of information retrieval interfaces suggests that unresolved issues exist in the way users communicate their information needs and in the way the system displays the search results. These issues are addressed in this research. In this work, I describe the evolution of information exploration interfaces that were built to test a number of ideas about query-mediated browsing. The design of the systems described below was motivated by the belief that fast, responsive, interactive interfaces that provide the user with direct control over the browsing process should lead to more effective, and more readily accepted, information retrieval interfaces. These ideas are based in part on human-computer interaction theory (e.g., Norman, 1986), and in part on the author's experiences in designing direct-manipulation interfaces (see also Golovchinsky, 1993). This work details an evolution of interface design. A number of research prototypes have been created by this process, each representing a qualitative improvement in interface or data-
4 handling capacity. These interfaces are mentioned here in chronological order: Queries-R-Links (QRL) was the initial graphical Boolean query system (Golovchinsky, 1993). STPatTREC1 extended the interface to multiple-megabyte collections and integrated a number of other tools (Charoenkitkarn et al., 1995). BrowsIR again extended the system's ability to handle large volumes of text (multiple gigabytes), and incorporated a graphical visualization of the query history (Charoenkitkarn et al., 1995). MultiSurf implemented graphical Boolean queries and a number of graphical overview displays in a WWW client (Hasan et al., 1995). Finally, VOIR (Visualization Of Information Retrieval) introduced a newspaper-style (multiple-article) display and dynamic hypertext links as the browsing mechanism. VOIR arranges the articles retrieved by the user into a number of rows and columns, and uses certain newspaper layout conventions (e.g., more space for more important articles, more important articles near the top of the page, etc.) to convey to the user a sense of the organization of the retrieved documents. The newspaper-like layout provides a number of advantages over conventional article-at-a-time displays and one-article-per-window displays. It provides subtle but clear cues about how the articles related to each other with respect to the topic of interest. It allows the user to get a sense of the "neighborhood" of articles on a given topic by showing much more of the retrieved article than its title or headline. It eliminates the clutter of multiple overlapping windows. It presents a display that is familiar to most people, and uses this familiarity as a foundation for a user interface metaphor. Finally, it lends itself naturally to hypertext interaction because it draws on hypertext features of traditional newspapers with which users are familiar. Newspapers (e.g., The New York Times) have evolved techniques for organizing information that is not part of a coherent narrative, but rather consists of many loosely–coupled, internally– coherent articles. Newspapers use space and position to group related information, and provide hypertext–like connections — structural and semantic — between articles. Structural connections usually take the form of page references on which articles continue; semantic connections may refer to sidebars, to op–ed pieces, or to parts of the newspaper where related topics are addressed.
1The name of this system is derived from Smalltalk (ST), the language in which it was programmed, Pat, the
underlying search engine, and TREC, the Text REtrieval Conference (Harman, 1993) for which it was constructed. That the system was christened in mid-February is no coincidence.
5 VOIR implements dynamic links by constructing queries based on a user's browsing history. It uses inverse document frequency (Salton and McGill, 1983) to identify terms that will serve as anchors, and uses the context around the selected anchor to expand the query that links the source document with a ranked collection of destination documents. Successive selections of links represent a form of relevance feedback that is based on the passage rather than on the entire document. VOIR provides orientation via graphical summaries of retrieval histories. Overview maps do not scale well as global navigation aids, which suggests that non-spatial2 visualizations may be more effective for orienting users. Instead of encouraging users to regard the organization of retrieved articles in spatial terms, VOIR presents data in a historical format. Thus the question "Where have I been?" becomes "What have I seen?"; the retrieval histories (depicted as bar charts) provide a visual memory aid that can display a large amount of information about the current browsing session. 1.4 Overview This rest of this thesis is organized as follows: Chapter 2 reviews the literature related to the domain of user interfaces for information exploration tasks. It concentrates on the relationships among information exploration interfaces, including hypertext, search interfaces, and electronic newspapers. Following the review, a number of research prototypes of information retrieval systems are described in Chapter 3. These software systems were designed to test a variety of user interface issues discussed in the chapter on research motivation. They represent the evolution of our understanding of information retrieval tasks and interface issues, and have been used to collect user performance data in a number of experiments. Chapter 4 describes the implementation of the electronic newspaper metaphor as an information exploration paradigm. A number of issues related to the display of retrieval results are discussed, and illustrated with details of VOIR, a newspaper–based information exploration interface. This chapter concentrates on the display of results; the version of VOIR it describes uses graphical Boolean queries similar to those of the prototypes in Chapter 3. VOIR extends
2Spatial here is taken to mean "relating to physical or metaphoric space." Maps are spatial in this sense because
they are metaphors for physical spaces. Color–coding and time lines, for example, are non–spatial visualizations.
6 the notation by introducing a Conjunctive Query Notation (CQN) version in addition to the previously–described Disjunctive Query Notation3 (DQN). Chapter 5 describes Experiment 1, which tested the effectiveness of the parallel article display and the relative performance of CQN vs. DQN queries for novice and expert users. Experience with the VOIR prototype suggested that dynamic, query–mediated links could be added to the interface. A version of the interface was developed to explore the possibility of creating hypertext links among articles in a large database automatically. Chapter 6 discusses this input technique, complementing the description of the interface in Chapter 4. Chapter 7 then describes Experiment 2, which was designed to compare differences in performance and behavior between query– and link–based interfaces. Finally, Chapter 8 concludes with some applications of the newspaper metaphor and dynamic linking techniques to a variety of media, including the World Wide Web. It also compares the electronic newspaper to the navigation paradigm, and discusses the relationship of each to problems of disorientation in hypertext.
3This notation was called Disjunctive Normal Form (DNF) notation in my Master's Thesis work.
Chapter 2 Literature Review The integration of information retrieval and hypertext … will become commonplace F. Halasz, 1991 The first online catalogs containing bibliographic data came into existence in the late 1960s. They revolutionized access to information, enabling a skilled librarian to search millions of records in the amount of time it previously took to walk to the paper–based catalog. Today, an average personal computer can search through hundreds of megabytes of text in a few seconds. As computing power and storage capacity have increased, so has the amount of information that the typical user must process. In the 1960s, computers had the capacity to handle (relatively) small amounts of data in real time, and thus bibliographic records, controlled vocabularies and elaborate query languages were used to make the computer's task easier. Through the 1970s and into 1980s, bibliographic data gave way to full-text abstracts, that in many cases have been replaced by the complete texts of documents. Collection sizes grew from hundreds to thousands to hundreds of thousands of documents. The information onslaught of the Web in the 1990s has made available millions of documents to any person almost anywhere in the world. Paralleling the expansion in computer capacity (and perhaps fueled by it), support for a variety of information exploration tasks has evolved. No longer are users restricted to searching through limited indices with arcane vocabularies; the entire text is now available online. Not pressed by CPU time and resource sharing constraints, users may browse through these electronic repositories in ways resembling a comfortable bookstore rather than a mail–order catalog. The technological innovation driving hardware design has outstripped the development of software user interfaces that mediate access to information. Thus many commercial systems still rely on complicated syntax-laden query languages as channels of communication between user and machine; few systems make useful recommendations to the user about possible avenues of exploration; information is often presented to readers in small increments, depriving them of perspective. In this chapter, I will review the fundamentals of information retrieval (IR) algorithms to provide the foundation for a discussion of information exploration interfaces. This foundation 7
8 includes a discussion of query languages, document representations and evaluation measures. I will present a brief overview of hypertext interfaces, and will show how traditional IR interfaces and hypertext interfaces have been integrated in the past. I will then discuss some recent work on electronic newspapers, as content and as medium. Finally, I will address briefly issues of information visualization as they pertain to information exploration.
2.1 Information retrieval Information Retrieval is concerned with facilitating users' access to large amounts of (predominantly textual) information. Typically, IR systems address the technical issues of representing the documents that comprise its collection, of representing the user-initiated queries, and of performing the calculations required to retrieve the appropriate documents. User interfaces to these systems range from command-line queries (e.g., Salton and McGill, 1983) to more sophisticated browsing environments (e.g., Remde et al., 1987; Hearst, 1995). In this section, I will review the fundamentals of electronic document representations. I will also discuss some common query paradigms and query matching algorithms. The intent of this review is to capture those aspects of information retrieval technology that impact user interface design. In particular, this review will focus on highly interactive systems. 2.1.1 Queries Boolean queries have been used traditionally to perform set-based retrieval in full-text databases. Set based retrieval partitions the database into two sets: the set of all documents matching the query, and the rest of the database. Often Boolean query languages have been extended to include operators other than the standard AND, OR and NOT. Proximity (e.g., NEAR and FOLLOWED-BY) are common extensions (e.g., Fawcett, 1989). Fuzzy–set theory has also been applied to structure query–matching calculations (Bookstein, 1981). Some query languages allow users to specify relative importance of search terms by assigning them weights (e.g., Salton, 1971a) and by respecting field–based (structural) components of documents (e.g., Callan et al., 1992). Although Boolean queries are potentially useful for expressing information seeking intent, novices have found them difficult to use (Borgman, 1986). Studies have shown that novices use natural language (operator-free) queries more effectively than Boolean queries (e.g., Turtle, 1994; see also Belkin and Croft (1987) for a review). Structured queries may be constructed from natural–language phrases by discarding
9 stop words, and then using a sum or weighted sum operator on the remaining (stemmed) words (e.g., Callan et al., 1992). 2.1.1.1 Relevance feedback Relevance feedback refers to algorithms for automatic query expansion based on feedback provided by users about the desirability of specific documents (Salton and McGill, 1983). Improvements in query performance attributed to relevance feedback have been reported widely in literature (e.g., Salton and Buckley, 1990; Harman, 1992). Relevance feedback algorithms typically use term frequency statistics to determine which words should be added to the query. Thus high-frequency terms that are found in only a few documents are potentially useful for query expansion. See Lee (1995) for a summary of some common term weighting and document frequency formulae. Efthimiadis (1993) compared several weighting algorithms based on their ability to predict terms that users found representative of the search topic. These algorithms select terms for query expansion based on measures of discriminability. That is, they provide formulae for estimating the usefulness of a term based on its frequency of occurrence in the database and in specific documents. He found some small but not reliable differences among several wellknown algorithms. It is not clear from his study, however, whether the terms identified by the algorithms would result in better retrieval performance (recall and/or precision). His measure relied on correlation with human judgment, but was not corroborated by any objective measures. The study does not report the expertise or training of subjects; if the participants were not expert searchers familiar with the distributions of terms within the target database, it is not clear that their decisions would necessarily lead to better system performance. Koeneman and Belkin (1996) found that subjects preferred to use relevance feedback over term generation because they found it (cognitively) easier to rely on the system's selections rather than introducing their own words. An advantage of this method is that the potential mismatch in terminology between the user and the text is reduced, since the system generates terms from the text being searched. In effect, this introduces a form of domain-independent controlled vocabulary. Some systems have presented relevance feedback techniques to the user as looking for articles similar to one already retrieved (e.g., "Show me more articles like this one" ). Relevance feedback techniques have typically focused on the document as a source of additional terms because traditionally the document has been the unit of retrieval.
10 2.1.2 Document representations There is a variety of ways in which the contents of text documents may be represented in the computer. These range from vector spaces (Salton, 1971a) to Pat trees (Gonnet et al., 1991 cited in Salminen and Tompa, 1994) to belief inference networks (Turtle and Croft, 1990). These indexing methods are designed to store pre–computed partial search results that may be combined at run time to yield the documents matching a given query. Not every word in the database is indexed: stop words, terms that occur frequently (in every, or almost every, document) are excluded because they are not effective discriminants among documents. Often words are reduced to some canonical form by stemming, that is by removing suffixes to conflate tenses and plurals. Some systems store the canonical and the original versions of indexed terms, allowing users greater flexibility when constructing queries. The Web–based Lycos search engine (Mauldin and Leavitt, 1995) is a good example of this approach. 2.1.2.1 Document vector model The document vector model (e.g., Salton, 1983) represents each document by a vector of numbers, each element of which represents the presence of a unique term. The numbers may be 1 or 0 (indicating presence or absence, respectively), or they may vary over [0,1] to indicate the relative importance of each term in the document. Documents may be compared for similarity by taking the cosine product of their vectors; the larger the product, the more similar the two documents are considered to be. Queries in this model are also represented by vectors. Thus a ranked results list may be computed by ordering documents based on their cosine product with the query vector. A number of alternative ranking schemes have been proposed. See Noreault et al. (1981) for a summary and empirical comparison. An advantage of this approach is that it provides not only the set of documents that match a given query, but also an estimate of the degree of match. Documents may also be clustered by topic; cluster centroids are then matched to queries to retrieve the appropriate documents (Salton, 1971b). 2.1.2.2 Inference Networks Differing somewhat from basic probabilistic retrieval models (e.g., Robertson, 1977), inference networks are designed to incorporate multiple sources of evidence when estimating the likelihood of relevance of a particular document to the user's information needs (Haines and Croft, 1993). Inference networks are directed, acyclic graphs consisting of document (di), content representation (rj), concept (ck), and query nodes (q). Document and query nodes are
11 always true; representation and concept nodes may be true or false. Representation and concept nodes may represent terms and other special index units (e.g., dates, numbers) that comprise documents and queries. Arcs between nodes represent conditional probabilities of relevance (Callan et al., 1992). An inference network used for information retrieval consists of a pre–compiled component and a run–time component. The parsing step extracts a set of content representation nodes from a collection of documents and stores them in a collection of index files. At run–time, a user's query is parsed to create a set of concept nodes. These concept nodes are used by a recursive activation algorithm to determine the probabilities (belief values) that particular documents are relevant to the query. Documents whose belief values exceed a certain threshold are ranked by their belief values and returned to the user (Callan et al., 1992). 2.1.2.3 Proximity models A variety of information retrieval architectures based on proximity models have been proposed. These approaches typically use document clustering (e.g., Jardine and van Rijsbergen, 1971; Salton, 1971b; Sparck Jones, 1971) and factor analyses (see Deerwester et al., 1990 for a brief review). Hearst and Pedersen (1996) used Scatter/Gather techniques for dynamically classifying retrieved documents into several clusters. They found that relevant documents tended to be clustered together, and that usually one cluster contained most of relevant articles. This approach differs from previous clustering techniques (e.g., Salton, 1971b) in that these clusters are computed on the retrieved results rather than on the entire database. Whereas prior clustering strategies were intended to improve search efficiency (Willett, 1988), Hearst and Pedersen used Scatter/Gather as a comprehension aid to help users to understand and to organize retrieval results. They found that users who had access to the clustering results in addition to similarity searches performed better than subjects without access to the clustering algorithms (Hearst and Pedersen, 1996). Latent Semantic Indexing (LSI) algorithms (Dumais et al., 1988; Deerwester et al., 1990) use Singular Value Decomposition (SVD) analyses to organize documents into disjoint clusters in a high–dimensional space. This technique maintains a term–document matrix that represents the relative importance (e.g., frequency of occurrence) of each term for each document. Thus both terms and documents are represented by vectors, making term–document and document– document comparisons equally meaningful. Queries are represented in this scheme as linear combinations of term vectors. SVD is used to partition the term–document space into non–
12 correlated sub–spaces; relevant articles are retrieved from the sub–space whose centroid best matches a query (Deerwester et al., 1990). Since terms and document are co–clustered, this technique allows the system to retrieve documents similar to matching documents even though they may not contain most (or any!) of the original search terms. This property is the result of an attempt to index the concepts that underlie the text, rather than the surface words that represent them (Dumais et al., 1988). 2.1.3 Evaluation Recall and precision are the traditional measures of information retrieval system performance. Recall is defined as the fraction of relevant articles retrieved, and precision is the fraction of retrieved articles that are relevant (Kent et al., 1955 cited in Meadow, 1992). Although initially developed to measure the performance of retrieval algorithms, these measures may also be used to describe the combined performance of user and IR system. Recall and precision may be used to compare different interfaces, but have to account for bias toward recall or precision introduced by the experimental task. A number of methods for combining recall and precision into a single measure have been proposed (see Meadow, 1992 , for a summary). While the measures may be useful for query routing tasks1, they tend to be dominated by precision when recall and precision are used to assess users' searching performance in large databases. Interactive searches tend to exhibit a bias toward precision due to limits on the number of documents subjects are willing to retrieve. Although search engines may be capable of retrieving the top 1,000 articles in response to a given query, it is unlikely that users would examine these results exhaustively, if at all. Their goal may be to find a few representatively relevant articles rather than to maximize recall (Meadow, 1992). Thus interactive use tends to exhibit relatively high precision and low recall scores due to limitations on how many documents users can process during a retrieval session. Calculating overall efficiency then becomes a problem: efficiency measures tend to correlate with either precision or with recall, rather than providing a joint estimate of performance.
1 Query routing consists of classifying incoming articles into categories defined by comprehensive query
expressions. Query routing usually takes place off-line to improve precision and recall at the expense of interactivity.
13 A tradeoff between recall and precision is often exhibited by systems that generate ranked document sets. Limiting the number of documents retrieved by a query increases the query's precision at the expense of recall. Since it is desirable to limit the number of articles retrieved by interactive systems, and since these decisions may depend on a range of factors (including user preference), the average precision measure has been proposed to allow meaningful comparisons of different systems (Salton and McGill, 1983). Two additional measures of performance have been described. Precision was measured at every relevant document in a ranked hit list. For example, if the first–, third– and tenth–ranked articles are relevant, the corresponding precision measures would be 1.0, 0.67, and 0.3. The average of these precisions over all relevant documents for each query, averaged over some sample of queries, is defined as the non–interpolated average precision (Harman, 1995). Harman also described an average precision over the first 100 documents, averaged over a set of queries (Harman, 1995).
2.2 Hypertext Nelson defines hypertext as non-sequential writing (Nelson, 1987), but to some extent all writing is non–sequential. More accurately, hypertexts are documents (or collections of documents) that provide explicit support for non–sequential reading. These documents consist of a data structure that organizes the content material, and a user interface that mediates users' access to the contents. Non–linearity is captured by positing connections — links — between elemental data units, termed nodes. A variety of link structures for organizing material have been proposed. These include one–to–one links that form a graph (e.g., Garret et al., 1986; Halasz et al., 1987), set–based links (Parunak, 1991), Petri nets (Stotts and Furuta, 1989), etc. Similarly, a great variety of interfaces have been developed. Hypertext interfaces provide the user with responsive, interactive access to the underlying data. (Nielsen, 1990) Remde et al. (1987) enumerate several deficiencies of traditional, linear text that motivated the creation of hypertext interfaces. Among other reasons, they suggest that it is "… too hard to find information in ordinary text," that it is "… too hard to acquire information in a sequence other than that determined by the author," and that it is "… extremely difficult to integrate and update large bodies of frequently changing information from many different sources." These are criticisms of the user interface used to present linear text. Yet these criticisms have been levied against hypertext interfaces also. Conklin (1987) documented the disorientation problems; static links were found to be too constraining to the reader (Halasz, 1991); and
14 hypertexts have been limited by problems of scale. While database technology is capable of handling very large collections of nodes and links, the human effort required to create and maintain the links becomes prohibitively expensive for large, multi-author hyperbases (Chignell, et al., 1991, Robertson, et al., 1994). The problem is two-fold: the link creator must decide which aspects of a particular node to link to which aspects of other nodes, and this decision has to be revisited with each addition of a document to the database. The user is faced with the complementary problem: which link of the possible links from the current node should be selected? Much work has been done to address authoring issues. Approaches include using existing document structure (e.g., Raymond and Tompa, 1987; Furuta et al., 1989; Glushko, 1989; Salminen et al., 1995); segmentation of text (Chignell, et al., 1991, Robertson, et al., 1994); using IR techniques to suggest potential semantic relationships to link authors (Bernstein, 1990); the incorporation of queries into the hypertext structure (e.g., Belkin et al., 1993); lexical techniques for link generation (e.g., Lesk, 1986); neural net (Lelu and Francois, 1992), knowledge-based (e.g., Fischer, et al., 1989) and agent-based AI techniques (e.g., Hayes and Pepper, 1989, Clitherow, et al., 1989); and allowing the users to construct links directly (Brown, 1987) and via relevance feedback (Boy, 1991). Other research has focused on deciding which links to show when. Approaches include link filtering based on context used in the Trellis system (Furuta and Stotts, 1991), and guided tours of the link structure (Marshall and Irish, 1989; Zellweger, 1989). Issue-Based Information Systems (IBISs) have concentrated on encapsulating complexity in hierarchical structures (e.g., Fischer, et al., 1989, Streitz, et al., 1989). In most cases, however, these solutions are not able to cope with the ever increasing database size. The guided tours approach usually relies on the author to structure the information, although Guinan and Smeaton (1992) described an IR–based approach to selecting nodes and constructing guided tours for a small (551 node) database. Their approach relied on the existence of typed links to determine path sequences to display. This technique, therefore, is limited to databases that contain sufficient numbers of typed links. IBISs require not only manual authoring, but also an extensive design effort (e.g., Thüring et al., 1991), making such systems inherently small-scale. Given the current state of technology, it is not clear that lexical parsing and agent-based AI techniques can handle multi-gigabyte databases fast enough to support interactive browsing of the collection.
15 2.3 Information exploration Information exploration interfaces may be classified along several dimensions (Waterworth and Chignell, 1991; Marchionini, 1995). For example, Waterworth and Chignell's target orientation describes the goal state of the user, and the extent to which he has well-defined information-seeking criteria. The interaction method dimension encodes the manner in which the user requests information from the system. Values along this dimension range from querying to browsing. The structural responsibility dimension (navigational vs. mediated) specifies who (user or system) is responsible for carrying out the search. Thus early IR systems were often found in the mediated query corner of the information exploration space. Recent developments in the IR domain have started to shift the focus away from traditional query formulation interfaces. The SMART system (Salton, 1971a), for example, relies heavily on relevance feedback to meet a user's information request. In addition to specifying key terms (mediated query), users also tell the system which of the retrieved documents are relevant. This action may be interpreted as expressing a "Show me more like this" intent on the part of the user, and thus falls in the mediated browsing region of Figure 1 (see next page). Traditional hypertext interfaces — lntermedia (Haan, et al., 1992), Sepia (Streitz, et al., 1989), KMS (Akscyn, et al., 1988), NoteCards (Halasz, et al., 1987), etc. — may be classified in the navigational browsing corner of the space. As the technology matured, it became clear that real tasks required hypertext systems to support directed search (querying) in some manner (e.g., Christophides and Rizk, 1994, Amann, et al., 1994, Lucarella, et al., 1993, etc.). The recent emphasis on index servers on the World Wide Web (WWW) exemplifies the incorporation of powerful search mechanisms to facilitate navigation of large hypertexts (Mauldin and Leavitt, 1994). These trends in IR and hypertext systems are illustrated in Figure 1. The figure suggests that the evolution of database and interface technologies is pushing traditionally disparate application domains closer together in terms of the user's interaction with the system. There is some evidence that interfaces that support combinations of structural and query–based navigation are better than single–strategy interfaces. Pirolli et al. (1996) constructed statistically–based hierarchical cluster partitions of a document collection, and allowed subjects to navigate the datasets either by browsing the hierarchical structure or with keyword searches. They found that although keyword searches out–performed Scatter/Gather clustering
16
query
IR
browse
Target Orientation
techniques, "Scatter/Gather interfaces induced a more coherent view of the text collection…" They conclude that combinations of the two approaches are expected to improve performance.
HT
mediated
navigational
Structural Responsibility Figure 2-1. Recent trends in information exploration interfaces. SuperBook (Remde et al., 1987) was one of the first query-mediated browsing interfaces. It allowed users to browse through the text using the hierarchical structure of the table of contents (TOC). It also used queries to select the required passages. Queries could be formed by typing, or by clicking on the desired words to add them to the query. The TOC was then annotated with the search keywords to show the number of matches for each section. Users could drill down through the hierarchical structure to display the matching text. The TOC hierarchy was still visible in a separate window, and provided the user with information about the relationship of the retrieved document to the rest of the database. Thus users could browse through the text by selecting matching passages, while at the same time being aware of the relationships between the retrieved text and the overall database structure. SuperBook was shown to be effective in supporting certain information exploration tasks. Unfortunately, this approach was limited to coherent collections of text, that is, to texts that were organized with a common table of contents. The interactive immersion into text and the sense of information neighborhood that SuperBook implemented was made possible by the organization provided by the table of contents. Without this structure, the SuperBook paradigm breaks down. Thus, although the SuperBook approach has been shown to work for a large number of documents, there exists a class of full-text databases for which this interface is not appropriate. Document collections that do not have a natural (or obvious) hierarchical organization, or collections that have multiple organizations that vary depending on the user's task, large
17 collections with broad, shallow tables of contents (e.g., encyclopedias) are poorly fitted to the SuperBook browsing model. Some research has appeared recently on using IR techniques to facilitate browsing. BRAQUE (Belkin, et al, 1993) allowed users to browse without specifying queries explicitly, supported interactive query reformulation, and provided an interface for browsing through an on-line thesaurus. These facilities were designed to improve users' access to the underlying text. The QRL system (Golovchinsky, 1993; Golovchinsky and Chignell, 1993) is another example of an information retrieval interface that incorporated some of the directness of a hypertext interface. Although it relied on Boolean queries as the basic retrieval mechanism, the system incorporated several aspects typical of hypertext interfaces. A text passage was always visible on the screen. Queries were specified graphically by selecting terms and operators within that visible passage, and the results of the search were presented in an adjacent list. When a search hit was selected (something akin to following a link), the corresponding passage was displayed, and the query was redrawn on this new passage. Thus the query served both as a dynamic anchor specification in the source (provided by the user), and in the target (redrawn by the system). The query also served as the starting point for the next iteration: the user could modify it incrementally without having to re–enter the entire expression.
2.4 Electronic newspapers Whereas the preceding sections have dealt mainly with eliciting users' information seeking intent, the following section discusses an approach for displaying search results. The ubiquity of the Web has made it possible to deliver large amounts of text to readers electronically. It is now increasingly common to find electronic versions of traditional printed newspapers on the Web. This degree of availability has created a visible demand for custom, personalized electronic newspapers catering to the interests of a particular group or individual. A number of research projects (e.g., Chesnais, et al., 1995; Crayon, 1995; Ruggerio et al., 1994) have been addressing various issues related to the electronic delivery of news, including content classification, database and network issues, and user-modeling. A distinction must be made between systems that handle news content and those that display it using the newspaper metaphor. In this work, we are concerned with the presentation style rather than with content. Shepherd et al. (1995) posited a distinction between newspaper databases (with arbitrary interfaces) and the delivery of news. They characterized electronic
18 news as consisting of delivery, a newspaper metaphor and the task of finding out "what is happening." Kamba et al. (1995) described an electronic newspaper prototype that collected data from an on-line source of newspaper articles and formatted the results into a multi-column display. Articles were downloaded from a newspaper, indexed on the server machine, and distributed to the client based on requests from a Java applet. The applet displayed the text and provided a number of interface features to collect relevance feedback from the user. This feedback was used to control the number, size and layout of articles. Users could interact with this system and cause it to retrieve different articles. The system provided an interactive browsing interface, although new editions of the electronic newspaper took too long to compile to support true interactivity. This work suggests that it is possible to use the newspaper as an information exploration metaphor. Although Kamba et al. limited themselves to newspaper content, this style of information display may be extended to other types of information. The newspaper metaphor appears ideal for organizing the presentation of hypertext data structures. Its layout and organization features may be used to convey relationships between hypertext nodes. Its implicitly ephemeral nature lends itself well to reflecting users' evolving information needs. Although this prototype resembled a newspaper page visually, the degree of interactivity that it provided was inadequate for an interactive information-exploration interface. The system took on the order of one minute to compose a page (Kamba et al., 1995), making it impractical for interactive browsing. Also, it was difficult to specify search intent explicitly. The system collected relevance feedback information implicitly, but required the user to fill out a separate form to provide arbitrary search keywords (Bharat, 1996). It was developed to demonstrate the use of inter-operative distributed applications. While it showed the promise of newspaper-like interfaces, its implementation was too slow to serve as an interactive large-scale information exploration interface. Perhaps its strength lies in the more traditional notion of a newspaper: the system seems to be designed to perform query routing tasks. Over time, it can establish comprehensive representations of users' interests, and can produce a "morning edition" tailored for each user. Users may then browse the edition and cause the system to update its notion of each user's interests. This work adopts a slightly different approach, concentrating on the newspaper metaphor without the same regard for delivery and task. In fact, the system described in Chapters 4 and 6
19 is quite independent of news content. It is suited to browsing large amounts of archived information, complementing the approach of Shepherd et al. (1995). The newspaper metaphor embodies in it some fundamental principles of information visualization by using spatial proximity and size to convey relatedness information. More explicit graphical means of representing documents and retrieval results are discussed in the following section.
2.5 Visualization of search results While much research has been performed on information retrieval algorithms, relatively little effort has been directed toward user interface design for IR tasks. In particular, graphical representation of information retrieval results has received little attention. Hypertext interfaces, on the other hand, have frequently contained graphical navigation aids such as overview maps, time line visualizations, and paths (see Neilsen, 1990, for an overview). Three–dimensional visualizations of hierarchies (Robertson et al., 1991) and graphs (Fairchild et al., 1988; Noik, 1993a) have also been proposed for overview diagrams. The intent of such displays was to provide sufficient global context to the reader to prevent disorientation and to allow backtracking. Information retrieval interfaces have not benefited as much from graphical visualization techniques. Some exceptions exist, however. One area of visualization research has focused on constructing graphical representations of database contents. This work is an outgrowth or application of more traditional information visualization research. Chalmers and Chitson (1992), for example, constructed three–dimensional representations of document clusters. They used physically–based modeling techniques to position particles (documents); physical proximity was used to represent document similarity. The computational resources available to the authors (Sun SPARC 2) and the complexity of the layout algorithms made this visualization technique impractical for large collections of data. The authors report that it took 150 minutes to lay out 301 articles. Clearly several orders of magnitude of improvement in computing time are required to make this interface useful for interactive tasks. Wise et al (1995) described interfaces for displaying two– and three–dimensional similarity– based clusters of documents. They visualized a collection of transcripts from CNN news stories from one week. They did not report, however, which algorithms they used, how many documents they were displaying, or how responsive their visualizations were.
20 Hearst and Pedersen (1996) report that their Scatter/Gather clustering algorithms can group 5,000 short documents into five clusters in about a minute on a SPARC 20. The clustering algorithm they use is O(kn) where n is the number of documents and k is the number of clusters. Thus it could be reasonably efficient for grouping a few hundred articles retrieved by a single query. It would become progressively less efficient to organize the combined results of several successive queries. These times do not include visualization (which their interface does not do), suggesting that although algorithms and computing speed are improving, we still cannot support truly interactive visualizations of large results sets. Unfortunately, it is precisely these large volumes of data that would benefit most from appropriate graphical representations. The visualizations mentioned above attempt to characterize similarities among all documents in a collection. Given the computational resources currently available, and the properties of the algorithms required to reduce the dimensionality of the data, these visualizations typically rely on pre–compiled similarity measures to achieve near real–time performance. They do not, however, provide the user with adequate feedback regarding the results of any particular search. Although these algorithms may be used to visualize subsets of documents retrieved by a particular query, they do not help the user to understand why specific documents matched the query. TileBars (Hearst, 1995) is an attempt to visualize the results of each query the user submits. The TileBars prototype uses position and density coding to represent degrees of relevance of each document section to each query component. A bar is used to represent each document; the bar is sub–divided into thinner horizontal bars, each corresponding to a query term set. The horizontal bars are sub–divided into vertical blocks that represent document sections. The shading density of each block represents the degree of relevance of the corresponding document section. In response to a query, the system generates a ranked list of matching documents, and displays a TileBar for each document. TileBars provide the user with a quick summary of the retrieval results, broken down by document segments. It works well for small results sets typical of interactive settings. One disadvantage of the TileBars display is that the length of each bar is proportional to the length of the corresponding document. TileBars are usually displayed in a scrolling list (one document per line) with document titles in a separate column to the right of the bars. Thus results sets that contain large variations in document size may generate displays in which it is difficult to match the bars with the corresponding titles.
21 2.6 Conclusions This review has focused on the state of the art in interactive information exploration interfaces. Hypertext and information retrieval interfaces have been described, and some trends toward convergence of the two paradigms have been suggested. Some visualization techniques applicable to text–based information retrieval have been described. This review suggests that there is considerable scope for user interface innovation to support interactive information exploitation activities. One possible set of approaches is discussed in the following chapters.
Chapter 3 Experimental Prototypes This chapter describes the evolution of query–mediated browsing interfaces that were used to test a variety of hypotheses about information exploration behavior. The ideas motivating the construction of these programs have evolved over the course of the thesis, as has the information handling capacity of the prototypes. The first version used a single search engine, and had a text–handling capacity of about 350K; the latest uses several search engines in parallel, and can handle multiple gigabytes of text. Interface aspects have evolved also: additional tools such as dictionaries, thesauri and visualization engines have been integrated, query notation has been expanded, and a variety of layouts have been tested. The sections that follow detail the series of prototypes, charting their evolution. These systems are the predecessors of the VOIR prototype that will be described in the following chapter.
3.1 QRL The QRL system was developed by Golovchinsky (1993) as an experimental query-mediated browsing tool. Its interface consisted of a text window, a hits list, and a few widgets to control certain search parameters. The user could click on any term in the window and add it to the query. Lines could be dragged between selected terms to form ANDed pairs. The absence of a line implied an OR. In this manner, the system allowed the user to formulate a DQN Boolean query (see Figure 3-1). Every state transition in the query (adding or deleting a term or operator) caused a new hit list to be computed. The user could select a hit by clicking on the appropriate item in the list. This would cause the file to be repositioned to the corresponding passage. The query terms would be selected, so that the user could see why the passage matched the query and also to edit the query incrementally. The QRL software system consisted of two main components, the interface and the search engine. The search engine class encapsulated PAT, a full-text Patricia tree-based search engine from OpenText (Fawcett, 1989). UNIX pipes were used to pass data to and from PAT, which was run as a child process. The interface, written in VisualWorks Smalltalk (ParcPlace, 1993) consisted of several different areas: the text display, the hits list, and the search controls. When the system was started, it loaded the specified text and displayed it at the beginning. The hits list was empty. When the user specified a new query by clicking on a term, by dragging a line between two previously-selected terms, or by removing one of the terms or lines, the graphical 22
23 notation was translated into a binary tree: the operators occupied the interior nodes and the terms were placed on the leaves. The tree was constructed bottom-up: first the sub-trees consisting of connected (ANDed) terms were built up, and then these were joined by OR operators. The final expression was passed to the search engine class that translated it into the search engine' query syntax. The current character proximity slider setting was passed to the search engine. AND nodes were mapped onto proximity operators ("NEAR"), ORs were mapped onto set union ("+"). PAT returned a collection of offsets that corresponded to query matches. One hundred or so characters around each offset were taken as the context of the hit; the keyword in context (KWIC) results were then displayed in the hits list. If the user adjusted the proximity threshold, the previous query would be reevaluated. Selecting one of the hits caused the text to be scrolled to the corresponding offset, and the query was redrawn on the text. This query process is summarized in Figure 3-2. When spring comes, the city's inhabitants, by the hundreds of thousands, go out on Sundays with a leather case over their shoulder. And they photograph one another. They come back as happy as hunters with bulging game-bags; they spend days waiting, with sweet anxiety, to see the developed pictures (anxiety to which some add the subtle pleasure of alchemistic manipulations in the dark-room, forbidding any intrusion by members of the family, relishing the harsh acid smell), and it is only when they have the photos before their eyes that they seem to take tangible posession of the day they spent, only then the mountain stream, the movement of the child with his pail, the glint of the sun on the wife's legs take on the irrevocability of what has been and can no linger be doubted. The rest can drown in the unreliable shadow of memory. From Italo Calvino's The Adventure of a Photographer Figure 3-1. DQN graphical query "(photograph and pleasure and memory) or darkroom" Although the system proved to be an interesting browsing environment, it suffered from some major limitations. The amount of text it could search was limited by available memory. The entire text was placed in a single text view, making it difficult to scroll. It assumed that the database was a linear text, and this made it difficult to handle different logical chunks. Finally, it did not rank the hits in terms of degree of relevance to the query, presenting them in file order instead. The QRL system did serve the purpose of establishing the feasibility of browsing through text databases by forming Boolean queries. This placed the system in the query-mediated browsing area of the information exploration space (Waterworth and Chignell, 1991). Observations
24 about system response times also suggested that in order to be useful, such systems had to be responsive.
Figure 3-2. (After Golovchinsky, 1993, Figure 2) The QRL browsing process. The user starts at the text view and creates the query markup. The system translates the markup to a textual string that is passed to the search engine. The search engine returns a set of hits, which are parsed and displayed to the user. Selecting one of the hits causes the corresponding file passage to be displayed. The user may then iterate the query.
3.2 StPatTREC The graphical query notation developed for QRL was then incorporated into the next– generation prototype, called StPatTREC (Chareonkitkarn et al, 1995). This system used a much larger database (several megabytes vs. about 300KB), divided the source text into logical chunks with SGML tags, and incorporated additional sources of reference information, including an on-line dictionary and a WordNet (Beckwith et al., 1991) thesaurus.
25 3.2.1 SGML documents When the system was started, it scanned the database file and constructed an index of the database. Each index entry corresponded to one document, as bracketed by a pair of SGML tags. The internal structure of each document was not parsed, and initially each index entry was a document proxy object. The starting and ending offsets of each document were stored to allow hit offsets to be matched to the appropriate documents. When an offset was returned by PAT, a binary search was used to select the document bracketing the hit. If the document containing a given offset was a proxy, it was converted to the full-fledged SGML structure on the fly. The text string bracketed by the starting and ending offsets was parsed and converted to a hierarchical SGML element tree. The algorithm assumed syntactically correct SGML, and did not attempt to enforce a particular DTD. A separate file, identified at start time, was used to store display style information. The formats specified in that file were stored in the corresponding SGML elements as they were created by the parser. These style formats specified whether or not an element should be displayed, what font style (serif or sans-serif, bold, normal or italic), font size (very large, large, medium, or small), and font color should be used when displaying the element's text. 3.2.2 Interface Figure 3-3 shows a screen-shot of the StPatTREC interface. In addition to the controls present in QRL, StPatTREC provided windows for a second (ranked) hits list, for WordNet and for Webster's online dictionary. It used a UNIX pipe connection to control a helper C program (written by Nipon Charoenkitkarn) that provided the data for these three supplementary windows. Communication with the helper program was asynchronous: the program could be updating the display while the user was scrolling through the text, or browsing the hits list. The WordNet and dictionary displays were intended to help the user with the query (re)formulation process; they were not used by the StPatTREC program directly. StPatTREC was used to run the first of a series of experiments (as part of our TREC-3 participation) to test how well users could retrieve information using such query-mediated browsing systems (Charoenkitkarn et al, 1995; Charoenkitkarn, 1996, Chapter 4). The system was tested with a 530MB (150,000 article) database of Wall Street Journal articles. The results suggested that the mark up interface was competitive with other interfaces participating in the TREC-3 competition.
26
Figure 3-3. StPatTREC interface
27 QRL User Interface ranked hits hits list dictionary output
WordNet output displayed text
graphical query WordNet results search results binary tree dictionary output pat query ranked hits pat
helper query
helper program glue routines dictionary
WordNet
Legend UI action Influence Smalltalk object transition Pipe connection Figure 3-4. Architecture of StPatTREC. The shaded background reflects the structure of the original QRL program. The StPatTREC architecture is depicted in Figure 3-4. A state change in the query initiated by the user triggered the query process. The query was converted to a binary tree representation, which was in turn converted to the PAT query syntax and to the syntax of the helper program.
28 Pat processed the query and returned the results set consisting of a collection of offsets into the file. The Smalltalk program extracted the context of each hit to build the KWIC hits list which was then displayed to the user. At the same time, the helper program computed the WordNet sense and definition of the last-added keyword, and returned the appropriate text for display. This information was intended to aid users in suggesting additional terms for the query; no automatic query expansion was performed by the system. The helper program also ranked the hits returned by the search engine. The ranking was based on the frequency of search terms in each document, normalized by the document length. 3.3 BrowsIR BrowsIR was derived from StPatTREC by increasing its database-handling capacity to multiple gigabytes, and by adding visualization tools to help the user assess the state of his information exploration session. The increase over StPatTREC in text-handling capacity was achieved by constructing the run-time document index incrementally. While StPatTREC loaded the entire index (the starting and ending offsets of each SGML document within the database file) into memory, BrowsIR deferred the loading until a particular document was needed. StPatTREC held a representation of each element in the database, even if that representation was merely a proxy containing only the starting and ending offsets. BrowsIR, on the other hand, held a sparse collection of documents that had been matched by some query and viewed by the user. When the search results pointed to an offset that was not included in the currently-loaded collection of documents, another PAT query was performed to retrieve the document boundaries bracketing the hit. The text within that span was then parsed to extract the SGML structure as in StPatTREC. This deferred loading eliminated the costly startup time and allowed more efficient memory management, as only a small portion of the database was loaded during a typical browsing session. The collection size used in the TREC interactive track increased from about half a gigabyte in TREC-3 to two gigabytes of text in TREC-4. The collection consisted of several sub— collections, including the Wall Street Journal, the San Jose Mercury News, AP newswire, the Federal Register, and others. These sub—collections were indexed separately to reduce indexing time and to distribute the databases over several disk drives. The constraints imposed by the data suggested that BrowsIR be changed to search through these collections in parallel. To handle the complete 2GB database, BrowsIR pooled the results from ten separate PAT processes, all running in parallel (Figure 3-5). The results were merged in the order they were received from the search engines. All hits from a single search engine were grouped together.
29
User interface
1
2
pat search engines
3
4a
helper program 4b Graphite visualization
Figure 3-5. BrowsIR architecture. The user's query is passed to multiple instances of PAT (1); the search results are returned to the interface program (2), and sent to the helper program along with the query (3), which ranks the results and sends them to the interface (4a), and also transmits the term co-occurrence information from the query to the visualization program (4b). The helper program for BrowsIR replaced the WordNet and Dictionary databases with a socket connection to a visualization tool. It also ranked the search results produced by PAT. The visualization was based on Graphite (Noik, 1993b; Noik, 1996), a flexible, dynamic visualization system. The user was shown a graph of the cumulative history of query term use, in which nodes were terms and arcs represented conjunctions. Line thickness was used to represent multiple co-occurrences of the same terms. Experiments with BrowsIR showed that novice users took advantage of the term graph to direct their exploration (Charoenkitkarn, 1996, Chapter 5).
Chapter 4 VOIR, The Electronic Newspaper Prototype
B. Watterson, Calvin and Hobbes
4.1 Introduction Despite considerable progress in increasing the data-handling capacity of the prototypes described in the previous chapter, the graphical Boolean query interface had remained quite unchanged. Just as in the original QRL system, the user was able to display a passage of text, mark it up, and view other — related — passages. A small font used in a large window allowed subjects to see a large portion of the retrieved article, with enough space left for auxiliary panes such as the hit list1. This, however, proved to be inefficient use of space: often, when subjects selected an article for viewing, the article turned out not to be relevant. Although it may have been difficult to recognize that from the title or the KWIC, the first couple of paragraphs may have been sufficient. Thus the idea emerged to collect samples or previews of retrieved articles and to present a digest of these results. Following the example of the Krakatoa Chronicle electronic newspaper prototype (Kamba et al., 1995), a multiple-view newspaperlike interface was constructed. It merged the interactive queries of QRL with the immersive text views of a newspaper–like layout. The multiple article layout was motivated by the argument that showing users portions of several articles simultaneously is likely to increase their chances of recognizing a relevant article. Although in some cases the title alone is sufficient to judge the relevance of an article, the likelihood of making a proper judgment should increase if the user can scan the first couple of paragraphs of an article. Organizing information in a way analogous to a printed newspaper
1The Graphite visualization window of BrowsIR was usually opened on a second monitor
30
31 should also enable users to apply their newspaper scanning skills when processing the display. This immersive interface should have a particular advantage in navigational browsing scenarios (Waterworth and Chignell, 1991) when users do not have a clearly understood information target, but are instead relying on the browsing process to refine their understanding of the domain. This system, named VOIR, is described in this chapter and in Chapter 6. The following sections will describe the newspaper-style display and the application of graphical queries in a multiple article layout. Chapter 5 will then describe an experiment created to test some design decisions related to VOIR, and then Chapter 6 will discuss the evolution of query interfaces from graphical Boolean queries to dynamic links.
4.2 System architecture VOIR is a software platform for testing query-mediated browsing interfaces. The software consists of a modular interface component (written in VisualWorks Smalltalk), and a search engine back end. In contrast to QRL, StPatTREC and BrowsIR, of which it is a logical descendant, this software uses Inquery2 (Turtle and Croft, 1990) to retrieve articles. Inquery supports a much richer query syntax and can be integrated more tightly with the interface code3. The Smalltalk interface collects input from the user, constructs appropriate queries in Inquery syntax4, passes them to Inquery, collects the search results, fetches the articles from the database and formats the display. This loop is similar to the QRL cycle, the differences
2Copyright (c) 1990-1994 by the Applied Computing Systems Institute of Massachusetts, Inc. (ACSIOM). All
rights reserved. The INQUERY SYSTEM was provided by the Center for Intelligent Information Retrieval (CIIR), University of Massachusetts Computer Science Department, Amherst, Massachusetts. For more information, contact ACSIOM at 413-545-6311. 3Inquery resides in a shared library and is called via a rich API, whereas PAT was executed in a child process and
communication with it was mediated by standard IO. 4The AND operator is mapped onto Inquery's #UWn(w w ... w ) operator that finds as many words w 1 2 m 1
... wm as possible within n words of each other (specified by a slider control setting), when computing the relevance of a target document. The OR operator is mapped onto the Inquery #SYN(w1 w2 ... wm) operator that treats words w1 ... wm as synonyms.
31
32 lying in the query syntax into which the query tree is rendered, and in the presentation of the search results. In order to extract hit location information for redisplaying the query, QRL constructed PAT queries to return word offsets rather than documents. Inquery, on the other hand, provided both a ranked collection of documents (in order of decreasing relevance), and the occurrences of search terms within each document. A limitation of this approach as compared with PAT was that only one cluster of search terms could be identified per document because only one search hit was returned per document. This proved to be both an advantage and a disadvantage: while PAT required additional (and computationally expensive) operations to identify the documents associated with a particular collection of offsets, Inquery provided a one-to-one mapping between hits and documents. This subtle difference in search engine capabilities was reflected in the interface. Prototypes based on PAT allowed subjects to skip through a given document, overlaying the query on successive matching passages, but subjects did not know ahead of time how many hits corresponded to a particular article. Using Inquery solved the number of articles issue at the expense of showing matching passages. A compromise solution involved highlighting all matching terms within each document in addition to displaying the query markup on a subset of these terms. The code that supports graphical markup had to be refactored (e.g., Johnson and Opdyke, 1993) to support connections among multiple views. Figure 4-1 shows the object relationships that handled mark up in QRL, StPatTREC and BrowsIR. Application Search Engine
Markup
TextView
MarkupView
Dependents Has-a Has-many
Figure 4-1. QRL mark up architecture. When a MarkupView needs to display itself, it must obtain a reference to an instance of TextView that will perform the text measurement functions required to compute the location 32
33 and size of the markup. In the QRL scheme, a MarkupView instance would ask its model (a Markup instance) for the Application that owns it, and would get the view from the Application. VOIR changed the one-to-one relationship between the Application and the TextView, thereby requiring changes in the architecture. Figure 4-2 shows the new configuration. Application Search Engine
Mark Up Article Mark Up View
Dependents Has-a Has-many
Text View
Figure 4-2. VOIR mark up architecture In VOIR, new Article objects — rather than the Application as before — serve as models for Markup instances. Each Article acts as a model for a TextView. When the user selects a hit for markup, the corresponding Article object is used to provide content to its TextView instance.
4.3 Interface The interface is designed to display several articles simultaneously. In keeping with the newspaper metaphor, different amounts of screen real estate are allocated to articles based on their relative importance. Although it is possible to make this allocation dynamic (i.e., make it depend on the scores of the documents) as was done by Kamba et al. (1995), for the purposes of this prototype the partitioning of the screen was fixed. The first — most important — article received the largest amount of space, the second was displayed to its right in a slightly smaller area, etc. The parallel presentation of search results introduced the concept of a page consisting of a group of related articles. Search results corresponding to a given query were subdivided into
33
34 groups corresponding in size to the layout capacity of the screen. Thus instead of selecting just one article for viewing, selecting an article implicitly selected all articles on that page. For example, with seven articles displayed per screen, selecting the tenth article would cause articles eight through 14 to be displayed. This binning of search results into pages implied that a linear list of hits was no longer appropriate. Instead, a list of pages was provided; a single click on a given page number would display (adjacently) the list of titles of articles comprising the selected page. Selecting one of the article titles would cause the corresponding page to be displayed, and would position the marked up query onto the selected article. Alternatively, a double-click on the page number would load all associated articles and would place the query markup on the first article. The interface is designed to encourage an iterative style of query formulation. Every addition of a term or operator causes the query to be recomputed, and updates the list of pages and articles. When the user selects an article from the list, the query is redrawn on that article. In addition, all occurrences of query terms are highlighted in yellow to facilitate recognition of relevant passages. Following the newspaper metaphor, successive queries for which an article was selected were termed volumes (see list in bottom right corner of Figure 5-1 in the next chapter). Intermediate queries (for which no article was selected) were not recorded in the volume history list to prevent them from overwhelming the history. A list of volumes, identified by the queries that generated them, was available to the user for back-tracking. Selecting a prior volume caused the first page of that volume to be displayed, with the query markup redrawn over the first article.
4.4 Query notation VOIR uses a graphical query notation similar to that developed for QRL (Golovchinsky and Chignell, 1993). Users compose graphical queries dynamically, selecting one term or operator at a time. Each selection causes the system to re-evaluate the query, thus providing feedback on the effectiveness of the incremental change. Terms are added either by clicking on a word in the displayed text, by clicking in the margin and typing a new word into the dialog box, or by copying a previously-selected term to the margin. Operators are added by dragging a line between two previously selected words with the left (of three) mouse button pressed. Unlike QRL, which only supported Disjunctive Query Notation (DQN) queries, VOIR also supports Conjunctive Query Notation (CQN). In DQN queries, lines connecting pairs of terms 34
35 represent AND Boolean operators; absence of lines is translated as an OR. This allows users to express queries in terms of alternatives of co-occurring terms, such as "Information AND retrieval OR hypertext." The AND operator has higher precedence. CQN swaps the meaning of the connecting line: the lines connect synonymous terms, and absence of lines represents conjunctions. CQN queries are constructed by specifying groups of alternatives; to match a query, a document must match at least one of the terms in each group. Thus CQN allows subjects to express queries such as "(Hypertext OR browsing) AND (Information) AND (retrieval)." Note that because the OR operator has higher precedence in this notation, the parentheses are logically not necessary. They are included in textual representations of these graphical queries to convey the structure of the query more clearly. The CQN was introduced based on anecdotal evidence from Charoenkitkarn (1996) that professional searchers repeatedly requested CQN rather than DQN. Charoenkitkarn attributed this evidence to online search training. Experiment 1 (Chapter 5) compares the relative effectiveness of the two notations.
4.5 Search engine requirements In addition to providing a specific user interface, this prototype acts as an IR interface workbench that may be used to experiment with different display and query configurations. Finally, it should be noted that the architecture of the software is designed in a modular fashion that makes it easy to incorporate different search engines without affecting the user interface. The minimum requirements that a search engine must meet to support VOIR in this graphical querying mode5 are ranked document retrieval, a mechanism for identifying the terms that matched a given query, and a query syntax that can be mapped loosely onto Boolean operators. Thus although PAT identifies the matching passage and can represent Boolean queries (by mapping AND onto "near" and OR onto "+"), it does not support ranked retrieval, and therefore cannot be used for VOIR. The next chapter describes an experiment designed to test the effectiveness of multiple article displays and also to assess the relative effectiveness of CQN vs. DQN query notation.
5See also Chapter 7 for the requirements to support dynamic links.
35
Chapter 5 Experiment 1 "But by arriving here what have you demonstrated that you could not have demonstrated before through the light of reason?" "I leave the light of reason to the old theologia. Today scientia wants proof through experentia." Umberto Eco, The Island of the Day Before
5.1 Introduction The VOIR prototype described in the previous chapter was designed as a platform for exploring the information-exploration interface design space. The system displays search results using a multi-column display, and allows users to specify queries via graphical query notation. The notation (described in detail in Chapter 3) represents search terms as rectangular selections that bound terms as they appear in the text, and represents operators as lines connecting the rectangles. It supports either Disjunctive Query Notation (DQN) or Conjunctive Query Notation (CQN) Boolean queries. The multi-column display allows readers to view several retrieved articles simultaneously, rather than having to select them one at a time. A variety of display formats may be defined, varying the number and size of the columns. Experiment 1 was designed to obtain insight into users' behavior when they interact with multicolumn information retrieval interfaces. Charoenkitkarn (1996) compared subjects' retrieval performance and behavior on a similar (one-column) interface using DQN, and found that subjects with professional information retrieval experience ("experts") had repeatedly requested conjunctive query notation. This experiment was designed in part to explore further Charoenkitkarn's results, and also to answer some additional questions: Do subjects perform better when they can see several articles simultaneously? Does query notation have an effect on performance and on behavior? Does this effect vary with expertise? Answers to these questions could have implications for design of information retrieval interfaces in general, and for similar query-mediated browsing interfaces in particular.
36
37 5.2 Experimental Design Experiment 1 was designed to assess the usefulness of parallel presentation of search results, and to detect the differences in performance between CQN and DQN queries. Three full-screen, multi-column interfaces were created for the three display conditions of the experiment. See figures 5-1, 5-2 and 5-3 for screenshots of the three experimental interfaces. In addition to the article columns and search results selection controls, each display included an area labeled "Search Controls," in which the user could set the term proximity setting for the AND operator, and the maximum number of articles to be retrieved by each query. The experiment was a 3x2x2 repeated measures design. It had three within–subjects interface conditions (two, four and seven articles displayed simultaneously) and two between–subjects conditions (CQN or DQN, and subject expertise — expert or novice). The nine topics were assigned at random to different interface conditions for each subject. Topics were counter–balanced across subjects and interface conditions. The order of presentation of interface conditions was counterbalanced. Figure 5-1 illustrates the experimental design. Table 5-1. Experimental design. Notation (between)
dqn – 12
Expertise (between) Interface (within) Topics (random)
expert – 6
cqn – 12 novice – 6
low med high low 6 6 6 6 3
3
3
3
expert – 6
med high low 6 6 6 3
3
3
novice – 6
med high low 6 6 6 3
3
3
med high 6 6 3
3
5.3 Research Hypotheses Experiment 2 was designed to test hypotheses1 about the effects of interface, notation and expertise on performance. In this section, each of these broad categories is subdivided into specific sub-hypotheses.
1For each of the research hypotheses outlined in this chapter and in chapter 7, the corresponding null hypotheses
are that there should be no differences concerning the effects noted.
38 5.3.1 Interface Exposing subjects to a larger number of articles simultaneously is expected to increase their understanding of the range of articles available in the database. This understanding should be reflected in subjects' choice of search terms, which is expected to increase the number of relevant articles that are retrieved. Thus, the first sub-hypothesis is: Hypothesis 1a: Subjects will obtain higher retrieved recall and precision scores when they see more articles on the screen at the same time. Showing the subject many articles simultaneously increases the chances that a relevant article will be seen. If subjects examined search results exhaustively, there would not be a difference in performance among the output conditions. If, however, subjects exhibit a certain threshold of time or patience in using the interface, giving up on a results set after some number of pageflipping interactions, then increasing the number of articles on a page is expected to increase the chances of spotting a relevant article. Even with seven articles displayed per page, a significant portion of the leading paragraphs of each article is visible at a glance (i.e., without requiring scrolling), increasing the chances that an article will attract the reader's attention. Therefore, we have Hypothesis 1b: Subjects will view more articles over the course of the experimental session when they see more articles per page, resulting in higher viewed recall. Finally, being exposed to a larger number of articles can be expected to increase subjects' understanding of the collection, and should facilitate their decision–making process. Furthermore, the reduction in the number of interactions required to display search results should decrease the cognitive load imposed by the interface, allowing subjects to concentrate more on the task. These considerations suggest that: Hypothesis 1c: Subjects will make more, and more accurate, judgments when exposed to more articles at a time, resulting in higher judged recall, judged precision and judgment efficiency. 5.3.2 Query notation Expert subjects in Charoenkitkarn's experiment with StPatTREC (Charoenkitkarn, 1996) repeatedly expressed the desire to formulate queries using CQN. This preference was attributed
39 to their training in online searching. The query notation factor was designed to assess the effect of notation on performance, and therefore to test if advantages were associated with this preference. Experts are expected to produce better queries due to their training in query formulation. Performance of novice searchers is not expected to differ between conditions. Thus we can hypothesize that: Hypothesis 2:
Expert subjects will have higher retrieved recall and precision when using Conjunctive Query Notation compared with Disjunctive Query Notation.
5.3.3 Expertise In addition, experts' queries may be expected to produce distributions of relevant articles in the ranked results sets that are skewed toward the beginning of the list. Articles relevant to the search topic tend to be scattered throughout the set of retrieved documents. The more representative a query is of the search topic, the higher retrieved recall of the query will be. Furthermore, the more representative a query is, the more likely relevant articles are to be ranked high by the search engine. Given that users are not expected to examine search results exhaustively, the number of highly–ranked relevant articles should be reflected in viewed recall and precision. Experts are expected to construct queries that capture the essence of the search topic better than novices. This expectation is reflected in hypothesis 3a. Hypothesis 3a: Experts will have higher retrieved and viewed recall and precision than novices. Due to their training in query formulation, experts are expected to compose more sophisticated queries than novices. Rather than jumping into the text and browsing, experts are more likely to spend time constructing queries that reflect closely their understanding of the search topic. This process is expected to take longer than novices' initial query formulation, leading to longer times to find the first relevant article. Thus, Hypothesis 3b: Expert subjects will take longer to find the first relevant article. Finally, experts tend to be precision–oriented; novices recall–oriented. Charoenkitkarn (1996) found that experts made fewer, more accurate, judgments than novices, leading to the hypothesis that: Hypothesis 3c: Expert subjects will have higher judged precision than novice subjects.
40 5.4 Subjects Twenty four volunteer subjects participated in the experiment. Subjects were paid $20 upon completion of the experiment. They were divided into two groups, 11 experts and 13 novices. Although initially the experiment was designed to be balanced with respect to expertise, the difficulty of recruiting a sufficient number of expert subjects resulted in a decision to run one extra novice instead of an expert subject. Following Charoenkitkarn (1996), subjects were classified as experts if they had extensive practical online searching experience as searchers or search intermediaries, or if they had received formal training in online searching (e.g., in a Faculty of Information Studies course). Novice subjects had neither the experience nor the training of expert subjects. All subjects had used computers in the past, and all were familiar with using a mouse. Although a three button mouse was attached to the computer, only the left and the middle buttons were active. Two subjects (4 and 5) performed considerably worse than the remaining 22 subjects. One of subjects was a novice, and the other an expert. Both had used Disjunctive query notation. One had failed to identify a single relevant article in five of nine topics; the other had three such failures. They had the two lowest judged recall scores, and subject 4 had the lowest judged precision score despite only a moderate number of judgments. Subject 4 also had the third– highest mean time to select the first relevant article. Subject 5 made the fewest number of judgments (relevant or not), averaging about two articles selected per topic, and less than one relevant article identified per topic. These factors suggested that these subjects were not representative of the population from which other subjects had been drawn. Therefore, these subjects were replaced with two other subjects, both of whom were experts. This replacement resulted in a balanced design with respect to expertise: there were now 12 novices and 12 experts. The analyses that follow reflect the data of the new subjects. The final 24 subjects included nine females and 15 males. Four females were novices, and five females were experts; eight males were novices and seven were experts.
5.5 Methodology 5.5.1 Task The experimental task was based on the methodology developed by Charoenkitkarn (1996) for TREC search topics (see Appendix A for the list of topics used in this experiment). Each topic described the criteria of relevance for articles the subject is required to find using the search
41 interface. Subjects were required to find as many relevant articles as possible within the allotted 15 minutes. They were encouraged to be as accurate as possible while retrieving as many articles as possible. Each subject performed nine topic searches. 5.5.2 Software This section describes in detail the user interfaces of the experimental sessions. The interfaces are based on VOIR, but have been instrumented and modified to include widgets appropriate for experimental tasks. Figures 5-1, 5-2 and 5-3 accompany the following description. When performing searches, users had control over the following aspects of the interface: they could select terms in the text, click in the margin to add terms that were not visible on the screen, add operators by drawing lines between terms, and remove terms and/or operators previously added to the query. They could select a page for viewing by double-clicking on the page list, and they could control which article received the graphical markup by clicking on the appropriate title. They could backtrack to a previous query by selecting it from the query history ("Volumes") list. A list of all currently–used query terms was provided, along with controls to delete selected terms from the query expression. They could also change the word proximity range for the AND operator, and could restrict the maximum number of articles retrieved per query. At any time they could switch between the topic text and the results control. In addition to the retrieved articles, a single line of text at the bottom of the area indicated how many articles were retrieved by the latest query. A textual representation of the query appeared below the title area of the window. Each retrieved article was accompanied by a horizontal bar that indicated the relative importance of that article with respect to the current query, and a check box labeled "Relevant" which was used by subjects to identify articles relevant to the search topic. Sessions were terminated automatically after 15 minutes. The experimental software was instrumented to record a variety of events initiated by the user. Every event was accompanied by a time-stamp, a subject number and information identifying the experimental conditions under which the event occurred. User events included adding or deleting a keyword or operator, selecting an article for viewing, backtracking in the history list, adjusting the proximity score, and judging an article to be relevant. System events that were logged included the parsed query as passed to the search engine, the number of articles retrieved, and the identity numbers (IDs) of the retrieved articles.
42
Figure 5-1. Two-article VOIR interface
43
Figure 5-2. Four-article VOIR interface
44
Figure 5-3. Seven-article VOIR interface
45 retrieved relevant
selected
viewed
Figure 5-4. Relationship among sets of documents used to compute recall and precision measures. Articles relevant to each topic were identified from the data provided through our participation in TREC. This information allowed the computation of recall and precision scores for each subject. Three types of scores were computed (see Figure 5-4). The judged recall and judged precision scores were based on the articles subjects identified as being relevant. The viewed recall and precision scores were based on the number of relevant articles retrieved by the subject and displayed on the screen, even if they were not identified as being relevant. Finally, the retrieved recall and precision scores were based on the set of articles retrieved by the subjects' queries even if the articles were not displayed on the screen (that is, if the user never flipped to the page containing the articles). Judged metrics were used to characterize subjects' behavior: the scores depended largely on subjects' propensity to make judgments (recall) and their ability to make accurate judgments (precision). These measures were treated independently of the retrieved and viewed equivalents to separate system performance from user behavior as much as possible. Viewed metrics were designed to assess the degree to which subjects were exposed to relevant articles, and retrieved measures were designed to compare the relative effectiveness of query notations. One logical difference between viewed and retrieved measures is that retrieved measures reflect subjects' ability to retrieve articles, whereas viewed measures assess a combination of subjects' page flipping propensity (what fraction of the retrieved articles they view) and their ability to make high–quality queries, that is, queries that contain large numbers of relevant articles near the top of ranked results lists. The higher the proportion of retrieved articles that are viewed by subjects is, the closer the viewed measures are to the recall measures. The lower that proportion, the better viewed measures reflect relevant document distributions across the retrieved set. For example, if articles ranked 1, 2, 3, 5, and 10 (of 15) are relevant, the retrieved precision of the set is 0.33. If the user only sees the first two two–article pages (for a total of four articles), viewed precision will be 0.75; if four pages are seen, viewed precision falls to 0.5. Viewed recall will behave similarly, but will depend on the total number of articles relevant to the topic in question.
46 Distinctions between these measures of performance may help to characterize the effects of interface independently of query effectiveness. Although recall measures are expected to be correlated (relevant articles must be retrieved to be viewed, and must be viewed to be selected as relevant), judged precision, for example, should depend much more on subjects' behavior rather than on system or interface parameters. 5.5.3 Procedure Subjects completed an experimental consent form. They were then asked to read the experimental instructions (Appendix B). The instructions described features of the user interface and of the experimental task. After the instructions were read, the experimenter demonstrated the use of the various aspects of the interface. Subjects gained familiarity with the interface by searching on a practice topic. When subjects reported that they were ready, that is, when they felt comfortable with the software, the experiment proper was started. The experiment consisted of nine topics, presented in blocks of three, each block corresponding to a different interface condition. Prior to the start of each block, the subject was shown a print-out of the screen. For each topic, a printed copy of the online topic text was provided, and was available throughout the searching session. Subjects were requested to read it prior to starting the experiment and to make whatever notes they thought appropriate. The timer started after subjects finished reading the text. They were allowed to pause as long as necessary between topics. In some cases, the experiment was split into two halves: subjects received the instructions and searched on the first few topics in one session, and then came back (up to several days later) to complete the remainder of the topics. The second session was preceded by a brief overview and practice session to refresh the subject's memory.
5.6 Dataset variables Several variables were extracted from experimental logs. They represented experimental conditions and raw and derived performance measures. This section defines the variables used in the subsequent analysis. The data were analyzed using SAS®, and in the following text, fixed-font capitalized terms (THUS ) refer to the variables used in the SAS® datasets. See Appendix F for a summary of the variables.
47 5.6.1 Independent measures For each subject, the subject number (SUBJECT), interface condition (INTERF: two, four or seven articles), query notation (NOTATION: cqn or dqn), and topic sequence numbers (SEQNO, 1 to 9) were recorded. SEQNO was used to derive the variable GROUP that coded whether the topic appeared in the first, second or third group of three topics of the experiment for each subject. For each topic, topic number (TOPIC ) and the number of relevant articles (NUMREL ) were recorded. Each subject was also classified with respect to expertise (EXPERTIS: novice or expert). 5.6.2 Dependent measures The number of times a keyword or operator was added (ADDS) to a query or removed from a query (REMOVES) during the session was recorded. The number of times the user changed the word proximity setting for a topic was represented by the PROXFREQ variable. COUNT recorded the number of queries made for each topic. The number of relevant articles retrieved (RRC), the total number of articles retrieved (RC), the number of relevant articles viewed (RVC), the total number of articles viewed (VC), the number of relevant articles selected (RSC) and the total number of articles selected (SC) were recorded for each topic. Several additional measures of performance were derived. Retrieved recall ( RRECALL = RRC / NUMREL ) and precision (RPREC = RRC / RC ), viewed recall (VRECALL = RVC / NUMREL ) and precision (VPREC = RVC / VC), and judged recall (JPREC = RSC / NUMREL) and judged precision (JPREC = RSC / SC) were calculated from these raw performance scores. Subjects' ability to make correct relevance judgments (judgment efficiency) was estimated by taking the ratio of the number of relevant articles selected to the number of relevant articles viewed (JUDGE = RSC / RVC). Finally, averages of the raw scores were obtained by dividing them by the number of queries. This yielded AVRRC = RRC / COUNT, AVRC = RC / COUNT, AVRVC = RVC / COUNT, etc. for each topic.
48 5.7 Results 5.7.1 Confirmatory analysis Mixed-factor multivariate analyses of variance (MANOVAs) were performed on raw scores2 and the derived recall and precision scores. Each MANOVA included interface, notation and expertise factors. Interface (INTERF ) was included as a within–subjects factor, notation (NOTATION) and expertise (EXPERTIS) were between–subjects factors, and topic (TOPIC) was nested within each subject–interface combination. Within-groups error terms were used to assess significance of independent variables (Keppel and Zedek, 1989; see also Cody and Smith, 1991, Chapter 7 for SAS® statements required to perform this analysis). Judged measures were analyzed separately from retrieved and viewed measures because SAS® MANOVA algorithms delete observations that have missing values among the dependent variables, and some subjects had failed to find relevant articles for some of the topics, causing precision and judgment measures to be undefined. Thus two analyses were performed; one that included retrieved recall and precision and viewed recall and precision, and another that used judged recall, judged precision and judgment reliability (JUDGE) as the dependent variables. For the purposes of hypothesis testing, a probability level of 0.05 will be taken as achieving significance; probabilities in the range (0.05, 0.10] will be considered borderline–significant. 5.7.1.1 Interface hypotheses Hypothesis 1a postulated higher retrieved recall and precision for interface conditions with more articles displayed simultaneously. A borderline–significant difference was found for retrieved recall (F[12,42]=2.55, p