Proceedings of the Annual Meeting of the American Society for Information Science, ASIS 2000, ... compared to a more standard (one-dimensional) sorted result list. .... This means that few (about 10) members of a retrieval set .... system was developed at the Darmstadt University of Technology (Hemmje et al., 1994).
Proceedings of the Annual Meeting of the American Society for Information Science, ASIS 2000, Chicago, Illinois November 13-16, 2000.
The Interaction of Result Set Display Dimensionality and Cognitive Factors in Information Retrieval Systems P. Bryan Heidorn, Hong Cui University of Illinois, Champaign, Illinois Abstract A visual information retrieval environment provides visualization features that help users manage the large result sets that are typical in many information retrieval environments. In many of these visual interfaces the spatial layout of document surrogates is used to communicate information about document to query interrelationships. In these systems it is difficult to determine if the spatial layout is responsible for system effectiveness differences or if other system features are involved. In this study we examine the effectiveness of a two-dimensional display format compared to a more standard (one-dimensional) sorted result list. The Visual Information Browsing Environment (VIBE) (Olsen et al., 1993; Korfhage, 1997) was modified to produce two systems that varied only on result display dimensionality. The effectiveness of a display is determined by the degree to which the visual representation is consistent with the cognitive abilities of the users of a system. For this reason we also investigated the interaction of the verbal and spatial abilities of users, as measured by cognitive factors batteries, with the dimensionality of the result display. Subjects took two Factor-Referenced Cognitive Tests (Ekstrom, French, and Harman, 1976): Factor VZ: Visualization (Paper Folding Test--VZ-2) the Verbal Fluency (Controlled Associations, test FA-1). Subjects were broken into two groups, one for each display type. They used the systems to search for full-text documents that described species of plants that matched other descriptions of the same species. The interface effectiveness was measured by number of tasks solved, recall effort and median task completion time. Automated system monitoring provided detailed information about the search behavior of individual subjects. INTRODUCTION A visual information retrieval environment provides visualization features that help users manage the large result sets that are typical in many information retrieval environments. The multimedia information retrieval environment discussed here is an extension of the Visual Information Browsing Environment (VIBE) (Olsen et al., 1993; Korfhage, 1997) and WebVIBE (Morse and Lewis, 1997). The "new" system can produce a more conventional onedimensional result list. Automatic system monitoring, control and similarity functions were also added. This paper provides a brief introduction to Visual Information Retrieval Interfaces (VIRI), followed by a description of VIBE and the extensions used here. Traditional measures of recall and precision are not sufficient for measuring the system performance of interest in this work, since the subjects search for one particular document to accomplish their task. For this reason recall effort is a more appropriate measure of retrieval effectiveness (see Korfhage, 1997, for an explanation of this measure.) This paper introduces an experiment that investigates the effectiveness of twodimensional "scattergram" displays in relation to more traditional one-dimensional result lists. We also investigate the interaction of dimensionality with individual cognitive differences in spatial and verbal ability. The volume of information that can be retrieved has become a barrier to the usefulness of the information. One approach to the problem is to provide new forms of information space visualization that are tuned to the skills and experience of the individual. The information environment proposed here allows searchers to represent hundreds or thousands of documents simultaneously on a computer screen and manipulate and navigate through this information space. This approach gives the user dynamic control of the information space. This is qualitatively different from the relatively static picture of information space provided by concept mapping techniques such as Kohonen selforganization maps (SOM). While the VIRI approach has promise, little work has been done on evaluating the approach's effectiveness for information retrieval tasks. In addition to this, frequently these visualization interfaces provide many features simultaneously, making it difficult to determine which combination of features is responsible for the observed strengths and weaknesses of the systems.
The vision behind this work is to produce a retrieval visualization environment containing features that have been verified, alone and in concert, as positively contributing to the retrieval task performance. The first test arena for the technology is in the field of botanical informatics. There are hundreds of thousands of botanical descriptions spread over dozens of databases around the world. Many countries are funding projects to expand these collections. At no point in the near future will there be retrieval mechanisms and standards available to allow for precise federated searching of these collections. This state of database development is consistent with the status of many scientific and non-scientific disciplines. In this type of information landscape facilities for information exploring and visualization become more important than standard retrieval precision. WebVIBE supports a 2-D display but searcher performance with 2-D displays has been inconsistent. Koshman (1996) compared VIBE to the text-based retrieval system, AskSam. Users in that study preferred the text system but did not demonstrate performance differences. Many aspects of those two systems were different making it impossible to determine if the 2-D interface was responsible for the differences. Morse (Morse and Lewis, 1997; Morse et al., 1998) studied the usability of several display formats including the spring model used in VIBE and WebVIBE. Subjects in that study preferred graphical displays over text-based displays. The experiment reported here controls the system and task parameters to identify the causes for this inconstancy. While it is useful to investigate a variety of factors related to features of VIRI, such as selection of color (a feature of VIBE) or screen density, it is impractical to evaluate them all simultaneously. Here we are more interested in issues more characteristic of VIRI than to general computer systems or even other character based information retrieval systems. There is no attempt here to review the literature on human factors in database systems or information retrieval systems. There are a number of recent excellent reviews in the design of information systems (e.g. Allen, 1996; Marchionini, 1997; Marchionini and Komlodi, 1998; Shneiderman, et al., 1997). VIRI There are many information retrieval tasks where large amounts of information are retrieved and displayed as a result of user queries. For example, many web search engines may return hundreds of thousands of items. Commercial services such as Dialog, Educational Research Information Center (ERIC), or Medline, also frequently return large data sets. These are traditionally displayed in one-dimensional lists ordered by some criterion. These displays convey little information to the user about the relationship of the documents to one another or to the query. In contrast, visual displays such as Bead (Chalmers and Chitson, 1992), BIRD--Browsing Interface for Retrieval of Documents (Kim and Korfhage, 1994), InforCrystal (Spoerri, 1993), Space IR (Newby, 1992), LyberWorld (Hemmje et al., 1994), and VIBE--Visual Information Browsing Environment (Olsen et al., 1993) can represent the relationships in 2-D or 3-D layouts. The addition of more dimensions leads to the loss of bibliographic information such as the title and replaces it with spatial information. The effect of this interface characteristic has not been systematically investigated. Cugini and colleagues (1996) present three different visual displays of results from a modified version of the NIST statistical text retrieval system called PRISE. The visual displays include a document spiral that converts the ranked list into an iconic display with the one-dimensional line of rank order twisted into a spiral. This allows visualization of the entire 1-D ranking space. The second display type is a three-keyword axis display where each keyword or group of keywords may be assigned to one of three axes. The final display type is a nearest neighbor sequence, which uses a nearest neighbor algorithm to project the high dimensional keyword space into a three dimensional space while keeping documents that were near one another in keyword space, near one another in the lower dimensional space. Unfortunately, the displays have not yet been tested for usability. Research has looked at the relative merits of different VIRI types (Morse & Lewis, 1997; Morse et al., 1998). On a two term Boolean task, performance for text based lists and icon lists were highest and graphs while a spring display (VIBE) interfaces were lowest (Morse et al., 1998). Users had a preference for the spring display over text in spite of the relatively poor performance. The authors point out that this may be due to the Hawthorne effect. There was a cross-condition learning effect for graph and spring displays that had not leveled by the end of the five trials used in the study. This indicates that familiarity with text systems may have contributed to the findings. Most people are not familiar with the use of visual metaphors in human-computer interfaces, such as those used in VIRI. Since it is not possible to train everyone on the use of information retrieval (IR) systems, it is essential to
design interfaces that can adapt to needs and strengths of particular users. It is necessary to determine which aspects of the interface are responsible for performance differences. Sugar (1995) has noted that in order to design effective systems, we must "identify human characteristics that lead to performance differences and the interface factors with which they interact." In this study we focus on user measures of spatial and verbal ability in order to help ascertain which subject cognitive abilities are being most taxed by the interface as discussed below. There is reason to believe that people with greater spatial abilities are able to make better use of VIRI. The 1-D vs. 2-D Tradeoff The information retrieval task can be characterized as one of information overload. Large retrieval sets are common. Much of the retrieved information is irrelevant to the information searcher’s individual need. The users must, therefore, scan the retrieval set for cues that will help them recognize relevant items. The cues which are traditionally made available in retrieval systems include document order, title and author of the document. Retrieval sets are usually presented in a list that is ordered by some measure of similarity between the document and the user's query. These features -- order, author and title --constitute the cues that the user has to categorize documents as potentially relevant or irrelevant to the information task. In this study we investigate the impact of varying the cues which subjects have in visualizing retrieval sets. The traditional text based cues of author and title are replaced with spatial cues for keyword similarity. The tradeoff is between information that may be conveyed by an author’s name and words in a title but which consumes substantial screen space versus an iconic representation that presents relative keyword similarities at the expense of text. Textual cues require a substantial amount of screen space. An author and title even when truncated as in standard displays may consume 50 characters of display space. This means that few (about 10) members of a retrieval set may be displayed at any one time. There is evidence that users do not look beyond the first list of items. A recent study showed that 58% of users do not look beyond the first 10 titles and 77% do not look beyond the first 20 (Jansen et al., 1998). Conversely, an iconic display allows each item to be displayed in approximately one character space. In addition, many icons may be superimposed without loss of information intrinsic to the icon. That is, if many documents have approximately the same relationship to the query components, they will have the same screen location and may be represented as a pile of documents. Therefore, hundreds of documents may be represented on the screen simultaneously. Which of these strategies is better depends on the relative efficiency with which individuals can exploit the different cue types: text versus space. We hypothesize that individuals with strong verbal abilities will be able to exploit title/author information, while individuals with strong spatial abilities will be able to exploit iconic position information. In the proposed study we will measure for both versions the amount of time searchers spend browsing the document title lists and in VIBE the amount of time in POI manipulation. If the WebVIBE browsing display is helpful, then the time spent in POI manipulation should be compensated for by savings in time searching title lists because WebVIBE title lists will be shorter than in the 1-D display condition. WebVIBE This section describes the operation of , Visual Information Browsing Environment (VIBE) (Olsen et al. 1993; Korfhage, 1997) and the system extension used in this study. VIBE represents documents on a two-dimensional display rather than in the one-dimensional display that is typical of commercial retrieval engines. The placement of the documents is determined by their relationship to reference points or points of interest (POIs). One Web-based version (WebVIBE) is used in this study. An example screen from WebVIBE is provided in Figure 1: 2-D WebVIBE. In this example the user has selected four POIs, "birch or elm", "ovate or oval or oblong-obovate", "swamp or marsh or bog or wetland" and "Illinois or Ill or Indiana or Ind or Mich or Michigan". These are represented as the labeled magnet icons. Documents are represented as rectangular icons or stack of icons where more then one document is mapped to the same location.
Figure 1: 2-D WebVIBE All documents are retrieved as a Boolean OR of the POIs term value. Any document which contains one or more of the POIs is retrieved and placed on the display. A threshold may be set to filter additional documents from the display. As in previous versions of VIBE, the placement of document icons in WebVIBE is determined by the ratio of similarities between the documents and the POIs. If a document is more like one POI than another, it is placed on the screen closer to the POI it is more similar to. The location of document icons on the display is defined for n reference points and similarity measures s1, s2,...,sn where si is the similarity between a given document and reference point i. Where pi is the position vector for reference point i, the location of the document icon is defined by,
pd =
∑sp. ∑s i i
i
i i
The similarity measure may be any appropriate to the document representation such as a document term frequency. The similarity measure need not be defined in terms of simple word frequency only. Where the POIs and documents are text, the cosine or other vector measure of similarity may define s. In this version of WebVIBE the similarity for an individual term to an individual document is defined as the inverse document frequency weight, tf.idf sik = fik[log2N - log2dk + 1],
where k denotes the term and i the document. N is the number of documents in the collect. dk is the number of documents containing term k. fik is the absolute frequency of term k in document i. When a POI represents a Boolean conjunction, AND, the mean of the individual tf.idf scores are used. For a disjunction the scores are summed. So if the tf.idf of A = 20 and B = 10 the tf.idf of (A and B)=15. Disjunction is a sum so (A or B) = 30. In the current study two dimensions are used but versions of VIBE have been produced which work in three dimensions (Benford et al., 1995). Later work may explore the three-dimensional representation. The searcher may expand any of the document icons of interest. This is accomplished by drawing a box around the document(s). Document titles for all selected documents are displayed in a list. In the two-dimensional display version of the program list document list window only takes up a small portion of the display but in the onedimensional display version of he program, this is the only result display that the subjects see and it covers the entire display area of the screen as can be seen in Figure 2: 1-D WebVIBE. The user may click on any title to open a window including the full document. The document display may be found in Figure 2. This version of WebVIBE records the amount of time that the user spends viewing the 1D or 2D document browsing area (depending on the subject's experimental condition). The system also records the time when individual documents are opened and the time spent viewing them. As can be seen at the bottom of Figure 2, buttons allow the user to mark it as the desired document or not (as described below in the task description.)
Figure 2: 1-D WebVIBE In 2-D WebVIBE searchers may add, move, remove or deactivate POIs on the display. A searcher adds a POI by adding a term to the query list. The searcher then moves the mouse pointer to the desired location on the screen and clicks the right mouse button to place the POI. The system automatically selects documents from the database that
match the POI. These are located on the screen according to the formula described above. The locations of document icons already on the screen are adjusted appropriately. A searcher may select a POI with the mouse and move it to another location on the screen. When a POI is removed or deactivated using the left mouse button, documents retrieved for only that POI are removed from the screen and then the position of the remaining documents is adjusted. WebVIBE (Morse & Lewis, 1997) is the basis for the current work. In that research the investigators demonstrated that defeaturing of the VIBE display to produce WebVIBE reduced some of the performance and preference problems that had been reported with VIBE (Koshman, 1996). Defeaturing removes all but the essential features from the interface. The defeaturing result highlights the question as to which features of VIRI and VIBE in particular are leading to performance and preference differences. The University of Pittsburgh development team is working on studying the individual features of VIBE. The work proposed here is coordinated with and complements theirs. A multidimensional Image Browser derived from VIBE (Cinque et al., 1998) was developed at the Universta' di Roma. POIs in this system, ImageVIBE, may be user sketches. Similarity between a POI and an image icon is determined by a set of geometric scoring functions such as minimum enclosing rectangle, signature, orientation of axis and three other properties. User evaluation has not been performed on ImageVIBE. Yet another version of VIBE, VR-VIBE (Benford et al., 1995), developed at the University of Nottingham, is a 3-D version of VIBE. The main difference is the addition of a third dimension. LyberWorld, another 3-D reference point system was developed at the Darmstadt University of Technology (Hemmje et al., 1994). There was no user evaluation or evaluation of the utility of individual features. The current 1-D versus 2-D study developed here could be extended to 3-D in a straightforward manner but this will be left to future research. Cognitive Factors There are a number of studies which indicate that subjects have differing abilities to use information in text displays, graphical displays and displays with multiple dimensionality. Chen, Czerwinski and Macredie provide an overview of cognitive factors in a special issue of JASIS (2000). The verbal scores of students in library and information science are higher then the general student population but their spatial abilities are lower (Allen & Allen, 1993). Allen (1994) showed by manipulating a text-retrieval system feature, that those users with higher perceptual speed were able to identify a subject heading, learn from it, minimize response time and retrieve information faster. Leitheiser and Munro (1995) demonstrated that a GUI provided performance benefits over a command line interface for both users with high spatial ability and users with low spatial ability. However, users with high spatial ability benefited more from the use of a GUI than users with low spatial ability. In other studies spatial ability appears to be less important than experience with systems using spatial displays (Swan et al., 1998). Steinberg and colleagues (1995) have shown that users were able to adapt more quickly to the task of monitoring missiles and airplanes when using a 3-D interface rather than a 2-D visual display. Extending these findings to a design of multimedia IR systems it is necessary to understand the relationship between spatial ability and system features such as dimensionality. Using VIBE, Koshman (1996) investigated the interaction between level of user expertise (novice, online expert, and visual expert) and interface type (graphical and text). Koshman found that subjects classified as visual experts searched faster using the graphical interface and, more importantly, these visual experts exhibited difficulties when using the traditional text retrieval system. People who were familiar with text based systems had difficulty with graphical interfaces. Characteristics of the subjects interacted with system features. Koshman's study compared a Microsoft Windows version of VIBE with the Windows based commercial system, AskSam. Query formation between the systems is similar in that in both systems users select the search terms from a list. The systems' primary difference is in the output display format. As was the case in the project being reported here, VIBE was 2-D while AskSam was 1-D. The findings from this study were consistent with those found by Koshman except that our study controlled system differences more directly and controlled the subject population. We used subjects from multiple disciplines as indicated by the Allen and Allen study (1993). Our study also differs in that we evaluate spatial and verbal ability with The three Factor-Referenced Cognitive tests (Ekstrom, French, and Harman, 1976) rather than defining spatial ability by experience levels as was the case in the Koshman study. In our study query formation (term selection) and the subjects' relevance judgement mechanism will be identical between the 1-D and 2-D result
display conditions since the same system will be used for both. Automatic system monitoring was provided in WebVIBE so we can examine the amount of time spent in different activities such as term selection, POI movement, and document reading. A more detailed evaluation of the cognitive processes involved in the task can help to identify potential relevant subject and system characteristics. Experimental Design Research Questions !
!
!
In a text document browsing environment does the dimensionality of a computer display (1-D ordered lists versus ordered 2-D plots) interact with retrieval performance? In a text document browsing environment does the dimensionality of a computer display (1-D ordered lists versus ordered 2-D plots) interact with retrieval performance differently for people with different spatial abilities? For each of the system features introduced in each retrieval task, where do searchers spend their time when presented with interface features of increasing complexity?
Experimental Task In this experiment subjects were asked to perform an object identification task. This is a form of a known item search. There were two groups. Members of one group used the 1-D WebVIBE display while members of the other group used the 2-D display. The subjects were randomly assigned to groups with the restrictions that the groups were of equal size and that equal numbers of library graduate students and non-library graduate students were in each group. This second restriction was to maximize the likelihood of a large distribution of spatial and verbal abilities within the groups. Before the retrieval task began, cognitive factors research instruments were administered to the subjects. The two Factor-Referenced Cognitive tests used in this study are: Factor VZ: Visualization (Paper Folding Test--VZ-2) the Verbal Fluency (Controlled Associations, test FA-1) (Ekstrom, French, and Harman, 1976). Each subject was then given a 15-minute training session on the interface that they were using. During the training, participants found the known-item (Latin names) for two species descriptions. Each subject was given field guide descriptions of up to nine objects, one at a time. The exact number of trials was limited by the number that an individual could complete in an hour. The object descriptions were of species of plants that occur in the document collection. These descriptions were drawn from field guides and included images and text. All scientific names were removed from these descriptions. The subjects then used the version of WebVIBE appropriate for their group, to attempt to determine the scientific name of the species descriptions. The interfaces are identical except that one group's result set was displayed in a conventional 1-D list format (as seen in Figure 2) while the other group's results were displayed in a two-dimensional WebVIBE browsing environment (as seen in Figure 1). If the subject saw a document that they believed matched the stimuli, they pressed a "This is the plant" button at the bottom of the full-text document display window to indicate that they had found it. If they were correct, they would move on to the next trial and the next object description. If they incorrectly identified an item as matching the stimuli when it did not, they were informed of this and allowed to continue the search. The subject could indicate that a full-text document did not match the stimuli by pressing a button at the bottom of the full-text document display window that read, "Keep on looking." In some cases the subjects realized that there was no penalty for false positives, pressing the "This is the plant" button when it was not the plant. These subjects abandoned the use of the "Keep on looking" button. If a subject incorrectly rejected a full-text document that did actually matched the stimuli they system would record the error but did not inform the subject of the mistake. Subjects could quit a trial at any point by pressing a "Give up on this plant" button. The system recorded the time spent using different parts of the interface as well as correct and incorrect responses. As discussed below there were performance differences between the groups. In an analysis for a later paper we will explain some of these differences by evaluating where people spend their time in the interface. For example, subjects who use a standard result display list may spend more time in query reformulation and may open more documents.
Subjects Forty-two subjects were paid twenty dollars for participation in the experiment that required approximately two hours for each subject to complete. Subjects were also paid twenty-five cents for each correct answer. The number of stimuli will be adjusted to fit into this time frame. Because of the findings about library student cognition discussed above (Allen & Allen, 1993), one-half of the subjects were selected from a graduate library and information science program and one-half from other graduate school departments. An entrance interview was used to exclude subjects with prior knowledge such as botanical degrees that might allow them to perform the tasks without using the retrieval environment. Only students who report being proficient with the use of the Netscape or Internet Explorer web browsers were accepted as subjects. Document Collection The search collection consisted of approximately 1300 species files extracted from 260 genus descriptions of plants described in the first four volumes of the Flora of North America north of Mexico (FNA, 1993a; 1993b; 1997; in press). Each genus includes an image file representing all of the species in the genus. For the purposes of this study, these images were linked to the individual species. The collection size is large enough to make it realistic and to make serial browsing of the collection impractical. It might be assumed that larger collections would on average produce larger result sets, favoring the 2-D display. The identification keys found in FNA were not provided, although automatic indexing and structuring of the keys would be a useful future project. While all species, genus and family descriptions from the Flora were included in the collection, the target stimuli (queries) were limited to species that may be identified without the aid of microscopic analysis or other laboratory equipment. The selection of this collection and task avoids the difficult issue of assigning relevance to documents. We believe that there is good reason to believe that the results of this study can be generalized to other collections. There are currently thousands of descriptions of plants available on line. Examples include the USDA PLANTS database (USDA, 1997) Vascular Plant Image Gallery. Texas A&M Bioinformatics Working Group Digital Library (Vascular Plant Image Gallery, 1999), California Flora Database (CalFlora, 1999) and CalFlora (1999) to name just a very few. Similar descriptions exist for insects and indeed to many natural history and cultural museum artifacts. There is also no reason to believe that the two-dimensional display interacts in any special way with this full-text collection. Results Data was analyzed with an ANOVA. The independent variable of Display Types (1-D and 2-D). The dependent variables include Number of Tasks Solved, Recall Effort, Median Task Completion Time, Document Surrogate Browsing Time, Document Reading Time, Query Reformulation Rate, Query Length, False Positive Error Rate, and False Negative Error Rate. Subject variables include scores from the Factor-Referenced Cognitive tests (Ekstrom, French, and Harman, 1976): Factor VZ: Visualization (Paper Folding Test--VZ-2) the Verbal Fluency (Controlled Associations, test FA-1). The current analysis was limited to Recall Effort, Median Task Completion Time, Number of Tasks Solved, VZ-2, and FA-1. Learning Effect An ordering or learning effect may have direct bearing on the training time and system familiarity. In both conditions the time needed to complete tasks may be less on average for later trials than for earlier trials. The order of the tasks is randomized so such an effect is not attributable to differences in task difficulty for individual stimuli. However, since users are generally unfamiliar with 2-D interfaces, it may take a longer time to learn to use this display interface. Since the 2-D interface has more configuration options and is less familiar, the learning effect should be greatest for this condition. Therefore the improvement might be expected to level off in later trials of the more familiar 1-D interface while the users of the 2-D interface may still be learning effective strategies for 2-D exploration. One way to look at the learning effect is to plot for each group the mean completion time for each correctly identified first species through the last species. This may be misleading however since the initial trials will include completion times from people who finished very few trials successfully. These times may be exceptionally long and pull up the average. These slower subjects never get to the later trials. A better view is to look at only people who
performed well in both groups and see if they learned over Figure 3 is a plot for people who finished seven or more of the nine trials. There are twelve cases with seven trials correct or better in the 1-D condition and six cases in the 2-D condition. This seems to support a differential learning effect. Subjects in the 1-D condition take longer for the first trial at about 260 seconds and then drop to between 150-200 seconds with no improvement after the second trial. Subjects in the 2-D condition perform the first trial in an average time of about 315 seconds and by the end seventh trial at about 100 seconds. Subjects become faster across trials as they become familiar with the 2-D system. Not shown on this graph are the eighth and ninth trials. In the 2-D condition only four individuals finished eight trials and two finished all nine trials. These trials took more time than trial seven but were very variable.
Mean Compleion Time (seconds)
Learning Effect 400 300
1D
200
2D
100 0 1
2
3
4
5
6
7
Stimulus Order
Figure 3 Number of Tasks Solved, Recall Effort and Median Task Completion Time In most retrieval studies, the relevance of individual documents is difficult or impossible (Ellis 1984; Ellis, 1996; Harter, 1992; Harter, 1996; Mizzaro, 1997; Schamber, 1994). For the current task and given this collection there is just one relevant document. This makes it possible to use Number of Tasks Solved, Recall Effort, and Median Task Completion Time as effectiveness measures rather then recall and precision. The 1-D list, through the ranking mechanism, provides an indication of the strength of the relationship between the documents and the query as a whole. No information is provided about the relationship between individual query terms and the document. The 2-D interface contains a greater amount of information about the relationship of documents to the retrieval terms. It is hoped that the 2-D display format will help to deal with the information overload that is typical of many retrieval environments. The display type that allowed the largest number of tasks to be solved in a fixed amount of time can be said to the best interface. Since Number of Tasks Solved is a count, nonparametric analysis was conducted for this measure. A Kruskal-Wallis test is the non-parametric equivalent to an ANOVA (Siegel, 1953). Under this analysis there was a significant effect, df=1, Chi-Square = 4.619, p < .032. This is consistent with the one-way ANOVA F(1,40) = 5.18 (p = .028). However this was in favor of the 1-D display. The group means were 1-D = 6.71 and 2-D = 5.00. The group using the traditional 1-D interface solved more problems in the time allotted than the 2-D group. As discussed below there are two explanations for this result that hold promise for WebVIBE-like 2-D displays. Recall Effort is the number of documents that need to be examined before the correct one is recognized. No statistically significant effect was found with Kruskal-Wallis Test, df=1, Chi-Square = .746, p < .388. This is at first surprising given the significant difference in terms of Number of Tasks Solved. People in the 2-D condition finished fewer tasks correctly but when they did finish they were about as efficient as the 1-D group in terms of the number of documents they needed to view before finding the correct one. The mean number of documents viewed in the 1-D condition was 3.8 and the number viewed in the 2-D condition was 5.0. One explanation is that there is a ceiling effect of tolerance limit for searching that is independent of the interface used. Recall Effort measures only successful trials, trials where the searcher actually found the solution. If a searcher is going to find the answer for a particular task, they will find it by the time they view a fixed number of documents. If it is not found by that point,
they may look at many more documents then give up and move on to another trial. The data was collected that can test this hypothesis and it will be examined in a later paper. The display type with the lowest Median Task Completion Time may be said to be superior in some ways to the other display format. Each participant finished between one and nine trials successfully. Since the medians were skewed a log transform was used on the data. Using a one-way ANOVA, there was no significant difference between the interfaces in the time it took to find the correct item, F(1,40) = 2.558, p < .118. This is consistent with the Recall Effort. Variables associated with Median Task Completion Time can be used to help understand how the members of each group are spending their time. Task completion time is dependent on the sum of all of the time the subjects spend on different subtasks. The distribution of time between these tasks may be even more informative than the overall Median Task Completion Time. The relative effectiveness of the two systems is a tradeoff between time spent browsing the 1-D list displays in both conditions plus the time spent viewing documents, plus in the 2-D condition, time spent studying and manipulating the POI positions. Since the 1-D condition does not have this process, the only way for the Median Task Completion Time for the 2-D condition to be faster is if the list browse time and /or the time viewing documents is shorter than the counterparts in the 1-D condition. This might be the case since the lists are expected to be shorter in the 2-D case and the total number of documents examined are expected to be small. These savings may or may not be sufficient to offset the additional time spent manipulating POIs. Information about these relationships will help to focus effort for future development. As stated above, the results of Median Task Completion Time may be masked by a learning effect. Cognitive Skill and Performance Interaction We had expected to find an interaction between cognitive skill and the main effect of interface. The hypothesis is that subjects with higher spatial scores will be able to take better advantage of the 2-D display while those with higher verbal ability will be able to take advantage of the text oriented 1-D display. Koshman (1996) found that experience with text based Boolean systems led to lower performance with VIBE. The same effect is expected in this study. The finding in the Koshman study may however in part be confounded with other subject variables. Koshman's subject pool was composed of library science students. These same subjects may have lower spatial scores and higher verbal scores than the population at large (Allen & Allen, 1993). Another study found that spatial visualization is a significant factor in use of a GUI versus a command line. Subjects with high spatial visualization scores took less time to complete using a GUI but there was no significant difference in scores for command line tasks (Leitheiser & Munro, 1995). Another study (Swan et al., 1998) found no difference in retrieval performance in the use of a 3-D system between librarians and the general student population but a significant difference in retrieval effectiveness using a 3-D iconic interface and a more traditional ranked list system. The same study found that librarians overwhelmingly (7-1) preferred a system with ranked lists while the general population preferred a 3-D display (6-2). These conflicting findings may indicate that there is a complex interaction between familiarity with traditional systems, Spatial Visualization skills and profession. While this is not the primary focus of this study, we analyzed (ANOVA) major and cognitive factor scores and found no significant differences between library and information science students and the general graduate student population. There was no effect on the performance measures on either cognitive factors test using MANOVA. DISCUSSION In this study, people performed a known-item search on a scientific data set. In spite of the relatively small data set, 1400 documents, this task was very challenging because of the technical nature of the vocabulary. This was demonstrated by the fact that subjects in both groups had difficulty finishing nine tasks in an hour. Only nine individuals out of forty-two were able to do all nine tasks. Because of their unfamiliarity with the scientific vocabulary of the data set, subjects had difficulty finding descriptive term to include in queries. Still, many subjects were able to solve the tasks. Participants used one of two interfaces. The interfaces were identical except that one interface provided a scattergram display of the result set that showed the relationship between components of the query and the document retrieved. It was hoped that this interface, designated 2-D, would allow people to more judiciously select documents to examine. The most important finding from this study was that the 2-D display did not help people in a known item search. In terms of the number of Number of Tasks Solved the 1-D interface was better. There was no
statistically significant difference between the interfaces for the performance measures of Recall Effort or Median Completion Time. This raises the question, why didn’t the 2-D interface help? Further analysis will be needed to address this issue. Fortunately, the performance statistics collected during this study may be sufficient to at least partially answer this question and point to methods to improve the interface. The first candidate is the subjects' difficulty in understanding the 2-D display. The participants had never seen an interface like it before while the result display in the 1-D condition was familiar to all of the subjects since they had all used web search engines. This conjecture is supported at least in part by the fact that searchers in the 2-D condition appeared to perform better in later trials than in earlier trials as can be seen in Figure 3 above. The implication of this is that in later studies it will be necessary to provide much more training and experience with the newer display before the conditions would be comparable with the level of experience people have with a traditional ranked-list display. This is difficult in a one sitting experiment, since the practical limit of people's attention before fatigue sets in is one hour. Future Work Other clues from the instrumentation used in this study may shed light on the failure of the 2-D interface to help people. This instrumentation of the interface records the contents of every query, every result set and every query reformulation as well as the time spent looking at every display window. Initial analysis of the data indicates that users of the 2-D display used more search terms with very poor discriminatory power as measured by the tf.idf. People searched a botanical database repeatedly for terms like "tree," "branch" and "leaf." This resulted in many more documents being returned for each query cycle. Subjects tolerated these large result sets in the scattergram more then the people in the 1-D group tolerated long result lists. While the 2-D display is designed to help people with large result sets, it could not compensate sufficiently for the large difference in result set size between the groups. We will be analyzing the data to see if this hypothesis is correct. If so, it may be necessary to provide mechanisms that correct this problem for example a controlled vocabulary or graphical feedback about poor query terms. Acknowledgement This material is based upon work supported by the National Science Foundation under Grant No. DBI-9982849 and the Research Review Board of the University of Illinois at Urbana-Champaign. We thank Sarai Lastra and Jun Wang for research assistance. REFERENCES
Allen, B. (1994). Perceptual speed, learning and information retrieval performance. Proceedings of the Seventh Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (ACM SIG/IR), (pp. 71-80). New York, Springer-Verlag. Allen, B. (1996). Information tasks: Toward a user-centered approach to information systems. New York: Academic Press. Allen, B., & Allen, G. (1993). Cognitive abilities of academic librarians and their patrons. College & Research Libraries, 54(1), 67-73. Benford, S. D., Snowdon, D N., Greenhalgh, C M., Ingram, R J., Knox, I. and Brown, C C. (1995). VR-VIBE: A virtual environment for co-operative information retrieval. Computer Graphics Forum, 14(3), 349-360, NCC Blackwell. [also Proc. Eurographics '95] CalFlora Database. Berkeley Digital Library Project. (http://elib.cs.berkeley.edu/calflora/) [Accessed 29 March 1999]. CalPhotos Database. California Plants & Habitats Photos, Berkeley Digital Library Project. http://elib.cs.berkeley.edu/photos/flora/ [Accessed 20 December 1998].
Chalmers, M. and Chitson, P. (1992) Bead: Explorations in Information Visualization. In Proceedings of SIGIR '92, Copenhagen, Denmark. ACM Press. 330-337. Chen, C., Czerwinski, M., and Macredie, R. (2000) Individual differences in virtual environments - Introduction and overview. Journal of American Society for Information Science. 51(6), 49-507. Cugini, J., Piatko, C., Laskowski, S. (1996). Interactive 3D Visualization for Document Retrieval. Paper presented at the Conference on Information, Knowledge and Management (CIKM 96) in the Workshop on New Paradigms in Information Visualization and Manipulation. (http://zing.ncsl.nist.gov/~cugini/uicd/viz.html) [Last Accessed: February 16, 1999] Ekstrom, R. B., French, J.W., Harman, H.H. and Dermen, D. (1976). The Kit of Factor-Referenced Cognitive Tests. Educational Testing Service, Princeton, NJ. (Perceptual Speed and Spatial Scanning tests) Ellis, D. (1996). The Dilemma of Measurement in Information Retrieval Research. Journal of American Society for Information Science. 47 (1), 23-36. Ellis, D. (1984). Theory and explanation in information retrieval research. Journal of Information Science, 8, 25-38. Flora of North America Editorial Committee. (1993). Flora of North America: North of Mexico. Volume 1: Introduction. New York: Oxford University Press. Flora of North America Editorial Committee. (1993). Flora of North America: North of Mexico. Volume 2: Ferns and Gymnosperms. New York: Oxford University Press. Flora of North America Editorial Committee (Eds.) (1997). Flora of North America: North of Mexico. Volume 3: Magnoliophyta: Magnoliidae and Hamamelidae. New York: Oxford University Press. Flora of North America Editorial Committee (Eds.) (in press). Flora of North America: North of Mexico. Volume 22: Magnoliophyta: Alismatidae, Arecidae, Commelinidae. New York: Oxford University Press. Harter, P. H. (1996). Variations in Relevance Assessment and the Measurement of Retrieval Effectiveness. Journal of the American Society for Information Science, 47(1), 37-49. Harter, P. H. (1992). Psychological Relevance and Information Science. Journal of the American Society for Information Science, 43(9), 602-615. Hemmje, M., Kunkel, C., Willett, A. (1994). LyberWorld - A visualization User Interface Supporting Fulltext Retrieval. In Proceedings of ACM SIGIR '94, Dublin, 1994. Jansen, Major Bernard J., Amanda Spink, Judy Bateman and Tefko Saracevic. (1998) Real Life Information Retrieval: A Study of User Queries on the Web, SIGIR Forum, 32 (1) 1-17. Kim, H. and Korfhage, R. R. (1994). BIRD: Browsing Interface for the Retrieval of Documents. Proceedings of IEEE Symposium on Visual Languages, St. Louis, 176-177. Korfhage, R. R. (1997). Information Storage and Retrieval. New York, N.Y.: Wiley & Sons, Inc. Koshman, S. L. (1996). User Testing of a Prototype Visualization-Based Information Retrieval System. Doctoral dissertation. University of Pittsburgh. Leitheiser, R. L., and Munro, D. (1995). An experimental study of the relationship between spatial ability and the learning of a graphical user interface. Proceedings of the First Americas Conference on Information Systems, Association for Information Systems, August 25-27, 1995. Pittsburgh, Pennsylvania, U.S.A., pp. 122-124. (http://hsb.baylor.edu/ramsower/acis/papers/leitheis.htm)[Last Accessed: January 28, 1998]. Marchionini, G. (1997). Information Seeking in Electronic Environments. New York, N.Y.: Cambridge University Press.
Marchionini, G. and Komlodi, A. (1998). Design of interfaces for information seeking. In: Williams, Martha., ed. Annual Review of Information Science and Technology: Volume 33, 1998. Medford, NJ: Information Today, Inc. for the American Society for Information Science. Mizzaro, S. (1997). Relevance: the whole story. Journal for the Society of Information Science, 48(9), 810-832. Morse, E., and Lewis, M. (1997). Why information visualizations sometimes fail. Proceedings of IEEE International Conference on Systems Man and Cybernetics, Orlando, FL. October 12-15, 1997. Morse, E., Lewis, M., Korfhage, R. and Olsen, K. (1998). Evaluation of text, numeric and graphical presentations for information retrieval interfaces: User preference and task performance measures. Proceedings of IEEE International Conference on Systems Man and Cybernetics, San Diego, CA. October 11-14, 1998. Newby, G. (1992). Towards Navigation for Information Retrieval. Unpublished doctoral dissertation. Syracuse University. Niering, W. A. and Olmstead., N. C. (1979). The Audubon Society field guide to North American wildflowers: Eastern region. A Chanticleer Press Edition. New York, N.Y.: Alfred A. Knopf, Publ., 887 p. Olsen, K.A., Korfhage, R. R., Sochats, K.M., Spring, M. B. and Williams, J.G. (1993). Visualization of a document collection: The VIBE system. Information Processing and Management, 29 (1), 69-81. Schamber, L., Eisenberg, M. B., & Nilan, M. S. (1990). A re-examination of relevance: Toward a dynamic, situational definition. Information Processing and Management, 26, 755-776. Schamber, L. (1994). Relevance and Information Behavior. Annual Review of Information Science and Technology (ARIST), 29, 3-48. Shneiderman, B., Byrd, D., Croft, W. B. (1997). Clarifying Search: A User-Interface Framework for Text Searches. D-Lib Magazine. 1997 January. ISSN: 1082-9873. (http://www.dlib.org/dlib/january97/retrieval/01shneiderman.html)[Last Accessed: February 21, 1999]. Siegel, S. (1953). Nonparametric Statistics for the Behavioral Sciences. New York, N.Y.: McGraw-Hill, p. 184-194. Spoerri, A. 1993. Visual tools for information retrieval. Proceedings of the 1993 IEEE Symposium on Visual Languages. Bergen, Norway. Los Alamitos, CA: IEEE Computer Society Press, 160-168. Steinberg, D, DePlachett, C. Pathak, K and Strickland, D. 3-D Displays for Real-Time Monitoring of Air Traffic. CHI ’95 Proceedings. (http://www.acm.org/sigchi/chi95/Electronic/documnts/intpost/rks_bdy.htm) [Last Accessed February 21, 1999] Sugar, W. (1995). User-centered perspective of information retrieval research and analysis methods. In M. Williams. (Ed.). Annual Review of Information Science and Technology. Vol. 30, pp. 77-109. Swan, R., Allan, J., Byrd, D. (1998). Evaluating a Visual Retrieval Interface: AspInquery at TREC-6. Position paper for the CHI '98 Workshop on Innovation and Evaluation in Information Exploration Interfaces (Los Angeles, April 1998). (http://www.fxpal.com/CHI98IE/submissions/long/swan/index.htm) [Last Accessed January 10, 2000]. USDA, NRCS 1997. The PLANTS database. (http://plants.usda.gov). National Plant Data Center, Baton Rouge, LA 70874-4490 USA.[Accessed May 28, 1999]. Vascular Plant Image Gallery. Texas A&M Bioinformatics Working Group Digital Library. http://www.csdl.tamu.edu/FLORA/gallery/gallery_query.htm [Accessed May 28, 1999].