1995; Inselberg and Dimsdale 1990; Keim ... their relevance to various features that may interest ..... bers (the nodes labeled Paul, George, John, Ringo).
1 Introduction
RMAP: a system for visualizing data in multidimensional relevance space Jackie Assa, Daniel Cohen-Or, Tova Milo Computer Science Department, School of Mathematical Sciences, Tel-Aviv University, Ramat-Aviv 69978, Israel e-mail: {jackie,daniel,milo}@math.tau.ac.il
We describe a prototype system, RMAP, for visualizing information distribution in a multidimensional relevance space. The information displayed consists of many objects, a set of features likely to interest the user, and some function that measures the relevance level of every object to the various features. The goal is to provide the user with a comprehensible visualization of that information, where the exact relevance measures of the objects are not significant. We flatten the multidimensionality of the feature space into a 2D relevance map, capturing the inter-relations among the features. The prototype, extract information from the World Wide Web from query engines, automatically categorizes and clusters the information and allow the user to visualize. Key words: User information ± Data visualization ± Relevance map ± World Wide Web ± Search engines Correspondence to: J. Assa The Visual Computer (1999) 15:217±234 Springer-Verlag 1999
Many information retrieval tasks involve dealing with large sets of objects and require the analysis of the relevance of the objects to various features that are likely to be of interest to the user (Consens et al. 1994; Doemel 1994; Hemmje 1994; Hendley et al. 1995; Inselberg and Dimsdale 1990; Keim and Kriegel 1995, 1996; Neves 1997; Wise and Thomas 1995; Zizi 1995). For example, in the data mining context (Keim and Kriegel 1996, 1997), we may be given a database of objects (typically tuples), a set of rules that potentially describes properties of the data. We may want to analyze how relevant the rules are to various classes of data objects. As another example, consider a Web search (Digital Equipment Corporation 1996; Excite 1996; Yahoo 1996; Shmueli and Konopnicki 1995; Mendelzon 1996). The answer to a query to a Web index server is often lengthy; much of it is not necessarily relevant, and the interesting documents are sometimes buried way down the document list. To assist the user in focusing on the relevant documents, it has been suggested that the returning list be analyzed and the documents classified according to their relevance to various features that may interest the user (Consens et al. 1994; Doemel 1994; Hemmje 1994; Hendley et al. 1995; Neves 1997; Wise and Thomas 1995; Zizi 1995). The question is how to present such object-feature relevance information to the user. What we want is a display that helps highlight the more interesting classes of objects, filter out uninteresting objects, focus on the important information, understand the significance of various pieces of data, and grasp the relationship between them. Abstractly, the information to be displayed consists of a large number of objects, a set of features, and some function that measures the relevance level of every object to the various features. Each feature can be viewed as a dimension of the information space. The classification of the objects, with respect to the various features, positions the objects in the corresponding ªcoordinatesº of this multidimensional information space. The feature's coordinates indicate its relevance to a given object. Grasping the object-feature relations has been the focus of many visualization systems that are designed to solve application problems such as data mining, web navigation, and the ªLost-in-hyperspaceº problem. Figure 1a, b illustrates the problematic aspects emerging from naive representation of such data. When each data object is represented in a line, the object's relation to the others is blurred.
217
Fig. 1a±b. The objects with their graded features, shown in tabular form and in an Internet browser answer list
2
Fig. 2. The features are displaced as gravitation nodes, and the objects are displayed as points centered around the features that are most relevant to the object
Furthermore, although relations for a given feature can still be noted, more complex relations, such as those consisting of two, three, or more features, cannot be viewed in these representations. Hence the naive visualization does not provide a good mechanism for capturing the data properties. Before we continue, it is important to note that for the type of information retrieval application de-
218
scribed, the exact grades (relevance measures) of the objects are not significant. We are not dealing with exact scientific measurements here. We want to enable the user to have, at a glance, a global picture of the data distribution, thereby easing the selection of significant objects. We are therefore willing to trade accuracy for a clear display.
a
b
c
Fig. 3a±c. The need for composed nodes: a multidimensional relevance data; b a relevance map with no composed nodes; c a relevance map with multifeatured (composed) nodes
A better visualization paradigm provides the user with a relevance map that summarizes the given information in a more intuitive way. A simple version of such a map visualizes the features as gravitation nodes and the objects as points centered around the features that are most relevant to the object. Figure 2 shows the relevance map constructed for the information presented in Fig. 1a. The labeled circles represent features, and the dots represent objects. With this visualization, the user can immediately see that there are some objects relevant only to the feature F1 ; and a few to feature F2 ; whereas all the other objects are relevant to several features. However, mapping a hyperspace down onto a 2D map is impossible without introducing ambiguities and conflicts. For example, in the map of Fig. 2, it is not clear if the objects in the center are relevant to all the features or just
to some of them. Some of those objects may be relevant to only two diagonal features, say F4 and F1 : The solution proposed in this paper is based on a layout method that adapts the feature set to obtain a more comprehensible map. The idea is to identify the potentially ambiguous relations among objects and features, and construct new composed features to resolve them. The relevance map is then built with respect to this extended set of features, and thus provides a clearer display of the information. For example, consider the map in Fig. 3. It enables one to distinguish between objects relevant only to the F4 and F5 features (they are located near the composed node labeled with the pair F4 ; F5 ), and documents that are relevant only to F1 ; F2 ; F3 or to all F1 ; F2 ; F3 ; F4 : Unlike previous works, where the layout technique assumed that the features being displayed were al-
219
most independent, we have to deal here with interdependencies among the features (i.e., the dependencies between composed and basic features). The dependencies have to be reflected in the data representation. The layout of the extended set of features becomes an optimization problem, which we solve using simulated annealing (Davidson and Harel 1996). The visualization technique we propose here has been tested in the context of a Web navigational aid tool. The relevance map (RMAP) prototype system was built to experiment and demonstrate the effectiveness of relevance maps (Assa et al. 1997) as a tool by visualizing the results of Web searches. The rest of this paper is organized as follows. In Sect. 2 we illustrate the notion of a relevance map as a mapping mechanism from hyperspaces into a 2D map. In Sect. 3 we discuss the details of the simulated annealing algorithms. In Sect. 4 we describe a specific application ± a Web search system ± where this technique was used. A further discussion on the system architecture continues in Sect. 5. Section 7 describes related work in this field, and finally, in Sect. 8, we conclude.
2 The relevance map As explained in the introduction, the layout of the objects in a 2D map should reflect their inter-relation with respect to the selected features. Since the number of features is usually much greater than two, it is a multidimensional layout problem (Consens et al. 1994; Davidson and Harel 1996; Hemmje 1996; Neves 1997). Projecting a multidimensional vector space onto a 2D map yields conflicts and ambiguities. In this section we present a method that lays out nodes (features) and points (objects) with their relative relevance (grades) on a 2D map and alleviates the ambiguity problem by introducing new feature nodes. Each feature defines a dimension or an axis along which the object-point can be placed so that it reflects its relevance to that feature. Placing two orthogonal dimensions generates a planar 2D map, which represents the information space of the two features. Similarly, the method can be generalized to an n-dimensional feature space. However, the visualization, orientation, and understanding of n-dimensional spaces is known to be difficult (Ankerst et al. 1996; Inselberg and Dimsdale 1990). The pro-
220
posed relevance map represents n-dimensional object-points on a 2D map such that their location reflects their relevance to the features. The relevance map is a 2D space where a distance function, but not a metric, is defined. That is, the proximity of two points means that they are likely to have similar data, but the distance between them cannot be compared to a third point. A feature dimension and an associated relevance threshold partitions a 2D map into two regions: one that is more relevant to the feature and the other less relevant. The relative distance of a point from the feature node expresses the relevance of the point to the feature (Fig. 4a). Placing two feature nodes in the 2D plane, partitions the map into four regions where the location of a point reflects its relevance to the two features. A high relevance to one feature places the point close to the feature node. A point with high relevance to both features is placed between them (Fig. 4b). Weak relevancy to either feature places the point far from both. The features partition the plane into topological regions like Venn diagrams (Fig. 5). This scheme shows the mutual relevancy of a point to the two features, and it is a stronger expression than just the distance (grade) to a feature. Unlike Venn diagrams where points are either in or out of a region, here the position of a point in the map expresses its relevancy to multiple features. Venn diagrams deal well with as many as three dimensions. There are Venn diagrams for more dimensions (Ruskey 1996), but they do not have a structure for the topological regions and do not give an intuitive view. This problem is referred to as inconsistency, and is likely to occur when points have high grades for several features and need to be placed in several regions simultaneously. The inconsistency might be resolved either by placing multiple copies of a point in all the regions while maintaining a visual link between them, or by placing it once in the most relevant region. Since the location of the point is a rough estimate of the relevance degree to the features, we can measure the ªdegree of inconsistencyº. That is, a point which can be placed in more than one region has a degree of irrelevance to the features. By neglecting minor inconsistencies, we overcome many possible conflicts and generate a map representing a large number of dimensions. Note that the relevance map is not used for scientific measurements,
Fig. 4. a A relevance map of a single feature, and b a relevance map of two features
5
Fig. 5. The Venn-diagram regions of two features
but rather to increase user intuition of its information space. Although using topological regions formed by more than three or four features seems neither intuitive nor mathematically consistent, the method can be extended to more features. Our framework supposition is that the relevance map represents the relevance of points to the feature nodes. Thus, most of the points are placed in a way that best expresses their characteristics while maintaining a low degree of inconsistency. This supposition translates the problem to an optimization problem in which the placement is a function of the given points and feature nodes. Theoretically, the cardinality of feature nodes in RMAP generated for n basic nodes might be up to 2n . Examining the problem as an optimization problem presents only the most effective feature nodes, whereas the effectiveness is dictated by the data itself. An example of a five-feature placement is illustrated in Fig. 3c. This example shows the five features
F1 to F5 and three more nodes. The ªF1, F2, F3, F4º node is for points that are very relevant to all those four features, and the ªF1, F2, F3º node is for points that are very relevant especially to F1, F2, and F3. Similarly, node ªF4, F5º is for points relevant to both features F4 and F5. The additional three composed nodes are needed because a significant portion of the points are categorized to these three groups, and at the same time they do not create too many inconsistencies in the map. Note that the selection of additional nodes is driven by the characteristics of the population points, and thus the map adapts itself to a ªclearº display of the relevant points. Note that the map in Fig. 3b does not provide any effective insight, while the map in Fig. 3c is more effective as it emphasizes the various dependencies among the objects with respect to the features.
221
3 Placement algorithm The point placement around the feature nodes (inside the topological regions) can be perceived as celestial bodies influenced by gravitational forces. Each node affects its surrounding environment, pulling strongly on the relevant points and weakly on the irrelevant ones. The layout process and generation of the relevance map consists of two steps. First, the feature nodes are placed and then the object points. The feature placement consists of two substeps: 1. Creation of composed nodes. The system examines the classification of points, and determines whether a significant portion of the population correlates closely to more than one feature. By exceeding a certain threshold, a new node is generated, and it represents a composition of features, referred to as a composed node. This threshold, used for controlling the generation of composed nodes, helps to define and examine the relation and distribution of the data in the result population, as described later. The system locates an n-tuple of features with enough corresponding points that are very relevant to all of the n features. 2. Placement of features. To place the nodes on the map, an undirected graph is constructed with edges defined between composed nodes and source basic nodes. For example, see the graph in Fig. 6 with the edges drawn in light gray. The edges emanate from the basic feature nodes F1±F5 to the two composed nodes. This graph is used as input for the simulated annealing algorithm, which positions the graph nodes ªnicelyº on the 2D map (Davidson and Harel 1996), according to some heuristic criteria. Simulated annealing is a flexible optimization method, and it is suited for large-scale combinatorial problems. This method differs from standard iterative improvement methods by allowing ªuphillº moves ± moves that degrade, rather than improve, the temporary solution. This provides more flexibility and enables escapes from local minima solutions to the preferred global minimum. Simulated annealing is derived from its similarity to the process in which liquids are cooled to a crystalline form, a process called annealing. The annealing algorithm is based on grading a given
222
graph situation according to several simple factors, and allowing the annealing process to try and locate the global minimum of the following rules. 1. Nodes are placed inside the viewing frame. 2. Nodes are distributed evenly over the map. 3. Basic nodes are distributed evenly around their composed nodes with even edge lengths. 4. Edges do not cross each other, and are away from the other nodes. Rule 1 states that nodes should be placed far from the viewing window frame and inside it. An example of the effect this rule may have on a seven-feature node is shown in Fig. 6a. Rule 2 states that nodes should be distributed evenly over the map, and farther away from each other. This rule checks the distance between each pair of nodes in the graph, and gives a lower cost for graphs that have nodes separated as far from each other as possible. As an example, applying this rule on top of the first rule produces the result shown in Fig. 6b. Rule 3 states that a composed node should have its basic generating nodes around it and at equal distances from it. This rule is implied by the desire to see the composed node in the center of its components. It is calculated by comparing the distance between the weight point of the basic nodes and the location of the composed node and by measuring the variance of edge lengths. A simple example of how this rule contributes to the understanding of the map is shown in Fig. 6c. The last rule states that edges should be far away from other edges and nodes as much as possible. Although this rule can create conflicting results in nonplanar graphs, its role is mainly to minimize crossing edges and to try to space the graph placement. For example, adding this rule produces the graph shown in Fig. 6d. Each graph position has a value that reflects its accommodation to these criteria. The simulated annealing algorithm iteratively searches for the position that minimizes the graph cost. The convergence to the final position is quite fast; for example, on a 15-node graph, the placement of the graph nodes on a HP 715/75 workstation takes less than 2 s. Starting from an arbitrary position (Fig. 7a), the graph converges to its final position after 35 iterative steps (Fig. 7b±d; display stages 5, 30, and 33, respectively).
Fig. 6a±d. The effect of the four rules on a simple case Fig. 7. Four simulated annealing iterations over a graph
Once the feature nodes have been placed, the object points have to be positioned. Their initial location is around their top-graded feature nodes. The other nodes influence the position of the points by pulling them by ªforcesº determined with re-
spect to their grades. Points that have no nodes influencing them are given a distinct position on the upper left-hand corner of the screen. The final position of the objects is determined by a relaxation process, in which the points are per-
223
turbed to be evenly distributed across the map. This is especially important in crowded regions. The very slight movements of the relaxation does not change the topology of the generated map, but only adjusts the point positions locally for a better display. Other visualization maps rely on a spring-set method layout, which simulates the graph as being a set of springs searching for an equilibrium, such as the ones described by Zizi (1995), and Sprenger et al. (1997). These solutions provide similar results faster than simulated annealing; however, they are more likely to generate local minima solutions.
4 RMAP To retrieve information dealing with a specific topic from the Internet, several index servers, search engines, and query languages were developed. For several examples see Digital Equipment Corporation (1996), Excite (1996), Yahoo (1996), Shmueli and Konopnicki (1995), and Mendelzon et al. (1996). The answer from a search engine to a query is often a large set of candidate documents. Rather than presenting the answer as a long linear list, the RMAP system displays it in relevance maps that reflect the relevance of the documents to various features that are likely to be of interest to the user. The system enables the user to query standard index servers, view the result as a map, interact with the map, adjust the layout (if needed) using coarser or finer feature granularity, and access or browse the relevant data. We believe that the prototype provides a tool for validating the intuitivity and feasibility of the placement algorithm, as well as confronting real-life problems such as dealing with a large noise ratio in the displayed population. The RMAP provides navigational aid to Internet surfers by connecting bidirectionally to an Internet browser. By selecting an information item (document), the browser loads the associated page, and the document icon on the map is highlighted. In the other direction, whenever the user loads new documents through the browser, the documents are automatically classified and placed into the relevance map according to their grades. The user can also directly modify the relevance map appearance by changing the positions of (or even removing) the feature nodes and document points.
224
Next we describe the system architecture, followed by a description of additional navigational aid, and examples of information maps, in Sect. 7.
5 System architecture The general architecture of the system consists of four building blocks shown in Fig. 8. There are many ways of implementing each of the blocks. The RMAP prototype system, described in this section, is implemented by choosing one specific solution for each component. It should be noted that, although the solutions we describe here were developed in the Web context, similar ideas can be used to display and analyze data from a standard information system. The intuitive and flexible representation enable user to explore the displayed information space fully, review several views of the space with various features, invoke new queries, and mine the information. An example of a view, with a brief look at the system's user interface, is shown in Fig. 9.
5.1 Information collection The first component determines which data items are to be displayed to the user. We call this set of data items result population. The result population is obtained by submitting one or more queries to a query server. In this work, we use existing Web index servers, so the result population is basically a set of links to documents. In general, the result population can be a set of tuples, complex values, objects, etc., depending on the query engine and the type of information being processed. The prototype can extract information automatically from the AltaVista search engine (Digital Equipment Corporation 1996), and with minor changes, interact with other query engines, such as the W3QS (Shmueli and Konopnicki 1995) and Websql (Mendelzon 1996) servers, by parsing the resulting HTML pages of these servers and extracting a list of document URLs. To analyze the documents, extract significant features, and classify the documents, the system needs to download the result population HTML pages. In some cases, downloading only a portion of these pages may be sufficient.
Fig. 8. The prototype building blocks
8
Fig. 9. Rmap user interface with a zoomed view of ªPoohº query results
9
225
The documents are downloaded in parallel by the system, which takes a relatively short time. For example, using a Web connection of less than 1 k/s, downloading about 200 pages takes roughly 5± 8 min.
5.2 Initial feature selection and classification The second component (1) selects the initial feature list and (2) classifies the information according to the list members by grading the relevance of each of the retrieved documents to this feature list. Since this process is similar to those used in other database/IR systems, it is implemented with the existing indexing methods and other retrieval techniques (Larson 1992; Padmini 1992; Salton 1989; Salton and McGill 1993). Extracting the features from the document context should reflect both the essence and the distribution of the result population. The system uses key words, word stems and groups of key words, extracted from the documents as features. Thus the feature list is mainly extracted from the retrieved data, and the user is able to modify it with ªexternalº knowledge. As a grading function, it uses a combined weight of feature Fj and document Di based on the product of the feature frequency and the inverse document frequency as follows: ÿ dij Rij log N Oj wj ; where: Rij is the feature frequency, which represents the number of occurrences of feature Fj in document Di : Oj is the number of documents in which feature Fj occurs. N is the number of loaded documents, and wj is the number of features in document Dj : The equation calculates the grade of a given feature to a given document by counting its occurrences and comparing it to the number of features extracted from this document and the number of documents containing it in the result population. The grading function takes special notice of the word position within the document. Since the information being read by RMAP is mainly HTML pages, some of the HTML metaformat can be used to determine words in titles and to increase their
226
importance. Thus, a word appearing in a document title significantly increases the document grade to this feature. A group of feature-grading functions is calculated as the sum of all the grades of its members. Other methods of measuring relevance of features to documents, like graph theoretic distance, Hamming distance, Levenshtein distance, maximum posterior probability distance, etc. (Larson 1992; Salton 1989). can be used as well, though, our metric produced good results nevertheless. The system automatically removes words on a predefined ªstop listº from the feature list. Such a stop list contains common words such as ªisº and ªtheº (Salton and McGill 1983; Zizi and Pediotakis 1996). The key words can be collected into groups by one of the following mechanisms: l
l
l
Automatic clustering. By applying Porter's stemming algorithm (Padmini 1992) to the feature list, all the features originating from the same stem are collected into one group. Manual clustering. The user collects features related to a certain field. For example, he can collect the features ªrainº, ªstormº, ªwindº, and ªsunº into a ªweatherº feature cluster, according to his interest and understanding. Semiautomatic clustering. This is a novel clustering approach. The user is required to define the cluster by giving another query example, not necessarily related to the originating query. For example, he can define cluster feature ªDisneyº as the following query: ªDisneylandº and ªMickey Mouseº. This query is used for downloading an additional set of documents, and extracting from them a secondary feature list. By comparing that list to the original one, the system can automatically collect features appearing in both, and compose a cluster of features that represents relevant features within a certain ªfieldº. For example, the last query produced the following list: ªDisneyº, ªcartoonº, ªanimationº, and ªDonaldº. Note that this list is affected by both the initial documents and the user definitions, when the WWW is being used as a global thesaurus. Usually the semi-automatic clustering results are very dependent on both the query and on the result population. However, in most cases it produces good results, as is shown in the next section.
We have also tested several other formal clustering mechanisms (Cutting et al. 1992; Hearst 1996), which were found to be less effective for our application. This is due to the vague definitions of the group contents and the lack of ability to explain the group characteristics to the user. Our method has some disadvantages in cases of multitopic pages that contain several disconnected parts, such as news flashes. However, we found it to be satisfactory in most cases. Our feature extraction method suffers from being imprecise. However, we found that the resulting visualization used for versatile information navigation and overview is satisfactory. Furthermore, this method is affected both by the user's ªexternalº knowledge, goals, and understanding and by the information context. This results in a map influenced by these factors. Other methods of automatic feature extraction and clustering are suggested in other works (Cutting et al. 1992; Kohonen 1995; Salton and McGill 1983). These either rely on a predefined thesaurus (and thus do not reflect the information and user perspective) or are rather difficult for users to relate on the basis of statistical features.
5.3 Layout of the results To this point we describe the subprocesses by which a global set of features is extracted and the documents are graded with minimal user intervention. This allows the user to select the features that are used to form the relevance map. The selected features are called the Rmap features. The Rmap features are used by the algorithm described in Sect. 2 to generate composed nodes and to assist with the screen layout. The placement of the document points is determined by the proportional distance to the three most highly ranked features and composed features. Documents that are not related to any of the selected features could have been removed from the map. However, our experiments show that placing them in a ªneutral positionº, currently on the upper left corner, is beneficial to the user. In many cases, examining these documents suggests new features that can be applied within the same relevance map, and it enriches the user's knowledge. Next, a relaxation algorithm is applied to move the document points slightly farther apart to cre-
ate a clearer map. The relaxation is based on an iterative perturbation of the document points within a certain distance from their original positions. Finally, associated with each document point, we display some information about the document it represents. In the current prototype this is done by displaying either the document title or some image extracted from it. Another option could be to use some thumbprint of the original page, as suggested by Card (1996).
5.4 User interaction with the map The prototype provides several capabilities to assist the user working with the relevance map as a navigational tool: l
l
l
Interaction with a browser. The prototype has a bidirectional synchronization with an Internet Browser, which includes, on one hand, the ability to click on a document point in the map, instructing the browser to load a new page, and on the other hand, automatic placement of pages accessed by the browser in the current relevance map. Zoom. The system allows the user to zoom into regions of interest and reveal more details of the map by scaling up the images and titles in that region. Composed nodes and feature cluster separations. The user can interactively separate existing feature clusters and create more views of the same population. The same function can also separate composed nodes into their originating features, if needed.
These features provide the user with hierarchical top-down view maps in which document points located near a certain feature cluster show the distribution of document points in a different view.
5.5 The visual load For a Web search application, the representation of the information credentials in the map is vitally important to the success of the visualization map. The object title should be contained in a minimal display bounding box. In some cases a thumbprint
227
of the object could be created, or one of its embedded pictures might be placed to be used as an identifying icon. Since the display size is limited, and the amount of information is considerably large, the visualization problem, known as the visual load, is encountered (Collaud et al. 1996). Overloaded displays are less intuitive, and their usefulness is lost. The visual load of complex maps can be alleviated by hierarchies or hyperbolic zoom (Hasan et al. 1995; Sarkar and Brown 1994). A hierarchy of features creates several levels of containment and lowers the number of objects displayed in each level. Hierarchies are useful in visually loaded regions where too many information items are closely placed. This solution is familiar from road maps where overloaded regions are referenced to a separate zoomed-in map. Hyperbolic display provides zoom-in regions of interest while zooming the background out. The effect is like that of an electronic magnifying glass which uses hyperbolic deformation, known as a fish-eye view. Our layout method is independent of such techniques. However, we have implemented a zoomin mechanism with which the user can scale up the region of interest.
6 Examples We demonstrate the use of the system with following examples.
6.1 Beatles Assume the user is interested in finding information about the Beatles band and its members. He submits the query ªBeatlesº. The system issues the query to a web search engine [Digital's AltaVista (1996) in our case], analyzes the returning documents, and extracts about 30 candidate features to focus on, including the band members' names. The features are then offered to the user who selects the names of the four band members, and bounds the number of documents to be displayed to 50. A high threshold for the creation of composed nodes is used, and the resulting map is shown in Fig. 10. In the map, we can see five feature nodes. Four of them correspond to the actual band mem-
228
bers (the nodes labeled Paul, George, John, Ringo). The fifth node is a composed node that was automatically created to highlight the fact that most of the documents in the center mention all four members. The placement algorithm places the composed node in the center between all the members, so that information more relevant to a particular band member can be placed between the composed node and the member. This example also shows the upper left corner, the location of the documents that are not relevant to any of the features in the map.
6.2 Pooh This example uses information extracted from the AltaVista search engine by querying for pages containing the word ªPoohº to retrieve information related to the Winnie the Pooh books by A.A. Milne. The features found in the first 180 pages, contained the main characters of the books, such as Piglet, Tigger, and others, shown in Fig. 11. In this example, the user decided to create a new feature called `Robin', which grouped together two words, `Robin' and `Christopher'. With that he manages to collect all the information mentioning Christopher Robin. From the information space, shown in Fig. 11, the user might understand, as in the last example, that most of the information has about the same relevance to all the characters. This, however, is not true. Consider the map in Fig. 12. Two composed nodes were automatically added to redistribute the information in the space ± `Tigger and Piglet' and `Piglet and Robin'. The document point distribution in the map represents its relevance to these features. As a comparison, we generated the same relevance map, but without any composed nodes as shown in Fig. 11. The distribution of the information, where most of it is in the center of the viewing area, symbolizing equal relations to all of the features, bears poor resemblance to the actual distribution. Figure 9 zooms in on a selected region of the map. The hyperlinks to documents are displayed as icons/images taken from the documents themselves. This provides more information about the correlation between the map presentation and the documents and allows better space and location perception to the user.
10
12
Fig. 10. Beatles relevance map Fig. 11. Initial view of ªPoohº query map Fig. 12. ªPoohº query with composed nodes
11
6.3 Mars landing This example shows information regarding the Mars landing mission performed by NASA's spaceship in July 97. The query given to the AltaVista search engine was ªMars and landingº (Mars+landing in the AltaVista query format). The first 180 answers were loaded by the system, from which only 167 were retrievable, detecting 13 pages as redundant/nonupdated links. In order to define the information space, several new fea-
tures were defined as groups of existing key words (features). l
To collect all the related issues for the atmospheric conditions on Mars, the user defined a group named Weather4, that comprises features found in the query ªweather and atmosphereº. The system automatically submitted this query to an Internet search engine, and used the result population to generate a list of member features in that group as described in Sect. 4. The key words col-
229
Fig. 13. Initial view of ªMars and Landingº query map, without composed nodes Fig. 14. A second view of ªMars and Landingº query map, without composed nodes Fig. 15. The map created by ªMars and Landingº query with composed nodes
13
14
l
15
lected for this group were Science, Water, Atmosphere, Atmospheric, Surface, Research, Students, University, Weather, Wind, Climate. The Mission group was automatically defined by gathering all the words built from the stem Mission (Mission, Missions etc.).
230
l
l
The Land_Site group was automatically defined by gathering all the words built from the stem Site (Site, Sites etc.). The JPL group was manually built from separation of the acronym of NASA's Jet Propulsion Laboratory (JPL).
Finally, an additional feature consisting of a single word PathFinder, the name of the lunar vehicle used, was added. Figure 13 displays the generated map, without adding composed nodes, rather than by having a very high threshold for the generation of composed nodes. From this, we have the impression that the JPL feature forms the map (i.e., the information in it) by pulling most of the pages towards the middle of the map. Furthermore, we can see that the PathFinder concept has a ªcloseº neighborhood of pages and is not affected by the JPL concept. Some answers between the Mission and JPL concepts, and especially between the Weather4 and JPL concepts, locate information related to the JPL Mission and JPL atmospheric reports, but these relations are not easily noticeable. In Fig. 14, two additional features were added: l l
Viking ± the name of the previous mission to Mars, Lunar ± to collect all the remaining NASA information.
The new map did not become a clearer or better representation of the information space. We can see that most of the information still remains within the (JPL ± Mission ± Weather4) region, but is now dominated by the Viking feature, which pulls all the surrounding information. The PathFinder feature retained its members, while minimally influencing most answers located in the (JPL ± Mission ± Weather4 ± Viking) area. In Fig. 15, the generated map includes composed nodes of which several were automatically created: l l l l
JPL and Weather4 JPL and Viking JPL and PathFinder Weather4 and PathFinder.
Topologically, the geometric shape the information induced is a triangle JPL ± Weather ± PathFinder, with an additional edge, JPL, Viking. The node placement in this map, induced by the composed nodes and information distribution in the answers, spreads the answers in a more intuitive manner, including: 1. Most of the information is located in the (JPL ± Weather4 ± PathFinder) triangle, and is only slightly affected by the Mission concept. Checking the information in the triangle, we
can see that the document point placement in the triangle is intuitive, i.e., information close to one of the vertices is more related to its concept. 2. The Viking concept is moved to the outskirts of the viewing frame, together with the document points that are greatly related to it (actually only one document point). More document points are near the Viking and JPL composed node. 3. Mission, Lunar, and Land_Site are kept on the outskirts of the viewing frame, affecting a few document points between them and the mentioned triangle. Examining these documents, we found quotes describing the mission, or ªland siteª, although it is obvious that these documents are not in the mainstream of the information presented.
7 Related work The field of information visualization has been explored by many researchers in recent years. Works in neural networks (Kohonen 1995), information retrieval (Hearst 1996), scientific and information visualization (Ankerst et al. 1996; Inselberg and Dimsdale 1990; Keim and Kriegel 1995) provide a wide range of solutions designed to alleviate problems in displaying higher dimensions. Neural network-based systems, designed to learn the characteristics of n-dimensional space and to display a map of relations between the objects in the space, are introduced by Kohonen (1995). Later work on this system introduces updates and evaluation of the method, showing some capability of associative thoughts, but still not representing the space in a comprehensible and intuitive way (Chen et al. 1995). Scientific generic visualizations such as parallel coordinates (Inselberg and Dimsdale 1990) and pixel representations (Keim and Kriegel 1995), used in the information retrieval field, also provide a visualization picture, which is appropriate to data mining. It allows the researcher to locate data behavior, but again, loses the space/object intuitive relations (Hearst 1996). The intuitive approach described in this paper is a variation of Venn diagrams. These were used in several previous systems such as VIBE (Korfhage 1991), Lyberworlds (Hemmje 1994), Cougar (Hearst 1996), Infocrystal (Spoerry 1993), Idm
231
(Zizi 1995), and others described by Card (1996). Now we briefly describe these systems and explain how our work differs from them. VIBE is a tool for displaying high dimensionality of document representation, where query terms apply gravity to the relevant documents. As mentioned by Hemmje (1994), this creates an inconsistency problem, in the case of a more than threeterm query. To minimize the inconsistencies, the system allows the user to change the location of the terms, thus regenerating the resulting space projection. Lyberworld (Hemmje 1994) extended the VIBE method by transforming the space into a relevance sphere and allowing more freedom for term placement, thus lowering the possibility of inconsistency in the positions of items. Another extension of VIBE, VR-VIBE (Benford et al. 1995) reduces the VIBE view inconsistencies by allowing the user to place the terms in three dimensions, thus creating a 3D space. Cougar is a computer-human interface based on Venn diagrams, which was developed to enable users to assess document similarity. The interface is constrained to only a ªthree-ringº view (eight different topological regions), and higher dimensions are expressed by a hierarchical set of diagrams. However, this interface is not used to display the data themselves, but rather the collection of sets of data. The Infocrystal system (Spoerry 1993; Hearst 1996) is a sophisticated interface that allows more than three dimensions of document space to be shown on the basis of a placement scheme similar to Venn diagrams. The interface also extends this method by using shapes indicating the influences on each of the entities being displayed. Later work within the SPIRE project developed two methods, called Galaxies and Themescapes (Wise and Thomas 1995). The first displays documents in their information space as stars, located according to their relevance to a set of terms. The second displays the documents in a 2.5D (spatial terrainlike) map built with certain themes as gravity anchors in Themespaces. Approaching the same problem from its clustering perspective and using cluster analysis tools and levels of details to reduce the amount of information displayed, we can organize it to generate similar maps. However, in many of these cases, no explanation for the placement reasons
232
can be given to the system users, as Assa (1998) describes. Our solution differs from those mentioned in expanding the Venn diagram scheme to accommodate n-dimensional data, exploring the data to be visualized, and providing an optimal solution for that space by placing composed features. Furthermore, placement of the terms by simulated annealing instead of by human intervention allows one to create a better placement. More navigational aids that assist the user's comprehension and understanding of information were examined in both the IR field and by Web researchers (Card 1996). Integrating other visualization paradigms such as hyperbolic projection, dynamic zoom, and levelof-detail control mechanisms are discussed by Assa (1998).
8 Conclusion In this paper we introduced relevance maps, a visualization paradigm to assist in analyzing the relevance of objects to various features that are likely to be of interest to the user, and a prototype, named RMAP, built to test this paradigm as an Internet navigational tool in a real-life environment. A key observation is that the exact relevance measures are not significant, for the type of information retrieval applications we are dealing with, and approximation suffices. Thus accuracy can be traded for a clearer display. This is used to improve the display by refining the set of features and introducing the notion of composed features. An optimized layout for the map is then obtained with simulated annealing. The RMAP prototype system enables users to query standard index servers, interact with the map, adjust the layout (if needed) with finer or coarser feature granularity, and access the relevant data. Better methods of extracting the features from the data and of refining the selection to get a better display, are of major importance for the success of the system. We plan to investigate and experiment with more alternatives and test their adequacy for the system. Furthermore, we intend to explore the advantages of using composed nodes, as well as to examine and improve the placement algorithm, in order to show complex feature relations accurately.
This prototype working in cooperation with other Internet visualization tools, such as those of Andrews (1995), Ayers and Stasko (1995), Collaud et al. (1996), Consens et al. (1994), Doemel (1994), Gaines and Shaw (1995), Hendley et al. (1995), Mukherjea and Foley (1995), and Robertson (1993) can create a working set of tools that would allow combining these techniques to produce better tools with better representation of the information space.
References Andrews K (1995) Visualising cyberspace: information visualization in the Harmony Internet browser. In: Gershon N, Eick SG (eds) Proceedings'95 Information Visualization, Atlanta, IEEE Press, 1995, pp 97±104 Ankerst M, Keim D, Kriegel H (1996) Circle segments: a technique for visually exploring large multidimensional data sets. In: Yagel R, Nielson GM (eds) Proceedings of Visualization'96, IEEE Press, 1996 Assa J (1998) RMAP ± a system for visualization of information relevance maps. Master's Thesis, Tel Aviv University, Tel Aviv, Israel Assa J, Cohen-Or D, Milo T (1997) Displaying data in multidimensional relevance space with 2D visualization maps. In: Yagel R, Hagen H (eds) Proceedings of Visualization'97, Phoenix A2, IEEE Press, pp 127±135 Ayers E, Stasko J (1995) Using graphic history in browsing the World Wide Web. In: Proceedings of the 4th International World Wide Web Conference, Boston. http://www.w3.org/ pub/-Conferences/WWW4/Papers2/270 Benford S, Snowdon D, Greenhalgh C, Ingram R, Knox I, Brown C (1995) VR-VIBE: A virtual environment for cooperative information retrieval. In: Post F, Göbel M (eds) Proceedings of Eurographics'95, Blackwell Publishers Maastricht, The Netherlands, pp 349±360 Card S (1996) Visualizing retrieved information: a surfey. In: IEEE Comput Graph Appl pp 63±66 Chen H, Schuffels C, Orwig R (1995) Internet categorization and search: a self-organizing approach. J Visual Commun Image Representation 7:88±102 Collaud G, Dill J, Jones C, Tan P (1996) The continuously zoomed web ± a graphical navigation aid for WWW. In Yagel R, Nielson GM (eds) Proceedings of Visualization'96, Phoenix A2, IEEE Press Consens M, Eigler F, Hasan M, Mendelzon A, Noik E, Ryman A, Vista D (1994) Architecture and applications of the Hy+ visualization system. IBM Syst J 33:458±476 Cutting D, Karger D, Pedersen J, Tukey J (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of ACM SIGIR'92, pp 318±329 Davidson R, Harel D (1996) Drawing graphs nicely using simulated annealing. ACM Trans Graph 15:301±331 Digital Equipment Corporation (1996) AltaVista: main page. http://-altavista.digital.com Doemel P (1994) WebMap ± a graphical hypertext navigation tool. In: Proceedings of the 2nd International World Wide Web Conference, Chicago. http://www.ncsa.uiuc.edu/SDG/ IT94/Proceedings/-Searching/doemel/www-fall94.html
Gaines B, Shaw M (1995) WebMap: concept mapping on the Web. In: Proceedings of the 4th International World Wide Web Conference, Boston. http://www.w3.org/pub/Conferences/WWW4/-Papers/134 Hasan M, Golovchinsky G, Noik E, Charoenkitkarn N, Chignell M, Mendelzon A, Modjeska D (1995) Visual web surfing with hy+. In: Lyons KA, Wilson GV, Wa E (eds) Proceedings of CASCON'95, Toronto, IBM Canada. ftp://-db.toronto.edu/pub/papers/cascon95-multisurf.ps.Z Hearst M (1994) Using categories to provide context for fulltext retrieval results. In: Proceedings of the RIAO'94, New York, pp 115±130 Hemmje M (1994) Lyberworld ± a visualization user interface with full text retrieval. In: Croft WB, Van Rijsbergen CJ (eds) Proceedings of SIGIR'94, Springer Verlag, Dublin, Ireland, pp 249±257 Hendley R, Drew N, Wood A, Beale R (1995) Narcissus: visualizing information. In: Proceedings'95 Information Visualization, Atlanta, pp 90±96 Excite (1996) Excite main page. http://www.excite.com Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multidimensional geometry. In: Proceedings of Visualization'90, IEEE Press, pp 92±103 Keim D, Kriegel H (1995) Visdb: a system for visualizing large databases. In: ACM SIGMOD International Conference on Management of Data, San Jose Keim D, Kriegel H (1996) Visualization technique for mining large databases. J Comput Graph Stats 8:923±938 Kohonen T (1995) The self-organizing maps. In: Proceedings of the IEEE Symposium on Neural Networks'95. 78:1464± 1480 Korfhage R (1991) To see or not to see ± is that the query? In: Bookstein A, Chiaramella Y, Salton G (eds) Proceedings of the 14th Annual International ACM/SIGIR Conference, Chicago, pp 134±141 Larson R (1992) Evaluation of retrieval techniques in an experimental online catalog. JASIS'92, Wiley J, Sons C 43:34±53 Mendelzon A, Mihaila G, Milo T (1996) Querying the world wide. In: Proceedings of PDIS'96, pp 80±91 Mukherjea S, Foley J (1995) Visualizing the World Wide Web with the Navigational View Builder. Comput Networks ISDN Syst 27:1075±1087 Neves F Das (1997) The aleph: a tool to spatially represent user knowledge about the WWW docuverse. Proceedings of ACM Hypertext'97 Padmini S (1992) Information retrieval: data structures and algorithms. Prentice Hall, pp 12±31 Robertson G, Card S, Roberts C (1993) Information visualization using 3D interactive animation. Commun ACM 36:57±71 Ruskey F (1996) Venn diagrams. Electronic J Combinatorics 1: Nr. 3 Salton G (1989) Automatic text processing. Addison-Wesley, pp 45±89 Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw Hill, pp 30±71 Sarkar M, Brown M (1994) Graphical fisheye views. Commun ACM 37:73±84 Shmueli O, Konopnicki D (1995) W3QS: a query system for the World Wide Web. In: Dayal U, Gray PMD, Nishio S (eds) Proceedings of VLDB'95, pp 54±65 Spoerry A (1993) Infocrystal: a visual tool for information retrieval & management. In: Proceedings of information knowledge and management'93, Washington DC
233
Sprenger TC, Gross MH, Eggenberger A, Kaufmann M (1997) A framework for physically based information visualization. In: Molnar S, Schneider B-O (eds) Proceedings of Eurographics Workshop on Visualization'97, ACM Press, pp 87±97 Wise J, Thomas J (1995) Visualizing the nonvisual: spatial analysis and interaction with information from text documents. IEEE Symposium on Information Visualization'95, pp 51± 58 Zizi M (1995) Providing maps to support the early stage of design of hypermedia systems. In: FraïssØ S, Garzotto F, Isakowitz T, Nanard J, Nanard M (eds) Proceedings of IWHD'95 International Workshop on Hypermedia Design, Springer, pp 92±104 Yahoo (1996) Yahoo main page. http://www.yahoo.com Zizi M, Pediotakis N (1996) Visual relevance analysis. Proceedings of Digital Library'96. ACM pp 63±71
TOVA MILO received her PhD degrees in Computer Science
from the Hebrew University, Jerusalem, in 1992. After graduating she visited the INRIA research institute in Paris for 6 month holding the Chateaubriand Postdoctoral Fellowship. During 1993±1994 she was a Postdoctoral Fellow in the database group of the University of Toronto. In 1995 she joined the Department of Computer Science at Tel Aviv university. She served on the program committees of many international conferences and has received grants from the Israel Science Foundation, from the Israeli Ministry of Science and from the French Ministry of Science. Her research interests include mainly advanced data base applications such as data integration, object-oriented and semistructured information, web-based applications, and the interaction between textual information and databases, focusing on both theoretical and practical aspects.
234
DANIEL COHEN-OR is a se-
nior lecturer at the Department of Computer Science since 1995. He received a BSc cum laude in both Mathematics and Computer Science (1985), an MSc cum laude in Computer Science (1986) from Ben-Gurion University, and a PhD from the Department of Computer Science (1991) at State University of New York at Stony Brook. He has been a lecturer at the Department of Mathematics and Computer Science of Ben Gurion University in 1992±1995. He is on the editorial advisory board of the international Computers and Graphics journal. He is the Program Co-chair of the symposium on volume visualization to be held in the year 2000. He is a member of the program committees of several international conferences on visualization and computer graphics, including IEEE Visualization and Eurographics. Between 1996±1998 he served as the Chairman of the Central Israel SIGGRAPH Chapter. Dr. Cohen-Or has a rich record of industrial collaboration. In 1992±93 he developed a real-time flythrough with Tiltan Ltd. and IBM Israel for the Israeli Air Force. During 1994±95 he worked on the development of a new parallel architecture at Terra Ltd. In 1996±1997 he has been working with MedSim Ltd. on the development of an ultrasound simulator. His research interests are in Computer Graphics, and include rendering techniques, client/server 3D graphics applications, real-time walkthroughs and flythroughs, volume graphics, architectures and algorithms for voxel-based graphics.
JACKIE ASSA Recieved his
BSc degree in Computer Science and Mathematics from Tel-Aviv University in 1993, and MSc in Computer Science in Tel-Aviv University in 1998. He designed and implemented software development projects in ATL (Advanced Technologies Limited), and Orbotech's CAM department. Today he is a project manager in BackWeb Technologies Ltd, working on enhancement of push technology. His main area of interest are ªknowledge representation and information visualizationº and ªapplication design and managementº