First Class Meta-Data Supporting Exploratory ... - Semantic Scholar

0 downloads 0 Views 352KB Size Report
datasets. Neither is the kind of lightweight operation as Visage's drag-and-drop gestures. DataDesk does support reusing an analysis by creating a template.
Submitted to Proceedings of the IEEE Information Visualization Conference (InfoVis'04).

First Class Meta-Data Supporting Exploratory Analysis Mark Derthick Human-Computer Interaction Institute Carnegie Mellon University Pittsburgh, PA 15213 USA +1 412 268-8812 [email protected]

Abstract If overview, zoom, filter, and details are sufficiently important to make them the mantra of Information Visualization [Shneiderman 1996], why aren't they first class objects in InfoVis systems? Why not specify what the overview is of, perform set operations on filters, and annotate details with usage information? While these have all been implemented as one-off features, this paper outlines techniques for persistent storage of session metadata, so that they emerge from the data architecture. The techniques rely on relational databases for their expressiveness and object-oriented database concepts for their direct correspondence to screen objects and operations. They supports arbitrary specification of active datasets, set operations, derived attributes, hierarchical attributes, and reification. Much of the paper describes technical details, such as SQL and database indexes, necessary to build an efficient expressive elegant semantic superstructure. CR Categories: H.3.7 [Information Storage and Retrieval]: Digital Libraries---User issues. General Terms: Algorithms, Human Factors. Keywords: Meta-data, information foraging

1. Motivation 1.1 Beyond Foraging It is only a slight caricature to say that current general purpose visualization systems display all and only the data represented in some table or view. While this view can be filtered, often there is only one current filtered subset. Information retrieval systems are also limited to considering one query at a time. The query may evolve through relevance feedback, and answers may be clustered into discrete groups, but there is no way to compare or combine the current set[s] with previous ones. Having a single dataset at a single time has been captured in the Information Patch model of the Information Foraging data exploration metaphor [Pirolli and Card 1995]. The assumption here is that analysts are like animals wandering through a habitat with discrete food patches scattered about. These differ in variety, abundance, and ease of extracting various kinds of food. While in a patch, it pays to eat high value

easily obtained food until the value/effort ratio declines sufficiently that it is worth the effort to find and travel to a new patch. In this model, neither the habitat as a whole nor the information patches need to be explicit, which simplifies users’ data models. They merely navigate, and the world takes care of showing the correct view. Optimal foraging does require knowledge of average patch desirability. Information patches are more malleable and socially constructed than food patches. The cost of traveling to another patch can be made miniscule. Thus it may pay to make scouting expeditions to obtain metadata before settling in to examine data from any one patch. Just as hyenas watch vultures to obtain metadata about distant food patches, visualizations of information patches can give overviews of many alternatives simultaneously. Finally, the foraging metaphor does not consider that there may be many ways to concentrate quality information and reduce redundant or irrelevant information and create desirable patches. Farmers put considerable effort into this, and use fertilizer and pest control. 1.2 Filtering with Scatter/Gather or Dynamic Query The original Information Foraging paper [Pirolli and Card 1995] was illustrated with the Scatter/Gather (SG) information retrieval paradigm [Cutting et al. 1992] and an interface using it [Pirolli et al. 1996]. Query results are clustered to form disjoint information patches. Good patches are then re-clustered. This is a good fit for the foraging metaphor because patches are largely static, and merely subdivided. SG doesn’t support merging diverse patches from multiple scatter operations and gathering more information relevant to their union, nor gathering data based on the difference between patches. Such non-unary operations require explicit references to patches. In domains with multiple types of entities, structured patch specifications may require explicit refrerence to datasets as well. For instance, to recall contacts from a previous project with an outside company, one might query for the name of all contacts from the company who sent email in 1997. Linking the contacts to their email is an operation that requires more than subsetting a single homogeneous dataset. For structured data, Dynamic Query (DQ) [Ahlberg et al. 1992] is a widely used subsetting tool, somewhat analgous to scatter/gather in IR (a specific example of the similarity noted in [Belkin and Croft 1992]). DQ supports conjunctive constraints among attributes, each of which can be disjunctively constrained. For instance, it can pick out contacts affiliated with CMU or Pitt who are between 25 and 30 years old. However, DQ can't find stores in either the town where you live or the town where you work. It also can't compare the set of chain stores at one mall to another, using the set-difference operation. (But see [Young and Shneiderman 1993] for a system that combines DQ with set operations.)

Both SG and DQ are fast and effective for the most common cases. Rather than complicate them with more features, adding set operations on first class representations of data subsets (henceforth called aggregates rather than patches) increases the expressivity of the system as a whole. It is important to remember both the intension and extension of aggregates. For instance, a SG session that identifies interaction techniques for distortion-based visualizations from a 2004 corpus should be reusable on next year’s papers. On the other hand, the list of particular papers just found should be saved while the SG query is modified to examine some other interest. SG imposes a hierarchy on a corpus. In the ideal case, each split would be based on a meaningful attribute. For instance, a news article about baseball revenue sharing could be classified under finance and sports. Whether it appears in the business or sports section might depend on which aspect is more salient. Here there is a hierarchical relation between finance and business, baseball and sports, and an ad hoc one between sports and business. Run time hierarchy specification can be an important organizing tool. In summary, aggregates denoting intermediate results of user interaction and hierarchical relations defined by navigation paths users follow are potentially very powerful handles on data. Rich languages for characterizing these one- and two-place predicates and using them in set expressions are the plows and combines of data farming. This paper gives suggestive examples of characterizations and tools for expressive and efficient patch creation, leaving out complete algorithms for lack of space and because they are dependent on the particular exploratory operations supported by visualization systems. While implemented for our own system [Kolojejchick et al. 1997], the ideas here are meant to apply to any highly interactive and richly expressive data exploration system. 1.3 Roadmap Relational databases are the most common storage medium for large datasets, but are poorly suited to hierarchical data, or to algorithmically defined attributes and relationships in general. They only indirectly support the object-oriented paradigm, which is so natural for visualization systems. Run time data reorganization has not been a primary goal for databases. This paper advocates lightweight definition in infovis systems of new groups of objects, new attributes, and new hierarchies of their values. Together they support increased reuse of the interaction history and more expressive creation of desirable information patches. The next section discusses related work. After that are presented the techniques for defining these meta-data and their uses, followed by possible extensions to the techniques. Last is a discussion of the value of this unified meta-data approach compared with the diversity of previous ideas.

2. Related Work 2.1 Capturing Interaction History Data exploration is a constrained problem requiring few types of user operations. In ongoing research, we have identified • Navigation (database join) [aggregate, relationship] → aggregate • Aggregation [aggregate, attribute] → aggregate of aggregates • DQ filtering [aggregate, attribute, value set] → aggregate • Selection [aggregate, object] → aggregate • Drag and Drop (union, difference) [aggregate] → aggregate

In SG, scatter is like navigation and gather is like aggregation. Combining these two operations is so useful that Visage offers a single gesture (called recompose) for them that eliminates the need to display the potentially large intermediate result. We previously described the benefits of capturing interaction history for explanation and selective undo/redo [Derthick and Roth 2001], and for appliance creation [Derthick and Roth 2001]. This paper focuses on capturing the operands and results for ad hoc reuse. History-enriched objects [Hill and Hollan 1994] were the first use of interaction history, and were applied in domain-specific ways to multiple applications, including text editing [Hill et al. 1992], spreadsheets [Hill and Hollan 1994], and software maintenance [Eick et al. 1992]. While this paper is also domain specific to exploratory visual analysis, it advocates recording all user operations and operations as first class meta data independent of how they will be visualized, so is more generalizable than previous work. HITS [Hollan et al. 1991] also used a universal metadata approach, but it suffered from capturing a low level of granularity (e.g. mouse moves), storing all meta data in a shared database, and consequent scalablity problems. DataDesk [Velleman 1993] allows saving datasets created on the fly, data subsets defined by “selector variables”, and visualizations. Creating new attributes (i.e. variables) to capture experiential data is very much in the spirit of this paper. However the ability to define datasets is rather limited, and selector variables are tied to datasets. Neither is the kind of lightweight operation as Visage’s drag-and-drop gestures. DataDesk does support reusing an analysis by creating a template. 2.2 Database Technology Database query rewriting has a long history, both as heuristics for human query writers to know and as part of query optimizers [Ullman 1989; Chakravarthy et al. 1990]. Many systems generate SQL automatically. For instamce, phpCodeGenie [Dosooye 2003] is an open source tool that generates php and SQL from a schema definition. There is little literature on doing this efficiently for a broad range of queries, however. Perhaps this is because the approaches have been either for simple record reads and updates, or system-specific. The approach used here is specific to Visage’s operations (listed above). However these are similar to or include those of many other visualization systems, and thus of wide enough interest to be shared through publication. Although XML and other object-oriented databases are the subject of much current research, they are not as mature as relational databases. The approach used here, mapping an object-oriented system onto a relational database is still widely used, and common problems and solutions are well known [Ambler 2003]. Organizing entity types, attributes, and attribute values into hierarchies admits several solutions, as described in [Stoffel et al. 1998].

3. Meta-Data Techniques The ideas described below are implemented in the Visage data exploration and visualization system, which is being developed by Maya Design Group and Carnegie Mellon University [Kolojejchick et al. 1997]. The Visage query language [Derthick et al. 1997], like the Design View of Microsoft Access, consists of a query graph plus constraints on attributes (see Figure 1). The graph specifies equi-joins on primary keys and foreign keys, and each node is labeled "node_1", "node_2", etc. Nodes represent entity type

Figure 1 Visage Visual Query Environment showing a navigation operation from 424 PhoneCalls to their 66 callees. Filtering out Contacts with an empty Company value using a DQ widget leaves 17 contacts from known companies. 72 of the calls were to these 17 contacts. The DQ widget on PhoneCall StartTime compares the distribution of these 72 calls (dark gray) to all 424 (light gray).

Figure 2 Scatterplot of Contacts’ Company vs. PhoneCall’s StartTime. Each mark denotes a pair. tables, and links represent [binary] relationships. Relationships can be implemented as foreign keys in entity tables or as relationship tables with foreign keys for the domain and range. The node labels are arbitrary, except that isomorphic graphs are always labeled the same way. A DQ widget on an attribute of a node specifies a value restriction. For a continuous attribute, the restriction is a single continuous range specified with a min and max endpoint (or null for +/infinity). For discrete attributes, what the user sees is a list of values to filter on, which includes if there are any null values, and if there are too many (by default, more than 16) distinct values to display. Null is treated like any other value in Visage, even though it must be treated specially in SQL. 3.1 Universe/Result Aggregate Pairs In order to chain operations together, with aggregates representing [{["CONTINUOUS", accessor, min1, max1, ...] ["DISCRETE", accessor, [onValues], [offValues]] ["SAMPLE", primary key, modulus] ["CONSTRAINT", string]}] "FALSE"

intermediate results, it must be possible to specify the universe to which a query is applied. To clarify the relationship between universe and result set, and to support part-whole comparison, Visage displays query graph nodes as Dynamic Aggregates (DA), which show the result set and universe cardinalities as a fraction [Derthick et al. 1997]. Figure 1 shows two such DAs linked by the callee relationship, forming a 2-node graph. The PhoneCall DA universe includes all 424 known phone calls. The Contacts universe was specified by navigation from the PhoneCalls, and represents those 66 contacts who were called. The result sets reflect both the navigation constraint and the DQ constraint that Contacts have a known Company. 72 of the 424 PhoneCalls were to some Contact with a known company, and 17 of the 66 Contacts have a known company. Thus Figure 1 depicts 4 first class aggregates that may be used in further exploration. They are used as examples below. 3.2 Threads In order to retain a direct mapping between visual objects and data objects, Visage visualizations like Figure 2 that include attributes from multiple nodes display threads [Derthick et al. 1997]. A thread is a tuple of entities, such as . Threads have derived attributes like node_1_StartTime; that way, the scatterplot doesn’t have to know so much about the underlying thread structure. It simply maps attribute values to graphical values like any visualization would. 3.3 Canonical Aggregation Language SQL is an inconvenient language for manipulation because it cannot be put in canonical form. Semantically equivalent queries can take quite different syntactic forms. The advantage of the canonical language described here is that it corresponds closely to the Navigation, Sampling, and Dynamic Query operations supported by the interface, that it is often closed under set union | | | |

\\intervals are sorted & do not intersect \\on, off do not intersect \\ignored for set operations \\not canonical!

Figure 3 Canonical representation (plus the non-canonical CONSTRAINT clause) for non-structural constraints that contribute to the SQL Where clause. Each clause must be unique within its first two elements. That is, each accessor can have only one CONTINUOUS, DISCRETE, and SAMPLE clause, and CONSTRAINTS must be unique.

Graph: node_1_PhoneCalls Constraints: [[CONTINUOUS, StartTime, 5-Nov-2000, null]]

Select * From , (Select * From Where ) Where

plus

becomes Select * From , Where () And ()

Graph: node_1_PhoneCalls Constraints: [[CONTINUOUS, Duration, 300, null]] gives Graph: node_1_PhoneCalls Constraints: [[CONSTRAINT, StartTime > 5-Nov-2000 Or Duration > 300]] Figure 4 Unioning PhoneCalls made after 5 November 2000 with those that lasted more than five minutes can’t be represented canonically. and difference, and that it can be easily translated into SQL. To represent a DQ filter on a nominal attribute (like Company) there are two cases, depending on whether is selected or filtered. If selected, as in Figure 1, the other selected values are treated like , and the filter must remove the set of deselected values (called offValues in Figure 3). Otherwise, the filtered values are treated like and the filter must only accept the set of selected values (onValues in Figure 3). Visage also supports sliders that take samples of aggregates. This supports two tasks. First, a large dataset may be sampled to speed up exploration. Second, EDA can be performed on one subset of the data to form hypotheses, and then rigorous testing can be applied to the complement of the sample. Samples should ideally be chosen from a uniform distribution to avoid spurrious patterns, but also should be reproducible so that an analysis session is consistent and so the complement can be used for validation. Visage generates database keys randomly when possible, so that mod operations (i.e. taking every nth row as a sample) generate samples that are uncorrelated with the data. They are also reproducible, of course. Confronted with a legacy database, this approach risk spurrious patterns. In that case, mod degrades more gracefully than simple inequality. Figure 3 shows the syntax for constraints due to continuous or

Figure 6 Schematic example of rewriting From clause to use only tables. The tables and clauses must also be standardized apart using aliases. discrete DQ and sampling. The syntax is more general than the descriptions above to account for the results of set union and difference operations. For instance, the union of two continuous DQ constraints on a single accessor is either one or two min/max intervals. Unfotunately, the language is not closed under set union or difference. If the result aggregate is not representable, the CONSTRAINT constraint can be filled with arbitrary SQL (see Figures 4 and 5). 3.4 Parsing SQL into Canonical Representation It is possible to decompose any SQL query into a [possibly nonconnected] graph plus CONSTRAINTs over that graph. First the From clause[s] are rewritten so they only contains tables (see Figure 6), and they are aliased as node_1, etc. Then, the Where clause[s] can be stored as a CONSTRAINT (converting Unions and Minuses to Or and And Not). Alternatively, to benefit from more canonicality, the Where clause can be put into conjunctive normal form. Each conjunct is written as a CONSTRAINT, or if possible, as a key equi-join link or one of the canonical constraint forms. Other clauses are dropped. Project is not necessary, as the visualization system can extract the values it needs. Sorting is done explicitly with widgets in visualizations. Group By is specified with a pseudo-relationship link to an aggregation node, and Having is specified with a DQ widget on this node. These last two constructs are beyond the scope of this paper. 3.5 Generating SQL from Canonical Representation SQL generation occurs in stages. Aggregates have several attributes that incorporate layers of constraints. The most general layer incorporates information that is static (e.g. that the PhoneCall

Graph: node_1_PhoneCalls Constraints: 0 minus Graph: node_1_PhoneCalls • callee • node_2_Contacts Constraints: [[DISCRETE, node_2_Contacts.Company, 0, ['']]] gives Graph: node_1_PhoneCalls Constraints: [[CONSTRAINT, (NOT EXISTS (Select * From Contacts node5_2_Contacts, callee node5_1_callee_node5_2 WHERE (node5_2_Contacts.Company NOT IN('')) AND node_1_PhoneCalls.ID = node5_1_callee_node5_2.Domain AND node5_2_Contacts.ID = node5_1_callee_node5_2.Range))]] Figure 5 Example set operation on aggregates with differing query graphs: Dragging the numerator 72 out of the DA in Figure 1 leaves as the new universe the set difference of “PhoneCalls to Contacts with a known Company” from all PhoneCalls. It will have cardinality 42472=352. The empty set is represented as 0, and the singleton set containing the empty string is represented [‘’]. The ‘node5_’ prefix results from the standardize apart operation. The result requires the opaque CONSTRAINT clause.

Select * From , (Select * From Where ) Where becomes Select * From , Where () And () Figure 7 Schematic example of rewriting From clause to use only tables. The tables and clauses must also be standardized apart using aliases. DA universe contains all calls) or chained from another aggregate (e.g. the Contact DA universe, which depends on the PhoneCall DA). On top of that are constraints expressed by links from other aggregates, representing inquality joins, dynamic set-difference links, and dynamic aggregation links. All of these are beyond the scope of this paper. Finally, local DQ constraints are added. This layering reflects the operation latency hierarchy, where DQ must be fast, while chained updates are expected to be slow. The query graph affects the From and Where clauses. Each node table is listed in the From clause and aliased as the node name. Any relationship tables are also listed, aliased as the ordered pair of related nodes combined with the relationship name. Constraints implementing the joins contribute to the Where clause. For instance, the two node graph Person→parent→Person would produce From Person node_1_Person, Person node_2_Person, parent node_1_parent_node_2 Where node_1.ID= node_1_parent_node_2.domain and node_2.ID= node_1_parent_node_2.range The five types of constraints listed in Figure 3 also contribute to the Where clause. FALSE generates Where 0 = 1 Interestingly, it is important to retain the structural constraints even in this case where they are semantically redundant. A Select query returns the empty set much more quickly with the structural constraints, at least in Oracle 8i. A CONTINUOUS constraint generates Where (({ > } {and} { < }) or ...) If a min or max is null, no inequality is generated for it. A DISCRETE constraint generates Where ( in and not in )

A CONSTRAINT constraint generates Where () A SAMPLE constraint generates Where mod(, ) = 0 For example, the simplest of the aggregates, the 424 PhoneCalls, has a singleton graph, no constraints, and generates SELECT * , 'PhoneCall_' & node_1.ID VUID FROM Phone node_1 The projected attribute list is always of the form ‘*, VUID’. The visualization system can substitute a list of the desired attributes for the *, and it knows that the object key will be named VUID. This simplifies the implementation in the case where the visualization displays non-unary threads. There is one more wrinkle when structural (join) constraints are involved in the definition of an aggregate of entities. Consider the 17 selected Contacts. Since each contact may be phoned multiple times, the join may return multiple rows for him or her. Thus, the SQL must project the intermediate join result back down to attributes of Contacts, and duplicate rows must be eliminated. Further, it is inefficient (or illegal) to apply the DISTINCT keyword to LOB attributes such as long text strings or images. The conversion for the 17 Contacts is given in Figure 8. If callee were one-to-many, the DISTINCT check could be eliminated. If every contact were called, the structural constraint could be ignored entirely. Visage performs these checks for each relationship on startup so that it can optimize queries. 3.6 Read-Only DB Issues A large database usually has many purposes, including availability and security, and storing personal exploration metadata inside the database does not further them. It may also compromise privacy and introduce scalability problems. Therefore we examine architectures in which user metadata is partitioned from the underlying data. The drawback is that metadata and base data can't be indexed together by the database for efficiency. To overcome this, we assume that the metadata will be small, and that queries involving both can be converted at run time into base terms. This can be done either through temporary user tables in the database or externally. For instance, to find the most recently called Contact in an aggregate, the primary key[s] are extracted from the contact objects and explicitly listed in the SQL. The result set database indices are then converted back to object IDs using a hash table. A single hash table is used for all objects, which is the reason for the literal prefix ‘Contacts_’ in the query. (The ‘node_’ prefixes have been removed for clarity.)

Graph: node_1_PhoneCalls • callee • node_2_Contacts Constraints: [[DISCRETE, node_2_Contacts.Company, 0, ['']]] generates Select * , 'Contacts_' & node_1.ID VUID From (Select * From (Select Distinct node15_2_Contacts.ID node15_2_Contacts_ID From PhoneCall node_1_PhoneCall, Contacts node15_2_Contacts, callee node5_1_callee_node5_2 Where (node15_2_Contacts.Company Not In('')) And node_1_PhoneCall.ID = node5_1_callee_node5_2.Domain And node15_2_Contacts.ID = node5_1_callee_node5_2.Range), Contacts node_2_Contacts Where node15_2_Contacts_ID = node_2_Contacts.ID) node_1 Figure 8 SQL for the selected 17 Contacts. The ‘node15’ prefix is from standardizing apart. Note that ‘Distinct’ is applied only to keys.

Select 'Contact_' & Contact.ID, PhoneCall.date From Contact, PhoneCall, callee Where PhoneCall.ID = callee.Domain and callee.Range = Contact.ID and Contact.ID In and PhoneCall.StartTime = (Select max(PhoneCall.StartTime) From Contact, PhoneCall, callee Where PhoneCall.ID = callee.Domain and callee.Range = Contact.ID and Contact.ID in ) The long ID lists make the queries hard for humans to read, but databases deal with such lists efficiently. Some databases limit the number of items in SQL lists, in which case they must be broken up: ...ID in or ID in or ... The query length may also be limited, in which case the lists must first be stored in temporary tables in the database. We are assuming that the visualization system is not using an efficient outof-memory object store, and that this imposes a limit on list lengths. This assumption is true with a vengeance in Visage. Before any queries are issued that will result in object importation, a COUNT query is first issued. If the count exceeds a threshold (usually 1000), the query is aborted and the user is informed that there is too much data to display individually. 3.7 Hierarchical Attributes Users may want to drill down into data subsets along multiple attributes, as mentioned in Section 1.2. There are two ways to support this in a relational database. One always uses the same underlying attribute order, independent of the data. For instance, the Linnaean evolutionary tree nests attributes in the order [Kingdom, Phylum, Class, Order, Family, Genus, Species]. Less regular breakdowns occur more commonly. It may be useful to break down my email by folder. Within ‘personal’ a date breakdown might be most useful, while in ‘ARL proposal’ a breakdown by sender.company might make the most sense. These choices may be task-dependent. Therefore Visage supports runtime drill down on any nominal value in a DQ widget. Supporting drilldown on the DQ attribute would be useful for the Linnaean case. Once again, this added functionality can be added with derived attributes, and the database interface doesn’t have to know about the nesting. For instance, drilling down on Company = Maya in Figure 1 generates Figure 9. The DQ widget is now displaying an attribute Company_Maya_Design_Group_FullName, defined as iif(NODEN.Company = 'Maya Design Group', NODEN.Company & '_NEST_' & NODEN.FullName, NODEN.Company) NODEN is replaced with the correct ‘node_’ prefix for the query node which constrains the attribute (e.g. node_2_Contact in Figure 1). The _NEST_ separator tells the DQ widget how to lay out the histogram. 3.8 Relevance and Databases SQL specifies binary result sets. A record either satisfies the query or not. Information retrieval relies on the non-binary notion of relevance. The most relevant set of documents is retrieved by a query. There is generally a quantitative measure of query

Figure 9 Detail from Figure 1 after drilling down to FullNames within Maya Design Group. In order to support meaningful addition of filled areas in the histogram, the Maya row now indicates the total number of contacts with a vertical line rather than a filled rectangle. Instead, the detail rows with indented text show filled areas that sum to that of Maya as seen in Figure 1. (The filled area is the same size for each detail row, since FullName always picks out a single Contact.) relevance. One way to add IR capability to a database, therefore, would be to write Select title From document Order By relevance(, body) Desc where relevance is a new SQL function that operates on pairs of strings. However computing relevance requires IR-style indexes and even so is rather slow. We have chosen to rely on an independent IR program to compute relevance, and cache the results so they can be combined with database queries. Conceptually, this is like adding an attribute to the document table for relevance-to-query001. Having the first-class attribute allows referring to multiple IR queries. For instance, we could write Select title From document Where query001-relevance > query002-relevance As usual, we refrain from modifying the relational database schema and cache values for the new attributes on explicit Visage objects. This assumes that few documents will be assigned a non-default relevance (see next section). 3.9 Truth Maintenance for Derived Attributes In an object-oriented system, maintaining consistency of derived attribute values (often called truth maintenance (TMS) [Doyle 1979]) during updates is straightforward. When a value is accessed during a computation, a pointer to the derived value is added. If the value changes, the pointer is used to decache the derived value. In Visage, chains of derived values are often quite long, and this lazy evaluation does a good job of minimizing total computation. Relational databases normally maintain dependencies at the view level rather than the record level, which makes it hard to minimize recomputation. Derived attributes can have two kinds of definitions in Visage. First, it may have a SQL expression that is substituted for it in queries, as in the case of Company_Maya_FullName. This always requires recomputation. Second, it may be defined by a non-SQL expression like Kleene * or with an attached procedure. The database can’t evaluate these, so they will have non-default values only for imported data objects. Relevance attributes are like this. The attached procedure always returns a non-negative number, and it ensures that the most relevant documents are imported. Unimported records are treated as having relevance = 0. The code that processes result sets returned by the database is given a list of attributes that aren’t in the database, in addition to a

SQL query. If the database record has previously been imported, it extracts the value from the object and appends it to the results; otherwise it uses the default value. 3.10 Importing concepts When any entity is to be displayed in a visualization, Visage ensures that an object exists. At import time, all its attributes are read in. For threads, all the tuple elements are read in as well. The object’s relationships are not imported with the attributes. When a query asks for related objects, the set is cached so relationships are also imported only once. This is not necessarily the best choice, particularly for large attribute values like images. For further discussion of synchronizing local information with a database, see [Cho and Garcia-Molina 2000].

4. Future Work The primary challenge in extending this work is scalability to multiple databases and managing meta-data over extended time periods. The separation proposed here between the expensive object representation and large relational database at least provides a framework for approaching the problem. Each time a database is registered with Visage, it extracts the schema, creates corresponding entity type, attribute, and relationship objects, and stores on them the source from which base level data can be imported. If an entity type is found in multiple databases, it queries each one for data. This feature has never been used, however, and there is no mechanism for updating only a subset of the databases. There is no mechanism for dealing with unavailable databases, or the many other issues in federating databases (see [Sheth and Larson 1990]). Eventually, the operation-level granularity of maintaining the interaction history will become overwhelming. Abstraction and forgetting will likely be necessary. A few approaches to this problem are discussed in [Derthick and Roth 2001], but much research remains to be done.

5. Conclusion Information patches are largely constructed, in contrast to food patches. Meta-data about previous operations, especially previous intermediate result aggregates, can be used to help construct them using set operations. The intermediate results may derive from DQ, SG, or other operations common in visualization systems. Derived attributes are a powerful and underutilized technique for capturing structure uncovered during exploration. They can often encapsulate the structure, thus retaining backward compatibility with existing software and/or simplifying implementation of diverse features. Examples include thread attributes that hide join structure, hierarchical attributes that hide the messiness of relational representations, and relevance attributes that capture query context. A TMS foundation supports declarative definitions and hides the often complicated dependencies among attributes. With judicious choice of “bookkeeping” attributes to cache intermediate results, a TMS framework is quite efficient. Neither IR nor visualization systems have often represented result sets as first class objects, and the consequence is poor support for extended exploration. Visage aggregates, derived attributes (including hierarchical and relevance), and threads have definitions that support scaling beyond extensional representations. These meta objects and relationships among them support visualization and reuse of users' interaction history. Canonical representations help maintain interpretable, efficient database operations.

Unfortunately, many internal functions require two versions: one for data in the relational database, and one for explicit objects. Previous DQ systems have been applied only to static datasets. Even in the case of Query Previews [Doan et al. 1996], a SQL expression is used only once, to read in all the data. Allowing reusable set operations on datasets and DQ-defined subsets requires maintaining the corresponding query definition. Doing this naively by treating complete SQL queries as atomic, and applying Union, Minus, or query nesting, quickly led to composite queries that took hours to execute, or that exceeded input buffer constraints. Doing late query generation from a canonical language matched to the interface operations has largely eliminated these problems, as observed during my use of the system while building domain specific applications in public health analysis and digital video libraries. Debugging incorrect queries is also much easier.

6. Acknowledgements This work was supported by the Advanced Research and Development Activity (ARDA under contract number MDA90402-C-0451 and through a DARPA STTR contract administered by the US Army to Maya Viz as contract number DAAH01-03-CR171, and subcontracted to CMU as contract A00784.

7. References Ahlberg, C., Williamson, C. and Shneiderman, B. 1992. Dynamic Queries for Information Exploration: An Implementation and Evaluation. In Human Factors in Computing Systems (CHI). Monterey, CA. ACM Press, 619-626. Ambler, S. W. 2003. The Fundamentals of Mapping Objects to Relational Databases. . http://www.agiledata.org/essays/mappingObjects.html Belkin, N. J. and Croft, W. B. 1992. Information filtering and information retrieval: two sides of the same coin? Commun. ACM 35, 12, 29--38. http://doi.acm.org/10.1145/138859.138861 Chakravarthy, U. S., Grant, J. and Minker, J. 1990. Logic-based approach to semantic query optimization. ACM Transactions on Database Systems (TODS) 15, 2, 162-207. Cho, J. and Garcia-Molina, H. 2000. Synchronizing a database to Improve Freshness. In Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD). http://wwwdb.stanford.edu/~cho/papers/cho-synch.pdf Cutting, D. R., Karger, D. R., Pedersen, J. O. and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the SIGIR '92 Conference, 318-329. Derthick, M., Kolojejchick, J. A. and Roth, S. 1997. An Interactive Visual Query Environment for Exploring Data. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST). Banff, Canada. ACM Press, 189-198. http://www.cs.cmu.edu/~sage/UIST97/UIST97.pdf Derthick, M. and Roth, S. F. 2001. Enhancing Data Exploration with a Branching History of User Operations. Knowledge Based Systems 14, 1-2, 65-74. http://www.cs.cmu.edu/~sage/Papers/KBS/KBS.pdf Derthick, M. and Roth, S. F. 2001. Example-based generation of custom data analysis appliances. In Proceedings of Intelligent User Interfaces (IUI '01). Santa Fe, NM, 60-67. http://www.cs.cmu.edu/~sage/Papers/IUI00/IUI00.pdf Doan, K., Plaisant, C. and Shneiderman, B. 1996. Query Previews in Networked Information Systems. In Research and technology advances in digital libraries. Washington; DC. IEEE Computer Society Press, 120-129. Dosooye, N. 2003. phpCodeGenie. . Freshmeat. http://freshmeat.net/projects/phpcodegenie/ Doyle, J. 1979. A Truth Maintenance System. Articial Intelligence 12, , 231-272. Eick, S. G., Steffen, J. L. and Eric E. Sumner, J. 1992. Seesoft-A Tool for Visualizing Line Oriented Software Statistics. IEEE Trans. Softw. Eng. 18, 11, 957-968. http://dx.doi.org/10.1109/32.177365

Hill, W., Hollan, J., Wroblewski and McCandles. 1992. Edit Wear and Read Wear. In Proceedings of CHI'92 Conference on Human Factors in Computing Systems. ACM Press, 3-9. Hill, W. C. and Hollan, J. D. 1994. History-Enriched Digital Objects: Prototypes and Policy Issues. The Information Society 10, 2, 139-145. Hollan, J., Rich, E., Hill, W., Wroblewski, D., Wilner, W., Wittenburg, K. and Grudin, J. 1991. An introduction to HITS: Human Interface Tool Suite. In Intelligent user interfaces. ACM Press. 293-337. Kolojejchick, J. A., Roth, S. F. and Lucas, P. 1997. Information Appliances and Tools in Visage. Computer Graphics and Applications 17, 4, 32-4. http://www.cs.cmu.edu/~sage/PDF/Appliances.pdf Pirolli, P. and Card, S. K. 1995. Information Foraging in Information Access Environments. In ACM Conference on Human Factors in Software (CHI '95). Denver, Colorado, 51–58. Pirolli, P., Schank, P., Marti A. Hearst and Diehl, C. 1996. Scatter/Gather Browsing Communicates the Topic Structure of a Very large Text Collection. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). http://www.acm.org/sigs/sigchi/chi96/proceedings/papers/Pirolli/pp_txt. htm

Sheth, A. P. and Larson, J. A. 1990. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys 22, 3, 183--236. Shneiderman, B. 1996. The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the IEEE Symposium on Visual Languages. IEEE Computer Society Press., 336-343. Stoffel, K., Davis, J. D., Rottman, G., Saltz, J., Dick, J., Merz, W. and Miller, R. 1998. A Graphical Tool for Ad Hoc Query Generation. In Proceedings of American Medical Informatics Association Annual Symposium (AMIA), 503-507. Ullman, J. D. 1989. Principles of Database and Knowledge Base Systems. Vol. 1: Computer Science Press pages. Velleman, P. F. 1993. Learning Data Analysis With Data Desk. Revised Edition ed: W.H. Freeman & Company. 64 pages. Young, D. and Shneiderman, B. 1993. A graphical filter/flow representation of Boolean queries: a prototype implementation and evaluation. J. Am. Soc. Inf. Sci. 44, 6, 327-339.

Suggest Documents