Simple queries must have simple representations, the meanings of which are ... In support of these goals, the graphical model was made as simple as ..... are booked on a flight from New York to Paris, or from Los Angeles to .... return those airlines that can get a passenger from New York to Paris in any number of hops.
Condor: A Simple, Expressive Graphical DataBase Query Language∗ Joshua S. Hodas Robert M. Keller Ingo Muschenetz Jeffrey Polakow Amy R. Ward Will Ballard Department of Computer Science Harvey Mudd College Claremont, CA 91711
Abstract Condor is a graphical database query language which is at once intuitive and powerful. The language is based on the relational model, with queries corresponding closely to the relational calculi. The visual representation makes the traditionally difficult aspects of those languages, such as specifying join constraints, easy to understand. While intended for database use, the language allows the direct formulation of arbitrary first-order formulas, without the use of awkward equivalences and normal forms. It also provides a natural syntax for queries involving arbitrary transitive closures.
1
Introduction
The goal of the Condor1 project is to develop a graphical database query language that is at once expressive and easy to learn and use. It is particularly important that it be usable by those with little or no background in databases after only a little training. While ease of use is difficult to quantify, three features in particular were considered crucial: • Simple queries must have simple representations, the meanings of which are readily apparent. • The system must be compositional: it should be possible to build up complex queries incrementally, and sub-graphs of a complex query must be understandable as sub-queries. • The mapping between a query and a natural-language description of the query should be as straightforward as possible. In support of these goals, the graphical model was made as simple as possible, with only a small set of graphical primitives. Further, the queries correspond to logical formulas, about which most users have a natural intuition, at least in the non-quantificational case. Finally, in order to simplify the entry of complex queries, the system includes a simple, direct representation of transitive closure, as well as a notion of macro in which an entire sub-query of arbitrary size and complexity can be reduced to a single visual element, to be used and reused in more complex queries. ∗
This project was supported by Jet Propulsion Laboratories. One target application of this system is data mining by intelligence operatives. The name of the project derives from the movie Three Days of the Condor. 1
1
de pa rtu re
tim al riv ar
tim e source
e
destination flig h
e lin air
tn
um
ber
Figure 1: The “flight” relationship, after being dragged to the workspace. The matter of expressivity is also a complicated one, as it depends on the target application. In the case of query languages, the language should be rich enough to express any query that the underlying database engine could handle. Condor is intended to be neutral with regards to the type of backend database to which it is applied. Therefore, in order to ensure that the system would be adequately expressive for all needs, we chose as a design criterion that the language have the ability to express any first-order predicate calculus formula, augmented with transitive closure. In practice, most queries in Condor will correspond closely to queries in the relational calculi. Condor can be best understood as a visual representation of those systems.
2 2.1
Core Model Relationships and Attributes
The simplest queries in Condor look much like entity-relationship (E-R) diagrams [3]. However, where E-R diagrams distinguish between three types of objects, entities, relationships and attributes, Condor, as with the relational model, removes the distinction between entities and relationships, calling both relationships (or relations). This correspondence with the relational model is intentional, as it is still the most obvious and natural one for most users. We contend that it is the query languages of relational databases —and, in particular, their textual forms– rather than the data model that present the greater stumbling block for users. A core Condor query is a graph with relationship and attribute nodes at the vertices. Relationship nodes are denoted by circles containing icons intended to connote the meaning of the relationship (it is up to the database administrator to choose the icon set). Alternately, relationships may be denoted by text-labeled ovals. Attribute nodes are denoted by rectangles. Each edge of the graph connects one attribute to one relationship. In order for the query to be meaningful, every attribute must be connected to at least one relationship. The converse is not necessary, though in general it will hold. In fact, while the graph need not be connected, most meaningful query graphs will be. A relationship is added to a query by dragging it into the workspace from a palette, or by selecting it from a pop-up menu in the workspace. When added to the workspace, the relationship appears with a number of radiating arcs corresponding to its arity. Each arc is labeled with the role that an attribute attached to that arc will play in the relationship. Figure 1 shows the “flight” relationship from an example database after it has been dragged to the workspace. Each attribute attached to a relationship is labeled with either a constant value (in ordinary 2
6:00 PM
10:00 AM de pa rtu re
tim al riv r a
tim e source
New York
e
destination flig h
e lin air
tn
um
Paris ber
Flight #
United
Figure 2: What is the flight number of the United Airlines flight from New York to Paris leaving at 10 AM and arriving at 6 PM?
Arrives
Departs de pa rtu re
New York
tim al riv r a
tim e source
e
destination flig ht num ber
e lin air
Paris
Flight #
Airline
Figure 3: What are the airlines, flight numbers, and departure and arrival times of flights from New York to Paris? text), or a variable name (in italic text). A constant value labeling an attribute indicates that the user wishes to know about instances of the relationship with that value for the corresponding role. Thus the simplest queries are those with no variables, which amount to yes/no questions: is there an instance of the relationship satisfying all the specified constraints on the roles? A variable labeling an attribute indicates a desire to know about all the values that the relationship takes on for that role, as constrained by the other attributes attached to the relationship. For example, Figure 2 asks for the flight numbers, if any, of United Airlines flights from New York to Paris leaving at 10 AM and arriving at 6 PM. When there is more than one variable-labeled attribute connected to a relationship the query is asking for the tuples of values for all the variables that simultaneously satisfy the query. Thus, Figure 3 asks for the airlines, flight numbers, and departure and arrival times of flights from New York to Paris. For reasons that will be discussed below, the variable names used to label attributes in a query must be unique.
2.2
Compound Queries
The Condor model extends straightforwardly to queries involving more than one relationship. Unless otherwise marked as discussed below, a query with multiple relationships is regarded as the conjunction of the sub-queries involving each of the individual relationships. If the query graph is not connected, then the conjunction is not particularly interesting, as the query amounts to a 3
Flight # be r
John Traveler
air lin e
Fare
sea t
e far
Seat #
im e re t rtu a p de
Departs
at io n de st in
tn fli gh
Airline
r be
airline
um
date
tn
me na
Date
gh fli
um
Paris
source arr iva ltim e
New York
Arrives
Figure 4: What are the airlines, flight numbers, dates, and flight times of those flights from New York to Paris on which a passenger named John Traveler flew, and what were his seat number and fare on those flights? cartesian product. If, however, two or more relationships have edges leading to the same attribute node, then that node can be seen as a shared constraint between the two sub-queries. The most important, and common, case is when the shared attribute node is labeled with a variable. In that case the output will contain only those values for that variable which satisfy all of the connected relationships. This corresponds to forming the join over that attribute. The query in Figure 4 asks for the airlines, flight numbers, dates, and flight times of those flights from New York to Paris on which a passenger named John Traveler flew, as well as his seat numbers and fares on those flights. In the last example, the linking of the two relationships on both of their common attributes corresponds to computing the natural join of the two relationships. Join operations are typically the most confusing aspect of the relational model for inexperienced users. Because the variables labeling attributes must be unique, the join constraint can only be drawn in one way, as multiple edges incident on an attribute node. This clear visual representation of the join constraint is perhaps Condor’s most important feature.
3
Combating Clutter
The fragment of Condor presented so far is a simple and intuitive model for specifying a reasonable (but by no means complete) class of queries. However, a common problem with graphical models is that they are often fine for toy examples, but with real-world problems involving many relations and attributes, the visuals become overly cluttered. Condor addresses this problem in a variety of ways, two of which are discussed in this section.
3.1
Suppressing Attributes
One of the advantages of relational algebra and tuple relational calculus over domain relational calculus is that a query in one of the first two languages need not refer to any attributes that don’t play a role in the query. That is, when the user does not want to constrain the value satisfying an attribute, but is not interested in the values of the attribute, the attribute is simply not included in the query and is projected away. In contrast domain relational calculus queries must refer to every attribute of every relation that participates in the query. As presented so far, Condor would 4
New York
source
destination flig ht nu mb er
e lin air
Paris
Flight #
Airline
Figure 5: What are the airlines and flight numbers of flights from New York to Paris?
Flight # be r
John Traveler um
r be
air lin e
at io n de st in
tn fli gh
Airline
um
airline
tn
date
gh fli
me na
Date
Paris
source
New York
Figure 6: What are the airlines, and dates on which John Traveler flew from New York to Paris? seem to suffer from the same problem. In order to avoid cluttering a query with uninteresting attributes, Condor allows the user to mark those roles of a relation that should be ignored. The corresponding edges are then removed from the graph in order to reduce the visual complexity of the query. For example, suppose that in the query in Figure 3 the user wants the airlines and flight numbers for flights from New York to Paris, but doesn’t care about the times of the flights. Figure 5 shows the query after the departure and arrival times have been marked as ignored. An important special case occurs when the user is not interested in the values of an attribute, but that attribute is used to link two or more relationships in a compound query. In order to act as a link, the attribute must appear in the query and be assigned a variable name label. Still, Condor allows the user to mark the attribute so that its values do not appear in the result output. Such attributes appear in greyed-out boxes in the query. Figure 6 shows a variant of the query in Figure 4 in which the user wants only the dates and airlines on which John Traveler flew from New York to Paris. All other attributes of the two relationships have been marked as ignored. The flight number is present in order to connect the two relationships, but its values will not appear in the output.
3.2
Macros
The other principal means of reducing visual and conceptual clutter in a query is through the use of macros. The Condor macro facility allows the user to specify a sub-graph of the query as a meta-relationship by circling the sub-graph using a “lasso” tool. The selected sub-graph must be capable of behaving as a relationship. That is, all edges that are crossed by the drawn boundary must connect attribute nodes outside the sub-graph with relationship nodes inside the sub-graph. 5
Flight # be r
John Traveler um
at io n
r be
air lin e
e far
Fare
de st in
um
Airline
airline
date
tn
fli gh
tn
gh fli
me na
Date
Paris
source arr iva ltim e
e tim
re artu dep
Arrives
Departs
Flight # flig h
airline
Airline
um
airline
me e ti tur r a dep
e far
Fare
tn
tn fli gh
me na date
um
be r
John Traveler
Date
New York
Departs
be r
Flight NY to Paris arr iva ltim
e
Arrives
Figure 7: Constructing a macro corresponding to flights from New York to Paris. The top portion of the figure shows the user selecting the sub-query that will form the macro. The bottom portion shows the macro in place after it has been named. When a proper sub-query has been lassoed, the sub-graph is removed from the query and replaced by a single relationship node that the user then labels with a name or icon. This node is given a thick border to indicate that it is a macro relationship, and it is added to the library of macros. The complexity of the query represented by a macro is up to the user. Figure 7 shows the user building a simple macro relationship: “Flight NY to Paris”. This macro relationship is just a customized version of the existing flight relationship, constrained to flights from New York to Paris. This macro corresponds to a selection in a relational algebra query. In contrast, the “Booked on a Flight” macro constructed in Figure 8 corresponds to a natural join. It is important to note that variable-labeled attributes that are inside a macro will be treated as ignored and not shown in the output of queries using the macro. This is consistent with the view that the output of a query should be determined only by the visible structure of the query. In the last example, if we wished to know the flight numbers then we would encircle only the two relationships and the airline attribute with the lasso, leaving the flight number attribute outside. The flight number attribute would then be connected to the macro by a single edge with the appropriate label. Note that a macro can always be opened up and its structure modified even after it has been added to a query. This is the reason that macros maintain a distinct representation (the thickened outline) different from that of ordinary relationships.
6
Flight # be r
John Traveler
Fare
r be
air lin e
e far
at io n de st in
fli gh
tn
um
Airline
airline
date
tn
me na
Date
gh fli
um
Paris
source arr iva ltim e
im e re t rtu a p de
New York
Arrives
Departs
John Traveler Paris na me
Date
date e far
Fare
p de
d
on nati esti
Booked On a Flight sourc
e
re tu ar
tim
arr iva ltim
e
New York e
Arrives
Departs
Figure 8: Constructing a macro corresponding to a passenger being booked on some flight. The top portion of the figure shows the user selecting the sub-query that will form the macro. The bottom portion shows the macro in place after it has been named.
7
or Paris tion tina d es
Booked On a Flight source nam
e
New York
Passenger
Tokyo nam
tion tina d es
e
Booked On a Flight source Los Angeles
Figure 9: What passengers are booked on a flight from New York to Paris, or from Los Angeles to Tokyo?
4
Function Boxes
As stated earlier, the default treatment of compound queries (those with more than one relation) is as a conjunction. The remaining logical relationships are obtained through the use of function boxes which encircle a set of relationship nodes and specify the logical treatment of that set. The use of boxes, which may be nested, allows any logical formula to be represented directly, without resorting to equivalences and normal forms.
4.1
Disjunction
A disjunction box (as with all function and quantifier boxes) is drawn by selecting the appropriate box type from a palette or pop-up menu and then drawing the box in the appropriate area. The box is labeled in the upper-left corner with the appropriate function, in this case or. The query in Figure 9 uses a disjunction box, together with the macro constructed in the last example, to ask for those passengers who have been booked either on a flight from New York to Paris, or on a flight from Los Angeles to Tokyo. Note that function boxes only affect the treatment of relationship nodes; attributes are unaffected. Therefore, the placement of attribute nodes relative to the function boxes —that is, whether they are inside or outside— is unimportant; they should be placed in the most visually appropriate position.
4.2
Conjunction
While conjunction is the default operation when nothing else is specified, an explicit representation of conjunction is still needed when the conjunction occurs as a sub-query. The conjunction box appears just the same as the disjunction box, with the label and replacing the label or. For example, if we did not wish to use the “Booked on a Flight” macro, the last query could have been drawn as in Figure 10.
8
or and be r
Flight #
air lin e
e
at io n de st in
r be
m na
um
Airline
airline
tn
fli gh
tn
gh fli
um
Paris
source
New York
Passenger
and be r
Flight #
r be
air lin e
fli gh
at io n de st in
um
Airline
tn
airline
gh fli
me na
tn
um
Tokyo
source
Los Angeles
Figure 10: What passengers are booked on a flight from New York to Paris, or from Los Angeles to Tokyo? (Without using a macro.)
9
Flight #
New York flig h
tn
um
fli gh
tn
me na
um
be r
John Traveler
airline
ber
u so
e rc
airline
Airline
Figure 11: What are the airline and flight number of flights John Traveler took from cities other than New York?
nu m
Flight #
source
New York
Airline
um air lin e
r be
airline
Destination
tn
fli gh
tn
gh fli
me na
um
be r
John Traveler
ber
ai rli ne
t fligh
n io at
in st de source
Source
Figure 12: What are the source, destination, airline and flight number of flights John Traveler took from cities other than New York?
4.3
Negation
Negation boxes, naturally, negate the sense of the underlying query. Unlike conjunction and disjunction boxes, which operate on an arbitrary number of sub-queries, negation boxes are unary: they must surround a single box or relationship. In order to reduce clutter, in the common case that the negation is applied to a relationship rather than to another box, the relationship is redrawn inverted black for white rather than adding a box. Figure 11 asks for the airline and flight number of flights that John Traveler took from cities other than New York. Figure 12 shows a variant of the query which also returns the source and destination of each flight. Note that this requires an additional use of the flight relation. 4.3.1
Safety
As is well known from the development of the relational calculi, the introduction of a direct representation of negation (or complementation) into a query language is problematic, as it introduces the possibility of infinite relations, which are not in the domain of data base theory. Fortunately, the theory of query safety as developed in the context of domain relational calculus [6] can be applied directly to Condor, and is not detailed here. Conceptually, safety requires that negations be used only to filter out tuples from the output of another positive, finite part of the overall query. 10
for all
or e airlin
Airline Paris air line
u so
rce
Figure 13: Does every airline have a flight originating in Paris? In Figure 11 values of the flight number and airline attributes of the negated flight relationship are constrained to come from the values for those attributes that occur in the booking relationship. In Figure 12, even if only the destination city was desired, the second use of the flight relationship would still be necessary, as an unconstrained use of the destination attribute on the negated relationship would be unsafe.
5
Quantifier Boxes
A Condor quantifier box is visually similar to a function box. It is labeled either for all or there exists and its purpose is to control the treatment of variable-labeled attributes in the enclosed query. Rather than put the name of the variable(s) being quantified in the box label, Condor quantifier boxes are specified to quantify those variable-labeled attributes that are at the outer level of the box. Variable-labeled attributes occurring inside nested function boxes are unaffected. Since the names used in variable labels must be unique, there is no ambiguity.
5.1
Universal Quantification
While quantifiers in condor are extensional, they are unconstrained, since they do not carry the name of a source domain. Therefore, as with negation, universal quantification introduces problems of safety. The result is that almost all uses of universal quantification are in conjunction with a range-limiting sub-query. Figure 13 shows the yes/no query “Do all the airlines have flights originating in Paris?”. Because it is used to constrain the range of universal quantifiers, the disjunction/negation pattern used in this query is very common. Therefore, though it is logically equivalent, an implication function box is also available. We believe that most users will also find implication more intuitive than the equivalent disjunction. The implication box is binary, operating on exactly two sub-queries, and is labeled with an implication arrow indicating the direction of the implication. Figure 14 shows the same query using an implication box. The query in Figure 15 asks for the names of passengers who have flown on all of the airlines, and is typical of the use of universal quantification and implication as in relational calculus to produce the behavior of division in the relational algebra.
11
for all e airlin
Airline Paris air line
e urc so
Figure 14: Does every airline have a flight originating in Paris? (Using implication.)
for all e airlin
Airline
Passenger air line
m na
e
Figure 15: What passengers have flown on all of the airlines?
12
or destination
source
Paris
airline
Los Angeles
airline air lin e
e lin air
and Los Angelessource
destination in st de n io at
u so
Paris
e rc
city3
Figure 16: What airlines fly from Los Angeles to Paris in one or two hops?
5.2
Existential Quantification
Existentially quantifying an attribute-labeled variable amounts to asking whether there is a value for the attribute which will satisfy the rest of the constraints in the query, but not caring about the actual value. We have already seen this behavior in the two forms of ignored roles. Both of these forms however assume that the implicit quantification is at the innermost possible level of the query. The existential quantifier box is provided to allow the level of the formula at which the quantification takes place to be made explicit. Nevertheless, most cases can be handled by the implicit forms, so we will not show an example here.
6
Transitive Closure
In addition to allowing the user to write any first-order query, Condor’s also provides a direct representation of arbitrary transitive closures in queries. While this would be useful in any query language, it is particularly useful for reducing the density of graphical queries. For example, Figure 16 shows a query asking for those airlines on which it is possible to fly from Los Angeles to Paris in one or two hops. Extending this to three or more hops quickly becomes a visual jumble. Figure 17 shows the equivalent query using Condor’s transitive closure operator. This query will return those airlines that can get a passenger from New York to Paris in any number of hops. The circular and square numbered nubs on the edge of the closure ring are placed by the user and indicate the way in which the roles of one hop connect to the roles of another. The square nubs correspond to “inputs” and the round nubs to “outputs”. The numbers determine how the nubs are paired together. The placement of variable-labeled attributes relative to the closure is significant. In the last example, by positioning the airline attribute outside the closure, we indicated that the value of that attribute must be fixed for any one series of hops. That is, we are looking for routes that do not
13
Los Angeles so u
Paris on ati tin s de
rc e 1
1
ne rli ai
airline
Figure 17: What airlines fly from Los Angeles to Paris in some number of hops? (Using transitive closure.) involve changing airlines. If, on the other hand, the user wants routes that can include a change of airlines, then the airline attribute would be positioned inside the closure circle. Note that, as with macros, the query inside the closure need not be just a single relationship. It can be an arbitrary graph with the nubs taking the place of any variable-labeled attributes.
7
Related Work
One of the earliest “graphical” interfaces, and perhaps still the most well known is Query By Example (QBE) [7]. However, QBE’s interface, which is essentially a tabular depiction of the database schema, together with the fact that user input is all textual, makes the description of it as a “graphical system” somewhat outmoded. Catarci, working with a variety of co-authors, has proposed several systems over the years. These systems, including Query By Diagram* (QDB*) and the Graph Model (GM) are based on variations of the E-R model [1, 2]. QDB* included a syntax for transitive closures, but it was based on a tabular presentation which did not seem to fit very well with the otherwise graphical presentation. The query graphs of GM are a hybridization of the E-R model and semantic networks. Condor queries, based on the relational model and logical formulas, are often more succinct, and, we believe, more easily understood by naive users. GrIT, the graphical interface for the Pierce object-oriented DBMS, like Condor, is based on constructing logical formulas as queries [4]. However, it is built on Sowa’s Conceptual Graphs [5] which provides only existential quantification. While the system is still logically complete, the user must rely on logical equivalences for some constructions. The resulting queries are often far less intuitive than if full first-order logic were used.
14
8
Prototype Implementation
A prototype implementation of Condor has been developed using the XPCE graphics toolkit from SWI. A backend database management system was also developed in SWI-Prolog. At present a new implementation using TCL/TK for the interface code is under development at Jet Propulsion Laboratories. In tandem with the Grok object-oriented DBMS (also being developed at JPL) and additional interface code being written at Harvey Mudd College, this version will be able to query data residing in an OracleTM database.
9
Future Directions
In addition to ongoing work on the implementation of Condor mentioned in the last section, there are many issues in the design of the language that remain to be investigated. In its current state Condor is a very rich and useful tool, but it corresponds closely to the various formal query languages. It thus lacks many of the crucial features that are present in real-world database systems but not in the theoretical languages. In particular, Condor does not currently support either computed relations (i.e. relational operators like < and functions like string containment) or aggregation operators. While computed relations can easily be fit into the visual structure of the language by having them appear much like any other relationship node, there may well be other denser representations that are more appropriate. Aggregation operators present their own challenges. Further work is also needed in determining visual shorthands to reduce clutter. One example of an area that needs attention is relationships constrained by two or more possible values on an attribute (e.g. flights to Paris or Rome). At present this requires a disjunction box with a separate copy of the relationship node for each of the possible constant values. It seems obvious that list-valued attributes would be more straightforward, but the best way to introduce them is not clear. Finally, as currently constituted, macros exist as syntactic sugar for the defining expression, much like a view. Each time a macro is used, the underlying query is recomputed. It is possible to imagine a second kind of macro in which the value of the relation represented is frozen at the time the macro is defined. How should this distinction be represented?
10
Conclusion
We have designed and implemented the Condor system for graphical entry of database queries. The following are the main contributions: Subject to query safety considerations, Condor provides a fully-quantified first-order predicate calculus capability, as well as a clean intuitive transitive closure construct. Macros, special handling of negation, and graphical suppression of irrelevant attributes are offered as ways of reducing the complexity of query presentation for the user’s benefit. We closed by mentioning imminently-realizable features which would be needed to make Condor a full-fledged graphical user interface for arbitrary databases.
11
Acknowledgments
Condor was designed as part of a Harvey Mudd College Computer Science Clinic project under contract from Jet Propulsion Laboratories. The authors wish to thank Walter Bunch and Dr.
15
Nevin Bryant, both of JPL, for their advice as liaisons on this project.
References [1] M. Angelaccio, T. Catarci, and G. Santucci. QDB*: A graphical query language with recursion. IEEE Transactions on Software Engineering, 16(10):1150 – 1163, 1990. [2] T. Catarci, S.K. Chang, M.F. Costabile, S. Levialdi, and G. Santucci. A graph-based framework for multiparadigmatic visual access to databases. IEEE Transactions on Knowledge and Data Engineering, 8(3):455 – 475, 1996. [3] P.P. Chen. The entity relationship model: Toward a unified of data. ACM Transactions on Database Systems, 1(1):9 – 36, 1976. [4] P.W. Eklund, J. Leane, and C. Nowak. GrIT: An implementation of a graphical user interface for conceptual structures. Technical Report TR94-03, Computer Science Department, The University of Adelaide, February 1994. [5] J.F. Sowa. Conceptual Structures. Addison Wesley, Reading, Mass., 1984. [6] J.D. Ullman. Principles of Database and Knowledge-Base Systems, volume I. Computer Science Press, Rockville, Maryland., 1988. [7] M.M. Zloof. Query-By-Example: A database language. IBM Systems Journal, 16(4):324 – 343, 1977.
16