nor do they differentiate between nomenclature and taxonomy. ..... a genus nomenclatural taxon in order to form a binomial name. Or, from ..... This list was then.
Database Support for Taxonomy. Cédric Raguenaud, Jessie Kennedy, Peter J. Barclay Database and Object Systems Group School of Computing Napier University 219 Colinton Road Edinburgh EH14 1DJ United Kingdom {cedric, jessie, pete}@dcs.napier.ac.uk Abstract: Taxonomists classify the organisms they study in order to refer to, identify and understand them. However, the same organism may at times be classified according to different taxonomic opinions and subsequently have several alternative names or be placed in different taxa. As alternative classifications multiply, biologists are commonly faced with the need to compare and contrast taxonomies in order to identify how they differ in their organisation. Thus there is a pressing need for computer systems that are capable of handling multiple taxonomies arising from the combination of legacy data, newly described taxa, modern revisions and conflicting opinions. Many database systems have been built to handle taxonomic data. We review taxonomic databases and show that they do not provide taxonomists with the necessary tools to support their work. A new model of taxonomy (the Prometheus Model) is defined in [PK99]1, and we now compare existing database technology to our requirements and show their strengths and weaknesses.
1. Introduction Taxonomy provides biologists with a means of identifying, categorising and referring to organisms they study. However, the complexity of the living world and the development of new techniques for surveying it (e.g. molecular phylogenetics) means that one cannot simply assume a single, common reference taxonomy categorising all organisms. Hence, the same organism may at times be classified according to different taxonomic opinions and subsequently have several alternative names. Modern taxonomies are usually improvements on previous ones, but sometimes the existence of alternative or variant taxonomies reflects the fact that there is disagreement how to interpret the data on which the taxonomy is based. As alternative classifications multiply, biologists are commonly faced with the need to compare and contrast taxonomies in order to identify how they differ in their organisation. A solution is to support all views of taxonomic classifications, without having to make judgements as to which is the 'correct' classification. Thus there is a pressing need for computer systems that are capable of handling multiple taxonomies arising from the combination of legacy data, newly described taxa, modern revisions and conflicting opinions. The use of computers in taxonomy has grown rapidly over the last decade. During this period several specialist databases have been implemented specifically for handling taxonomic data. Each of these systems has modelled the taxonomic process in a somewhat different way. Extant databases are not capable of dynamically handling multiple, contradictory taxonomies, nor do they differentiate between nomenclature and taxonomy. We believe that current 1
Later in this document some taxonomic points may not be clear for non-taxonomists. We recommend the reading of this paper that explains the theory behind the Prometheus taxonomic model and the processes involved in taxonomic work.
taxonomic database systems, (e.g. [Pan93] [WAW93] [BPB93] [SLPS95]), which are usually based on the relational model, are unable to represent taxonomic data accurately and support taxonomic working practice. The Prometheus system is based on careful analysis of taxonomists’ working practices and will allow users to switch between classifications, and compare and contrast them in an even-handed manner. Section 2 describes and reviews current taxonomic models, how they are implemented in database systems and why they fail to support taxonomic working practice, and concludes by describing the Prometheus taxonomic model [PK99]. Section 3 presents existing database technology and shows why they are not suitable for supporting taxonomic work. Finally, we conclude in section 4.
2. Taxonomy Taxonomy is defined as "the study of the general principles of scientific classification" [Web99]. Taxonomists classify the organisms under study and generate a classification hierarchy depicting their presumed natural relationships. These classifications are hierarchical structures where specimens and taxa are placed according to various criteria (e.g. DNA relationships, morphological similarities). The levels (or ranks) used in generating the classification hierarchies vary for different groups of specimens and between taxonomists. Some specimens may end up classified in different ways over time. These classifications are all valid, even though more recently revised versions exist. Taxonomists do not have the concept of "correct classification": all published classifications are valid viewpoints. When taxonomists choose a group to be studied, they collect preserved and living specimens on which to base their work. These specimens may be studied in herbariums or in the field. At the same time, they compile information about past classifications of this group from the literature. The taxonomist examines the specimens and decides on the criteria to differentiate them. Using these criteria the specimen to be classified into different groups and attributed to taxa. Finally a name is assigned to each taxon defined in the classification. This involves application of the nomenclatural code, which in simple cases involves finding the name of the oldest published type specimen in each group. If no type specimen exists in a newly defined taxon, a new name is created.
Bellis
Genus
perennis
Species
Taxa are also assigned to ranks which specify the level of a taxon in a classification hierarchy (figure 1, ranks are on the right side of the figure).
The classification is published for other taxonomists to use and is now considered a valid Figure 1: Taxonomic hierarchy classification. If other taxonomists disagree with this classification then they must undertake a revision of the group and publish their conflicting classification in a similar way, i.e. once published a classification can never be removed.
2.1. Taxonomic Challenges for Databases The first challenge generated by the way taxonomists work is the management of old historical classifications. Indeed, when a classification is revised, it stays valid (for example because of references to it in the literature) even if it is not the classification that is recognised
by the majority of taxonomists. A database developed for taxonomists must allow the existence of historical classifications. The second challenge is that the choice of criteria and the way a classification is created (e.g. a revision of previous work or a new study) is largely free. Even the nomenclatural code has varied over time and hence will affect the naming of taxa. Thus it is likely that two taxonomists working on the same set of data will not produce the same classification. A system that supports the work of taxonomists must understand that the same specimens are seen differently by different taxonomists. A third challenge is the support a common reference taxonomy for use in identification of species for legal or commercial purposes. These classifications are not necessarily those that used by taxonomists but are important for non-specialists to avoid naming confusions. At the moment, no database system is able to provide at the same time the ability to store special fixed classifications and working classifications. These problems appear to be the consequence of the need for multiple classifications. The lack of tools that handle multiple contradictory classifications limits the compilation and comparison of useful global data. In the last decade, computers and in particular databases have become widespread in taxonomic work. The databases model taxonomy in a slightly different ways; however, none are capable of capturing multiple contradictory classifications either because the taxonomic model is inappropriate, or because the underlying database technology is too limited.
2.1.1 Inappropriate taxonomic model [PK99] describes the problems generated by existing taxonomic problems, and we summarise them here for a better understanding of the problem. The vast majority of taxonomic systems do not capture multiple classifications (e.g. [Pan93] [SLPS95]). These systems are unable to represent multiple classifications because they do not distinguish between the concepts of naming and classifying. None of these systems automatically assign names WKH\UHTXLUH WKH user to declare them. Because of this aspect, it would be impossible to extend their models to accommodate multiple classifications without heavy modifications. Only two systems to our knowledge try to represent multiple classifications: IOPI [Beren97] and HICLAS [BPB93]. The IOPI system is based on the concept of potential taxon, which consists of the combination of a taxon name plus a citation. In this system, a taxon does not directly have a name, but is related to a name. This is an improvement on previous systems. However, it appears that they combine the concepts of nomenclatural status of a name (was it validly published?) and the status of the taxon in the classification (is the taxon synonymous with another?). This is because all status information is linked to the potential taxon. We believe that this mechanism does not allow the representation and comparison of concurrent classifications. Moreover, IOPI requires the user to make a statement about the status of the taxon rather than allowing the system to work out that information. From a classification point of view, the taxa defined in IOPI may be classified under many higher rank taxa. This provides the basic mechanism for multiple classification. However, a name is still attached to this taxon, which means that the name can become incorrect in certain circumstances. Moreover, not enough information is stored in relation to the classification process, so when taxa are classified under taxa which are also multiply classified, the concept of multiple hierarchies does not appear and the different classifications are mixed together. The IOPI approach does not allow easy classification comparisons because the different taxa defined have no common reference point.
HICLAS uses a different approach to the multiple classification problem. HICLAS is based on the concept of taxon-view. A taxon view is a quadruple of the following elements: taxon name, author/authority, year, and publication number. HICLAS defines two types of trees: the classification tree, where taxon-views are stored, and operational trees. These trees are orthogonal. HICLAS represents classification hierarchies as parallel trees without interaction. These trees may only be related because an operational tree may link them. An operational tree captures the history of a taxon-view (its creation, moves, partitions, etc). HICLAS does not distinguish the concepts of naming and classification, which contravenes taxonomic practice [PK99]. It is also difficult to implement interaction between these hierarchies if they share data (such as a name or a specimen). Finally, it makes comparison difficult because the classifications do not share data (e.g. references to names or specimens). In order to capture multiple classifications and support taxonomists’ work, three features appear to be necessary in the taxonomic model: - names must be separated from the classification so that many taxa may share the same name or a taxon may have different names - names must be derived automatically so that the name of a taxon can be generated according to a classification - the classification hierarchy must be able to handle multiple specimen classification so that a specimen may be classified under many different taxonomic groups
2.1.2. Inappropriate database model Another reason for inadequate taxonomic database systems is the implementation of the taxonomic model on top of a database system that does not allow the expression and the correct representation of taxonomic data. With the exception of the Taxon-Object system [SLPS95], all taxonomic databases we reviewed [Pan93] [WAW93] [BPB93] [SLPS95] [Beren97] [WO93] [Filer94] [BD92] are built on top of a relational database. In the relational model, data structures are simple flat tables. Furthermore, the operations provided on these structures are generally simple, straightforward, and do not support any recursive queries. Although HICLAS and IOPI attempt to model multiple classifications, they are built on top of relational databases. We believe that this model is not suitable for representing taxonomic data because it is a flat structure and classifications are inherently hierarchical. The relational model, by nature, only supports the concept of relation, and does not directly support the concept of hierarchy. If hierarchies are to be created and manipulated in a relational database, an application outside the database must be written especially for this purpose. This limits the support for taxonomic working. Moreover, the relational query language is limited as it does not usually provide any way of traversing a graph, and make the definition of the queries taxonomists are likely to use difficult (for example the extraction of a complete hierarchy based on some criteria). This limits the interaction with the database system to the kind of queries implemented by the application layer and prohibits extensions to the system without rewriting part of the software. This limitation may be a problem for some applications, especially graph-based applications where two-dimensional graphs are manipulated (for example it is impossible to create a hierarchy of arbitrary depth as the result of a relational query). In order to solve this problem, some manufacturers have added a recursive clause in SQL that frees the query language from this limitation, but these extensions are manufacturer-dependent and not standardised.
Finally, the relational model has a well known and used viewing system. These views allow the user to define alternative representations of data and perform (sometimes limited) updates. A view is also represented by a relation, with attributes and domains. This is an interesting feature for taxonomists, but once again, the possibilities of the view system are limited by the relational structure of the database. First, views are created using the query language defined in the database, and thus are limited. Moreover, views used by taxonomists would need to contain heterogeneous data that must keep their individuality. They represent part of a global schema and contain for example taxa, information about taxonomists, citations, etc. Relational views are always a relation, so are not able to represent heterogeneous data structures without restructuring them into a relation. It would be necessary to create a great number of views in order to satisfy a single request from the user.
2.2. The Prometheus Taxonomic Model As was explained previously, current taxonomic databases do not support the work of taxonomists as the model of taxonomy is incorrect. The authors have been working with taxonomists at the Royal Botanic Garden in Edinburgh on the Prometheus project, to develop a more accurate taxonomic model and implement this in a working database system. The Prometheus taxonomic model is fully described in [PK99]. We explain this model briefly here:
Circumscription Type Definition
1..n
Specimen Ascribed Name
Type Specimen
Type
Circum scription
Working Name 0..1
Nomenclature Taxon
Calculated name 0..1 Plac ement
Circum scribed Taxon
H igher
0..1 0..1
0..n
Plac ement 0..1
Epithet
Rank
Conservation/ Rejection
Conserved against one/m any names
0..1 Publication Author
Validity
Conserved against all names
Rejected outright
Figure 2: The Prometheus Taxonomic Model (taken from [PK99])
This diagram represents an object-oriented model. We explain here the notation used: plain boxes represent concrete classes and dashed boxes represent abstract classes. Lines represent associations, which by default are one-to-one, and the cardinality is explicitly described otherwise. Lines starting with a diamond shape represent aggregations and are also one-to one by default, the cardinality is defined otherwise. Bold lines represent inheritance. All relationships are directed and named when their role is not clear from the context.
To support the work of taxonomists fully (in particular multiple classifications), the taxonomic model must have the following properties: clearly defining the specimen as the basic unit in the system, distinguishing naming and classifying, automatically generating names, and supporting multiple classifications. The Prometheus model (shown in figure 2) contains three trees in order to implement these constraints: a nomenclature taxon tree, a circumscription taxon tree, and a rank tree. Circumscription taxa (taxa resulting from the classification process) are arranged into classification trees. Each of the taxa of a tree is assigned to a rank, which describes its location in the hierarchy. Unlike in many other models (e.g. [Pan93], [BPB93]), names are not usually directly assigned to taxa (these are automatically calculated); although the possibility of direct assignment is included in the system in order to capture errors in historical data sets. The nomenclature taxon tree represents the existence of names at a given rank. For example, Bellis is an occurrence of a name at genus level, and perennis an occurrence of a name at species level. The nomenclature taxon tree is multi-rooted. This is due to the rules used for the creation of names. Generally, names do not need to be related to other names, but at some levels in the hierarchy (e.g. species or sub levels), name must be composed (e.g. the binomial species name Bellis perennis). A nomenclatural taxon is a unique entity composed of many different concepts. It is therefore unique in the system, even though many nomenclatural taxa may represent the same name [PK99]. The rank tree describes levels at which taxa and names can be classified. These levels express different degrees of precision in the description of classified items. All these trees obey a set of rules defined in the nomenclatural code and by the user. For example, from a taxonomic point of view a species nomenclatural taxon needs to be related to a genus nomenclatural taxon in order to form a binomial name. Or, from a user point of view, if a person concept is defined and is composed of a date of birth and a date of death, the date of death needs to be later in time than the date of birth. These integrity constraints cannot be represented with classical modelling notations so they do not appear in the model, but are a required feature of our system. The circumscription taxon side of the Prometheus model uses the specimen as a basic unit. Unlike other taxonomic models, this model ignores the definition of a taxon as an entity in itself, and concentrates on its delimitation by grouping specimens. These taxa may contain both type specimens (which hold nomenclature information) and ordinary specimens. It is then the task of the system to assign names to taxa when necessary using available information about specimens. In some cases, the name of a taxon may not be the one it should have, so it is possible to force the assignation of a name to a circumscription taxon. The fact that the Prometheus model is built on the concept of specimens allows an easier comparison (classification hierarchies share information), and allows the investigation of alternative hierarchies based on the classification of particular specimens in other parts of the classification. In this way, the model provides facilities for the construction of classifications and not only their storage. Research carried out as part of the Prometheus project has shown that it is important to be able to relate new circumscription taxa to older ones. This allows easier comparisons between overlapping classifications. There is conceptually a need for keeping track of the history of taxa, as implemented in HICLAS.
3. Database Support for Taxonomic Data We have seen in the previous section that the relational model is not suitable for supporting taxonomists’ work because of its flat structure and the lack of expressiveness of its query language. This may also be true of other models. This section explains the requirements for such a database, then reviews other common database models (object-oriented, and graph based), in terms of their suitability to support taxonomic work.
3.1. Taxonomists’ needs The taxonomic model described in [PK99] (shown in figure 2) is an object-oriented model. By trying to express the basic properties of the system in terms of this model, we may find the need for an object oriented database system where it’s not really necessary. We thus need to define exactly what are the needs of this model in terms of functionality instead of modelling. The requirements are as follows: First, we need to be able to represent simple concepts such as numbers of strings, but we also need to be able to extend the system with new complex objects such as NT or CT. These complex objects are closely related to the complex objects in the object-oriented sense. They of course need to be of any degree of complexity. These objects must obey a certain structure defined by the database designer. This structure must then be divided in two groups: a group that serves as model for all data input in the database (types), and a group that serves as representatives of the data input (instances). Instances have to conform to their type in order to be valid. All types form the schema of the database. We have shown in previous documents that taxonomic data is highly hierarchical. These hierarchies are the direct result of the way the relationships between real world organisms are constructed (whatever the means is, e.g. morphologic or genetic). Not only our system must be able to capture these structures, but it also needs to be able to manipulate and query them in a meaningful way. Our model needs to be able to represent simple and multiple associations. These associations are simple relationships between objects showing that they are interacting. In our model, this can be found in the modelling of authors and publications. This relationship has no influence on the behaviour of the objects involved. Our model also needs to be able to capture single and multiple aggregations. Indeed, this relationship between two objects is central to our model. Since the modelling of multiple taxonomies implies the sharing of data between classifications (e.g. an epithet may occur many times in the classifications but still exists only once as a published name), our concepts are spread over a multitude of objects (see the modelling of the NT concept for example). We thus need to be able to show that a concept is constituted of a certain number of objects. We need to represent this concept using relationships because our graph model is flat (by opposition with nested), so does not allow the representation of a whole – parts concept. This relationship modifies the behaviour of the operations that can be applied on the participating objects. For example, deleting an NT implies the deletion of all participating objects if they do not participate in other concepts. As shown in our taxonomic model, we need to represent a certain kind of synonymy. We understand synonymy as a way of grouping many kinds of objects so that they can be used indifferently by a function of the system, or because they represent concepts that are similar
in some conditions. For example, we modelled the Type concept as a super-class for NT and Specimen so that either NT or Specimen can be used as a type for a given NT. But from semantically, an NT and a Specimen are not related by an isa relationship because they do not share behaviour or structure. The only reason why they are represented as inheriting from a common super-class is that both the NT and the Specimen types can become a type for an NT. We also need to represent another aspect of synonymy at instance level instead of type level: sometimes, the objects manipulated by taxonomists are duplicated because of mistakes or divergence of opinion. However, these objects still represent the same real world entity and so should be treated as such. For example, the same author may be called in different ways depending of the owner of a classification (e.g. “J. Kennedy” and “Jessie Kennedy” represent the same real world person, but are different objects because of spelling differences). Thus, our model needs to represent unions of types that can be involved in some relationships (and not true inheritance), and instance synonyms. Finally, our model must offer a certain degree of integrity checking. Part of it is provided by the presence of types in an object-oriented system, but integrity on values must also be checked (usually through code writing in an object-oriented environment, e.g. a child must be less than 21 years old). These integrity constraints must be extended to inter-object relationships in order to capture the taxonomic code (e.g. an NT can have another NT as its type, but this second NT must be of a lower rank). We summarise the requirements of our system here: - Simple entities: for example atomic objects such as numbers or strings. - Complex entities: for example a NT. - Schema and instances. - Simple associations: an Author may be linked to a publication but is not defined by it. - Multiple associations: an entity may be linked to many entities of a given kind. For example, an Author may have written many books. - Single aggregations (or composition): a NT is composed of an Epithet, a Validity flag, a Publication, and sometimes a type and a placement. Structurally, an aggregation is an association, but implies a modification of the behaviour of operators applies to it. - Multiple aggregations: a CT may be circumscribed by many other CTs or many Specimens. - Synonymy: the taxonomic model uses inheritance as a way of representing synonymy. But this is not a true inheritance relationship since for example Specimen and NT have nothing in common except being used one for the other. It’s also necessary to represent this synonymy at instance level in order to group identical concepts that are represented by different objects. - Hierarchical aspect: the data manipulated is easily represented as trees and has sometimes been modelled this way in a taxonomic database (HICLAS). - Recursivity: the type of treatment necessarily implies a recursive exploration of the database. For example, a taxonomist may ask the database to extract a given classification. - Integrity constraints: they are required in order to capture the rules of the taxonomic code.
3.2. Object-Oriented Databases A definition of an object-oriented database (OODB) is given in [ABD+90] and later specified in the ODMG standard [CBB+97]. It has gained importance in industry over the last few years but is still used only marginally, especially for taxonomic applications. Briefly, in an OODB the basic piece of information is the object. Two kinds of objects exist: complex and simple objects. Complex objects have an internal state (a set of simpler objects), and behaviour (methods). Simple objects are values such as integers or strings. Unlike in the
relational model, objects are recognisable independently from their value. Objects all have an identifier that has to be unique in the system. Objects of similar structure are gathered into classes or types. A type describes the interface of an object and its implementation, whereas a class is in addition an object factory (that is able to create new objects) and an object warehouse (it contains the set of its objects). These classes or types form inheritance hierarchies where behaviour and information may be shared between many classes or types. There is a variety of querying mechanisms in OODBs. Early papers describe how to query and update data in an object-oriented environment. In [KKS92] attributes and methods are also objects, and a query language inspired from SQL called XSQL is defined. It uses path expressions (paths along the composition hierarchy, of the form object.attribute1...attributen). These path expressions are able to query the schema and cope with nested structures. However once again, this query language provides no recursive facilities over non-explicitly recursive data. The standard query language developed for object-oriented databases, OQL, is also largely inspired from SQL and uses path expressions. Because of this relationship, it does not offer facilities that support the concept of hierarchy or its extraction and modification. It is for example impossible to query a database and receive as a result a portion of a hierarchy. Views based on the object-oriented model are a relatively recent but active field of research. Some systems provide a complex view behaviour. The SetV system [Bella96] provides mechanisms to create views over an object-oriented database. It uses the virtual schema approach like many other research systems [Run92]. The O 2 view system is extended in order to support views [Bella96] and updates [Bella97] of virtual complex objects. The update of materialised virtual complex objects (by opposition with virtual simple objects that are simply a restriction on base objects) is made possible by creating new objects (with oids) as a result of a query, and keeping track of all the oids of objects used to create the view. The objects are materialised but their value is computed at the invocation of the query by the view. However, this system fails to provide automatic reclassification of newly created classes. [Run92] gives the answer and the algorithms described are implemented in Multiview [JR96]. When a new class is created, the inheritance hierarchy is scanned in order to find the location to insert the new class. If it is not found, intermediate classes are created or the inheritance hierarchy modified. Multiview also solves the multiple-membership problem [KRR95] and class removal [C-TR96] that other research papers do not consider. Multiview is also able to cope with automatic view reorganisation due to the modification of a base object, whether directly or by another view. Application of OODB to Taxonomy More intuitively linked to the real world, objects ease the manipulation and querying of data. In the Taxon-Object system [SLPS95], objects cannot be extended, and the different levels of the classification hierarchy (e.g. family, genus, species) are predefined. This is due to modelling of the rank concept, which is included in the taxon concept. We believe this system cannot support the work of taxonomists because it does not deal with multiple classifications and its mechanisms and data are too rigid. It is clear that the problems encountered by TaxonObject are largely due to the taxonomic model used. However, we discovered while building Prometheus taxonomic model and the corresponding database, that object-oriented structures do not capture taxonomic data efficiently. If we compare to our requirements (see section 3.1), we see that object–oriented systems satisfy some of the constraints (e.g. a sense of aggregation, support for complex data, single and multiple relationships), but fail to support some of them. Indeed, taxonomic data is hierarchical and hierarchies are not managed easily as concepts in an object-oriented environment, and highly fragmented, leading to numerous simple objects, not offering a good object-oriented design and exhibiting a structure very close to a graph model in terms of number of objects and relationships. An object-oriented environment also fails to capture the important concepts of synonymy. Even the polymorphism and inheritance facility does not provide a full support for synonymy (for example at instance level) and often requires more meaning that is in reality necessary
(inheritance is not only synonymy of types, even if it can help to represent it at the price of a bad design). Moreover, object-oriented query languages do not seem to provide ways of querying taxonomic data easily. The fact that query languages do not have the concept of hierarchical data makes it difficult to extract meaningful data sets from a taxonomic database. Although views provide interesting support for taxonomic work in that subsets of the database can be extracted for study, they suffer from the query language that creates them. Finally, the integrity constraints implemented in object-oriented environments are not suitable for taxonomic constraints because they do not take in consideration the environment of the constraint. For example, many constraints related to taxonomic work are dependent on the level of a concept in the hierarchy of taxonomic ranks, and only applicable if the concept is at certain ranks. The usual constraints defined on object-oriented databases are represented by logical expressions that are not satisfied if only one condition is false.
3.3. Graph Databases The appeal of the graph structure is that it is very flexible, is able to model any kind of data (including relational or object-oriented data), to any degree of complexity, and allows data to be represented in an easily understood way [Gemis96] [BDHS96] [CT95]. Two main approaches for this kind of structure exist: a graph model using a schema, and the semistructured model where data is not described by a schema (often because the data is imported from foreign systems, or because it is inherently unstructured like hypertext documents). First, we focus on graph databases with a schema. These databases have the advantage of offering the power of graph structures, and providing at the same time a schema that guides user applications when data is inserted in the database. As a taxonomic classification hierarchy is a tree-like structure, it would seem an intuitive choice to model it using a structure that is inherently hierarchic. Candidate graph systems are GOOD [Gemis96], Hyperlog [PL94], and Spider [RK97]. Graph oriented databases are a recent area of research, and there is no unanimously accepted model or system. Rather, there are many proposed models that differ slightly from each other (e.g. they differ on where the information is stored, on labelled nodes (e.g. Hyperlog), on labelled edges, or both (e.g. GOOD, Spider)), but share the same principles based on graph theory. These graph models can be divided in three categories: models that manipulate nodes and edges; models that manage hypernodes and edges; and finally models that extend these concepts to hypernodes and hyperedges. The first category is represented by models such as Spider, and GOOD. Spider and GOOD use a classical definition of graph structures. GOOD is the only one explicitly defining nodes for representing sets. Once a schema is defined, instances of this schema can be created. The only way of manipulating the data in the database is by graph rewriting. The second category can be represented by Hyperlog. Hyperlog distinguishes itself from classical graph models in that the basic piece of information, the hypernode, is a directed, labelled graph. It is not a flat graph structure, but a nested graph structure. A hypernode is defined as containing a set of nodes that may be constants or variables linked by edges (edges are seen as a pair of nodes). Hyperlog is a typed system where types can be atomic types (e.g. number or string) or hypernode types (user-defined types). The types defined can be instantiated by: replacing the type label with a unique instance label (which plays the role of an oid in object-oriented systems); replacing each node with zero or more nodes of the same type; and replacing each edge n1 → n2 with zero or more edges linking instances of type n1 and n2. As Hyperlog does not define sets, it is not possible to constrain that only one edge can exist in instances of any type.
Finally, the third category is a generalisation of the concepts offered by the first two: Graphs can be generalised, and hypergraphs defined explicitly [CT95]. In [CT95], hypergraphs are composed of a set of nodes and a set of edges. These edges may be simple edges linking two nodes (arcs), or hyperedges or hyperarcs (an arc being a directed edge). Hyperedges may have different forms: structure hyperedges used to define the structure of an object; extension hyperedges that group individuals sharing common features; group edges that group different individuals (e.g. set-valued properties). The querying approaches can be divided into two different classes: graph-rewriting approaches, and set manipulation approaches. The first class is the least commonly used. Many models use graph rewriting queries in order to access the data in the database, i.e. Spider and GOOD. These languages allow a user to specify graphically the matches and the transformations required on the graph using graph rewriting. In Spider, a program contains information similar in structure to the functions or predicates of declarative textual languages. Transformations are composed of a series of graph rewrites. Each rewrite has two graphs, one to represent the graph to be matched against the database, and the other to represent changes to be made. Similarly, in GOOD, the query language (called PaMal) uses patterns which are directed graphs to find the parts of the database on which the operation will be executed. PaMal has only two operations: addition (of nodes and edges) and deletion (of nodes and edges). With the addition of loops, procedures, and program constructs, it is enough to create a computationally complete language. Pure graphical languages like GOOD or Spider are computationally complete and allow a non-experienced user to create queries in an intuitive way. However, these languages are not suitable for structure discovery, as the knowledge of the structure of the database is essential to the query. Other models use the more common and intuitive set-manipulation approach. [CT95] adopts a unique approach to querying. A query does not exist as such, but a set of particular data can be retrieved from the database incrementally by interaction with the user. The technique is based on on/off operations: the user starts with the schema of the database and then switches on or off portions of the graph until only the structure of the data of interest is left. The user selects the portions of the data where this mask must be applied, and the system returns a hypergraph as a result. In another way, Hyperlog integrates the concept of rules in its query language. A query is defined by a template (also a directed labelled graph) to be matched against the database. Templates differ from hypernodes in that they contain variable nodes, and that their nodes can be negated. They are created from a type, which allows restricting the search in the database to target types. A program in Hyperlog is rule based. It contains rules that are evaluated in the database. Each rule has a body containing templates, which will be matched against the database, and a head, which defines the transformations to be performed on the matches returned by the query. As far as we know, the models presented in this section have not yet introduced the concept of view as defined on relational or object-oriented models. It is arguable that object-oriented views may be applied to simple graph structures. However, they would be found inefficient as each node may be treated as a distinct object, which would result in having a very large quantity of objects manipulated and assembled. Additionally they may be impractical e.g. a complete schema would need to be defined for each basic graph viewed and nested views managed for hypergraphs. Views have not been included in graph-rewriting models, possibly because the result of a query is a modification of the input graph, and not a new output graph. Application of Graph Databases to Taxonomy To our knowledge, no taxonomic system has been built on top of a graph-based database. We believe that graph structures support taxonomic data more efficiently than object-oriented structures. They are able to capture the highly segmented aspect of taxonomic data in a natural way and offer a good support for the hierarchical aspect of it. Some of them also support the single and multiple relationships (e.g. GOOD) or provide indirect ways of
supporting them (e.g. Hyperlog with its queries). The query languages are also more adapted to taxonomic work because patterns (or templates), sometimes augmented with variables or paths expressions (GraphDB [Guti94]), allow an easier definition of the subgraph queried, especially for non-computer specialists. However, graph models lack the ability to support generic types or support them in an object-oriented manner (e.g. GOOD). The also lack of view support for graphs would force taxonomists to work on large sets of data, and the lack of integrity constraints may create a corrupt database and not support the implementation of the nomenclatural code.
Non schema-based graph databases and semi-structured Data Some graph databases do not use a schema in order to create a model for the data that can be input in the system, e.g. Hygraph [Cons94] or Hypernode [LL95]. Hygraph is a hybrid between a flat and a nested graph model. It provides simple relationships between nodes and a container a relationship that allows the definition of part-of relationships. An important point in Hygraph is that the model is not interpreted, therefore no semantic meaning is attached to the nodes and edges apart from their name (that can be interpreted later). Hypernode is a model similar to Hyperlog explained before, but which does not use a schema in order to create a mould for data. The queries in Hygraph are based on patterns. These patterns consist of nodes and edges where edges are labelled by path expressions. A pattern is a subset of the database that will be matched against the data stored in order to extract information. Patterns in Hygraph allow recursive matches without explicit recursion. In Hygraph, queries are filtering queries. Filtering queries do not create new relationships, therefore avoid many problems generated by view updates (all entities in the output are present in the input of the query). Hypernode does not offer a query language with powerful search abilities. It only provides a simple declarative language with basic operations that can manipulate a graph (add/delete nodes and edges) and hypernodes (add/delete). In addition, it provides loops and conditional statements. Semi-structured data is derived from graph structures (e. g. Tsimmis [PG-MW95] or [BDS95]). It was born from the realisation that traditional data models, including classical graph models, were unable to capture certain types of data where the structure is absent or partial, the schema ignored or rapidly evolving, or the distinction between schema and data blurred [Abi97]. Many descriptions of semi-structured databases exist in the literature, e.g. [BDS95], Tsimmis Lore [MAG+97]. The differences between these models are mainly that Lore uses the OEM (Object Exchange Model) data model defined for the Tsimmis project, which is a graph system manipulating nodes and labelled edges, whereas [BDS95] is a recursive model where only labelled edges are manipulated. These differences are important in the way that they define how the database is built (e.g. recursively for [BDS95]), and how it works (queries, views). These models have two main goals: integrating data from heterogeneous sources, and capturing data that is inherently unstructured. They are more often used to describe and store data than to modify it once inside the database. The fact that the structure of the data manipulated is unknown or highly incomplete causes a problem for data manipulation, query formulation, and query optimisation. Thus it appears necessary to find structure in unstructured data and to represent it. Lore uses a complex layered structure built on top of data translators [CDSS98] and mediators such as dataguides [GW97] or representative objects [NUW+97] to achieve this aim, and other systems use similar techniques [BDFS97]. These mediators may be manually defined, or (semi-) automatically generated [AK97] [NAM97]. The discovery of the implicit structure of a set of
data may be managed in two ways. Firstly by calculating the distance between the different objects found [NAM97] and grouping objects with close structure, and secondly by adjusting paths in the database so that a target object can only be reached by a single path [GW97]. Some systems use mediators or translators to achieve the definition of more efficient and more expressive query languages over semi-structured databases. These mediators encapsulate heterogeneous data sources into one common model or translate queries into queries understandable by each different data source, as explained previously. The obvious disadvantage of these mediators/translators is that they must be developed for each different data source and may be complex. The query languages are of three sorts: navigation-based, recursive, or pattern matching. A navigation-based language is [MMM98], which is built on top of a system representing hypertext documents. A navigational algebra (NALG) is defined to provide relational views over HTML pages. It provides the classical operators (selection, projection, join), plus specific operators (unnest page and follow link). Optimisation of queries is done in the following way: the original query is translated into the corresponding projection-selectionjoin expressions; this expression is converted into a computable NALG expression, which is repeatedly rewritten by applying NALG rewriting rules in order to derive a number of candidate execution plans; the cost of these alternatives is evaluated; and the best one is chosen, based on a specific cost model. The recursive query language defined in [BDHS96] is a direct consequence of the recursive structure of the database (a graph is a label pointing to another graph). This language offers powerful tools to traverse the database (the traverse function and markers) and extract subgraphs from the database graph. Simple patterns can also be defined. However, this language is perceived as non user-friendly because of its unfamiliar aspect and behaviour. Finally, a more intuitive approach is taken in Lore [AQM+96], Tsimmis [QRS+95], and [CACS94]. These languages are based on path expressions like those of object-oriented models. Since the database structure is not expected to be known by the user, these path expressions offer great flexibility by providing variable paths and attributes. In these systems, especially Lore, the path to an object is as important as its value [MAG+97] so the query language allows querying of the schema, but avoids the use of classical indexing techniques. In order to extend the flexibility of the query, Lore’s query language Lorel provides a high level of data coercion. The definition of views over semi-structured data is a very new area of research. A unique motivation for views when dealing with semi-structured data is that views can be used to introduce some structure [AGM+97]. A view is a subgraph of the database (possibly extended with new objects and new relationships) with references to real objects in the source database. In Lore, the creation of a view is achieved by running a query and specifying the parts of the data that are to be included in the view [AGM+97]. In a view, each object is represented by a delegate object that is a copy of the original object, so that views are independent units. If views are materialised, it becomes difficult to keep the view consistent when updates occur in the database. Views affected by a change in the database are detected by inverse path expressions from the modified object [AMR+98]. However, this view system is still immature, so no reclassification of virtual graphs has been considered. Application to taxonomy Hygraph is an interesting model because it can be seen as flat as well as nested (if blobs are interpreted as containers). However, because it a purely syntactic model (therefore not interpreted), it does not provide all the mechanisms taxonomic concepts require. Moreover, the absence of schema makes it difficult to implement the nomenclatural code. Nevertheless, the query language it understands is of great interest for taxonomists because it allows the definition of path expressions, which add flexibility in the queries, and recursion without explicitly representing it.
Like graph databases, semi-structured databases support many aspects of taxonomic data. The semi-structured approach offers the advantage of supporting highly incomplete data, which may happen in some parts of taxonomic work, and would support the integration of existing databases whose structure may be unknown, incomplete, or incompatible with the main taxonomic model. However, the absence of a schema would open the door to errors and would not allow an implementation of the nomenclatural code. The view mechanism offered by Lore would provide a means to extract subgraphs from the database in order to work on a simpler environment for taxonomic work.
4. Conclusion We have first described taxonomic data and taxonomic work. Taxonomic data is highly hierarchical and involves the sharing of common data between different hierarchies. The work taxonomists do on this data is based on the nomenclatural code that specifies the rules that must be followed, but also necessitates a lot of freedom. Taxonomists are told by the nomenclatural code how they must articulate and publish their data, but not how to chose it and on which criteria they must built their classifications. These properties imply a very flexible system that can react to some situations but leaving the user take many decisions. We have also shown in this report how current taxonomic databases work and why they do not support properly taxonomists’ work. A basic requirement of taxonomic work is the ability to model the real world according to any taxonomist’s opinion, without formulating any judgement on the validity of this view. All but two taxonomic databases we have reviewed do not support the possibility of representing many different classifications of the same plants and this limits the ability of the system to work with taxonomists and offer them enough freedom to work as they are accustomed to. More over, these databases often rely on a taxonomic model that does not clearly separate the process of naming and the process of classifying. This leads invariably to the formulation of a judgement when a plant is classified. The two databases that try to represent multiple classifications (HICLAS and IOPI) do not either support fully taxonomic work. Indeed, IOPI does not differentiate the status of a name and the status of a taxon in a classification, forcing the user to make a statement about the status of a taxon. Moreover, the name of a taxon is attached to the taxon instead of being derived from the context in which the taxon exists. HICLAS takes a different approach, but still does not differentiate the process of naming and the process of classifying. It does not either represent the different classifications in a way that would allow the user to compare them (no information is shared between classifications). These system, more developed than many, do not perfectly support the work of taxonomists with the necessary freedom. The Prometheus groups have thus developed a new taxonomic model that allows taxonomists to represent multiple classifications, differentiates the process of naming from the process of classifying, and models data in a way that make it possible to compare classifications. The analysis of the needs expressed by taxonomists allowed us to create a list of the requirements a database system supporting taxonomic work should satisfy. This list was then used in order to compare existing database models and find out their ability to support taxonomic data. This comparison showed that the relational model mainly lacks the ability to query taxonomic data in a way that provides meaningful results because many taxonomic queries involve graph-traversing techniques because they require the extraction of hierarchical data. The object-oriented model also suffers from the inability to extract and manipulate hierarchies. These two models also fail to represent synonymy properly at type level (where inheritance can be considered as a substitute although it would be a bad use of this mechanism), and at instance level (no comparison exists). We have discovered that a
graph-based database would offer a good basis for supporting the mechanisms involved in taxonomic work, but they also lack some of the features we consider necessary, such as different kind of relationships between entities. Indeed, the graph model is essentially a simple data model, and the models we have reviewed do not satisfy all our criteria. However, since it is the more flexible and simple model, we think it is the most promising. Further work includes the definition of a database model that supports the taxonomic model defined by the Prometheus group and it development.
5. References [ABD+90] Malcolm P. Atkinson, François Bancilhon, David J. DeWitt, Klaus R. Dittrich, David Maier, Stanley B. Zdonik, "The Object-Oriented Database System Manifesto", Proceedings of the First International Conference on Deductive and Object-Oriented Databases (DOOD'89), Kyoto research Park, Kyoto, Japan, pp 223-240 (1989) [Abi97] Serge Abiteboul, "Querying semi-structured data", Proceedings of the International Conference on Database Theory, Delphi, Greece, pp 1-18 (1997) [AGM+97] S. Abiteboul, R. Goldman, J. McHugh, V. Vassalos, Y. Zhuge, "Views for Semistructured Data", International workshop on management of semi-structured data, Tucson, USA (1997) [AK97] Naveen Ashish, Craig A. Knoblock, “Wrapper Generation for Semi-structured Internet Sources”, SIGMOD Record Vol. 26 Issue 4, pp 8-15 (1997) [AMR+98] S. Abiteboul, J. Mc Hugh, M. Rys, V. Vassalos, J. Wiener, "Incremental maintenance for materialized views over semi-structured data", VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, New York City, New York, USA, pp 38-49 (1998) [AQM+96] S. Abiteboul, D. Quass, J. McHugh, J. Widom, J.L. Wiener, “ The Lorel Query Language for Semi-structured Data”, International Journal on Digital Libraries Vol. 1 Issue 1, pp 68-88 (1996) [BD92] B. Bartholomew, T Duncan, "The specimen management system of California herbaria as a model for an inter-institutional distributed database system", Phytogeography and Botanical Inventory of Taiwan, Academia Sinica Monograph Series No. 12, pp 82-91 (1992) [BDFS97] Peter Buneman, Susan B. Davidson, Mary F. Fernandez, Dan Suciu, “Adding Structure to Unstructured Data”, Database Theory - ICDT ‘97, 6th International Conference, Delphi, Greece, pp 336-350 (1997) [BDHS96] Peter Buneman, Susan Davidson, Gerd Hillebrand, Dan Suciu, “A query language and optimization techniques for unstructured data”, Proceedings of ACM-SIGMOD International Conference on Management of Data, Montreal, Canada, pp 505-516 (1996) [BDS95] Peter Buneman, Susan Davidson, Dan Suciu, “Programming Constructs for Unstructured Data”, DBPL-5 Proceedings of the Workshop on Database Programming Languages, Gubbio, Umbria, Italy, p 12 (1995) [Bella96] Zohra Bellahsene, "View Mechanism for Schema Evolution in Object-Oriented DBMS", 14th British National Conference on Databases, BNCOD 14, Edinburgh, Scotland, pp 18-35 (1996) [Bella97] Zohra Bellahsene, "Updating Virtual Complex Objects", OOIS'97, 1997 International Conference on Object Oriented Information Systems, Brisbane, Australia, pp 422-432 (1997) [Beren97] W. Berendsohn, "International Organization for Plant Information", Botanical Garden and Botanical Museum Berlin-Dahlem, http://www.bgbm.fuberlin.de/iopi/iopimodel73/7301root.htm (1997)
[BPB93] J. H Beach., S. Pramanik, J. H. Beaman, "Hierarchic taxonomic databases",. Ch. 15 in Fortuner, R., ed. Advances in Computer Methods for Systematic Biology: Artificial Intelligence, Databases, Computer Vision. Johns Hopkins Univ. Press, Baltimore. pp. 241256 (1993) [CACS94] Vassilis Christophides, Serge Abiteboul, Sophie Cluet, Michel Scholl, “From Structured Documents to Novel Query Facilities”, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, USA, pp 313-324 (1994) [CAW98] S. Chawathe, S. Abiteboul, J. Widom, "Representing and Querying Changes in Semi-structured Data", Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, pp 4-13 (1998) [CBB+97] R. G. G. Cattell, Douglas Barry, Dirk Bartels, Mark Berler, Jeff Eastman, Sophie Gamerman, David Jordan, Adam Springer, Henry Strickland, Drew Wade, "The Object Database Standard: ODMG 2.0", Morgan Kaufmann Publishers, Inc., ISBN 1-55860-463-4 (1997) [CDSS98] Sophie Cluet, Claude Delobel, Jérôme Siméon, Katarzyna Smaga, “Your Mediators Need Data Conversion!”, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp 177-188 (1998) [Cluet97] Sophie Cluet, “Modeling and Querying Semi-structured Data”, Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School, SCIE-97, Frascati, Italy, pp 192-213 (1997) [Codd70] E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM (CACM) Vol 13 Issue 6, pp 377-387 (1970) [Cons94] Mariano P. Consens, “Creating and Filtering Structural Data Visualizations using Hygraph Patterns”, PhD Thesis, Department of Computer Science, University of Toronto (1994) [CT95] T.Catarci, L.Tarantino, "A Hypergraph-based Framework for Visual Interaction with Databases", Journal of Visual Languages and Computing Vol. 6 Issue 2, pp 135-166 (1995) [C-TR96] Viviane Crestana-Taube, Elke A. Rundensteiner, "Schema Removal Issues for Transparent Schema Evolution", Sixth International Workshop on Research Issues on Data Engineering, Interoperability of Non traditional Database Systems, RIDE'96, IEEE, New Orleans, Louisiana, pp 138-147 (1996) [Filer94] D. L. Filer, "BRAHMS - Botanical Research and Herbarium Management System", A pocket introduction and Demonstration Guide, Oxford Forestry Institute, 28pp (1994) [Gemis96] M. Gemis, "Graph-based Languages in DBMS", PhD Thesis, Universiteit Antwerpen, Departement Wiskunde - Informatica (1996) [Guti94] Ralf Hartmut Guting, “GraphDB: Modeling and Querying Graphs in Databases”, Proc 20th Int. Conf. on Very Large Databases, Santiago, Chile, pp 297-308 (1994) [GW97] Roy Goldman, Jennifer Widom, “DataGuides: Enabling Query Formulation and Optimization in Semi-structured Databases”, VLDB’97, Proceedings of 23th International Conference on Very Large Data Bases, Athens, Greece, pp 436-445 (1997) [JR96] Harumi A. Juno and Elke A. Rundensteiner, "The MultiView OODB View System: Design and Implementation", Journal of Theory and Practice of Object Systems (TAPOS), Special Issue on Subjectivity in Object-Oriented Systems Vol. 2 Issue 3, pp 202-225 (1996) [KKS92] Michael Kifer, Won Kim, Yehoshua Sagiv, "Querying Object-Oriented Databases", Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, pp 393-402 (1992) [KRR95] Harumi A. Kuno, Youg-Gook Ra, and Elke A. Rundensteiner, "The Object-Slicing Technique: A Flexible Object Representation and Its Evaluation", Technical report, Dept. Of Elect. Engineering and Computer Science, Software Systems Research Laboratory, The University of Michigan (1995) [LL95] M. Levene, G. Loizou, “A graph-based data model and its ramifications”, IEEE Transactions on Knowledge and Data Engineering, Volume 7 Issue 5, pp 809-823 (1995)
[MAG+97] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, J. Widom, “Lore: A Database Management System for Semi-structured Data”, SIGMOD Record Vol. 26 Issue 3, pp 54-66 (1997) [MMM98] Giansalvatore Mecca, Alberto O. Mendelzon, Paolo Merialdo, “Efficient Queries over Web Views”, Advances in Database Technology - EDBT’98, 6th International Conference on Extending Database Technology, Valencia, Spain, pp 72-86 (1998) [NAM97] Svetlozar Nestorov, Serge Abiteboul, Rajeev Motwani, “Inferring Structure in Semi-structured Data,” SIGMOD Record Vol. 26 Issue 4, pp 39-43 (1997) [NUW+97] S. Nestorov , J. Ullman , J. Wiener , S. Chawathe, “Representative Objects: Concise Representations of Semi-structured, Hierarchial Data”, Proceedings of the 13th International Conference on Data Engineering (ICDE’97), Birmingham, U.K., pp 79-90 (1997) [Pan93] R. J. Pankhurst, "Taxonomic Databases: The PANDORA System", Advances in computer methods for systematic biology: Articificial Intelligence, databases, computer vision, John Hopkins University Press. (1993) [PG-MW95] Yannis Papakonstantinou, Hector Garcia-Molina, Jennifer Widom, "Object Exchange Across Heterogeneous Information Sources", Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, pp 251-260 (1995) [PK99] M. Pullan, J. Kennedy, et al., "The Prometheus Taxonomic Model ", submitted to Taxon, 1999 [PL94] A. Poulovassilis, M. Levene, "A nested-graph model for the representation and manipulation of complex objects", ACM Transactions on Information Systems Vol. 12 Issue 1, pp 35-68 (1994) [QRS+95] Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Jennifer Widom, “Querying Semi-structured Heterogeneous Information”, Deductive and ObjectOriented Databases, Fourth International Conference, DOOD’95, Singapore, pp 319-344 (1995) [RBP+91] James Rumbaugh, Michael Blaha, William Premerlani, Federick Eddy, William Lorensen, “Object-Oriented Modelling and Design”, Prentice Hall International Editions (1991) [RK97] P. J. Rodgers, P. J. H. King, "A graph rewriting visual language for database programming", Journal of Visual Languages & Computing Vol. 8 Issue 5/6, pp 641-674 (1997) [Run92] Elke A. Rundensteiner, "A Class Integration Algorithm and its Application for Supporting Consistent Object Views", Information and Computer Science Department, Univ. of California, Irvine, Technical report 92-50 (1992) [SLPS95] H. Saarenmaa, S. Leppäjärvi, J. Perttunen, J. Saarikko, "Object- oriented taxonomic biodiversity databases on the World Wide Web". Manuscript 7 pp. IUFRO XX World Congress, Tampere, Finland, August 6-12, 1995. EFI Proceedings (1995) [Web99] On-line Webster Dictionary, http://www.m-w.com/netdict.htm [WAW93] R. J. White, R. Allkin, P. K. Winfield, "Systematic Databases: The BAOBAB Design and the ALICE system", Advances in computer methods for systematic biology: Artificial Intelligence, databases, computer vision, John Hopkins University Press. (1993) [WO93] K. S. Walter, M. J. O'Neal, "BG-BASE: Software for botanical gardens and arboreta", The Public Garden, October 1993, 21-22,34 (1993)