Mathl. Comput. Modelling Vol. 25, No. 10, pp. 109-127, 1997 Copyright@1997 Elsevier Science Ltd Printed in Great Britain. All rights reserved OS95-7177/97 917.00 + 0.00
Pergamon
PII: SO895-7177(97)00078-Z
Basic Properties for Biological Databases: Character Development and Support J. DIEDERICH Department of Mathematics, University of California-Davis Davis, CA 95616, U.S.A.
[email protected] (Received November 1995; accepted January
1996)
this paper, we examine problems and solutions for building a large set of characters for descriptive data derived from published species descriptions. The ideas presented lead in the direction of creating a kind of BioDBMS that can be used to support large integrated biological databases.
Abstract-In
Keywords-Biological design.
databases, Biological characters, Data modeling, Basic property, Schema
1. INTRODUCTION Among the types of taxonomic databases [l], i.e., curatorial including geographical, nomenclatur-al, bibliographical, and morpho-anatomical (descriptive), descriptive data presents the greatest challenges to the database designer and software developer, with far less supporting software for managing the data than for the others [2]. One explanation for this may be the structure and nature of biological research. Institutions such as horticultural museums and herbaria have large collections of specimens. Consequently, they have the greatest need for creating formal electronic databases to manage their collections to locate specimen, track loans to researchers, add new specimens, and the like. The structure of their data is, generally, not as complex as the structure of descriptive data. On the other hand, descriptive data has, generally speaking, been the province of individual researchers who more often than not, focus on a small number of species with a limited number of external features. There is little uniformity across these databases and sharing of electronic data can be difficult though some software has been developed to help [3,4]. While the complexity and the semantics of the data can be managed for small databases used by experts on the few species represented, it becomes much less so when the number of species expands into the thousands and the database is to be used by nonexperts too. Biologists have made some strides in exploiting database technology [2,5-71 and have set standards to allow for data exchange for taxonomic databases [3,4,8]. Much of this work has focused on how to utilize relational DBMSs, to avoid duplication of effort, and to help avoid problems that seem to inevitably infect many hand tailored taxonomic databases [2]. Still, database methods and techniques available for handling descriptive data are inadequate. Recent efforts have been aimed at exploring management of taxonomic biodiversity information using object-oriented databases and the World Wide Web [9].
109
110
J. DIEDERICH
The work discussed in this paper, to examine problems and propose some solutions in creating large descriptive databases, is part of the NEMISYS Project [lo-121, an effort to build an identification system for the approximately 4000 species of plant-parasitic nematodes, It may eventually include an equal number of nonplant-parasitic species. For the most part, the source of our data will be approximately 10,000 published descriptions in various journals over the much of this century. Two important aspects of creating and using a descriptive database are data modeling and data semantics. Difficulties arise in modeling the data because the source of the data is published descriptions. One important goal in creating a descriptive database is to faithfully represent the data in the descriptions, correcting for errors and omissions whenever possible. However, standards have not been set over the years for describing species. With each author placing his or her stamp on a given description, it remains a challenge to capture and use the data from this rich and available source. Characteristics that are important for some species are not important for others, so there is no small core of characteristics on which to focus. Since standard modeling practice involves determining the data structures in advance of acquiring the data, it would be necessary to know what characteristics of species are described in thousands of descriptions and to know how they are described, a difficult if not impossible requirement. The option of forcing the data into a predetermined structure simply will not work for large collections if the goal is to faithfully record the data. Thus, it is necessary, to some extent, to create the database structures as the data is acquired. Consequently, building a data model prior to data acquisition is impossible in such cases, and must remain a dynamic process, placing significant challenges in maintaining a consistent and uniform model. Each new description can cause changes in the model. Yet, it is not possible to take the properties exactly as they come from the literature since the resulting database structures would then seem too chaotic to use effectively. This seems like an impossible task with contradictory goals. If the data is represented faithfully, the data structures would be too disorganized, while on the other hand, if the data is too rigidly structured, there would be a loss of information. In this paper, we introduce new modelling concepts that support character set creation and make these contradictory goals tractable. A subsequent paper [13], will explore the use of these concepts in a wide variety of practical situations and will develop guidelines for their use. What we are attempting goes somewhat counter to standard practice. Biological data is often reformulated in some fashion prior to its storage and use, usually in small individual databases, to support a particular type of activity such as identification within a particular taxon using dichotomous keys. For example [14], in some species the manipulated data may consist of two states, obviously winged and not obviously winged for identification within a certain taxon, while for phenetic classification it may be necessary to breakdown these into their various types such as narrowly tinged, widely winged, terete, and striate. However, a problem arises when new species are added to the system since the structure of the data often needs to be modified to remain consistent relative to its intended use. Additionally, such manipulated data is difficult to use for purposes other than that specifically intended and often the data does not reflect what was in the original source. Thus, data semantics have to be supported above the level of the database structures to allow for a wide variety of uses. This is analogous to database views, where different users see the data structure according to their needs. However, database views are not sufficiently powerful to deal with the variety of uses of biological data. Clearly, the manner of supporting the semantics will affect how the data is structured. It is true that some of the complexity we have encountered is due to the nature of these microscopic round worms, where internal as well as external parts are used in descriptions for differentiating tsxa. In many other areas, only a few dozen characteristics are required for differentiating and describing species. Nevertheless, if large descriptive databases are to be constructed and maintained, many of the problems we address will have to be solved for these databases if they are to be integrated.
Biological Databases
111
In the remainder of this paper, in Section 2, we present some background. In Section 3 problems are discussed that arise in creating a large list of biological characters. In Section 4, we introduce the major concept of basic property and its features to facilitate handling these problems, including the representation and use of state-based relationships in Section 5. In Section 6, we extend the idea of name extension introduced in Section 4 to another context, and in Section 7, we briefly discuss schema changes.
2. BACKGROUND In conceptualizing descriptive data, the biologist thinks in terms of what is commonly called the “data matrix,” i.e., a taxon by character array [2]. Within the matrix are the states or numerical values of the species for that character. Some clarification may be needed with regard to our meaning of character since the term does not have a standard definition among biologists. Even the standardization of the concept of a character would go a long way towards integrating descriptive databases. At one extreme, in a biological key a character used for differentiation among species may be as complex as “esophagus with valve-like expansion about one to two head diameters posterior to base of stoma; amphid unispiral.” At the other end of the spectrum, the practice is to treat a character as a single characteristic such as Flower color or Leaflet presence, each considered as an atomic unit and stored as data within a single field in a data table as shown in Figure 1, [2]. In some cases, the character includes a state as well as in petal pink. Also, characters are definitely not fully decomposed in a hierarchical fashion though DELTA [4], allows a one level breakdown into character subheadings.
I
Taxon
Character
Data value
Vicia cmcca
flower colour
blue
Vicia cfucca
I Laathurusaohaca
~~flower colour ~~ leaflet mesence
violet
I
absent
Figure 1. Characters as a unit in a single field [2].
We take a character to be a triple: (biological structure, property, state/value). A biological structure, or structure, can be a system, organ, organ part, tissue, etc. A property is usually an attribute of the structure such a shape, length, color, and so forth. A state is the quality of the property such as round or pink, while value is taken to be a numerical value. Also, structures can have substructures as well, as seen in Figure 2. This is more consistent with usage in database design where entity (structure), attribute (property), and value are the main elements. At times we will use the terminology list of characters or characters to simply mean a list of (structure name, property) tuples, ignoring the states, which in examples will be written without the parenthesis as in body, shape. We do not impose the restriction that a character must differentiate taxa, though usually it does, and a bit of nondescriptive data is usually included such a host, soil, and stage (gender). Our definition of a character will become more evident when basic properties are introduced later in this paper. An extended discussion of the definition of a character will be found in [13]. Biologists often talk about the “data matrix” for a domain even though it may in fact not exist. In nematology for instance, a data matrix has never been constructed for large sets of species. The number of characters needed for small sets of species can usually be limited to a handful of differentiating characters. As the number of species increases, the number of characters needed increases as well and the complexity tends to frustrate development of a data matrix of any appreciable size rendering the notion of “data matrix” as more of a concept than reality. While it might seem conceptually possible to combine the data from data matrices for small collections into a single large one, in practice it can be difficult to do since the problems that occur with
112
J. DIEDERICH
Structures/Substructures
Prooertv
&@
Digestive system Oesophagus Median bulb shape slight swelling fusiform oval round quadrangular violin
Figure 2. Sample hierarchy for descriptive character
data.
small sets of characters would likely be solved in very different ways, making Part of what we present here is to help standardize this process.
3. SCHEMA In this section, characters. In the of characters, and complete. At this
we examine NEMISYS he felt with size, the list
CREATION
integration
difficult.
PROBLEMS
some of the problems of creating a large schema of descriptive project, the nematologist took charge of building the original list about 125 characters, excluding states, that the list was close to seemed to be quite manageable.
Unfortunately, the number of characters continued to grow causing significant complications. The nature of the problems was not always apparent. Often there was a tension between the semantics and the structure, as one can attempt to capture too much of the semantics via the structure. It also became clear that a tool incorporating new concepts and standards would be needed to manage the list of characters. It may be somewhat surprising that the initial estimate of 125 characters would be so different from the current number of over 700, even with techniques designed to consolidate the character set. However, this illustrates the point discussed earlier that descriptions reflect their authors’ tastes as well as the fact that standards have not been set with an eye towards electronic storage and retrieval.
Testis Anterior part of the testis Spermatocytes shape cuboidal circular Posterior part of the testis Spermatocytes shape cuboidal circular
Figure 3. A possible decomposition.
There are some well-known, inherent problems in creating any hierarchical decomposition. A simple example is having to choose among decompositions according to function, i.e., should one make digestive muscles part of the digestive system or part of the muscular system? Likewise,
BiologicalDatabases
113
one might wish other groupings such as by region. These multiple views could easily be handled in a “schema tool” that supports alternative groupings of structures. While views can be quite complicated in some domains, i.e., viewing an airplane’s structure, aerodynamics, or electronics [15], this degree of complexity does not appear to be needed in descriptive databases, though it is easy to envision more complex ones that include data on physiology, ecology, and the like where they might be needed. Even within a structural decomposition problems can arise. For example, testis contain spermatocytes, which in some cases have different shapes depending on whether they are contained in the anterior end or posterior end. In choosing the decomposition as shown in Figure 3, there is an advantage whenever the difference is exhibited in a species, but this is not desirable when the shape is the same at both ends since the result would have to be stored twice. There are many other plausible and seemingly reasonable alternative decompositions, but all tend to lead to problems in creating and utilizing the database, and would of course create difficulties in integrating even small databases as shown in [13].
73. Numero de eetambres (cuando el perianto es presente) 1. diferente de1 numero de petalos o de sepalos 2. igual al numero de petalos o de sepalos
74. Position de estambres (cuando el numero de estambres = numero de tapaios) 1. no opuestos a 10s petalos 2. opuestos a 10s petaios (0 alternos a 10s sepalos)
75. Estambres - numero 1. no mss de quince 2. mss de quince
76. Numero de estambres 1. uno 2. dos 3. tres
. .. 10. dies 11. mayor de once 12. de quince en adelante
77. Anterss fertiles - numero 1. 1 2. 2 3. 3 ... 10. 10 11. 11 0 mas Figure 4. Some characters from the flora of Veracruz database [16]. CONSISTENCY AND UNIFORMITY.
As the size of the list of characters expanded, the biologist felt a sense of loss of control. One of the main problems was trying to maintain a uniform and consistent set. While one can solve this problem just by being consistent, it is not so easy to remember how similar characters were expressed in other parts of the list. For example, the property presence with possible states present/absent is one way to represent whether a part is present or not. However, the property visibility with states absent/faint/clear/conspicuous is another way, with the latter three implying the presence of the part. As another example, consider the characters and states found in Figure 4 from (161. Here we see four ways of representing
114
numerical
J.DIEDEFUCH values:
as integers
(1,2,3,
etc.)
in character
77, as strings
(uno,
dos, tres,
etc.)
in
character 76, as ranges in character 75, and as comparators in character 73, though the latter two are attempts to deal with a problem discussed below. One also may confront mixed expressions in the same character such as (1,2, 3, 4, 5, or more) as in character 77 or (1, 2, 3, 4, 5, half a dozen, about a dozen, many) similar to character 76. These are all characters that appear with one another,
where consistency
should
be easy to observe,
yet even here it has not been
maintained. Also note that
the expression
of the character
name
is sometimes
numerv
de and at other
times-numero. While this may seem a trivial concern, it can have implications for supporting and using the characters. In other cases, useful information may be embedded in the character name which makes it difficult to process the character properly. For example, the properties shape of the female and shape of the male hide the fact from the system that one character is for females and the other is for males, yet characters found in the female genital system and the male genital system would easily be detected as gender based and could thus be used appropriately by the system as in an identification where an unknown is known to be female.
IMPLICITPROPERTIES.For some biological characters the property is implicit and it may be difficult to determine what the correct property is. For example, in “body 200pm, smooth” one implicit property is the length of the body, but it is unclear what the correct property is for the state smooth. One solution is to create a property out of one of the states, in this case it could be smoothness, with smooth one of its states. We observed that when property names were not naturally associated with a character the biologist tended to use artificial property names such as aspect, type, situation, or nature. They were often used interchangeably and at times inconsistently. For instance, in one character the property nature may be used to indicate whether the part is faint or well marked while in another aspect is used for the same states, and in another visibility is used. It becomes even more problematic when there are multiple distinguishing implicit properties for the same part. For example, “hair curly, thick, and coarse” has three implicit properties, but it may not be easy to identify what the property names should be among the choices curliness, thickness, coarseness, body, texture, etc. This may explain in part why in practice petal pink is treated as a character, since one does not have to deal with naming the property (obviously color in this case) either when it is obvious or unknown.
PROPERTY EXPLOSION. In a descriptive database, knowing the data is necessary just to correctly specify the properties, that is, you need to know what the data is going to be prior to creating its storage structure. Unfortunately, this is difficult when there are descriptions of thousands of species, each bearing the stamp of its author. One result is that an explosion of properties which are very similar as new data is added to the database. For example, the diameter of the body may appear to be a reasonable property to use. But the literature may incrementally reveal that the diameter of the body is measured at the vulva, at midbody, at the stylet base, or at several other positions on the body, or may indeed be the maximum diameter without specifying where the measurement is made, though an expert on the species will probably understand what is intended. Creating a new property each time is clearly not the best resolution. CHARACTER STATE SYNONYMY. In biology, states represent an important part of the domain expertise thus are properly part of the schema development. It may be difficult to determine if a state is a separate state or a synonym for another state, or determine if a group of states should be logically combined into one property or kept separate in two. In particular, the language used in published descriptions is quite rich. For example, it is not obvious if widely open C, open C, very open C are synonymous with weak C or are distinct states. Experts could disagree. No standards generally exist for published descriptive terminology and with thousands of descriptions it takes significant effort to determine valid properties and states, to determine which are new states and which are synonymous with existing states. Thus, creation of a schema is not a short term activity, it can only develop over time as the literature is reviewed, a time consuming task for
BiologicalDatabases
115
experts. Therefore, the design must be sufficiently flexible to change 8s new data is acquired, since the addition of new data can add to the list of states, which play a critical role in relationships expressed in the design, particularly with state-based relationships discussed below. GENERAL AND SPECIFIC STATES. Another problem, similar to synonyms, involves general and specific states for a property. For example, if round, circular, and elliptical are states, the first is a general state encompassing the latter two, which are more specific. An example from our domain are the states ellipsoidal, oval, almost round, almost circular, subspherical, round, spherical, and quadrangular from the character median bulb, shape. In some descriptions, the author may only give the general state while other authors give specific states. If the general/specific state relationships are not represented, then the queries find all taxa with median bulb, shape = round and find all taxa with median bulb, shape = almost round will fetch different sets of species. If the relationships are represented, then the first query returns all species for those states for which round is considered a general state, while the latter query would return all those with the state almost round, but should also return those with the state round as “maybe” results. Alternatively, if two properties were used, one for the general states and another for the specific states, one called median bulb, shape and another called median bulb, general shape, then what would be returned would depend on the formulation of the queries. General and specific states may arise due to different uses of the data. For example [14], for phenetic classification a stem has finer distinctions terete, striate, narrowly winged, or widely winged while for identification there may be only two categories not obviously winged and obviously winged. In essence, the latter two can be considered general states for terete, striate and narrowly winged, widely winged, respectively. STATE-BASED RELATIONSHIPS.In standard database design methods such as the Entity-Relationship Model [17], relationships are expressed between entities. However, relationships between attributes or between attributes and states are not considered. We call these relationships statebased. Some methods allow certain state-based relationships such as value-determined classes, where the value restricted in one class is the basis for creation of a subclass. For example, restricting the class SHIP to those with cargo = ‘oil’ would create a subclass DANGEROUS SHIP [18]. Until we recognized the existence of state-based relationships, there were endless rounds of revisions in the character set, where the biologist would propose characters and the computer scientists would suggest alternatives or point out problems. Generally, the difficulty stemmed from the biologist’s attempt to capture these relationships implicitly within the characters. This embedding of information in the characters makes the information subsequently difficult to work with. In fact, many of our criticisms of current approaches in building characters sets stems from the fact that too much information is improperly embedded in the structures, properties, and states. If these state-based relationships are not identified and remain implicit, then it is likely that the resulting set of characters will be poorly designed.
Body behind the neck
shape
kidney pear irregularly swollen spheroid
depends on Body
kind
+e&n=m nonvermiform
Figure 5. A dependent
character.
J. DIEDERICH
116
Synonymy, general, and specific states are simple forms of state-based relationships. One could classify them as intra-character state-based relationships since they can be handled within a property, as discussed below. Examples of inter-character state-based relationships will be presented next. DEPENDENT CHARACTER. It has been observed that presence or absence of a character affects its usage within the system [6,7]. For example, petiole hair length depends on petiole hairiness, which depends on petiole presence, and on leaf presence. Dependency need not be restricted to the presence or absence of another character, but can be based on one or more states in a property. In Figure 5, the property body behind the neclc, shape is only applicable if the body, kind = nonvermiform. SUMMARYCHARACTER. In the literature, one often sees characters that summarize a number of others. A summary character is a high level abstract characterization or shorthand used by the experts for other characters. For example, Stylet, type = hoplolaimid implies that other properties have their states as shown in Figure 6. Stylet, type
=
hoplolaimid
signifies Stylet, size
=
medium to long
Stylet, kind
=
robust
Cone, size
=
Shaft, size
Cone, shape
=
conoid
Knobs, kind
=
true knobs
Knob, size
=
medium to large
Figure 6. A summary character. REDUNDANT CHARACTER. An example of a redundant character is hemizonid, distance to the phasmid, which may or may not be represented in the phasmid as the distance to the hemizonid. A more delicate situation exists when the redundancy is conditional. For example, if /mobs, shape = circular, then the anterior and posterior parts of the knobs will be circular too, actually
semicircular. FUZZY STATES. In the literature, it is not unusual to find that measurements and quantities are given imprecisely. Instead of the stylet, length given as a numeric value or range as 20.5-25.1pm, it is given as stylet, length = short. The Annuli, number may be given as many rather than as a specific number when it is too tedious to count. These are intro-character relationships between qualitative and quantitative data. There are several other types of fuzzy properties representing inter-character relationships for comparison of states between properties such as bigger, smaller, equal to, longer, shorter, etc. Finally, we briefly mention one other aspect of biological data that we have observed in this project and discussed in detail elsewhere [19]. Character semantics, what we call metadata, play a central role in the underlying understanding of the domain and its uses. For instance, with large character sets it is important to know which characters are easy to use in an identification and which are not, which can be relied on as input from observers and which cannot depending on their expertise. The difficulty arises when the metadata changes from taxon to taxon. This is unlike most metadata found in data models where the metadata is independent of the instances stored in the database. 4.
BIOLOGICAL
DATABASE
DESIGN
AND
SUPPORT
There are efforts by database researchers to examine requirements in a variety of scientific areas [20] and the new capabilities in newer generation DBMSs such as user-defined data types,
Biological Databases stored procedures,
triggers,
117
and rules,
have made it possible to address the database requirements of scientific research, where file systems have been the principal means for data management in the past [21]. However, these new capabilities alone are insufficient since biologists do not have the expertise, time, or money to effectively exploit them [22]. It is a challenge to the database community to create database management tools [22] that can be tailored to typical scientific endeavors and to construct for each domain a unifying model [23]. In this section, we present several concepts that address the problems stated above and provide a model that is easy to work with, both for the designer and the user. We have discussed a variety of problems that arise in creating a large set of characters for a descriptive database. While it is possible to attack each of these problems individually, the result may be a complex system that is difficult to support and makes integration extremely difficult. Our effort will be to create a framework that provides the necessary expressiveness, while at the same time is reasonably uniform and simple, both for the designers and for the users. Given the wide variety of problems discussed above, this is a significant challenge, but we feel that the concept of basic property and its features takes a major step in achieving our goals. Basic Property In the course of analyzing several early versions of the list of characters that the biologist had created, we observed many problems with consistency and uniformity. While many properties had certain features in common, the concept of a property was not sufficiently well defined to aid in producing a uniform and consistent set of characters. Subsequently, we developed the concept of basic property. A basic property is a property satisfying the following four general conditions. CONDITION I. A basic property is domain independent, that is, it is not peculiar to one domain such as nematology, or ichtheology, or entomology. A basic property should be useful in multiple domains. For example, shape is domain independent while shape of the stylet is not, so is Jove rate, while jlow rate of blood is not. CONDITION II. A basic property is specific to the type of data, i.e., descriptive, behavioral, ecological, etc. For example, length is specific for descriptive data while flow rate is specific to physiological.
APPEARANCE
MEASUREMENT
presence
length
shape
height
kind (distinctive trait)
width
color
diameter
texture
depth
arrangement
weight
symmetry
ratio of* size
PLACEMENT position relative to* orientation angle with+ distance to* *Properties that are generally relational.
QUANTITY number quantity
Figure 7. Basic properties for descriptive data by semantic category.
118
J. DIEDERICH
For descriptive data, there are four broad semantic categories in which basic properties can be placed: APPEARANCE, MEASUREMENTS, PLACEMENTS, AND QUANTITIES as seen in Figure 7. (While commercial DBMSs support basic business data types like date and money, they do not support certain basic properties for domains like order processing. There, one might find that order number, dropship address, line item, etc., would be part of a set of basic properties that could be used in a wide variety of order processing systems.) III. A basic property is independent of structures and states. For example, shape of the wing is not a basic property as it contains a structural reference, ting, and is circular-shaped is not a basic property as it contains a state reference, circular. CONDITION
CONDITION IV. A basic property is a template to be used in creating characters. When a character is created, an instance of a basic property is created, i.e., copied and modified, to form the character. The advantages of creating basic properties lies in promoting uniformity and in ease of use in building the set of characters. Once basic properties are created for one area and type of data, they do not have to be recreated for each of the domains with the same type of data, thus eliminating some of the redundant effort seen in creating biological databases [2]. Additional
Aspects
of Basic Properties
Given the four general aspects of a basic property, we now extend its definition by examining specific aspects that hold whenever a basic property is instantiated (used) to create a character, which we will refer to as an “instantiated basic property.” (We continue the numbering in the definition.) CONDITION v. RELATIONAL AND NONRELATIONALPROPERTIES. A basic property that can be instantiated to form a relationship with another character is called relational, otherwise it is nonrelational. Basic properties that are typically instantiated in this way are marked with a “*” in Figure 7. For example, the basic properties distance to and nztio of would be meaningless when instantiated as is in a list of characters since they inherently relate structures and characters. Sometimes these relationships are referred to as landmarks (for the definition see the glossary of [12]). We emphasize that with relational basic properties the relationship is established when the property is instantiated to create a character. At that time, the system should prompt for the related character, i.e., structure or structure and property. For example, when creating a character using distance to, the system would prompt for a structure name to complete the name of the property and when creating a character using ratio of, the system would prompt for two other properties such as height and width. The relationships can be maintained via the mechanism for state-based relationships discussed below. Note that with nonrelational basic properties, such as length, the system would not prompt for additional information, though the names of instantiated properties can be modified. This discourages creating properties such as length of the wing, which is really a combined property and structure, a poor design choice since a better decomposition would be ting, length. Thus, the very definition of basic properties aids in uniformity and consistency within a character set, even if they are not implemented and supported. CONDITION VI. STATES IN BASIC PROPERTIES. Basic properties may have specified states as part of their definition. For instance, presence has two states {present, absent}, which are automatically included in any instantiation. Additional states and synonyms can be added in Basic properties classified as measurements and quantities each instantiation as appropriate. also have fuzzy states included. For example, the basic property length includes the fuzzy states {very short, short, intermediate, long, and very long}. Quantities have fuzzy states such as {a couple, a few, several, many, about a dozen}. Upon instantiation, changes to the list can be
Biological Databases
made.
We have not addressed
have only provided data
the complexity
for fuzzy states
119
of issues related
in the list of characters
to the use of fuzzy states.
and for storing
fuzzy values
We in the
acquisition.
CONDITION VII.
QUALITATIVE AND QUANTITATIVE PROPERTIES. There
types of data required
for descriptive
data:
qualitative
states and quantitative
are essentially two values. Other data
types may be required for ecological, physiological, and geographical data. Storage structures are created based on the instantiated properties and the type of storage structures desired: records, relations, objects, etc. In this discussion, we will assume the descriptive data is stored in records as shown in Figure 8, that is, each record will represent data for a given character. Included but not shown are the fields to store taxon related information obtained directly from published descriptions. We do not address the complex area of nomenclature in this paper.
I 1. Structure I
2. Property
1 3. Name extension 1 4. State I 5. Qualifier I 6. Version& I
(a) Qualitative data record format and example.
1
1-6 8s above
Median bulb
I 7. Value I 8. Low Hange I 9. High Range length
1
13.0
10.6
15.8
I
IO. Stdev
I
1.4
(b) Quantitative data record format and example. Figure 8. Storage structures.
Field 1 in Figure 8a identifies the structure for the property in field 2. Field 3 will be discussed in condition IX. Qualitative data have a field for a state, and a field for the frequency of occurrence or for other qualifications, fields 4 and 5, Figure as “always, usually, sometimes.”
8a, as states
are often given with qualifiers
such
Using records to represent character data presents a problem whenever character data needs to be linked, which happens occasionally, but not always predictably, in the data we are working with. This may be different states for the same character (see VIII below) or for different characters. Field 6 stores a version number, with version number 0 for characters that are not linked and a different version number each time there is a link for the same set of characters. Application programs would have to handle this whenever necessary. The alternative approach of storing multiple characters per record and normalizing would be more difficult to achieve given the uncertainty of the data in the literature or in adding new species. Quantitative values can be separated into measurements (reals) and quantities (integers). Each measurement has a field for a value, field 7 of Figure 8b, representing a measurement of an individual or an average for a population. In the latter case, there would be data for high range, low range, standard deviation as these are the terms most frequently encountered in descriptions, fields 8-10, Figure 8b. There are alternative ways in which measurements are given such as when the variance, standard error, confidence interval, or normal and extreme ranges are used in place of a standard deviation. Often a conversion can be made to a standard deviation representation. A more elaborate representation of measurements may be necessary though to accommodate these alternatives. Quantitative data records also have fields for states to accommodate fuzzy states. For integer data, usually a single numeric value is given or a range is given. This can be handled using the same fields 7-9 in Figure 8b, though the data type would be integer instead of real. Integer ranges with gaps can be handled by multiple records. Occasionally, an average value for
120
J. DIEDERICH
integer data is given, an example is average family size. In this case, the value would have to be a real as well, representing an average. The resulting ambiguity is resolved by specifying the range and scale of a property. An instantiated basic property may be designated having a scale, one of {nominal, ordinal, interval, ratio), and a range, one of {binary, discrete, continuous}. Scale indicates whether the states or values are unordered, ordered, ordered with measurable differences (a - b), and ordered with measurable differences (a - b) and ratios (a/b), respectively. Range differentiates between discrete/binary, i.e., integers or states, and continuous data, i.e., reals. Basic properties classified under appearance and position have defaults of nominal or ordinal, and binary or discrete, those in measurements have defaults of continuous and ratio, and those in quantities have defaults of interval and discrete. One normally accepts the defaults, but can make changes for a given character. Thus, properties designated continuous would have all of the fields for a measurements, those with interval and discrete would have those for quantities, and those that are ordinal or nominal and discrete would have fields for qualitative states. Those that are continuous and interval would have integer ranges and a real value and stdev. CONDITION VIII. IMPLICIT PROPERTIES AND MULTIPLE STATE LISTS. To address the problem of implicit properties mentioned earlier and to avoid the use of multiple artificial properties such as nature, aspect, situation, etc., we use the basic property kind. (The property type, which is not a descriptive property, is a term that has special meaning in biology and should be reserved for states that indicate a biological type as shown in the example of the summary character in Figure 6.) This presents a problem though when there are multiple implicit kinds in the same structure. Instead of using Icindl, Icind2, IcindS, etc., we introduce the concept of multiple state lists for instances of basic properties. That is, associated with a qualitative property we allow more than a single list of states. For example, we could have one list for general states and another list for specific states within the same character, though we opt for a different approach described below, or we could have separate state lists for the character such as hair, kind, one state list including states {thin, thick}, another state list including states {coarse, smooth}, and another including states {curly, straight). One advantage in using multiple state lists is that it provides a mechanism for decomposing complex states into more atomic states. For example, the states {low contiguous, high contiguous, high separated} might be decomposed into separate state lists {high, low}, and {separated, contiguous}. Each state list is maintained in a lexicon, defined below, and forms the basis of supporting many of the concepts discussed including intro and inter-character state-based relationships. CONDITION IX. NAME EXTENSIONS IN BASIC PROPERTIES. A simple mechanism, called a name extension, is a modifier of a property name. As such it can be stored as a separate field as shown in Figure 8, field 3. One example of property explosion where name extensions are appropriate is given when length is measured in different ways for the same structure such as length along the axis, length along the outer boundary, or length directly taken. Rather than creating multiple properties, we can create one property, length, with three name extensions along the axis, along the outer boundary, directly taken. Instances of basic properties maintain a list of name extensions. In the course of a query, the user can decide which, if any, of the name extensions to enforce. There may appear to be a conceptual resemblance between name extensions and relational properties. And indeed, name extensions are used to represent relational properties. For instance, in the example of property explosion diameter at the vulva, diameter at midbody, diameter at the stylet base, etc., this can be represented by the property diameter and have the name extensions at the vulva, at midbody, at the stylet base, etc., see Figure 9 where the name extensions have a “ _ ,, prefix, at the same time it represents relationships with other characters. There are several advantages of name extensions. The first, and obvious one is the convenience of consolidating multiple properties into a single property. As new name extensions are encoun-
Biological Datab8aea
121
Body diameter - at stylet base = SBW - at median bulb -
at
nerve ring
- at excretory pore - at oesophago-intestinal junction -
at
beginning anterior ovary
- at midbody = MBW - at vulva = VBW = VB - at end posterior ovary -
8t 8UUS
-
maximum = breadth
Figure 9. Name extensions in
tered for a property, a new one.
they
can be added
to an existing
8
property.
property
rather
than
having
to create
A field in the storage record of each property, whether it initially has name extensions or not, can be set aside for a name extension value. Whenever a value is stored, the name extension can be stored as well. It is a simple matter in a query to enforce (or ignore) a name extension via a simple “and” condition. Operationally, the name extension could be used like a modifier that may or may not factor in a query. For example, in a query with the condition body, diameter 5 20.0 it would not be important to enforce a condition like body, diameter.name extension = at the vulva, indeed, it would be most desirable not to. Needless to say, in an interactive session the user would be able to make appropriate choices. The principal mechanism in an instance of a basic property to handle many of the concepts discussed including intro- and inter-character state-based relationships is called a lexicon. A lexicon L is a 5-tuple (S, P, C’S, DS, M), where S is a structure, P is a property including specified name extensions, CS is a set of cited states, DS is a set of display states, and M is a correspondence or mapping, not necessarily a function. M maps CS to DS. CONDITION
x.
LEXICONS IN BASIC PROPERTIES.
The basic rationale for a lexicon is straightforward. In the literature there may be a wide variety of states for a property. One example set of states is {straight, weak C, C, circle, closed circle, open circle, widely open C, spiral, question mark, tight spiral}. This set is designated as a set of cited states CS, since they are as cited in the literature and we assume these are the values that are stored in the database. However, for a number of reasons, this set CS may not be the set of states we would choose to display 8s a list of states for the character. For example, some of the terms in CS may be outdated, some may be nonstandard terminology, some could even be wrong, i.e., a bad synonym, or some might be synonyms that are less frequently used than other terms. Thus, we form a set DS, display states, that represents a set of distinct states to be displayed whenever the character is viewed. An example of a set DS for the example CS above might be {straight, weak C, C, circle, question mark, spiral, tight spiral}. Since the set of cited states is from the literature, the set would not change over time except to add new states, unless the character were to change in some fundamental way such as dividing a character in two. On the other hand, the set of display states would change. The intention is that by changing the set DS, the stored data would remain unaffected and would allow individual users to change DS as needed. A correspondence or mapping M between CS and DS is needed to relate elements of CS that are synonymous with and general states for those in DS. An example of the correspondence is shown in Figure 10. On the left-hand side are elements of DS,
122
J. DIEDERICH
S = body D = habitus
DS
cs
-M-
straight
= straight
weak C
= weak C = widely open C GZC
c
= C = open circle X circle
circle
= circle
spiral
= spiral
tight spiral
= tight spiral M spiral Figure 10. A lexicon.
on the right-hand side the corresponding elements of CS with “ = ” showing the state itself and synonyms and “x ” showing general states. Note that C is a general state for weak C, and it is also a display state itself, and videly open C is a synonym for weak C. A lexicon can easily be represented and implemented as a table. A property may have one or more lexicons, thus multiple state-lists directly correspond to multiple lexicons. Most importantly, the lexicons form the basis for inter-character state-based relationships.
5. STATE-BASED
RELATIONSHIPS
The concepts described above, including basic properties, name extensions, lexicons, and statebased relationships have made it significantly easier to build a large set of characters in the NEMISYS project in a more consistent and uniform fashion than would have otherwise been possible. Lexicons, which provide a means for representing synonymy and general states, also provide a convenient and straightforward way of representing inter-character state-based relationships such as dependent, redundant, and summary characters. Again the basic idea is that these relationships are represented via correspondences between lexicons. Representation
of State-Based
Relationships
The basic unit in representing inter-character state-based relationships is a triple (Li, Lz, Clz) representing a correspondence Crs between display states DS1 and DSz of two lexicons L1 and Lz. The mapping of a state in DS1 to multiple states in OS2 is interpreted disjunctively. One or more triples can be grouped into a collection G and is interpreted conjunctively. A relationship is then defined by its type T and a collection {Gi, Gs, . . . ,G,} of groups of triples, with the collection interpreted disjunctively. An example of the simplest relationship is a dependent relationship shown in Figure 5, with T = ‘Dependent’ and one grouping consisting of a single triple, i.e., Gr = {(Li, Ls, Crz)}, where Lr the lexicon for Sr = body, D1 = kind, DS1 = {vermiform, intermediate, nonvermiform}, and Ls the lexicon for 5’s = body behind the neck, Ds = shape, DSs = {kidney, pear, irregularly swollen, spheroid}, and Crs(nonvermiform) = {kidney, pear, irregularly swollen, spheroid}. The other two values, intermediate and verrrtiforrn are mapped to the empty list { }. Summary characters using the example in Figure 6, require multiple groups of multiple triples. With Lr the lexicon for the character stylet, type G1 includes (LI, L2, CI~), where Cr2 (hople laimid) = {robust} in the lexicon Ls for stylet, kind; includes (LI, La, CI~), where Cis (hoplolaimid) = {conoid} in the lexicon LS for cone, shape; includes (LI, L4, C4, where Cl4 (hoplolaimid) = {intermediate, long} in the lexicon L4 for stylet, size, and so forth. (Note that mappings are interpreted disjunctively when mapping a state to two or more states.) The condition that cone, size = shaft, size can be handled by placing the mappings from hoplolaimid to
Biological Databases
123
each of shaft, size = small and cone, sire = small in GI, with the other cases of equal shaft sizes and cones sizes using additional sets Gi. Alternatively, one could include features like wild card elements to simplify and reduce the number of Gi needed in a representation. The representation of state-based relationships can be thought of as query conditions that can be used to modify queries, as discussed next. Utilization Proper characters.
of State-Based
Relationships
use of the data is very much dependent Given the complexity
of any moderate
used to build the set, the user will need assistance to operate
on the data.
Usually,
in selecting
the more concepts
be in building a mechanism to assist the user. represented using lexicons and name extensions simplify building processor.
on the accuracy sized character
such a mechanism,
in the selection set, independent
the appropriate
involved
of the correct of the concepts
characters
the more difficult
in order
the task
will
However, the concepts we have introduced are in a fairly straightforward manner. This would
which for purposes
of discussion
we call a character
list
The types of state-based relationships we have presented, i.e., synonymy, general states, dependent, summary, redundant, and fuzzy characters certainly not do not exhaust the possibilities. The addition of new relationship types should not require specialized code or changes in existing code for a character list processor. Our approach in using state-based relationships simplifies the addition
of new kinds of relationships.
To illustrate the need for a character list processor, we consider an identification session where an observer enters some initial observations Cl, a set of characters with states/values that are a character list processor connected by the usual logical connectives ‘and’ and ‘or.’ Without candidates would be retrieved based on Cl. However, the set Cl may not be the best set to use. One would not expect the set of observations Cl by the user to translate directly into the best set of characters for the retrieval. For example, the user may not have been sufficiently general or specific. General or specific states may be needed to clarify, for example, that a posture designated as C includes the specific state weak C. The user may specify inconsistent characters without realizing it as could be the case if the body, kind is given as intermediate but the body behind the neck, shape is observed as kidney, Figure 5. The user may not be aware that an observation represented by one character may indeed be represented for different tsxa in different ways. For example, if the stylet, type = hopEolaimid is in Cl, it may be necessary retrieve also based on the summarized stylet, type has not been specified.
characters
shown
in Figure
6 for those taxa
where
the
There may be circumstances other than a retrieval, where a set of characters Cl needs to be modified before an action takes place. For example, Cl might be a set of observations that need to be verified or Cl may be used to update the database. Clearly, state-based relationships play a central role in overcoming the problem of formulating or modifying Cl caused by the structure of the data. Each context requires proper selection of the characters. In utilizing the state-based relationships, we can view a character list processor, Figure 11, ss a set of characters Cl is taken as input and a modified set of characters C2 is produced as output. Other input includes the set of state-based relationships, a context and a table of operations based on the context. By ‘context,’ we mean a user designated name representing the kind of activity taking place with the data. Typical names for contexts could be “Retrieval,” “Query,” or “Update,” but there may be many more that are suitable for other contexts as well, such as verification of observations. We will discuss the situation for retrievals since they are the most common context. By ‘operation,’ we mean the modification made to Cl. Some generic operations are given in Figure 12. For example, if Cl contains body, habitus = C, in a retrieval Cl could be EXPANDed using specific states to include the condition body, habitus = C or body, habitus = weak C.
J. DIEDERICH
124
d
Cl
I
list processor
Character
i i i
state-based relationships
context
I
zc2
table of operations
Figure 11. Character list processor.
Likewise, if Cl contains body, kind = nonvermiform, Cl could have ADDed the condition body behind the neck, shape = kidney, or pear, or irregularly swollen, or spheroid. The ADVISE operation is used to alert the user of an existing relationship and allows the user to selectively modify Cl. A SUBSTITUTE operation replaces one or more characters by other characters. Thus, the user could EXPAND, ADD, or SUBSTITUTE, and within each, add or delete disjuncts and conjuncts. EXPAND, modify Cl by disjunctively adding conditions to Cl. ADD, modify Cl by conjunctively adding conditions to Cl. SUBSTITURE, modify Cl by substituting conditions for conditions in Cl. ADVISE, alert the user to existing relationships. Figure 12. Operations on character sets.
Whenever a new context arises it is necessary to specify what operations to carry out on a character set Cl relative to each kind of relationship in order to produce C2. The specification should not be based on instances of state-based relationships between two specific properties, but should only be based on the type of relationship, i.e., redundant, dependent, summary, etc., and on the context. Figure 13 shows some entries in the table of operations. We emphasize that it is up to the user, initially the system designers, to specify the entries in the table. A relationship can be expressed in two ways. For example, for a state sl with a synonym 92 we can express this as “31 has synonym 92” or alternatively “92 is a synonym for sl.” Retrieve
1
I I
Update 1
No Op
Has synonyms
1 EXPAND
Is 8 synonym
EXPAND
ADVISE
Is general state
ADVISE
ADVISE
Has general state
ADVISE
No Op
Is dependent
ADD
ADD
Has dependent
No Op
No Op
Has summarv
ADVISE
ADVISE
Ia summarv
I EXPAND
IS redundant
I
EXPAND
I I
ADVISE ADVISE
Has redundant
EXPAND
ADVISE
Name extension
ADVISE
ADVISE
Figure 13. Specifying operations on Cl.
In Figure 13, we see two types of contexts and the operations taken for each relationship we have considered. If a set of characters Cl is proposed to retrieve a set of candidates, we
Biological Databases
125
call this a “Retrieve” in this example, and the column below it in Figure 13, indicates choice of operations on Cl for each relationship. For example, the first entry shows that
our if a
state “Has Synonyms,” as in body, habitus = weak C, then on a “Retrieve”’ the query should be “EXPAND”-ed to include conditions body, habitus = weak C or body, habitus = open C. Likewise, on a “Retrieve” that includes a state that “Is A Synonym,” then the state for which it is a synonym should be “EXPAND”-ed. For “Is General State” on a “Retrieve,” the user would be “ADVISE’‘-ed of the general state and choose to modify conditions in the query or not. Note that when a character “Is Dependent” on another character, then the condition is “ADD’‘-ed to contain the primary character as well, while a character that “Has Dependent” characters would yield no operation on Cl. If a character in Cl is a summary character “Is Summary,” then the summarized several that
characters will be “EXPAND’‘-ed, however, if a character is in Cl and is one of has a summary character “Has Summary”’ then the user will be “ADVISE’‘-d and
can choose whether Conceptually conditional
or not Cl is modified
speaking,
fragments
we can consider
and how it is done. state-based
relationship
that can be used to modify a set Cl.
could be directly represented and stored We do not address the implementation
as conjunctive
and disjunctive
In fact, our state-based
in this manner for simplicity of character list processors
and efficiency. in this paper,
relationships but we do
point out that whenever each character in Cl is modified in producing C2, additional relationships may arise. This would typically occur in chaining of dependent relationships or when states with synonyms or general states are added. We assume that the implementation of the processor maintains a list of dependencies used to produce C2 in order to avoid undoing or redoing operations. The user should have the option of allowing the character list processor to continue until all changes are made or to review the current C2 as each change is made.
6. STRUCTURAL
DECOMPOSITION
AND NAME
EXTENSIONS
Basic properties constitute the main focus of this paper, however, some of these ideas can carry over to the structural level as well. In particular, the concept of name extension can be used with structures to improve the structural decomposition. The example in Figure 3 of spermatocytes in the anterior and posterior end of the testis can be handled by making these positional modifiers name extensions of testis; see [13] for an extended discussion of alternative decompositions including the use of name extensions. Unlike the case of name extensions in properties, the semantics and implementation issues of name extensions in structures is less straightforward, though we believe there are important advantages in using them there. A field for each structure/substructure name is needed in the record and a field for a possible name extension must be provided as well. Given that a record consists of a single character and its state or value as assumed in Figure 8, then the name extension field for each structure/substructure would apply to that datum. However, if one opted for a record format that has multiple properties per record, then another field would be necessary to indicate which properties the name extensions applied to. This would overly complicate the storage mechanism and the character list processor. This is another reason we opt for the record format of Figure 8.
7. SCHEMA
CHANGES
Schema evolution has been addressed extensively. The focus has been primarily on structural changes such as adding and deleting attributes and classes. We will limit our discussion to schema changes in the context of state-based relationships since changes to the hierarchy at the property level and above would be analogous to structural changes in a schema. If a property participating in a state-based relationship were moved from one structure to another, for affected relationships we would change the name of the structure 5’ in any lexicon L for that property. Splitting a property into two properties would be straightforward if the property
126
J. DIEDERICH
contained a lexicon
multiple lexicons and the lexicons remained unchanged. More complex changes, where is split into two or more lexicons, would be similar to adding and deleting states in a
lexicon as discussed
below.
As one might expect,
the system, but some would require intervention. if relationships will be affected by the changes. In the course of using the schema,
In the latter case, no state-based
in existing
relationships
are needed,
be the case for redundant
states to a lexicon is a very common such as when the new state set of display instance,
states.
added
involve adding
extend
by
states
and adding
and no changes
may need to be established
relationships
such as presence.
as
Adding
In some cases, the system would either do nothing to CS, the set of cited states, could automatically
is added to the DS, and the DS is dependent
could automatically
automatically
should issue a warning
exist for a new property
a new relationship dependent
situation.
is merely
changes
relationships
In other cases, the system
if a new state
the relationship
though
or certain
can be handled
In all cases, the system
the most frequent
properties. would
some changes
to the new state.
extend
but not to DS, the the relationship.
on another
For example,
property,
if another
For then
state
is
added to the property shape in Figure 5, which is a dependent property, then the new state along with the existing states can be assumed to be dependent on the state non’uerrn$orna in property lcind in the part body. The same would hold if a state were inserted into or outside of, but not . . . , dsi+m} of DS. If it were placed at the at the boundary of, an ordered subset {dsi,dsi+l, boundary, then automatically determining whether it participated in the relationship would be difficult as would be the case if the states were unordered. Deleting states is less likely to occur from CS since these are taken from the literature. Some deletions from a DS would occur since the role of a state may change. Generally speaking, deletions can be done automatically, though warning of existing relationships should be given in case a lexicon is being divided as mentioned above. Also, a relationship could be rendered obsolete by deleting the last element in its domain or range necessitating a warning as well.
8. CONCLUSION In this paper, we have presented the key concept of basic property and its features. Whether or not a system implements basic properties per se, the concept itself can aid in creating more uniform and consistent character sets. Additionally, basic properties provide a mechanism for representing and utilizing state-based relationships, which are difficult to avoid in any large set of characters. While new generations of relational and object-oriented database systems may make it possible to implement these ideas more efficiently, it would be too much to expect each group interested in building a biological database to do its own implementation. Thus, if large scale biological databases are to exist, be used effectively, and be integrated, the problems discussed here will have to be addressed with the goal of creating a kind of generic BioDBMS. We believe our ideas will contribute to this goal, though actual design and implementation will require a large scale effort. Though much remains to be investigated in what we have presented, we believe this is a solid start in the right direction.
REFERENCES 1. R.J. Pankhurst, Database design for monographs and floras, Taxon 37, 733-746, (1988). 2. R. Allkin, R.J. White and P.J. Winfield, Handling the taxonomic structure of biological data, M&l. Comput. Modelling 16 (6/7), 1-9, (1992). 3. M.J. Dallwitz, DELTA and INTKEY, Advances in Computer Methods for Systematic Biology, Chapter 18, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993). 4. M.J. Dallwita and T.A. Paine, User’s guide to the DELTA system: A general system for processing taxonomic descriptions, 3’d edition, CSIRO Au&. Div. Entomol. Rep. No. 13, pp. l-106, (1986). 5. R. Allkin and F.A. Bisby, Editors, Databases in Systematics, Systematics Association, Vol. 26, Academic Press, London, (1984). 6. R.J. White, R. Allkin and J.P. Winfield, Systematic databases: The BAOBAB design and the Alice system, In Advances in Computer Methods for Systematic Biology, Chapter 19, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).
Biological Databasea
127
7. R.J. Pankhurst, Taxonomic databases: The Pandora system, In Advances in Computer Methods for Systematic Biology, Chapter 14, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993). 8. R.J. White and R. Allkin, A language for the definition and exchange of biological data sets, Mathl. Comput. Modelling 16 (6/7), 199-223, (1992). 9. H. Saarenmaa, S. Leppiijiirvi, .I. Perttunen and J. Saarikko, Object-oriented taxonomic biodiversity databases on the World Wide Web, from an international workshop: Internet Applications and Electronic Information Resources in Forestry and Environmental Sciences (l-5 August 1995, European Forest Institute, Joensuu, Finland) and available through the web at http : //www . ef i .j oensuu . f i ~saarenma/oobdwww-naturelatest. htm. 10. J. Diederich and J. Milton, Expert workstations: A tool-based approach, In Advances in Computer Methods for Systematic Biology, Chapter 7, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993). 11. J. Diederich and J. Milton, NEMISYS: A computer perspective, In Advances in Computer Methods for Systematic Biology, Chapter 10, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993). 12. R. Fortuner, The NEMISYS solution to problems in nematode identification, In Advances in Computer Methoda for Systematic Biology, Chapter 9, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993). 13. J. Diederich, J. Milton and R. Fortuner, Construction and integration of large character sets for nematode morpho-anatomical data, findamental and Applied Nematology 20 (to appear). 14. R. Allkin and F.A. Bisby, The structure of monographic databases, Z&on 37, 756-763, (1988). 15. G. Wiederhold, Views, objects, and databases, Computer 19 (12), 37-44, (Dec. 1986). 16. R. Allkin, N.P. Moreno, L. GamB. Campillo and T. Mejia, Multiple uses for computer-stored taxonomic descriptions: Keys for Veracrus, Taxon 41 (3), 413-435, (1992). 17. P. Chen, The entity-relationship model: Toward a unified view of data, ACM TODS 1 (l), Q-36, (Mar. 1976). 18. M. Hammer and D. McLeod, Data description with SDM: A semantic data model, ACM TODS 6 (3), 351-386, (Sept. 1981). 19. J. Diederich and J. Milton, Creating domain specific metadata for scientific data and knowledge bases, IEEE 7kans. on Know and Data Eng. 3 (4), 421-434, (Dec. 1991). 20. Special Issue on Scientific Databases, Bulletin of the technical committee on data engineering, IEEE Computer Society, Volume 16, No. 1, Washington, DC, (1993). 21. A. Shoshani, A layered approach to scientific data management projects at Lawrence Berkeley laboratory, In Data Engineering, IEEE Computer Society, Volume 16, No. 1, pp. 4-8, Washington, DC, (1993). 22. Y. Ioannidis, Desktop experiment management, In Data Engineering, IEEE Computer Society, Vol. 16, No. 1, pp. 19-23, Washington, DC, (1993). 23. J.B. Cushing, D. Hansen, D. Maier and C. Pu, Connecting scientific programs and data using object databases, In Data Engineering, IEEE Computer Society, Volume 16, No. 1, pp. 9-13, Washington, DC, (1993).