peckiana - Senckenberg

PECKIANA ·

Volume 4 (2005)

·

91 – 100

ISSN 1618-1735

The GloMyrIS project of GBIF: Database structure and data exchange J ÖRG S PELDA Abstract The Global Myriapod Information System (GloMyrIS) consists of three databases, representing six modules: literature, taxonomy, link between both, collection, cartography and media. The present paper explains the structure of the basic tables and how exchange of data between the modules and other databases is done.

1. Introduction GBIF, the Global Biodiversity Information Facility (http://www.gbif.org) is an initiative to undertake biodiversity informatics on a worldwide basis. Its mission is to make the worlds biodiversity data freely and universally available via the Internet and to share primary scientific biodiversity data for society, science and sustainable future. GBIF data should provide information on taxa and names (e.g. original combination, type species of genera, type locality of species), allow a rapid identification of species and show the current occurrence of taxa. Linked with other, e.g. ecological data, this might allow prediction of potential localities, as well as patterns of dispersal and disappearance caused by environmental changes. GBIF is nationally organised in nodes; in Germany (http://www.gbif.de) these nodes are defined by the taxa dealt with. One of these nodes, located at the Bavarian State Collection of Zoology in Munich (ZSM), is responsible for Molluscs, Arachnids and Myriapods (http://www.gbif.de/evertebrata2). The project dealing with myriapods is called GloMyrIS (Spelda et al. 2004, http://www.gbif.de/evertebrata2/GloMyrIS), which stands for Global Myriapod Information System. The name is an anagram to the pill-millipede genus Glomeris as well. 2. History of the GloMyrIS project The development of the Global Myriapod Information System reaches back to the nineties, when the necessity to recover huge amounts of data appeared. During the authors studies, simple text files had served for this. While the authors thesis (Spelda 1999) was being constructed, a cartography database was developed, based on the database application Florein, at that time widely used for cartography of plants in Germany. This database, based on the Dbase database development system, was transformed to an MS Access version (2.0) in 1999. At the same time a taxonomic database was developed on the same platform. Originally developed for use in collaboration with the milli-PEET project (www.fieldmuseum.org/research_collections/zoology/zoo_sites/millipeet/home.html) to

92

Jörg Spelda

provide a digital nomenclator for diplopods, it was used for oribatid mites during the OBIF project (Verhaagh et al. 2003) at the State Museum of Natural History at Karlsruhe (SMNK), which was part of the EDIS umbrella project (Häuser et al. 2003, http://www.insectsonline.de), and supported by the BIOLOG (biodiversity informatics) programme of the German Federal Ministry of Education and Research as a direct forerunner of GBIF. The literature database was developed in 1998 directly in MS Access 2.0. Since that time it has been transformed several times until it attained its final shape during the OBIF project, when it was adapted to the field definitions of the database application ReferenceManager. In common with the other databases it is completed in MS Access2000 at present. Based on these three databases the GloMyrIS project was proposed and applied for support. In 2003 this was granted for type recording in German collections solely, meaning that the further development of other modules happens without support. 3. Organisation of the GloMyrIS project The database application of the GloMyrIS-project consists of five main modules. The structures of their basic tables have proved to be optimal. Therefore they are proposed herewith as a standard for data exchange. 3.1. The literature module The literature module is an independent database. Its structure is adapted from the widely used literature program ReferenceManager to have a standard that allows data exchange with users of this database, as already explained in a previous paper (Spelda et al. 2003). Principally literature recording can be made via ReferenceManager, but the data has to be transformed to an SQL-Database, because it is necessary to link it directly with the taxonomic module. The main function of the literature module is to provide information about publication dates or magazine abbreviations for synonymy lists. 3.2. The taxonomic module The taxonomic module consists of a basic table (Tab. 1) containing records, which are linked to each other. A record consists basically of the fields »taxonID«, »nameTax«, »authorTax«, »stepTax« and »PreStepTax«. Additional fields indicate special information like synonymy, type species, mode of typification (e.g. monotypy, original designation: for genus group taxa) and original combination (for species group taxa). Every record is connected with one or two references from literature: one obligatory, which indicates the source of the name, and a second, which indicates the original description. The latter is regularly updated later, based on the original combination. Although it is possible to enter data directly into the taxonomic module, this is not the preferred way. Regularly the taxonomic data come from analysed references, which are recorded in the link module.

The GloMyrIS project of GBIF

93

Tab. 1 Structure of the basic table of the taxonomic module.

3.3. The link module The link module contains the most important table of the whole database application. This basic table (Tab. 2) is independent. Structural clones can be dispersed among several associates to be later re-imported. These secondary tables can be of any useful format: text files, spreadsheets or other database systems like Dbase. The basic table contains a lot of fields that might correctly require their own tables in a relational database. It has to be pointed out, that every subset of fields works correctly, provided the fields for taxon name (»genus«, »species«) and »literature« are included. Such field reduction is provided within the database by different forms. As the basic table serves as a base for cooperative data exchange, this justifies a thorough explanation of the fields and their function. The link module has been designed to record names of species group level from references. Recording of higher taxa is possible, using the genus field, but not yet in practice. The fields family and author are optional. They are only needed when new taxa occur, which are not present in the taxonomic module. It is advisable to fill these fields as long as the taxonomic backbone has not been established. The species field has to be left empty for genera or other taxa of higher category. This is the way the system discriminates between genus group and species group names. Names for infraspecific taxa are standardised: species and subspecies are only separated by a blank, a variety is preceded by »var.«, a form by »f.«. The inclusion of infraspecific taxa in the species field is necessary to reduce taxonomic steps, which slow down system speed. Abbreviations of the author should be transformed to the full name, e.g. (L., 1758) has to be changed to (Linnaeus, 1758).

94

Jörg Spelda

Tab. 2 Structure of the basic table of the link module.


95

The category field (Tab. 3) is necessary for queries to restrict the results to a given category, e.g. for keys where a taxon is mentioned. Other values might be introduced, e.g. for ecological or physiological citations. The locality field can also be filled in records of the category description. If the »newtax«-flag is set »true«, this is interpreted as being the type locality. If a species has been described from several localities a single record of the category »type locality« has to be created for every potential type locality. A comment like »[3 localities]« in squared brackets might be filled in the corresponding description record. The category »note« is used as paper basket for all records that cannot be assigned to another category. In case a »note« -citation covers several geographic records, some records of the category »locality« might specify it later. Records of both categories exist independently. Filling the category field might appear to be straightforward, but in fact it is often difficult to decide to which category a citation belongs. E.g., if a faunistic treatment contains pictures of the species or some characters of the material seen, the question arises, if this is still a note or rather a description. As the aim of the category field is to find the information quickly, and taxonomy is the centre of interest, it is better to have such citations within the category »description«. Tab. 3 Standardised codes of the category field.

An important improvement to gather of data of distinct geographical units was the introduction of a categorised locality. A categorised locality is a country or a state, for which an ISO 3166-1/1997 alpha-2 code exists (e.g. for Germany the states, called »Bundesländer«, are included, but not for Austria or Italy). In some cases it was necessary to split off several geographical units, which have no own ISO 3166 1/1997 alpha-2 code of their own. This affects especially large islands, e.g. Corsica or Borneo, which build distinct biogeographical, but not political units. A background table is used for data input. The field is provided with a carry-flag, meaning that each new record overtakes the locality of the preceding. This avoids mistakes and allows fast input of large faunistic lists.

96

Jörg Spelda

3.4. Update of the taxonomic module After several data inputs have been made, either online or by import, the analysis program can be started. It first looks for new generic names and proposes their list to the database administrator, who can manually check if they are typing errors and eventually add a precursor for real new genera. If the names are accepted as new, they will be added to the taxonomic module. The same procedure will be continued with species group names. Then the analysis program looks for incomplete author data within the taxonomic module. If it finds newer data in the basic table it proposes them to the user. After the taxonomic module is updated, every record of the link table is linked with the taxonomic and the literature module. 3.5. Synonymy The records of the category »secondary« provide the synonymy within the GloMyrIS system. In every other category, where the synonym field »istax« is left empty, the GloMyrIS system fills it with the name of the corresponding taxon, meaning it sets the taxon valid. If needed it provides a list of unknown taxa occurring in the field »istax«. This list has to be interpreted as mistakes, as the system expects that only known names occur in this field. The system connects all secondary records with the corresponding valid taxa and selects the latest synonymy. This synonymy is transferred to the taxonomic module. It is possible to create virtual references for ones own interpretations by creating a special reference code for them (e.g. 0). This is interpreted by the system as the latest record, which covers all others. Recombinations require further interest. In literature it is often the case that in recombinations the formal synonymisation is left out and the original combination is not cited. In older papers even the author of the taxon has been changed wrongly and it is often a detective work to find the original author. As several specialists were not aware of the true gender of a genus, different endings within the same species name may exist. The database has to treat all such combinations as synonyms as well. 3.6. Misidentification The misidentification table is separate and requires contact between the literature module, the taxonomic module and the link module during the data entry. Therefore the correction of misidentifications is only possible inside the GloMyrIS database application. The misidentification table provides a link between the three tables. One has to select a citation, characterised by its identification code, select the true name from the taxonomic module and a reference from the literature module, where the correction comes from. It is possible to create several records based on the same citation, provided the references differ. During evaluation the system selects the latest record based upon a given citation.


97

4. The cartography module The cartography module is an independent module. It is more complicated than the locality part of the link module as it contains several extensions. Its use is recommended for geographic surveys, where many species from comparatively few locations are recorded. The main difference to the link module is, that the user has to fill in a separate locality table. In the optimal structure, a further table, characterising the collection event (date and collector) is situated between the locality and record table. But the experience of the EDIS project (Häuser et al. 2003) has shown that in most cases data input in cartography databases is less error sensitive if a collection event table is omitted and instead a new locality record is created for every collection event, based on a former one. The cartography module can be linked with the literature and taxonomic module but has its own submodules for this purpose. Translation tables are used to allow linkage. Cartography is a separate subject that requires its own contribution. As it is not part of the present GBIF project (but may be of a future one) no further attention is paid to this module. 5. The collection module Although the cartography database has its own collection module, the database application Specify is used within the Bavarian State Collection of Zoology (ZSM). The GloMyrIS database application is used to identify types by linkage. Both databases are connected by the specific name of the specimen. This is easy, as Specify is based on MS SQLServer, which allows direct linking of both databases via ODBC. The taxonomic module of GloMyrIS analyses the collection tables of Specify and proposes a list of unknown taxa to the user. The user has to check them and either correct wrongly-written names or look for publications where they occur. If this is done the link module is also connected. The GloMyrIS system proposes a comparison of type material from literature and the current material. A specialist is then able to select and mark the types in the collection modules of GloMyrIS and Specify. 6. The media module The media module is used to link media files, especially pictures to the other modules. For this reason the filenames of pictures have to be standardised. Scans of pages from literature build their filename by the code of the literature, followed by an underscore and the last three digits of the page number (e.g. »carl1903x007_543.jpg«). In the same way figures are treated: literature code, underscore, »p« if figures are situated on a plate, plate number, underscore, »f« for figure, figure number (e.g. »carl1903x007_p17_f32.jpg«). Information about a picture is given in a media table as shown in Tab. 4.

98

Jörg Spelda

Tab. 4 Structure of the basic table of the media module.

6.1. How is data quality ensured? One main improvement during the EDIS project was the introduction of markers for each dataset, which document data input and changes: who made it and when. These four fields, »DataInputEditor«, »DataInputDate«, »DataLastChangeEditor« and »DataLast ChangeDate« allow a critical evaluation and update of records. In database applications with multi-user function (e.g. Specify or SysTax) they are important for accession rights. 6.2. Providing data from different modules Knowing the distribution of a taxon is one of the questions GBIF projects have to answer. Such data might come from literature (via link module), from the collection module and the cartography module. SQL provides a simple way to query several databases: A union-select-query has to be created, which gathers data from all available sources. This query tells the system which fields from which tables have to be united in a common structure. In that way new tables might be created, as needed in the example of the following chapter. 6.3. Providing data for other databases The most common case of data exchange occurs within cartography projects, when local projects have to be combined. The way in which this is done within the Central European Arachnologists (http://www.spiderling.de.vu) might serve as an example: Simple


99

tabulator delimited text files are exchanged via e-mail. A single file contains taxon name, grid field (alternatively co-ordinates in latitude and longitude might be possible), date (categorised as decade there), reference (of literature, if data is taken from there) and source (from whom the data comes). A second file for the references has to be sent, containing the literature code in one and the complete citation in another field. These reference files are provider specific. The database itself does not discriminate between records from the same literature source if different collaborators provide them. Duplicates in records are eliminated. A map server analyses the complete sum of files and prints the maps. No update of existing data occurs. Therefore it is difficult to correct misidentifications in old files. For this reason it might be a good idea for the local workers to deliver their whole local database content, which completely replaces the previous file of a given collaborator and also to recombine all local files from time to time. 6.4. Providing data for the Internet Although it is possible to put databases of GBIF projects directly on the Internet, this requires an enormous amount of resources for every single project. It is more effective to develop a single portal and to provide the data there in a common appearance. This also guarantees interoperability with GBIF International. Such a portal has to be based on a database system, that allows high numbers of users to have access simultaneously, like Oracle or MS SQLServer. During the EDIS umbrella project the database application SysTax (http://www.biologie.uni-ulm.de/systax) was chosen as portal. SysTax (Hoppe et al. 2003) is based on the Oracle database system. It is now used in three of seven German GBIF nodes. During EDIS it was first planed to develop XML interfaces for data exchange, but it has been shown that their development requires many more resources than provided by BIOLOG. As a practical solution, flatfile interfaces have been developed. These flatfiles are simple ANSI text files, the only kind of files an Oracle or 4D database system accepts without problems. Both the XML and flatfile interfaces are based upon the standards of the TDWG (Taxonomic Database Working Group, http://www.tdwg.org). The interfaces have been further developed within GBIF and updated versions occur from time to time (see http://www.biologie.uni-ulm.de/systax/documentation/interfaces/index.html). Each interface can be used if the version is specified. The main data exchange rule is, that the provider is responsible for his data fitting with one of the interface formats, the portal is responsible for the correct import. Although some changes might be necessary when other databases are included, the principles can be transferred to all databases and the interfaces developed for GBIF Germany might be useful for data exchange within the myriapodological community. 6.5. Outlook It is an illusion to believe that all data gathering will be coordinated by GBIF now. In fact several specialists work independently, but exchange their data from time to time. It is inevitable that data will be duplicated. But this is no fault. In contrast this provides a fine quality check: if everyone uses the same field conventions, e.g. those shown in this paper, the data can easily be cross-checked and errors are detected sooner.

100

Jörg Spelda

7. Acknowledgements PD Dr Roland Melzer (Munich) is thanked for valuable suggestions regarding the manuscript and Richard Desmond Kime (La Chapelle Montmoreau) for his improvements on the English text. 8. References Häuser, C., J. Holstein & A. Steiner (2003): EDIS Entomological Data Information System. In: Sustainable use and conservation of biological diversity. A challenge for society. Symposium report. Part A: 197 Hoppe, J. R., G. Gottsberger, F. Schweiggert, T. Stützel & D. Walossek (2003): SYSTAX Electronic data processing for recording an analysing biodiversity data with the systematic and taxonomic database SYSTAX. In: Sustainable use and conservation of biological diversity. A challenge for society. Symposium report. Part A: 223 224 Spelda, J. (1999): Verbreitungsmuster und Taxonomie der Chilopoda und Diplopoda Südwestdeutschlands. Diskriminanzanalytische Verfahren zur Trennung von Arten und Unterarten am Beispiel der Gattung Rhymogona, 1896 (Diplopoda: Chordeumatida: Craspedosomatidae). Ph.D. Thesis, University of Ulm. Part 1, 217 pp. Part 2, 324 pp. , J. Rosenberg & K. Voigtländer (2003): The German Myriapod Literature Project (GERMYLIT). Afr. Invertebr. 44 (1): 325 330 , M. Umsöld, M. Rizerfeld, C. Pilz & R. R. Melzer (2004): The Global Myriapod Information System (GloMyrIS) an aid for scientific research. In: Berendsohn, W. G. & S. Oehlschlaeger (eds): GBIF-D Participation in the Global Biodiversity Information Facility: 100 101 Verhaagh, M., J. Spelda, C. Wurst, L. Beck, N. Blüthgen, W. Hanagarth, H. Höfer, F. Meyer & S. Woas (2003): Optimization of biodiversity information facilities at the State Museum of Natural History Karlsruhe (OBIF). In: Sustainable use and conservation of biological diversity. A challenge for society. Symposium report. Part A: 210 211

Authors address: Dr Jörg Spelda Zoologische Staatssammlung München Münchhausenstr. 21 81247 München, Germany e-mail: [email protected]