Biological collection databases in Europe: Status and perspective GÜNTSCH, A.; HAHN, A.; BERENDSOHN, W.G. 1 Abstract This paper summarizes the status of databases with information from biological surveys and collections in Europe with regard to their online accessibility. The BioCISE project analysed biological collections by the means of an extensive survey. Based on the results a scenario for a common portal to Europe’s collection information is drafted.
Introduction The term Biological Collection covers traditional natural history collections, living collections such as zoological or botanical gardens, microbial strain collections, or commercial nurseries, as well as collections of digitised objects (e.g. image files and sound files), and observation databases resulting from floristic and faunistic mapping projects, surveys, etc. Biological collections in this broad sense form the base of many - if not most - of the scientific results brought forward in life sciences. Extensive lists of the wide range of potential uses of biological collections have been compiled (e.g. CAMPBELL 1998, LANE 1996). To name but a few: collections are representations of the variability of organisms (including that of extinct species), they hold the voucher specimens needed to replicate research; they are the source of information for migratory behaviour, and the spatial information linked to collection units is used to derive the past and present distribution of organisms. The broad definition of biological collections here adopted is not in line with current databasing practice: most collection databases are focussed on a specific collection type. However, an extensive study of their information structures found that a single adaptable information model is sufficient to cover all biological collections (BERENDSOHN et al. 1999). Based on this finding, the BioCISE project 2 (BERENDSOHN & GÜNTSCH 2000) analysed collections and collection information systems in Europe (European Union, pre-accession states, and Israel) by conducting a comprehensive survey 3. The survey’s results are stored in a project database currently maintained by the Botanic Garden and Botanical Museum, Berlin and publicly available via the World Wide Web. A survey of user interests clearly indicated that over-all access to biological collections is a requirement for many user communities, especially those in the environmental sector. The results are used to propose the implementation of a common online interface to Europe’s collections. This portal will increase the efficiency of daily scientific work and contribute to the fulfilment of international biodiversity agreements such as the Bonn Convention 4 and the Rio 1
Botanic Garden and Botanical Museum Berlin Dahlem, Department of Biodiversity Informatics and Laboratories, Königin-Luise-Str. 6-8, D-14191 Berlin, Email:
[email protected] 2 BioCISE was a Concerted Action Project funded by The European Commission, DG XII, Biotechnology, under the 4th Framework Programme (BIO4-CT97-2309). 3 http://www.bgbm.fu-berlin.de/BioCISE/TheProject/Survey/default.htm 4 Convention on the Conservation of Migratory Species of Wild Animals (CMS), http://www.wcmc.org.uk/cms/
Convention 5 (for a comprehensive overview of global initiatives and conventions, see BERENDSOHN et al. 1999). This article summarises the survey’s results, and provides some details with respect to zoological collections and collection databases. In conclusion, an information system is proposed which could overcome the present lack of online collection information.
Collection databasing in Europe The BioCISE survey covered about 2600 collection holding institutes, of which nearly 500 responded to the questionnaires. More than half (61%) of the respondents are presently equipped with an electronic collection information system or database (including lists etc. held as text or spreadsheet files). The average varies with discipline and geographical location and is generally higher for observation-based approaches (mapping projects) than for physical object collections. It is lower in countries where common access to the necessary technologies has only recently reached significant levels (e.g. in the Eastern European countries). The high number of different software solutions in use for the capture of biological collection data – more than 60 different applications were named for the management of collections in just about 300 institutions - reflects the heterogeneity of the biological community, the fragmented institutional base, and the lack of commercial solutions. About two thirds of database owners reported their solution to be developed in-house. Stand-alone (more or less) relational systems are clearly favoured, as shown in Fig. 1, with MS Access (44%) most prevalent, followed by dBase (25%). More basic systems include “flat” structures using word processing files and spreadsheet tables. Larger databases are using client-server applications, here the majority (70%) named Oracle as their database management system.
Fig. 1: Categories of database management systems (DBMS) in use for biological collections
The actual number of objects and observations held by European collections is unknown. The results of the survey, together with some other studies, can be used to attempt some rough estimates. There are approximately 620 zoological living collections in Europe, including zoological gardens, aquaria, and animal genetic
5
Convention on Biological Diversity (CBD): http://www.biodiv.org/chm/conv/default.htm
resource collections. According to our results and the information gathered from ISIS 6 and the Global Zoo Directory 7, a total of about 800,000 units seems a reasonable estimate. Our idea about the number of objects in natural history collections is less exact. The very large collections often have only a rather vague idea about their holdings. For Germany, the number of zoological samples in natural history collections has been estimated to range between 50 and 80 million for invertebrates alone (NAUMANN & GREUTER 1997). London, Vienna, and Brussels combine holdings of more than 50 million invertebrate samples. For vertebrates, numbers are given with 2.7 million for the decentralised German collections, while in Britain and France the large taxonomic facilities (Paris: 1.5 Mio, London: 5.5 Mio specimens) are likely to hold the bulk of the respective national holdings (NAUMANN & GREUTER 1997). The number of zoological observation records is even harder to assess. A number of observation databases cover from 1 to more than 2.5 million records each, giving an indication of the soaring total to be expected from all over Europe. Examples include the Austrian invertebrate survey database ZOBODAT 8, the database on migratory birds of the University of Gdansk 9, the data collections on ringing recoveries and vertebrate mappings in Copenhagen 10, and the butterfly observation database in Wageningen 11. Among respondents, the percentage of existing objects captured in databases varies greatly between institutions. The average capture level is rather low (5.5%), even though about one fifth of the existing databases are reported to cover all existing objects in the particular collection. It thus does not come as a surprise that on-line access to collection object data is still the exception. Fig. 2 gives an indication of online accessibility of biological collections and object data according to the survey results. The figures cannot be extrapolated to all collections, because the survey was biased towards databased collections; the fraction of 8% of collections offering object data via Internet access is likely to be much lower if all collections are considered. Quite a high percentage (60%) of collections now providing descriptive information through the BioCISE collection catalogue had not before been represented in the internet at all, while others do not supply collection details to object level, but descriptive information on their institution and collections. In animal collections,
6
International Species Information System, http://www.worldzoo.org/ http://www.cbsg.org/gzd.htm Fig. 2: On-line access to information systems of institutions with biological collections. 8 Biologiezentrum Oberösterreichischen Landesmuseums, Linz, Austria through the Internet. Accessibility of des object data and of descriptive information (metadata) 9 Bird Migration Research all collections: n = 483,Station, animal Choczewo, collections:Poland n = 210. 10 Zoological Museum Univ. Copenhagen, http://www.aki.ku.dk/zmuc/zmuc.htm 11 De Vlinderstichting - Dutch Butterfly Conservation Wageningen, http://www.bos.nl/vlinderstichting 7
general web representation is slightly more common, largely caused by the web presence of natural history museums. Generally, the degree of object level access to data is largely dependent on the type of collection (see Tab. 1). While in physical object collections the state of data capture in general as well as different technical provisions necessary for web access seem to be the limiting factor, observations are nowadays almost universally collected in electronic form. Nevertheless, a minor part of the roughly 9 million datasets reported by respondents from faunistic survey databases is accessible on-line (less than 1 in 80, in contrast to 20-30% in other animal collections). In contrast to other collection types, this will not be remedied through an adequate allocation of resources to technical infrastructure, curation, and data capture. Rather, the obstacle lies in the lack of solutions for questions of protection of sensitive data (e.g. endangered species) and intellectual property rights of the data providers. Tab. 1: On-line accessibility of object level data for different kinds of zoological collections reported to the BioCISE survey.
Number of databases reported Number of entries Of those: reported to be on-line Ratio
Zoological gardens etc. 14*
Natural history collections 139
Observation databases 61
17,000 5,600
~ 4.5 Mio ~ 1 Mio
~ 9 Mio ~100,000
1:3
1:4 – 1:5
< 1:80
* Zoological gardens excluding those networked in the ISIS system.
From meta-information to unit-level data: an outlook Considering its high scientific relevance, the on-line availability of unit-level collection information is appallingly low. Moreover, what is available is only rarely networked, i.e. accessible through a common user interface. Although some unit-level thematic networks exist (e.g. ISIS for zoological gardens, IPGRI for plant genetic resources), this situation is unlikely to change significantly in the near future. A primary impediment is the lack of standardisation in data capture and interchange for collection and observation data (BERENDSOHN 1998). An initiative to overcome this obstacle is the European Natural History Specimen Information Network (ENHSIN 12), an EU financed project of some of the largest collections in Europe which set out to produce standards for the interconnection of collection databases and to tackle questions like IPR issues for collection objects. Other such initiatives exist in the US and in Australia. However, adequate availability of object-level data will not be achieved in the near future. Consequently, an interim system should be implemented which describes biological collections in the form of meta-information, but which at the same time allows access to those collections already providing unit level data. Figure 3 depicts the two information levels. The term collection meta-information is here defined as information describing entire collections or subcollections instead of providing access to information pertaining to 12
http://www.nhm.ac.uk/science/ENHSIN
single collection objects (Güntsch et al. 2000). Collection meta-information covers the following main information areas: •
Categorizations and descriptive keywords (e.g. taxonomic and geographic)
•
Type of specimen (preservation), storage characteristics
•
Statements on IPR and administrative properties
•
Data quality assessments. .......... .......... ..........
.......... .......... ..........
Collection Meta Information Level
...... ......
...... ......
...... ......
...... ......
...... ......
...... ...... ...... ......
Unit Information Level
Fig.3: Unit-level vs. meta level collection information.
A detailed list of attributes describing general (also non biological) collections was compiled by the UKOLN Collection Description Working group (Powell 1998). It is based on the "Dublin Core simple content description model for electronic resources" (Anon. 1999), an emerging internet standard for metadata. The BioCISE collection catalogue 13 represents a first step on the way to a European collection meta information system. Initially designed to document the survey of European collections and collection databases, it turned out to be a valuable tool to enable potential users to get into contact with holders of collection data. A simple query interface allows for both predefined queries (country and broad collection category) and free text searches (see fig. 4). Apart from the above mentioned networks providing unit-level information access, several national and/or thematic networks exist which provide collection meta information. A relatively simple procedure had to be developed to link the respective information systems to the BioCISE catalogue (Güntsch & Vander Velde 1998). Three national initiatives (BIODIV 14, NatureWeb 15, Polish Herbaria 16) and three
13 14
http://www.bgbm.fu-berlin.de/BioCISE/Database/default.htm Biodiversity Resources in Belgium: http://www.br.fgov.be/BIODIV/
thematic Networks (Index Herbariorum 17, CABRI 18, IPGRI 19) exemplify this approach.
Fig. 4: The BioCISE collection catalogue Web interface.
The co-operating networks periodically transfer files to the central system, which contain a simple core set of attributes necessary to be queried from the BioCISE interface and to produce hyperlinks back to the local catalogues (fig. 5). The following data items are transmitted (one record per collection): •
Name of institution which owns the collection
•
Country and city
•
Simple list of (mainly taxonomic) keywords
•
Parameter to establish the access (hyperlink) to detailed meta-information in the co-operating network
The transfer files are automatically converted into an SQL script populating the BioCISE database, system maintenance costs can thus be kept at bay for the central system. The BioCISE initiative now proposes to build on this approach to construct the portal to Europe’s collection and observation information. The development of appropriate 15
Austrian Networking Initiative and Service for Natural History Collections and Observation Data: http://www.natureweb.at/ 16 Catalogue of Scientific Herbarium Collections held by the Polish Academy of Sciences: http://www.ib-pan.krakow.pl/ 17 The Herbaria of the World: http://www.nybg.org/bsci/ih/ 18 Common Access to Biotechnological Resources and Information: http://www.cabri.org/ 19 Directory of Germplasm Collections: http://www.cgiar.org/ipgri/doc/dbintro.htm
geographic and taxonomic thesauri to build a common ground for the local indexing of collections is a prerequisite and one of the focal points of the proposed project.
Hyperlink BioCISE Collection Catalogue
BioCISE Database
Network WWW Interface
Transfer File (ASCII)
Local or thematic Network
Fig 5: Co-operation with local and thematic Networks.
The formulation of predefined queries for the portal’s interface according to the user requirements defined in the survey is another crucial factor for the success of such a system. Central (international) costs have to be kept low to guarantee a sustained service. The proposed system therefore relies on automatic indexing of web contents in addition to the provision of metadata by networks organised (and financed) on the national level. With more unit data becoming available, meta-information will be successively enriched to become the Biological Collection Information Service for Europe.
Literature ANON. (1999): The Dublin Core: A Simple Content Description Model for Electronic Resources. Dublin Core Metadata Initiative (ed.) [http://purl.oclc.org/dc/index.htm]. BERENDSOHN, W. G. (1998): Datenstrukturforschung und international vernetzte Umweltinformation auf der Ebene der Organisation. In: Hoppe, J., Helle, S. & Krasemann, L. (Hrsg.): Praxis der Umweltinformatik Band 7 - Vernetzte Umweltinformation. Marburg: 33-45. BERENDSOHN, W.G., ANAGNOSTOPOULOS, A., HAGEDORN, G., JAKUPOVIC, J., NIMIS, P.L., VALDÉS, B., GÜNTSCH, A., PANKHURST, R.J. & WHITE, R.J. (1999): A comprehensive reference model for biological collections and surveys. Taxon 48: 511-562. BERENDSOHN, W.G., HÄUSER, C.L. & LAMPE, K.-H. (1999): Biodiversitätsinformatik in Deutschland: Bestandsaufnahme und Perspektiven. Bonner Zoologische Monographien 45. Bonn
BERENDSOHN, W.G. & GÜNTSCH, A.. 2000: Resource Identification for a Biological Collection Information Service in Europe (BioCISE). Bocconea (in press) CAMPBELL, P. 1998. 101 uses for a dead bird. - Nature 394 (6698): 105. GÜNTSCH, A. & VANDER VELDE, A. (1998 [Nov]): Collaboration of BIODIV and BioCISE [http://www.bgbm.fu-berlin.de/biocise/TheProject/IntroCollab.htm]. GÜNTSCH, A., HAHN, A. & BERENDSOHN, W.G. (2000): Repräsentation der Struktur biologischer Sammlungen als Grundlage für die Schaffung eines europäischen Metainformationssystems, GI Workshop – Umweltdatenbanken im Netz, Karlsruhe Juni 1999, Proceedings (in press) LANE, M. A. (1996): Roles of Natural History Collections. Annals of the Missouri Botanical Garden 83(4): 536-545 NAUMANN, C.M. & GREUTER, W. (1997). Naturwissenschaftliche Forschungssammlungen in Deutschland: Die biologischen Sammlungen. Funktion, Situation, Perspektiven. Denkschrift im Auftrag der Direktorenkonferenz naturwissenschaftlicher Forschungssammlungen Deutschlands (DNFS). POWELL, A. (ed.) (1998 [Oct]): Metadata - Collection Level Description (Collection description working group Work in progress) [http://www.ukoln.ac.uk/metadata/cld/wg-report/].