Annotating biodiversity data via the Internet - Ingenta Connect

11 downloads 117 Views 891KB Size Report
opportunity provided by computer-based annotation services ..... The AnnoSys Project is funded by the LIS-program (Wissen- .... PHP: programming language.
Tschöpe & al. • Annotating biodiversity data via the Internet

TAXON 62 (6) • December 2013: 1248–1258

ME THODS AND TECHNIQUES

Annotating biodiversity data via the Internet Okka Tschöpe,1 James A. Macklin,2 Robert A. Morris,3 Lutz Suhrbier1 & Walter G. Berendsohn1 1 Botanic Garden and Botanical Museum Berlin-Dahlem, Freie Universität Berlin, Germany 2 Agriculture and Agri-Food Canada, Wm. Saunders Building, Ottawa, Ontario K1A 0C6, Canada 3 Harvard University Herbaria, 22 Divinity Avenue, Cambridge, Massachusetts 01238, U.S.A. Author for correspondence: Walter G. Berendsohn, [email protected] Abstract  Biological specimens in research collections provide the most important baseline information for systematic research. Traditionally, they are annotated by experts in written form, which remains directly associated with the specimens. These annotations, defined as data added at a later stage to the original data, provide an important quality control mechanism. They improve the value of herbarium specimens and are identification trails documenting the development of taxonomic concepts over time. With specimen data increasingly becoming accessible via the Internet, a general online annotation system that ensures that the traditional data sharing and documentation of specimen data is continued after the information is mobilised through digitisation, is currently missing. We lay out the prerequisites for such an annotation system including data standards, a data repository, system access, and user roles. We also introduce an exemplar solution developed in the DFG-funded AnnoSys project. AnnoSys is being implemented using the example of collection and observation data in the botanical domain as provided by the GBIF/BioCASe networks. It provides a user-friendly interface to allow researchers to produce and discover annotations. If a record has been annotated, both the annotation and the original record will be stored in a repository, linked via a persistent identifier, and will be accessible through the AnnoSys interfaces. Collection holders and scientists specifically interested in a subset of data will be informed about annotations in which they have expressed interest. We discuss AnnoSys in relation to the FilteredPush project, which pursues the same goal in facilitating and communicating online annotations, but which takes a different approach. Keywords  annotation; AnnoSys; BioCASe; database; FilteredPush; herbarium specimen

Received: 15 Feb. 2013; revision received: 9 Sep. 2013; accepted: 10 Sep. 2013. DOI: http://dx.doi.org/10.12705/626.4 Published online “open-access” under the terms of the Creative Commons Attribution-ShareAlike (CC BY-SA) License, which permits unrestricted use, adaptation, distribution (under the same license), and reproduction in any medium or format, provided the original author and source are credited and modifications are indicated.

INTRODUCTION Biological specimens in research collections ensure reproducibility and non-ambiguous referencing of research results relating to organisms. In addition, they are the most important basis of biological systematic research (Greuter & al., 2005). Collections world-wide hold an estimated 1.2–2.1 billion preserved specimens (Ariño, 2010). Given that a sufficient amount of fit-for-use collection data is available, this allows for, e.g., the modelling of potential species distributions (e.g., Hroudova & al., 2004; Cruz-Cardenas & al., 2012; Taylor & Kumar, 2012) and thus allows forecasting of their potential to become invasive, to distribute pathogens, or to carry out important ecosystem services such as pollination under changing conditions, e.g., climate change (Mohamed & al., 2006; Crawford & Hoagland, 2009; Molnar & al., 2012). The original data. — When specimens are collected in the field, collectors record a number of items, describing the collection event (where, when, and by whom the specimen was collected), descriptive features of the organism collected (e.g., if it is a tree or an herb), and an identifier to connect the notes to 1248

the physical material collected. When further processed by the collector, additional information may be added, most importantly an initial identification (i.e., a scientific name and classification). In herbarium specimens (dried and often pressed samples of macroscopic plants), this information usually ends up in a printed or handwritten paper label containing all the information. Upon mounting the specimen (fixing it to a sheet of rigid paper), the material and the collector’s information become united because the label is glued to the same sheet. Botanical collectors usually assign a sequential number to their individual collection (or set of collections from the same plant), which, together with their name and the name of the herbarium where the sheet was deposited, present a recommended and much used persistent identifier for these sheets in botanical literature and communications. When the information from herbarium specimens is digitised, the textual label data are usually recorded in a database (see Berendsohn & al., 1999, for a comprehensive description of specimen-related data). Such datasets are structurally more or less identical to data derived from species occurrence observations (e.g., from floristic mapping projects, monitoring and

Version of Record (identical to print version).

Tschöpe & al. • Annotating biodiversity data via the Internet

TAXON 62 (6) • December 2013: 1248–1258

the like). The difference largely resides in the latter’s lack of additional managerial data items such as the herbarium accession number that pertain to the physical specimen. We here refer to these datasets, stemming from specimens or species occurrence observations, as the Original Data. Annotations – definition. — We define annotations as data added at a later stage to the original data, to the specimen itself, or to a digital representation of the specimen. Essentially, we consider all data which are not original data as annotations. This also includes annotations of annotations. Traditionally, natural science collection objects are annotated in written form and curators add the annotations directly to the specimen preparation. In herbaria, annotations mostly take the form of little slips of paper that are glued to the herbarium sheet. Most frequently, annotations represent the reappraisal of the specimen by a specialist, who corrects or confirms the current identification (the name). Another very important (though less frequent) annotation refers to the status of the specimen as a nomenclatural type (i.e., the specimen that acts as a fixing point of a specific scientific name according to the Melbourne Code; McNeill & al., 2012. However, annotations may refer to any part of the original data, e.g., a correction or a specification of the locality (Bergmeier, 2011), comments on the life stage of the specimen, hints as to important morphological features and line drawings thereof, etc. (Berendsohn & Nimis, 2000). In addition, cross links to literature and to additional sampling (e.g., for molecular or micro-morphological examination) of the specimen may be recorded. The process of annotating specimens conforms to a long-standing tradition (Nevling, 1973). Importance. — These kinds of annotations all improve the scientific value of a specimen for further research and, from an historical perspective, document the development of scientific knowledge. With modern methods, handwritten annotations, which are commonly difficult to decipher, can be identified as to their author (Mund & Steinke, 2010) and thus document the (historical) scientific view of the authors. Such annotations also provide an interesting challenge for citizen science, because the capacity to translate or transcribe old handwriting often lies outside of the closer systematics community (Hill & al., 2012). More importantly, annotations ensure that scientists working with specimens share their scientific results with subsequent researchers working on the same specimen. Thus, specimen annotations are both a quality control mechanism that improves the value of herbarium specimens (Perkins, 2013) and an identification trail documenting the development of taxonomic concepts over time. If stored in searchable annotation databases or content management systems, the annotations can be aggregated to form a knowledge base of scientific opinions about the original data as well as the specimens. This can itself serve as an authority for subsequent annotations, e.g., the correction of a place name. The fact that herbarium collection data are being digitised ever more rapidly and are put on the Internet and used for all kinds of scientific purposes increases the urgency of keeping these records up-to-date. We do have to find a way to deal with sharing annotations because otherwise “old” data are being used for all kinds of purposes.

This paper describes the current state of annotating herbarium specimens, it then centers on new problems in collection and annotation management brought about by databasing collections and eventually publishing specimen records on the Internet. It then lays out the prerequisites for an information system that tackles these problems and introduces an exemplar solution developed in the AnnoSys project. Finally, it briefly discusses how the FilteredPush project, while sharing many of the goals, differs from AnnoSys.

ANALYSING THE CURRENT STATE Although at least some of the authors of this paper have a thorough background in herbarium management and procedures, we thought that a formal analysis of the traditional annotation workflow in herbaria would be useful. On the basis of the workflow analyses in the Herbarium Berolinense (B) and a survey with curators of 17 European herbaria about the current workflows in their herbaria we conducted a system analysis and specification for an online annotation system for biodiversity data. The classical workflow. — The classical workflow is illustratet in Fig. 1. A scientist wishing to study certain specimens first has to request access to these specimens. Access is permitted by the collection curator if the scientist is known or is associated with relevant institutions or authorities. Access is granted either by letting the scientist actually enter the herbarium, or by serving a loan request, i.e., sending the relevant specimens to the scientist’s institution on loan. In their loan practice, most scientific herbaria follow the guidelines given by the Committee for Recommendations in Desirable Procedures in Herbarium Practice and Ethics (Nevling, 1973). The scientist, having analysed the specimen, should document results by adding an annotation label to the specimen. Subsequently, when the specimen is returned, in most herbaria the annotation is controlled by a curator (Fig. 2) and the specimen is inserted again into the collection at its appropriate storage location. Most collections are using a systematic arrangement, i.e., the specimens are stored according to their systematic classification (e.g., family, genus, species). Unless scientists publish their results, information about their annotation activity will not be accessible outside the collection. Databases introduce new problems. — Databasing the collection introduces a new problem for collection management. If the annotation consisted of a new identification of the specimen, the processing may result in a new storage location for the specimen (or even a group of similar but unannotated specimens in the collection). This works as long as there is no electronic record of the specimen. If a record exists, the annotation must be recorded and linked to the original record in some way (e.g., by updating the record), because otherwise the specimen cannot be located using the database anymore. The herbaria that answered our survey mostly update their records, and many even annotate the physical specimen (Fig. 3). This is an additional new task for collection management and it is to be expected that it will outstrip resources when annotations

Version of Record (identical to print version).

1249

Tschöpe & al. • Annotating biodiversity data via the Internet

become more frequent. As a result, the database and the collection may increasingly lack synchronisation, rendering the data less and less useful. This is a problem at the collection management level that the future annotation system must help to solve. The Internet adds to the problem. — Collections are conscious of the fact that virtual annotations pose a problem to management, not only because of insufficient resources to maintain the traditional workflow (and thus maintain the physical collection as the central data store), but also because they will increasingly be unaware of annotations made, simply because these are residing with the electronic records elsewhere. This already poses a problem in the traditional workflow, when publications refer to specimens that have not been annotated properly by the author, or when duplicate specimens become annotated elsewhere. With electronic publication of specimen records and images, this problem is multiplied, because of the multitude of possible entry points for annotations. For example, aggregators like GBIF and BioCASE, taxonomic information portals like Scratchpads or EDIT Platform Sites, and individual on-line Floras and Faunas all cite specimens that may be annotated.

TAXON 62 (6) • December 2013: 1248–1258

First attempts at a solution. — A number of biodiversity portals and digital herbaria have annotation mechanisms already integrated in their workflows. A few examples include the Atlas of Living Australia (Belbin, 2011), Atrium (AABP Atrium, 2012), SYNTHESYS (Güntsch & al., 2009), and JSTOR Plant Science (JSTOR, 2012). For molecular data, the DAS Writeback (Salazar & al., 2011) for protein sequences and AnnoTrack (Kokocinski & al., 2010) for annotating annotations of genome sequences exist, among others. Annotation workflows in these systems differ significantly concerning architecture, annotation options, interfaces and interoperability. While in local annotation systems like Atrium data of a specific collection, from one data provider, are annotated, in centralised annotation systems like the SYNTHESYS annotation system, data from distributed data providers are annotated and the annotations are stored in a single (central) repository. Another approach is taken by FilteredPush (Wang, 2009, see under Discussion below). SYNTHESYS, AnnoSys and FilteredPush already use standardised data as defined in domain standards but no attempt has been made to agree on a common protocol or joint data storage mechanisms. FilteredPush and AnnoSys are the first to use a joint standard for the annotation metadata.

Fig. 1. Classical workflow in herbaria.

1250

Version of Record (identical to print version).

Tschöpe & al. • Annotating biodiversity data via the Internet

TAXON 62 (6) • December 2013: 1248–1258

THE NEED FOR A NEW WORKFLOW New developments. — With the advent of databases and especially with data increasingly becoming accessible via the Internet, we must re-think our procedures for the annotation process. The traditional annotation is normally that of a specialist having had access to the physical specimen. However, data from (and images of) specimens are becoming available at an ever increasing rate. For herbarium specimens, the goal to completely image and digitise the specimens is clearly attainable, as demonstrated by the projects under way, e.g., in France (Chagnoux & Michiels, 2011) or the U.S. (e.g., Tulig & al., 2012). In addition, huge datasets of observations have become available, and while these lack the important possibility to re-examine the physical specimen, the data may be annotated and improved in quality as well. In the course of some scientific projects large quantities of data are analysed and quality-checked concerning their suitability for a certain question. For example, the taxonomic coherence of species identification is ensured or the reliability of georeferencing is tested. The issues discovered and changes made to the data, resulting from such quality checks, are indeed valuable annotations of the original data. However, currently the traditional flow of annotation information is interrupted. An on-line user of the data is not able to add an annotation to the digital record of the specimen or observation and link this back to the original source (or at least this is not possible in a convenient way). Researchers who downloaded datasets lack a convenient way to feed back their changes to the original dataset. On the other hand, citizen science projects like the Swedish Artportalen (ArtDatabanken, 2012) or eBird (Hochachka & al., 2012) add a new dimension, demonstrating how masses of annotations can be handled within a specific observation project using social networking mechanisms (Snall & al., 2011). Duplicate specimens. — Another important aspect of an online annotations system storing and sharing annotations is the handling of duplicates: because the data on herbarium specimens usually also include the collector’s name and collection number, duplicate specimens housed by other institutions can be searched for and (as far as these data are recorded similarly) be identified. Annotations referring to a specific specimen can 100

100

yes

60 40 20 0 Do control If so, is control Are credible If not, are you procedures for carried out corrections refer- discussing how physical by curators? ring to online to do that in annotations exist? specimen processed? the future?

Fig. 2. Results of a questionnaire about current annotation workflows in herbaria (n = 17).

yes

80

No proportion (%)

proportion (%)

80

thus be made visible and distributed to all holders of duplicates, or be used in the data capture process for newly digitised specimens. Global infrastructures. — For more than a decade, international initiatives such as the Global Biodiversity Information Facility (GBIF) or the Biological Collection Access Service for Europe (BioCASE) have set up an open access infrastructure to connect biodiversity data from various distributed sources (Holetschek & al., 2009). The importance of the accompanying standardisation measures (of data and data services) cannot be overstated. The mobilisation effort not only starts to realise the enormous potential of such data for large-scale scientific analysis, but the accompanying standardisation is also a prerequisite of constructing a global annotation infrastructure which maintains the advantages of the traditional approaches and opens up the domain to new flexible and distributed solutions. What is needed. — Currently missing is a general online annotation system that ensures that the traditional data sharing and documentation of specimen data is continued after the information is mobilised by digitisation, and that observation datasets can be incorporated into the same system. The opportunity provided by computer-based annotation services and networks is to both maximize knowledge gain about organisms and efficiently disseminate it to everyone who has a vested interest in it (Macklin & al., 2006). Although need and acceptance of annotation procedures in the life sciences are undisputed, their use has been limited to a small group of experts (Kusber & al., 2009). Deans & al. (2012) state that although taxonomists are arguably the most active annotators of the natural world, the relevance of their products needs to be broadened by improving its accessibility. Annotations of freely accessible specimen or observation records should, therefore, be made freely accessible, wherever they have been created or stored. Management procedures have to be devised to put such records to good use, with the aim of improving data quality and fitness for use of the original data. In conclusion, we do need a solution that has the ability to store and access annotations referring to a physical specimen whenever and wherever these have been recorded. Such a global system needs to provide services which support the annotation process itself (data entry) as well as access to the annotation

No

60 40 20 0

Do you have a defined procedure in place on how to process virtual annotations?

Do you update your collection database?

Do you create an annotation label and place it on the specimen?

Fig. 3. Results of a questionnaire about processing of virtual annotations (n = 15).

Version of Record (identical to print version).

1251

Tschöpe & al. • Annotating biodiversity data via the Internet

for collection holders and researchers. There also should be a mechanism to document the results of quality control measures carried out on the annotation (e.g., the acceptance of an annotation by the collection holder).

DEVELOPING A SOLUTION—SYSTEM REQUIREMENTS Data standards. — Any solution involving the sharing of data requires standardisation of the format and the semantics of the data items to be shared. Fortunately, collection data have been thoroughly analysed by the natural history community over the last three decades. There are still items under discussion and some inconsistencies to be resolved (due mainly to the lack of consistent support by natural history institutions and data aggregators for the volunteer organisation devoted to standard development—the organisation for Biodiversity Information Standards, TDWG). However, we do have a solid base to build on, starting with data models developed in the 1990s (e.g., ASC 1992, BioCISE—Berendsohn & al., 1999) on to data exchange standards like ABCD (Access to Biological Collection Data (Berendsohn, 2005), and DwC (Darwin Core, Wieczorek & al., 2012), which today support large-scale access to and aggregation of collection and observation data (as of 27 Dec. 2012, more than 383 million records are accessible through the GBIF network). These standards form the basic prerequisite for interoperability between disparate and locally distributed systems (Berendsohn & al., 2011). The annotation process itself requires a number of (meta-)data items that are completely covered by the current collection data standards. These are not unique to the natural history domain. As any material published on the Internet may be annotated, there are efforts under way to provide a global standard, comparable to the Dublin Core standard describing publications (DCMI, 2012). The W3C Open Annotation Data Model (Sanderson & al., 2013) specifies an interoperable framework for creating annotations. At present, the standard is still at a draft stage and no example of an implemented solution in the biodiversity domain exists. Building on the current domain standards means a solution must build on XML-encoded data, simply because these are available in large quantities and because XML offers a simple and standardised framework for pointing to individual data items in records (the XML tags). However, it is essential that flexibility for future changes, e.g., to an RDF-based record scheme, is part of the system requirements. The data repository. — As here defined, annotations are data added to existing data. However, data as provided by collection databases may change, for example due to the incorporation of annotations into the source database. This means that the “original data” as defined above has changed, although it may still be accessed using the same identifier. Storing only the annotation and a reference to that source is thus inadequate. Instead, all annotations conducted within the annotation system have to be stored persistently in a repository together with the original record (today: the XML-document 1252

TAXON 62 (6) • December 2013: 1248–1258

retrieved from the collection database or aggregator) they refer to. Recording persistent identifiers for the specimen in the collection is essential because it allows assembling all connected data: the annotation records together with the records they refer to, as well as the current state of the specimen record as newly retrieved from the source. The currently used triplets in DwC and ABCD consist of an institution ID, a collection ID and an ID for the specimen or observation (Unit ID). This has its weaknesses, because individual institutions may use literal terms as institution and collection IDs and subsequently change them, but in the networks it is largely working to uniquely identify specimens. Part of the internal annotation record is also the annotator’s personal profile information including login credentials and authorisation details. Upon registration, users will have to agree on the publication of their name, institution and (optional) e-mail address, so that the individual annotation is analogous to a traditional, physical annotation. The need for a persistent repository does not mean that there can be only a single site for storage. Protocols like BioCASe (Güntsch & al., 2006) allow one to jointly access distributed repositories to retrieve complex data. This is proven technology that works reliably as long as the underlying repositories are dependably providing data. A procedure to ensure the reliability of annotation repositories in the network must be developed (e.g., as a certification process such as the one established for World Data System members, see ICSU, 2011). System access, user roles. — People (and machines) may access the system taking different roles, and consequently have different rights to enter or change data in the system. The annotator has of course a central role as a data provider and people should be encouraged to participate. On the other hand, misuse of the system must be prevented. In the traditional system, quality control is first provided by selective access to the physical specimens (as discussed above). Secondly, the collection curators will normally carry out a check of the annotations themselves before returning the specimens to the collection (Fig. 2). The annotator’s identity is a significant quality criterion, e.g., for taxonomic determinations. As this does not apply to specimens freely accessible via the Internet, appropriate access and quality control mechanisms need to be established, e.g., to prevent identity theft. However, since it is desirable to encourage a wide range of users to conduct annotations, there has to be a trade-off between a strongly restricted access control and a user-friendly access to ensure maximum participation. Also, laws regulating the protection of personal data need to be considered and appropriate permissions have to be obtained from the annotator. The annotation data themselves should be put under open access, e.g., by using a CC-0 (no rights reserved, see, e.g., Keller & Mossink, 2008) license. This does not affect the issue of attribution to the annotator, since this is not covered by copyright, but is an obligation set by scientific rules and academic conventions (Agosti & Egloff, 2009). In any case, as one of the reviewers of this paper pointed out, enforcing attribution may lead to endless and fairly meaningless attributions every time a specimen was cited, and that there may be other ways to credit an annotator. For example, the annotator could

Version of Record (identical to print version).

Tschöpe & al. • Annotating biodiversity data via the Internet

TAXON 62 (6) • December 2013: 1248–1258

provide a reference to the published work being used, or being written by the annotator as part of the annotation. Annotators need to be informed about the further handling of their annotations and of their personal data. A second important role is that of a collection curator (and analogous, the curator of an observation dataset). Data management in a suitable system makes annotations accessible and reproducible. Our survey revealed that most of the herbarium curators process credible corrections of virtual annotations (Fig. 2), but only half of them have a defined procedure in place for how to do so (Fig. 3). Herbaria that do not process virtual annotations are discussing how to do so in the future. Over 80% of respondants update their collection database and create an annotation label and place it on the specimen when virtual annotations are made (Fig. 3). Curators need to be informed about new annotations to give them a chance to improve the dataset at the source or even the physical collection. The annotation system should also support them in this task. Notification can be via e-mail subscription or news feeds, so that curators become aware of annotations referring to their records, or by means of a query interface specific to the individual collection. The central annotation system could also provide a possibility to print annotation labels for direct use on the specimen. In reality, many annotations will never or at least not immediately be put to such uses, simply because most herbaria lack the resources to do so. However, in the course of an investigation or a loan of a specimen, all accumulated annotations referring to the specific specimen can be queried from the annotation repository and then transferred to the specimen. In addition, machine-readable services of the annotation system could provide the information once the local database is queried (e.g., by reading the barcode of a specimen that is being handled). Such functionality has to be incorporated into the local specimen management system, but since more and more collections tend to join forces and use software jointly, this may become feasible in the future. A further step in that direction would be the direct incorporation of annotation data into the local system upon approval by the curator. Supporting this by a generic process that has been called “reverse wrapping” in the SYNTHESYS project has been investigated, but no generalised solution has been found (J. Holetschek, pers. comm.). This is analogous to the “push” in FilteredPush: that project has generally concluded that (1) generic schema mapping is not yet possible; (2) mapping a known exchange schema to an arbitrary relational database for a limited set of business operations is feasible by configuration, and (3) mapping a known exchange schema to a specific relational database is possible through a generic interface with database-specific code and configuration. Methods 2 and 3 have been demonstrated by the FilteredPush project, which found that 3 was not very flexible. The current implementations focus on method 2 and provide this functionality through an API that has been implemented on the Specify (Specify, 2013) and Symbiota (Symbiota, 2013) platforms. FilteredPush continues to explore where the boundary lies between general and specific interfaces for backend updates. Finally, there is the user of the data, who has read-only access to the annotation system. Data users need to be advised

to acknowledge the annotator’s personal effort by appropriate citation and attribution mechanisms. Since the records are under the cc-zero license, this is only a recommendation, but comparable to the citation of literature this should be part of best practice in science. Users of entire datasets will need guidance as to how to mix original and annotated data for their purposes of analysis. User (and machine) interfaces need to be designed based on criteria that can be used to search for specific data. Annotation data referring to a specific collection, a specific collector, added since a specific date etc. will be among such criteria. In botany, the combination of the principal collector’s name and collection number with the herbarium code (according to Index Herbariorum, Thiers, 2010) is widely used, so it should be possible to query the system using these data items. Motivating annotators. — The primary step to incite researchers to annotate data is the integration of annotation systems in portals serving or including specimen data, like GBIF or BioCASE. Secondarily, linking the annotation system with other tools used by researchers would facilitate the process. For example, virtual work platforms for biodiversity sciences provide users with various types of tools for their taxonomic work. The EDIT Platform for Cybertaxonomy (Berendsohn, 2010) and the EDIT Scratchpads (Smith & al., 2012) offer tools for data access, editing and management of data as well as for team collaborations and publication (Berendsohn & al., 2011). Dou & al. (2012) prototyped a semi-automated curation pipeline involving several tools and services for data normalization, cleaning, and enhancement using the scientific workflow system, Kepler. This workflow incorporated a FilteredPush actor to manage annotations, interacted with curators using Google cloud services when their expertise was required, and tracked the provenance of the data. Integrating an annotation system in such platforms would further encourage scientists to make annotations during their research or while curating, thus ensuring quality control and rapid information flow.

IMPLEMENTING A SOLUTION: THE ANNOSYS WORKFLOW The AnnoSys project is a research project funded by the German Research Council (DFG). The project aims at developing a specification and implementation for an annotation system and data repository for networked and highly complex biodiversity data. The three-year project, currently in its second year, has developed a prototype available on its website (Tschöpe & al., 2012). AnnoSys is being implemented using the example of collection and observation data in the botanical domain as provided by the GBIF/BioCASe networks. The original data are thus delivered as ABCD or Darwin Core standardised XML documents. The aims include the creation of a user-friendly interface to allow researchers to conduct and discover annotations. AnnoSys will also allow bulk annotations, i.e., adding annotations to all records of a set of collection records, e.g., to point out or correct methodical errors or to add computed data (e.g., translations, coordinates for place names with their error, etc.). If a record has been annotated, both

Version of Record (identical to print version).

1253

Tschöpe & al. • Annotating biodiversity data via the Internet

the annotation as well as the original record will be stored in a repository, linked via a persistent identifier. Linking back to the current, newly retrieved version of the record allows users to detect changes made by the provider to the original recorded. If these changes actually follow those suggested by an annotation, the annotation record itself may be deprecated or be stored in an archive. The entire data communication in the system will be effected using the established mechanisms of the GBIF/BioCASe network—the annotation store will itself become a BioCASe provider and can thus be accessed by the GBIF network. The collection holder, manager or curator, or in fact any other persons specifically interested in a subset of the data, will be informed by a variety of mechanisms about annotations to the data of their interest. The AnnoSys workflow is illustrated in Fig. 4. A user coming from a data portal (1) needs to log in (2) to be able to contribute annotations. Users may apply for a curator role for specific collections (or specific taxonomic groups within their collection). To encourage participation in the annotation process, there will be no further restrictions after registration. Read-only users can use the system without logging in. By

TAXON 62 (6) • December 2013: 1248–1258

pushing an “annotate record” button from the portal interface, the user enters the start page of AnnoSys. After successful logging in, users can choose between the two functions “make annotation” and “search annotation” (6). After the user has finished and saved an annotation (3), the annotated data and comments, together with the original version given by the data provider are saved in the repository of the annotation server (4), linked via a persistent identifier. Via a message system (5), the annotation server then informs the data provider in charge as well as the subscribers of the annotation information system about the effective annotation. Collection managers then can decide whether to accept or reject the annotation and to subsequently update their data. Rejections of an annotation can be stored as a comment to the annotation. New annotations will always refer to the latest record retrieved from the provider’s database. It is therefore very important that the physical specimen object or the original data record is uniquely identified. As discussed before, the triple ID used in ABCD and DwC to identify objects is not perfect, but because it is grounded in the physical object (or original observation) it is probably the best means to ensure retrievability of all information pertaining to a specific object.

Fig. 4. AnnoSys workflow. Numbers in circles represent different parts of the workflow: A user coming from a data portal (1) needs to log in (2). After an annotation has been made by the user (3), the annotated data are stored in the repository of the annotation server (4), together with the original version given by the data provider. A message system (5) then informs the data provider and other subscribers of the annotation system about the event. Interested collection managers can then decide on whether to accept or reject the annotation and to subsequently update their data. Instead of being referred to AnnoSys via a portal, users can also access AnnoSys directly and search for annotations via a query function (6).

1254

Version of Record (identical to print version).

Tschöpe & al. • Annotating biodiversity data via the Internet

TAXON 62 (6) • December 2013: 1248–1258

DISCUSSION The virtual annotation of biodiversity data poses multiple challenges, including the use of appropriate standards, storing original records and annotations in a repository where they are easily accessible, and supporting curatorial workflows (e.g., retrieval of annotations when specimens are accessed in the collection). AnnoSys takes a step on the way to a completely virtualised annotation procedure by contributing to a transparent and optimised information flow between portal users and data providers. By offering best possible support in integrating annotations in local databases, AnnoSys counteracts the problem of divergence between collections and databases, if annotated records are not updated in the collection databases by the curators. In addition to usability, a reasonable reward for annotators may be of importance for the acceptance of the system. Citizen science sites often use a kind of game-like competition to encourage participation. However, this may not be very attractive to the scientist. Chavan & Penev (2011) suggested that peer-reviewed data papers should provide reward structures for professional recognition of data contributions, but this approach is mainly geared at data compilations, not annotations. However, a periodic review of data annotation activities may offer some way of recording and acknowledging such contributions also as a way to contribute to scientists’ careers. In general, utilisation and acceptance will improve when data quality significantly increases and users are able to easily contribute to the quality improvement of biodiversity data. It is thus of highest importance to create a unified global system that can be used productively by anybody. Unlike simple annotation systems that allow only unstructured comments associated with records (e.g., GBIF feedback system or the disqus (2012) platform used in JSTOR), AnnoSys allows structured annotations of specific elements of a record. This saves the curator from the work of editing the information and allows the further application and integration of annotations. It also greatly improves on the SYNTHESYS annotation system (Güntsch & al., 2009), insofar as it shields the user from the underlying XML and offers a user-friendly interface, allowing users to annotate as well as to search for annotations. The FilteredPush system, which pursues the same goal in facilitating and communicating online annotations like AnnoSys, takes a different approach. The FilteredPush architecture is agnostic about whether data or annotations are centrally located or distributed; fully distributed deployments consist of a set of FilteredPush clients and network nodes, each node representing an access point. In the fully distributed approach annotations are distributed immediately from the issuing endpoint to any other interested endpoint connected to the Filtered­Push network, typically including the original data holder, which may act on the annotation according to its own policies. FilteredPush decomposes its architecture so that producing, consuming, distributing, and storing annotations are independent modules that can be invoked either by web services or with lightweight wrappers, as PHP or Python libraries. In its simplest form it allows only the generation of annotations, with

the distribution and storage left to application programs. Its most network-centric form is dedicated to an architecture that is more complex than is needed by the collection community. Nevertheless, a common framework or library enabling AnnoSys and Filtered Push to manage annotations in a compatible manner would pave the way to seamlessly exchange annotations, first between both projects, and later under the ceiling of TDWG for the entire community. Such a distributed FilteredPush configuration consists of several access points whose distribution services are invoked by clients like Morphbank (2013) and Symbiota (2013). Whether stored annotations and notices of their arrival are controlled by these access points, or are more directly accessible, is a network node policy configuration issue. The simpler of these two kinds of systems alone does not meet the needs of AnnoSys users, and the fully distributed one is more than we need. At the time of writing this, no deployments intermediate in complexity are being built. By contrast, AnnoSys can be accessed by a central web interface and comprises a central AnnoSys repository (or a network of distributed repositories) for annotations and annotated records that can be queried externally. Thus, data are retrievable and can be integrated into other systems. Many of the differences between the approaches rest on software engineering decisions that are outside the scope of this paper. Those differences would principally be of interest to engineers charged with expanding the functionality of the respective systems. Annotating a highly complex data scheme and thereby creating new and searchable data within distributed systems is an innovative and challenging task. AnnoSys does not only contribute to quality improvement of biological collection data, but can also be used in ongoing research projects for filtering of useable data from the overall system of GBIF, BioCASe and GBIF-Germany. However, the problems tackled are not restricted to GBIF-like data but are applicable to other complex biological information systems like taxonomic checklists with distribution data. Our prototype is based on the XML-standards ABCD and DwC and uses the RDF-based Open Annotation Core Data Model to store and exchange annotations, thus making it highly interoperable and flexible. The use of largely generic software also allows an expansion to other XML-based disciplines.

ACKNOWLEDGEMENTS The AnnoSys Project is funded by the LIS-program (Wissenschaftliche Literaturversorgungs- und Informationssysteme) of the German Research Foundation DFG under the title “Ein generisches Annotationssystem für Biodiversitätsdaten” (project number BE 2283/4-1). Macklin and Morris are supported in part by U.S. National Science Foundation Grant 0960535 to Harvard University, “Filtered Push: Continuous Quality Control for Distributed Collections and Other Species-Occurrence Data.” SYNTHESYS, SYNTHESYS 2 and EDIT are projects funded by the European Commission under the 6th and 7th Framework Programmes. We thank two anonymous reviewers and the editors of Taxon for helpful comments on the original manuscript submissions.

Version of Record (identical to print version).

1255

Tschöpe & al. • Annotating biodiversity data via the Internet

GLOSSARY OF TECHNICAL TERMS ABCD: “Access to Biological Collection Data”, a community data standard for specimen and observation data (layed out as an → XML → Data schema). Annotation Ontology: a vocabulary to describe performed annotations (Ciccarese & al., 2011). Backend: in contrast to the frontend, which usually describes the system layer closest to the user, the backend is the system end point that processes user inputs and/or generates the output of processing results. The frontend interacts directly with the end user, while the backend denotes indirectly linked services or devices that respond to end user activities or requests, e.g., a server. BioCASe: “Biological Collection Access Service”, a transnational network of biological collections of all kinds. BioCASe techniques enable unified access to distributed and heterogeneous collection and observational databases. BioCASe protocol: manages the querying of distributed data­ bases on the Internet and the resulting data exchange between data providers and data portals. Content Management System: a web-based computer program for collaborative publishing, editing and updating web content. Data repository: a place where data are stored to be accessed by users or data services. Data schema: a formal description of the data structure. Data semantics: the meaning and use of data. Data standard: defines the names of data items, their content and formatting rules in a given context. Data standards enable the exchange and sharing of data and ensure that the interacting parties have the same understanding of what the data represent. DwC: “Darwin Core”, data exchange standard for geographic occurrences of organisms and the physical existence of biotic specimens in collections. Dublin Core: standard vocabulary for describing documents and objects using metadata. EDIT Platform: The EDIT (“European Distributed Institute of Taxonomy”) Platform for Cybertaxonomy is a collection of tools and services which together cover all aspects of the taxonomic workflow. GBIF: the “Global Biodiversity Information Facility” was established by governments in 2001 to promote and facilitate the mobilisation, access, and use of biodiversity information. Metadata: “data about data”, describe other data by providing information about an item’s content. Open Annotation: specification for connecting annotations with resources, utilising a methodology conformal with the architecture of the World Wide Web and the Linked Data initiative (Sanderson & Van de Sompel, 2011). Open Annotation Data Model: an RDF-based specification for annotating digital resources. It pools the two former initiatives → A nnotation Ontology and → O pen Annotation. Persistent identifier: a unique identification code permanently 1256

TAXON 62 (6) • December 2013: 1248–1258

assigned to an object, so that the object can be unambiguously referenced. PHP: programming language. Python: programming language. RDF: “Resource Description Framework“, a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time. Scratchpads: an online virtual research environment for biodiversity, allowing users to share data and to create their own research networks. TDWG: “Taxonomic Databases Working Group”, now “Biodiversity Information Standards”, an organisation that develops, adopts and promotes standards for the exchange of biodiversity data. User role: defines what a user is allowed to do in a system (e.g., user, administrator, curator). A given role is linked to a certain set of permissions in the system, such as read-only access, full annotation right, update-permission. Web service: a software function that supports interoperable machine-to-machine interaction via a network. Wrapper: an entity (data structure or software) that encapsulates (“wraps”) another entity, so that the contained elements can exist in the other system. XML: “Extensible Markup Language”, a simple, flexible text format for the exchange and publishing of data.

LITERATURE CITED AABP Atrium 2012. Atrium biodiversity information system for the Andes to Amazon Biodiversity Program at the Botanical Research Institute of Texas. http://atrium.andesamazon.org/index .php (accessed 26 Dec. 2012). Agosti, D. & Egloff, W. 2009. Taxonomic information exchange and copyright: The Plazi approach. B. M. C. Res. Notes 2(1): 1–9. http://dx.doi.org/10.1186/1756-0500-2-53 Ariño, A. 2010. Approaches to estimating the universe of natural history collections data. Biodiv. Inform. 7: 81–92. ArtDatabanken 2012. Artportalen.se: Rapportsystem för växter, djur och svampar. www.artportalen.se (accessed 26 Dec. 2012). ASC (Association of Systematic Collections, Committee on Computerization and Networking) 1992. An information model for biological collections (draft). Report of the Biological Collections Data Standards Workshop, August 18–24, 1992. http://cool .conservation-us.org/lex/datamodl.html (accessed 26 Dec. 2012). Belbin, L. 2011. The atlas of living Australia’s Spatial Portal. Pp. 39–43 in: Jones, M.B. & Gries, C. (eds.), Proceedings of the Environmental Information Management Conference 2011 (EIM 2011). Santa Barbara: University of California. http://dx.doi.org/10.5060/D2NC5Z4X Berendsohn, W.G. (ed.) 2005–. ABCD Schema 2.06 – ratified TDWG Standard. TDWG Task Group on Access to Biological Collection Data, Botanic Garden and Botanical Museum Berlin-Dahlem. http://www.bgbm.org/TDWG/CODATA/Schema/default.htm (accessed 26 Dec. 2012). Berendsohn, W.G. 2010. Devising the EDIT Platform for Cybertaxonomy. Pp. 1–6 in: Nimis P.L. & Vigne Lebbe, R. (eds.), Tools for Identifying Biodiversity: Progress and Problems. Trieste: Edizioni Università di Trieste. http://www.openstarts.units.it/dspace/bit

Version of Record (identical to print version).

Tschöpe & al. • Annotating biodiversity data via the Internet

TAXON 62 (6) • December 2013: 1248–1258

stream/10077/3737/1/Berendsohn,%20bioidentify.pdf (accessed 26 Dec. 2012). Berendsohn, W.G. & Nimis, P.L. 2000. The complexity of collection information. Pp. 13–18 in: Berendsohn, W.G. (ed.), Resource identification for a biological collection information service in Europe (BioCISE). Berlin: Botanic Garden and Botanical Museum Berlin-Dahlem. Berendsohn, W.G., Anagnostopoulos, A., Hagedorn, G., Jakupovic, J., Nimis, P.L., Valdés, B., Güntsch, A., Pankhurst, R.J. & White, R.J. 1999. A comprehensive reference model for biological collections and surveys. Taxon 48: 511–562. http://dx.doi.org/10.2307/1224564 Berendsohn, W.G., Güntsch, A., Hoffmann, N., Kohlbecker, A., Luther, K. & Müller, A. 2011. Biodiversity information platforms: From standards to interoperability. ZooKeys 150: 71–87. http://dx.doi.org/10.3897/zookeys.150.2166 Bergmeier, E. 2011. New floristic records, confirmations and other phytogeographical notes from Crete (Greece). Willdenowia 41: 167–177. http://dx.doi.org/10.3372/wi.41.41120 Chagnoux, S. & Michiels, H. 2011. Switching to the fast track: Rapid digitization of the world’s largest herbarium. In: TDWG 2011 Annual Conference – Abstracts. http://www.tdwg.org/fileadmin/ 2011conference/slides/Michiels-Chagnoux_Paris-Herbarium -digitization.pdf (accessed 26 Dec. 2012). Chavan, V. & Penev, L. 2011. The data paper: A mechanism to incentivize data publishing in biodiversity science. B. M. C. Bioinf. 12(Suppl. 15): S2. http://dx.doi.org/10.1186/1471-2105-12-S15-S2. Ciccarese, P., Ocana, M., Castro, L.J.G., Das, S. & Clark, T. 2011. An open annotation ontology for science on Web 3.0. J. Biomed. Semantics 2(Suppl. 2): S4. http://dx.doi.org/10.1186/2041-1480-2-S2-S4. Crawford, P.H.C. & Hoagland, B.W. 2009. Can herbarium records be used to map alien species invasion and native species expansion over the past 100 years? J. Biogeogr. 36: 651–661. http://dx.doi.org/10.1111/j.1365-2699.2008.02043.x Cruz-Cardenas, G., Villaseñor, J.L., Lopez-Mata, L. & Ortiz, E. 2012. Potential distribution of humid mountain forest in Mexico. Bot. Sci. 90: 331–340. Deans, A.R., Yoder, M.J. & Balhoff, J.P. 2012. Time to change how we describe biodiversity. Trends Ecol. Evol. 27: 78–84. http://dx.doi.org/10.1016/j.tree.2011.11.007 DCMI 2012. Dublin Core Metadata Initiative. Singapore. http://dublin core.org/ (accessed 26 Dec. 2012). Disqus 2012. http://disqus.com/ (accessed 26 Dec. 2012). Dou, L., Cao, G., Morris, P.J., Morris, R.A., Ludäscher, B., Macklin, J.A. & Hanken, J. 2012. Kurator: A Kepler package for data curation workflows. Procedia Computer Sci. 9: 1614–1619. http://dx.doi.org/10.1016/j.procs.2012.04.177 Greuter, W., Naumann, C.M., Steininger, F., Breyer, R., Häuser, C.L. & Haas, F. (eds.) 2005. Naturwissenschaftliche Forschungssammlungen in Deutschland: Schatzkammern des Lebens und der Erde. Frankfurt: Schweizerbart Science Publishers. Güntsch, A., Döring, M. & Berendsohn, W.G. 2006. Mobilisierung von primären Biodiversitätsdaten: Das BioCASe Protokoll und seine Anwendung in internationalen Netzwerken. Pp. 129–138 in: Knetsch, G. (ed.), Umweltdatenbanken und Netzwerke. Dessau: Umweltbundesamt. Güntsch, A., Berendsohn, W.G., Ciardelli, P., Hahn, A., Kusber, W.-H. & Li, J. 2009: Adding content to content—a generic annotation system for biodiversity data. Studi Trent. Sci. Nat. 84: 123–128. Hill, A., Guralnick, R., Smith, A., Sallans, A., Gillespie, R., Denslow, M., Gross, J., Murrell, Z., Conyers, T., Oboyski, P., Ball, J., Thomer, A., Prys-Jones, R., De la Torre, J., Kociolek, P. & Fortson, L. 2012. The notes from nature tool for unlocking biodiversity records from museum records through citizen science. ZooKeys 209: 219–233. http://dx.doi.org/10.3897/zookeys.209.3472

Hochachka, W.M., Fink, D., Hutchinson, R.A., Sheldon, D., Wong, W.-K. & Kelling, S. 2012. Data-intensive science applied to broadscale citizen science, Trends Ecol. Evol. 27: 130–137. http://dx.doi.org/10.1016/j.tree.2011.11.006 Holetschek, J., Kelbert, P., Müller, A., Ciardelli, P., Güntsch, A. & Berendsohn, W.G. 2009. International networking of large amounts of primary biodiversity data. Pp. 552–564 in: Fischer, S., Maehle, E. & Reischuk, R. (eds.), Informatik 2009: Im Fokus das Leben; Beiträge der 39. Jahrestagung der Gesellschaft für Informatik e.V. (GI). Lecture Notes in Informatics 154. Bonn: Gesellschaft für Informatik. http://subs.emis.de/LNI/Proceedings/ Proceedings154/gi-proc-154-8.pdf Hroudova, Z., Zakravsky, P. & Cechurova, O. 2004. Germination of seed of Alisma gramineum and its distribution in the Czech Republic. Preslia 76: 97–118. ICSU 2011. World Data System: Certification of WDS members. http:// icsu-wds.org/images/files/WDS_Certification_Summary_11_ June_2012.pdf (accessed 10 Feb. 2013). JSTOR 2012. JSTOR Plant Science - About. http://plants.jstor.org/ (accessed 26 Dec. 2012). Keller, P. & Mossink, W. 2008. Reuse of material in the context of education and research. Utrecht: SURFdirect. Archived at www .webcitation.org/62KX892td (accessed 26 Dec. 2012). Kokocinski, F., Harrow, J. & Hubbar, T. 2010. AnnoTrack – A tracking system for genome annotation. B. M. C. Genomics 11: 538. http://dx.doi.org/10.1186/1471-2164-11-538 Kusber, W.-H., Zippel, E., Kelbert, P., Holetschek, J., Güntsch, A. & Berendsohn, W.G. 2009. From cleaning the valves to cleaning the data: Case studies using diatom biodiversity data on the Internet (GBIF, BioCASE). Studi Trent. Sci. Nat. 84: 111–122. Macklin, J.A., Rabeler, R.K & Morris, P.J. 2006. Herbarium Networks Part II: Developing a framework for exchange of botanical specimen data to reduce duplicative effort and improve quality using a “FilteredPush”. Society for the Preservation of Natural History Collections, 2006 Annual Meeting. Abstract (presentation available at: http://wiki.filteredpush.org/w/media/a/aa/Mack lin_SPNHC_2006.odp (accessed 14 Feb. 2013). McNeill, J., Barrie, F.R., Buck, W.R., Demoulin, V., Greuter, W., Hawksworth, D.L., Herendeen, P.S., Knapp, S., Marhold, K., Prado, J., Prud’homme van Reyne, W.F., Smith, G.F., Wiersema, J.H. & Turland, N.J. (eds.) 2012. International Code of Nomenclature for Algae, fungi, and plants (Melbourne Code). Regnum Vegetabile 154. Königstein: Koeltz Scientific Books. Mohamed, K.I., Papes, M., Williams, R., Benz, B.W. & Peterson, A.T. 2006. Global invasive potential of 10 parasitic witchweeds and related Orobanchaceae. Ambio 35: 281–288. http://dx.doi.org/10.1579/05-R-051R.1 Molnar, V.A., Tokolyi, J., Vegvari, Z., Sramko, G., Sulyok, J. & Barta, Z. 2012. Pollination mode predicts phenological response to climate change in terrestrial orchids: A case study from central Europe. J. Ecol. 100: 1141–1152. http://dx.doi.org/10.1111/j.1365-2745.2012.02003.x Morphbank 2013. Morphbank: Biological Imaging. Florida State University, Department of Scientific Computing, Tallahassee. http:// www.morphbank.net/ (accessed 10 Feb. 2013). Mund, B. & Steinke, K.-H. 2010. Processing handwritten words by intelligent use of OCR results. Pp. 174–185 in: Perner, P. (ed.), Advances in data mining: Applications and theoretical aspects. Berlin: Springer. http://dx.doi.org/10.1007/978-3-642-14400-4_14 Nevling, L.I., Jr. 1973. Report of the Committee for Recommendations in Desirable Procedures in Herbarium Practice and Ethics, II. Brittonia 25: 307–310. http://dx.doi.org/10.2307/2805592 Perkins, K.D. 2013. Annotation of herbarium specimens: Recommendations. University of Florida Herbarium/Florida Museum of Natural History. http://www.flmnh.ufl.edu/herbarium/anno/ (accessed 5 Nov. 2013).

Version of Record (identical to print version).

1257

Tschöpe & al. • Annotating biodiversity data via the Internet

Salazar, G.A., Jimenez, R.C., Garcia, A., Hermjakob, H., Mulder, N. & Blake, E. 2011. DAS Writeback: A collaborative annotation system. B. M. C. Bioinf. 12: 143. http://dx.doi.org/10.1186/1471-2105-12-143 Sanderson, R. & Van de Sompel, H. 2011. Open Annotation: Beta Data Model Guide, 10 August 2011. http://www.openannotation .org/spec/beta/ (accessed 6 Dec. 2012). Sanderson, R., Ciccarese, P. & Van de Sompel, H. (ed.) 2013. W3C Open Annotation Data Model, Community Draft, 08 February 2013. http://www.openannotation.org/spec/core/ (accessed 20 Aug. 2013). Smith, V.S., Rycroft, S., Scott, B., Baker, E., Livermore, L., Heaton, A., Bouton, K., Koureas, D.N. & Roberts, D. 2012. Scratchpads 2.0: A virtual research environment infrastructure for biodiversity data. http://scratchpads.eu (accessed 4 Dec. 2012) Snall, T., Kindvall, O., Nilsson, J. & Part, T. 2011. Evaluating citizenbased presence data for bird monitoring. Biol. Conservation 144: 804–810. http://dx.doi.org/10.1016/j.biocon.2010.11.010 Specify 2013. Specify 6. http://specifysoftware.org/ (accessed 21 Aug. 2013). Symbiota 2013. Symbiota—Promoting Bio-Collaboration. http://sym biota.org/ (accessed 21 Aug. 2013). Taylor, S. & Kumar, L. 2012. Sensitivity analysis of CLIMEX parameters in modelling potential distribution of Lantana camara L. PLoS ONE 7(7): e40969. http://dx.doi.org/10.1371/journal.pone.0040969

1258

TAXON 62 (6) • December 2013: 1248–1258

Thiers, B. 2007–. Index Herbariorum: A global directory of public herbaria and associated staff. New York: New York Botanical Garden. http://sciweb.nybg.org/science2/IndexHerbariorum.asp (accessed 26 Dec. 2012). Tschöpe, O., Suhrbier, L., Güntsch, A. & Berendsohn, W.G. 2012. AnnoSys website. https://annosys.bgbm.fu-berlin.de (accessed 26 Dec. 2012). Tulig, M., Tarnowsky, N., Bevans, M., Kirchgessner, A. & Thiers, B.M. 2012. Increasing the efficiency of digitization workflows for herbarium specimens. ZooKeys 209: 103–113. http://dx.doi.org/10.3897/zookeys.209.3125 Wang, Z., Dong, H., Kelly, M., Macklin, J.A., Morris, P.J. & Morris, R.A. 2009. Filtered-Push: A map-reduce platform for collaborative taxonomic data management. Pp. 731–735 in: Burgin, M., Chowdhury, M.H., Ham, C.H., Ludwig, S., Su, W. & Yenduri, S. (eds.), 2009 WRI World Congress Computer Science and Information Engineering, vol. 3. Los Alamitos, California: IEEE Computer Society. http://dx.doi.org/10.1109/CSIE.2009.948 Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., Robertson, T. & Vieglais, D. 2012. Darwin Core: An evolving community-developed biodiversity data standard. PLoS ONE 7(1): e29715. http://dx.doi.org/10.1371/journal.pone.0029715

Version of Record (identical to print version).