Methods in Ecology and Evolution 2013, 4, 201–205
doi: 10.1111/2041-210x.12009
APPLICATION
Harmonizing, annotating and sharing data in biodiversity–ecosystem functioning research € nisch2, Helge Bruelheide3, Jens Kattge2, Karin Nadrowski1*, Sophia Ratcliffe1, Gerhard Bo 4 5,6 4 Xiaojuan Liu , Lutz Maicher , Xiangcheng Mi , Michael Prilop5,6, Daniel Seifarth5, Karl Welter1,5, Sven Windisch5,7 and Christian Wirth1 1
Institute of Special Botany and Functional Biodiversity Research, University of Leipzig, 04103 Leipzig, Germany; 2Max Planck Institute of Biogeochemistry, 07745 Jena, Germany; 3Institute of Botany/ Geobotany and Botanical Garden, Martin Luther University Halle Wittenberg, 06120 Halle, Germany; 4Institute of Botany, Chinese Academy of Sciences, 20 Nanxincun, Beijing, Xiangshan, 100093, China; 5Topic Maps Lab, Natural Language Processing Group, University of Leipzig, 04109 Leipzig, Germany; 6Fraunhofer-Zentrum fu€r Mittel- und Osteuropa (MOEZ), 04109 Leipzig, Germany; and 7Business Development, ESEMOS GmbH, 04109 Leipzig, Germany
Summary 1. The integrative research field of biodiversity–ecosystem functioning (BEF) requires close collaboration between researchers from different disciplines working on different scales in time, space as well as taxon resolution. Data can describe anything from abiotic ecosystem components, to organisms, parts of organisms, genetic information or element stocks and flows. Researchers prefer the convenience of spreadsheets for data preparation, which can lead to isolated data sets that are diverse in structure and follow diverging naming conventions. 2. BEFdata (https://github.com/befdata/befdata) is a new, open source web platform for the upload, validation and storage of data from a formatted Excel workbook. Metadata can be downloaded in Ecological Metadata Language (EML). BEFdata allows the harmonization of naming conventions by generating category lists from the primary data, which can be reviewed and managed via the Excel workbook or directly on the platform. BEFdata provides a secure environment during ongoing analysis; project members can only access primary data from other researchers after the acceptance of a data request. 3. Due to its generic database schema, BEFdata platforms can be used for any research domain working with tabular data. It supports the compilation of coherent data sets at the level of the primary data, allowing researchers to explicitly model correlation structures across data sets for synthesis. The EML export enables efficient publishing of data in global repositories.
Key-words: Ecoinformatics, BEF-China, cooperating research groups, web applications, Ruby on Rails, knowledge management, Ecological Metadata Language, semantic web, spreadsheets, Web of Data
Introduction In biodiversity–ecosystem functioning (BEF) research, both the predictor – biodiversity – and the dependant variables – ecosystem services and functions – represent complex concepts. The data needed to establish BEF relationships are themselves highly heterogeneous and are typically generated by collaborative, interdisciplinary research consortia assembling expertise from various disciplines ranging from molecular ecology to remote sensing (Michener & Jones 2012). The diversity of data structures and scientific disciplines pose significant challenges when merging data sets to perform overarching meta-analyses. Here, we introduce the BEFdata platform that allows research*Correspondence author. E-mail:
[email protected]
ers to manage naming conventions between data sets and to import metadata and primary data from the same spreadsheet. It includes a transparent data sharing mechanism for cooperative research projects. We use a generic data structure to accommodate the complexity of BEF research, which makes our approach useful to other scientific disciplines. In the following, we review the challenges of managing complex data, including (1) the heterogeneity of data structures, (2) the need to manage naming conventions at the primary data level and (3) the need for transparent data sharing mechanisms. The transdisciplinary nature, as well as the range of spatial and temporal scales typical of BEF research, is reflected in the complexity of BEF data sets. They may describe the properties of soil layers, plant traits, occurrences of individual organisms, parts of individual organisms possibly at the
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society
202 K. Nadrowski et al. molecular level or aggregated properties of conceptual entities such as vegetation layers or ecosystem matter pools. Additionally, the majority of data sets are human-entered containing less than 1000 rows (Heidorn 2008; Lotz et al. 2012), each with a unique data structure. Researchers prefer to use spreadsheets to prepare their data for analysis (Tenopir et al. 2011), but without proper annotation, even simple data sets can be difficult to understand. When data sets are prepared independently in each research project, it is often easier to generate new names for physical or conceptual objects, than to work with names developed by other groups. Examples of such naming conventions are the codes given to plots, species names, individual IDs or categorical parameter values. Diverging naming conventions increase the effort required to harmonize data sets a posteriori. One way of promoting data harmonization is by prescribing fixed data structures that enforce the use of naming conventions. For example, the Diversity Workbench (Triebel 2011) offers validation against many different web services, including services for scientific species names, habitat types, institutions or geographic context. However, these represent only a small subset of the data resulting from BEF research. Another approach is to allow any type of data file to be uploaded but ensure that detailed metadata is included in a standard form. For example, Metacat (KNB 2010) uses the Ecological Metadata Language (EML) format (Fegraus et al. 2005). See Hernandez-Ernst et al. (2008) for a review of ecological information standards.
A
B
Data requests called ‘paper proposals’ or ‘proposals’ are often used within cooperative research projects as a way to make data exchange more transparent, to help in attributing credit to data contributors and to increase trust and team spirit (Stokstad 2011). They are formulated research ideas that specify what data are needed and whose expertise should be consulted to answer a specific question. Cooperative research projects that use paper proposals include, for example, the TRY initiative (Kattge et al. 2011a), BEF-China (this article) and the Nutrition Network (Stokstad 2011). To our knowledge, there are no data management solutions that offer paper proposal mechanisms to share data sets and protocol data exchange.
BEFdata platform The ‘BEFdata’ platform (Fig. 1) was developed within the Biodiversity-Ecosystem Functioning Research Unit of the German Research Foundation (BEF-China, http://www.befchina.de, FOR 891). BEFdata is an open source web application written in Ruby on Rails (Ruby, Thomas & Heinemeier Hansson 2011) and PostgreSQL (PostgreSQL Global Development Group 2012). During upload, the data are harmonized against existing data sets at the primary data level. We use a generic data structure in that we store all primary data in a single ‘sheetcells’ table (Kattge et al. 2011b, Appendix 1). BEFdata provides an EML metadata export (Appendix 2). A detailed
C
Fig. 1. Screenshots of the welcome pages of the BEF-China group (http://china.befdata.biow.uni-leipzig.de), its Chinese partner projects (http://159.226.89.107) and the FunDivEUROPE (http://fundiv.befdata.biow.uni-leipzig.de) BEFdata platforms. Data sets and paper proposals are grouped by projects (A), by user (B) and on a separate data view (C). Primary data as well as metadata are uploaded exclusively through a formatted Excel 2003 workbook (Appendix 4) to minimize user interaction with web forms. For a user manual, see Appendix 3 or the BEFdata code repository (https://github.com/befdata/befdata). © 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
BEFdata: data harmonization and sharing user manual is provided in Appendix 3. Information on setting up and managing the platform can be found online (https://github.com/befdata/befdata). BEFdata platforms are currently implemented by the BEF-China project (http://china.befdata.biow.uni-leipzig.de) and its Chinese partner projects (http://159.226.89.107) and by the FunDivEUROPE project (http://fundiv.befdata.biow.uni-leipzig.de).
Data harmonization BEFdata platforms use a bottom-up approach to developing naming conventions driven by the data. Primary data are uploaded from the import workbook (Appendix 4). BEFdata platforms currently support text, date, number and category data types; each type has its own validation rules. Original import values are stored in the ‘sheetcells’ table and are not altered thereafter. A separate ‘categories’ table enables adherence to naming conventions across data sets (Appendix 1). Data columns from the import data are assigned to data groups, and the upload process ensures that categories are unique within data groups. Primary data of number, date and category data types are matched to existing categories within their data group during upload (Fig. 2). Having different categories available for numeric data allows the explicit definition of missing data values. Invalid values are flagged to the user
203
for manual checking. See the user manual in Appendix 3 for further information. The bottom-up approach to naming conventions requires a level of data management, which would not be needed when using fixed naming conventions. Categories and data groups can be browsed by members and managed by data owners and administrators (Fig. 2). All the categories in the data groups are listed on the individual data group page. Each category also has its own page that lists all the primary data linked to the category and the original uploaded value. Administrators can rename, merge and split categories on the platform (Fig. 2). Any changes are reflected in every data set that is linked to the category. Data owners can edit their data sets and reassign data groups, which restarts the validation process. The data owner can also download the workbook at any point. Any invalid categories will be highlighted in the downloaded file, and any missing or invalid data can be corrected in the workbook and the workbook re-uploaded.
Data sharing workflow: paper proposals Access to data sets is restricted to the data owners. Members who would like to use data sets for analysis must submit a paper proposal, which contains a list of the data
E C
D
A
B Fig. 2. Data group and category pages of a BEFdata platform. Categories are unique within data groups, and a data group page lists all its categories (A). During data import, primary data are matched to existing categories. Each category links to its own page (B), listing all the primary data it is associated with (C), including their original import values (D). Administrators can rename or merge categories from the data group pages (E) and split categories from the category page (B). See the text and the user manual in Appendix 3 for further information on how to manage categories and data groups. © 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
204 K. Nadrowski et al. sets. Data sets can be added to a logged-in user’s cart, and this collection of data sets can then be used for a paper proposal. The proposal is initially reviewed by a project board to make sure that it is novel, complementary and does not compete with other activities, and then by all data set owners listed in the proposal. Once all owners have approved the proposal, proponents gain download access to the requested data sets.
Discussion BEFdata platforms allow the harmonization of both primary data and metadata for collaborating research projects. In comparison with initiatives that concentrate on managing data set metadata (for example, Metacat, KNB 2010 or BExIS, Lotz et al. 2012), the focus of BEFdata is on the primary data and specifically naming conventions within primary data. Having complex but consistent sets of primary data offers new possibilities for analysing ecosystem functioning. Current approaches to interdisciplinary synthesis in BEF research compare the regression slopes from separate analyses (Balvanera et al. 2006; Nadrowski, Wirth & Scherer-Lorenzen 2010; Maestre et al. 2012). Consistent data sets enable synthesis at the level of the primary data where the correlation structure of data points can be modelled explicitly using hierarchical modelling techniques (Ogle et al. 2007). The categories of BEFdata platforms are not controlled vocabularies (NISO 2005). While homonyms can be resolved because categories are nested within data groups, it is not possible to specify narrower or broader terms or to flag synonyms. However, BEFdata can make the use of existing semantic tools easier: custom naming conventions are exposed on a common platform where they can be reviewed; data and metadata are stored in one relational database, enabling seamless data and metadata interrogation; and metadata can be exported in standard EML format. A logical further step is to implement data validation against existing web services or thesauri (Nadrowski et al. 2012). The possibility of using web services to exchange information between repositories will be the subject of future BEFdata development. We are additionally evaluating the integration of BEFdata platforms into Kepler workflows (Gries & Porter 2011; Pfaff, Nadrowski & Wirth 2012).
Conclusions BEFdata platforms are communication tools that help researchers in cooperative research projects speak the same language using shared naming conventions, while having the convenience of working with spreadsheets. Our implementation of the paper proposal process makes the data use more transparent, which can increase synergies in cooperative research programs. Global data visibility can lead to new scientific collaborations, and data can be exported in EML format. BEFdata platforms do not contain prescribed domain logic
and can thus be used by any scientific domain working with tabular data. With this, we hope to make data management and reuse within cooperating research projects more efficient and enjoyable. Current managers of BEFdata platforms have profited from the speed of installation and customization (1 to 3 days for managers unexperienced with rails applications). They continue to profit from bug fixes and new features added to the common code repository. Initial feedback from the current users has been positive. Researchers have found it especially helpful to be able to extract automatically assembled lists of names across data sets for species or plots.
Acknowledgements This manuscript was greatly improved by the comments of two anonymous reviewers. The authors wish to thank all the members of the BEF-China project for essential help and feedback in crafting the functionality of the BEFdata platform. K. N, M. P., D. S., K. W, S. W. were supported by the German Science Foundation (DFG) through the BEF-China project (FOR 981, sub-project ‘Data management’) of C.W and H.B., and S. R. was supported by the EU project FunDivEUROPE (265171, Work package 1, Task I.4 ‘Data management, data quality assessment and control’) of C. W.
References Balvanera, P., Pfisterer, A.B., Buchmann, N., He, J.-S., Nakashizuka, T., Raffaelli, D. & Schmid, B. (2006) Quantifying the evidence for biodiversity effects on ecosystem functioning and services. Ecology Letters, 9, 1146–1156. Fegraus, E.H., Andelman, S., Jones, M.B. & Schildhauer, M. (2005) Maximizing the value of ecological data with structured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. Bulletin of the Ecological Society of America, 86, 158–168. Gries, C. & Porter, J.H. (2011) Moving from custom scripts with extensive instructions to a workflow system: use of the Kepler workflow engine in environmental information management. Environmental Information Management Conference 2011 (eds M.B. Jones & C. Gries), pp. 70–75. University of California, Santa Barbara, CA. Heidorn, P.B. (2008) Shedding light on the dark data in the long tail of science. Library Trends, 57, 280–299. Hernandez-Ernst, V., Poigne, A., Voss, A., Voss, H., Berendsohn, W., Giddy, J., Gebhardt, M., Hardisty, A., Schentz, H. & Magagna, B. (2008) Data & Modelling Tool Structures. Status Report on Infrastructures for Biodiversity Research. Fraunhofer IAIS, Cardiff University, St. Augustin, Germany. Kattge, J., Dıaz, S., Lavorel, S., Prentice, I.C., Leadley, P., B€ onisch, G., et al. (2011a) TRY - a global database of plant traits. Global Change Biology, 17, 2905–2935. Kattge, J., Ogle, K., B€ onisch, G., Dıaz, S., Lavorel, S., Madin, J., Nadrowski, K., N€ ollert, S., Sartor, K. & Wirth, C. (2011b) A generic structure for plant trait databases. Methods in Ecology and Evolution, 2, 202–213. KNB (2010) Administrator’s Guide for Metacat 193. National Center for Ecological Analysis and Synthesis (NCEAS); Knowledge Network of Biocomplexity, Santa Barbara, CA. Lotz, T., Nieschulze, J., Bendix, J., Dobbermann, M. & K€ onig-Ries, B. (2012) Diverse or uniform? — Intercomparison of two major German project databases for interdisciplinary collaborative functional biodiversity research. Ecological Informatics, 8, 10–19. Maestre, F.T., Quero, J.L., Gotelli, N.J., Escudero, A., Ochoa, V., DelgadoBaquerizo, M., et al. (2012) Plant species richness and ecosystem multifunctionality in global drylands. Science, 335, 214–218. Michener, W.K. & Jones, M.B. (2012) Ecoinformatics: supporting ecology as a data-intensive science. Trends in Ecology & Evolution, 27, 85–93. Nadrowski, K., Wirth, C. & Scherer-Lorenzen, M. (2010) Is forest diversity driving ecosystem function and service? Current Opinion in Environmental Sustainability, 2, 75–79. Nadrowski, K., Seifarth, D., Ratcliffe, S., Wirth, C. & Maicher, L. (2012) Identifiers in e-Science platforms for the ecological sciences. Communities in New
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
BEFdata: data harmonization and sharing Media: Virtual Enterprises, Research Communities & Social Media Networks. Proceedings of GeNeMe 2012 (eds T. Ko¨hler & N. Kahnwald), pp. 259–272. TUDpress, Dresden. NISO (2005) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. NISO Press, Bethesda, MD. Ogle, K. & Barber, J. (2007) Bayesian data-model integration in plant physiological and ecosystem ecology. Progress in Botany (eds K. Esser, U. L€ ottge, W. Beyschlag & J., Murata), pp. 281–311. Springer, Berlin, Heidelberg, Germany. Pfaff, C.-T., Nadrowski, K. & Wirth, C. (2012) Using Kepler Workflows In Ecology. F1000 Posters, 3, 1356. PostgreSQL Global Development Group. (2012) PostgreSQL – the world’s most advanced open source database. URL www.postgresql.org [accessed 9 November 2012]. Ruby, S., Thomas, D. & Heinemeier Hansson, D. (2011) Agile Web Development with Rails. The Pragmatic Bookshelf, Raleigh, North Carolina. Stokstad, E. (2011) Open-source ecology takes root across the world. Science, 334, 308–309. Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A.U., Wu, L., Read, E., Manoff, M. & Frame, M. (2011) Data sharing by scientists: practices and perceptions (C Neylon, Ed.). PLoS ONE, 6, e21101. Triebel, D. (2012) Diversity Workbench. Staatlichen Naturwissenschaftlichen Sammlungen Bayerns (SNSB), Mu¨nchen. URL www.diversityworkbench.net [accessed 9 November 2012].
205
Supporting Information Additional Supporting Information may be found in the online version of this article. Appendix S1. Class diagram of the BEFdata platform. Rails applications by default provide a class for every database table. Relationships between tables are not stored as foreign keys in the database but are defined in the classes. Appendix S2. EML document as downloaded from the BEF-China BEFdata platform (http://china.befdata.biow.uni-leipzig.de), including a version using pseudo-code that refers to BEFdata classes. Appendix S3. The BEFdata platform user manual. Appendix S4. Excel workbook for importing data into BEFdata platforms.
Received 20 July 2012; accepted 15 October 2012 Handling Editor: Nick Isaac
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205