A taxonomic-based spatio-temporal data structure for ...

UNIVERSITY OF SOUTHAMPTON FACULTY OF SOCIAL SCIENCES Geography and the Environment

A taxonomic-based spatio-temporal data structure for modelling biodiversity networks. An Open-Source GIS tool to help bridging the gap between ecosystem science and community ecology

by Juan M. Escamilla M´ olgora

Thesis for the degree of Master of Science

May 2015

UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF SOCIAL SCIENCES Geography and the Environment Master of Science

A TAXONOMIC-BASED SPATIO-TEMPORAL DATA STRUCTURE FOR MODELLING BIODIVERSITY NETWORKS. AN OPEN-SOURCE GIS TOOL TO HELP BRIDGING THE GAP BETWEEN ECOSYSTEM SCIENCE AND COMMUNITY ECOLOGY by Juan M. Escamilla Mólgora

iv Anthropogenic activities are affecting drastically the natural cycles of matter and energy in the Earth. One of the most important effects is the change from natural to human-used land cover, estimated to be more than 50% of the land in Earth. A direct consequence is a rapid decrease of global and local biodiversity. On average, species losses threaten the stability of the ecosystem productivity and therefore its services to humankind. Dynamic Global Vegetation Models (DGVMs) have been extensively used in Earth System Models to forecast the implications of anthropogenic activities. However, it has been found that their uncertainty is high. Incorrect or incomplete Plant Functional Type classifications have been detected as sources of uncertainty. Ecosystem-evolutionary models have been proposed to be a way to join the holistic ecosystem perspective with the mechanistic perspective of community ecology. This work presents a theoretical and computational implementation of a data structure (called Taxonomy) that integrates species presences with temporal and spatial components in a hierarchical network build by the taxonomic classification. The natural classification gives the network an evolutionary context by representing ancestral relationships. Taxonomy includes common biodiversity measures at all taxonomic levels; as well as algebraic operations (sum and difference), and a metric to calculate distances. Later, The Taxonomies are cell-wise incorporated in a regular lattice called GriddedTaxonomy. It has some methods implemented to resemble a raster data type. E.g., projections, map algebra and statistical summary. To give scale context to the model, a hierarchical arrangement of GriddedTaxonomies was implemented similar to a layer stack. This aggregation was used to generate pseudo-presence/absence lists of species and can be used to interpolate Taxonomies in the future. The work includes case studies applied to ecology and land use change that exemplify the software. The examples use temporal and spatial filters, algebraic operations and other features. Good programming practices were applied to release the software as an open-source project.

Contents Declaration of Authorship

xi

Preface 0.1 The structure of the dissertation . 0.1.1 Introduction . . . . . . . . 0.1.2 Methods . . . . . . . . . . . 0.1.3 Case Study . . . . . . . . . 0.1.4 Discussions . . . . . . . . . 0.2 To the reader: Here were dragons! 0.3 Acknowledgements . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

xiii . xiii . xiv . xiv . xv . xv . xv . xvi

1 Introduction 1.1 The anthropogenic impact on the biosphere . . . . . . . . . . . . . . . . . 1.1.1 On the dawn of the Antropocene? . . . . . . . . . . . . . . . . . . . 1.1.2 Ecosystem services to reconcile the biosphere and human well-being 1.1.3 How will losses of species affect the provision of ecosystem services? 1.2 Biodiversity through the lenses of ecology . . . . . . . . . . . . . . . . . . 1.2.1 Different forms for measuring diversity . . . . . . . . . . . . . . . . 1.2.2 Biodiversity in Community Ecology . . . . . . . . . . . . . . . . . 1.2.2.1 Richness . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.2 Relative abundance . . . . . . . . . . . . . . . . . . . . . 1.2.2.3 Evenness . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.4 Species diversity indexes . . . . . . . . . . . . . . . . . . 1.2.3 Biodiversity in Ecosystem ecology . . . . . . . . . . . . . . . . . . 1.2.3.1 Ecological functional diversity . . . . . . . . . . . . . . . 1.2.4 Integrating global biodiversity studies with meta-analysis methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Ecosystem dynamics and community assemblages . . . . . . . . . . . . . . 1.3.1 Plant functional types past present and future . . . . . . . . . . . 1.3.2 Ecosystem stability as a function of biodiversity . . . . . . . . . . 1.3.3 The ecological synthesis . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The knowledge gap and this work’s contribution . . . . . . . . . . . . . . 1.4.1 Objectives of the research . . . . . . . . . . . . . . . . . . . . . . . 1.5 Project aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Theoretic and Implementation aims . . . . . . . . . . . . . . . . . 1.5.1.1 First stage: The Taxonomy object . . . . . . . . . . . . 1.5.1.2 Second stage: The GriddedTaxonomy object . . . . . . . v

1 1 2 2 3 3 4 4 5 5 5 6 6 7 7 8 8 9 10 10 11 11 12 12 12

vi

CONTENTS 1.5.1.3 Third stage: The NestedTaxonomy object . . . . . . . . . 1.5.2 Aims: Software engineering and best practices for collaboration . . 1.5.3 Aims: Case Studies for exploring biodiversity patterns with the tool 1.5.4 Research main questions . . . . . . . . . . . . . . . . . . . . . . . . Data use and description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 The Global Biodiversity Information Facility . . . . . . . . . . . . 1.6.2 Preprocessing methodology . . . . . . . . . . . . . . . . . . . . . .

13 13 14 14 15 15 15

2 Models description and software implementation 2.1 Specie as the atomic study unit . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Model axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Model Data Assumptions . . . . . . . . . . . . . . . . . . . . . . . 2.3 Mathematical definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Algebraic operations . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Some immediate consequences . . . . . . . . . . . . . . . . . . . . 2.3.3.1 Taxonomic Tree operations . . . . . . . . . . . . . . . . . The Identity element . . . . . . . . . . . . . . . . . . . . . . 2.4 Computational implementation . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Object Relational Mapper . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Distributed architecture . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Concurrent capabilities . . . . . . . . . . . . . . . . . . . . . . . . 2.5 The Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 The Taxonomy Class . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.1 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . calculateRichness . . . . . . . . . . . . . . . . . . . . . . calculateIntrinsicComplexity . . . . . . . . . . . . . . . distanceToTree(taxonomy) . . . . . . . . . . . . . . . . . generatePDI . . . . . . . . . . . . . . . . . . . . . . . . . . getFreqs . . . . . . . . . . . . . . . . . . . . . . . . . . . . buildInnerTree . . . . . . . . . . . . . . . . . . . . . . . . cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . loadFromCache . . . . . . . . . . . . . . . . . . . . . . . . . sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . diff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 The GriddedTaxonomy Class . . . . . . . . . . . . . . . . . . . . . 2.5.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . restoreTaxonomiesFromCache(redis wrapper) . . . . . . createShapefile(option=’richness’,store=’out maps’) mergeGeometries . . . . . . . . . . . . . . . . . . . . . . . .

17 17 18 18 20 21 21 21 22 23 23 24 24 25 26 26 26 27 28 29 29 29 29 29 30 30 31 31 32 32 32 33 34 35 36 36 36 36 37

1.6

CONTENTS

2.6

2.7

vii

distanceToTree(external taxonomic forest) . . . . . setPresenceAbundanceData(reference dict) . . . . . . summary(attr=’raw’) . . . . . . . . . . . . . . . . . . . . cache(redis wrapper,key=’default’) . . . . . . . . . . intrinsicPanel(with this list=’’) . . . . . . . . . . . Pythonic iterators . . . . . . . . . . . . . . . . . . . . . . sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . diff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 The Nested Gridded Taxonomy object . . . . . . . . . . . . . . . 2.5.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . The Integer Representation . . . . . . . . . . . . . . . . . iterator for NestedGriddedTaxonomy . . . . . . . . . . . . setPresenceInLevels . . . . . . . . . . . . . . . . . . . . cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . loadFromCache . . . . . . . . . . . . . . . . . . . . . . . . remap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . Other implemented tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 The Cache - No SQL - System . . . . . . . . . . . . . . . . . . . 2.6.2 Building Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2.1 Python wrapper for generating meshes . . . . . . . . . createGridOnThisSquare . . . . . . . . . . . . . . . . . . createRegionalNestedGrid . . . . . . . . . . . . . . . . . . Example for generating mesh . . . . . . . . . . . . . . . . Functions in Postgis . . . . . . . . . . . . . . . . . . . . . Guide to generate Nested Context . . . . . . . . . . . . . . 2.6.3 Exploration tools for QGIS . . . . . . . . . . . . . . . . . . . . . Methodology for visualization in web browser . . . . . . . Methodology for configuring the visualization in QGIS as an action . . . . . . . . . . . . . . . . . . . . . Collaborative implementation . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Documentation and ipython Notebooks . . . . . . . . . . . . . . 2.7.2 Open source repository . . . . . . . . . . . . . . . . . . . . . . .

3 Case studies 3.1 Biodiversity Loss and Competition analysis using 3.1.0.1 Biodiversity loss . . . . . . . . . 3.1.0.2 Methodology . . . . . . . . . . . 3.1.0.3 Result . . . . . . . . . . . . . . . 3.1.0.4 Discussion . . . . . . . . . . . . 3.1.0.5 Conclusion . . . . . . . . . . . . 3.1.1 Competition analysis . . . . . . . . . . . . 3.1.1.1 Methodology . . . . . . . . . . . 3.1.1.2 Results . . . . . . . . . . . . . .

the . . . . . . . . . . . . . . . .

operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 38 38 38 39 39 39 39 40 40 40 42 42 42 43 43 44 44 44 44 45 45 45 46 46 46 47 47 47 48 48

. . . .

48 48 49 50

. . . . . . . . .

53 53 54 55 56 56 57 57 57 58

viii

CONTENTS

3.1.2

3.1.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . Spatial distribution of community assemblages in Brasil-Argentina using Pseudo-Presence/Absence lists . . . . . . . . . . . . . . . . 3.1.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

4 Discussions 4.1 The role of ecosystems and community ecology and its implications in this model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 The concept of specie and other axiomatic flaws . . . . . . . . . 4.1.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . 4.1.3 Discussions on methods . . . . . . . . . . . . . . . . . . . . . . . 4.2 About data uncertainty and how to estimate it . . . . . . . . . . . . . . 4.3 Different software used by biodiversity studies . . . . . . . . . . . . . . . 4.3.0.1 Data included . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Other open-source projects related to biodiversity studies . . . . 4.4 Criticisms to this software and future lines of development . . . . . . . . 4.4.1 Future development work and gaps . . . . . . . . . . . . . . . . .

. 58 . 59 . . . . .

60 60 60 61 62 65

. . . . . . . . . .

65 66 66 67 67 68 69 69 70 70

5 Conclusions 73 5.1 Aims covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A Appendix - The concept of specie A.1 Systematics and the natural method for classification . . . . . . . . . A.1.1 classification . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 Reasons for using the taxonomic levels to represents species Earth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77 . . . 77 . . . 77 on . . . 78

B Excerpts from BiosPYtial source code

81

References

85

List of Figures 2.1

Biospytial distributed architecture. Blue arrows indicate messages send through the network using HTTP protocol. Dashed lines indicate potential I/O streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Concurrent process in which the master node (yellow) send messages through HTTP requests, The other nodes (biosPytial instances) process according to the parameters given by the master node. . . . . . . . . . . 2.3 Histogram of taxonomic composition using relative abundance and richness using the method getfreqs from the class gbif.taxonomy.Taxonomy 2.4 Visualization of Taxonomic Tree filtered by the parameters described in example. As it can be seen, the Taxonomic tree has all the information of all the occurrences within that area. The datastructure is an acyclic graph and therefore can be treated mathematically as such. . . . . . . . . 2.5 Visualization of a Gridded Taxonomy showing taxonomic trees at specie level for an arbitrary area. Each cell is a Taxonomy data type. . . . . . . . 2.6 A richness by genera map exported with the method createshapefile for a region in Central Mexico. The richness from dark red (high richness) to low richness (light yellow). The resolution for each cell is 5 km. Each cell contains a Taxonomy object. . . . . . . . . . . . . . . . . . . . . . . . 2.7 Two schematics of the NestedGriddedTaxonomy. The parent layer has the information of all the taxa within the area but cannot differentiate spatial heterogeneity. The bottom layer contains partial information when compared to bigger scale but has variability in all the region. . . . . . . . 2.8 Number of cells in each level for a Nested Gridded Taxonomy . . . . . . . 2.9 Configuring visualisation action in QGIS . . . . . . . . . . . . . . . . . . . 2.10 Visualiation of the embedded Taxonomy visualizer web-app- . . . . . . . . 2.11 The BiosPytial documentation website! http://test.holobio.me . . . . 2.12 Spatial Analysis using ipython notebooks https://github.com/molgor/ biospatial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8

Study region in Central Mexico overlaid with an Ecorregional thematic map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biodiversity loss from 1980 to 2000. Here visualizing loss in species richness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Competition using species/genera ratio . . . . . . . . . . . . . . . . . . Competition using species/genera ratio . . . . . . . . . . . . . . . . . . . Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assemblages of species by the method of integer representation . . . Assemblages of genera by the method of integer representation . . . Assemblages of families by the method of integer representation . . ix

25

26 31

34 35

37

41 41 49 49 50 51

. 54 . . . . . . .

56 58 59 59 60 61 61

x

LIST OF FIGURES 3.9 3.10 3.11 3.12

Assemblages Assemblages Assemblages Assemblages

of of of of

orders by the method of integer representation . . classes by the method of integer representation . . phyla by the method of integer representation . . kingdoms by the method of integer representation

. . . .

. . . .

61 62 62 63

Declaration of Authorship I, Juan M. Escamilla M´ olgora , declare that the thesis entitled A

taxonomic-based spatio-temporal data structure for modelling biodiversity networks. An Open-Source GIS tool to help bridging the gap between ecosystem science and community ecology and the work presented in the thesis are both my own, and have been generated by me as the result of my own original research. I confirm that:

• this work was done wholly or mainly while in candidature for a research degree at this University; • where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated; • where I have consulted the published work of others, this is always clearly attributed; • where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work; • I have acknowledged all main sources of help; • where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself; • none of this work has been published before submission

Signed:....................................................................................................................... Date:..........................................................................................................................

xi

Preface The present work has been an exploration to understand how does current ecological theory can be combined with geospatial sciences. This motivation had dictated the research process in a branching pattern with several study lines and analysis. The root of this research is a central problem in environmental sciences: How to link ecosystem processes with biodiversity loss? This problem has been tagged as crucial to tackled because its implications to the contemporary environmental crisis and global change scenarios [Loreau, 2010], [Wilson, 2010], [Magurran, 2013] and[Tilman et al., 2014]. Giving the interdisciplinary nature of the problem and the limitations in time and resources, the scope of my work does not aim to provide a canonical use of the traditional scientific method following a single argumentative line. Instead, it points out distinct scientific disciplines and methods supported by theoretical and experimental studies as a way to settle, formalize and study these relationships. Together, it will help in bridging the gap between ecosystem and community ecology with applications ranging from biodiversity loss to environmental management and conservation. This work is an effort on putting the scattered pieces together by defining a common framework with basis on graph theory, evolution and spatial data structures. The main contribution is the development of an Open Source software that can bring interdisciplinary studies in a common ground for analysing, visualizing, operating 1 and handling spatial-ecological-data structures. As a corollary, a use case has been implemented in a region in central Mexico that characterizes a plant functional type thematic map using presences, relative absences and taxonomic relationships of all the known community composition in that area.

0.1

The structure of the dissertation

As said before, this dissertation doesn’t follow a strict line of research but several. Therefore it was necessary to deepen into the theory of three disciplines: Biology, Mathematics and Software Engineering. From Biology by taking elements from community 1

In the algebraic sense, see methods for more information.

xiii

xiv

Chapter 0 Preface

and ecosystem ecology, systematic and evolution. From Mathematics: formal systems and graph theory and from Software Engineering: data structures, web technologies, relational mappers,No-SQL and best practices for software development [Wilson et al., 2014]. This document is structured in four chapters.

0.1.1

Introduction

An introduction to the problem, starting from contemporary environmental issues; the need of bringing these two ecological theories together following a synopsis of the debates involved in joining them. After this, a current synthesis in the field and the proposal from [Pavoine and Bonsall, 2011] and [Loreau, 2010] to unify them using biodiversity indexes. This proposal involves the use of certain biological definitions that are addressed in the next sections of the chapter as well as a brief exposition of why it is important to include ancestry relationships in ecological communities. Once the knowledge gap has been defined, an open problem is raised as a need to adapt this framework into an integrative model that translates biotic interactions in space and time into a distributed software tool capable to work with standard geographical information systems (GIS). A proposal is given using the taxonomic classification as the essential principle for building phylogenetic relationships. The use of a global integrated repository of biological occurrences is suggested and the description of the data as well as preprocessing is explained.

0.1.2

Methods

This chapter is divided in two sections. One that defines the mathematical foundations and other that defines technical developing. The first section describes the mathematical notion of Sets, Relationships and Graphs (networks); the definition of the core object Taxonomy, which describes the community assemblage, its attributes and methods (morphisms). How the tree-data structure is build and some other mathematical properties that gives the object an algebraic behaviour for performing standard operations like sum or difference. The technical section describes the software needed to achieve this, the use of a spatial database, a caching system, decentralised systems and web standards in order to provide a robust but flexible system with capacity of performing global analysis in the easiest way possible.

Chapter 0 Preface

0.1.3

xv

Case Study

This chapter focus on a case study in Mexico, states an open problem an derives a methodology using the current software implementation. The results are analysed and discussed in this section giving priority to the methodology to highlight the advantages, disadvantages and future development lines.

0.1.4

Discussions

This chapter analyses the pros and cons on using the methodology in abstract and in particular with the GBIF data; Highlights the important aspects to consider relating current debates in the science of ecosystem modelling and evolutionary ecology. Particularly within the niche modelling studies and the new network ecology. A special section involving contemporary problems in systematics and a debate in which taxonomic classification clearly implies evolutionary histories. Some possible ways for integrating more data and a comparison with a recently published effort for joining together these data type.

0.2

To the reader: Here were dragons!

This dissertation lasted from October 2014 to May 2015 as part of the second year: Geo-information Science and Earth Observation for Environmental Modelling and Management (Erasmus-Mundus, Masters Programme). The motivations for doing this work were studying spatio-temporal networks and possible ways to analyse them using spatial and temporal analysis with an application to global biodiversity patterns. The project was very ambitious, it involved a synthesis of many disciplines. During these 7 months I encountered many problems; the database was so big (c.a. 200GB) that it blew my laptop. Jason Sandler from GeoData Institute helped me hosting it in their server. Problems arose regarding the preprocessing, creation of spatial indexes and other spatial operations and filters. Even simple operations like point in polygons can be computationally exhaustive when dealing with so much information. I’d hired a virtual private server (VPS) for nightly processing. On average, using my laptop, the model takes 12 hours to generate the tree structures in a nested grid for an area of approximately 500,000 Km2 . A caching system for storing that information was done for ease the analysis. In order to give scientific context and meaning to the research it was necessary to look for similar approaches and guide the project through a founded research line that could give long life to the project. This research line was found in the intersection with Ecosystem science, Ecology and Spatial Analysis. The dissertation’s project is not complete but it throws a light in a field that has been considered by many as the crucial for building better ecosystems models to optimize ecosystem services. Important problems to

xvi

Chapter 0 Preface

consider in a moment of global climate change and environmental crisis. When reading this dissertation, consider that I gave my best to the project and that the limitations in time and computer resources as well with logistic problems from the Erasmus Mundus program were added on challenges to sort out.

0.3

Acknowledgements

I would like to thank Raul Jiménez Rosenberg from Conabio

2

for sharing me the GBIF

database; Jason Sandler and GeoData Institute from Southampton University for hosting and migrating the 200 GB database; My supervisor Peter M. Atkinson for his comments, support and motivation with the project; Steven de Jong and Elisabeth Addink from Utrecht University for their valuable suggestions regarding applications and future lines of research; My parents who had supported, motivated, and relief me since my first breath; Laura Reinelt for listening, supporting me and sometimes arguing about the science in the research; I am very grateful with the People of Mexico whose hard work and sacrifice along our history has built institutions like Consejo Nacional de Ciencia y Technolog´ıa (CONACYT) who sponsored me and many other Mexican colleges in the pursuit of bringing other forms of science from abroad. Finally, I would like to praise the selfless work of the Free Software and Open Source community for contributing to build a world of freedom, knowledge and justice with their brilliant software.

Juan M. Escamilla Mólgora Southampton, UK Contact: [email protected] http://juan.escamilla.holobio.me

Stop Mexican State repression now! 26 May, 2015

This symbol

m indicates that this work

has contributed to the state of the field with an innovative feature.

2

National Commission for Knowledge and Use of Biodiversity

Chapter 1

Introduction Nothing in biology makes sense except in the light of evolution Dobzhansky.T [1973]

1.1

The anthropogenic impact on the biosphere

Almost 20 years had passed since the publication of Vitousek et al. [1997] seminal paper on human-kind domination which described the significant alterations in Earth’s ecosystems. They reported that between one-third to one-half of the land surface had been transformed by human action since the beginning of the industrialization epoch. The massive use of industrial agricultural products resulted in more atmospheric nitrogen fixation than by natural processes and more than half of all accessible fresh water had been assigned for human use. This paper started a debate on how to measure these transformations and in which way antropogenic processes feedback the ecosystems processes with repercussions to society itself. Since the appearance of this paper, many publications highlighted the importance of ecosystem functioning, stability of biogeochemical cycles and theoretical methodologies, debates and experimental advances. Of particular importance is the synthesis made by Foley et al. [2005] which stated that the exponential increase in changes in land use had allowed humans to appropriate the planet’s resources undermining the capacity of ecosystems to sustain food production, maintain freshwater and forest resources, regulate climate and air quality, and ameliorate infectious diseases. They estimated that crop land and pastures were using approximately 40% of the total Earth’s land surface area. Sachs [2008] validated this using remote sensing methods concluding that there is a high compromise in maintaining basic bio-geochemical cycles that sustain the biosphere, of relevance was Nitrogen cycle. 1

2

Chapter 1 Introduction

1.1.1

On the dawn of the Antropocene?

The high modification of natural cycles like Carbon, Nitrogen, Phosphorous or Water, unbalances the natural conditions and processes stability

1

of the whole Earth Sys-

tem.The main contributors being: extensive use of CO2 related emissions caused by industrialization; deforestation associated with the change in land use; and extensive use of agricultural techniques, explained as a direct consequence of population growth and globalization ideology [May, 2007]. The current situation of environmental transformation has been of such great magnitude that some authors had suggested a shift from the present geologic epoch into what they had called Anthropocene. This argument is based on the assumption that if environmental policies like mitigation and conservation continue to be neglected, the 11,700 year old epoch Holocene will experiment a shift that will drive the Earth System into a new epoch with high probability of being much less hospitable to human societies (Crutzen 2002; Steffen et al. 2011;Steffen et al. 2015 ). Therefore, understanding the ecological implications related to ecosystems’ dynamics and processes are essential to define knowledge gaps to define thresholds and tipping points (May 2007; Cornell 2012; de Vries et al. 2013; Baum and Handoh 2014;Tilman et al. 2014). From the last decades researchers have addressed questions for determining which and in what magnitude ecosystems processes are being hindered by anthropogenic activities. Even though this is important at first step, it is also necessary to understand which human and natural processes improves or makes more efficient the ecosystems’ productivity and services to humans.

1.1.2

Ecosystem services to reconcile the biosphere and human wellbeing

The term ecosystem services has been defined as ’the benefits people obtain from ecosystems’ by the Millennium Ecosystem Assessment [2005] (EMA). This definition covers managed and natural ecosystems and includes direct and indirect benefits to humans. Examples of this are: food supply, water, minerals, aesthetic appreciation, medicine and timber. The EMA divides the services in four groups: • Supporting e.g., nutrient cycling, soil formation and primary production • Provisioning e.g., food, fresh water, biochemicals and genetic resources • Regulating e.g., pollination, invasion resistance, diseases regulation and natural hazard protection 1

Meaning low variation with respect of the time.


3

• Cultural e.g., Aesthetic, spiritual, educational and recreational The last three categories have a direct impact in human well-being. The Provisioning is considered to have an indirect impact because it makes possible the others to exist. The MAE has proposed 24 broad services. 15 are assessed as currently degrading, four improving (related to agricultural services) and for the rest, there is not enough information for evaluating but are mostly related with the link in biodiversity and ecosystem functioning e.g., pollination (Millennium Ecosystem Assessment 2005; Balvanera et al. 2005; Magurran and McGill 2011 in Foreword by May).

1.1.3

How will losses of species affect the provision of ecosystem services?

The principal effect of massive land use change is the unprecedented increase in the global extinction rate2 . It has been estimated to be 100 to 1000 times higher than previous massive extinctions identified by fossil records and stratigraphic studies [Millennium Ecosystem Assessment, 2005] and [May, 2010]. The exponential increase of extinction rates happens worldwide and there is considerable evidence that ecosystem functioning (e.g., productivity, nutrient cycling) and ecosystem stability (i.e., temporal invariability of productivity) depend on the diversity of species [Naeem, 2009]. The decline of diversity diminish human well-being by decreasing the ecosystem services [Millennium Ecosystem Assessment, 2005] and [Isbell, 2010]. Ecosystem services are not counted in conventional measures of gross domestic product (GDP), nevertheless it is estimated that it can be similar or greater than the global GDP [Costanza et al., 1997] and [May, 2007]. Understanding the relationship between Community assemblage,biodiversity and ecosystem productivity is crucial to understand the most influential factors that can affect future human well-being.

1.2

Biodiversity through the lenses of ecology

Definition 1.1 (Biological Diversity). The UN Convention of Biological Diversity defines biological diversity as: ’the variability among living organisms from all sources, including, inter alia, terrestrial, marine, and other aquatic ecosystems, and the ecological complexes of which they are part; this includes diversity within species, between species and of ecosystems.’ [United Nations, 1992] 2

Extinction with respect with time

4

Chapter 1 Introduction Biodiversity is simply the variety of life. It can range from genes and microbes in a

few grams of soil to all the living beings in Earth. Biodiversity is a multifaceted concept that can be defined and documented in different ways [Magurran, 2004]. To measure it, it is needed a clear definition of the study frame, i.e. which aspects of life are going to be analysed and in which temporal, spatial and geographic scale. Biodiversity relates directly and indirectly with ecosystem productivity. Odum [1953] and Elton [1958] conjectured that high biodiversity implies high productivity and stability within the ecosystems. This assertion, perhaps intuitive, could not be proven experimentally on that time generating a long and controversial debate in ecology that ultimately split the science in two sub-disciplines: ecosytem ecology and community ecology [Loreau, 2010]. It is a fact that organisms share resources, compete, feed, pollinate and in general build up a complex networks of physical, chemical and ecological dependencies. It has been the scope of each particular research what determines the methodology involved [Smith and Smith, 2001], and [Krebs, 2009].

1.2.1

Different forms for measuring diversity

The number of biodiversity measurements is large and growing (Southwood and Henderson 2009, Royal Society 2003, Magurran 2004) and still, there is a general trend to condemn previous diversity indexes whenever a new one is reported [Southwood and Henderson, 2009]. Even though the amount of this measures is overwhelming, many of them can be classified in three classical families: richness, abundance and evenness from the community ecology and functional traits in ecosystem ecology.

1.2.2

Biodiversity in Community Ecology

A community is a group of living beings that interact and share a common place in the biosphere at a certain time. Community ecology is the study of the relationships of different populations of species at different spatial and temporal scales. Community ecology studies inter-specific interactions like trophic webs, richness, abundance , distribution of species, succession dynamics, ecological niches, host-pathogen patterns, evolutionary structures and dependences (Odum 1953; Smith and Smith 2001 ; Futuyma 2013 ; Pyron 2010). Community ecology is a dynamic field of research with more than 50 years of experience. Nevertheless it lacks a synthetic theoretical framework to organize and represent these results into a formal system of hypotheses that could yield testable predictions. It is often found contradictions on similar hypothesis because concepts are ambiguous (Loreau 2010; Pavoine and Bonsall 2011; Tilman et al. 2014).


5

On the other hand, its contribution to the development of a general theory of life has been invaluable. For example, several community studies had focus attention on the relationships of ratios between taxonomic levels. Their results had shown ways of determining mutualistic and competitive interactions and speciation patterns (Barraclough et al. 1998; Gotelli and Entsminger 2001; Tofts and Silvertown 2000; Webb et al. 2002; Enquist et al. 2002). This also shows the importance of unifying phylogenetic models and ecosystem models. Representing intrinsic niche adaptations gives the opportunity to explore its relationships with productivity and stability.

1.2.2.1

Richness

The number of different species or taxa in a study area. Measuring this variable is an essential objective for community ecology and conservation biology. It gives an intuitive index of community assemblages. There are examples of richness studies measured at small and large geographic scale (Blake and Loiselle 2000 and Rahbek and Graves 2001 (respectively). There is a historic development of these measurements going from the MacArthur - Wilson equilibrium model [MacArthur, 1967] to more recent models of neutral theory [Hubbell, 2001], meta-comunity structure [Holyoak et al., 2005] and biogeography [Gotelli and Entsminger, 2001]. These studies in general have concluded that species richness counts are highly sensitive to the number, size and spatial arrangement of samples.

1.2.2.2

Relative abundance

A measure that refers to how common or rare a specie is relative to other species in a study area or community [Hubbell, 2001]. Many relative abundance studies have based its research in developing statistical estimators and frequency distribution. These patterns have been long-recognized and can be broadly summarized with the statement that most species are rare [Andrewartha, 1986], [Magurran, 2004].

1.2.2.3

Evenness

Evenness is a a measure of how different the abundances of species in a community are from each other [Smith and Wilson, 1996]. When a community has the same proportion of species (constant relative abundance) it is considered to be perfectly even. In reality, all natural communities are highly uneven, making the term evenness a relative statement [Magurran and McGill, 2011]. Most evenness indexes are normalized within the range of [0, 1] meaning that 0 is maximal uneven (i.e. maximum entropy) or 1 perfectly

6


even. Evenness gives a notion of symmetry within a community and therefore, relationships of conservation forces

3

can be deducted. Studies of broken symmetry can be

imported from modern physics to explain Why species composition within a community follows anisotropic patterns?.

1.2.2.4

Species diversity indexes

Species diversity is often intended to represent two aspects of diversity measures: species richness and species evenness, i.,e. the degree in which relative abundances of each species is similar among the relative abundance of the other species present in the same study area or community (Magurran and McGill 2011 (Maurer and McGill) Chap. 5 ). While most research has been done to show the relationships between ecosystem diversity and stability using richness; the variation in species composition provides the mechanistic basis to explain the relationship between species richness and ecosystem functioning [Cleland, 2011]. The reason behind is that species differ from one another in their resource use, environmental tolerances, and interactions with other species. The union of all the distinct features is called ecological niche. It is the diversity between these niches what explains how the species composition has a major influence on ecosystem functioning and stability.

1.2.3

Biodiversity in Ecosystem ecology

An ecosystem is an ecological community of living beings coupled with the non-living components of their environment (abiotic factors) like: water, sunlight, soil and air. Organisms transform their abiotic environment by exchanging matter and energy in several nutrient cycles and energy flows (Smith and Smith 2001; Krebs 2009; Schlesinger and Bernhardt 2013). The ecosystem concept is broader than the community concept because it includes a wide range of biological, physical, and chemical processes that connect organisms and their environment [Loreau, 2010]. Ecosystem ecology studies the effects of these interactions by measuring fluxes of energy, establishing relationships by the exchange of matter at various spatial scales. This discipline focus on answering questions regarding fluxes of mater and energy. For example: How energy is captured, transferred and dissipated? How are limiting nutrients recycled and incorporated into the system for growing? [Loreau, 2010]. Ecosystem ecology also studies the causes and effects of climate change by developing Dynamic Vegetation Models (DVMs). This models are integrated in Earth system models (ESMs) to include interactions of atmosphere, ocean, land, ice to estimate the state of regional and global climate under a wide variety of conditions (Smith et al. 2001; 3

Invariants, using the formal geometricians’ definition


7

Heavens 2013). Despite its broad use the scope of the ecological systems are considered static in the sense that it is not taken into account the adaptability and coevolutionary processes of the biological communities involved, raising claims about the rejuvenation of the discipline (O’Neill 2001; Loreau 2010).

1.2.3.1

Ecological functional diversity

The features that characterize the ecological function of an organism are termed functional traits. A functional group is a group of species that share similar arrangement of traits. [Cleland, 2011]. Species from different functional groups can exhibit complementarity on ecological niches, meaning that they consume different resources or same resources at different time. Complementarity has been found to be an important aspect in preserving stability of ecosystems by showing co-evolutionary relationships (Odum 1953; Loreau and Hector 2001; Webb et al. 2002; Pavoine and Bonsall 2011 ). For example, flycatchers and eagles in a temperate forest are members of the predators functional groups but their feeding habits (trophic niches) are not overlapping, this allows a higher total biomass of the group predators in the system. Plants exhibit a different case because almost all plants are primary producers. They use the same set of resources e.g., space, light, water, soil) nevertheless not all plants compete, different adaptations have yielded plants to take resources optimally in different seasons (phenology) or in disruptive conditions like, openings in the forest or wildfires [Smith and Smith, 2001]. As said before, modern ecosystem models use Dynamic Vegetation Models (DVMs) that simulates plant functional groups dynamics with a term called plant functional type (PFT) [Wullschleger et al., 2014].

1.2.4

Integrating global biodiversity studies with meta-analysis methodologies

Many biodiversity studies analyse an area and archives occurrences of species found. If all this information becomes public, as suggested by the open science initiative 4 , the merging of data from different sources could span the entire globe and become a good representation of reality [Herther, 2012]. The information can be used in metaanalysis, by (f)using data from different sources in an attempt to understand more aspects of biological interactions. In a similar way as Monte Carlo methods can work as an stochastic approximation to multiple variable problems, meta-analysis can use the fusion of data for approximating better ecological models [Balvanera et al., 2006]. 4

http://science.okfn.org/

8


Ecologists possess in their file cabinets and spreadsheets a wealth of information about the natural world. It is urgent that environmental scientists sum up to open initiatives for collaboration because within the context of rapid climate change, an ecological dataset collected at a certain place and time represents an irreproductible set of observations. [Wolkovich et al., 2012]. Meta-analysis studies can integrate new meaning to the variability of life measured from different perspectives [Magurran, 2013].

1.3 1.3.1

Ecosystem dynamics and community assemblages Plant functional types past present and future

PFTs are a special case of functional groups for representing primary producers and their relationships. This conceptual model has proven to be useful in simulations of plant distribution and climate change at regional to global scales (Haxeltine and Prentice 1996; Foley et al. 1998; Sitch et al. 2008). PFTs represent basic ecological types grouped by distinct taxa with similar ecological requirements (ecological classes) and has been a convenient way for describing vegetation structures without having to treat large number of species individually [Rubel et al., 1930]. The concept of PFT can be traced back as early as 19th century with the classification strategies of (von Humboldt 1806 and Grisebach 1884) but it was until April 1988 in ’A meeting on Global Vegetation Change’ held at the International Institute for Applied Systems Analysis (IIASA) in Laxenburg (Austria) that the term Plant Functional Trait appeared for the first time (Prentice 1989; Steffen et al. 1992). In the early development stages, the PFTs were used as a way to understand the factors that influence the distribution of vegetation [Prentice et al., 2007]. In this decade, the use of PFTs has been extensively used in Earth System Models to simulate and predict biogeochemical cycles under different climate change scenarios to asses systems properties like stability, productivity and resilience e.g. Sitch et al. 2003; Wullschleger et al. 2014. However, as pointed out by [O’Neill, 2001] its accuracy has been hindered by the static implementation of the PFT concept, i.e. it is based on a discrete set of parametric traits that result in deterministic systems. There is strong evidence that uncertainty in ecosystem models arise from several sources, including incorrect or incomplete PFT classifications and inadequate parametrizations. The field is changing rapidly as scientists consider more options for representing plant diversity in PFT and DGVMs. This will identify sources of uncertainty and if possible reduce or resolve them (Pavlick et al. 2013; Scheiter et al. 2013). It is necessary to work with multidisciplinary groups among modellers and specialists from taxonomy, biogeography, ecology and remote sensing. This will develop more accurate and reliable ecosystem models [Wullschleger et al., 2014].


1.3.2

9

Ecosystem stability as a function of biodiversity

There are many definitions for stability [McCann, 2000], going into them are outside the scope of this project. An intuitive definition of a stable system is the low variability (i.e., little deviation from its average state) through time. Productivity can be interpreted as a state variable in the system [McCann, 2000]. For example, it can be measured as: GPP, NPP, biomass or other ecosystem service [Balvanera et al., 2006]. There are many ways to measure productivity from indirect measurements. Remote sensing has been used to derive productivity as a function of time in practically any place of the Earth at certain spatial resolution e.g., [Tucker and Sellers, 1986]. These approaches can be combined in community-ecosystem-traits models to analyse spatial correlations on productivity, species composition or biodiversity [Cleland, 2011]. Ecosystem ecology and community ecology represent extrinsic and intrinsic perspectives on how organisms interact and respond to the environment. Intrinsic, referring to how organisms adapt, grow and depend from one another and extrinsic in the sense of how these co-dependences affect the physical environment through fluxes of mater and energy. As explained before the split of these two disciplines started after serious studies began to contradict the idea that high biodiversity enhances ecosystem’s stability. A hypothesis considered as true since the work of Darwin [1859] and corroborated independently and empirically with Elton [1958], MacArthur [1967] and Odum. May [1973] made mathematical simulations using linear stability analysis to model ecosystem-stability-complexity relationships. Contrary to the general belief, he found that diversity tends to destabilize community dynamics. Meaning that more complexity (i.e. diversity) increases uncertainty to the system in terms of productivity5 . These findings contradicted the empirical evidence, that high diverse ecosystems are stable and resilient (i.e. tropical rainforest). However, other ecologists (e.g., Gardner and Ashby 1970; Pimm and Lawton 1978, revisited by McCann 2000) found consistent results using stochastic simulations. Evidently there was a lack of knowledge; if diversity and stability were positively correlated, as empirically happens, more had to be happening than simply increasing the number of species [McCann, 2000]. These questions opened a new field of study for integrating community ecology and ecosystem ecology and brought efforts on formalizing a general ecological model based on complex systems theory that could help explain these contradictory findings (Solé and Bascompte 2006 ; Levin 2007). 5

Uncertainty is the reciprocal of stability

10

1.3.3


The ecological synthesis

It was until long-term experiments stared to appear that a general understanding of the internals of the ecosystem dynamics started to see the light. The first experiment was the Cedar Creek biodiversity experiments in Minnesota, USA (Tilman et al. 1997; Tilman et al. 2006). Later, similar findings were obtained with BIODEPTH experiment in Europe [Spehn et al., 2005] and the Jena experiment [Weigelt et al., 2010]. These experiments and observations evaluates the Diversity-Stability Relationships. Balvanera et al. [2006] made an important meta-analysis research that shows extensively these relationships with respect to ecosystem services.

1.4

The knowledge gap and this work’s contribution

Numerous attempts to unite community ecology and ecosystem ecology in an integrative framework have been proposed (Loreau 2010; Pavoine and Bonsall 2011;Tilman et al. 2014). A theoretical framework proposed by Pavoine and Bonsall [2011] includes a new definition for biodiversity indexes that can rank more effectively ecosystems for conservation and productivity. This definition integrates not only presences of species but phylogenetic structures and functional traits. The new framework can represent potential adaptations, competition, mutualism and other intrinsic properties within the community assemblage [Webb et al., 2002]. By adding ecosystem traits, the framework can model in a greater detail the response of the communities with respect to their physical environment and energetic dynamics in scenarios of drastic changing environments [Steffen et al., 2015]. Although the theoretical model is needed, it has been hard to implement it in a global context. Joining ecosystem traits, presence/absence species lists and phylogenetic information is not a trivial task and requires the effort of many people, institutions and infrastructure support. Recently, several institutions have started to make public their data repositories and attempts to integrate these findings are now possible. Two of these global databases are: the Global Biodiversity Information Facility [GBIF Secretariat, 2015] and TRY – a global database of plant trais [Costanza et al., 1997]. There is an increasing stream of new data coming from different sources (e.g. science 2.0 and The Internet of Things [Herther, 2012]) which have broaden exponentially the amount of spatial data. Thanks to the advances in computer science and remote sensing technologies, it is now possible to process and analyse enormous amounts of information [Ma et al., 2014]. Although there are some implementations for integrating ecosystem traits and biodiversity studies into a public platform [Hudson et al., 2014] and independenty, an increasing need for aggregating new data to existing ecosystem models [Hartig et al., 2012]. There


11

is not yet a single software implementation for representing, analysing and simulating these evolutionary-community-ecosystem objects in a Geographical Information System. Moreover, there is not an open source implementation capable of joining other features to this model framework. This is the objective of this work. To develop a unified biodiversity spatial data structure that can be flexible enough to fuse information from different sources and types. Formal enough to be used in spatial analysis and capable of exporting outputs to standard open formats, at this first stage. Definition 1.2 (Data structure). An organization of information, usually in memory, for better algorithm efficiency. It may include redundant information such as length or number of nodes. Examples of data structures are: lists, arrays, matrices, hash tables and Trees [Paul E. Black (editor), 2004]. Most data structures have associated algorithms to perform operations, such as search, insert, distance, sum or products that maintain the properties of the data structure

1.4.1

Objectives of the research

It has been presented the need for developing a novel model for biodiversity capable of integrating evolutionary relationships, known location of species and functional traits. Moreover, spatially nested analysis of community phylogenetic structures has been pointed out as a method for detecting patterns of phylogenetic clustering and anomalies of dispersion through different scales [Webb et al., 2002]. Therefore the model should represent the structures in a nested form through hierarchical scales. The model should be flexible enough to let people frame its research scope by selecting which aspect of biodiversity to analyse. This can be done by implementing filter operations for selecting date ranges, taxonomic groups and custom study areas guaranteeing freedom to study the code, distribute it and modify it.

1.5

Project aims

For achieving the objectives, the proposal is first to define a simple data structure that integrates location-known presence of species. This data structure, from here on called Taxonomy will have several attributes, including the complete taxonomic classification. Each Taxonomy has a spatial area of action which is defined as the interior of a closed curve defined in the Earth surface.

12


1.5.1 1.5.1.1

Theoretic and Implementation aims First stage: The Taxonomy object

1. Develop a Taxonomy data structure for biodiversity studies that take into account spatial and temporal selections. The temporal range is (1812 - 2013) and the

m

spatial range is the entire globe. 2. The data structure should represent evolutionary histories and common diversity measures that can be interpreted with community ecology theory. 3. The data structure should have formal foundations. By formalizing it, it is expected to behave consistent within a universe of discourse in a similar way as algebraic structures (e.g. matrices) are consistent with linear mappings in vector spaces6 . See definition 1.2 (Data structure). Reached in 2.3 4. The data structure should have metric (distance) definition for comparing structure’s similarities 5. It should have a way to include information of functional traits which will bring together ecosystem and community studies.

1.5.1.2

Second stage: The GriddedTaxonomy object

The Taxonomy is later used in a bi-dimensional array to give a continuous representation of the different taxonomies in a given geographic area. This 2-D array is similar to a raster data structure with the difference that each entry (cell) is going to be a Taxonomy type. The array has taken the name GriddedTaxonomy and other methods had been implemented. These are the aims: For example, Map Algebra and Export to OGC formats for certain Taxonomy attributes.

m 1. Create spatial structures for integrating these data structures in the space in a way similar to raster data types. 2. Ability to perform spatial operations like: • Intersection, union and difference • Common raster operations: focal operations, convolutions (filters), zonal operations. 3. Export attributes and diversity measures to an open spatial data format supported by OGC. 6

Therefore the use of mathematical categories [Barr and Wells, 1990]

Chapter 1 Introduction 1.5.1.3

13

Third stage: The NestedTaxonomy object

Finally a layer stack of Gridded Taxonomies is defined under the class name Nested Taxonomy. The layer stack is ordered by spatial resolution going from the most coarse to the highest one. The structure follows a Quad-Tree structure [Worboys and Duckham, 2004] in which each level doubles the resolution of the level above which means that the number of cells (therefore taxonomies) is four times bigger. i.e. Let Ntax (zi ) the number of taxonomies in the Nested Taxonomy. The total number of taxonomies from the level 0 to the level zi with i ∈ {0, 1, ..., n} is: Ntax (zi ) = 4Ntax (zi−1 ) = 4i Therefore the complexity of the structure is not polynomial.

1.5.2

Aims: Software engineering and best practices for collaboration

As explained above, the processing time is exponential with respect to the scale levels and the chosen minimum resolution. However, the processes are independent thus, they do not need to wait for the output of another process to be executed (except the input stream from the database). The processes can be run separately in different computers giving the possibility of performing concurrent processes simultaneously for covering bigger areas. The software has been design with a decentralized processing framework. This is done with Representational State Transfer (REST), a software architecture style for creating scalable web services in decentralized infrastructures [Richardson and Ruby, 2008]. This framework makes possible the use of known web protocols like HTTP to receive, process and respond a request via a TCP network, e.g. Internet, Ethernet, etc [Tanenbaum and Wetherall, 2013]. This facilitates scaling the analysis for covering bigger areas in an heterogeneous set of computers connected through internet in distant geographical places. The distributed design in the software gives the possibility to collaborate with research groups from different parts of the world. Therefore, for optimizing collaboration it is mandatory to focus on the following aims: 1. Implement the software in an easy to understand and popular programming language.(Reached in 2.5. 2. Include tools for creating and instantiating the data structures in any part of the world at arbitrary resolutions. 3. Include tools for managing, visualizing and explore the data structures across space. 4. Clear, concise and easy to access documentation.

14

Chapter 1 Introduction 5. Modular programming and possibility to integrate other data analysis software. (e.g., statistics and Machine Learning software) 2.5. 6. Standard and open input/output formats. 7. Give emphasis on best practice for scientific programming [Wilson et al., 2014] to facilitate the improvement and interest of the scientific community. 8. Distribute the package in a publicly available platform with a control version system.

Almost all these aims have been satisfied in this work. The explanation of each are described in the next chapters.

m 1.5.3

Aims: Case Studies for exploring biodiversity patterns with the tool

Three approaches for studying taxonomic diversity were explored in this dissertation. The objective was to show the implementation of the project’s aims and the potentiality of the software with easy examples. The activities done were the following: • Explore the intrinsic relationships of the taxonomic objects within a region using the taxonomic ratios within the trees. This will include the analysis of an unsupervised classification to classify the regions based on ratio similarities. • Infer a pseudo presence/absence taxonomic list using the aggregation of species occurrences at different scales. This will include a mapping of similar community assemblages in the space. • Change detection. Taxonomic analysis of biodiversity loss using algebraic structure’s operations.

1.5.4

Research main questions

• Is it possible to develop a unified software for integrating biodiversity studies that takes into account: scale, space, time, evolutionary histories and ecosystem traits? Is it possible to use this software to find new questions or answer currently open questions in the fields of ecosystem sciences and community ecology? • How do new computing methods could help improve the knowledge of global biodiversity studies?


1.6

15

Data use and description

1.6.1

The Global Biodiversity Information Facility

In this work I have used the Global Biodiversity Information Facility (GBIF)

7

as a

unique source of information. GBIF was conceived as a clearinghouse for information, as opposed to being a primary data provider [MA, 2007]. The GBIF is an intergovernmental organisation providing an interoperable network of biodiversity databases and information technology. It is the largest global biodiversity database. It is a confederation of the observations of more than 200 natural history institutions. Each of them has the duty to protect and store biologic information of a place of the Earth. [Edwards, 2004]. Until now, the GBIF integrates more than 500, 000 000 records of species occurrences in the world. Many of them has location features as well as the complete taxonomic classification. The raw data is a CSV file in raw text format. The file unzipped is c.a. 200 GB. For an optimal perfomance, the data was migrated into a Data Base Management System (DBMS) with spatial capabilities. The system is a Postgresql 9.1 with Postgis 2.x instantiated in GeoData Institute’s infrastructure. The whole project created several auxiliary tables and functions. Nevertheless there is only one single table that stores the entire Occurrences data. The table has been optimized using B-tree indexes for numeric fields and the spatial index GIST (Tree indexing) for the spatial column [Obe and Hsu, 2011]. The fields names, their data types and the query indices used are described in table 1.1.

1.6.2

Preprocessing methodology

The data used was a snapshot of the global GBIF database at October 2013 with a total of 415, 896, 295 occurrences (records) distributed world wide. As pointed out by [Yesson et al., 2007] the data contains hundreds of thousands of unresolved records. For example there are genus without families and species without phylum value. In order to have a unique representation per occurrence, only the records with complete taxonomic rank as well as location column has been used. The rest were deleted. Table 1.2 show the records deleted per missing field. The resulting table has a complete description for each field. This avoids any ambiguities when measuring richness at different taxonomic scales. The total number of remaining records are approximately: 333 953 000 occurrences distributed globally (c.a. 80.15 %).

7

http://www.gbif.org/

16


Table 1.1: Field name and data type for the columns used in the GBIF Occurrence Table. GIST is the Generalized Search Tree [Kornacker, 2000], B-Tree means Binary Tree and N.A. means Not available. Field name identification dataset id institution code collection code catalog number kingdom kingdom id phylum phylum id class class id order order id family family id genus genus id scientific name species id latitude longitude year month event date geom

Data type integer character character character character character integer character integer character integer character integer character integer character integer character integer double precision double precision integer integer timestamp geometry (Point EPSG:4326)

Index used N.A. N.A. N.A. B-tree B-tree N.A. B-tree N.A. B-tree N.A. B-tree N.A. B-tree N.A. B-tree N.A. B-tree N.A. B-tree GIST GIST B-tree B-tree GIST

Table 1.2: Deleted records during preprocessing. All records that lack one of the following fields were erased. Each taxonomic level is the number of records defined in the name and id. e.g. genus = {genus name, genus id} without field value kingdom phylum class order family genus species name species id geometry Total deleted:

counts 0 1 631 710 223 998 397 845 253 387 772 239 3 769 970 23 239 300 52 238 100 82 526 549

perc. 0% 0.39 % 0.05 % 0.09 % 0.06 % 0.18 % 0.90 % 5.58 % 12.56 % 19.84 %

Chapter 2

Models description and software implementation Mathematics is the backbone of modern science and a remarkably efficient source of new concepts and tools to understand the ”reality” in which we participate – Alain Connes The aim of this research is to contribute to the understanding of the relationships of the species assemblage and biodiversity. This has been addressed by developing a tool that integrates information from species presence data, geographical location, time-stamp data acquisition and taxonomic classification. The model relies heavily on the theory of evolution and has spatial components aggregated at several spatial scales.

2.1

Specie as the atomic study unit

Since the appearance of formal biological studies (See appendix: A), several definitions of species had been proposed depending on the study object and the research scope. This had raised epistemological debates that cannot be solved using a single general definition. In this work, the Biological Specie Concept is going to be used, with the restriction of taxonomic classification. Definition 2.1 (Biological Specie). The following definitions are equivalent: • Groups of actually or potentially interbreeding natural populations which are reproductively isolated from other such groups (Mayr 1940). • An inclusive Mendelian population; it is integrated by the bonds of sexual reproduction and parentage (Dobzhansky and Dobzhansky 1970: 354). 17

18

Chapter 2 Models description and software implementation • A species is a group of interbreeding natural populations that is reproductively isolated from other such groups (Mayr and Ashlock 1991)

Do note that the concept of specie is mostly biased by the data used. In the practical case is based in natural museum records around the world (See section on Data used and GBIF page: 15). Therefore, a more restrictive definition should be used in order to support further argumentations on evolution and ecology. Definition 2.2 (Taxonomic concept of specie). ’... a species consists of all the specimens which are, or would be, considered by a particular taxonomist to be members of a single kind as shown by the evidence or the assumption that they are as alike as their offspring or their hereditary relatives within a few generations. When there is no evidence of the hereditary relationship, the taxonomist will rely on distinctions that have been found to be effective in segregating species among other groups’. (Blackwelder 1967 : 164)

2.2

Foundations

In order to give the model a state of formality, it is necessary to build a clear framework based on assumptions, hypotheses and axioms. In this work, an assumption is considered to be any limiting fact that depends on the reality of the problem. For instance, any issue concerning the quality of the input data. E.g. bias in sampling, erroneous classification, non existent data, etc. An axiom is any proposition that cannot be proof because of: i) it is self-evident (e.g. life in Earth has a spatial component) or ii) there is a scientific theory1 that supports it as valid and for the purposes of this work it is not necessary to proof it. E.g. All organisms have a common ancestor. A hypothesis is any proposition that will be proven in this work. The hypothesis will make use of the axioms, assumptions, other hypothesis and results from other works, namely references.

2.2.1

Model axioms

The model uses evolutionary relationships to show different aspects of diversity of life. For embedding these relationships into the evolutionary biological theory, the model uses the following theoretical assumptions, disregarding the quality of the data used2 . Axiom 2.3 (Common Ancestor). Every two living beings (organisms) in Earth are phylogenetically related. It means that there is a relationship of ancestry. This relationship implies that in the past they belonged to the same species. 1 2

e.g., the Theory of Evolution Therefore the choice in naming Axioms and not Assumptions

Chapter 2 Models description and software implementation

19

This guarantees that every pair of organisms are comparable in the evolutionary sense. It can be adapted to satisfy other types of relationships between organisms that preserves a comparison or partial order. Examples of these could be trophic relationships, in this case the hierarchical structure would represents a food web. Axiom 2.4 (Uniqueness of Common Ancestor). There is only one most recent common ancestor for every two organisms. This axiom guarantees that it is not possible to have two or more immediate common ancestors. In this sense, speciation, the evolutionary process in which a new species arise, is a bifurcation from a pre-existent specie. This axiom prevents ambiguity in lineages meaning that a species has one and only one linage or (taxonomic chain). Axiom 2.5 ( Last Universal Ancestor (LUA)). There is a universal ancestor of all living species in Earth (for now on called biosphere). This is the most recent organism from which all living organisms in Earth descends3 . This axiom states that there is a minimal element in the set of all organisms and as such every organism in Earth is related to the same primordial living being. Definition 2.6 (Taxonomic relationship). The hierarchical ordering of: kingdom, phylum, class, order, family ,genus and species is based on the natural system (See apendix A). If this order acts in the entire set of species in the Earth (biosphere), with the inclusion of LUA (Axiom 1.5) it defines a partial order set 4 . A consequence of being a partial order set is that, for every specie s there exists a unique chain of ordered elements that join s with a genus gn, a family f , ..., a kingdom k.e.g., The specie Homo sapiens (L. 1758) has an ordered chain of: H. sapiens 6 Homo 6 Hominidae 6 Primates 6 Mammalia 6 Chordata 6 Animalia. Every partially ordered set P can be considered as a small category 5 , whose objects are the elements of P and which its set of morphisms ( H(a, b) a, b ∈ P ) consists of one element if a 6 b and is empty otherwise. This means that for any set A ⊆ biosphere, A is partially ordered and the taxonomic order induces a hierarchical acyclic network in which the objects are elements of the set of all species, all genus, all families, all orders, all classes, all phyla and all kingdoms in A 6 . The edges (links) are the morphisms (*) that associates immediate taxonomic membership e.g. Homo sapiens (L. 1758) 6 Homo ⇒ H.sapiens * Homo and in plants, for examples Solanales * Magnolopsida. 3

It is estimated that this organism could had lived in the Paleoarchean Era (3.8 -3.5 billion years ago) [Glansdorff et al., 2008]. 4 Ergo, the biosphere is a partial ordered set. For formal definition see: L.A. Skornyakov (originator) [2014] 5 For definition see: M.Sh. Tsalenko (originator) [2011] 6 i.e. ∪λ∈{sp,gn,...,kng} biosphere/λ

20


This is the link between the systematic taxonomy and the evolutionary relationships. The classification is a continuous process performed by experts in different fields and it is always subject to modifications. Axiom 2.7 (Spatial autocorrelation). Life happens in Earth, therefore has a spatial component who satisfies Tobler’s first law of Geography. Meaning that life in a place is related to life in another place, but near organisms are more related than distant (in geographical space) organisms.

This gives an interesting

feature to the data structure model in the sense that spatial autocorrelation properties can be analysed. Axiom 2.8 (Life is conspicuous). Life is conspicuous in all the surface of the Earth. Meaning that any area

7

on the surface of the Earth contains a non-empty set of living

beings.

2.2.2

Model Data Assumptions

A model represents an abstraction of reality [Smith and Smith, 2007] and therefore has some generalizations and details that can be considered as valid or not influential. The model uses as main input the GBIF database with global range. Description is here: 15. The model assumptions for the GBIF-data: • The GBIF database is valid for the axioms proposed in section: 2.2.1. The data extracted from it can be used in the model as well of other data sample with the same features. For example a systematic sample can give a better representation of species communities in space. • The classification (taxonomy) of occurrences in the database is accurate and preserve phylogenetic relationships (common ancestry) • The analysis will not discriminate synonym species. Different names for the same specie will be considered as different species. • There is not explicit information of any common sample frame. The database is a fusion of many data samples acquired and preserved by many researchers along history. It includes historic information, citizen science, volunteering and therefore the sampling method is heterogeneous. • There is a latent Law of large numbers in the sample which implies that, on average, the presence, richness and other biodiversity measure are similar to the natural populations. 7

In practice the area is constrained to the size given by the sample


2.3 2.3.1

21

Mathematical definitions Network

Definition 2.9 (Graph or Network). Let V (G) be a set and E(G) ⊆ V (G) × V (G). A graph G is a duple given by (V (G), E(G)). V (G) is the set of vertices of the graph and E(G) is the set of edges. An example of a graph is draw in figure: 2.1. Definition 2.10 (Subgraph). Let G be a graph. G0 is a subgraph of G (G0 ⊆ G) if and only if V (G0 ) ⊆ V (G) and E(G0 ) ⊆ E(G). Definition 2.11 (Connected and acyclic graph). If for every u, v ∈ V (G) there exist a path that connects them, then G is say to be connected. If that path is unique for every u, v then G is acyclic (without cycles). Definition 2.12 (Tree). A graph T which is connected and non-cyclic is called Tree. An example in figure 2.2 Definition 2.13 (Subtree). Let T be a tree. A subtree T 0 is a subgraph of T such that is also a tree (i.e. contains no cycles).

2.3.2

Algebraic operations

The following definitions will give algebraic structure to the model meaning that it will be possible to sum and take the difference of taxonomic tree structures in a similar way as integer, real numbers or matrices operate arithmetically. When extending the Taxonomy object to a regular grid, the operation will be preserved, in this sense the gridd will have similar properties as a raster when performing Map Algebra. Examples of this will be shown in the Chapter ??. Definition 2.14 (Semigroup). Let T be a set and m : T × T → T be an associative binary operation8 . The duple (T, m) is called a semigroup and T is called the underlying set of the semigroup. In this work s, t ∈ T , m(s, t) will be written s + t and is called sum if m is defined as Sum or s − t if m is defined as Difference. Definition 2.15 (Identity element). Let e ∈ T and T a semigroup. e is called identity element if and only if te = et for all t ∈ T . There can only be at most one identity element in a semigroup. Definition 2.16 (Monoid). A monoid is a semigroup with an identity element. We will see that the sum and difference operator of taxonomic trees are monoids. 8

Meaning that if t, p, q ∈ T then m(m(t, p), q) = m(t, m(p, q))

22


(b) Example of a tree with (a) Example of a simple graph with V (G) = {1, 2, 3, 4, 5, 6} and E(G) = V (G) = {a, b, c, d, e} and E(G) = {(4, 1), (4, 2), (4, 3), (4, 5), (5, 6), } There {(a, b), (b, c), (e, c), (e, a), (c, d), (a, c)} are no cycles

2.3.3

Some immediate consequences

Lemma 2.17. There is a unique Taxonomic Tree of all life on Earth. This tree is called The Tree of Life. Proof. All organisms have Common Ancestor. Because of this is possible to build taxonomic relationships based on this comparison. The Uniqueness of this common ancestor and the existence of LUA implies that: i) there is just one path that connects any pair of species (vertices) and ii) the graph is connected. Lemma 2.18 (Local Tree). For any area in Earth it is possible to derive a unique Taxonomic Tree. Proof. Because Life is Conspicuous it is possible to find organisms in any place. By the axioms of Common Ancestor and Taxonomic Relationship it is possible to build a taxonomic hierarchy between the group of organisms within that place. Because Axiom of LUA there is only one tree that represents these taxonomic /ancestry relationships.

Proposition 2.19. For a given area9 in Earth, the taxonomic tree derived from it is a subtree of the Tree of Life. Proof. Let T be the Tree of Life and T (A) the local tree in the area A. A ⊆ Earth. T (A) is a tree because of lemma 1.14. T (A) is based on the same taxonomy given by the species in A (which are leaves in the tree) therefore all the edges of T (A) are in T . The species in A is a subset of all the species in the Earth otherwise the Earth would not be the Earth and there exist another greater set that could be called Earth. Corollary 2.20. If A = Earth then T (A) = Tree of Life. 9

Any open set contained in the surface Earth. Earth can be considered as a compact surface embedded in R3


23

Proof. Let A = Earth. This implies that all species in A are in Earth and vice versa. V (T (Earth)) = V (T reeof Lif e) and the taxonomic chain (path) of V (T (Earth)) is the same as in V (T reeof Lif e) because it is unique. Therefore T reeof Lif e = T (Earth)

2.3.3.1

Taxonomic Tree operations

The Taxonomic Tree (Taxonomy.forest) is a Tree data-structure, a small category and a Partial Ordered Set. It is possible to define arithmetic operations among trees that will give algebraic structure to the Taxonomy object. These operations are monoids 2.16 and as such the result of operating two Taxonomic Trees will be another Taxonomic Tree opening new forms for modelling these data structures. Definition 2.21 (The Sum ⊕). Let A and B two Taxonomic Trees. The sum is defined as the taxa that are in A or the taxa that are in B. In set notation will be: A ⊕ B = {s|(s ∈ A ∨ s ∈ B)}

Definition 2.22 (The Difference ). Let A and B two Taxonomic Trees. The difference is defined as The taxa that are in A but are not in B. In set notation will be: A B = {s|(s ∈ A ∧ s ∈ / B)}

Note that is not commutative. In general for any two A, B Taxonomic Trees: A B 6= B A For example A = {a, b, c, d} and B = {b, c, d, e} then A B = {a} and B A = {e}

The Identity element

An implication of the LUA axiom 2.5 is that all taxonomic

trees has one unique root. This root is called Last Universal Common Ancestor (LUCA) and it is also a Tree data structure with empty sets of species, genera, families, orderes, classes, phyla and kingdoms. The only element is the common root of life. Let A be a Taxonomic Tree and Id the LUCA tree. It follows that: A ⊕ Id = {s|s ∈ A ∨ s ∈ Id} = A A Id = {s|(s ∈ A ∧ s ∈ / B)} = A r (A ∩ B) = A

24


The LUCA tree is the identity and therefore, ⊕ and are monoids.

2.4

Computational implementation

As said before the computational model is a data structure(1.2) capable of performing actions / functions. The model has been designed using the object oriented approach. Three classes instantiate the models Taxonomy, GriddedTaxonomy, andNested Gridded Taxonomy. Besides the model there are some auxiliary tools that have been developed for processing, managing, visualizing and handling HTTP petitions. The set of all these tools, the modules and classes is a software package named BiosPytial (Biodiversity Spatial Analysis in Python). The software uses the GBIF database as the main input data but it supports geo-processing functions making possible to join other spatial data sources with topological and geometrical functions. The following are the main features of BiosPytial.

2.4.1

m

Object Relational Mapper

For an optimal use, the database is hosted in a Postgis spatial DBMS10 . The implementation has prioritized decentralization, collaborative work and general-purpose analysis. In this sense, the dependence of a specific interface for data acquisition will constrain the users’ niche to the people familiar with that DBMS. This highlights the need to abstract the data source to be independent of the software used and leave this decision to the user. Another problem is that each registry (row) in the GBIF database (Occurrence) corresponds to a single observation and it has several features, including space and time components. On the other hand there exists several open source projects for spatial and geo-processing analysis (e.g., [GDAL Development Team, 2015]). Many of them uses standard spatial data structures. Converting a table row into a computational object with geospatial features and behaviours its a needed step in the development of a spatial explicit object based model for analysing ecological communities. To solve this it is necessary to map one-to-one each row to an object with attributes defined by the values on each column and methods (functions/behaviours) implemented in another programming language with geoprocessing capabilities. i.e. A mapping form the Table occurrences in the Postgis database to the class gbif.Occurrence. The translation of table rows into spatial explicit objects has been done with a programming technique called Object Relational Mapper (ORM) and has been implemented in the Python programming language using the library Django11 and GeoDjango12 . Using 10

Data Base Management System https://docs.djangoproject.com/en/1.8/topics/db/models/ 12 https://docs.djangoproject.com/en/dev/ref/contrib/gis/ 11


25

this technology is possible to convert a row in the GBIF Occurrence table into an instance of the class Occurrence. The class Occurrence has spatial built in capabilities for aggregating points (into multipoints datatype) and re-projection, among others. The class gbif.Occurrence is shown in the Appendix B. The ORM implementation abstracts the DBMS making it transparent to the user. This means that it is possible to change the data source providing technology to any other

Novel feature

relational database with geo-spatial functions (e.g., spatialite or Oracle spatial). Spatialite is a light DBMS and can be installed locally, making it better for analysing smaller datasets.

2.4.2

Distributed architecture

m A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages [Coulouris et al., 2005]. BiosPytial has a built-in web server that listens to HTTP requests and is capable of processing and sending responses13 through a network. The requests could be any processing call like: Taxonomies instantiations, Information retrieval, Load information from cache, calculate intrinsic complexities (see next sections), creating Shapefiles, Statistical modelling, etc. The standard for sending requests is the REST architecture Richardson and Ruby [2008]. The processing functions can be called requesting specific URI14 (or URL) and the parameters are giving using GET or POST variables. An example of this methodology is used in the QGIS - Bisopytial visualization

15 .

The figure

2.1 shows a schematic of one distributing node.

Figure 2.1: Biospytial distributed architecture. Blue arrows indicate messages send through the network using HTTP protocol. Dashed lines indicate potential I/O streams. 13

For enabling biospytial in server-mode run: python manage.py runserver [local ip/mask: port] in biosPytial root directory. 14 Uniform resource identifier 15 Check video demonstration on: https://youtu.be/oqYYUl7ULnE

26


2.4.3

Concurrent capabilities

m The distributed architecture and the independence of the data source provider allows to call procedures concurrently (at the same time) for different geographic locations. As we will see, the computational complexity of building Nested Gridded Taxonomies is exponential being the spatial resolution the most determinant factor. However, it is possible to have several instances of BiosPytial working together to process different geographic regions. The software uses the HTTP protocol and can be configured in public servers, therefore, the processes can be executed independently in different places. A schematic of this is shown in figure 2.2

Figure 2.2: Concurrent process in which the master node (yellow) send messages through HTTP requests, The other nodes (biosPytial instances) process according to the parameters given by the master node.

2.5

The Data Structures

The data structures are implemented as Classes in the Python programming language. This language was chosen because of its simplicity, the vast amount of libraries and the large developer community, specially in science. Python has an enormous amount of open source libraries for numerical analysis, database interfaces, web services, visualization tools, bioinformatics, modelling, simulating, etc. Develop the software in this language and open it to the research community will bring attention to researchers in other areas. This will help to establish a collaborating network of people interested in the study of biodiversity with new generation of models and computational tools.

2.5.1

The Taxonomy Class

This class defines the Taxonomy object which is the basic data structure in the software. The class is defined in the gbif.taxonomy16 module under the nameclass Taxonomy. The constructor is the following: 16

https://github.com/molgor/biospatial/blob/master/gbif/taxonomy.py


1

27

Constructor class gbif.taxonomy.Taxonomy(biome, geometry=’’, id=0)

Where: biome is an instance of the class gbif.models.Occurrence, the actual GBIF Occurrence table. The class gbif.models.Occurrence inherits the properties of the ORM library and as such it has spatial geo-processing capabilities as well as methods17 for filtering objects that satisfies arbitrarily conditions. Filters are equivalent to SQL queries in the object-oriented specification18 . Therefore biome can represent any subset of the GBIF database. The geometry parameter is a string which should represent a geospatial polygon (closed curve) in the open standard Well Known Text (WKT)19 . id is an identification feature and it expects an integer.

2.5.2

Description

When instantiated, the constructor retrieves all the Occurrences that are spatially within the interior of the polygon defined by the geometry parameter. The set of all Occurrences are then aggregated for all the taxonomic levels (i.e. Species, Genus, Family, Order, Class, Phylum and Kingdom ). Each aggregation uses the identification number (id) as the grouping parameter and returns a list of composed objects (aggregates). Each aggregate represent the collapse of the Occurrences that satisfy being equal to a Species id, Genus id, Family id, Order id, Class id, Phylum id or Kingdom id (depending the groupby clause). The aggregations are made for all the taxonomic levels and are stored in its respective attribute. Each of these attributes is a python dictionary (hash table) with key : values: • points : A collection of the points aggregated. • ab : The abundance i.e. the sum of the occurrences that satisfy being equal to a specific id for a given taxonomic level. • [taxonomic level] : Which can be specie, genus,family, order, class, phylum, kingdom (depending on the attribute) • parent id The Id of the immediate ancestor common to all the aggregated objects. This item determines the branching process for building the taxonomic trees. 17

Functions in the Object Oriented terminology. What is really happening under the hood is that the method filter induces a generalized SELECT . . . WHERE statement in the SQL database 19 Specification in: http://www.opengeospatial.org/standards/sfa section 7. 18

28


2.5.2.1

Attributes

These are the attributes of the class Taxonomy Occurrences : Gbif.models.Occurence All the occurrences within the geometry of the biome species : Gbif.models.Occurence.aggregated(species) All the occurrences aggregated by the relationship of being a member of the species S. genera : Gbif.models.Occurence.aggregated(genera) All the species aggregated by the relationship of being a member of the genus G. families : Gbif.models.Occurence.aggregated(families) All the genera aggregated by the relationship of being a member of the family F. orders : Gbif.models.Occurence.aggregated(orders) All the families aggregated by the relationship of being a member of the order O. classes : Gbif.models.Occurence.aggregated(classes) All the orders aggregated by the relationship of being a member of the class C. phyla : Gbif.models.Occurence.aggregated(phyla) All the classes aggregated by the relationship of being a member of the phylum P. kingdoms : Gbif.models.Occurence.aggregated(kingdoms) All the phyla aggregated by the relationship of being a member of the kingdom K. richness : dictionary A dictionary containing the counts of all the aggregated objects at the different taxonomic scales. biomeGeometry : geometry The geometry inherited by the geometric attribute of the biome gid : int The identification value inherited by the cell in a mesh that defines the biome (if applicable) forest : dictionary of trees A dictionary that has the Tree structures at different taxonomic levels: • ’sp’ : Local taxonomic tree at specie level, • ’gns’ : Local taxonomic tree at genus level • ’fam’ : Local taxonomic tree at family level • ’ord’ : Local taxonomic tree at order level


29

• ’cls’ : Local taxonomic tree at class level • ’phy’ : Local taxonomic tree at phylum level • ’kng’ : Local taxonomic tree at kingdom level Each local taxonomic tree is an instance of the class ete2.TreeNode() which is a library for building, managing and analysing phylogenetic trees [Huerta-Cepas et al., 2010]. intrinsicM : numpy.Matrix The intrinsic matrix is calculated as the change in richness from one taxonomic level to another giving a 7 × 7 anti-symetric matrix. Each row (column) represent a taxonomic level. The matrix is calculated only with the information given only by the Taxonomy itself, therefore the name intrinsic. vectorIntrinsic : numpy.array / list An array derived from a projection into R of intrisicM at different sub-matrices. The projection is the determinant in which the first element will be the determinant of the sub-matrix of intrisicM given by i, j = (1, 2) and the last one will be the complete intrisicM. presences : A presence type An object that has attributes in the taxonomic level key word (as in forest) As values has lists that defines the presence or absences of species based on other taxonomy. Useful when comparing with other Taxonomies and in the integer representation.

2.5.2.2

Methods

calculateRichness

This method calculates the richness (different taxa within a tax-

onomic level) at all levels. Returns a dictionary with keys {Taxonomic levels} and values the respective richness value. calculateIntrinsicComplexity

Calculates the intrinsic Matrix (i.e. taxonomic ra-

tios based on richness). Intrinsic Complexity is a variation measure. It makes reference only to the current tree and not on other external objects or taxonomies. Returns:

m

intrinsicM : numpy.Matrix.

distanceToTree(taxonomy)

This function calculates the distance from this object to

the Taxonomy given as input. The method uses the Robinson-Foulds metric which has been used to compare phylogenetic trees on molecular evolutionary studies. The metric counts the number of species removed and added in order to transform one tree into the

30


other. The specification of the algorithm is described in: Robinson and Foulds [1981]. The metric only works for trees at the same taxonomic level. COmparing different levels

m

e.g., species with phyla will give unwanted results. Parameters taxonomic forest : dictionary of trees The dictionary obtained by the pruning of the taxonomic tree at species level to obtain the other levels. Returns Distance : integer The distance (number of added and remove taxa between the two taxonomies.

generatePDI

This method calculates partial diversity index (PDI) based on the taxo-

nomic level and the diversity measure. The Partial Diversity Index is a diversity measure calculated at each taxonomic level. Parameters level : string The taxonomic level ∈ [’richness’, ’abundance’, ’relative abundance’] Returns PDI : float

getFreqs

This method returns a histogram of richness, abundance or relative abun-

dance. This method only works in interactive environments because makes use of the matplotlib.plot.show() function.

Parameters sel : string Could be one of the following: • sel : richness The count data obtained from the richness attribute. • sel : abundance The abundance function depends on the total number of occurrences within that polygon. • sel : rel abundance The relative abundance function depends on the number of the lower taxonomic level. The taxonomic level ∈ [’richness’, ’abundance’, ’relative abundance’] Returns Image : window See figures: 2.3a and 2.3b


(a) Relative abundance

31

(b) Richness

Figure 2.3: Histogram of taxonomic composition using relative abundance and richness using the method getfreqs from the class gbif.taxonomy.Taxonomy buildInnerTree

Builds a Tree Data Structure of the hierarchical taxonomic classifi-

cation. It uses the ETE2 data type which includes methods for analysing and visualizing phylogenetic trees.

Parameters deep : Boolean (flag) True means that is going to build all the partial trees as well. Partial tree is the pruned version at phylum, class, order,...,or, species only id : Boolean (flag) True (default) means that is going to append the full name of the taxons. This is a string and can be vary in length. If it is used in big data sets it will impact the amount of memory used because of the heavy load of information. Returns Void

m cache

This method stores the Taxonomy object into the local cache system. The local

cache system runs in a No-SQL key:value database called Redis. It needs a redis-python object that acts as interface. By default it assigns the key given by the geographical extent, the id and other information generated by the method showInfo() (see:

20 )

m Parameters redis wrapper : redis.StrictRedis object This is the redis connection to be used. Needs to have specified host, database and port (default values must work in many situations). 20

https://github.com/molgor/biospatial/blob/master/gbif/taxonomy.py

32


key : string The key used for storing in the Redis backend. If none then it will use the output of the ShowId method. refresh : Boolean If true it will overwrite the value with the given key if it existed already. If false and if the key / object exists then it will not store in cache. Returns Void

loadFromCache

This method restores the cached values stored in the Redis backend.

Parameters cached taxonomy : gbif.taxonomy.Taxonomy object The taxonomy obtained from the cache backend. The cached object can be created using the create tree now set to False Which is going to create the taxonomy without making the calculations for obtaining the tree giving an empty object with defined GeoQueryValuesSets. Returns Void

This method overcharges the operator sum to ⊕ the current Taxonomy with

sum

another Taxonomy object following the description in 2.21. Parameters Taxonomy : gbif.taxonomy.Taxonomy object The taxonomy obtained to sum with. Returns Taxonomy : gbif.taxonomy.Taxonomy object The summed Taxonomic Tree

m diff

This method overcharges the operator difference to the current Taxonomy

with another Taxonomy object following the description in 2.22. Parameters Taxonomy : gbif.taxonomy.Taxonomy object The taxonomy obtained to make the difference with. Returns Taxonomy : gbif.taxonomy.Taxonomy object The difference Taxonomic Tree

m

Chapter 2 Models description and software implementation 2.5.2.3

33

Example

The following example shows how to build a Taxonomy in a given area. For building the object it is needed the class Occurrence. The class Sketch only has polygons and helps to this examples only to define the area cell.geom. This can be changed to any polygon spatial type, for example a city, a natural protected area, etc.

1

Example of usage from gbif.taxonomy import Taxonomy

2

from gbif.models import Occurrence

3

from sketches.models import Sketch

4

biosphere = Occurrence.objects.filter(year__gt=1970)

5

cell = Sketch.objects.all(id=1)

6

tx = Taxonomy(biosphere,cell.geom,cell.id)

7

tx.getFreqs()

Line 4 gives an example of how the entire GBIF database can be instantiated and filtered by years greater than 1970 (year gt). When creating the Taxonomy, it uses this filtered version of all the database. Therefore, the object will contain the Occurences registered from 1970 onwards and only within the cell with id 1 (line 6). Line 7 gives a histogram as show in figure ??. The forest attribute is the taxonomic data structure. It has some visualization tools inherited by the library ETE2. The figure 2.4 shows the visualization of these tree structures a different taxonomic levels.

34


(a) Specie level

(b) Genus level

(c) Family level

(d) Order level

(e) Class level

(f) Phylum level

(g) Kingdom level

Figure 2.4: Visualization of Taxonomic Tree filtered by the parameters described in example. As it can be seen, the Taxonomic tree has all the information of all the occurrences within that area. The datastructure is an acyclic graph and therefore can be treated mathematically as such.

2.5.3

The GriddedTaxonomy Class

The class GriddedTaxonomy (with proper namespace: gbif.taxonomy) instantiates an object composed of a regular square grid defined in the class mesh.models.Mesh21 . Each cell in the grid has an instance of Taxonomy. Therefore, the cell boundary defines the geometry parameter. The constructor is the following:

1

Constructor class gbif.taxonomy.GriddedTaxonomy(biome, mesh,

2

upper_level_grid_id=0, generate_tree_now=False, 21

Mesh modules is part of BiosPytial.


3

35

grid_name=’N.A.’)

2.5.3.1

Description

biome is a Geoqueryset of gbif occurrences (the same as in Taxonomy), mesh is a Mesh Object, instance of the mesh.models.Mesh class, Upper level grid is a parameter reserved for the id of the parent mesh. For single use, this parameter is irrelevant, taking importance when defining Nested Grids. generate tree now is a boolean flag (default: False) that, when True, generates all the Taxonomic Tree structures. If activated, the function is computationally intensive because it fetches all the objects within the Grid. When this flag is false the object remains in a Lazy State. If a particular GriddedTaxonomy has been stored in the Cache system the object can be instantiated in this Lazy mode and load the Taxonomic structures with the method loadTaxonomiesFromCache(redis connection). grid name is a string value that defines a unique name for the GriddedTaxonomy. Currently the mesh.model.Mesh implementation defines the mesh in the spatial database. The grid name in this case is set to the respective table name stored in the database (i.e. mesh"."mexico grid64). Figure 2.5 shows visualization of the GriddedTaxonomy object . Seems like a regular raster spatial data type with the difference that each cell has a Taxonomy type defined on it.

Figure 2.5: Visualization of a Gridded Taxonomy showing taxonomic trees at specie level for an arbitrary area. Each cell is a Taxonomy data type.

36


2.5.3.2

Attributes

taxonomies : list Taxonomies A list of taxonomies defined under the action of the geometric constraints of each cell in the grid. extent : numpy.array The geographical extention of the Grid area : Float The geographical area covered by the grid geometry : geometry The geometry of the grid (WKB). grid name : string The name of the corresponding table of this grid in the database parent id : int An id value to define the Grid dArea : float The unit area represented by a single cell biome: GeoqueryValuesSet The subset of the GBIF Occurrences that is defined globally in the entire grid.

2.5.3.3

Methods

restoreTaxonomiesFromCache(redis wrapper) Restores the attributes: forest, IntrinsicM and intrinsicVector for each Taxonomy in all the grid cells. This method uses the cache system (redis backend) by retrieving the serialized values of the taxonomies that corresponds with the referencing name invoked by the method showId() in each Taxonomy.

createShapefile(option=’richness’,store=’out maps’) This function creates an ESRI Shapefile of the GriddedTaxonomy using an array of selected Taxonomy attributes. Parameters option: string Currently the valid options are: • richness (default Gives the richness at each taxonomic level within each cell.


37

• jacobi Gives the determinants of the IntrinsicM and its sub-matrices from the 7x7 to 2x2. Each layer in the ShapeFile is a determinant of each submatrix. store : string (default out maps) The name of the relative path in which the Shapefiles are going to be stored. The user running the command needs to have permission. Returns A directory with shp, dbx and shx files. Figure 2.6 shows a map generated by the output of this method.

Figure 2.6: A richness by genera map exported with the method createshapefile for a region in Central Mexico. The richness from dark red (high richness) to low richness (light yellow). The resolution for each cell is 5 km. Each cell contains a Taxonomy object.

mergeGeometries

Creates a polygon made by the union of all the cells in the gridded

taxonomy. Returns Boundary : polygon The polygon derived from the merging of all cells. distanceToTree(external taxonomic forest)

Calculates the distance from each

taxonomy in the grid compared with an arbitrary taxonomic tree. Parameters taxonomic forest : dictionary of taxonomic trees The dictionary of taxonomic trees which uses keys: [’sp’,’gns’,’fam’,’ord’,’cls’,’phy’,’kng’]

38


Returns: distances : list mapped by taxonomies. setPresenceAbundanceData(reference dict) This method sets the presence attributes in every taxonomy in the Grid. It uses a dictio-

m

nary passed as a parameter which has the reference presence/absence data. Parameters reference dict : dictionary of Presences The dictionary that has keys in ’sp’,’gns’,’fam’,’ord’,’cls’,’phy’,kng’ and the associated values are the Presence data type that has all the taxonomic levels and has extends attributes in BitArray. Returns: Void : Null But set the attribute presences in each taxonomy. summary(attr=’raw’)

This method gives a dictionary of all the presences and ab-

sences of taxa compared with the attribute presence in each taxonomy. Parameters attr : string The options for feature extraction are in the form of: • int : the integer representation • str : the Bitstring (String) • ist : The list of bits • mapping : the mappping that relates Id with presence or absence Return type: depending on the selected parameter. cache(redis wrapper,key=’default’)

This method stores in cache all the tax-

onomies separated in a key: value entry using the key parameter as prefix for the names. Parameters redis wrapper : StrictRedis object This is the redis connection to be used. Needs to have specified host, database and port. key : string The key used for storing in the Redis backend. If none then it will use the output of the ShowId method


39

The current implemented caching system cannot store Taxonomies bigger than 500 MB. For storing Taxonomies of this size it is recommended to save the objects in standard files as binary using the Python library Pickle. intrinsicPanel(with this list=’’)

Returns the IntrinsicM of all taxonomies in

the form of a 4-d Matrix in the Pandas type. A software utility for performing statistical analysis. Pythonic iterators The pythonic way of list comprehension is a usefull feature of the language that helps the developing and increases the understanding using natural syntax. The GriddedTaxomy has incorporated this feature defining iterators on attribute: taxonomies, which is the ordered list of Taxonomy data structures in the Grid. Applying a loop over all the taxonomies is as easy as22 :

m

Let g taxonomy a GriddedTaxonomy object. 1

Example of iterator over al taxonomies for taxonomy in g_taxonomy: print taxonomy.richness

2

This method overcharges the operator sum (⊕) to operate two GriddedTax-

sum

onomies that share the same geographical extent. The sum is performed cell-wise making use of the ⊕ operator defined in Taxonomy.

m

Parameters GriddedTaxonomy : gbif.taxonomy.GriddedTaxonomy object The GriddedTaxonomy to sum with. Returns GriddedTaxonomy : gbif.taxonomy.GriddedTaxonomy object A new Gridded Taxonomy23 .

diff

This method overcharges the operator difference () to operate two Grid-

dedTaxonomies that share the same geographical extent. The difference is performed cell-wise making use of the operator defined in Taxonomy. Parameters GriddedTaxonomy : gbif.taxonomy.GriddedTaxonomy object The GriddedTaxonomy to take the difference with. Returns GriddedTaxonomy : gbif.taxonomy.GriddedTaxonomy object A new Gridded Taxonomy24 . 22

This will print on screen al the richness for all taxonomies in a GriddedTaxonomy Can contain empty parameters that are needed to be calculated in later steps 24 Can contain empty parameters that are needed to be calculated in later steps 23

m

40


2.5.3.4

Examples

The following example shows how to build a GriddedTaxonomy for an arbitrary mesh25 . The mesh object is an instance of the class Mesh defined in the module mesh.models. The GriddedTaxonomy once is instantiated as g traxonomy, the example makes use of the module: createShapefile to generate the map shown in figure: 2.6.

1

Example of usage from gbif.models import Occurrence

2

from mesh.models import initMesh

3

from gbif.taxonomy import GriddedTaxonomy

4


5

mesh = initMesh(Intlevel=9)

6

g_taxonomy = GriddedTaxonomy(biosphere, grid, upper_level_grid_id=9999, generate_tree_now=False)

7 8

g_taxonomy.createShapefile(store=’mex_genera_richness’)

2.5.4

The Nested Gridded Taxonomy object

This class generates a layer stack of GriddedTaxonomy objects. Under the namespace of gbif.taxonomy and the name Nested Taxonomy. The class uses an object in the mesh module called NestedMesh which generates a hierarchical set of regular grids (mesh objects) doubling the spatial resolution of the immediate predecessor level. i.e. each layer is partitionated by half (in x axis and y axis) generating 4 new cells for every cell in the upper level. The building of the NestedMesh is a recursive process which follows

m

a Quad-Tree data structure [Worboys and Duckham, 2004] as show in figure 2.7.

1

Constructor class gbif.taxonomy.NestedTaxonomy(cell_id, biome,start_level=3,

2

end_level=9,generate_tree_now=False)

3

\label{const:nested}

2.5.4.1

Description

In order to instantiate an object from this class it is needed to have a spatial context enabled, i.e., a set of previously defined grids (managed by the module mesh) and stored in the spatial database. This modules has built-in tools for generating arbitrary grids recursively to generate the needed grids for building a Nesting context. The tools are later explained in the section:Building Grids. Currently, the implementation along this work uses three Nesting Contexts, one defined in Central Mexico from global extent with 25

Defined in the settings.py file


41

(a) Diagram of the quad-treedata structure in the Nested (b) Spatial Layer Stack of Gridded GriddedTaxonomy Taxonomies

Figure 2.7: Two schematics of the NestedGriddedTaxonomy. The parent layer has the information of all the taxa within the area but cannot differentiate spatial heterogeneity. The bottom layer contains partial information when compared to bigger scale but has variability in all the region. resolution of 5 degrees, to a local range and resolution of 5 km (bottom). The other is a region between Argentina and Brazil with bottom resolution of 5 km. The third is a global grid with bottom resolution of 0.5 degrees. Note that the number of cells is exponential in terms of the number of levels needed. Therefore, generating a Global set at 5km resolution is computationally expensive and its use should be evaluated first (see figure: 2.8).

Figure 2.8: Number of cells in each level for a Nested Gridded Taxonomy For choosing a Nesting context select the name of the scales dictionary located in the option: MESH TABLENAMESPACE in the settings.py file. Currently, you’ll find two definitions MEX SCALES and BRAZ SCALES. Each scales dictionary contains a key :

value

pair that corresponds to the level number and the names of the tables in the spatial database. Each table is a mesh at some spatial resolution and are stored in the spatial database26 . Once the Nesting Context has been determined it is possible to instantiate an object of the class NestedTaxonomy defining the following parameters (see constructor in: 2.7): 26

In this case the Postgis database

42

Chapter 2 Models description and software implementation • cell id : the identification number of the cell defined in the top level. • biome : The Geoqueryset of gbif occurrences (the same as in Taxonomy and GriddedTaxonomy). • start level : int The top level defined in the scales dictionary in the Nesting Context. • end level : int The bottom level defined in the scales dictionary in the Nesting Context. • generate tree now : Boolean When True, generates all the Taxonomic Tree structures. Same function as in GriddedTaxonomy.

2.5.4.2

Attributes

levels : ordered list of GriddedTaxonomy

parent : The GriddedTaxonomy with the highest scale

parent id : int Integer (gid) of the parent cell id. toplevel : Int The id level of the toppest grid. bottomlevel : Int The id level of the grid with higher resolution (bottom).

2.5.4.3

Methods

The Integer Representation The integer representation is a technique for identify specific arrangements of taxa and measures the variability of different combinations of species. The method assigns an integer to each taxonomic level. Therefore, the integer representation will be specific

m

for species, genera, families, orders, classes, phyla and kingdoms. The algorithm for assigning the integer value is the following: 1. Start with the parent level. 2. Sort the taxa by id27 and generate a list composed of the duple (id,1). Let this list be named Total Presences. 27

Every taxa has an identification number (id) e.g., in the case of taxa = specie the id = species id


43

3. Continue to the next level 4. Sort the taxa by id and generate a list composed of the duple (id,p) where p = 1 if the taxa id is in that level’s list and in the Total Presences list. Contrary, p = 0 if the taxa id is not present in that level’s list. This list is going to be called: Presence/Absence list for the j th taxonomic level. 5. Let nj be the richness of taxa in the j th taxonomic level and i ∈ {1, 2, .., n}, the ktax−j is the integer representation of the j th taxonomic level and is given by: ktax−j =

n X

p i 2i

i=1

6. Go to step 3 until level reaches the bottom.

iterator for NestedGriddedTaxonomy

As in the iterator in Gridded Taxonomy tthe

iterator definend in this class acts in all the levels. Using these two iterators in a nested loop gives the possibility to operate all the taxonomies in all the levels. For example printing the richness of all the taxonomies within a NestedTaxonomy is as simple as: Let nested taxonomy a NestedGriddedTaxonomy

1 2 3

Example of iterator over all taxonomies in a NestedGriddedTaxonomy for gridded_taxonomy in nested_taxonomy: for taxonomy in gridded_taxonomy: print taxonomy.richness

setPresenceInLevels

This method sets the Presence/Absence list attribute on all

the Taxonomies on all the levels defined in a NestedTaxonomy.

m Parameters external reference dic : dictionary of Presences The dictionary has keys in ’sp’,’gns’,’fam’,’ord’,’cls’,’phy’,kng’ and the associated values are the Presence data type that has all the taxonomic levels and extends attributes of the BitArray class. The integers are very big and cannot be represented by the standard Integer class. Returns None : Null But set the attribute presences in each taxonomy.

44


cache This method stores all the Taxonomy objects in all the levels in the NestedGriddedTaxonomy. Parameters redis wrapper : StrictRedis object The redis connection to be used. Needs to have specified host, database and port key : string The key-prefix used for storing the Redis backend. If none then it will use the standard output of the ShowId() method.

loadFromCache This method loads the Nested Taxonomy object stored from in the Cache System Backend. Parameters redis wrapper : StrictRedis object The redis connection to be used. Needs to have specified host, database and port key : string The key used for retrieving the object in the Redis backend. The default value means that the object is going to be called using the standard tag used.

remap This function transforms the absolute integer representation to a shorter standard integer (8-bits) to be able to generate a map. 2.5.4.4

Examples Example of usage

1

import redis

2

r = redis.StrictRedis()

3


4

from gbif.taxonomy import NestedTaxonomy

5


6

nt = NestedTaxonomy(36,biosphere,start_level=3,end_level=9,

7 8 9

generate_tree_now=False) for level in nested_taxonomies: for taxonomy in level:

10

taxonomy.calculateIntrinsicComplexity()

11

taxonomy.cache(r,refresh=True)


45

12 13

taxonomy.restoreTaxonomiesFromCache(r)

2.6

Other implemented tools

BiosPytial is not only Data Structures definition. Other tools are included for using them, analysing and visualizing biodiversity spatial information. These tools includes interfaces to the database, a caching system, a visualization tool for using it in QGIS, tools for generating grids, examples and ipython notebooks.

2.6.1

The Cache - No SQL - System

As seen in figure 2.8 the number of cells in a NestedTaxonomy or even a GriddedTaxonomy can easily reach to thousands. As a consequence, the computational complexity for building up the tree structures and other calculations can be intensive. To add, if the data source is located in a remote server (as in the current case), bottle necks not only during processing but also in data transmission occurs. For example, Generating the taxonomic trees in figure 2.6 took 6 hours using a remote server and a regular laptop. To solve this issue, BiosPytial has a built-in caching system that stores the instantiated objects in the key:value Non-SQL database Redis28 . According to the Redis project homepage, it is an open-source, advanced key:value cache and store database, often referred as a data structure server. Keys can contain strings, hashes, lists, sets, sorted sets and bitmaps. The Taxonomy object is the atomic object stored in Redis. Redis cannot store complex objects, therefore a process known as serialization has to be done previously to convert the object into a binary-string structure that can be stored instead. In addition, the Taxonomies use as key a composition of the geometric extent in WKT format (string), the Nested Context, the level and other custom name. If in Taxonomy or GriddedTaxonomy or NestedGriddedTaxonomy, the method cache is invoqued, the method performs the serialization and stores accordingly and in a unique way all the information. In conclusion, the proccess that takes 6 hours on computing takes only 5 minutes for loading them from the Cache system.

2.6.2

Building Grids

The software makes use of grids and as such it has the tools to build them. The grids are instantiated by the class Mesh. There is module called mesh.tools that implements Python wrappers to create new grids and new Nested Context. 28

http://redis.io

m

46


There are two ways for building the grids. Either directly invoking the SQL functions stored in the mesh schema (Postgres - host database) or by using the Python wrapper built in the mesh.tools module.

2.6.2.1

Python wrapper for generating meshes

The Python wrapper is recommended for the sake of clarity and ease of use. Reducing

m

the need invoke functions written in other programming languages (i.e. SQL).

createGridOnThisSquare This function creates a Grid in the database by making use of the function: generategridon(polygon,t name grid division) stored in the database. The function needs a square polygon given as a parameter. If this requirement is not satisfied the grid will not be regular and it may have problems to be defined. Parameters square : Geometry WKT The geometry. Should be a regular square. n partitions in x : integer The number of partitions per axis in the grid with respect to the square parameter. For example: if n partitions in x = 2 then the grid will have 4 squares, therefore 2 partitions in the x axis and 2 partitions in the y axis.

createRegionalNestedGrid The function creates all the grids needed for analysing regions with the Nestedtaxonomy module i.e. a Nested Context. The function uses the parent square as parameter to define the square region in which it’s going to start the partition. The levels are derived by partitioning the top level by half in the X axis and half in the Y-axis, giving four squares in the immediate successor. Each level is going to have 4 more cells than the previous level. Giving a total number of 4n cells, where n is the number of levels. This structure is a Quad-tree representation. Parameters parent square : geometry The square geometric object in WKT. Should be a square. store prefix : string The prefix name that the grids are going to have. Could be, for instance, the name of the main region. e.g. mex


47

store prefix : string The number of levels (partitions) to build. As it can be seen, the computational complexity of this function is exponential, meaning that a high number of levels will lead in a high performance process that could crash the server or the postgres instance.Always mind the number of levels to generate. Returns scales : dictionary The same as used by Nested Context. The dictionary of the tables built. e.g. scales = { 8 : ’mesh\".\"braz_grid8a’, 9 : ’mesh\".\"braz_grid16a’, 10 : ’mesh\".\"braz_grid32a’, 11 : ’mesh\".\"braz_grid64a’, 12 : ’mesh\".\"braz_grid128a’, 13 : ’mesh\".\"braz_grid256a’, 14 : ’mesh\".\"braz_grid512a’, 15 : ’mesh\".\"braz_grid1024a’, 16 : ’mesh\".\"braz_grid2048a’, 17 : ’mesh\".\"braz_grid4096a’ }

Example for generating mesh

Generating Nested Meshes using the mesh.tools

wrapper is as simple as:

1

Generate Nested Grids from mesh.tools import createRegionalNestedGrid

2

scales = createRegionalNestedGrid(S.geom.wkt,’New_Grid’,10)

Functions in Postgis The functions are defined and stored as scripts in biospatial/SQL functions. Or by visiting https://github.com/molgor/biospatial/tree/master/SQL_functions

Guide to generate Nested Context A complete tutorial can be find in the BiosPytial directory or here: https://github. com/molgor/biospatial/blob/master/Mesh%20generation.ipynb

48


2.6.3

Exploration tools for QGIS

Making use of the distributed capabilities of the software, the BiosPytial software suite has incorporated a web-application for visualizing Taxonomy objects in a given grid. The process has been setted in QGIS with the use of actions.

Methodology for visualization in web browser As seen at the beginning of the chapter, BiosPytial has a small web-server that receives and responses HTTP request. The visualization web-app can be accessed with the URI:

m

http://127.0.0.1:8000/getAllTrees/?gid=GID&g_l=Level&names=1 where 127.0.0.1:8000 is the IP address of the server (in this case localhost) listening on the port 8000 (default configuration). The GET variables are: gid, g l and names. gid is the id number of the selected cell. g l is the grid level according to a Nested Context and names ∈ {0, 1} if 1 uses full names, if 0 uses only id numbers to identify taxa.

Methodology for configuring the visualization in QGIS as an action The visualization can be embedded in QGIS as an action using the QTWebKit (internal web browser)

29 .

For configuring the action add the following definition (See figure 2.9 for more information).

1

QGIS - Tree visualizer from PyQt4.QtCore import QUrl; from PyQt4.QtWebKit import QWebView;

2

myWV = QWebView(None);

3

myWV.load(QUrl(’http://127.0.0.1:8000/getAllTrees/?

4

gid=[% "gid" %]&g_l=16&names=1’));

5

myWV.show()

A screenshot can be seen in figure 2.10. A video is available on: https://youtu.be/ oqYYUl7ULnE which shows dynamically how the exploration of trees can be done in scale and space.

2.7

Collaborative implementation

According to the aims proposed, a strong emphasis on sharing methodologies and collaboration were used. The use of a well documented software and code helps the new developers to continue improving the code. Making it accessible to the public using revision system will help the diversification of the project into other tools and services 29

A good example of how using Open Source tools open opportunities for new inventions!


49

Figure 2.9: Configuring visualisation action in QGIS

Figure 2.10: Visualiation of the embedded Taxonomy visualizer web-appto build a better: interactive / command-line modeller’s suite for analyzing biodiversity

m

across scales and space

2.7.1

Documentation and ipython Notebooks

The software has been widely documented using the Python docstring standard. In addition, the Sphinx30 syntax has been used and the documentation is available as a web page hosted in test.holobio.me. See figure 2.11. The documentation web page can be build from the source code. Added to the documentation there are some examples written as iPython notebooks. IPython Notebook is an interactive computational environment, in which is possible to combine code, rich text, mathematics, plots and media. The case studies 1 and 2 in Chapter 3 that are described using these tool and are available on-line and in the source code 2.12. 30

http://sphinx-doc.org

m

50


2.7.2

Open source repository

BiosPytial uses free and open source software. The code is distributed under the BSD licence. The software has used best practice for scientific programming and software engineering [Wilson et al., 2014]. It has been built using a distributed revision control system (git), currently is hosted in Github with the name: biospatial (https://github. com/molgor/biospatial). Bugs, Issues, Forks and Future improvements can be posted into this site. Making the code accessible to everybody helps the collaboration and buildup research networks in issues related to understanding global biodiversity patterns and its consequences. The open source methodology can bring new ideas from different users/developers which gives the project more chances for surviving in the future. The BSD license guarantees an academic use by restricting users and developers to cite the

m

work in any publication.

Figure 2.11: The BiosPytial documentation website! http://test.holobio.me


Figure 2.12: Spatial Analysis using ipython notebooks https://github.com/ molgor/biospatial

51

Chapter 3

Case studies This chapter exposes the use of the BiosPytial data structures and tools. Both examples instantiates a NestedGriddedTaxonomy object. One example in Central Mexico and the other a the small northern region in the border Argentina- Brazil. The first example makes use of the Lazy evaluation of records for filtering by time range and spatial extension, cache loading and grid operation to generate a GriddedTaxonomy of biodiversity loss. The highest resolution layer is exported as a Shapefile and an analysis of inter-specific competition is done. The second example uses the integer representation method to create Pseudo- Presence/absence lists and the evenness of the different arrangements are analysed concluding that the proportion of higher taxa follows a uniformly random distribution.

3.1

Biodiversity Loss and Competition analysis using the operator

During the last 30 years the region of Central Mexico has suffered extensive land use change as a consequence of population growth, policies in urban planning and extensive agricultural activities [Lambin et al., 2001]. Most of the in-field data collected has been made by an academic or governmental entity and has been integrated in the National Inventory of Biologic Resources (NIBR) [CONABIO, 2015]. Conabio is a member of the GBIF consortium and therefore the NIBR is part of its global database. Identifying biodiversity loss and iter-specific competition is an interesting problem that can be addressed using the methods and tools described in BiosPytial. The region extends from the Gulf of Mexico to the Pacific Ocean, includes the five major ecoregions, including the Neotropical and Neartic macro-ecological biomes. The dataset has approximately 1.2 Million Occurrences that span three kingdoms Animalia, Plantae and 53

54

Chapter 3 Case studies

Fungi. All the information is integrated in the GBIF dataset. A map of the region is shown in figure 3.1

Figure 3.1: Study region in Central Mexico overlaid with an Ecorregional thematic map

3.1.0.1

Biodiversity loss

The biodiversity loss from the period of time t1 to t2 is defined as the difference in Taxonomies from t1 to t2. In other words: Definition 3.1. Let ∇a be an arbitrary area1 , At1 (∇a) a Taxonomy sampled at certain moment of time t1 in ∇a and Bt2 (∇a) a Taxonomy sampled at different time t2 where (t1 6 t2 ). The biodiversity loss between t1 and t2 in ∇a is going to be a biodiversity measure m from the Taxonomy Ct2−t1 (∇a) calculated as: Ct2−t1 (∇a) = Bt2 (∇a) At1 (∇a) . Extending this operation cell-wise to an entire GriddedTaxonomy will give a new GriddedTaxonomy generated by the taxa present in t1 but absent in t2. In this example, m will be richness and it was used to generate the maps explained in the next subsections. 1

Interior of a closed polygon

Chapter 3 Case studies 3.1.0.2

55

Methodology

For the purpose of this example two selections of GriddedTaxonomies were made. One for 1980 to the present and the other from 2000 to the present. A spatial difference using the operator was done to generate the new GriddedTaxonomy. The measure richness was selected and a Shapefile of the entire study region was generated giving the map in figure 3.3. No other filtering for specific taxa were used nonetheless the selection of biome can be refined previously to adjust particular needs. The following script shows the lines of code needed to generate the biodiversity loss from 1980 to 2000 in a GriddedTaxomy exported as a richness in a Shapefile format. The explanaition is embedded on it in the form of comments (lines with the symbol ”#”)

1

Calculating Biodiversity Loss for in a particular Cell #Import necessary modules

2

from gbif.taxonomy import NestedTaxonomy

3


4 5

# Create a biosphere One object abstracts the entire GBIF database!

6

biosphere = Occurrence.objects.all()

7

#Filter the database by year, 1980 onwards.

8

biosphere_1980 = biosphere.filter(year__gt=1980)

9

#From 2000 onwards

10

biosphere_2000 = biosphere.filter(year__gt=2000)

11 12

#Filter terrestrial taxonomies with an external data source

13

mex = Country.objects.filter(name__contains=’exico’).get()

14 15

#Spatial filtering, only terrestrial in Mexico

16

biosphere_1980 = biosphere.filter(geom__intersects=mex.geom)

17

biosphere_2000 = biosphere.filter(geom__intersects=mex.geom)

18 19

# Create Nested taxonomies on the same area with different temp. window

20

nt80 = NestedTaxonomy(36,biosphere_1980,start_level=3,end_level=10, generate_tree_now=False)

21 22

nt20 = NestedTaxonomy(36,biosphere_2000,start_level=3,end_level=10, generate_tree_now=False)

23 24 25

#select level 10

26

l10_80

27

l10_20 = nt20.levels[10]

28

= nt80.levels[10]

56


29

#Load data from the Cache system, otherwise will take much time

30

l10_80.restoreTaxonomiesFromCache(r)

31

l10_20.restoreTaxonomiesFromCache(r)

32 33

#l10_dif is the diversity loss

34

l10_dif = l10_80 - l10_20

35 36

#Generate the shapefile for visualisation

37

l10_dif.createShapefile(option=’richness’,store=’b_lost_80-00’)

3.1.0.3

Result

The following figure 3.2 shows the resulting biodiversity loss in the form of number of species lost from 1980 to 2000. Note that this figure is only a visualisation, the resulting GriddedTaxonomy has other information (in the form of taxonomic trees) that could be analysed as well. The next example will make use of other features.

Figure 3.2: Biodiversity loss from 1980 to 2000. Here visualizing loss in species richness

3.1.0.4

Discussion

The biodiversity loss is totally influenced by the sample. An analysis for assesing representatively of dominant taxa could be useful in determining if the sample fairly describes the distinct populations. Field studies and validation tests should be done to asses the uncertainty of the methods.

Chapter 3 Case studies 3.1.0.5

57

Conclusion

The methodology shows how to perform a spatio temporal biodiversity loss process using the BiosPytial software. It shows how to use the caching system to load data as well as the how to select sub datasets of the GBIF dataset in an elegant object (biosphere).

3.1.1

Competition analysis

As explained in the introduction, some authors (Gotelli and Entsminger 2001; Webb et al. 2002; Enquist et al. 2002) have studied the ratio of change between the richness in species and genus to calculate inter-specific competition. The assertion is based on the fact that recently speciated organisms, on average, have more probability to share a common ecological niche with their species-siblings. This can be estimated by the number of distinct genera and the number of distinct species within an ecological community. If a community has a large number of species and small number of genera, the community will be facing high competition caused by inter-specific overlapping. On the contrary if a community has relatively same number of genera than species the community will have null or small competition. The Intrinsic Matrix is the complete taxonomic ratio by richness description for all taxonomic levels. In this example, the genus/specie relation was used as a measure for competition. BiosPytial includes some methods to transform the IntrinsicMatrix into a data-frame that can be handled by the statistical module Pandas2 . The data-frame has exporting capabilities to generate CSV tables that can be loaded and joined with the richness maps to visualize the results 3 .

3.1.1.1

Methodology

• The entry (1, 2) of all the taxonomies defined in l10 dif where taken. i.e. the specie/genus richness ratio. • The same process was applied for the rest of the taxonomic ratios in descending order. i.e. (genus/family, family/order , order/class, class/phylum and phylum/kingdom) • a Pandas data frame was built preserving the gid of each Taxonomy, which corresponds to the same gid in the Grid. The id is the foreign key attribute for joining the data frame to the map. • The table was joined with the Biodiversity loss by species richness map 3.2. 2

http://pandas.pydata.org The methodology can be found as a ipython notebook on the url https://github.com/molgor/ biospatial/blob/master/Spatial%20Analysis%20with%20BiosPytial.ipynb 3

58 3.1.1.2

Chapter 3 Case studies Results

The competition values were divided in quantiles of five representing: Null or close to null, low, medium ,high and extreme competition. For comparison, the Taxonomies for 1980-onwards (figure 3.3) and 2000-onwards (figure 3.4) were used. In all the images, the highest competition is related to populated urban areas. Nevertheless, figure 3.4 shows a decrease in the competition values. The figure 3.5 is the result for the analysis in change of biodiversity from 1980 to 2000. The results show that in general the competition kept a constant change except in certain spots that are spatially correlated with highly dense populated areas like Mexico City, Cuernavaca and Veracruz. The blue (negative - low) values indicate places where the competition decrease and in red (positive - high) place where it increased.

Figure 3.3: Competition using species/genera ratio

3.1.1.3

Discussion

The sample provides one of the most important issues for validating the results. Other sources of competition that are not taken into account using this simplistic taxonomic ratio index can be happening and altering the internal species composition. Also, some authors e.g., [Tilman et al., 2014] contradict this estimation of competition arguing that redundant ecological niches are important components of ecosystem stability and highly diverse ecosystems are in general redundant. This analysis broadens the debate about stability, competition and disturbance and experimental and infield validations are needed.


59

Figure 3.4: Competition using species/genera ratio

Figure 3.5: Competition 3.1.1.4

Conclusion

Nonetheless the estimation of competition introduced is simplistic, in general gives a rough estimation of important sites to characterise or increase their study. BiosPytial has built in tools that help analyse these ecological processes in an easy, fast, accessible and global platform.

60


3.1.2

Spatial distribution of community assemblages in Brasil-Argentina using Pseudo-Presence/Absence lists

This example uses the scale aggregation feature implemented by NestedGriddedTaxonomy to represent specific arrangement of taxa in space per unit area. The analysis was made for a region in the southern brazil and northern Argentina with a geographical extent of: (-66.09, -22.50, -64.68, -21.09) and an area (in degrees2 ) of 1.97. The area represents at least two biomes, on the East side, tropical rainforest and on the West side Temperate Rainforest with mountains. The number of Occurrences in the GBIF available for this region is c.a. 400,000.

3.1.2.1

Methodology

A NestedTaxonomy was instantiated for the Cell id 10417 starting from the level 12 ending in the level 16 using the braz scales Nested Context (See settings.py file). The bottom level has a cell representation area of: 0.03. The integer representation method was used classify arrangements of taxa. To visualize it, the big big, big , big ..., big integer 4 was reclassified to a value in the range (0,256) to visualize the arrangements.

3.1.2.2

Results

The results are in figures 3.6a, 3.7a,3.8a ,3.10a ,3.11a,3.12a. Each duple contains in the left side the map of community assemblages and on the left a histogram showing the distribution (counts) of communities with the same assemblage of taxa. It can be seen that the higher the taxonomic level, the more random uniformly distributed the arrangements are. The pattern shows a change in families and a normal-like distribution in phyla.

(a) Map of species assemblages, each color represents a specific arrangement (b) Counts of cells that have the same of species arrangement of species

Figure 3.6: Assemblages of species by the method of integer representation 4

Bigger than the number of atoms in the known Universe


61

(a) Map of genera assemblages, each color represents a specific arrangement (b) Counts of cells that have the same arrangement of genera of genera

Figure 3.7: Assemblages of genera by the method of integer representation

(a) Map of families assemblages, each color represents a specific arrangement (b) Counts of cells that have the same of families arrangement of families

Figure 3.8: Assemblages of families by the method of integer representation

(a) Map of orders assemblages, each color represents a specific arrangement (b) Counts of cells that have the same of orders arrangement of orders

Figure 3.9: Assemblages of orders by the method of integer representation 3.1.2.3

Discussion

The colors in the maps of assemblages have been assigned randomly and it ONLY shows specific arrangement of taxa. An alternative method for indexing (and ordering) taxonomic arrangements has been proposed and is based on the proportion of taxa through different taxonomic levels. Hopefully it will be implemented in the near future. The

62


(a) Map of classes assemblages, each color represents a specific arrangement (b) Counts of cells that have the same arrangement of classes of classes

Figure 3.10: Assemblages of classes by the method of integer representation

(a) Map of phyla assemblages, each color represents a specific arrangement (b) Counts of cells that have the same of classes arrangement of phyla

Figure 3.11: Assemblages of phyla by the method of integer representation new index could give a more comprehensible visualization of the arrangements of species, their abundance and importance to the community. The integer representation will have problems on big areas or communities with high biodiversity (more than 1000 sp). A more smart way for gathering distinct Taxonomies through scale and with them generate a similar Pseudo-Presence and Absence lists is needed. The next step is to use the Sum operation to up-scale and explore assemblage interpolation.

3.1.2.4

Conclusion

These examples have shown in small detail some of the possible applications of BiosPytial. It has shown that the software simplifies the work for quantifying biodiversity loss or other measurements and integrates in a distributed system different aspects of biodiversity studies.


(a) Map of kingdoms assemblages, each color represents a specific arrangement (b) Counts of cells that have the same of kingdoms arrangement of kingdoms

Figure 3.12: Assemblages of kingdoms by the method of integer representation

63

Chapter 4

Discussions 4.1

The role of ecosystems and community ecology and its implications in this model

Ecosystem Stability is positively related with species richness, both within and among functional groups [Wardle et al., 2000]. In general, local and global species losses could threaten the stability of the ecosystem productivity and therefore its services to humankind (McCann 2000; Balvanera et al. 2006). There could be redundancy in functional roles and therefore in ecosystem traits but experiments have shown that functionally redundant species may play an important role in ensuring ecosystem stability when individual species are lost due to environmental changes [Cleland, 2011]. Niche construction generates an adaptive feedback between organisms and their environment, which can yield adaptive regulation of the abiotic environment [Kylafis and Loreau, 2008]. Moreover, when phylogenetic and ecological information is available to include all taxa in a lineage, ancestral character reconstruction of ecological traits and niche use can be examined (Cunningham et al. 1998; Webb et al. 2002). Ecosystem-evolutionary models naturally bridge the gap between the holistic perspective of ecosystem ecology and the mechanistic perspective of community ecology [Tilman et al., 2014]. Still there are many questions to be answered. For example, the negative effects of human impacts to natural ecosystems have been studied widely but there is not much information on how managed ecosystems are responding and how to optimize diversity with ecosystem services (Millennium Ecosystem Assessment 2005;May 2007;Magurran 2013). Although the aim of the project to join community ecology with ecosystem processes could not be achieved it is important to state that by making spatial, temporal and taxonomical explicit the object opens the possibility to link data in other ways that were not possible before. BiosPytial is one of few new efforts that are combining methods 65

66

Chapter 4 Discussions

from data mining and Big Data technologies to make better predictions to the Earth System processes (See [Hartig et al., 2012], [Scheiter et al., 2013] and [Hudson et al., 2014]).

4.1.1

The concept of specie and other axiomatic flaws

The Specie concept is the basic unit in biological classification and in taxonomic ranks. Nevertheless this concept has been in center of controversies since decades ago. Several definitions can be used depending on the research questions. Mayden [1997] compiled several other definitions of species that can be used instead. The classification (taxonomy) of occurrences in the database could not be accurate and not preserve phylogenetic relationships (common ancestry). This is a hard assumption because there is not enough evidence to support this for all described species. Nevertheless, the higher the level the more likely to be true but depends a lot on each group. The axiom on Life is conspicuous in all the surface of the Earth, in practice the area is constrained to the size given by the sample and it is a strong assumption for the model. Its purpose is to simplify the model supposing that the living phenomenon is continuous in the surface of the Earth. A more interesting but complex to formulate (from the mathematical point of view) is the use of a probabilistic space as the model defined by Amari and Nagaoka [2000] which uses information geometry. The GBIF database can have synonyms misclassification and the only way to improve the taxonomic accuracy is to use other source of taxonomic information. Several additional database initiatives with a systematic focus help to bring systematic data to the public. For example, The Tree of Life Web site is an effort to summarize hypotheses of the hierarchy of life in a single tree. [Schuh, 2000]. In any case the use of taxonomical information is important because it associates ecological functional traits and niches. Although this reconstruction could not be well suited for all groups of organisms in general, it provides a structural approach to include non uniform dependence of spatially autocorrelated species. It is possible, however, to use other structures based on other correlations. For example, other phylogenetic models can be substituted using other sources of data like (molecular, cladistic or phylogeographic analysis).

4.1.2

Computational Complexity

The process for building the NestedGriddedTaxonomy is computationally inefficient and needs to be refactored. On the other hand, the use of the caching system has proved to be effective by reducing the time 100 times. This opens a window of opportunity to develop repositories (or collaborative mechanisms) for sharing cache databases dumps.


67

Again, this feature eases the problem of parallel computing by instantiating several biospytial modules each with an exact copy of the cache database.

4.1.3

Discussions on methods

The methods diff and sum open a novel approach to analyse biodiversity patterns because the data structures can be handled in the same way as raster data structures. Abstracting the taxonomic structures to this level is possible to represent in a unified object the entire distribution of taxa with in an area, including not only presences and abundance but also potential presences based on the information at higher scales. A direct applications of the

diff

operation is the study of change within a certain

time or even taxonomic group. This opens the possibility to automate and analyse exhaustively patterns of inter-specific dependence by taking away certain key organisms and analysing its repercussions in the other assemblages. On the other hand the

sum

operation gives the Taxonomies the ability to be aggre-

gated and make spatial analysis dynamically by fussing spatial regions disregarding the scale used, an important aspect when performing spatial filters (convolutions) The integer representation method in NestedGriddedTaxonomy provides a naive approach to ”label” different arrangement of taxa. The possible combinations of taxa for set larger than 200 exceeds the number of atoms in the universe. Another method for determining important taxa is needed. An approach using a variation of Principal Component Analysis for discrete , categorical data has been introduced by Dr. Elisabeth Addink (personal comments). This will be a key method for providing novel methods for conservation and optimization of biodiversity and ecosystem stability. The method for aggregating data using the Phylogenetic Tree-Data structure is not efficient and is recommended to assess using other phylogenetic library.

4.2

About data uncertainty and how to estimate it

Uncertainty is still an open problem for meta-analysis studies. It is difficult to calculate it because the data-source is a joint effort for fusing different samples for different sources at different time periods. There are issues in classification, issues on location and issues on taxa synonyms or deprecated names. Nevertheless, as expressed by [Wolkovich et al., 2012] the presences information have unreproducible value about nature at particular time and space. Uncertainty analysis in this sense involves a grounding framework that takes into account uncertainties in the spatial extent, in the taxonomic classification and within the sample.

68


Considering that most species are rare, rarefaction curves and parametric statistical inference can be used according to the scopes of the research framework. Fortunately, as revisited in the introduction there is a good set of knowledge relative to species richness, abundance and expected number of rare species (see: [Andrewartha, 1986]; Magurran 2004 and Magurran and McGill 2011). These methods can be used to calculate expected values and define sampling error ranges in space. For assessing misclassification issues other sources of information should be included. It is worth mention that when studying biodiversity through the lenses of a particular aspect of life, the sampling frame is fundamental. Therefore, developing general purpose tools for assessing uncertainty at different aspects can become an exhaustive task. One important consequence of making the code public is that the people interested in developing custom tools for assessing uncertainty can contribute appending their own code or giving ideas, needs or suggestions for future development. The software presents an abstraction of biodiversity information. The unit is the Occurrence data structure gbif.models.Occurrence. If a sample has the same attributes (columns) then it is possible to import the data sample into BiosPytial and use the features of the software giving the uncertainty to the sample provided. In other words, If you are not satisfied with the data provided, you can use your own data!. Create a table in the postgis data table with the same structure and overwrite the table name in the settings.py file. Although GBIF have been assessed as deficient in many of the world’s biodiversity hotspots. The deficiencies in data coverage can be resolved by an increased application of resources to digitize and publish data throughout these most diverse regions [Yesson et al., 2007].

4.3

Different software used by biodiversity studies

There are several computational tools for analysing biodiversity patterns from a theoretical or practical approach. One of the most popular software packages is Estimates [Colwell, 2005]. This software is heavily biased towards statistical methods for assessing uncertainty, expected abundances and species presence. It uses parametric models, bayesian inference and other tools for modelling biodiversity richness, abundance and evenness. Even though its great acceptance within the biodiversity research community (with more than 5000 citations) the software is not Open-Source and does not include a way to include hierarchical or non linear structural relationships. BiosPytial tries to innovate in this field by bringing together a common framework for modelling organisms in a general purpose scheme allowing researchers to choose the aspect of biodiversity that they need. This software also has been developed in a modular way with the Python programming language. BiosPytial can incorporate Machine Learning methods, statistical inference models, data mining analysis and agent


69

based simulation in an easier way because of the big repository of Open Source projects written in Python or with Python wrappers. Many of this Libraries have been developed by the academic or technological community and has been rigorously proven.

4.3.0.1

Data included

Another advantage of BiosPytial is that it comes with data included. Although it has issues on uncertainty and trustfulness, it is one of the few biodiversity software packages that includes location and presences of species. It is flexible enough to allow interchange of external sources adapting to the data sets of the researcher.

4.3.1

Other open-source projects related to biodiversity studies

There exist several open-source projects for biodiversity studies. The Github repository, one of the most popular open-source hosting sites, has approximately 176 projects related with biodiversity, mixing, local projects, visualization geoportals and more than 80 empty or non functional repositories. Within BiosPytial users’ niche there exists a popular project called rgbif (a Wrapper to the Global Biodiversity Information Facility for R). This software has approximately 790 downloads per month and is among the few available software for analysing biodiversity at global scale using GBIF. rgbif

1

nonetheless only provides two features: i) Species

lookups by location, name and the rest of the GBIF standard attributes; and ii) Simple map-generation functions that gives a bitmap image of the distribution of the selected species. It is an extension to the R statistical package and needs an Internet Connection. BiosPytial on the other hand, covers these features with the use of the ORM and filter capabilities and enhances the analysis by adding spatial and scale dependant structures. The structures are rich in operations and evolutionary structures. The createShapefile method in GriddedTaxonomy can represent not only points but polygons, richness values, arrangement of taxa, evenness and relative abundances. BiosPytial can be used not only for getting information but also can generate and perform distributed processing. To conclude, BiosPytial cover all the features of rgbif and improves other, it is not needed an active Internet connection and can export in CSV format a standard input format for R, if needed. Nevertheless, rgbif has 7 active developers, has been in active development since 2013 and is part of other software suites for data fusion from other databases e.g., National Center for Biotechnology Information, Barcode of Life Data Systems (BOLD), National Biodiversity Network (UK), IUCN Red List, etc. A good path to follow will be to detail specifically the differences and find complementarity. 1

https://ropensci.org/tutorials/rgbif_tutorial.html

70


4.4

Criticisms to this software and future lines of development

The software is in its early development stages and has a big number of issues and bugs that need to be fixed. One of the main issues is that it has been built using different programming libraries. Several of them requires to be installed first e.g. GDAL, GEOS, proj4. It uses other applications that can be computationally expensive for modest computers. For example, depending on the analysis, the cache system can grow to several Gigabytes. There are problems in delivering the GBIF information. Right now it is hosted in GeoData Institute but as soon as my status of student will be over, the connection to the database will be on the same state. Therefore a direct connection to the official GBIF repositories or a custom REST-API implementation in GeoData infrastructure should be developed. The current state of BiosPytial is proto-beta version and requires more effort on building easy installing tools as well. The library for building Taxonomic Trees (ETE2) is not well maintained and has several issues, specially for analysing and visualizing trees in a distributed environment. While the NestedGriddedTaxonomy works, there are some major issues regarding the performance. It takes much time to build-up a fine resolution grid. Taking upto 12 hours of processing for generating the Tree-Structures at 8 Nested Levels.

4.4.1

Future development work and gaps

The software roadmap should accomplish this goals in priority order.

Develop tools for remote acquisition of data from the GBIF infrastructure using their API instead of the independent server in GeoData institute. If possible, develop an entry point for data acquisition using GeoData infrastructure. Develop tools for exporting analysis to open data format like CSV. Develop an easy way to overwrite the use of GBIF database and use arbitrary samples. Develop mechanisms for analysing Taxonomies in a Cell-wise scheme using moving windows, similar to a convolution or spatial filters.


71

The current implementation of Nested Gridded Taxonomy is computationally intensive. GeoHash is an efficient method for creating arbitrarily small grids in an abstract way that avoids the explicit construction of grids. This method could make the processes more efficient and disregard the dependence of the PostGIS DBMS for handling the Grids.

Chapter 5

Conclusions The main motivation in this work is to help study the relationships between ecosystem science and community ecology using biodiversity as the pivotal concept. Establishing a bridge between these branches of ecology has been considered a knowledge gap but recently has raised attention because of its implications with the Earth System dynamics under the effect of anthropogenic land cover change. Although some philosophical syntheses, methodological frameworks and models for integrating ecosystem models, community assemblages and spatial analysis have been proposed (Loreau 2010;Isbell 2010; Steffen et al. 2011; Pavoine and Bonsall 2011; Magurran and McGill 2011) their scope have been theoretical or have been constrained to a particular location or biodiversity aspect. Works like Hartig et al. [2012], Scheiter et al. [2013] and Hudson et al. [2014] had contributed using a computational approach that could bring together these two perspectives of ecology with a spatio-temporal context. Nevertheless those applications have not considered general purpose analyses for accounting distinct forms for measuring diversity, neither the possibility to merge other data sources adapted to the needs of each research field. This work formalizes mathematical concepts, and with them, develops a computational GIS model and some tools for studying assemblages of living beings in space and time using the taxonomic classification as a hierarchical structure for representing similarities in ecological roles at different spatial scales. The models are developed in a dependent order, first it defines a basic data structure (Taxonomy) that has the Presence information of the Species occurred in an arbitrary area. By making use of the taxonomic classification, a Tree Data Structure is build to aggregate all the information of genus, family, order,class, phylum and kingdom. Therefore, each Taxonomy object is represented by a Tree Structure that has aggregated information of location and taxonomic group in the same object. The Taxonomy object has built-in methods for analysing, visualizing and even performing algebraic operations. 73

74

Chapter 5 Conclusions

At other level, the model GriddedTaxonomy couples a grid data structure with the Taxonomy class in the sense that each cell in the grid is a Taxonomy. Operations like, sum, difference and transformation to common spatial data types have been implemented as well. Several grids are defined in a model called NestedGriddedTaxonomy. This object represents the assemblage of species (or other taxa) in space with respect to the scale. The model is composed of a parent Taxonomy and a list of levels, which are ordered nested GriddedTaxonomies packed as a layer-stack and using a quad-tree structure based on the scale. This model allows the estimation of potential presences based on the scale. A method for counting distinct assemblages of taxa has been developed although for now this method only describes specific arrangement (See discussions). The project has opened a window of opportunity to use innovative programming techniques for analysing large volume of information. The main data source of this work is the Global Biodiversity Information Facility (GBIF) data collection. An international, global database of taxonomic presences compiled by a consortium of organizations involved in the storage, classification and research of ecological data. The software implementation has been developed using best practices for scientific programming [Wilson et al., 2014] and the code has been released as an open-source project named BiosPytial1 .

5.1

Aims covered

As the described in the Projects aims 1.5.1 almost all the aims at different stages has been achieved. TheTaxonomy object has been developed with the ORM software architecture making allowing an explicit filtering of Occurrences by name, location, acquisition date, etc. Its Taxonomic-Tree-Data structure represents evolutionary histories and groups together groups of taxa that, according to the sytematic-biologists, share a common history. The data structure has as mathematical foundations, for instance, it has been proved that the Taxonomy object is a monoid capable of performing algebraic operations. There was no time to link functional trait (ecosystem) information to the Taxonmies leaving this item to be worked in the future. Almost all the aims proposed in the second stage were covered (see 1.5.1.2). Important to mention the fact that all the spatial datatypes are implemented with OGC standards and using the GDAL/OGR[GDAL Development Team, 2015] and GEOS library. Therefore, all these structures inherit the features, spatial operations and geoprocessing capabilities of the standard spatial data types like: Union, Intersection and Difference. Total Area Gridded operations were implemented by overcharging operators ⊕, . Unfortunately, there was no time to implement focal or zonal cell-wise operations. 1

https://github.com/molgor/biospatial.git


75

The software was written in the Python programming language, a popular and highly documented language. The data source is global and the methods for generating the Nested Gridded Taxonomies can be adapted to cover every place in the Earth, therefore the software has a global scope. Added to the Data structures, the software includes other tools for generating grids, visualizing Taxonomies in an open-source GIS or a normal web browser. BiosPytial also has been developed as a Python module i.e. it can be imported into other Python programs or libraries. The outputs can be exported to Shapefiles, CSV or other OGC standard formats. The Occurrence table and other data sources can be incorporated into BiosPytial using standard tools for migrating data to the Postgres DBMS.

5.2

Future Research

This work has open many roads for future research. Some involving a more refined mathematical conceptualization of the framework, other involving analysis and new development tools and other related to direct applications to conservation and optimal management of ecosystem services. I consider important to develop the following: Explore and develop moving window analysis to measure Taxonomic Distance across scales. This will help us understand how patterns on β and γ diversity are related taxonomically and in which groups are more important. Apply dimensionality reduction methods to the assemblage of species to get representativeness or priority taxa to conserve based on area and not on Ecosystem or biome. It has been showed that randomness in the assemblage does not imply higher stability or productivity, nevertheless there are open questions on why the specific arrangement of species follows a uniformly random distribution. Autocorrelation analysis of groups of taxa which groups are generally together? Continue formalizing the mathematical structure of Taxonomic operations. If is possible to define operations as a algebraic group? A new road on symmetry analysis on biodiversity could be broaden by exploring conservation laws on distinct aspects of distinctness in life. Uncertainty analysis : The sample is inherently a complex problem because of the mixed sample frames. An assessment for identifying methods for measuring uncertainty should be investigated. Spatial-correlation of ecosystem process. One of the main motivations for studying this topic was the possibility to join functional traits with taxonomic relationships. This can be achievable by bringing together other sources of data in the same Taxonomic object.

76


Develop statistical models for interpolating or smoothing Taxonomies in a GriddedTaxonomy. Correlate important taxa with productivity variables (e.g., remote sensing products)to identify methods for optimizing the duple (biodiversity-productivity). Develop new methods using standard distributed protocols to potentially escalate the process to cover the Globe. Explore how does taxonomic networks relate with functional networks. Are there any invariants?

5.3

Epilogue

BiosPytial is an innovative attempt to unify biological information into a complex evolutionary-based data structure capable of scaling to perform distributing computing for analysing biodiversity patterns and trends in a global scenario. The information integration in these objects allows several aspects of biodiversity to be studied at once. Until now, there is no other tool that can incorporate all these features in a single solution. The software includes tools for building hierarchical spatial models that can be useful to people outside biodiversity studies. Finally, the project has been released as an open-source software to promote collaboration among researchers from different backgrounds to find new answers, define tipping points and assess worldwide biodiversity loss, in the dawn of the Anthropocene.

Appendix A

Appendix - The concept of specie The concept of species in biology has been an epistemological issue along the history of the science. Nevertheless it is in most cases the atomic unit in biodiversity studies in ecology and evolutionary biology. It it important to take into account how this concept is treated and addressed.

A.1

Systematics and the natural method for classification

”Systematics is the science of biological classification. It embodies the study of biological diversity and provides a comparative framework to study the historical aspects of evolution” [Schuh, 2000]. Its tasks are: a) Describing organisms, b) Providing scientific names, c) Preserving collections, d) Providing classifications, e) Characterizing organisms attributes for identification and f ) investigate evolutionary histories [Schuh, 2000].

A.1.1

classification

Systematics started with the works of the Swedish botanist and naturalist Carolus Linnaeus (Species Plantarum (1753) and Systema Naturae (natural system) 10th Ed. (1758)). Considered as the starting point of modern biological classification, Linnaeus proposed a system for Hierarchical classification of organisms based on morphological and ecological similarities. This natural hierarchy has been recognized at least since the time of Aristotle but it was until the formalization of the theory of organic evolution by Alfred R. Wallace and Charles Darwin that natural classification could be

77

78

Appendix A Appendix - The concept of specie -

consider hierarchical scheme that reflects our understanding of the organismic ancestry (phylogenetic) relationships [Schuh, 2000]. There are several schools for systematic classification and several rules had been derived in each taxonomist’s guilds (e.g., botany, zoology, mycology, etc). In every case, the principle for classification is to preserve consistency with the natural system and the other groups of organisms. [Schuh, 2000]

A.1.2

Reasons for using the taxonomic levels to represents species on Earth

The Natural Taxonomic System has more than 250 years of existence and due to historic reasons is still in use. The system is hierarchical and is composed of 7 basic levels 1 . Systematic biologists had created, expanded and disappeared groups (taxa) along history to preserve consistency. The scheme includes: Kingdom, Class, Order, Genus, Species and Variety as basic ranks. The purpose of this work is to derive conclusions about how organisms interact with the environment with respect to their natural history (evolution) and spatial conditions. In this work I made use of the taxonomic-rank relationships to reconstruct an evolutionary structure with explicit spatial data (See: Data Source section, page: 15). The work is a proof of concept to implement a spatio-temporal data structure for biodiversity analysis. This data structure is designed for being used as a distributed modelling system in which spatial point vectors are aggregated in an ordered set of Grids. Each Grid doubles the resolution of the preceding spanning a Quad-Tree data structure in the scale (z) component See figure: bla . The grid is composed of a square lattice. Each partition of four points defines a cell. Each cell is defined by a simple polygon spatial data structure orboys-book called cell. This spatial data structure can be easily exported to the Well Known Text Format (Standardized by OGC) or other formats supported by the GDAL Library rank wammerdam cite. Widthin this simple polygon (square) a spatial joint selects the points in the biosphere that are in the interior of it. Because every set A ⊆ biosphere preserves the order, A inherits the network of phylogenetic relationships. Therefore, inherits a small category. This small category is called Taxonomy which is implemented in biospytial as the class Taxonomy (See documentation:) which has several attributes and methods (See documentation:) . 1

Several authors had incorporated more categories above and below the level of order [McKenna and Bell, 1997], [Schuh, 2000]. It depends on the systematic schools and the taxonomic groups

Appendix A Appendix - The concept of specie -

79

Using the categorical framework for databases proposed by (CITE DAvid spivak) the identity morphism will be a unique identification number for each taxonomy in a particular Grid. This id corresponds to a primary key in the relational table (the current implementation is PostGis 2.3). The choice of developing with Django as an ORM model was among other things because it gives the freedom of moving to other DBM. The software implementation is a Python Library Each point in each Taxonomy represents presence of biological occurrence in a certain time. Every point belongs to a type of biologic specie. Using the taxonomic rank system, each specie belongs to a genus, who belongs to (w.b.t.) a family, w.b.t. a order, w.b.t. a class,w.b.t. a phylum, w.b.t. a kingdom. Before going any further it is important to define a common ground of truths to the system.

Appendix B

Excerpts from BiosPYtial source code This is the Occurrence Class definition. The complete code is in: https://github.com/molgor/biospatial/blob/master/ gbif/models.py class Occurrence ( models . Model ): """ .. _gbif . models . o c c u r r e c e : This is the Base class that maps the O c c u r r e n c e ( and further t a x o n o m i c a g g r e g a t e s ) with the spatial enabled d a t a b a s e . The current d a t a b a s e is built on Postgis . It i n c l u d e s the field string length d e f i n i t i o n for a u t o m a t i c p o p u l a t i n g the d a t a b a s e using a Attributes ---------id : int I d e n t i f i c a t i o n value of each o c c u r r e n c e . Unique to any element of the GBIF dataset . d a t a s e t _ i d : int I d e n t i f i c a t i o n of the c o l l e c t i o n ( C u r r e n t l y not used ) i n s t i t u t i o n _ c o d e : int I d e n t i f i c a t i o n for the i n s t i t u t i o n r e s p o s i b l e for storing , c a p t u r i n g or r e c o r d i n g the o c c o l l e c t i o n _ c o d e : int I d e n t i f i c a t i o n of the c o l l e c t i o n ( C u r r e n t l y not used ) c a t a l o g _ n u m b e r : int I d e n t i f i c a t i o n for catalog number b a s i s _ o f _ r e c o r d : int Unknown value s c i e n t i f i c _ n a m e : String Species name in the bi n o m i a l n o m e n c l a t u r e kingdom : String Name of the kingdom to whom these o c c u r r e n c e belongs phylum : String Name of the phylum to whom these o c c u r r e n c e belongs _class : String Name of the class to whom these o c c u r r e n c e belongs _order : String Name of the order to whom these o c c u r r e n c e belongs family : String

81

82

Appendix B Excerpts from BiosPYtial source code Name of the family to whom these o c c u r r e n c e belongs genus : String Name of the genus to whom these o c c u r r e n c e belongs s p e c i f i c _ e p i t h e t : string Name of the epithet to whom these o c c u r r e n c e belongs k i n g d o m _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g kingdom ( indexed ). p h y l u m _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g phylum ( indexed ). c l a s s _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g class ( indexed ). o r d e r _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g order ( indexed ). f a m i l y _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g family ( indexed ). g e n u s _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g genus ( indexed ). s p e c i e s _ i d : int I d e n t i f i c a t i o n number for the b e l o n g i n g species ( indexed ). c o u n t r y _ c o d e : string String r e p r e s e n t i n g the country ’s code l a t i t u d e : Float L a ti t u d e in WGS84 ( degrees ) l o n g i t u d e : Float L o ng i t u d in WGS84 ( degrees ) year : int Year of record month : int Month of record event_date : datetime T i m e s t a m p of record s t a t e _ p r o v i n c e : String Name of state or p r o v i n c e county : String Name of country geom : G e o m e t r i c Point G e o m e t r i c Value in WKB objects : models . G e o M a n a g e r () Wrapper for G e o D j a n g o

#

""" chars = { ’ l1 ’ :15 , ’ l2 ’ :15 , ’ l3 ’ :25 , ’ l4 ’ :100 , ’ l5 ’ :60 , ’ l6 ’ :70 , ’ l7 ’ :100} id = models . AutoField ( primary_key = True , db_column = " id_gbif " ) id_gbif = models . I n t e g e r F i e l d () dataset_id = models . CharField ( db_index = True , max_length = chars [ ’ l5 ’] , blank = True , null = True ) i n s t i t u t i o n _ c o d e = models . CharField ( db_index = True , max_length = chars [ ’ l1 ’] , blank = True , null = True ) c ol le cti on _c od e = models . CharField ( db_index = True , max_length = chars [ ’ l1 ’] , blank = True , null = True ) catal og_numb er = models . CharField ( db_index = True , max_length = chars [ ’ l2 ’] , blank = True , null = True ) b as is _of _r ec or d = models . CharField ( db_index = True , max_length = chars [ ’ l2 ’] , blank = True , null = True ) s ci en tif ic _n am e = models . CharField ( db_index = True , max_length = chars [ ’ l7 ’] , blank = True , null = True ) # s c i e n t i f i c _ n a m e _ a u t h o r = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l4 ’] , blank = True , null = Tr # t a x o n _ i d = models . I n t e g e r F i e l d ( blank = True , null = True ) kingdom = models . CharField ( db_index = True , max_length = chars [ ’ l2 ’] , blank = True , null = True ) phylum = models . CharField ( db_index = True , max_length = chars [ ’ l3 ’] , blank = True , null = True ) _class = models . CharField ( db_index = True , max_length = chars [ ’ l3 ’] , blank = True , null = True ) _order = models . CharField ( db_index = True , max_length = chars [ ’ l3 ’] , blank = True , null = True ) family = models . CharField ( db_index = True , max_length = chars [ ’ l3 ’] , blank = True , null = True ) genus = models . CharField ( db_index = True , max_length = chars [ ’ l3 ’] , blank = True , null = True ) s p e c i f i c _ e p i t h e t = models . CharField ( db_index = True , max_length = chars [ ’ l4 ’] , blank = True , null = True )

Appendix B Excerpts from BiosPYtial source code

83

kingdom_id = models . IntegerField ( db_index = True , blank = True , null = True ) phylum_id = models . IntegerField ( db_index = True , blank = True , null = True ) class_id = models . IntegerField ( db_index = True , blank = True , null = True ) order_id = models . IntegerField ( db_index = True , blank = True , null = True ) family_id = models . IntegerField ( db_index = True , blank = True , null = True ) genus_id = models . IntegerField ( db_index = True , blank = True , null = True ) species_id = models . IntegerField ( db_index = True , blank = True , null = True ) country_code = models . CharField ( db_index = True , max_length =7 , blank = True , null = True ) latitude = models . FloatField ( db_index = True , blank = True , null = True ) longitude = models . FloatField ( db_index = True , blank = True , null = True ) year = models . IntegerField ( db_index = True , blank = True , null = True ) month = models . IntegerField ( db_index = True , blank = True , null = True ) event_date = models . DateTimeField ( db_index = True , blank = True , null = True ) # e l e v a t i o n _ i n _ m e t e r s = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # d e p t h _ i n _ m e t e r s = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # v e r b a t i m _ s c i e n t i f i c _ n a m e = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l5 ’] , blank = True # t a x o n _ r a n k = models . I n t e g e r F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # v e r b a t i m _ k i n g d o m = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = True , null = T # v e r b a t i m _ p h y l u m = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = True , null = T # v e r b a t i m _ c l a s s = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = True , null = Tr # v e r b a t i m _ o r d e r = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = True , null = Tr # v e r b a t i m _ g e n u s = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = True , null = Tr # v e r b a t i m _ f a m i l y = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = True , null = T # v e r b a t i m _ s p e c i f i c _ e p i t h e t = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , blank = Tru # v e r b a t i m _ i n f r a s p e c i f i c _ e p i t h e t = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l3 ’] , bla # v e r b a t i m _ l a t i t u d e = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # v e r b a t i m _ l o n g i t u d e = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # c o o r d i n a t e _ p r e c i s i o n = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # m a x i m u m _ e l e v a t i o n _ i n _ m e t e r s = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # m i n i m u m _ e l e v a t i o n _ i n _ m e t e r s = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # e l e v a t i o n _ p r e c i s i o n = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # m i n i m u m _ d e p t h _ i n _ m e t e r s = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # m a x i m u m _ d e p t h _ i n _ m e t e r s = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # d e p t h _ p r e c i s i o n = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # c o n t i n e n t _ o c e a n = models . F l o a t F i e l d ( d b _ i n d e x = True , blank = True , null = True ) state _provin ce = models . CharField ( db_index = True , max_length = chars [ ’ l5 ’] , blank = True , null = True county = models . CharField ( db_index = True , max_length = chars [ ’ l5 ’] , blank = True , null = True ) country = models . CharField ( db_index = True , max_length = chars [ ’ l5 ’] , blank = True , null = True ) # r e c o r d e d _ b y = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l5 ’] , blank = True , null = True ) # l o c a l i t y = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l6 ’] , blank = True , null = True ) # v e r b a t i m _ m o n t h = models . I n t e g e r F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # v e r b a t i m _ y e a r = models . I n t e g e r F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # day = models . I n t e g e r F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # v e r b a t i m _ b a s i s _ o f _ r e c o r d = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l4 ’] , blank = True # d a t e _ i d e n t i f i e d = models . D a t e T i m e F i e l d ( d b _ i n d e x = True , blank = True , null = True ) # i d e n t i f i e d _ b y = models . C h a r F i e l d ( d b _ i n d e x = True , m a x _ l e n g t h = chars [ ’ l6 ’] , blank = True , null = Tru # created = models . D a t e T i m e F i e l d ( d b _ i n d e x = True , blank = True , null = True ) geom = models . PointField () # m o d i f i e d = models . D a t e T i m e F i e l d ( d b _ i n d e x = True , blank = True , null = True ) objects = models . GeoManager () class Meta : managed = False # remote server table name db_table = settings . GBIF_DA TATABLE # db_table = " gbif_occurrence " # Local table name # db_table = " mexico_gbif_subset "

84

Appendix B Excerpts from BiosPYtial source code

def __unicode__ ( self ): """ .. String r e p r e s e n t a t i o n of O c c u r r e n c e Returns ------info : string Name """ return u ’ < GBIF Occurrence : % s sc ie n ti fi c_ n am e : %s >\ n Kingdom : % s \n ,\ t Phylum : % s \n ,\ t \ t Orde

def g e t f u l l D e s c r i p t i o n ( self ): """ .. R e t r i e v e s the total d e s c r i p t i o n of the fields for the this r e g i s t r y . Returns ------info : string The i n f o r m a t i o n of all fields . Good for e x p o r t i n g raw data to CSV . """ fields = self . _meta . g e t _ a l l _ f i e l d _ n a m e s () cadena = [ " < GBIF /: Occurrence % s - -% s / >\ n " %( self . id , self . s ci e nt if i c_ na me )] for f in fields : c = " \ t < % s : % s / >\ n " %( f , getattr ( self , f )) cadena . append ( c ) return reduce ( lambda x , y : x +y , cadena )

References Shun’ichi Amari and Hiroshi Nagaoka. Methods of information geometry. Translations of mathematical monographs,. American Mathematical Society, Providence, RI, 2000. ISBN 0821805312 (alk. paper) 0065-9282 ;. 00059362 Joho kika no hoho. English Shun-ichi Amari, Hiroshi Nagaoka ; [translated from the Japanese by Daishi Harada]. ill. ; cm. Includes bibliographical references (p. 187-202) and index. Herbert George Andrewartha. The ecological web: more on the distribution and abundance of animals. University of Chicago Press, 1986. Patricia Balvanera, Claire Kremen, and Miguel Mart´ınez-Ramos. Applying community structure analysis to ecosystem function: examples from pollination and carbon storage. Ecological Applications, 15(1):360–375, 2005. Patricia Balvanera, Andrea B Pfisterer, Nina Buchmann, Jing-Shen He, Tohru Nakashizuka, David Raffaelli, and Bernhard Schmid. Quantifying the evidence for biodiversity effects on ecosystem functioning and services. Ecology letters, 9(10):1146– 1156, 2006. M. Barr and C. Wells. Category theory for computing science. Category theory for computing science. 1990. ISBN 0-13-120486-6. URL ://INSPEC:3730127. Timothy G Barraclough, Alfried P Vogler, and Paul H Harvey. Revealing the factors that promote speciation. Philosophical Transactions of the Royal Society B: Biological Sciences, 353(1366):241–249, 1998. S. D. Baum and I. C. Handoh. Integrating the planetary boundaries and global catastrophic risk paradigms. Ecological Economics, 107:13–21, 2014. R.E. Blackwelder. Taxonomy: a text and reference book. Wiley, 1967. J. G. Blake and B. A. Loiselle. Diversity of birds along an elevational gradient in the cordillera central, costa rica. Auk, 117(3):663–686, 2000. ISSN 0004-8038. Elsa E. Cleland.

Biodiversity and ecosystem stability.

Nature Education Knowl-

edge, 3(10):14, 2011. URL http://www.nature.com/scitable/knowledge/library/ biodiversity-and-ecosystem-stability-17059965. 85

86

REFERENCES

Robert K Colwell. Estimates: Statistical estimation of species richness and shared species from samples. 2005. CONABIO. (Mexican) National Inventory of Biologic Resources. (Mexican) National Commission for the Knowledge and Use of Biodiversity, 2015. URL http://www. conabio.gob.mx. Alain Connes. A view of mathematics. URL http://alainconnes.org/docs/maths. pdf. S. Cornell. On the system properties of the planetary boundaries. Ecology and Society, 17(1), 2012. Robert Costanza, Charles Perrings, and Cutler J. Cleveland. The development of ecological economics. The international library of critical writings in economics. E. Elgar Pub. Co., Cheltenham, UK ; Brookfield, VT, 1997. George F Coulouris, Jean Dollimore, and Tim Kindberg. Distributed systems: concepts and design. pearson education, 2005. Paul J. Crutzen. Geology of mankind. Nature, 415(6867):23–23, 2002. Clifford W Cunningham, Kevin E Omland, and Todd H Oakley. Reconstructing ancestral character states: a critical reappraisal. Trends in Ecology & Evolution, 13(9): 361–366, 1998. C. Darwin. On the Origin of Species. Oxford World’s Classics. OUP Oxford, 1859. ISBN 9780191607677. W. de Vries, J. Kros, C. Kroeze, and S. P. Seitzinger. Assessing planetary and regional nitrogen boundaries related to food security and adverse environmental impacts. Current Opinion in Environmental Sustainability, 5(3-4):392–402, 2013. T. Dobzhansky and T.G. Dobzhansky. Genetics of the Evolutionary Process. Columbia University Press, 1970. ISBN 9780231083065. Dobzhansky.T. Nothing in biology makes sense except in light of evolution. American Biology Teacher, 35(3):125–129, 1973. James L Edwards. Research and societal benefits of the global biodiversity information facility. BioScience, 54(6):485–486, 2004. Charles S. Elton. The ecology of invasions by animals and plants. Methuen, London,, 1958. 58003896 (Charles Sutherland), illus., maps. 22 cm. Bibliography: p. 160-174. Brian J Enquist, John P Haskell, and Bruce H Tiffney. General patterns of taxonomic and biomass partitioning in extant and fossil plant communities. Nature, 419(6907): 610–613, 2002.

REFERENCES

87

J. A. Foley, R. DeFries, G. P. Asner, C. Barford, G. Bonan, S. R. Carpenter, F. S. Chapin, M. T. Coe, G. C. Daily, H. K. Gibbs, J. H. Helkowski, T. Holloway, E. A. Howard, C. J. Kucharik, C. Monfreda, J. A. Patz, I. C. Prentice, N. Ramankutty, and P. K. Snyder. Global consequences of land use. Science, 309(5734):570–574, 2005. ISSN 0036-8075. Jonathan A Foley, Samuel Levis, I Colin Prentice, David Pollard, and Starley L Thompson. Coupling dynamic models of climate and vegetation. Global Change Biology, 4 (5):561–579, 1998. D. J. Futuyma. The evolution of evolutionary ecology. Israel Journal of Ecology and Evolution, 59(4):172–180, 2013. Mark R Gardner and W Ross Ashby. Connectance of large dynamic (cybernetic) systems: critical values for stability. Nature, 228:784, 1970. GBIF Secretariat. Global biodiversity infrastructure, May, 2015 2015. URL http: //www.gbif.org/participation/participant-list. GDAL Development Team. GDAL - Geospatial Data Abstraction Library, Version 1.10.0. Open Source Geospatial Foundation, 2015. URL http://www.gdal.org. N. Glansdorff, Y. Xu, and B. Labedan. The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner. Biol Direct, 3:29, 2008. ISSN 1745-6150 (Electronic) 1745-6150 (Linking). doi: 10.1186/1745-6150-3-29. URL http: //www.ncbi.nlm.nih.gov/pubmed/18613974. Glansdorff, Nicolas Xu, Ying Labedan, Bernard eng Research Support, Non-U.S. Gov’t Review England 2008/07/11 09:00 Biol Direct. 2008 Jul 9;3:29. doi: 10.1186/1745-6150-3-29. Nicholas J Gotelli and GL Entsminger. Ecosim: Null models software for ecology, 2001. August Grisebach. Die Vegetation der Erde: nach ihrer Klimatischen Anordnung. Ein Abriss der Vergleichenden Geographie der Pflanzen, volume 1. W. Engelmann, 1884. Florian Hartig, James Dyke, Thomas Hickler, Steven I Higgins, Robert B OHara, Simon Scheiter, and Andreas Huth. Connecting dynamic vegetation models to data–an inverse perspective. Journal of Biogeography, 39(12):2240–2252, 2012. Alex Haxeltine and I Colin Prentice. Biome3: An equilibrium terrestrial biosphere model based on ecophysiological constraints, resource availability, and competition among plant functional types. Global Biogeochemical Cycles, 10(4):693–709, 1996. Nicholas earth 2013.

Heavens. system

Studying models.

URL

and Nature

projecting Education

climate

change

Knowledge,

with 4(5):4,

http://www.nature.com/scitable/knowledge/library/

studying-and-projecting-climate-change-with-earth-103087065.

88

REFERENCES

N. K. Herther. 21st-century sciene citizen science and science 2.0. Online, 36(6):14–22, 2012. Marcel Holyoak, Mathew A. Leibold, Robert D. Holt, and Ecological Society of America. Meeting. Metacommunities : spatial dynamics and ecological communities. University of Chicago Press, Chicago, 2005. Stephen P Hubbell. The unified neutral theory of biodiversity and biogeography (MPB32), volume 32. Princeton University Press, 2001. Lawrence N Hudson, Tim Newbold, Sara Contu, Samantha LL Hill, Igor Lysenko, Adriana De Palma, Helen RP Phillips, Rebecca A Senior, Dominic J Bennett, Hollie Booth, et al. The predicts database: a global database of how local terrestrial biodiversity responds to human impacts. Ecology and evolution, 4(24):4701–4735, 2014. Jaime Huerta-Cepas, Joaqu´ın Dopazo, and Toni Gabaldón. Ete: a python environment for tree exploration. BMC bioinformatics, 11(1):24, 2010. Forest Isbell. Causes and consequences of biodiversity declines. Nature Education Knowledge, 3(10):54, 2010. URL http://www.nature.com/scitable/knowledge/library/ causes-and-consequences-of-bio-diversity-declines-16132475. Marcel Kornacker. Access Methods for Next-Generation Database Systems. PhD thesis, Computer Science UNIVERSITY of CALIFORNIA at BERKELEY, 2000. C.J. Krebs. Ecology: the experimental analysis of distribution and abundance. Pearson Benjamin Cummings, 2009. ISBN 9780321507433. Grigoris Kylafis and Michel Loreau. Ecological and evolutionary consequences of niche construction for its agent. Ecology letters, 11(10):1072–1081, 2008. L.A. Skornyakov (originator).

Partially ordered set.

Encyclopedia of Mathemat-

ics, October, 2014. URL http://www.encyclopediaofmath.org/index.php?title= Partially_ordered_set&oldid=33633. Eric F Lambin, Bi L Turner, Helmut J Geist, Samuel B Agbola, Arild Angelsen, John W Bruce, Oliver T Coomes, Rodolfo Dirzo, G¨ unther Fischer, Carl Folke, et al. The causes of land-use and land-cover change: moving beyond the myths. Global environmental change, 11(4):261–269, 2001. Simon Levin. Fragile dominion. Basic Books, 2007. Michel Loreau. Linking biodiversity and ecosystems: towards a unifying ecological theory. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 365(1537):49–60, 2010. Michel Loreau and Andy Hector. Partitioning selection and complementarity in biodiversity experiments. Nature, 412(6842):72–76, 2001.

REFERENCES

89

Lane MA. Copenhagen: Global biodiversity information facility. GBIF Strategic and Operational Plans 20072011, 1(1), 2007. Yan Ma, Haiping Wu, Lizhe Wang, Bormin Huang, Rajiv Ranjan, Albert Zomaya, and Wei Jie. Remote sensing big data computing: Challenges and opportunities. Future Generation Computer Systems, 2014. ISSN 0167-739X. Robert H MacArthur. The theory of island biogeography, volume 1. Princeton University Press, 1967. A. E. Magurran. Open questions: some unresolved issues in biodiversity. Bmc Biology, 11, 2013. ISSN 1741-7007. Anne E. Magurran. Measuring biological diversity. Blackwell Pub., 2004. Anne E Magurran and Brian J McGill. Biological diversity: frontiers in measurement and assessment, volume 12. Oxford University Press Oxford, 2011. R. M. May. Ecological science and tomorrow’s world. Philosophical Transactions of the Royal Society B-Biological Sciences, 365(1537):41–47, 2010. Robert M May. Unanswered questions and why they matter. Theoretical ecology: principles and applications, pages 205–215, 2007. Robert McCredie May. Stability and complexity in model ecosystems, volume 6. Princeton University Press, 1973. Richard L Mayden. A hierarchy of species concepts: the denouement in the saga of the species problem. 1997. E. Mayr and P.D. Ashlock. Principles of Systematic Zoology. McGraw-Hill, 1991. ISBN 9780071127011. Ernst Mayr. Speciation Phenomena in Birds. American Naturalist, 74, 1940. doi: 10.1086/280892. Kevin Shear McCann. The diversity–stability debate. Nature, 405(6783):228–233, 2000. Malcolm C McKenna and Susan K Bell. Classification of mammals above the species level. Columbia University Press, 1997. Millennium Ecosystem Assessment. Ecosystems and Human Well-being: Synthesis. Island Press, 2005. ISBN 1597260401. M.Sh. Tsalenko (originator). Small category. Encyclopedia of Mathematics, February, 2011. URL http://www.encyclopediaofmath.org/index.php?title=Small_ category&oldid=12738.

90

REFERENCES

Shahid Naeem. Biodiversity, ecosystem functioning, and human wellbeing : an ecological and economic perspective. Oxford biology. Oxford University Press, 2009. Regina Obe and Leo Hsu. PostGIS in action. Manning Publications Co., 2011. Eugene P. Odum. Fundamentals of ecology. Saunders, Philadelphia,, 1953. Robert V. O’Neill. Is it time to bury the ecosystem concept? (with full military honors, of course!). Ecology, 82(12):pp. 3275–3284, 2001. ISSN 00129658. Paul E. Black (editor). Dicctionary of algorithms and data structures. U.S. National Institute for Standards and Technology, December(Online version):Accessed on 14 May, 2015, 2004. URL xlinux.nist.gov/dads/HTML/datastructur.html. Ryan Pavlick, Darren T Drewry, Kristin Bohn, Björn Reu, and Axel Kleidon. The jena diversity-dynamic global vegetation model (jedi-dgvm): a diverse approach to representing terrestrial biogeography and biogeochemistry based on plant functional trade-offs. Biogeosciences, 10:4137–4177, 2013. S. Pavoine and M. B. Bonsall. Measuring biodiversity to explain community assembly: a unified approach. Biol Rev Camb Philos Soc, 86(4):792–812, 2011. ISSN 1469-185X (Electronic) 0006-3231. SL Pimm and JH Lawton. On feeding on more than one trophic level. 1978. Colin I Prentice. Developing a global vegetation dynamics model: results of an iiasa summer workshop. 1989. I Colin Prentice, Alberte Bondeau, Wolfgang Cramer, Sandy P Harrison, Thomas Hickler, Wolfgang Lucht, Stephen Sitch, Ben Smith, and Martin T Sykes. Dynamic global vegetation modeling: quantifying terrestrial ecosystem responses to large-scale environmental change. In Terrestrial ecosystems in a changing world, pages 175–192. Springer, 2007. Mark Pyron. (10):39, 2010.

Characterizing communities.

Nature Education Knowledge, 3

URL http://www.nature.com/scitable/knowledge/library/

characterizing-communities-13241173. C. Rahbek and G. R. Graves. Multiscale assessment of patterns of avian species richness. Proceedings of the National Academy of Sciences of the United States of America, 98 (8):4534–4539, 2001. L. Richardson and S. Ruby. RESTful Web Services. O’Reilly Media, 2008. ISBN 9780596554606. URL https://books.google.co.uk/books?id=XUaErakHsoAC. DavidF Robinson and Leslie R Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53(1):131–147, 1981.

REFERENCES

91

Royal Society. Measuring biodiversity for conservation, 2003. Eduard Rubel et al. Pflanzengesellschaften der erde. 1930. Jeffrey Sachs. Common wealth : economics for a crowded planet. Penguin Press, New York, 2008. Simon Scheiter, Liam Langan, and Steven I Higgins. Next-generation dynamic global vegetation models: learning from community ecology. New Phytologist, 198(3):957– 969, 2013. William H Schlesinger and Emily S Bernhardt. Biogeochemistry: an analysis of global change. Academic press, 2013. R.T. Schuh. Biological Systematics: Principles and Applications. Cornell University Press, 2000. ISBN 9780801436758. S Sitch, Benjamin Smith, I Colin Prentice, Almut Arneth, A Bondeau, W Cramer, JO Kaplan, Samuel Levis, W Lucht, M Thonicke Sykes, et al. Evaluation of ecosystem dynamics, plant geography and terrestrial carbon cycling in the lpj dynamic global vegetation model. Global Change Biology, 9(2):161–185, 2003. S Sitch, C Huntingford, N Gedney, PE Levy, M Lomas, SL Piao, R Betts, P Ciais, P Cox, P Friedlingstein, et al. Evaluation of the terrestrial carbon cycle, future plant geography and climate-carbon cycle feedbacks using five dynamic global vegetation models (dgvms). Global Change Biology, 14(9):2015–2039, 2008. Benjamin Smith and J Bastow Wilson. A consumer’s guide to evenness indices. Oikos, pages 70–82, 1996. Benjamin Smith, I Colin Prentice, and Martin T Sykes. Representation of vegetation dynamics in the modelling of terrestrial ecosystems: comparing two contrasting approaches within european climate space. Global Ecology and Biogeography, 10(6): 621–637, 2001. Jo U. Smith and Pete Smith. Introduction to environmental modelling. Oxford University Press, Oxford ; New York, 2007. ISBN 9780199272068 (pbk.) 0199272069 (pbk.). R.L. Smith and T.M. Smith. Elements of Ecology. Benjamin Cummings Publishing Company, 2001. ISBN 9780321068804. R.V. Solé and J. Bascompte. Self-Organization in Complex Ecosystems. (MPB-42). Monographs in Population Biology. Princeton University Press, 2006. Thomas Richard Edmund Southwood and Peter A Henderson. Ecological methods. John Wiley & Sons, 2009.

92

REFERENCES

EM Spehn, A Hector, J Joshi, M Scherer-Lorenzen, B Schmid, E Bazeley-White, C Beierkuhnlein, MC Caldeira, M Diemer, PG Dimitrakopoulos, et al. Ecosystem effects of biodiversity manipulations in european grasslands. Ecological monographs, 75(1):37–63, 2005. W. Steffen, K. Richardson, J. Rockstrom, S. E. Cornell, I. Fetzer, E. M. Bennett, R. Biggs, S. R. Carpenter, W. de Vries, C. A. de Wit, C. Folke, D. Gerten, J. Heinke, G. M. Mace, L. M. Persson, V. Ramanathan, B. Reyers, and S. Sorlin. Planetary boundaries: Guiding human development on a changing planet. Science, 2015. Will Steffen, Jacques Grinevald, Paul Crutzen, and John McNeill. The anthropocene: conceptual and historical perspectives. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 369(1938):842–867, 2011. William L Steffen, BH Walker, JSL Ingram, and GW Koch. Global change and terrestrial ecosystems. the operational plan. Global Change Report (Sweden). no. 21., 1992. A.S. Tanenbaum and D.J. Wetherall. Computer Networks. Pearson custom library. Pearson Education, Limited, 2013. ISBN 9781292024226. URL https://books. google.co.uk/books?id=w_d5ngEACAAJ. David Tilman, Johannes Knops, David Wedin, Peter Reich, Mark Ritchie, and Evan Siemann. The influence of functional diversity and composition on ecosystem processes. Science, 277(5330):1300–1302, 1997. David Tilman, Peter B Reich, and Johannes MH Knops. Biodiversity and ecosystem stability in a decade-long grassland experiment. Nature, 441(7093):629–632, 2006. David Tilman, Forest Isbell, and Jane M. Cowles. Biodiversity and ecosystem functioning. Annual Review of Ecology, Evolution, and Systematics, 45(1):471–493, 2014. doi: doi:10.1146/annurev-ecolsys-120213-091917. Richard Tofts and Jonathan Silvertown. A phylogenetic approach to community assembly from a local species pool. Proceedings of the Royal Society of London. Series B: Biological Sciences, 267(1441):363–369, 2000. CJ Tucker and PJ Sellers. Satellite remote sensing of primary production. International journal of remote sensing, 7(11):1395–1416, 1986. United Nations. Convention on biological diversity, 1992. URL http://www.cbd.int/ convention/refrhandbook.shtml. P. M. Vitousek, H. A. Mooney, J. Lubchenco, and J. M. Melillo. Human domination of earth’s ecosystems. Science, 277(5325):494–499, 1997. ISSN 0036-8075. Alexander von Humboldt. Ideen zu einer Physiognomik der Gew¨ achse. 1806.

REFERENCES

93

David A Wardle, Karen I Bonner, and Gary M Barker. Stability of ecosystem properties in response to above-ground functional group richness and composition. Oikos, 89(1): 11–23, 2000. Campbell O Webb, David D Ackerly, Mark A McPeek, and Michael J Donoghue. Phylogenies and community ecology. Annual review of ecology and systematics, pages 475–505, 2002. Alexandra Weigelt, Elisabeth Marquard, Vicky M Temperton, Christiane Roscher, Christoph Scherber, Peter N Mwangi, Stefanievon Felten, Nina Buchmann, Bernhard Schmid, Ernst-Detlef Schulze, et al. The jena experiment: six years of data from a grassland biodiversity experiment: Ecological archives e091-066. Ecology, 91 (3):930–931, 2010. Edward O. Wilson. The diversity of life. Questions of science. Belknap Press of Harvard University Press, Cambridge, Mass., 1st harvard university press pbk. edition, 2010. ISBN 9780674058170 (pbk.) 0674058178 (pbk.). 2011420458 Edward O. Wilson. ill., maps ; 24 cm. With a new preface, dated 20 May 2010. Questions of science. G. Wilson, D. A. Aruliah, C. T. Brown, N. P. C. Hong, M. Davis, R. T. Guy, S. H. D. Haddock, K. D. Huff, I. M. Mitchell, M. D. Plumbley, B. Waugh, E. P. White, and P. Wilson. Best practices for scientific computing. Plos Biology, 12(1), 2014. ISSN 1545-7885. Elizabeth M. Wolkovich, James Regetz, and Mary I. O’Connor. Advances in global change research require open science by individual researchers. Global Change Biology, 18(7):2102–2110, 2012. ISSN 1365-2486. doi: 10.1111/j.1365-2486.2012.02693.x. URL http://dx.doi.org/10.1111/j.1365-2486.2012.02693.x. Michael F Worboys and Matt Duckham. GIS: a computing perspective. CRC press, 2004. Stan D. Wullschleger, Howard E. Epstein, Elgene O. Box, Eugnie S. Euskirchen, Santonu Goswami, Colleen M. Iversen, Jens Kattge, Richard J. Norby, Peter M. van Bodegom, and Xiaofeng Xu. Plant functional types in earth system models: past experiences and future directions for application of dynamic vegetation models in high-latitude ecosystems. Annals of Botany, 114(1):1–16, 2014. Chris Yesson, Peter W Brewer, Tim Sutton, Neil Caithness, Jaspreet S Pahwa, Mikhaila Burgess, W Alec Gray, Richard J White, Andrew C Jones, Frank A Bisby, et al. How global is the global biodiversity information facility? PLoS One, 2(11):e1124, 2007.