Transactions in GIS, 2012, 16(2): 143–160
Research Article
Advancing Global Marine Biogeography Research with Open-source GIS Software and Cloud Computing Ei Fujioka
Edward Vanden Berghe
Marine Geospatial Ecology Lab Duke University
Institute of Marine and Coastal Sciences Rutgers University
Ben Donnelly
Julio Castillo
Marine Geospatial Ecology Lab Duke University
Universidad Simón Bolívar
Jesse Cleary
Chris Holmes
Marine Geospatial Ecology Lab Duke University
OpenGeo
Sean McKnight
Patrick Halpin
Department of Public Works City of Durham
Marine Geospatial Ecology Lab Duke University
Abstract Across many scientific domains, the ability to aggregate disparate datasets enables more meaningful global analyses. Within marine biology, the Census of Marine Life served as the catalyst for such a global data aggregation effort. Under the Census framework, the Ocean Biogeographic Information System was established to coordinate an unprecedented aggregation of global marine biogeography data. The OBIS data system now contains 31.3 million observations, freely accessible through a geospatial portal. The challenges of storing, querying, disseminating, and mapping a global data collection of this complexity and magnitude are significant. In the face of declining performance and expanding feature requests, a redevelopment of the OBIS data system was undertaken. Following an Open Source philosophy, the OBIS technology stack was rebuilt using PostgreSQL,
Address for correspondence: Ei Fujioka, Marine Geospatial Ecology Lab, Nicholas School, Duke University, LSRC A328, Duke University, Durham, NC 27708, USA. E-mail:
[email protected] © 2012 Blackwell Publishing Ltd doi: 10.1111/j.1467-9671.2012.01310.x
144
E Fujioka et al. PostGIS, GeoServer and OpenLayers. This approach has markedly improved the performance and online user experience while maintaining a standards-compliant and interoperable framework. Due to the distributed nature of the project and increasing needs for storage, scalability and deployment flexibility, the entire hardware and software stack was built on a Cloud Computing environment. The flexibility of the platform, combined with the power of the application stack, enabled rapid re-development of the OBIS infrastructure, and ensured complete standardscompliance.
1 Introduction Through the Census of Marine Life (CoML; http://www.coml.org), scientists from more than 80 countries assessed and explained the abundance, distribution, and diversity of marine life throughout the world’s oceans, past, present, and future (Ausubel 1999). Launched in 2000 with funding from the Alfred P. Sloan Foundation among others, CoML was an unprecedented multi-national scientific effort. Over its history it involved more than 2,700 scientists, supported more than 500 research cruises, and potentially identified more than 6,000 new marine species. The Ocean Biogeographic Information System (OBIS; http://www.iobis.org), as the information portal, provided the critical gateway to an enormous database serving both scientists and laypersons (Grassle 2000, Vanden Berghe et al. 2010). OBIS continues as a single entry point to a distributed federation of databases and uses web-based technology to make its data holdings broadly accessible. Since the end of the Census, OBIS has been incorporated in the Intergovernmental Oceanographic Commission of UNESCO, under its International Oceanographic Data and Information Exchange programme. OBIS data have been used for analysis and conservation efforts worldwide. Several scientific analyses have been driven by OBIS data (e.g. Mora et al. 2008, Tittensor et al. 2010); it allows us to identify gaps in our knowledge and to quantify our ignorance (e.g. Webb et al. 2010); OBIS data holdings promise to play an important role in informing science-based management of marine living resources (e.g. Williams et al. 2010, Ardron et al. 2009). Across many scientific domains, the ability to aggregate disparate datasets enables more meaningful global analyses; it facilitates the creation of massive datasets, on a global scale, that are commensurate with the global change problems humankind is confronted with (Figure 1). But the massive scale of these databases creates their own computational and GIS challenges. The OBIS data system now contains 31.3 million observations, representing 120,000 marine species (Figure 2). While a query and mapping interface for the OBIS database had been in place since 2000, as the Census community matured, the need for improved access and spatially explicit, multi-faceted filtering of data became apparent. As the OBIS data holdings expanded, we found that supporting such demands was not technically possible under the existing framework and that a new, innovative search interface was needed. The development of a new OBIS Portal was led by the international OBIS Project Office at Rutgers University (http://www.iobis.org), with help from the Marine Geospatial Ecology Lab at Duke University (http://mgel.env.duke.edu/) and Universidad Simón Bolívar (http://www.usb.ve), and with contracted support from OpenGeo (http://opengeo.org/). © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
145
Figure 1 Number of observations in OBIS plotted in 2 degree and 5 degree grid cells
Figure 2 The growing OBIS database
The overall objective of the OBIS Portal renovation was to make the system both more user-friendly and more powerful, striking an appropriate balance (Costello and Vanden Berghe 2006). The key to this dual objective was to broaden the number of search and query criteria that could be combined (geographic space, time, depth, environmental envelope, biological classification). Packaging these options into an interface that would allow for easy queries, while not limiting more complex queries, helped achieve this balance. Specific goals included: 1. An intuitive system to browse the biological classification and to integrate results over the hierarchy, requiring that the full classification of each species be accessible (i.e. extract records for all species belonging to the group “Pisces” or fishes); 2. Create summarized views of data holdings for efficient extraction and rendering; 3. All query results to be downloadable in common GIS formats and web service standards with enhanced interoperability for other databases or products (Edwards et al. 2000). © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
146
E Fujioka et al.
We realized that a key to successfully achieving these ambitious goals was to make the system more manageable, interoperable and flexible. Relying on an open-source GIS framework provided a modular set of mature GIS and database tools that could deliver the performance that was needed, in a flexible and standards compliant manner. The rapid update cycle of these tools also ensured that any new features could be folded into the system as they came online. In addition, improvements to Open Source systems developed under the OBIS renovation could be made available back to the open-source GIS community.
2 Data and Database OBIS data arrive at the data center from data providers around the world and are stored in a spatially-enabled relational database. Following our Open Source objectives, PostgreSQL 8.4.7 and PostGIS 1.5 were chosen to store the data (Figure 4). A spatial database built on PostgreSQL + PostGIS has been proven to be robust and comparable with proprietary or commercial equivalents (Anderson and Moreno-Sanchez 2003). OBIS-SEAMAP, a sister project of OBIS, is a noteworthy example of the successful implementation of a biogeography database with PostgreSQL + PostGIS (Halpin et al. 2006). To mitigate the issues inherent to the distributed biodiversity data and improve the consistency and accuracy of aggregated data (Lapp et al. 2011), OBIS adopts Darwin Core biodiversity data standards (Darwin Core Task Group 2009), and its extension, the OBIS Schema, and supports Distributed Generic Information Retrieval (DiGIR; DiGIR 2005), a data transfer protocol to retrieve structured data from multiple, heterogeneous databases. Data providers supply data to OBIS via a DiGIR Provider, an open-source application implementing the DiGIR protocol, in either the Darwin Core or OBIS Schema. While the data in these standards are transferred in a flat table, the data retrieved are standardized and normalized into a relational structure for efficient data processing and consistent terminology (Figure 3). Since the two central elements of a biodiversity database are locations and species (Schnase et al. 2003), the first task of the normalization is to extract unique latitude/ longitude pairs into a separate table. This location table serves to look up environmental attributes such as average temperature or salinity and relationship with jurisdictional or biogeographic classifications such as Exclusive Economic Zones or FAO Fishing Areas. The association of biological observations with environmental attributes and jurisdictional or biogeographic classifications are a particularly important property for habitat modeling, marine conservation planning and fisheries management (Lourie and Vincent 2004, Redfern et al. 2006, Roff and Evans 2002, Sherman et al. 1996) and the OBIS Search Interface allows the user to extract observations based on such an association (described below). Among the environmental attributes, physical and chemical oceanographic parameters are taken from the World Ocean Atlas (WOA09, http://www.nodc.noaa.gov/OC5/ WOA09/pr_woa09.html) and the depth of the ocean floor from the ETOPO 1 minute grid (http://www.ngdc.noaa.gov/mgg/global/global.html). For jurisdictional or biogeographic classifications, a series of GIS datasets were downloaded from public sources and imported into PostGIS tables (Table 1). Many of these polygons are very complex, and querying them in real-time, responding to user actions through the Search Interface, is not practical. For this reason, the lookup query was conducted offline and each unique © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
147
Figure 3 Conceptual diagram of the OBIS database
Table 1 Polygon sets incorporated in OBIS, and their sources Exclusive Economic Zones FAO Fishing Areas International Hydrographic Office Regional Seas Large Marine Ecosystems Marine Ecoregions of the World
United Nations, 1982 FAO, 2002 International Hydrographic Organization, 1953 Sherman and Duda, 1999 Spalding et al., 2007
http://www.vliz.be/vmdcdata/ marbound/ http://www.fao.org/fishery/area/ search/en http://www.vliz.be/vmdcdata/vlimar/ downloads.php http://www.lme.noaa.gov http://www.conserveonline.org/ workspaces/ecoregional.shapefile/ MEOW/view.html
position in the location table was qualified with attributes indicating membership in each of these polygons. This way, only an attribute query has to be addressed in response to a user action, instead of a spatial query. Furthermore, the normalized locations tally less than 10 percent of the 31.3 million observation records, significantly reducing the time needed for the lookup query. Another set of reference polygons used are regular geographic grids at different resolutions. Standard sizes built into the database are 5 and 1 degree, and 30 and 6 arc minutes. The coding of the names of each of these squares follows the C-Squares method (Rees 2003), indicating which smaller size squares fit in larger squares. This indexing method made some of the calculations (e.g. the number of species per grid at various © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
148
E Fujioka et al.
resolutions) more efficient because the calculations are accomplished by an attribute query rather than a spatial one. The different-size squares are used to display maps on the Search Interface, and also to calculate a series of summary maps, with various diversity indices and statistics of the data holdings (Figure 3; but also see detailed descriptions in Section 3 below. The other central element of a biodiversity database is taxonomy, or the study of biological classification. In the taxonomic “tree of life” all forms of life, from microbes to whales, are placed on a node tree that starts at Biota and cascades down to species or subspecies; nodes in this tree are called “Taxa” (singular “Taxon”); the level of a taxon in this tree is its “Rank”. In the case of Animalia, there are 28 ranks excluding Biota. While a single species distribution is important, many biodiversity studies are interested in a higher level taxon and its children (Vanderklift et al. 1998). Also, in the field work that contributes to OBIS, researchers are not always able to identify the observed animal at a species level (e.g. bottlenose dolphins; Lapp et al. 2011). In these cases, the practice is to identify and record the observed animal at a higher rank (e.g. Delphinidae); even at the kingdom level (e.g. a kind of animal). In the OBIS database, the classification is stored as a parent link with each of the taxa. For example, Tursiops truncatus (bottlenose dolphins) points at the genus Tursiops, which in turn points to the family Delphinidae, continuing up to Animalia, which has the top node Biota as its parent. Following all these parent links is time consuming; moreover, at the outset of querying the members of a higher taxon, it is unknown how many branches will be followed. For this reason a derived field, the “stored path,” is precalculated: the delimited concatenation of all unique identifiers of all ancestors, from the highest rank Biota down to the parent of the target taxon. The stored path makes it possible to interrogate the full classification with a single LIKE query or regular expression. The database was developed before recursive queries were available in PostgreSQL; these recursive queries might be an alternative to the stored path approach. The method described above, however, does not work well for gridded summary layers. While each row in the point observation table represents an individual observation of a taxon, a row in the summary table is a grid aggregating multiple observations of a taxon. Since grids of multiple taxa (a parent taxon and its child taxa) would overlap, the results must be aggregated per grid to report the accurate number of observations in a grid. We pre-calculated, for each grid, the number of observations of all child taxa of the parent. Therefore, the summary layers have two count columns: (1) the number of observations of a taxon itself; and (2) the number of observations of the taxon plus the number of observations of all child taxa of the parent taxon.
3 Search Interface Development The development team at Duke University had a decade of experience with online biogeography applications for OBIS-SEAMAP (Halpin et al. 2009). Selection of applications was therefore based on the team’s experience along with our Open Source objectives (Figure 4). GeoServer 2.1.0 was chosen as the map engine to provide standard geospatial web services (Open Geospatial Consortium, OGC; http:// www.opengeospatial.org/). On the browser side, the Search Interface embedded OpenLayers 2.9.1, which provides a framework for the mapping interface, requesting and fetching WMS tiles to © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
149
Figure 4 System diagram of the iOBIS Search Interface
and from GeoServer. GUI components for the search functionality were created using EXTJS 3.1, a cross-browser Javascript library available for free for non-commercial uses (http://www.sencha.com/). EXTJS provides user interface elements comparable to those in desktop applications, such as widgets to visualize data in a spreadsheet-like gridded table. EXTJS was also used to request and fetch non-spatial data such as those for graphing and record details through Asynchronous JavaScript and XML (AJAX) calls to the server. AJAX calls are processed by PHP scripts on the server and results are returned in JavaScript Object Notation ( JSON) format. All GIS layers in GeoServer, except for the static backgrounds, point to a corresponding table or view in the PostgreSQL database. The Search Interface uses two background layers: continents and bathymetry. These are the only layers that remain static through the user experience. The static nature and huge size of the layers (bathymetry: 1.7GB TIFF) made them suitable for caching. Initial testing showed that rendering of the bathymetry through a WMS request without caching was too slow for the Search Interface. This slow throughput would degrade further when the application was configured to use tiles. A single map window in the OpenLayers framework uses 60 tiles when tiling is enabled (it is possible to reduce the number of tiles by reducing the buffer outside the map view). The tile-caching provided by GeoWebCache, GeoServer’s default tile-caching mechanism, improved performance dramatically. In addition, the continents and bathymetry are always served together, so a GeoServer group layer was used to feed the combined layer to GeoWebCache. The use of a group layer is also beneficial in that the number of WMS requests to fetch the backgrounds is reduced by half. One of the biggest challenges came from the inability of OGC standards to make a layer highly searchable while providing rich query options. Most GIS data are divided into discrete layers of information – roads, parks, railways, etc. The OGC information model was created with this traditional geospatial framework in mind. Biogeographic data are somewhat different; there are literally hundreds of thousands of different species for which data are available. Even in a very simple scenario where the user extracts data for a single species (e.g. bottlenose dolphins), it is not feasible to define a map layer for each species. © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
150
E Fujioka et al.
Complexity grows rapidly when more search options are provided. The Search Interface allows the user to define any shape of region of interest and extract data within the region. This polygon may have more sides than a simple bounding box and depends on PostGIS’s spatial query functions (e.g. ST_WITHIN). As the region is defined by the user online, the possible shapes are virtually infinite and it is impossible to pre-define layers or cache the results for regions of interest. Other criteria that the Search Interface supports are: (1) a parent taxon and its child taxa; (2) one or more datasets; (3) a region within a jurisdictional or biogeographic classification (e.g. Bahamas EEZ); (4) various temporal resolutions (e.g. year only, year and month, or a specific date); (5) a season; and (6) a range of one or multiple oceanographic attributes such as temperature and salinity (Figure 5). For an intuitive browsing of the taxonomic tree, we used EXTJS components to combine a live search box with a tree component similar to Windows Explorer. When the user enters a portion of the species name into the box, the live search returns a candidate list with the taxonomic stored path from which the user chooses the species of interest. Upon the selection the stored path is parsed, cascading down to the target species while fetching sibling taxa at each rank to fill out the branches (Figure 6). In addition to the complexity of filtering, there is an issue with the Common Query Language (CQL) used to pass client query information back to GeoServer. It does not support commonly used query operations such as LIKE, let alone database-specific query functions/operations. Due to this limitation, we came to the conclusion that CQL is not suitable for achieving the complex queries that the Search Interface allows the user to build. Instead of writing an unmaintainable database script to overcome the limitation of CQL, the OBIS team decided to work with the core GeoServer developers at OpenGeo
Figure 5 Multi-faceted data extraction on the Search Interface (e.g. humpback whale observations in the Bahamas EEZ) © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
151
Figure 6 An implementation of the taxonomic tree of life with a live search box and a tree component
to develop a standards-based approach that could handle the requirement for custom database queries. The result was new functionality in GeoServer called Parametric SQL Views. A Parametric SQL View is defined as a geospatial layer with a SQL query statement native to the underlying database by the GeoServer administrator, instead of forcing the database administrator to create a database view. The query statement in a Parametric SQL View can contain placeholders that are intended to be substituted with parameter values passed by the client software. The result is a complete SQL statement reflecting the on-the-fly user inputs, which in turn is passed to the database by GeoServer (Figure 7). The remainder of the process is same as for an ordinary OGC layer. Since a Parametric SQL View is an OGC layer, all the standard OGC services including WMS, WFS and KML can still be served. If the Parametric SQL View parameters are not included in an OGC request, default values, which should be carefully chosen for suitable performance, are used for the substitution. This gives an incredible level of control, although it also introduces the potential for SQL injection attacks if not configured properly (Boyd and Keromytis 2004). The risk of SQL injection attacks is reduced by supplying validation regular expressions that define expected parameter values. It is also recommended to set up a GeoServer’s datastore so that it accesses the database with a read-only privilege. In the implementation on the Search Interface, each time the user chooses a set of criteria, the criteria are combined and formalized as Parametric SQL View parameters, and then added to WMS requests. GeoServer receives the request, parses the passed © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
152
E Fujioka et al.
Figure 7
Parametric SQL View definition and parameter syntax
parameters and substitutes placeholders in the layer definition with the actual values. To accept such diverse search criteria, it is crucial for both performance and management to prepare the target table or view of the layer so that it contains all attributes subject to the search (denormalization; Figure 3). Many on-line mapping applications are centered around the display of points. This could be the default representation for biogeographic data as well, but the large number of records in OBIS makes it impossible to follow this simple path. For example, the phylum Annelida, or segmented worms, and its sub-groups are represented by more than 1 million records, and are spread over the globe. Sending one million points to the browser does not result in usable display of the data. The Global Biodiversity Information Facility (GBIF), the largest biodiversity data portal to which OBIS has been a long-time associate member, for example, provides occurrence overview maps online at one-degree resolution by default and at a smaller resolution while zooming in (GBIF 2011). GBIF advises the user to download the data for more in-depth investigations. The challenges of this huge volume of data over a global extent were: (1) extracting a large number of location data from the database and mapping individual points within an acceptable response time (e.g. 30 seconds); and (2) point locations which overlap extensively, especially at small scales, making it difficult to grasp the global distribution of a group of interest. To overcome these challenges and depict the distribution of records with precision, while achieving a desired performance against a huge number of records, we developed multi-resolution aggregated summaries. Following the C-Squares method, aggregation is conducted against the individual point locations per group of taxa and grid at predetermined resolutions, summarizing the number of records in each grid. The resulting layers are gridded summaries of distribution and each cell is a square representing a fixed degree(s) latitudinally and longitudinally. One spatially-enabled table is generated per resolution. As the aggregation is based on regular grids, the resulting grids never overlap and the maximum number of records/grids can be easily known (e.g. 2,520 grids at a 5 degree resolution). The resolution is automatically determined by tracking the set zoom level of the map interface. For example, at the lowest zoom level (smallest scale), the five-degree summary layer will be displayed. Zooming one level up, the one-degree summary layer will be picked up. The match between possible zoom levels and resolutions to serve was arbitrarily decided (Table 2). To better represent the concentration of the distribution, the grids are color-coded by the number of records in the grid. Since the © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
153
Table 2 The match between zoom levels and summary grid resolutions and the number of records and grids for Phylum Annelida at each resolution Individual points 1,191,401 Zoom level
5 degrees 1,575 0
1 degree 8,762 1–3
0.5 degree 14,310 4–5
0.1 degree 33,653 6–7
number of grids to display at a low resolution is significantly reduced compared with the individual point layer, the performance challenges of extracting and rendering them are mitigated (Table 2). Whereas it is possible that the number of distribution grids dramatically increases as the resolution becomes higher (e.g. to cover the whole globe at a 0.1 degree resolution, 6,480,000 grids are necessary), the multi-resolution aggregation addresses this issue. By adjusting the resolution to zoom level, smaller areas are to be mapped online at a higher resolution, constraining the number of grids that require display. We also generated multi-resolution summaries per dataset. Many users search for datasets that originated from specific research efforts, rather than a broad search of a taxon. The technique of the multi-resolution gridded summary along with the Parametric SQL Views allowed us to implement another scientifically valuable layer. Biodiversity indices, measures of biological diversity, have gained increasing attention from scientists, policy-makers and the general public, as biodiversity has become one of the major environmental concerns (Magurran 2004). Since no single biodiversity index serves as the most appropriate measure for all research and decision making (Duelli and Obrist 2003), we chose four indices, species richness, Simpson index, Shannon index and Hurlbert’s index to start with (Magurran 2004, Hurlbert 1971). The indices were calculated from the point observations per grid at various resolutions and put together in a PostGIS table per resolution (Figure 8). A choice of the biodiversity index is passed to GeoServer as a Parametric SQL View parameter which, in this case, is a column name rather than a SQL WHERE clause.
4 Platform Selection and Scaling Dealing with huge volumes of geospatial data is a challenge and evaluating and improving geoprocessing performance is a popular topic in computer science (Scholten et al. 2006, Zhang et al. 2007). In addition, the teams that rebuilt the OBIS Portal were distributed geographically and institutionally. Therefore, a more flexible, scalable and open environment was required, with full access to the operating system for performance optimization and the ability to create multiple development servers for the database and portal components. The primary software choices were made before selecting the hosting platform. The early decision to rebuild the OBIS infrastructure on a Ubuntu/PostgreSQL/ GeoServer/Drupal stack allowed for several possible server environments. The two that were strongly considered were VMware virtual hosting on hardware at a local facility, and the Amazon Elastic Compute Cloud (Amazon EC2). Although adopting Cloud Computing in geospatial data infrastructure has not been explored intensively, it is a promising approach to deal with scalability and performance (Baranski et al. 2009, © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
154
E Fujioka et al.
Figure 8 Four biodiversity indices at a 5 degree resolution and species richness at 1 and 0.5 degree resolutions (bottom two)
2011). Both VMware and Amazon EC2 offered complete control over the operating system and software versions and would allow for snapshots of server state, and quick creation of identical development instances. Amazon EC2 was ultimately preferable for several reasons: 1. It was “neutral territory” between the teams, independent of capital hardware owned by any of the participating institutions 2. It was self-contained, with strong online documentation allowing each of the teams to reach the same level of expertise quickly 3. It provided the most flexibility in the allocation of base level resources: RAM and disk space could be allotted with minimal system administration effort; and hardware (or at least abstracted virtual hardware) could be scaled up to test load and performance. The early stages of development coincided with the release of Ubuntu Lucid Lynx 10.04, a long-term support version, ideal for a project that expected to have a major pulse of development. The default versions of PostgreSQL and PostGIS were sufficient for the database server build. The portal proved more complicated, as GeoServer was not specifically packaged for Ubuntu, and the Drupal content management system required significant customization to integrate with the Search Interface. The decision was made to partition generic software packages (PostgreSQL, Apache) from the customized software that existed outside of well-known Ubuntu software repositories. (GeoServer, Drupal). © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
155
The principle for building server nodes thus became: 1. Designate an empty Ubuntu 10.04 virtual image from Amazon EC2 cloud as the “base image”; 2. Instantiate the base image, fetch a script to install and configure generic software packages; 3. Attach a second disk containing pre-configured OBIS applications to the server system; 4. For a portal server, use version control software to sync the code on the preconfigured second disk to the latest OBIS code base and website content; 5. Manually make network changes to uniquely identify the node (DNS names, firewall adjustments for connections between servers). Updates to the OBIS database are accomplished through system administration, rather than database administration. A “data preparation” server node is built through the above process. New data is loaded into the data preparation instance of PostgreSQL and tested with development instances of the OBIS Portal. The secondary, non-OS disk is then detached and cloned. The cloned disk is attached to the production instance, and the update disk is mounted and made visible to the operating system during the quick process of shutdown and restarting the PostgreSQL process. Downtime during a data update is typically less than five minutes. The generally un-cacheable nature of the web application was another factor in using cloud computing. Bringing the revised OBIS Portal into production coincided with the major media events surrounding the parent project, the Census of Marine Life. There was no way to estimate the amount of traffic that the Census of Marine Life media event would drive to the OBIS Portal. Prior to the events, both the production database and portal servers were scaled up to the largest size offered by Amazon EC2: 15GB RAM, 4 virtual processors.
5 Results One of the main outcomes from this development effort was that we have accumulated knowledge and skills to build a biodiversity portal infrastructure based on open-source, standard-compliant applications in a Cloud Computing environment that could be applied to other OBIS Nodes, regional or thematic branches of OBIS, or outside products. The complexity and volume of the data and multitude of desired access channels required powerful and flexible web-based geospatial components to deliver a user friendly and performant interface. That such a complex and full-featured system could be constructed with disparate open-source geospatial components is a testament to the maturity and prowess of the components. The use of the Amazon EC2 cloud enabled the development to become more nimble and to scale up to meet the expected challenges of a widely covered international release event. On the peak day during the week of the Census of Marine Life release event in London, we witnessed 3,157 visitors averaging 3:28 minutes on the page hosting the Search Interface (Figure 9) and over 11,000 visitors sitewide. Traffic was evenly distributed between Western Europe, North America and Pacific Asia, so the load was consistent through the day, rather than clumped between the peak hours of a particular region. Monitoring the site from England and the U.S. showed no degradation in service. Despite © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
156
E Fujioka et al.
Figure 9
CoML release event load on the Search Interface
this increase, efficient database indexing, tile caching and scale-up of virtual hardware kept the site running smoothly.
6 Discussion and Conclusions The intermittent update cycle of open-source software and the modular nature of the software stack assembled for this effort present a host of challenges and opportunities. Open-source software advances at a more rapid pace than many commercial platforms, potentially adding new features and performance benefits. Since the ongoing development effort is pushing the limits of these packages already, these benefits are often needed immediately. However, the modular nature of open-source GIS software makes the intermittent update cycles of each piece of software a constant challenge to implement in a production web application. As commonly used open-source GIS software stacks emerge, improved coordination of the release cycles among the coding projects would help researchers remain focused on their domain and informatic improvements instead of version compatibility management. One of the significant contributions to marine science communities of the renovated OBIS infrastructure built on open-source, standards-compliant framework is the improved accessibility and interoperability allowing users or products to exploit the OBIS data more efficiently and dynamically. Providing access to diversity indices and statistics at multiple resolutions will be a valuable service as the spatial scales matter for identifying species distributions and ecological processes (Lourie and Vincent 2004, Magurran 2007, Redfern et al. 2006). Through the Search Interface, the user is able to set up exact criteria in terms of species, region and time period of the user’s interest at a desired resolution, extract the data that meet the criteria and download the data in an OGC standard format. Similarly an online product can integrate the OBIS data dynamically through the Internet. A notable example is the Ocean Data Viewer hosted by UNEP-WCMC where WMS images of the biodiversity indices from OBIS are visualized on their own mapping application (http://data.unep-wcmc.org/). These biodiversity indices were also used in practical illustrations for identifying ecologically and biologically significant areas under the Global Ocean Biodiversity Initiative (GOBI; Ardron et al. 2009). A summary obtained from download activities logs indicates that more than 90% of the downloads were requested in comma-separated value format (CSV) followed by KML (4%). Although KML is now an OGC standard (Open Geospatial Consortium 2011), it is safe to assume that the intended usage of KML is very specific (i.e. loading it onto Google Earth). WMS and WFS, the traditional OGC formats the Search Interface supports, © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
157
represent only 5% of the downloads combined (Table 3). A similar trend is seen in OBIS-SEAMAP, a sister project of OBIS, where WMS and WFS are requested only 3% of all downloads. OBIS-SEAMAP has a slightly different objective from the OBIS Portal and also supports Esri shapefiles as a download format, which represents 50% of the downloads. While standards such as OGC give the application developers common knowledge, approaches and methods to exchange geospatial data, end-users may not recognize or make use of the same benefits as developers. It is often the case that end-users need to write a script to parse WFS or customize WMS images, which may be beyond their ability or a distraction from their primary objective (i.e. research analyses). Moreover, especially for WFS, the size of the XML file tends to be large and take a long time to generate and transfer. We estimated file sizes of WFS and CSV for all cetacean species (n = 242,870). The size of the WFS XML file is more than 360 MB while the same data in a CSV file is less than 56 MB. Because of the size of the OBIS database, WFS files sized in the hundreds of megabytes is common, which causes transfer issues and may crash client applications. The size issue is partly solved by zipping the XML file, which the Search Interface implements. The XML for the cetacean example compresses to just 10 MB. These issues remind us of the importance to continue improving the standards (e.g. adopt a zipped version of WFS like KMZ is for KML), and using tool improvements to promote the benefits of the standards. In general, compared with terrestrial animals, marine creatures tend to have longer migration paths and broader home ranges. For example, loggerhead sea turtles make astonishing trans-Pacific migrations (Nichols et al. 2000) and arctic terns tirelessly fly between the North and South Poles (Egevang et al. 2010). Distributions of such species are likely to cover an entire ocean basin or even the globe. An extent crossing the international dateline is not uncommon. Whereas popular online mapping interfaces such as Google Maps and OpenLayers are able to achieve a seamless 360-degree view of the globe, map operations and analyses (e.g. drawing a polygon over the international dateline and extracting the data within the polygon) are not easily transferable to the web. This is partly a limitation rising from geographic coordinate systems. The coordinate system used by the Search Interface, EPSG:4326, splits the globe on longitude 180/-180, and an end-user’s bounding box that traverses that longitude cannot be represented with a single polygon.
Table 3 Download requests by format (percentage in parenthesis) in a certain period of time Format
OBIS
OBIS-SEAMAP
WFS WMS KML CSV Esri Shapefile Time period
19 (2%) 21 (2%) 35 (4%) 783 (91%) N/A 138 days
22 (< 1%) 139 (2%) N/A 2,874 (47%) 3,098 (51%) 671 days
© 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
158
E Fujioka et al.
This limitation could be overcome on the application side. The application could detect the current view (Pacific-centric or Atlantic-centric) and store user actions (e.g. drawing a polygon). Then it could break the operation or calculation into two, with distinct polygons for the western and eastern hemispheres. Alternatively, a more promising approach would be the geography data type, a new data type supported by PostGIS 2.0 (Obe and Hsu 2011). We will investigate this approach further. For species that call the Arctic or Antarctic waters home, map projection raises another issue. Commonly used projections such as a Mercator or a geographic coordinate system severely distort the polar regions and make it hard to grasp the species distribution or movement around the poles. When it comes to online GIS applications, it is not sufficient simply to reproject query results to a polar projection. User inputs (e.g. a bounding box) and spatial analyses also need to be dealt with under the polar projection. We encourage the open-source GIS community to devise a consolidated method or package to allow the user to project and operate geospatial data in a projection of the user’s choice. Given the complex global challenges facing the world’s oceans, marine biogeographic data are a crucial resource for scientists and policy-makers addressing these issues. Partnerships between scientists, GIS researchers, and software developers can provide the tools needed to support these efforts. Additional work is needed to improve performance, access and ease of use, which will advance the ability of GIS to provide benefits to global environmental informatic challenges.
Acknowledgments We would like to thank Alfred P. Sloan Foundation for unprecedented efforts to make the Census project possible. We also would like to thank all contributors organizing OBIS regional and thematic nodes. Finally, we express deepest appreciation to data providers around the world; without them, the OBIS database would not be possible.
References Anderson G and Moreno-Sanchez R 2003 Building Web-based spatial information solutions around open specifications and Open Source software. Transactions in GIS 7: 447–66 Ardron J, Dunn D C, Corrigan C, Gjerde K, Halpin P N, Rice J, Vanden Berghe E, and Vierros M 2009 Defining Ecologically or Biologically Significant Areas in the Open Oceans and Deep Seas: Analysis, Tools, Resources and Illustrations. Report to the CBD Expert Workshop on Scientific and Technical Guidance on the Use of Biogeographic Classification Systems and Identification of Marine Areas beyond National Jurisdiction in Need of Protection Ausubel J H 1999 Toward a census of marine life. Oceanography 12(3): 4–5 Baranski B, Schaeffer B, and Redweik R 2009 Geoprocessing in the clouds. In Proceedings of the Free and Open Source Software for Geospatial Conference, Sydney, Australia Baranski B, Foerster T, Schäffer B, and Lange K 2011 Matching INSPIRE quality of service requirements with hybrid clouds. Transactions in GIS 15: 125–42 Boyd S W and Keromytis A D 2004 SQLrand: Preventing SQL injection attacks. In Jakobsson M, Yung M, and Zhou J (eds) Applied Cryptography and Network Security 2004. Berlin, Springer Lecture Notes in Computer Science Vol. 3089: 292–302 Costello M J and Vanden Berghe E 2006 ‘Ocean biodiversity informatics’: A new era in marine biology research and management. Marine Ecology Progress Series 316: 203–14 © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
Advancing Global Marine Biogeography Research
159
DiGIR 2005 Distributed Generic Information Retrieval (DiGIR). WWW document, http://digir.net/ Egevang C, Stenhouse I J, Phillips R A, Petersen A, Fox J W, and Silk J R D 2010 Tracking of Arctic terns Sterna paradisaea reveals longest animal migration. Proceedings of the National Academy of Sciences of the USA 107: 2078–81 Darwin Core Task Group 2009 Darwin Core. WWW document, http://www.tdwg.org/standards/ 450 Duelli P and Obrist M K 2003 Biodiversity indicators: The choice of values and measures. Agriculture, Ecosystems and Environment 98: 87–98 Edwards J L, Lane M A, and Nielsen E S 2000 Interoperability of biodiversity databases: Biodiversity information on every desktop. Science 289(5488): 2312–14 FAO 2002 CWP Handbook of Fishery Statistical Standards: Fishing Areas. WWW document, http://www.fao.org/fishery/cwp/handbook/G/en GBIF 2011 Maps in the GBIF Portal. WWW document, http://data.gbif.org/tutorial/maps Grassle J F 2000 The Ocean Biogeographic Information System (OBIS): An on-line, worldwide atlas for accessing, modeling and mapping marine biological data in a multidimensional geographic context. Oceanography 13(3): 5–7 Halpin P N, Read A J, Best B D, Hyrenbach K D, Fujioka E, Coyne M S, Crowder L B, Freeman S A, and Spoerri C 2006 OBIS-SEAMAP: Developing a biogeographic research data commons for the ecological studies of marine mammals, seabirds, and sea turtles. Marine Ecological Progress Series 316: 239–46 Halpin P N, Read A J, Fujioka E, Best B D, Donnelly B, Hazen L J, Kot CY, Urian K, LaBrecque E, Dimatteo A, Cleary J, Good C, Crowder L B, and Hyrenbach K D 2009 OBIS-SEAMAP: The world data center for marine mammal, sea bird, and sea turtle distributions. Oceanography 22: 104–15 Hurlbert S H 1971 The non-concept of species diversity: A critique and alternative parameters. Ecology 52: 577–86 International Hydrographic Organization 1953 Limits of Oceans and Seas (Third Edition). Monte Carlo, International Hydrographic Organization Special Publication No. 23 Lapp H, Morris R A, Catapano T, Hobern D, and Morrison N 2011 Organizing our knowledge of biodiversity. Bulletin of the American Society for Information Science and Technology (Online; available at http://search.proquest.com/docview/870848358?accountid=10598) Lourie S A and Vincent A C J 2004 Using biogeography to help set priorities in marine conservation. Conservation Biology 18: 1004–20 Magurran A E 2004 Measuring Biological Diversity. Oxford, Blackwell Magurran A E 2007 Species abundance distributions over time. Ecology Letters 10: 347–54 Mora C, Tittensor D, and Myers R 2008 The completeness of taxonomic inventories for describing the global diversity and distribution of marine fishes. Proceedings of the Royal Society B-Biological Sciences 275(1631): 149–55 Nichols W J, Resendiz A, Seminoff J A, and Resendiz B 2000 Transpacific migration of a loggerhead turtle monitored by satellite telemetry. Bulletin of Marine Science 67: 937–47 Obe R and Hsu L 2011 PostGIS 2.0: The new stuff. In Proceedings of FOSS4G, Denver, Colorado Open Geospatial Consortium 2011 KML. WWW document, http://www.opengeospatial.org/ standards/kml Redfern J V, Ferguson M C, Becker E A, Hyrenbach K D, Good C, Barlow J, Kaschner K, Baumgartner M F, Forney K A, Ballance L T, Fauchald P, Halpin P, Hamazaki T, Pershing A J, Qian S S, Read A, Reilly S B, Torres L, and Werner F 2006 Techniques for cetacean-habitat modeling. Marine Ecology Progress Series 310: 271–95 Rees T 2003 “C-Squares”, a new spatial indexing system and its applicability to the description of oceanographic datasets. Oceanography 16: 11–19 Roff J C and Evans S M J 2002 Frameworks for marine conservation: Non-hierarchical approaches and distinctive habitats. Aquatic Conservation: Marine and Freshwater Ecosystems 12: 635–48 Schnase J L, Cushing J, Frame M, Frondorf A, Landis E, Maier D, and Silberschatz A 2003 Information technology challenges of biodiversity and ecosystems informatics. Information Systems 28: 339–45 Scholten M, Klamma R, and Kiehle C 2006 Evaluating performance in spatial data infrastructures for geoprocessing. IEEE Internet Computing 10(5): 34–41 © 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)
160
E Fujioka et al.
Sherman K and Duda A M 1999 Large marine ecosystems: An emerging paradigm for fishery sustainability. Fisheries 24(12): 15–26 Sherman K, Jaworski N A, and Smayda T J 1996 The Northeast Shelf Ecosystem: Assessment, Sustainability and Management. Cambridge, MA, Blackwell Science Spalding M D, Fox H E, Allen G R, Davidson N, Ferdana Z A, Finlayson M, Halpern B S, Jorge M A, Lombana A, Lourie S A, Martin K D, Mcmanus E, Molnar J, Recchia C A, and Robertson J 2007 Marine ecoregions of the world: A bioregionalization of coastal and shelf areas. BioScience 57: 573–83 Tittensor D P, Mora C, Jetz W, Lotze H K, Ricard D, Vanden Berghe E, and Worm B 2010 Global patterns and predictors of marine biodiversity across taxa. Nature 466(7310): 1098–101 United Nations 1982 Law of the Sea: Part V, Exclusive Economic Zone. WWW document, https://www.un.org/depts/los/convention_agreements/texts/unclos/part5.htm Vanden Berghe E, Halpin P, Lang da Silveira F, Stocks K and Grassle F 2010 Integrating biological data into ocean observing systems: The future role of OBIS. In Hall J, Harrison D E, and Stammer D (eds) Proceedings of OceanObs’09: Sustained Ocean Observations and Information for Society (Volume 2). Paris, European Space Agency Publication No WPP-306 Vanderklift M A, Ward T J, and Phillips J C 1998 Use of assemblages derived from different taxonomic levels to select areas for conserving marine biodiversity. Biological Conservation 86: 307–15 Webb T J, Vanden Berghe E, and O’Dor R 2010 Biodiversity’s big wet secret: The global distribution of marine biological records reveals chronic under-exploration of the deep pelagic ocean. PLoS ONE 5: 8 Williams M J, Ausubel J, Poiner I, Garcia S M, Baker D J, Clark M R, Mannix H, Yarincik Y, and Halpin P N 2010 Making marine life count: A new baseline for policy. PLoS Biology 8: 10 Zhang J, Pennington D D, and Michener W K 2007 Performance evaluations of geospatial web services composition and invocation. Transactions in GIS 12: 59–73
© 2012 Blackwell Publishing Ltd Transactions in GIS, 2012, 16(2)