Canopy Database (BCD); and 4) a field database design and warehousing ... researcher, with data transferred from paper data intake forms to electronic .... organization, International Canopy Network (ICAN) provides a quarterly newsletter and .... the field databases where the sub-schema matches Emerald data templates.
The Development of Databases and Database Tools for Forest Canopy Researchers: A Model for Database Enhancement in the Ecological Sciences Judith Bayard Cushing, Nalini Nadkarni Lois Delcambre, Keri Healy, Dave Maier and Erik Ordway Abstract: Solving biosphere-level problems such as global warming, decreasing biodiversity, and depletion of natural resources requires changes in the ways ecological research is carried out. Historically, ecologists have collected and stored their data in individualist ways, making data sharing among collaborators and subsequent data mining difficult. Integrating the use of database technology into the research process makes data sharing across studies and access to analysis tools easier. Unfortunately, significant barriers to database use in this domain often prevent effective use of database technology, including lack of appropriate and domain-specific database tools, community data archives; standard data formats and common research protocols; limited database expertise among individual researchers; few incentives to contribute data or methods to common archives; and cultural differences between ecologists and computer scientists. In this paper, we describe a model project whose aim is to build a system that will enable one subdiscipline of ecology to overcome some of these barriers, and serve both as a model system for other branches of ecology and as a model project for other ecosystem informatics research. Section 1. Introduction. To understand the biosphere and the way human activities affect it nearly always requires the efforts of more than a single investigator, often from different scientific disciplines. The collective analysis of data originally gathered by individuals, but subsequently stored in shared databases, can yield insights beyond inquiry of a single data set [NRC 1995, NRC 1997]. Future advances in the ecological sciences will depend upon the effective management of existing and new information, and database technology will be integral both to the management of ecological information and to the creation of new knowledge. A workshop sponsored by the National Science Foundation, the USGS and NASA identified the need for a new “biodiversity and ecosystem informatics”, noting challenges and opportunities for further research such as acquisition, conversion, analysis and synthesis, and dissemination of data and metadata (e.g., digital
Databases and Database Tools for Canopy Researchers
libraries, remote sensing, mobile computing, and taxonomies) to a wide range of human users and new biodiversity and ecosystem informatics tools. The data and metadata in question were characterized as highly complex – ontologically, spatio-temporally and sociologically. Workshop participants emphasized the need for establishing deeply interdisciplinary collaboration and broad-based community support for the research agenda. Lack of harmonized protocols, resistance to depositing data and metadata in central repositories, and lack of expertise with informatics tools were noted as all contributing to the limited use of information technology among most ecological researchers [NSF BDEI 2000, http://bdi.cse.ogi.edu/, http://bio.gsfc.nasa.gov/, http://www.ecoinformatics.org ].
We have assessed barriers to using database technology for one emerging sub-field of ecology, forest canopy research, and are building prototype systems that will dissipate these. The forest canopy (Figure 1) is one of the richest but most poorly studied habitats in the biosphere [Moffett 1993, Lowman & Nadkarni 1995]. Canopy-dwelling plants (“epiphytes”) constitute up to half the total plant diversity of some wet tropical forests and provide crucial resources for a host of arboreal birds and mammals [Coxson & Nadkarni 1995, Lugo & Scatena 1992]. Interdisciplinary research groups such as the Global Canopy Programme are now coalescing to approach major biosphere-level problems with global canopy research projects.
Figure 1. A Forest Canopy and the Wind River Canopy Crane With a planning grant from the National Science Foundation, we surveyed 240 canopy researchers [Nadkarni and Parker 1994] to identify the major obstacles that impede forest canopy research. The most commonly cited obstacle was not difficulty of physical access to the canopy (as we had expected), but rather the problems in management, use, and sharing of data sets. This was associated with a lack of uniformity in collecting, processing, and analyzing canopy data, the dearth of data archives, and an inability to link Cushing, et al
Page 2 of 24
Databases and Database Tools for Canopy Researchers
data for comparative research. The relative youth of the field – with its lack of entrenched methods, legacy data sets, and conflicting camps of competing groups – provides an opportunity to integrate data management and analysis tools into the research process. Because researchers appear openly communicative and supportive of others' work and open to sharing data, we use forest canopy studies as an arena in which to generate database tools. Furthermore, because canopy researchers essentially conduct interdisciplinary ecological research, albeit limited to the forest canopy, our work is generalizable to other fields of ecology.
In this paper, we present our approaches to building a data archive for the community of forest canopy researchers. We also articulate for computer scientists the research and development surfaced during this work. As with our previous work, we emphasize not only physical connectivity, but agreement at the semantic level [Maier 1993]. In Sections 2-5, we describe four aspects of the Canopy Database Project, respectively: 1) obstacles that canopy researchers and other ecologists face in using information technology; 2) organization of our multi-disciplinary project; 3) a research issues that have reference database we are developing for the immediate benefit of canopy researchers, aka the Big Canopy Database (BCD); and 4) a field database design and warehousing system, aka Emerald. Sections 6-8 conclude the paper with an outline of research issues, plans for future research, and conclusions.
Section 2. Obstacles to Use of Information Technology by Ecologists. Although ecologists often consult web-accessible information, they typically enter data into private data stores that are rarely published or archived. Despite increasing pressure from funding agencies, the availability of several excellent ecological data archives [www.lternet.edu], emerging tools for recording metadata [Nottrott 1999], and even opportunities to publish data in the prestigious Ecological Society of America archives [ http://www.esapubs.org/esapubs/archive/archive_main.html ], documenting data for archival purposes is still perceived as a time-consuming process and sometimes not even attempted [Michener 1997, Spycher 1996]. Ecologists tend to use flat files, spreadsheets and non-relational databases. Individual researchers rarely use modern database management systems, although some who are good programmers use sophisticated Cushing, et al
Page 3 of 24
Databases and Database Tools for Canopy Researchers
statistical programs and GIS, or even write complex mathematical models [Michener 1998, Michener 2001].
In this section, we outline the typical steps in ecology research and identify data management system requirements by pointing out bottlenecks for a typical researcher. Figure 2 shows a “typical” ecological research scenario. We represent a linear process for explanation, but note that any phase n can either progress to phase n+1, or return to any previous phase.
Study Design
Field Work
Data Entry & Verif’n
Data Analysis
Data Sharing (w/in Group)
Journal Pub
Data Archive
Figure 2. The Ecological Research Process
During the first research phase (study design), canopy ecologists locate and use information about studies that are similar to the one they are designing. They may also want to find scientific articles and field data relevant to the research sites they are considering, or look up accident reports associated with the canopy access equipment under consideration. Researchers at this stage also typically draft preliminary research protocols, articulate hypotheses, and design preliminary field data intake forms. For database use to be feasible, it must be as easy as a traditional spreadsheet, but far more powerful. Because data sets are informally conveyed, researchers who work in small laboratories are often at a disadvantage relative to those who work in teams at large universities.
During the second research phase (field work ), a scientist typically uses the research protocols and field intake forms from Phase I. Because field characteristics are often different than envisioned in research design, the ecologist often alters the research protocols or intake forms (and hence his or her database schema) in the field. Thus, any database system used in the study design phase must have easy update functionality both for schema and data. Data entry, often into a spreadsheet, is usually a separate step for the Cushing, et al
Page 4 of 24
Databases and Database Tools for Canopy Researchers
researcher, with data transferred from paper data intake forms to electronic forms sometimes months after data acquisition in the field. There is often little automated data validation according to domain constraints, although a researcher sometimes discovers that some data forms contain erroneous data or are inconsistent with the initial design. The researcher cannot always return immediately to the field to gather new data, so data analysis strategies or even research objectives might have to be rethought, or data extrapolated.
During data analysis (the third research phase), researchers almost always reformat data for analysis or modeling, or place intermediate or results into data sets. This is a timeconsuming process, especially if the scientist is not experienced with the analysis or modeling programs. At this point, the researcher could still discover that key parameters or data are missing, and might decide to return to the field, use another researcher’s data (collected for a different purpose), extrapolate or interpolate existing data, or not conduct an originally envisioned analysis. Database technology could make these problems obvious earlier, and facilitate easier data transformation and the sharing of macros in analysis applications.
Subsequent to data verification and preliminary analysis, some laboratories now require researchers to save data sets to a common within-lab data store in formats understood by others in the laboratory, and with some documentation. This avoids potential loss of valuable field data if a researcher leaves the lab. However, documenting data even for others already familiar with the research objectives and protocols is a time-consuming process. Database technology could help facilitate this process, especially in larger laboratories where many researchers focus on related problems and where common research protocols, vocabulary field collection procedures and equipment are in effect.
Journal publication is the phase most critical for receiving feedback from peers, advancing the state of the field, and acquiring continued funding. Ecologists need help in preparing manuscripts for publication in several ways, e.g., by easily generating graphics from analysis applications and by gaining access to scientific citations. Cushing, et al
Page 5 of 24
Databases and Database Tools for Canopy Researchers
Data documentation and validation are major bottleneck to the data archiving phase for ecologists [Porter 1997, Spycher 1996]. The tasks of deciding which data to archive, writing data descriptions, and validating data against those descriptions seem disconnected to individual researcher goals. These tasks are usually so overwhelming that they are almost always done at the end of the research cycle – in spite of the fact that data documentation could provide useful artifacts if applied earlier. Archiving data thus becomes a last minute labor for which ecologists perceive little, if any, reward. Although technology cannot change the sociology of data archiving, applying database technology early in the research cycle could make data documenting considerably easier and data archiving a matter of pushing a button that says “publish this database”.
An additional major bottleneck to data archiving is that some scientists continue to hesitate giving up their data for others to use. Concerns about being appropriately credited and about proper use of the data are sociological problems being addressed by funding agencies, long term research sites, and some journals. The Ecological Society of America, for example, is now offering to publish peer-reviewed data sets. However, many scientists feel that such publication will not carry similar weight as journal publication in terms of recognition and career promotion. While technology cannot solve these problems, any database system must not only make security a serious requirement, but also make it very obvious how to properly use data sets and give credit.
Data mining is now almost always accomplished on a person-to-person basis, with notable exceptions of weather data, satellite maps, or some data sets published for permanent sites such as the LTERs. Even where data are published electronically, however, there are few data standards in ecology. Thus, two data sets gathered from a single archive are likely to have different data formats, and the data miner must read each of the metadata forms with care to appropriately compare variables.
Section 3. The Canopy Database Project. To address the data management problems of canopy researchers, we organized the Canopy Database Project with a 3-year grant from the National Science Foundation. Our primary goals are to make the use of database technology more attractive with immediate productivity gains and to make data archiving Cushing, et al
Page 6 of 24
Databases and Database Tools for Canopy Researchers
a natural byproduct of the research process by providing easy ways of collecting metadata early in the research process.
Project objectives are to: 1) carry out and record field studies that include forests of differing structural diversity and generate a range of typical forest and tree structure data; 2) provide a repository for terminology, research protocols, and data encoding schemes for canopy studies; 3) build a web-accessible database to publish forest structure data (e.g., tree height, crown volume, canopy permeability) for diverse forests types; and build tools to integrate functional data (e.g., canopy throughfall, light interception) into that database; 4) provide other types of data useful to canopy researchers in a web-accessible database (e.g., canopy images, study site descriptions, funding sources, meetings, links to related sites).
Our strategy has been to first gather an interdisciplinary team of three groups: 1) canopy researchers willing to commit time to explore the use of database technology in their research, 2) database system developers, and 3) computer science researchers interested in scientific applications. We also sought collaboration of ecological information managers at the H. J. Andrews LTER site where forest ecological data has been successfully archived for the past 20 years.
We organized the project around five foci: 1) International communication among canopy researchers: Founded in 1994, the non-profit organization, International Canopy Network (ICAN) provides a quarterly newsletter and email bulletin board for canopy researchers, and outreach to lay persons interested in forest canopies [Nadkarni, 1995, www.evergreen.edu/ican]. The Global Canopy Programme (GCP), launched in 1999 with funds from the U.S. National Science Foundation and the European Science Foundation, organizes large-scale research efforts. These two groups provide our project connections to the global canopy research community. Our project in return will provide leadership in data sharing and archiving among international projects. 2) A “model” field study and publication of model data sets: This focus has two components: a) forest canopy fieldwork on forest structure and function in a 1000 yr Cushing, et al
Page 7 of 24
Databases and Database Tools for Canopy Researchers
chronosequence that produces sample data and highly articulated research protocols replicated at several sites; b) identifying researchers who will contribute data and actively collaborate with other data contributors and with the database project. These “pathfinder projects” contribute reference materials to the BCD (3) and data structures and field data to Emerald (4). 3) The “Big Canopy Database” (BCD): This internet portal (Section 4) provides reference data to the canopy research community, contributes reference tables for Emerald (4), and research trends to the GCP (1). 4)
The Database Design and Warehouse Tool (Emerald): Emerald (Section 5) provides a vehicle for canopy scientists to more easily design, document, archive, and mine field databases. It also contributes reference material such as new citations or site data to the BCD (3) and sample data to our research effort (5).
5) Computer science research: One of the research issues we address is spatial data representation and transformation, a major barrier to data integration for ecological research (Section 7). Emerald is the basis upon which we will develop this spatial infrastructure, and the research effort provides new functionality for Emerald.
The project foci are coordinated to ensure productivity and communication across disciplines. Figure 3 shows how each part of the project provides artifacts to the others.
new reference material
Global Canopy Programme
Pathfinder Projects
research trends The Big Canopy Database
data structures field data
Metadata Reference tables Emerald
sample data, problems
new citations, site characteristics, etc. new functionality Spatial Database Research
Figure 3. Organization of the Canopy Database Project
Section 4. The Big Canopy Database. This web-accessible database consolidates information and data of concern to forest canopy researchers into a portal web site Cushing, et al
Page 8 of 24
Databases and Database Tools for Canopy Researchers
[www.evergreen.edu/canopydb ]. Figure 4 shows the types of data included. Rectangles shaded in blue show data that serve as reference in BCD and Emerald; those in green are built and viewed using the Emerald Field Database System.
Res earch Projects
Res earchers
Gl os s ary (key wo rd s )
Ca no py Da ta S ets
Meeting s
Trai ning Prog ra ms
S tu dy S ites
Fun di ng S ources
S tructur al Te mpl ates
An alys is Prog ra ms
Res earch Prot ocols
Imag es
S cientific Cita tio ns Vis ualizatio n Prog ra ms
Po pular Cita tio ns
Da mag e & Acci dents
Figure 4. Information Components of The Big Canopy Database (BCD)
The first prototype BCD was implemented in HTML, but has since been converted to a database-driven application, using HTML, ASP and MS SQL Server. Our long term (fiveyear plan) is to enable a team of geographically distributed curators to enter information so that the community can more easily contribute and peer-review reference data. Until that point, the BCD will be maintained under by the International Canopy Network.
Because one of the major bottlenecks for researchers to archive ecological data is the time needed to describe data sets, the BCD provides source tables for metadata field data descriptions. The considerable overlap of content between BCD data and Emerald metadata (Table 1) means that canopy researchers who deposit field data in Emerald can document those databases with a shared vocabulary by using existing reference data. Appendix I shows the conceptual design of the BCD database.
Using an NSF LTER metadata standard (the H. J. Andrews LTER, [www.fsl.orst.edu/lter/datafr.htm], we collect metadata in the following four categories: 1. Personnel Information: information about the people involved in the conception, design, implementation, and documentation of a study and its data tables. Cushing, et al
Page 9 of 24
Databases and Database Tools for Canopy Researchers
2. Project Level Information: overall information about the project, including use constraints and disclaimers for data use. The term project is used to refer to large research programs that involve multiple field study databases. 3. Study (aka Database) Level Information: general information about a study (e.g., title, abstract, purpose, dates, methods, site characteristics). Where a study is not part of a project, the study-level information will also include general information about the study such as use constraints. 4. Entity and Attribute (or variable) Level Information: detailed information about individual data tables, or files that contain GIS layers or images, i. e., metadata. We term items 1-3 scientific metadata (which scientists typically call metadata). We call Item 4 table-level metadata (which computer scientists typically call metadata) to further distinguish it from items 1-3. The first two columns of Table 1 show metadata that researchers supply for Project and Study. The third column, if not blank, indicates which BCD reference table or Emerald function is available as a source for that metadata. Emerald Project Level Metadata
Study Level Metadata
Project Name Description (abstract & purpose)
Study Name Description (abstract & purpose)
Personnel and Role
Personnel and Role
Start Date, Duration or End Date
Start Date, Duration or End Date
Administrative organization
Administrative organization
Metadata Source (BCD unless otherwise stated)
Researchers (Persons)
Administrative and Research Organizations
Funding Organization Data Access Policy Distribution Liability
Funding Organization Access Constraints Use Constraints
Quality Assurance Procedure Data Entry Procedure Data Archive Procedure
Research & Field Methodology Measurement Frequency Update Frequency
Cushing, et al
Funding Organizations Data Access Policies & Proc. Data Security Policies & Proc. Research Methodologies
Page 10 of 24
Databases and Database Tools for Canopy Researchers
Related Material & Citations
Geographic Location(s) Research Sites
Analytical Tools
Analysis & Visualization Programs
Related Material & Citations
Citations & Web Sites
Taxonomy
Taxonomic Table References
Category and Keywords
Categories and Glossary
Geographic Location(s) Research Site characteristics
Research Sites
Plot Characteristics For each entity in the database, Entity Name and Alias Entity Description For each attribute in each table, Attribute Name Attribute Description Data Type [Units, Precision, Instrument] [name of code list] Data Range For each enumerated code list: Name Date Type Data Elements
Permanent Plots Emerald Data Templates and Field Database Emerald Data Templates and Field Database
Emerald Data Templates
Table 1. Correspondence between BCD Data and Emerald Metadata Section 5. Emerald – Ecological Database Design, Warehousing and Query. Emerald is a web-accessible database system designed for canopy researchers to integrate database technology more easily into their research [www.evergreen.edu/canopydb]. Goals are to help scientists increase research productivity, simplify sharing data with close collaborators, and facilitate data archiving. We aim to make metadata acquisition a natural byproduct of the research process, provide data transformation tools, and make archiving as easy as pushing the “publish my data” button. Key assumptions are: 1) metadata articulation and data validation are onerous when they occur at the end of the research cycle, and 2) integration of metadata data with field data earlier in the research cycle will improve researcher productivity by improving the quality and flexibility of the research data.
Distributed (off line) Field Databases Extend Populate Cushing, Analyze et al
Field DB
Database Design
Validate & Upload Field Data to MFDB
Canopy DB BCD Metadata Ref. Tables Emerald Templates Master Field DB = Field Data + Metadata PageField 11 ofDBs 24 Researcher
Databases and Database Tools for Canopy Researchers
Figure 5. Emerald’s Software Components
Figure 5 shows Emerald’s software components. Our strategy for making field databases easier to document and comparable to other field databases is to provide building blocks for database design and to use metadata source tables (from the BCD described above). We will accomplish the former by allowing the reuse of commonly recurring domainspecific data structures (what we call “templates”) as the building blocks for new databases, for importing data into a warehouse, and for composing cross study queries. Field databases designed with Emerald would first be used in “single user” mode on a private workstation during fieldwork and analysis.
Emerald complements, not replaces, existing data archives such as the canopy crane site databases and the LTER repositories. Emerald differs from the canopy crane research site databases in that we provide information that spans several sites (e.g., a common list of researchers, lists of canopy research sites with links to the sites themselves). We differ from the LTER repositories in that we provide help in research design and database design and that we specialize these services for one specific research community (forest canopy research). Thus, for example, our citations and our researcher databases are not limited to projects whose data are stored in our database, but are meant as community-wide references. We remain compliant with the metadata requirements for data deposition at the LTER sites. Researchers using Emerald will be able to seamlessly archive data at any LTER site.
Section 5a. Emerald Functional Requirements. The system’s functional requirements are articulated below. System design documents and a current development schedule are available at http://scidb.evergreen.edu/cdbdocs-public/index.html. Cushing, et al
Page 12 of 24
Databases and Database Tools for Canopy Researchers
1. Field database repository. Emerald is a metadata repository for canopy research projects. It is modeled after that of the HJA LTER, with the capability of searching and viewing study metadata and field data, and downloading data sets for analysis. Security and privacy features allow a researcher flexibility: to publish metadata only (no field data), or make field data available to selected colleagues, or viewable but not downloadable. Scientists may also forbid release of personal information.
2. Field database design and implementation. The process to add a project and study to the database is simple. A database archivist registers a Principal Investigator (PI) and creates a new project, with that researcher as the PI and possibly a second person as data archivist. The PI or data archivist then adds other project researchers and creates studies for that project. For each individual study, a database is designed by clicking on data templates and optionally identifying some parts of the BCD as source tables. Emerald provides some database refinement, such as which attributes are included, attribute-level metadata (e.g.. data range values), and prepopulating the database with existing site or study data. The database can be downloaded as any SQL (or Excel) database, and a populated field database later uploaded to the repository. Validation against metadata will be accomplished via software and services from an LTER data center.
3. Field database warehouse and data mining. This part of the system will integrate field studies into a single warehouse and enable cross-study data queries.We currently distinguish repository and warehouse features. The repository includes all field databases, as uploaded by the researcher. Databases in the repository can be searched via study-level metadata, (e.g., “find all the studies conducted at the Wind River Canopy Crane Research Facility by Bob Van Pelt”), and the contents of those databases viewed individually. The virtual warehouse includes only those parts of the field databases where the sub-schema matches Emerald data templates. Data in the warehouse can be queried by using data templates (e.g., “find the average diameter for all trees that have height greater than 20 m”), or a combination of data Cushing, et al
Page 13 of 24
Databases and Database Tools for Canopy Researchers
templates and metadata, (e.g., “find the average of diameter for all trees of height greater than 20 m for studies in the U. S. Pacific Northwest”).
Section 5b. Emerald Data Templates. A template represents data collected when measuring a particular physical object in the real world (e. g., a tree or branch) and appears to users as a conceptual database primitive – a domain-specific data type. To a computer scientist, templates are collections of variables, each grouped as one or more relational tables, that can be composed into an end-user defined database. To a canopy researcher, templates are mechanisms for designing field databases by using conceptual structures. One template table (within a given template) might be defined to be related to another table within the same template. When more than one template is chosen, appropriate relationships between the template tables are induced. Templates carry with them table-level metadata that can later be exploited for validation, archiving and query, but are transparent to the end user.
Figure 6 shows a typical collection of templates a scientist might include in a database. For example, a researcher might collect data about epiphytes as per-cent-cover per branch quadrat for each of the epiphyte species under consideration. To build a database for this study, the researcher first uses a template for branch “quadrat”. Data collected for each quadrat are epiphyte species and the per-cent-cover on the branch- quadrat for that species, as well as the date of the observation. Since the data of interest are located on a tree branch, the researcher must also collect information about branches and trees, and hence includes a branch-template and a tree-template 1 in her database. The table tree-plus is an additional measurement on tree (not in the tree-template) that this researcher will make. Because each tree must be located in space and that space described, the researcher will be asked to include plot- and site- templates. 1:m site plot 1:m studyID tree 1:m branch 1:m Quadrat siteID studyID studyID studyID tree_plus studyID name siteID plotID siteID siteID stemID location plotID siteID plotID plotID date . lat,long stemID stemID stemID Thiessen . 1 What we call plot datepaper is called a stem-template branchID a tree-template type in this in Emerald, as per canopy researcher branchID species we used thedate quadratID conventions. For simplicity of explanation, term “tree”. x, y date height Cushing, et al Page 141:1 of 24 epiphyte diameter aspect percentCover length
Databases and Database Tools for Canopy Researchers
Figure 6. A Scientist’s Conceptual View of Tree-Branch- Quadrat Templates
Appendix II shows a conceptual view of how measurement data is organized within Emerald, which is somewhat more abstract than what the scientist would see with an Emerald-generated database.
Once a researcher has collected field data and is ready to publish or share the database, the database will be checked against the standard template tables for any schema changes and integrated into our data warehouse, aka the “Master Field Database”. The master field database schema is the database schema that would be generated if one were to include every template in a database design.
We have designed data templates for site, plot, stem (aka tree), branch, and observation variables. The observation variable template allows the definition of generic forest functional observations such as light, temperature, and per cent interception of rainfall at a particular location or on a forest object. These can be instantiated to hold data values for field observations such as per cent stem cover of a particular component , (e. g., mistletoe), or per cent branch cover of any number of canopy-dwelling species. Variables can be grouped according to whether they are measured several times, and if they are measured and displayed as a group.
Our current templates are derived from field databases for seven independent studies for which we designed databases. The studies range from those carried out by individual researchers working independently at separate sites to a group of very close collaborators using the same set of equipment and research sites. These data represent a diversity of data types – including structural data, biodiversity data, and functional data. Cushing, et al
Page 15 of 24
Databases and Database Tools for Canopy Researchers
Section 5c. Emerald Development Issues. Emerald is implemented in Microsoft SQL Server, Java, JDBC, and Enhydra. Four open technology (implementation) issues remain:
1) Template representation and export (SQL, XML or Java), including the articulation of rules for template composition. Templates are currently represented as XML scripts, with annotations about how variables are grouped into tables and how those tables are related. A collection of XML variables are subsequently mapped to SQL tables, and to Java objects. As we articulate templates, we also develop rules for how they can be composed into database designs for field databases. We will include heuristics for suggesting additional data collection that would expand the applicability of that data set.
2) Development platform. In our first prototype, we found HTML/ASP/SQL Server technology too brittle for a flexible user interface. We therefore chose HTML/Java/Enhydra/SQL Server technology for the second version, but the webaccessibility requirement still limits user interface functionality.
3) Visualization of the end-user database design to the ecologist. The user interface design issue of the appropriate level of abstraction to present the templates and subsequent database design and field database to the ecologist is still an open question. Database entities are more abstract, with a greater degree of normalization, than most researchers prefer. We provide high-level views for the SQLServer database, but the extent to which these views will be physically implemented as non-normalized Access tables (and Excel spreadsheets) will be a matter of experimentation with users.
4) Implementation of the data warehouse. One strategy for building the “master field database” is to upload the parts of a field database that match template structures into an integrated database, and to have a separate repository of individual field databases each in its entirety. Rather than duplicate field databases, we are collaborating with Eric Simon’s Caravel research team to build a virtual warehouse using Le Select [http://www-caravel.inria.fr]. This solution involves wrapping each Cushing, et al
Page 16 of 24
Databases and Database Tools for Canopy Researchers
individual field database to articulate where its schema matches the templates, writing specialized client software to build user queries, and using Le Select to run cross study queries. We have implemented a prototype Le Select client tool that allows a user to compose queries across a several templates and then executes those queries across several field databases.
Section 6. Research issues. We have identified two major computer science research areas relevant to other ecosystem databases. These are consonant with those identified in the June 2000 Workshop on Biodiveristy and Ecosystem Informatics.
1. Data warehouse. Data warehouse problems are well documented and an active research area [Bernstein 2000], and many of the same problems of schema integration and migration that plague business systems apply to ecosystem informatics. How do we get data from different field studies into one repository? Should that repository be physically centralized and integrated, or distributed? How can we assure that data match the warehouse schema and be adequately described? How can the syntactic and semantic aspects of the templates (i.e., the global schema) be maintained in a flexible manner befitting researchers? How can we accommodate data that do not fit existing global schema (templates)? How can we provide a domain-specific query capability over the data warehouse that ecologists find easy to use[ http://knb.ecoinformatics.org ]?
2. Data and program interoperability, data transformation, model management and workflow. Scientific databases have in common with traditional database systems problems of data and program interoperability [Maier 1996]. Many of the tedious data management and analysis tasks researchers perform as they move through the research cycle are less idiosyncratic than the current use of hand-crafted methods suggests, and it seems possible to devise tools to transform data from one physical form to another. Furthermore, describing and cataloging both application programs and application macros more formally would provide application invocation that would be a significant help for researchers. This could lead to tools that better coordinate multiple tasks performed by different processing entities, and reuse common workflow components. Cushing, et al
Page 17 of 24
Databases and Database Tools for Canopy Researchers
3. 3D spatial data representation and transformation. Though relevent to data warehouse and interoperability issues, the problem of spatial data representation deserves attention in its own right. This topic is addressed in Section 7.
Section 7. Future Work. Our future work on the canopy database project will focus on three areas: continued development of the BCD and Emerald, computer science research to build a spatial infrastructure for ecologists, and the outreach needed to deploy a selfsustaining system. For development, future tasks are to: refine and field test data templates and figure out how to manage their evolution over time; field test and improve the user interface; build better tools for supporting research activities such as data validation in the field, visualization and analysis; create an integrative data warehouse to perform cross study queries; increase the number of field studies in the warehouse and the scope of the metadata source tables.
We will focus our major database research efforts on developing spatial infrastructure for ecology. In this domain, most research is currently accomplished through actual measurement, observation, and study of objects in the physical world. In addition, remote sensing and satellite devices are increasing the capacity of the ecologist to gather data. These new kinds of data will increase the need for ecologists to work at different spatiotemporal scales, but will not change the need for traditional field work, nor a fundamental aspect of ecological data: the location of phenomena studied is of paramount importance. Determining the location of a physical object in the real world, and its relative position with respect to other objects of interest, is critical for the author of a data set and to others undertaking cross study and interdisciplinary analysis.
Although it may seem obvious that determining the precise location of any object simply requires a GPS, our experience has convinced us to the contrary. Because of difficulties in conducting field work, nearly all measurements are taken relative to a very local point of reference. In-depth consideration of the geometry and spatial aspects of ecological data (to meet current and future needs) is not a primary focus to the individual scientist, who typically focuses on data collection and analyses to answer one or a few focused Cushing, et al
Page 18 of 24
Databases and Database Tools for Canopy Researchers
questions. However, failure to capture even one simple measurement can prevent a data set from being properly placed in three dimensions and could defeat the use of various tools and or prevent investigation of unanticipated questions in cross-study correlations. While there is no need for every scientist to reinvent the spatial structures and transformations needed for their questions, individual researchers often spend much time performing spatial transformations which is at best time-consuming and at worst error prone. We believe that better database infrastructure for manipulating and relating spatial objects would increase the productivity of ecology researchers and of interdisciplinary teams, allow better integration of new kinds of data and enable more facile data mining and spatial scaling. We have dubbed our nascent infrastructure The Spatial Harness.
To establish the canopy database in the real world, with real users, several very practical steps must be taken, many of which are sociological. We identify three tasks: a) establishing a critical mass of users and data, b) recruiting and training volunteer curators, and c) finding long term funding for the operational system. We will work with the Global Canopy Programme, ICAN, and funding agencies to organize the canopy community and help scientists publish data, data models, common vocabulary, dictionaries and “translation guides”. We will seek endowment funds and perhaps a scheme for chargeback to users for the database, and provide value-added services to the GCP foundation (e.g., tracking research proposals and projects).
Section 7. Conclusions . Although the community of canopy researchers is relatively small, it is a maturing field that brings together scientists who were trained in a variety of disciplines. Canopy researchers are open to sharing data and many are willing to try new technology, and could serve as an appropriate model for other emerging fields of ecology. Database tools could help speed the journey of this emerging field to a more mature stage. However valuable databases might be in bringing fields to maturity, scientists will not use them unless those databases increases research productivity. It is thus critical to develop flexible database tools that can easily be specialized for a wide variety of ecological studies, where scientists see that efforts to use database technology will pay off.
Cushing, et al
Page 19 of 24
Databases and Database Tools for Canopy Researchers
While it is impractical to define and maintain a global schema for ecology, we predict that a relatively small number of domain-specific data structures (templates) could be used to construct individualized field databases, and that several such databases would be considerably more comparable than databases idiosyncratically designed. Generating field databases with domain-specific tools that document the resulting databases facilitate two critical productivity gains : 1) data sets validation and archiving, and 2) cross study queries for sophisticated analysis, modeling and visualization.
Building an interdisciplinary informatics team and focusing that team on problems in a specific discipline is difficult, but is essential for computer scientists to work real-world problems and for ecologists to attempt real work with prototype systems. Working on such a project requires considerable patience and investment because of differences in language and culture or extra work to produce additional data or documentation. In this domain, it is not enough for the computer scientists to work on “toy” problems that have “clean” properties. Nor does it suffice for ecologists to simply conduct their research without articulating terms and procedures. The database systems needed by ecologists cannot be produced with existing technology; however, the research needed to deliver that technology requires ecologists and computer scientists working together and alternating between development that uncovers and defines problems, research that solves those problems, and further work that “field-tests” that research. While the resulting database research may not fit cleanly into traditional molds, it is clear that without such integrative work, ecologists will not have the kinds of systems they need.
In summary, we have created database tools to help ecologists improve research productivity. We have enhanced their ability to share results of individual field studies across projects of larger theoretical, social, and economic impact. While researcher productivity is a necessary condition for ecologists to use database tools, it may not be sufficient. Integrating systems such as those we propose into the ecological research cycle will involve some changes in the way ecology is practiced, and, as mentioned tangentially in this paper, rewards for archiving data sets are not generally perceived to be in place. Although such sociological changes are beyond the scope of the work of the Canopy Cushing, et al
Page 20 of 24
Databases and Database Tools for Canopy Researchers
Database Project, our work has suggested that both ecologists and computer scientists will play roles as change agents as these rewards are introduced into the scientific arena.
Acknowledgements. We acknowledge the contributions of programmer James Tucker, and of Bonnie Moonchild and Jay Turner of Starling Consultants. Research technician Steve Rentmeester provided invaluable work of redesigning researcher spreadsheets into databases, and translating these for the database developers. Bob Van Pelt demonstrated considerable foresight in research design for the 1000-Year Chronosequence. Brook Hatch, Neil Honomichl, Peter Boonekamp and Mike Ficker (all undergraduate computer science students at Evergreen) contributed significantly to the development effort. The Evergreen State College, the International Canopy Network staff and the Global Canopy Programme (especially Andrew Mitchell) provided exceptional support. The Information Managers Don Henshaw and Gody Spycher for the NSF LTER H.J. Andrew site, as well as its former director Susan Stafford, provided considerable help to us in providing metadata standards and understanding of ecological information management. David Shaw, Jerry Franklin and Ken Bible of the Wind River Canopy Crane Research Facility deserve special mention for providing not only data and encouragement but also the venue for numerous presentations and meetings with other researchers. We would like to thank the many ecology researchers (including, but not limited to, Barbara Bond, Mark Harmon, Betsy Lyons, Hiroaki Ishii, Robert Mutzelfeldt, Roman Dial, Steve Sillett, Akihiro Sumida) who have contributed data and ideas to our effort. Finally, we would like to thank computer scientists Eric Simon and his staff at INRIA Caravel Project, Dennis Shasha of New York University and Phil Bernstein of Microsoft for valuable feedback on our work. This work has been supported by the National Science Foundation grants: BIR 9975510; BIR 9630316; BIR 9300771, INT 9981531.
Cushing, et al
Page 21 of 24
Databases and Database Tools for Canopy Researchers
References Bernstein, P.A. and E. Rahm. 2000. Data Warehouse Scenarios for Model Management. ER2000 Conference Proceedings. Springer-Verlag, pp. 1-15. Coxson, D. & N. Nadkarni. 1995. Ecological roles of epiphytes in nutrient cycles of forest ecosystems. Pages 495-546 in M. Lowman & N. Nadkarni, eds. Forest Canopies. Academic Press, San Diego, California. Lowman, M. & N. Nadkarni. 1995. Forest Canopies. Academic Press, San Diego. Lugo, A. & F. Scatena. 1992. Epiphytes and climate change research in the Caribbean: a proposal Selbyana 13: 123-130. Maier, D., J. B. Cushing, T. Keller and T. Marr. 1996. Proxies in Practice: Object Architectures for Distributed Computational Workbenches, Journal of the Brazilian Computer Society, 3-1. D. Maier, J. B. Cushing, et al.. 1993. Object Data Models for Shared Molecular Structures, in Computerized Chemical Data Standards: Databases, Data Interchange, and Information Systems, R. Lysakowski (ed). STP 1214, ASTM. Michener, W., J. Brunt, J. Helly, T. Kirchner and S. Stafford. 1997. Non-spatial metadata for the ecological sciences. Ecological Applications 7:330-342. Michener, W. , J. H. Porter, and S. Stafford (eds). 1998. Data and Information Management in the Ecological Sciences: A Resource Guide. Michener, W., J. Brunt (eds). 2001. Ecological Data – Design, Management and Processing, Blackwell Science Methods in Ecology Series. Moffett, M. 1993. The High Frontier: Exploring the Tropical Rain Forest Canopy. Harvard Univ. Press, Cambridge, Massachusetts. Nadkarni, N. & G. Parker. 1994. A profile of forest canopy science and scientists - who we are, what we want to know, and obstacles we face: results of an international survey. Selbyana 15:38-50. Nadkarni, N., G. Parker, E.D. Ford, J. Cushing, & C. Stallman. 1995. The canopy research network: a pathway for interdisciplinary exchange of scientific information on forest canopies. Northwest Science. Nadkarni, N. & J. Cushing. 1995. Final report: designing the forest canopy researcher's workbench: computer tools for the 21st century. Intl. Canopy Network, Olympia, WA. 50 pp. National Research Council. 1995. Finding the Forest for the Trees: The Challenge of Combining Diverse Environmental Data – Selected Case Studies. National Academy Press, Washington, D.C. 129 pages. National Research Council. 1997. Bits of Power: Issues in Global Access to Scientific Data. National Academy Press, Washington, D.C. 235 pages. National Science Foundation. 2001. Maier, D., E. Landis, J. Cushing, A. Frondorf, A. Silberschatz, M. Frame, J. Schnase (eds). Report on a NSF, USGS, NASA June 2000 Workshop on Biodiversity and Ecosystem Informatics. http://bio.gsfc.nasa.gov. Nottrott, R., M. B. Jones, M. Schildhauer. 1999. Using XML-Structured Metadata to Automate Quality Assurance Processing for Ecological Data, Proceedings of the Third IEEE Computer Society Metadata Conference. IEEE. Bethesda, MD. Porter, J. H., D. L. Henshaw, and S. Stafford. 1997. Research Metadata in Long-Term Ecological Research (LTER). IEEE Metadata Conference. G. Spycher, J. B. Cushing, D. L. Henshaw, S. G. Stafford, N. Nadkarni, 1996. Solving Problems for Validation, Federation, and Migration of Ecological Databases. EcoInforma.
Cushing, et al
Page 22 of 24
Databases and Database Tools for Canopy Researchers
Appendix I. The Canopy Database Conceptual Model.
Cushing, et al
Page 23 of 24
Databases and Database Tools for Canopy Researchers
Appendix II. Conceptual Model for Emerald Field Measurement Data Structure.
Cushing, et al
Page 24 of 24