ECOGEO 2015 Workshop I Final Report - EarthCube

Research Coordination Network Workshop I Report 27-28 August 2015, University of Hawai‘i at Mānoa, Honolulu, HI Report: 24 November 2016

Table of Contents Executive Summary

Page 1

ECOGEO – a community focused on solutions

Page 2

Workshop Goals

Page 2

Workshop Outcomes

Page 3

I. Summary of workshop structure II. Grand ‘omics science challenges III. Specific cyberinfrastructure needs IV. Leveraging and expanding existing infrastructures V. Enabling and encouraging big data best practices Recommendations Building Environmental ‘Omics Infrastructure for Earth Sciences Page 9 Upcoming events

Page 11

Appendices

Page 12

Appendix I: Workshop I Agenda Appendix II: Participant List Appendix III: Participant Use Cases and Use Case Template Appendix IV: Community Survey and Summary Resources ECOGEO RCN website: http://earthcube.org/group/ecogeo Workshop website: http://cmore.soest.hawaii.edu/rcn2015/ Workshop agenda: http://cmore.soest.hawaii.edu/rcn2015/agenda.htm (also Appendix I) Workshop webinar: https://vimeo.com/uhcmore/review/138035693/5223a63a63 Environmental ‘omics Resource Viewer: http://pivots.azurewebsites.net/ecogeo.html Workshop conveners: Edward DeLong, Elisha Wood-Charlson, ECOGEO Steering Committee Report prepared by: Edward DeLong, Elisha Wood-Charlson, and ECOGEO Workshop I participants (Appendix II)

24 November 2015 Executive Summary The aim of ECOGEO’s first workshop was to enable domain scientists and cyberinfrastructure experts to collaboratively discuss grand challenges in ‘omics science (Outcome II) and explore use cases that translate those challenges into cyberinfrastructure needs (Outcome III). The group also worked to outline existing resources, brainstormed how to best leverage and expand those resources (Outcome IV), and discussed ways to better establish best practices in the community (Outcome V). The workshop hosted over 50 participants from more than 20 universities, three national labs, four cyberinfrastructure centers, two NSF funded data resources, and three EarthCube funded projects. At the conclusion of the workshop, participant comments were unified and optimistic about how to best move forward as a community. Scientists and cyberinfrastructure experts were able to identify common ground and consensus on how to address the some of the core ‘omics science challenges. This report summarizes discussions and synthesis activities that generated the following recommendations, aimed at overcoming cyberinfrastructure challenges in environmental ‘omics research. In summary, the community’s recommendations include: I. EarthCube-based solutions should integrate science drivers and challenges (e.g. end user discussions and use case scenarios) with technological and engineering solutions via continuous, iterative discussion/development cycles, from start to finish. II. Existing ‘omics-orientated cyberinfrastructures should be extensively leveraged and integrated into EarthCube systems, with further development and support, to better meet current and future needs of the ‘omics community. III. Data centers, databases, and analytical tools that address issues of data discovery, scalability, and community HPC access should be further developed. IV. Data visualization and statistical analytical frameworks should be integrated into standard ‘omics analyses workflows and software. V. The cyberinfrastructure should enable and encourage “big data” best practices and standards for the community. VI. The ‘omics and cyberinfrastructure communities should enable and provide a platform for future ‘omics research via streamlined, accessible, state-of-the-art education, training tools, and best practices. This report aims to provide a foundation of community support for a federated platform of interoperable cyberinfrastructures for oceanography and geobiology environmental ‘omics research. With support for improved cyberinfrastructure, interdisciplinary collaborations, and best practices and training, the ‘omics community (domain scientists and cyberinfrastructure experts) is very well positioned to move environmental ‘omics research into the future.

1

24 November 2015

ECOGEO – a community focused on solutions The EarthCube Oceanography and Geobiology Environmental ‘Omics (ECOGEO) project is a two-year, NSF funded Research Coordination Network (RCN) designed to bring together domain and cyberinfrastructure scientists and engineers with the goal of articulating needs, challenges, solutions, and required software and hardware infrastructures for enabling and advancing current and future ‘omics research in the Geosciences, and in particular Ocean Science. The ECOGEO research community spans an array of disciplines, but is united in developing and applying ‘omics technologies and bioinformatic approaches to address core questions relating the interplay of biological, geological, and chemical processes. Investigations range from high-throughput sequencing of microbial community DNA to assess taxon, gene, and metabolic pathway distributions across samples (metagenomics), monitoring the expression of genes and/or proteins in a variety of environmental settings (metatranscriptomics and proteomics, respectively), and measuring the distribution and significance of metabolites and lipids in organisms and the environment (metabolomics, lipidomics). These methods enable researchers in biological oceanography, biogeochemistry, organic geochemistry, microbial oceanography, and geobiology to explore and inter-relate the biological, geological, and chemical (biogeochemical) world in hitherto unprecedented depth and detail. This general approach requires considerable computational hardware and software infrastructures, which rely on high performance computing, advanced networking and database capabilities, and collaboration with computer scientists, bioinformaticians, software engineers, computational biologists, and interdisciplinary support from both government and private funding agencies and foundations. Overall goals  Create and sustain a strategic network and community of field and cyber scientists to explore new facets of ‘omics data.  Articulate needs, challenges, and practical solutions that address: 1) development of cyberinfrastructure, 2) integration and implementation of workflows, and 3) database and resource sustainability to support ocean and geobiology environmental ‘omics research.  Develop a community-based framework that integrates best practices for sharing, curation, and analysis of ‘omics data, with associated “metadata”, and facilitates collaboration and training among environmental microbiology, geobiology, and computer science disciplines. Workshop Goals Highlight core science and technology drivers for research using environmental ‘omics The field of environmental ‘omics requires close collaboration between domain science and technology/cyberinfrastructure. Therefore, one of ECOGEO’s main workshop goals was to discuss key drivers from the perspective of current challenges and cyberinfrastructure needs. These drivers were identified during review of the original end-user workshop documents, participants use cases developed for this workshop (Appendix III), and community feedback from the ECOGEO survey (Appendix IV).

2

24 November 2015 Identify solutions: Based on these challenge-focused science and technology drivers, the workshop participants also discussed ways to leverage existing resources and described gaps where new solutions could be built. EarthCube Context The workshop leveraged numerous EarthCube resources to enhance our discussions. Several active members of EarthCube governance and funded projects were in attendence, including (alphabetical by last name, also available in Appendix II) Emma Aronson, Science Committee; Basil Gomez, Leadership Council; Danie Kinkade, Leadership Council; Ouida Meier, CReSCyNT RCN; Ken Rubin, Science Committee; Elisha Wood-Charlson, Engagement and Liaison Teams, Science Committee, ECOGEO RCN; and Ilya Zaslavsky, GEAR Conceptual Design, Technology and Architecture Committee, CINERGI Building Block. In addition, ECOGEO has, and will continue, to contribute to EarthCube’s vision. The 2nd year of our RCN will focus on integrating the ‘omics community and our collective resources into the broader EarthCube infrastructure, with the goal of creating sustainable contributions through future EarthCube funded projects. Thus far, we have compiled a dozen domain science use cases, which will be refined at a follow up working group meeting early 2016 (see Path Forward) and have been actively engaged with CINERGI to expand our environmental ‘omics resource viewer. Throughout year 2, ECOGEO will continue to have representation in EarthCube governance, including revision of best practices and core documents, as well as future ideas for EarthCube funded projects recommendations. Finally, we will communicate outcomes from our workshops to EarthCube community through dissemination of reports, and continue to inform the broader ‘omics community of EarthCube through society town hall sessions. Workshop Outcomes I. Summary of workshop structure This workshop focused on understanding the key science and technology drivers in the field and the development of use cases to identify resource gaps, as well highlight their potential for training. The workshop was organized into several breakout groups, with time for reporting and discussion amongst all participants (please see Appendix I for the full agenda). The first series of breakouts on science/tech drivers focused on grand ‘omics science challenges:  Geospatial & temporal registry for 'omics data across scales, led by D. Kindade, V. Orphan  Tracking synoptic ‘omics data products, led by B. Hurwitz, N. Kyrpides  Integrated modeling of organisms (‘omics) and environmental dynamics, led by M. Follows, N. Levine The next set of breakouts focused on a subset of participant submitted use cases with the aim of extracting the overarching science challenges and cyberinfrastructure needs in ‘omics research. Representative use cases were grouped by theme and are available in Appendix III.  “Google Earth” ‘omics, led by E. Allen. Use cases by H. Alexander and B. Jenkins.  Linking function to biogeochemical cycling in space/time, led by M. Saito. Use cases by R. Morris and J. Waldbauer.

3

24 November 2015 

Using ‘omics for evolution/trait-based studies, led by E. Aronson. Use cases by D. Chivian and J. Gilbert. The final discussion focused on the potential for and limitations of existing cyberinfrastructure. The aim of this session was to move beyond just identifying gaps in existing resources to proposing solutions as a community to address specific current and future cyberinfrastructure needs in ‘omics research. Breakout leads included J. Heidelberg, B. Jenkins, and D. Kinkade.

In addition to collective brainstorming, the workshop offered several presentations on resources that could be integrated into the cyberinfrastructure “ecosystem”, or at least provide fodder for on-going discussions. Presentations (listed in agenda, Appendix I) included several plenary talks by NSF-funded Science and Technology Centers and their approaches to handling big data, as well as a presentation by Jason Leigh, who leads the Laboratory for Advanced Visualization and Application (LAVA) at UH Mānoa. Jason and his team also hosted a Q&A and informal brainstorming session with the workshop participants, who were encouraged to explore advancements in visualization. We also had a presentation by B. Hurwitz and D. Kinkade on linking the main environmental oceanographic database, BCO-DMO, to the iMicrobe ‘omics data commons integrated in the iPlant Cyberinfrastructure, both supported by NSF. B. Tully and I. Zaslavsky presented the ECOGEO resource viewer, which was enabled by the EarthCube Building Block CINERGI. Finally, L. Teytelman described Protocols.io as a mechanism for researchers to share, modify, comment, and collaborate on laboratory and bioinformatics protocols, and F. Chavez presented MBARI’s collaboration with NOAA on a new eDNA study concept and proposed data analysis workflow. II. Grand ‘omics science challenges One of the inherent challenges in environmental ‘omics research is that the focus, by default, occurs at the level of microbes, since they make up most of the environmental biomass on Earth. However, many of our research questions span from sub-micron scales (viruses) to global ecosystems and modeling. How do you sample a micron-level 3-D space (e.g. an algal bloom in the ocean) over a 4th dimension (time) and then extrapolate the micro-changes and environmental interactions to ecosystem biogeochemical cycling? The current answer is – “the technology is not there quite yet”, but these sorts of analyses may be within reach in the very near future. The workshop participants discussed how to create geospatial and temporal registries for ‘omics data across different scales (Use Case: Jenkins), and how they could be visualized through a Google Earth data discovery model (Use Case: Alexander, Morris). For example, data layers could be expanded from latitude/longitude to include nutrient concentrations and global temperatures, with a “street-view” like function that would allow for micro-to-meter scale visualization, such as activity on sinking particles in the ocean. When the main challenges extend beyond traditional conventional scale boundaries, one must reach across scales in order to enable the next generation of research questions. Within the issue of scaling lies the fundamental challenge of understanding how biological processes interact with these scalable environmental data layers. For example, from an environmental (microbial) ‘omics perspective, the classic ecological metrics of alpha and beta species diversity may no longer be entirely relevant. In part, microbes don’t follow typical

4

24 November 2015 speciation criteria. Therefore, the context of environmental interactions and functional diversity (ability to fix nitrogen, utilize low abundance iron, etc.) may be more relevant from a biogeochemist’s perspective than defining which strain of microbe is present (Use Case: Chivian, Gilbert, Waldbauer). In particular, trait-based metrics as opposed to taxonomic criteria may be more important when considering globally significant and/or societally relevant questions such as, “how are microbe-microbe and microbe-environment interactions impacting global biogeochemical cycles?” and “how can that information be used to improve climate models and projections of change?”. Currently, there is a disconnect between model predictions and data driven observations. Therefore, we need new ways to enable more iterative observations, hypothesis generation, and hypothesis testing cycles. These big picture challenges require a fundamental change in the structure, availability, and scale of ‘omicsenabling cyberinfrastructures. III. Specific cyberinfrastructure needs Although the ‘omics community is diverse in focus and techniques, we share several common challenges that prevent the field from moving forward. These challenges were highlighted in the workshop use cases (Appendix III) as well as the community survey (Appendix IV), which was administered to the ‘omics community in late 2014. One of the most acute articulated needs for the ECOGEO community is a new mode of sequence data repositories that facilitate data sharing and data discovery of primary data and any associated environmental and sample processing data (“metadata”), as well as links to other data products (and their provenance), analytical software and workflows, and the infrastructures required to implement them. Such a repository would store sequence read data with its associated metadata in a manner that would allow seamless and simultaneous queries of metadata fields and their corresponding sequence reads (and vice versa). A similar repository for environmental proteomic mass spectrometry data was also identified as a core need. Such repositories would be invaluable tools for data discovery. They promote efficient computation, the requirement for transferring large data sets would be reduced, and they facilitate development of downstream analysis pipelines that could be shared and standardized. Beyond repositories for raw sequence reads and mass spectrometry data, the community also needs a federated, searchable repository for data products being used as a basis for biological inference in publications, which would greatly enhance comparative analyses between studies. These products include such things as phylogenetically resolved population level genome fragments assembled from metagenomes, gene/protein expression data from metatranscriptome/proteome analyses, and sequence alignments underlying phylogenetic inferences. In addition to searchable data and data product repositories, analysis platforms that enable large-scale metagenomic data comparisons were identified as another core “ocean ‘omics” cyberinfrastructure requirement. Such platforms would support data searches driven by taxonomy and physiochemistry thus making such large-scale comparisons feasible. However, this vision of next generation of sequence data repositories goes beyond aggregating disparate data sets currently housed in dispersed data resources (e.g. NCBI, IMG, iPlant, EBI, MG-RAST, BCO-DMO, etc.). They must also consider “dark data”, data already available in the

5

24 November 2015 public domain but not readily discoverable in via the commonly accessed databases. Data mining through web crawls focused on primary scientific literature would significantly extend the volume of data gathered to promote comparative ‘omics investigations. Once ‘omics data are available in a suitable federated infrastructure with query-able and standardized metadata, it should be possible to pose a striking diversity of hypotheses. Analysis tools, workflows, visualization, and statistics are critical to making sense of ‘omics data. Many analysis tools are disparate (developed by individual research groups), which can make them difficult to capture in a single workflow. Furthermore, most tools require computational experience to run, and are not well vetted by the community (i.e. which is best tool for certain data types and why) or maintained once a developer moves on. This computational climate prevents the continual improvement and vetting of existing tools by the community and an everexpanding database of programs that are difficult to maintain and come up to speed on. This strategy is particularly problematic for researchers without a bioinformatics team. In addition, programs are often developed independently without workflows in mind. This leads to disparate output and formats that are not easily bridged between tools without scripting skills to reformat resulting data and therefore cannot be easily merged into user-defined workflows. The ideal platform would promote a federated toolkit with inter-operable and standardized output formats to enable domain scientists to answer their science questions, as well as encourage continued technology and cyberinfrastructure development. Quantitative Insights in Microbial Ecology (QIIME) and the associated QIITA database have been recently adopted by a large community as platforms for federated ribosomal RNA (rRNA) tag studies, with open access tools and a well supported community run helpdesk. QIIME provides online tutorials that allow for community adoption resulting in a large user database. However, the rRNA tag data set analyses for which QIIME is designed have low complexity, and require comparatively less processing than metagenomic data sets (rendering development and use of these analytical tools much more straightforward). For large-scale meta- ‘omic analyses, different cyber solutions will be required. A few other analysis platforms exist, such as DNA Nexus, which supports the biomedical community, and bioinformatics Apps for the life science that exist in the iPlant “ecosystem”. Finally, statistical and visualization tools are necessary for researchers to explore data and draw conclusions related to their core science questions. During the workshop, we were able to start this conversation with big data statistics and visualization experts. The main take-home message was that domain scientists should not struggle with these challenges alone. Just as our community has grown to educate and include bioinformaticians in the development of our research plans, it is evident that we need to expand our collaborations to also include the big data statistics and visualization experts. IV. Leveraging and expanding existing infrastructures The ‘omics community has many distinct layers of cyberinfrastructure requirements, ranging from the physical hardware to house and serve the data, to software that allows users to process, analyze, and interface with the data. The ECOGEO workshop focused extensively on

6

24 November 2015 discussing these needs and identifying existing resources that might be leveraged to accomplish our research goals. One of the core issues with ‘omics data is the size. Moving large-scale sequence data sets requires significant network bandwidth and access to network platforms, such as Globus and Internet2. It is estimated that the data we have today represents only 10% of the total data available in the next 5 years, given improvements in sequencing capacity and advances in the throughput of non-nucleic acid data products. While data storage and networking and communications infrastructures will continue to evolve to help meet these “big data” needs, they also need to serve a variety of end-users: from raw novices to domain experts, as well as specialized requirements of the educational, survey and monitoring, and policy driven programs and communities. Analyzing and interpreting large collections of data will likely require a collaborative and federated approach. Cloud-based computing approaches hold great promise, but there are a number of issues that need to be addressed by the community, and the current economics of commercial cloud solutions do not appear scalable for a large and diverse community. For the larger and more well established institutions, such as the Joint Genome Institute (JGI), iPlant/iMicrobe, Broad, J. Craig Venter Institute (JCVI), Sanger, etc., maintaining dedicated computing resources make sense, while for small labs it probably doesn’t. For most intermediate size groups, a hybrid approach may be optimal, with some dedicated resources coupled with access to Cloud-based resources. In this context, federated infrastructure virtual machines - including lightweight containers - will likely be a central avenue to provide easy-to-use analysis tools. These have the potential to democratize access to software suites that may be too complex for researchers, without dedicated computational support staff, to install. These researchers may be best served by access to online analysis tools offered by groups such as JGI, KBase, and iPlant/iMicrobe. Common APIs and architectures with such virtual machines will help forge the links for an interoperable and federated infrastructure. In addition, existing EarthCube “dark data” discovery projects, such as DeepDive, can be used to identify published data that are not in a public repository. The imagined next generation data repositories will likely be built on a federated cross-agency structure that ties together data from different providers into a common framework. Data collections from public resources, such as iMicrobe and the International Nucleotide Sequence Database Collaboration (INSDC), which includes Sequence Read Archive (SRA), GenBank, European Nucleotide Archive (ENA), and Integrated Microbial Genomes (IMG), are currently the most used, robust and sustainable cyber and meta- ’omics resources. These should definitely be integral, federated players in the context of any proposed meta‘omics “cyber superstructure”. Presently, researchers in the ECOGEO community deposit raw sequence data into the SRA as part of the National Center for Biotechnology Information’s (NCBI) GenBank service, which currently houses over 19,000 environmental genomic data sets totaling > 15 TB. This resource, however, is not easily searchable and thus prevents the integration of data sets across projects

7

24 November 2015 and limits the possibility of ecosystem level analyses. Further, SRA files at NCBI often do not contain a sufficient description of a sample’s “metadata”, which ideally includes information on the sampled environment, sample collection, processing, and data generation. This contextual metadata, in addition to the oceanographic data in BCO-DMO, is essential if data sets are to be intercomparable. The Genome Standards Consortium (GSC) has established baselines for describing genomic, metagenomic, metatranscriptomic, and amplicon sequence data (discussed in the next section). Previously, the ocean ‘omics community relied heavily on the Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) database for data discoverability, but this platform was discontinued in 2014. The CAMERA data sets have been transferred into iMicrobe, a sub-portal within iPlant. In addition, IMG at JGI and MG-RAST at Argonne National Laboratory are genomic and metagenomic resources that have been heavily leveraged by the larger community. Many data repositories, such as GenBank, are also limited in that they do not accept processed data products and are not properly formatted for non-nucleic acid ‘omics data, such as proteomic, glycomic, metabolomic, and lipidomic data. Currently, the largest available metagenomics data integrations are provided through the JGI’s IMG system. Both systems support the integration and analysis of a number of different ‘omics data sets, and support the general community by annotating and analyzing user submitted data. Due to their long term funding scheme, volume of existing data, and their position within the international community, these systems are “pre-adapted” to be an integral part of a federated meta- ‘omics “cyber superstructure”. A unique capability provided by IMG is the scale of processed data publically available. While IMG is not a centralized resource for raw data storage, it has potential to serve as a central resource of assembled metagenomics data sets. There is currently an effort to assemble and annotate a large part of the raw metagenomics data available through SRA and integrate them with the metadata curation effort through the Genomes OnLine Database (GOLD). This provides a good example of how current data centers could serve as specialized hubs in a federated, interoperable alliance, each providing different data products – in this case with IMG serving as a central repository for metagenome assemblies. Another example of existing cyber-infrastructure that is “pre-adapted” to be an integral part of a federated meta- ‘omics “cyber superstructure” is the iMicrobe project, built on the iPlant cyberinfrastructure. This collaboration is emerging as a viable solution for storing usergenerated and defined data sets in a community data commons. The iMicrobe project provides a query-able interface for data sets in iPlant by linking to BCODMO’s data and mapping appropriate metadata to GSC’s MiXS compliant terminology and other standardized ontologies to enhance data discovery and re-use. iPlant also provides the capacity for users to develop and distribute tools for use by the community. These tools are tied to freely available, high performance computing resources at iPlant and Texas Advanced Computing Center (TACC). Presently, over 500 bioinformatics tools are available within iPlant’s discovery environment, and the iMicrobe project is developing tools specific to microbial ‘omics analyses, including metagenomic and metatranscriptomic data sets, and new analysis pipelines for uncultured viruses. iPlant and TACC are NSF-funded programs that, at the request of NSF, have expanded

8

24 November 2015 their scope to the broader (non-plant) life science community, which is compatible with ECOGEO-related research questions. V. Enabling and encouraging big data best practices The GSC has already established a foundation for the minimal information that should accompany sequence data sets. This community-led initiative has spent 10 years creating consensus based standard languages and formats for describing the metadata associated with a sequence data set. This includes physical, chemical, and biological data that accompany the physically sampled environment. These environmental metadata standards include formats developed for marine, soil, human, host-associated, built-environment, and many other systems. This provides a crib sheet that helps educate people on the kinds of information they should include when they submit their data to a public database. The format promotes a standard that makes incoming data compliant with other data sets in the databases, and also makes the data machine readable and hence searchable. This includes the use of standardized ontologies (e.g. country of origin defined as USA, instead of U.S.A., US, United States, or United States of America). Variations in descriptors confuse searching and make data retrieval extremely difficult. Standard ontologies allow for communication between the searcher and the submitter. In addition, the ‘omics community has been asked to compile a list of requests that would make the SRA database more useful to the community, in alignment with the NIH microbiome database needs. The primary difficulty is not getting people to agree that these standards should be used; it is getting them to use them. Formatting data appropriately requires effort from the submitter. Therefore, databases, journals, and funding agencies are finding it difficult to reach consensus on the best way to motivate the community to employ such standards. GenBank and EBI have adopted the GSC ‘gold star’ standard, making databases complied with the minimal information standard (e.g. MiMS, MiXS, etc.) more data rich and the search for data sets easier. The hope is that if data sets comply with this standard, and are given a gold star, they will be more frequently used, more regularly cited, and hence will encourage more researchers to employ these standards. Recommendations Building Environmental ‘Omics Infrastructure for Earth Sciences Enabling our community to build the necessary data discovery repositories, with federated and efficient frameworks for data integration and interoperability, the establishment of best practices and workflows, and the development of functional platforms for analysis, visualization, and statistics. Recommendation I. EarthCube-based solutions should integrate science drivers and challenges (e.g. end user discussions and use case scenarios) with technological and engineering solutions via continuous, iterative discussion/development cycles, from start to finish. Without frequent communication regarding needs of the larger end user community, EarthCube funded projects will not be able to adapt, limiting their usefulness and sustainability. Development cycles should strive for rapid beta testing, end user feedback, and engineering redesign and build iterations.

9

24 November 2015

Recommendation II. Existing ‘omics-orientated cyberinfrastructures should be extensively leveraged and integrated into EarthCube systems, with further development and support, to better meet current and future needs of the ‘omics community. The true potential of ‘omics-based science is limited by the community’s need for a well developed, federated, interoperable, and distributed “cyber superstructure”. Currently, IMG and iMicrobe are the most extensive, robust, and sustainable cyber meta- ‘omic resources. These platforms should be integrated with other data projects (e.g. BCO-DMO, EarthCube funded projects) and supported as leaders in the development of a federated and interoperable of meta- ‘omics “cyber superstructure”. Recommendation III. Data centers, databases, and analytical tools that address issues of data discovery, scalability, and community HPC access should be further developed. Data discovery through accessible repositories, semantic integration of associated metadata, and scalable analyses are crucial for the ‘omics community to address many of the globally significant and/or societally relevant questions. This level of data integration will require ingenuity and collaboration between domain scientists, cyberinfrastructure developers, statisticians, and visualization experts. As resources, tools, and experts become available, the ‘omics community should support the development of innovative ideas. Recommendation IV. Data visualization and statistical analytical frameworks should be integrated into standard ‘omics analyses workflows and software. Conversations with data visualization and statistical experts have already started, but stronger integration, through the development of collaborations and interdisciplinary projects, will be necessary for effective interdisciplinary projects that will expand and enable the full potential of ‘omics data. Recommendation V. The cyberinfrastructure should enable and encourage “big data” best practices and standards for the community. The on-going efforts of the GSC to create a ‘gold star’ standard for user submitted data sets should continue to be supported and adopted by the community. With a concerted effort by leaders in this field, from single investigators to the established institutions, these standards can be adopted by the ‘omics community over time. This level of cohesion should also be extended to the development of workflows describing data processing and data products. As a community, we should explore existing tools, such as Galaxy, Protocols.io, and EarthCube’s GeoSoft project for curation and disseminating of protocols, software, and scripts. In addition, international collaboration should be further developed and encouraged, with the assistance of the EarthCube Liaison Team. For example, European efforts in metagenomics and microbiome studies (e.g. Marine Ecological Genomics and EBI Metagenomics) have parallel goals and objective to those described here. Recommendation VI. The ‘omics and cyberinfrastructure communities should enable and provide a platform for future ‘omics research via streamlined, accessible, state-ofthe-art education, training tools, and best practices. The complex network of ‘omics research requires individuals to be savvy in field, laboratory, and computer-based techniques. In order to continue pushing the science forward, the ‘omics community should strive to develop

10

24 November 2015 and disseminate educational training tools, such as training workflows, demonstration videos, interactive workshops, and training courses. Effective knowledge transfer to the next generation of ‘omics researchers, developers, and innovators will be necessary to position them to take ‘omics science into the future. Through EarthCube, ECOGEO will develop a foundation of training videos, but proper development, assessment, and improvements will require support and a large community effort. Upcoming events The ECOGEO RCN has several activities planned for the remaining year of NSF funding (through August 2016). In addition to hosting a second workshop (late Spring, early Summer 2016) as funded in the original award (1440066), supplementary funds were granted by the NSF Division of Ocean Science (OCE) for additional activities. In January/February, the ECOGEO RCN will run a small working group focused on creating 12 complete EarthCube use cases. Prior to Workshop I, participants were asked to submit use cases to be reviewed and discussed during the workshop. Due to time constraints, we were only able to review six use cases, but we are keen to work with the TAC Use Case Working Group to flesh out all 12 use cases, including integration into EarthCube resources where possible, and then contribute them to the EarthCube use case repository. In addition, the ECOGEO RCN will be hosting a Town Hall at the 2016 ASLO/AGU/TOS Ocean Sciences Meeting in New Orleans, LA. The Town Hall will be held on 25 February from 12:45-13:45 in the Ernest N. Morial Convention Center (217-219). The Town Hall is intended to introduce the OSM community to EarthCube and the on-going efforts of the ECOGEO RCN. Because we already have representation on the EarthCube Engagement Team, several “Introduction to EarthCube” resources are already under development. Our final workshop will focus on creating instructional webinars that demonstrate ‘omics tools and data portals, as well as implementing the developed use cases. The main goal is to train the next generation of ‘omics researchers and develop ways for them to integrate their research with EarthCube’s on-going mission to enable data science through cyberinfrastructure.

Appendices (pages 12 – 40) Appendix I: Workshop I Agenda Appendix II: Participant List Appendix III: Participant Use Cases and Use Case Template Appendix IV: Community Survey and Summary

11

(updated 15 Sep 2015)

Workshop I – Agenda See also - http://cmore.soest.hawaii.edu/rcn2015/agenda.htm

27 August 2015, East-West Center (EWC), UH Manoa 0745-0800

Depart hotel for EWC

0800-0830

Morning coffee, light breakfast at EWC

0830-1200

Big Data, Big Ideas - Joint session w/ STC directors meeting (Keoni Auditorium)

0830-0845

Advanced Networking Critical Infrastructure for Big Data and Global Collaborative Science – David Lassner, President – University of Hawaii

0845-0945

See the Angel, and Other Thoughts on Breakthrough Science Hon. Daniel S. Goldin, Founder & Chairman, Intellisis Corporation; 9th NASA Administrator, Retired

0945-1015

Coffee Break

1015-1035

Geophysical Sensors, Ice Sheets – Prasad Gogineni, Director CReSIS

1035-1055

Life Science Applications – Ananth Grama, Associate Director CSoI

1055-1115

The Role of STC's and BIO Centers in the Face of Big Data – Erik Goodman, Director BEACON

1115-1135

Big Data Visualization – Jason Leigh, Information and Computer Sciences, UH Manoa

1135-1200

Open mic – Moderator Ed DeLong, ECOGEO PI & C-MORE Co-Director

1200-1300

Lunch at EWC (w/ STCs for group discussion, Bottom Floor)

1300-1330

ECOGEO only (Asia Room, 2nd floor): Opening Remarks; Goals and Agenda

1330-1430

Breakout I (Asia, Sarimanok, Kaniela): Science and CI drivers

1430-1500

Report I: Science/CI drivers (Asia)

1500-1530

Coffee Break

1530-1630

Breakout II (Asia, Sarimanok, Kaniela): Use cases

1630-1700

Report II: Use cases (Asia)

1700

Depart EWC for hotel

1800

Dinner at Waikiki Aquarium (w/ STC meeting) – Bring your name tag!

Workshop I – Agenda

28 August 2015, Morning – C-MORE Hale, Late-morning – EWC 0745-0800

Depart hotel for C-MORE Hale

0800-0830

Morning coffee, light breakfast at C-MORE Hale

0830-0900

Debrief for Community Telecom

0900-1030

Community Telecom – live video stream, interactive Q&A webinar •

Overview of workshop

•

Panel presentation of breakouts: Science/CI drivers, Use cases

•

Open community forum for discussion and feedback

1030-1100

Coffee Break at EWC (Asia)

1100-1130

Discussion re: Community Telecom, remaining agenda items (Asia)

1130-1220

Presentation and discussion: Linking environmental and sequence databases Bonnie Hurwitz, iMicrobe; Danie Kinkade, BCO-DMO

1230-1300

Lunch with presentation on ECOGEO Resource Viewer (Bottom Floor) Benjamin Tully, USC/C-DEBI; Ilya Zaslavsky, UCSD/EarthCube CINERGI BB

1300-1400

Discussion and brainstorm on data visualization (Asia) Jason Leigh, UH; Khairi Reda, UH/ Argonne NL; Madhi Belcaid, UH/ HIMB

1400-1500

Breakout III (Asia, Sarimanok, Kaniela): Final list-CI needs, potential solutions

1500-1530

Coffee Break

1530-1630

Report II: Final list of CI needs, potential solutions - addressed in the final report

1630-1700

Final Presentations Lenny Teytelman, ZappyLab – Protocols.io Francisco Chavez, MBARI – eDNA workflow

1700-1715

Outline of workshop report, ECOGEO RCN’s next steps

1715

Depart EWC for hotel – Aloha and Mahalo!

2

Workshop Participant List

Last Alexander Allen Allen Alm Amend Aronson Belcaid Bender Buchan Chavez Chivian Cleveland Crump DeLong Dhyrman Follows Gomez Grethe Hallam Heidelberg Hurwitz Jacobs Jenkins Kinkade Kyrpides Leigh Levine Mackey Matsen Meier Merrill Moran Murray Nahorniak Neuer Orphan Polson Reda Rubin Saito Schanzenbach Seracki Stanzione Teske Teytelman Tully Waldbauer Wood-Charlson Zaslavsky Zeigler Allen Zinser

First Harriet Andrew Eric Eric Jan Emma Madhi Sara Alison Francisco Dylan Sean Byron Edward Sonya Michael Basil Jeffrey Steven John Bonnie Gwen Bethany Danie Nikos Jason Naomi Katherine Frederick Ouida Ron Mary Ann Alison Jasmine Susanne Victoria Shawn Khairi Ken Mak David Michael Dan Andreas Lenny Ben Jacob Elisha Ilya Lisa Erik

(updated 14 Sep 2015)

Institution MIT JCVI, UCSD SIO, UCSD MIT USC, CDEBI UC Riverside, EarthCube UH (HIMB) GBMF U Tenn MBARI LBNL UH (ITS) Oregon State UH (ECOGEO Lead PI) Columbia MIT UH, EarthCube UCSD UBC USC, C-DEBI U Arizona, iMicrobe UH (ITS) U Rhode Isl BCO-DMO, EarthCube JGI UH (Visualization) USC UC Irvine Hutchinson UH, EarthCube - CReSCyNT UH (ITS) U Georgia DRI/U Nevada Oregon State Arizona State Cal Tech U Delaware UH, Argonne NL UH, EarthCube WHOI UH (ITS) NSF TACC UNC ZappyLab. Protocols.io USC, C-DEBI U Chicago UH (ECOGEO Communications) UCSD JCVI, UCSD U Tenn

Aloha 2015 ECOGEO Workshop Participants, We are really looking forward to having you join us in Hawai‘i on the 27-28 August for the first ECOGEO RCN workshop. In order to prepare for the workshop, the organizers (Ed, Elisha, and the ECOGEO Steering Committee) would greatly appreciate having your research group contribute a single Use Case related to your work in environmental ‘omics. As our first workshop is focused on core issues in ‘omics research, many of the invited participants (list available on the website) represent the senior research/PI level. Therefore, we ask that you use this Use Case development opportunity to involve your research group in the conversation. Below are a few points to help provide some direction, but don’t hesitate to contact Elisha if you have questions or would like feedback. 1. Please draft a Use Case that highlights a current challenge/limitation for your ‘omics research (see the provided Use Case as an example). 2. The provided Use Case represents a current big picture ‘omics question/challenge. Depending on your Use Case, this may or may not be appropriate. Any level of focus and/or complexity is welcome. 3. We encourage input from all research groups: Science and Tech/CI !

Please submit your Use Case NO LATER than 10 August 2015! (Elisha: [email protected]) Prior to the workshop, we will review the submitted Use Cases with the aim of collecting and preparing representative examples for 1) focused discussion on solving challenges and 2) progressing each Use Cases towards functionality in research and training. Mahalo, looking forward to seeing you all very soon! Please refer to the website for logistics and documents related to the workshop. Cheers! Elisha Wood-Charlson and Ed DeLong

1

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name

Contact(s)

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

Basic Flow Describe steps to be followed. Document as a list here and/or as a diagram (see use case example)

1. 2. 3. 4. 5. 6. 7. … Critical Existing Cyberinfrastructure o o o Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may be good candidate(s) for a community CI application.

2

Critical Cyberinfrastructure Not in Existence o o o Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop.

Problems/Challenges (any barriers to successful completion of use case) For each one, list The challenge What, if any, efforts have been undertaken to fix these problems? What recommendations do you have for tackling this problem?

1. 2. 3. 4. … References (links to background or useful source material)

Notes (any additional information that does not fit in a previous category)

3

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name Cosmopolitan species physiological response and strain variability across ecological gradients Contact(s) Harriet Alexander ([email protected]) Overarching Science Driver (these can be refined during the workshop) Understand the role of species-level genome variability in the success of a species complex across environmental gradients. Science Objectives, Outcomes, and/or Measures of Success Aggregate available meta-omic datasets that contain an organism or sequence of interest Create analysis work flow for pulling out target species from within meta-omic datasets

Key people and their roles Sonya Dyhrman (lead PI) Harriet Alexander Basic Flow Describe steps to be followed. Document as a list here and/or as a diagram (see use case example) 1. Select all available metagenomic/metatranscriptomic datasets based on location within the water column (euphotic zone 1% surface irradiance) 2. Query selected datasets for the presence of organism of interest (based on query sequence, genome, or transcriptome) within the omic dataset. 3. Extract metadata, sequences associated with your taxonomic query, and information associated with the sequences (e.g. relative sequence abundance) 4. Run expression, statistical, and alignments locally 5. Visualize data locally Critical Existing Cyberinfrastructure o iMicrobe, NCBI SRA, IMG, EBI, JGI o Python, iPython, Amazon cloud for HPC, virtual machines o Bioinformatic tools for mapping sequences (BWA, Bowtie), assembling sequences (Trinity, Velvet, Abyss), clustering sequences (CD-HIT), taxonomically binning sequences (ClaMS, Phylophythia, ESOM)

1

Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may be good candidate(s) for a community CI application. Critical Cyberinfrastructure Not in Existence o Portal similar to JGI or EBI that can be used to browse meta-omic data without having to download it locally o Standardized data format for environmental sequence data collected on different platforms o Some means of linking omic datasets based on organisms/genes present Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop.

Problems/Challenges (any barriers to successful completion of use case) For each one, list The challenge What, if any, efforts have been undertaken to fix these problems? What recommendations do you have for tackling this problem? 1. How can we unify the type of sequence data that is made available from environmental studies? What types of data should be required? a. We should decide upon what types of data should be 2. The computational time and memory required to specifically query against tens to hundreds of large omic data sets is not feasible to do locally. a. Many groups have started to use a combination of cloud computing (e.g. Amazon cloud) and virtual machines to perform analyses. If the databases provided through earthcube could be made to streamline into such a platform analyses might be made easier. 3. In an ideal world every time a meta-omic dataset were added to the overarching database that dataset would be queried against all other environmental/culture datasets. For example, genes would be clustered with like genes from other environments, species common across environments would be highlighted, patterns of khmer abundance might be tracked and correlated. The goal here would be to create a synthetic , this might place the data within the new dataset into greater context and consequently make further analyses more streamlined. a. This particular challenge is still a bit far off from being solved. I think that work needs to be done to improve the actual computational tools that we currently have available to make such computational efforts more tractable. References (links to background or useful source material) Notes (any additional information that does not fit in a previous category)

2

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name Trait-based modelling of community response to changing conditions. (Overall goal: integrate time-series biogeochemical measurements, meta-transcriptomics, 16S profiling, metagenomic assemblies, isolate genomes, isolate metabolite dependencies, and sometimes isolate metabolomics and meta-metabolomics.) (Note: this is not one of my own experiments, but rather I have several collaborations pursuing questions using such rich data. Example systems include Desert Crust and Mediterranean Grassland Rhizosphere, but similar experimental designs are also used in aquatic environments). Contact(s) Dylan Chivian ([email protected])

Overarching Science Driver (these can be refined during the workshop) Understand key functional genes and the roles of the trait-guild member species in adaptation to a perturbed environment.

Science Objectives, Outcomes, and/or Measures of Success 1. Identify key functional genes in perturbation response. 2. Link key functional genes to species. 3. Model trait-guild member species and their interactions.

Key people and their roles Dylan Chivian, Ulas Karaoz - Science and CI Eoin Brodie - PI Trent Northen - PI Basic Flow Describe steps to be followed. Document as a list here and/or as a diagram (see use case example) 1. Assembly and annotation of isolate genomes. 2. Assembly, annotation, binning, and assessment of MG-derived genomes. 3. Meta-transcriptomic abundance calculations against isolate and MG-derived genomes. 4. Trait-guild member assignment.

1

5. Integration of metabolomic and meta-metabolomic data into species models. 6. Time-series models of community adaptation. 7. Stats and visualization. Critical Existing Cyberinfrastructure o KBase/RAST/ModelSEED/MG-RAST, M-suite, QIIME, IMG, IMG/M, ggKbase, iMicrobe, PathwayTools, MicrobesOnline, metaMicrobesOnline o R, MeV, SparCC, kallisto, bowtie, Cytoscape (analysis and viz) Critical Cyberinfrastructure Not in Existence o Easy access to rapid metagenomic assembly and binning. o Easy Integration of metabolomics data into metabolic modeling. o Meaningful compartmentalized metabolic models and interaction networks. o Easy trait-guild modeling and viz. o Easy Time-series trait-based modeling. Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop. This will be done during the workshop. Problems/Challenges (any barriers to successful completion of use case) For each one, list The challenge What, if any, efforts have been undertaken to fix these problems? What recommendations do you have for tackling this problem? 1. Tools and data formats sometimes inconsistent. One-stop shopping would be nice. 2. Ease of use for non-coder biologist desirable. 3. Information rich but clear data and analysis viz hard to make. Need to make more available to biologists. References (links to background or useful source material)


2

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name Using population genomes to analyse taxon specific functional constraints Contact(s) Jack A. Gilbert: [email protected] Naseer Sangwan: [email protected] Chris Marshall: [email protected] Melissa Dsouza: [email protected] Pamela Weisenhorn: [email protected] Overarching Science Driver To understand how translational fine-tuning shapes the microbial genome evolution in natural environment Science Objectives, Outcomes, and/or Measures of Success (I) Create habitat specific database of population level orthologous genes with pre-calculated metrics i.e. codon bias, dN/dS. (ii) Create new workflows and analysis pipelines to compute codon bias and dN/dS values across fragmented metagenome assemblies representing complex environments e.g. soil/sediment (iii) Create new normalization methods for accurate correlation between dN/dS and codon bias values of population level genes

Key people and their roles Jack A. Gilbert: Lead PI Naseer Sangwan: Postdoctoral researcher Chris Marshall: Postdoctoral researcher Pamela B. Weisenhorn : Postdoctoral researcher Melissa Dsouza: Postdoctoral researcher Basic Flow 1. Quality trimming and de-novo assembly of shot-gun metagenome datasets 2. Binning Metagenome contigs into population genomes (pan-genomes) 3. Gene calling on contig bins representing population genomes 4. Identification of orthologous genes between population genomes 5. Cross validation of orthologous genes (i.e length cut-off, sequencing errors)

1

6. Calculating pairwise dN/dS and codon bias values 7. Normalization and calculation of pairwise correlation between dN/dS and codon bias profiles 8. Demarcate & functionally characterize protein pairs w/ positive and/or negative selection Critical Existing Cyberinfrastructure o Alignable Tight Genome Clusters (ATGC) database of prokaryote genomes (has genomes of cultured isolates) o Integrated Microbial Genomes (IMG) (e.g. can be used to pull orthologous genes) o MicroScope pipeline ( e.g. *has size limit for annotation*) Critical Cyberinfrastructure Not in Existence o Central database of population genomes i.e. reconstructed from metagenomes o Unique algorithms for calculating codon bias and dN/dS across short protein sequences. o Accurate normalization method that can handle the average genome size variation across populations Activity Diagram This can be targeted during the workshop Problems/Challenges 1. How to acess the habitat specific gene pool information? Recommendation : Create a comprehensive portal that can store such datasets. 2. High-throughput methods to screen orthologous genes across multipule population genomes a. some methods exist, but they are specific for genome sequences of cultured micobes. b. Recommendation: develop new methods or modify the existing methods to target the genome bins represting mix of strains or species. 3. How to calculate accurate rate to evolution and codon bias on short protein sequences. a. There are some methods but they are not validated for errors and bias caused during metagenome data analysis e.g length variation, average genome size variation etc. b. Recommendation: develop some new method to calculate and normalize the dN/dS and codon bias profiles of population genomes. e.g consider the average genome size variations. References -Ran W, Kristensen DM, Koonin EV. (2014). Coupling Between Protein Level Selection and Codon Usage Optimization in the Evolution of Bacteria and Archaea. mBio 5:e00956–14. -Nielsen, R. (2005). Molecular signatures of natural selection. Annu Rev Genet. 39:197-218. Notes

2

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name Linking global models of nutrient limitation to gene expression of nutrient-specific responses in diatoms Contact(s) Bethany Jenkins University of Rhode Island, Joselynn Wallace PhD candidate University of Rhode Island Overarching Science Driver (these can be refined during the workshop) Linking global biogeochemical models to in-situ measurements and meta-omics Science Objectives, Outcomes, and/or Measures of Success Compile micro (trace metals, vitamins) and macro (N, P, Si) nutrient concentration measurements, CTD depth profiles, measures of biodiversity and metagenomics, and genespecific expression or metatranscriptome data into a queryable database.

Key people and their roles GEOTRACES, PDC, K. Buck – trace metal concentration and distribution, Fe speciation BDJ, PDC, K. Thamatrakoln? – gene-specific expression (genetic markers of Si and Fe limitation of diatoms), metagenomics and transcriptomics Basic Flow 1. Use global models predicting the role of nutrient limitation on primary production of key phytoplankton taxa to select oceanic region of interest. 2. Filter by depth horizon 3. Retrieve historical macro and micronutrient measurements collected from this region and filter data by concentration of a given nutrient 4. Retrieve ‘omics datasets from this region (this is the crux of this pipeline matching the nutrient data with the ‘omics data and finding relevant omics data) 5. Compile locations of nutrient measurements at a range of selected values with ‘omic data-availability of metagenomes and metatranscriptomes 6. Determine from metagenomics data if target organisms or taxa are present at target nutrient values 7. Filter metatranscriptome data by taxonomy to only retrieve transcripts from target taxonomic group (2nd crux of pipeline-need to interface with phylogenetics infrastructure). 8. Use downstream measures to search for specific genes (e.g. BLAST)

1

Critical Existing Cyberinfrastructure o World Ocean Database (Atlas)( https://www.nodc.noaa.gov/OC5/indprod.html) o BCO-DMO (http://www.bco-dmo.org/) o GEOTRACES International Data Assembly Center o PANGEA archive (http://doi.pangaea.de/10.1594/PANGAEA.840721) o iMicrobe (http://imicrobe.us/) o EBI metagenomics (https://www.ebi.ac.uk/metagenomics/) o European Nucleotide Archive (http://www.ebi.ac.uk/ena) o NCBI (http://www.ncbi.nlm.nih.gov/) o QIIME (http://qiime.org/) Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may be good candidate(s) for a community CI application. Critical Cyberinfrastructure Not in Existence o Centralized or cross referenced queryable repository of global model/map overlays, nutrient and in-situ measurements, and associated –omics data. o Integrated taxonomic pipelines for omics data Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop.

1.#Global#map#of#nutrient#limita2on#for#diatoms#

4.#Query##database#(same#or# different)#with#metagenomics# informa2on#that#is#cross# referenced#to#samples#

2.#Define#region#of#Fe#limita2on#in#N#equatorial#Atlan2c#

3.#Query#db#of#mixed#layer#depth#samples#from# specified#region#with#measured#Fe#values#below# specified#level.# Return#data#with#Fe#and#all#other#measured#nutrient# and#profiling#data#(e.g.#temp,#salinity#etc).#

5.#Apply#taxonomic#filtering#to#data# (requires#integrated#pipeline#for# taxonomic#classifica2on)#

6.##Retrieve#metatranscriptome#data#for# sample#containing#taxonomic#targets#

2

Problems/Challenges (any barriers to successful completion of use case) For each one, list The challenge What, if any, efforts have been undertaken to fix these problems? What recommendations do you have for tackling this problem? 1. Cross referencing of data-BCO-DMO-having a “accession number’ for each sample that is capitulated through all data records so they can be housed in different databases but search engines can query by record and then for specific types of associated data 2. Discoverability of “omics data” –data currently living in a variety of repositories (ncbi, ebi, iMicrobe) submissions don’t presently contain links to metadata records. Omics data may need to live in separate mirrored repository to facilitate retrieval. References (links to background or useful source material) Global model images from J. Keith Moore, Keith Lindsay, Scott C. Doney, Matthew C. Long, and Kazuhiro Misumi, 2013: Marine Ecosystem Dynamics and Biogeochemical Cycling in the Community Earth System Model [CESM1(BGC)]: Comparison of the 1990s with the 2090s under the RCP4.5 and RCP8.5 Scenarios.J. Climate, 26, 9291–9312. Notes (any additional information that does not fit in a previous category)

3

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name Systems analysis linking information from metagenomic, metatranscriptomic, and metaproteomic datasets with key physical and chemical parameters. Contact(s) Robert M. Morris, University of Washington ([email protected]) Overarching Science Driver (these can be refined during the workshop) To identify the key environmental parameters controlling the activities of microbial communities across an ocean gradient in organic and inorganic nutrients Science Objectives, Outcomes, and/or Measures of Success Synchronize community “omics” datasets (ID, location, time, replicate, annotations!!!) Extract information across datasets (genes, transcripts, proteins with same annotations)

Key people and their roles Virginia Armbrust, Adam Martini (genomics) Mary Ann Moran (transcriptomics) Robert Morris (proteomics) Basic Flow Describe steps to be followed. Document as a list here and/or as a diagram (see use case example) 1. Identify samples with matching datasets (physical, chemical, biological) 2. Download and retrieve appropriate datasets (omics, metals, nutrients, etc.) 3. Synchronize biological omics datasets (annotate using standard annotations) 4. Identify categories for comparison (CEG paths, EC numbers, taxonomy, etc.) 5. Extract data for comparative analyses 6. Determine genetic potential, gene regulation, and expressed protein functions 7. Multivariate analysis of biological activity with physical and chemical parameters Critical Existing Cyberinfrastructure o **Standard annotation database developed by Mary Ann Moran o Data archives (BCO-DMO, NCBI, MG-RAST, SILVA-RDP-Greengenes for 16S) o Comet: An open source MS/MS sequence database search tool o Kbase: A systems biology knowledge base (mostly genomic at this point) Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may

1

be good candidate(s) for a community CI application. Critical Cyberinfrastructure Not in Existence o Database to host datasets o Tools for comparative analyses of “omics” datasets (establishes links) o File conversion and export capabilities Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop. Will be done at the workshop

Problems/Challenges (any barriers to successful completion of use case) For each one, list The challenge What, if any, efforts have been undertaken to fix these problems? What recommendations do you have for tackling this problem? 1. What data are available? A) BCO-DMO does some of this, but is not specialized for large biological datasets generated by genomics, transcriptomics and proteomics B) The host site should have a fairly uniform summary diagram with links to available data, data that is coming, data available through other sources. 2. Data are in different formats (raw files, processed files, annotated/unannotated) A) Existing data archives (above) do this, but they can be very difficult to navigate and the file formats are not always consistent (for meta “omics” data). A) Some standards regarding data formats should be established 3. Some datasets have been deduplicated and some have not been deduplicated A) Many sites offer both versions B) This is particularly challenging when annotations don’t match. Decisions about annotation (in addition to sequence similarity) will impact this. 4. Multiple files are available (size fractionated, replicates, etc.) A) This is often done, but naming schemes are not uniform B) Develop some standards for naming when data are deposited so that the user will know if there are replicates, different size fractions, etc. 5. Processed files from published results are often times unavailable A) Not always required B) Should be able to save and export data at different stages of analyses. References (links to background or useful source material)


2

Use Case Template (revised from EarthCube version 1.1) Summary Information Section Use Case Name Increasing identification rates for peptide mass spectra from ocean metaproteome datasets Contact(s) Jacob Waldbauer

Overarching Science Driver (these can be refined during the workshop) Develop clearer picture of protein-level gene expression patterns and regulation for quantitative understanding of metabolic & biogeochemical processes Science Objectives, Outcomes, and/or Measures of Success Develop ‘Ocean Metaproteome Atlas’ for comparative analysis of protein-level expression in oceanographic context Compare community spatiotemporal gene expression patterns between transcript & protein levels, and examine relationships with activities of biogeochemical interest Ultimately, develop a sufficiently mechanistic & quantitative picture of expression regulation & consequent metabolic activity in marine microbes to contribute to predictive biogeochemical models of ocean carbon & nutrient cycling

Key people and their roles

Basic Flow Describe steps to be followed. Document as a list here and/or as a diagram (see use case example) 1. 2. 3. 4. 5. 6. 7. 8.

Collate & integrate ocean metaproteome datasets Extract potentially informative peptide fragmentation spectra Develop refined sequence databases for PSM searching Sequence peptides by database searching, de novo, spectral library and/or hybrid methods Control FDR on putative sequence IDs in integrated statistical framework Assign gene ID, function and/or taxon to identified peptides Compare & visualize identified peptides across metaproteome samples Contribute identified spectra to community spectrum library

1

Critical Existing Cyberinfrastructure o Peptide-spectrum matching, spectral library searching and de novo sequencing algorithms (of varying speed/parallelizability)** Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may be good candidate(s) for a community CI application. Critical Cyberinfrastructure Not in Existence o Sharing/integration platform for raw metaproteome data in open format(s) o Automated pipeline/expert system for generating/optimizing proteome search databases o Integrated system for linking peptide IDs with annotation/taxonomy systems o Community spectral library of confident peptide IDs Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop.

Problems/Challenges (any barriers to successful completion of use case) For each one, list The challenge What, if any, efforts have been undertaken to fix these problems? What recommendations do you have for tackling this problem? 1. Sharing metaproteomic mass spec data a. Currently, PRIDE and MassIVE repositories active, but little ability to integrate oceanographically-relevant metadata b. Recommendation: work with proteomeXchange and/or MassIVE (CCMS, UCSD) to develop ocean-specific metaproteome repository 2. Focusing on most (potentially) informative spectra a. Recommendation: develop generalized criteria (via machine learning?) for sequence-information content of fragmentation spectra – will cull large amounts of uninformative data 3. Arriving at consensus, FDR-controlled sequence ID & annotation from multiple sequencing methods/annotation streams a. Recommendation: Allow Metaproteome Atlas to maintain multiple scored ID candidates for given spectrum, apply parsimony and/or pathway logic at protein and/or organism levels References (links to background or useful source material)


2

Summary of ECOGEO’s community survey EarthCube’s Oceanography and Geobiology Environmental ‘Omics (ECOGEO) Research Coordination Network (RCN) created a survey to assess current community needs and challenges with respect to ‘omics research. The survey was available from Nov 2014 – Jan 2015, and had a total of 105 respondents. Of those, ~90 gave feedback on a major of questions, while 30-60 responded to the open ended questions. Results from this survey are summarized below. Overview The main areas of ‘omics research currently being explored by our community are metagenomics, 16S/18S taxonomy, and correlating omics data with environmental data (Figure 1). In addition, the majority of our research community regularly collects samples for processing (~85%), conduct in-depth analysis on the output data (~72%), and use the data for comparative omics (62%) (n=96, with more that one selection possible). However, our community’s engagement with ‘omics data ranges from doing limited analysis (~47%) to using the data to develop workflows (~40%). Figure 1. Areas of 'omics research (n=97, more than one selection possible)

Accessing data Most ‘omics users are able to submit data sets and associated metadata for archival, search reference databases by sequence similarity or annotation (Figure 2). However, we struggle to search by associated metadata/ project characteristics, and we definitely face challenges in accessing unique data sets not in the main reference databases (a.k.a. “dark data”).

100%

Would like to use

Already use

90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Figure 2. Resources already in use or would like to use.

1

Summary of ECOGEO’s community survey Community Workflows Feedback on idealized workflows provided good fodder for use case development, the need for which was highlighted in Figure 2 (“would like to use” – case studies, interactive webinars). Therefore, we are asking the 2015 workshop participants to submit a use case prior to the workshop; one that highlights a current challenge in ‘omics research. We have provided a template form and an example use case focused on using metadata to retrieve targeted data sets for further exploration. During the workshop, we will discuss several representative use cases to 1) highlight areas that we, as a community, need to focus on to move our research forward, and 2) establish a repository of training tools for the next generation of ‘omics researchers. Barriers to Research The general consensus on the barriers to ‘omics research moving forward are summarized in a few key, big picture points (below). During the 2015 ECOGEO workshop, we will be tackling these at a finer scale level, in an attempt to move solutions forward. 1. Data standards – including quality measures and a way to index data sets that will link samples to environmental metadata, across different types of ‘sequencing’, and throughout various sequence analyses and annotations stages. 2. Central repository of raw and processed data (see #1) that is searchable (see Figure 2) and downloadable with compatible/standardized output, while also having online tools and compute power for processing (and archiving) assemblies, comparative analyses, annotations, visualizations, and statistics. 3. Regular annotation updates on existing databases with potential to request notifications if data sets of interest gain new information. 4. Training – use-cases/workflows, training webinars, user-friendly GUI interface. 5. Last, but far from least – longevity!

2

Earth Cube Oceanography and Geobiology Environmental 'Omics

ECOGEO is a brandnew, NSFfunded Research Coordination Network (RCN) housed within the EarthCube platform. Please visit http://workspace.earthcube.org/ecogeo for more information and to join our listserv! The mission of this RCN is to identify community needs and develop necessary plans to create a federated cyberinfrastructure to enable ocean and geobiology environmental ‘omics. This survey is designed to address the first part of our mission. We are gathering information regarding the current usage of and community needs for 'omics research in the oceanography and geobiology communities. This brief research survey should take 515 minutes of your time, depending on your level of feedback. Your participation is greatly appreciated, but also voluntary and you can choose to not answer any question. This survey is anonymous and without foreseeable risks to you for taking part in this survey. Please do not include any personal information in your responses. If you have any questions or concerns regarding this survey, please contact Dr. Elisha WoodCharlson at the University of Hawai'i at Manoa ([email protected]). If you have questions regarding your rights as a participant, please contact the University of Hawai'i at Manoa Human Studies Program ([email protected]) This study has been reviewed and approved by the University of Hawaii Institutional Review Board (#...).

*1. By selecting "Yes", you are indicating your consent to participate in this survey.

j Yes k l m n j No k l m n

Page 1


2. What area(s) of ‘omics research do you typically work in? (select all that apply) c Genomics d e f g

c Single cell genomics d e f g c Metagenomics d e f g

c Transcriptomics d e f g

c Metatranscriptomics d e f g c Proteomics d e f g

c Metaproteomics d e f g c Metabolomics d e f g

c Correlating ‘omics data with environmental data d e f g c Phylogenetics d e f g

c 16S, 18S; Taxonomy d e f g c Modeling d e f g

Other (please specify)

5

6

Page 2

Earth Cube Oceanography and Geobiology Environmental 'Omics 3. What area(s) of ‘omics sample and data processing do you typically engage in? (select all that apply) c Collect samples and process for sequencing d e f g

c Limited analysis of processed ‘omics data (e.g. postQC/QA) d e f g

c Indepth analysis (e.g. single data set assembly, annotation, pathways, etc…) d e f g c Workflow development d e f g

c Analytical and/or statistical tool development d e f g c Use ‘omics data in modeling d e f g

c Comparative ‘omics (e.g. across ‘omic types, complex data sets, integration with metadata) d e f g


5

6

Page 3


4. Please indicate ‘omicsassociated RESOURCES you use or would like to use. Already use

Would like to use

Submission of sequence data and metadata for archival services

c d e f g

c d e f g

Access to unique data sets not available in other sequence repositories

c d e f g

c d e f g

Search for usersubmitted samples by description or project characteristics

c d e f g

c d e f g

Search for usersubmitted samples by sequence similarity (e.g. BLAST, RapSearch)

c d e f g

c d e f g

Search for usersubmitted samples by annotation (e.g. gene function, taxonomy)

c d e f g

c d e f g

Search for data sets by metadata (e.g. latitude/longitude, date collected, lead PI)

c d e f g

c d e f g

Access to reference datasets (e.g. nonredundant and RefSeq from NCBI)

c d e f g

c d e f g

Casestudies for training

c d e f g

c d e f g

Interactive webinars

c d e f g

c d e f g

Other resources, or additional comments (please specify)

5

6

5. Please indicate ‘omicsassociated TOOLS you use or would like to use. Already use

Would like to use

Initial data processing (e.g. QC/QA, trimming)

c d e f g

c d e f g

BLAST and BLASTlike workflows (e.g. RapSearch)

c d e f g

c d e f g

Assembly tools (e.g. RayMeta, Newbler)

c d e f g

c d e f g

Annotation tools (e.g. Pfam, COG/KOG, TIGRFAM, NCBI’s PRK)

c d e f g

c d e f g

Phylogeneticallybased annotation services (e.g. MEGAN)

c d e f g

c d e f g

Workflow pipelines (e.g. Clustering, RAMMCAP, Redundancy filter)

c d e f g

c d e f g

Comparative pathway analysis (e.g. KEGG, pFAM)

c d e f g

c d e f g

Statistical tools

c d e f g

c d e f g

Visualization tools

c d e f g

c d e f g

Other resources, or additional comments (please specify)

5

6

Page 4

Earth Cube Oceanography and Geobiology Environmental 'Omics 6. If you currently have favorite tools/resources, please list them and explain why they are working for you. 5

6

7. To put the previous questions in a research context, please describe your idealized data analysis workflow that would best achieve your main science goals using omics data sets. What do you want ‘omics data to do in order to answer your scientific questions? 5

6

Page 5


8. Please identify the community needs for storage, management, analysis, sharing, integration, and visualization of ‘omic data that you feel are immediate vs. should be considered in future development with a longerterm vision. Immediate

Longterm

Storage of raw data (akin to the NCBI Short Read Archive for sequence data)

c d e f g

c d e f g

Storage of processed data (e.g., translated proteins or assembled contigs)

c d e f g

c d e f g

Storage of data used for biological inference (e.g., differential gene/protein expression)

c d e f g

c d e f g

Linking different ‘omics for single sample

c d e f g

c d e f g

Sustainable curation

c d e f g

c d e f g

Access to highperformance computational resources

c d e f g

c d e f g

Access to usersubmitted data

c d e f g

c d e f g

Analysis workflows

c d e f g

c d e f g

Annotation tools

c d e f g

c d e f g

Comparative pathway tools

c d e f g

c d e f g

Comparative ‘omics tools

c d e f g

c d e f g

Statistical tools

c d e f g

c d e f g

Visualization tools

c d e f g

c d e f g

Casestudies for training

c d e f g

c d e f g


5 6

9. Any additional thoughts regarding Question 8? 5

6

Page 6

Earth Cube Oceanography and Geobiology Environmental 'Omics 10. Please comment on what you perceive to be the PRIMARY NEEDS surrounding ‘omics research for the oceanography and geobiology communities 5

6

11. Please comment on what you perceive to be the MAJOR INFRASTRUCTURE BARRIERS for improving ‘omics research in the oceanography and geobiology communities. 5

6

Page 7