The Parallel System for Integrating Impact Models and Sectors (pSIMS) Joshua Elliott
Neil Best
Michael Glotter
Computation Institute University of Chicago and Argonne National Laboratory
Computation Institute University of Chicago and Argonne National Laboratory
Department of Geophysical Sciences University of Chicago
[email protected]
[email protected]
David Kelly
Michael Wilde
Computation Institute University of Chicago and Argonne National Laboratory
Computation Institute University of Chicago and Argonne National Laboratory
[email protected]
[email protected]
[email protected] Ian Foster Computation Institute University of Chicago and Argonne National Laboratory
[email protected]
ABSTRACT
1. INTRODUCTION AND DESIGN
We present a framework for massively parallel simulations of climate impact models in agriculture and forestry: the parallel System for Integrating Impact Models and Sectors (pSIMS). This framework comprises a) tools for ingesting large amounts of data from various sources and standardizing them to a versatile and compact data type; b) tools for translating this standard data type into the custom formats required for point-based impact models in agriculture and forestry; c) a scalable parallel framework for performing large ensemble simulations on various computer systems, from small local clusters to supercomputers and even distributed grids and clouds; d) tools and data standards for reformatting outputs for easy analysis and visualization; and d) a methodology and tools for aggregating simulated measures to arbitrary spatial scales such as administrative districts (counties, states, nations) or relevant environmental demarcations such as watersheds and river-basins. We present the technical elements of this framework and the results of an example climate impact assessment and validation exercise that involved large parallel computations on XSEDE.
Understanding the vulnerability and response of human society to climate change is necessary for sound decision-making in climate policy. However, progress on these important research questions is made difficult by the fact that science and information products must be integrated across vastly different spatial and temporal scales. Biophysical and environmental responses to global change generally depend strongly on environmental (e.g., soil type), socioeconomic (e.g., farm management), and climatic factors that can vary substantially over regions at high spatial resolution.
Categories and Subject Descriptors I.6.8 [Simulation and Modeling]: Parallel – parallel simulation of climate vulnerabilities, impacts, and adaptations.
General Terms Algorithms, Performance, Design, Languages.
Keywords Climate change Impacts, Adaptation, and Vulnerabilities (VIA); Parallel Computing; Data processing and standardization; Crop modeling; Multi-model; Ensemble; Uncertainty Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. XSEDE13, July 22 - 25 2013, San Diego, California, USA Copyright 2013 ACM 978-1-4503-2170-9/13/07 ...$15.00.
Global Gridded Crop Models (GGCMs) are designed to capture this spatial heterogeneity and simulate crop yields and climate impacts at continental or global scales. Site-based GGCMs, like those described here, aggregate local high-resolution simulations, and are often limited by data availability and quality at the scales required by a large-scale campaign. Obtaining the data inputs necessary for a comprehensive high-resolution assessment of crop yields and climate impacts typically requires tremendous effort by researchers, who must catalog, assimilate, test, and process multiple data sources with vastly different spatial and temporal scales and extents. Accessing, understanding, scaling, and integrating diverse data typically involves a labor-intensive and error-prone process that creates a custom data processing pipeline. A comparably complex set of transformations must often be performed once impact simulations have been completed to produce information products for a wide range of stakeholders, including farmers, policy-makers, markets, and agro-business interests. Thus, simulation outputs, like inputs, must be available at all scales and in familiar and easy to use formats. To address these challenges and thus facilitate access to highresolution climate impact modeling we are developing a suite of tools, data, and models called the parallel System for Integrating Impacts Models and Sectors (pSIMS). Using an integrated multimodel multi-sector simulation approach, pSIMS leverages automated data ingest and transformation pipelines and highperformance computing to enable researchers to address key challenges. We present in this paper the pSIMS structure, data, and methodology; describe a prototype use case, validation methodology, and key input datasets; and summarize features of the software and computational architecture that enable largescale simulations.
Figure 1: Schematic of the pSIMS workflow. 1) Data ingest takes data from numerous publicly available archives in arbitrary file formats and data types. 2) The standardization step reconstitutes each of these datasets into the highly portable point-based .psims.nc4 format. 3) The specification step defines a simulation campaign for pSIMS by choosing a .psims.nc4 dataset or subset and one or more climate impact models (and the requisite custom input and output data translators) from the code library. 4) The translation step individual .psims.nc4 files (each representing one grid-cell in the spatial simulation) are converted into the custom file formats required by the models. 5) In the simulation step, individual simulation calls within the campaign are managed by Swift on the available high-performance resource. 6) In the output reformatting step, model outputs (dozens or even hundreds of time-series variable from each run) are extracted from model-specific custom output formats and translated into a standard .psims.out format. 7) Finally, in the aggregation step, output variables are masked as necessary and aggregated to arbitrary decision-relevant spatial or temporal scales.
Our goals in developing this framework are fourfold, to: 1) develop tools to assimilate relevant data from arbitrary sources; 2) enable large ensemble simulations, at high resolution and with continental/global extent, of the impacts of a changing climate on primary industries; 3) produce tools that can aggregate simulation output consistently to arbitrary boundaries; and 4) characterize uncertainties inherent in the estimates by using multiple models and simulating with large ensembles of inputs. Figure 1 shows the principal elements of pSIMS including both automated steps (the focus of this paper) and those that require an interactive component (such as specification and aggregation). The framework is separated into four major components: data ingest and standardization, user initiated campaign specification, parallel implementation of specified simulations, and aggregation to arbitrary decision-relevant scales. pSIMS is designed to support integration of any point-based climate impact model that requires standard daily weather data as inputs. We developed the framework prototype with versions 4.0 and 4.5 of the Decision Support System for Agrotechnology Transfer (DSSAT [15]). DSSAT 4.5 supports models of 28 distinct crops using two underlying biophysical models (CERES and CropGRO). Additionally, we have prototyped and tested a parallel version of the CenW [16] forest growth simulation model and are actively integrating additional crop yield and climate impact models starting with the widely used Agricultural Production Systems Simulator (APSIM) [23]. As of this writing, continental- to global-scale simulation experiments ranging in resolution from 3–30 arcminutes have been conducted on four crop and 1 tree (pinus radiata) species using a variety of weather/climate inputs and scenarios.
2. THE pSIMS DATA INPUT PIPELINE The minimum weather data requirements for crop and climate impact models such as DSSAT, APSIM, and CenW are typically: Daily maximum temperature (degrees C at 2m above the ground surface) Daily minimum temperature (degrees C at 2m) Daily average downward shortwave radiation flux (W/m2 measured at the ground surface) Total daily Precipitation (mm/day at the ground surface)
In addition, it is also sometimes desirable to include daily average wind speeds and surface humidity if available. Hundreds of observational datasets, model-based reanalyses, and multi-decadal climate model simulation outputs at regional and global scales exist that can be used to drive climate impact simulations. For some use-cases, one or another data product may be demonstrably superior, but most often a clear understanding of the range of outcomes and uncertainty requires simulations with an ensemble of inputs. For this reason, input data requirements for high-resolution climate impact experiments can be large and data processing and management can be challenging. One major challenge in dealing with long daily time-series of high-resolution weather data is that input products are typically stored one or more variables at a time in large spatial raster files segmented into annual, monthly, or sometimes even daily or subdaily time-slices. Point-based impact models on the other hand, typically require custom ASCII data types that encode long timeseries of daily data for all required variables for a single point into one file. The process of accessing hundreds or thousands of spatial raster files O(105–106) times to extract the time-series for each point is time-consuming and expensive [20]. To ameliorate this challenge, we have established a NetCDF4-based data-type for the pSIMS framework—identified by a .psims.nc4 extension—and tools for translating existing archives into this format (Figure 2). Each .psims.nc4 file represents a single grid cell or simulation site within the study area. Its contents are one or more 1 x 1 x T arrays, one per variable, where T is the number of time steps. The time coordinates of each array are explicit in the definition of their dimensions, a strategy that facilitates near-real-time updates from upstream sources. The spatial coordinates of each array are also explicit in the definition of their dimensions, which facilitates downstream spatial merging. Variables are grouped and named so as to avoid namespace collisions with variables from other sources in downstream merging, e.g.. “narr/tmax” refers to daily maximum temperature variable extracted from the North American Regional Reanalysis (NARR) dataset and “cfsrr/tmax” to the daily minimum temperature from the NOAA Climate Forecast System Reanalysis and Reforecast (CFSRR) dataset. Because a given .psims.nc4 file contains information about a single point location, the longitude and latitude vectors used as
Figure 2: Expanded schematic of data types and processing pipeline from Figure 1. The steps in the pipeline are as follows. 1) Data is ingested from arbitrary sources in various file formats and data types. 2) If necessary data is transformed to a geographic projection. 3) For each land grid-cell, the full time series of daily data is extracted for each cell (in parallel) and converted to the .psims.nc4 format. 4) Each set of .psims.nc4 files extracted from a dataset are then organized into an archive for long-term storage. 5) If the input dataset is still being updated, we ingest and process the updates at regular intervals and 6) we append the updates to the existing .psims.nc4 archive.
array dimensions are both of length one. Therefore the typical weather variable is represented as a 1x1xT array where T is the number of time steps in the source data. In contrast, arrays containing forecast variables have two time dimensions, the time the forecast is made and the time of the prediction. This use of two time dimensions makes it possible to follow the evolution of a series of forecasts made over a period of time for a particular future date, as for example when forecasts of crop yields at a particular location are refined over time as more information becomes available.. pSIMS input files are named according to (row, col) tuple in the global grid and organized in a file structure similarly (e.g., /narr/361/1168/361_1168.psims.nc4) so that the terminal directories hold data for a single site. This improves the performance of parallel reads and writes on shared filesystems like GPFS and minimizes clutter while browsing the tree. Because files represent points, the resolution is only set by the organization of the enclosing directory tree and is recorded in an archive metadata file, as well as in the metadata of each .psims.nc4 file. Climate data from observations and simulations is widely available in open and easily accessible archives, such as those accessible by Earth System Grid (ESG) [30], NCAR’s NOMADS servers [26], and NASA’s Modeling and Assimilation Data Information Services Center (MDISC). These datasets have substantial metadata—often based on Climate and Forecasting (CF) conventions [8]—and are frequently identified by a unique Digital Object Identifier (DOI) [21], making provenance and tracking relatively straightforward. Such data is often archived at uniform grid resolutions, but scales can vary from a few kilometers to one or more degrees and frequencies from an hour to a month. Furthermore, different map projections mean that transformations are sometimes necessary before data can be used.
3. AN EXAMPLE ASSESSMENT STUDY We now present an example of a maize yield assessment and validation exercise conducted in the conterminous US at fivearcminute spatial resolution. For each grid cell in the study region, we simulate eight management configurations (detailing fertilizer and irrigation application) from 1980-2009 using the CERESmaize model (Figure 3). Simulations are driven by observation and reanalysis-based data products comprised of NCEP Climate Forecast System Reanalysis
(CFSR) temperatures [13], NOAA Climate Prediction Center (CPC) US Unified Precipitation [23], and NASA Surface Radiation Budget solar radiation (SRB) dataset [27].
Figure 3: Time-series yields for a single site for four fertilizer application rates w/ (solid) and w/o (dashed) irrigation. The gridded yield for crop i in grid cell x at time t takes the form where we denote explicitly the dependence of yield on local climate (Q; including temperature, precipitation, solar radiation, and CO2), soils (S), and farm management (M; including planting dates, cultivars, fertilizers, and irrigation). Figure 4 shows summary measures (median and standard deviation) over the historical simulation period for a single management configuration (rainfed with high nitrogen fertilizer application rates) from this campaign. High-resolution yield maps allow researchers to visualize the changing patterns of crop yields. However, such measures must typically be aggregated at some decision-relevant environmental or political scale, such as county, watershed, state, river basin, nation, or continent, for use in models or decision-maker analyses. To this end, we are developing tools to aggregate yields and climate impact measures to arbitrary spatial scales. For a given individual region R, crop i, and time t, the basic problem of rescaling local yields (Y) to regional production (P) takes the form
where the weighting function for aggregation, Wi(x,t), is the area in location x at time t that is engaged in the production of crop i.
A further use of aggregation is to produce measures of yield that can be compared with survey data at various spatial scales for model validation and uncertainty quantification. For example, yield measures are aggregated to county and state level and compared with survey data from the US National Agricultural Statistics Service (NASS) [3] by calculating time-series correlations between simulated and observed values (Figure 5). Next we consider RMSE measures for the simulated vs. observed yields at the various scales. Figure 6 plots the simulated vs. observed yields at the county and state level: in total, 60423 observations over 30 years and 2373 counties and 1225 observations over 41 states, respectively. The root mean square error (RMSE) between observed yields and prototype simulated yields is 25% of the sample mean, and an unconstrained linear fit of the simulation vs. observation results has an R2 of 0.6. At state level, the RMSE is 15.7% of the sample mean and at national level its reduced to 10.8% (13.6 bushels/Acre) over the test period. Given the simplicity of the experiment, these results compare favorably with other recent spatial simulation/validation exercises at the state level.
does not permit a detailed description of this program, but in brief, it defines a machine-independent interface to the DSSAT executable, loads a list of grids on which that executable is to be run, and then defines a set of DSSAT tasks, one per grid. The Swift language is implicitly parallel, high-level, and functional. It automates the difficult and science-distracting tasks of distributing tasks and data across multiple remote systems, and of retrying failing tasks and restarting failing workflow runs. Its runtime automatically manages the execution of tens of thousands of small single-core or multi-core jobs, and dynamically packs those jobs tightly onto multiple nodes of the allocated computing resources to maximize system utilization. Swift automates the acquisition of nodes; inter-job data dependencies; throttling; scheduling and dispatch of work to cores; and retry of failing jobs. Swift is used in many science domains, including earth systems, energy economics, theoretical chemistry, protein science and graph analysis. It is a simple scripting system for users to install and use and enables many new classes of users to get on board and run productively in a remarkably short time.
Figure 5: Time-series correlations between county level aggregates of simulated maize yields and NASS statistics from 1980-2009 for pDSSAT simulations using CFSR temperatures, CPC precipitation, and SRB solar. Only counties for which NASS records a minimum of six years of data with an average of more than 500 maize cultivated hectares per year are shown.
Figure 4: Simulation outputs of a single pDSSAT campaign. Maps of the median (top) and standard deviation (bottom) of simulated potential rainfed/high-input maize yield.
4. USE OF Swift PARALLEL SCRIPTING Each pSIMS model run requires the execution of 10,000 to 120,000 fairly small serial jobs, one per geographic grid cell, and each using one CPU core for a few seconds to a few minutes. The Swift [29] parallel scripting language has made it straightforward to write and execute pSIMS runs, using highly portable and system-independent scripts such the example in Listing 1. Space
The key to Swift’s ability to execute large numbers of small tasks efficiently on large parallel computers is its use of a two-level scheduling strategy. Internally, Swift launches a pilot job called a “coaster” on each node within a resource pool [12]. The Swift runtime then manages the dispatch of application invocations, plus any data that these tasks require, to those coasters, which manage their execution on compute nodes. As tasks finish, Swift schedules more work to those nodes, achieving high CPU utilization even for fine-grained workloads. Individual tasks can be serial, OpenMP, MPI, or other parallel applications. Swift makes computing location independent, allowing us to run pSIMS on a variety of grids, supercomputers, clouds, and clusters, with the same pSIMS script used on multiple distributed sites and diverse computing resources. Figure 7 shows a typical execution scenario, in which pSIMS is run across two University of Chicago campus resources: the UChicago Campus Computing Cooperative (UC3) [6] and the UChicago Research Computing Center cluster “Midway.” Over the past two years the RDCEP team has
performed a large amount of impact modeling using this framework. In the past 10 months alone, more than 80 pSIMS impact runs have been performed, totaling over 5.6 million DSSAT runs and a growing number of runs with other models such as CeNW: see Table 1.
Listing 1: Swift script for the pSIMS ensemble simulation study // 1) Define RunDSSAT function as an interface to the DSSAT program // a) Specify function prototype app (file tar_out, file part_out) RunDSSAT (string x_file, file scenario[], file weather[], file common[], file exe[], file perl[], file wrapper)
// b) Specify how to invoke the DSSAST executable { bash "-c" @strcats(" ","chmod +x ./RunDSSAT.sh ; ./RunDSSAT.sh ", x_file, @tar_out, @arg("years"), @arg("num_scenarios"), @arg("variables"), @arg("crop"), @arg("executable"), @scenarios, @weather, @comdata, @exe, @perl); }
// 2) Read the list of grids over which computation is to be performed string gridLists[] = readData("gridList.txt"); string xfile = @arg("xfile");
// 3) Invoke RunDSSAT once for each grid; different calls are concurrent foreach g,i in gridLists {
// a) Map the various input and output files to Swift variables file tar_output
; file part_output ; file scenario[] ; file weather[] ; file common[] ; file exe[] ; file perl[] ; file wrapper ;
// b) Call RunDSSAT (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, in5, wrapper); }
Figure 6: Simulated county-level (top) and state-level (bottom) average corn yields plotted against USDA survey data. Over 19802009, there are 60423 data points from 2373 counties and 1225 datapoints from 41 states. The red line is y = x with a RMSE between the simulated and observed yields of 26 bu/A (25% of sample mean) at the county level and 18.6 bu/A (15.7% of the sample mean) at the state level. The black line is the best-fit linear model with R2 = 0.60 at county and 0.73 at state level.
We have run pSIMS over the UC3 campus computing collective [6], Open Science Grid [22], and UChicago Midway research cluster; the XSEDE clusters Ranger and its successor Stampede; and, in the past, on Beagle, PADS, and Raven. Through Swift’s interface to the Nimbus cloudinit.d [5], we have run on the NSF FutureGrid and can readily run on commercial cloud services.
Table 1: Summary of campaign execution by project, including the total number of jobs in each campaign, the total number of simulation units (jobs x scenarios x years), the total model CPU time, and the total size of the outputs generated. Sim CPU Units Hours (Billion) (K)
Jobs (M)
Output data (TBytes)
13
1.9
.47
11.8
216
4.38
4.14
0.2
2
0.24
0.5
Project
Campaigns
NARCCAP USA
16
1.3
ISI-MIP Global
80
Prediction 2012
2
Qualitatively, our experience with the pSIMS framework and Swift has been a striking confirmation of the fact that portable scripts can indeed enable the use of diverse and distributed HPC resources for many-task computing by scientists with no prior experience in HPC and with little time to learn and apply such skills. After the scripts were developed and solidified initially by developers on the Swift team, execution and extension of the framework was managed by the project science team, who had no prior experience with Swift and only modest prior experience in shell and Perl scripting.
Figure 7: Typical Swift configuration for pSIMS execution.
Whether due to bugs, human error, machine interruption, or other issues, science workflows must often be restarted or re-run. Swift enables a simple and intuitive re-execution mechanism to pick-up interrupted simulations where they left off and complete the
simulation campaign. And pSIMS has proven its value in enabling entire projects (millions of jobs) to be re-executed when application code errors were discovered. Swift enables us to take advantage of new systems and systems with idle opportunities (e.g., over holidays). We recently demonstrated such powerful multi-system runs using a similar geospatially gridded application for analyzing MODIS land use data. The user was able to dispatch 3,000 jobs from a single submit host (on Midway) to five different computer systems (UC3, Midway, Beagle, MWT2, and OSG). The user initially validated correct execution by processing 10 files on a basic login host, and then scaled up to a 3,000-file dataset, changing only the dataset name and a site-specification list to get to the resources. Swift’s transparent multi-site capability expanded the scope of their computations from one node to hundreds or thousands of cores. Swift’s transparent multi-site capability expanded the scope of their computations from one node to hundreds or thousands of cores. (About 500 cores were used for most production runs, and tests of full scale runs were executed at up to 4,096 cores on the XSEDE "Ranger" system).The user did not need to look at what sites were busy or adjust arcane scripts to get to these resources. The throughput achieved by Swift varies according to the nature of the system on which it is run. For example, Figure 8 shows an execution profile for a heavily loaded University of Chicago UC3 system. The “load” (number of concurrently running tasks) varies greatly over time, due to other users’ activity. At the other end of the scale, on systems such as the (former) TACC Ranger and its successor Stampede, where local login hosts are not suited for running even modest multi-core jobs such as the Swift Java client, Swift provides the flexibility of running both the client and all its workers within a single multi-node HPC job. The performance of such a run is shown in Figure 9 below. This performance profile reflects the absence of competition for cluster resources at the time of execution. Because the allocated partition was almost completely idle, Swift was able to quickly ramp up the system and keep it fully utilized for a significant portion of the run. Similar flexibility exists to run Swift from an interactive compute node and to send application invocation tasks to the compute nodes composed from one or more separate HPC jobs.
both upstream data archives and into the formats expected by the various crop modelling packages.
Figure 8: Task execution progress for the ~120,000-task simulation campaign described in Section 3. This particular run was performed on the University of Chicago’s UC3 system. Time is in seconds.
Figure 9: Results from a large scale pSIMS run on XSEDE’s Ranger supercomputer. Using 256 nodes (4096 cores), the full run completes in 16 minutes.
Swift has also been instrumental in the preparation of input data for our DSSAT simulation campaigns. In the case of the NARR data it was necessary to process hundreds of thousands of six-hour weather reports in individual gridded binary (GRB) files. Swift managed the format conversion, spatial resampling, aggregation to daily statistics, and collation into annual netCDF files as precursors to the individual point-based time series used in the simulation. By exposing generic command line utilities for manipulating the various file types as Swift functions it is possible to express the transformations required in an intuitive fashion.
We are integrating pSIMS into two different web-based eScience portals to improve usability and expand access to this powerful tool. The Framework to Advance Climate, Economics, and Impacts Investigations with Information Technology (FACE-IT) is a prototype eScience portal and workflow engine based on Galaxy [11]. As a first step towards integration of pSIMS with FACE-IT, we have prototyped a pSIMS workflow in Galaxy: see Figure 10. Working with researchers from iPlant and TACC, we prototyped pSIMS as an application in the iPlant Discovery Environment [18]: see Figure 11. The latter application has been run on Stampede and Ranger.
5. FUTURE OBJECTIVES
6. DISCUSSION
Ongoing efforts related to the development of this framework fall into two general themes: facilitating the applications of other biophysical crop production models and enabling the management of simulation campaigns through web-based portals. The process of transforming a new weather input dataset is a painstaking one that has been diffcult to generalize so far, so our approach has been to decouple the data preparation phase of the processing pipeline from the simulation and summarizing phases. The point of contact between these activities is the .psims.nc4 format. As an intermediate representation of the data that drives our simulations it gives us a target to implement translations from
The parallel System for Integrating Impacts Models and Sectors (pSIMS) is a new framework for efficient implementation of large-scale assessments of climate vulnerabilities, impacts, and adaptations across multiple sectors and at unprecedented scales. pSIMS delivers two advances in software and applications for climate impact modeling: 1) a high-performance data ingest and processing pipeline that generates a standardized and evolving input dataset of socio-economic, environmental, and climate datasets based on a portable and efficient new data format; and 2) a code base to enable large-scale simulations at high resolution of the impacts of a changing climate on primary production
(agriculture, livestock, and forestry) using arbitrary point-based climate impact models.
work was supported in part by the National Science Foundation under grants SBE-0951576 and GEO-1215910. Swift is supported in part by NSF grant OCI-1148443. Computing for this project is provided by a number of sources including the XSEDE Stampede machine at TACC, University of Chicago Computing Cooperative, the University of Chicago Research Computing Center, and through the NIH with resources provided by the Computation Institute and the Biological Sciences Division of the University of Chicago and Argonne National Laboratory under grant S10 RR029030-01. NASA SRB data was obtained from the NASA Langley Research Center Atmospheric Sciences Data Center NASA/GEWEX SRB Project. CPC US Unified Precipitation data obtained from the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA [2].
REFERENCES
Figure 10: Prototype Galaxy-based FACE-IT portal
Figure 11: Prototype pSIMS integration in iPlant.
Both advances are enabled by high-performance computing leveraged with the Swift parallel scripting language. pSIMS also contains tools for data translation of both the input and output formats from various models; specifications for integrating translators developed in AgMIP; and tools for aggregation and scaling of the resulting simulation outputs to arbitrary spatial and temporal scales relevant for decision-support, validation, and downstream model coupling. This framework has been used for high-resolution crop yield and climate impact assessments at the US [10] and global levels [9, 25].
ACKNOWLEDGMENTS We thank Pierre Riteau and Kate Keahey for help running pSIMS on cloud resources with Nimbus; Stephen Welch of AgMIP, Dan Stanzione of iPlant and the Texas Advanced Computing Center (TACC), and John Fonner and Matthew Vaughn of TACC for help executing Swift workflows on Ranger and Stampede, and under the iPlant portal; and Ravi Madduri of Argonne and UChicago for help placing pSIMS under the Galaxy portal. This
1. AgMIP source code repository. [Accessed April 1, 2013]; Available from: https://github.com/agmip. 2. Earth System Research Laboratory, Physical Sciences Division. [Accessed April 1, 2013]; Available from: http://www.esrl.noaa.gov/psd/. 3. National Agricultural Statistics Service, County Data Release Schedule. [Accessed April 1, 2013]; Available from: http://1.usa.gov/bJ4VQ6. 4. Bondeau, A., Smith, P.C., Zaehle, S., Schaphoff, S., Lucht, W., Cramer, W., Gerten, D., Lotze-Campen, H., MÜLler, C., Reichstein, M. and Smith, B. Modelling the role of agriculture for the 20th century global terrestrial carbon balance. Global Change Biology, 13(3):679-706, 2007. 5. Bresnahan, J., Freeman, T., LaBissoniere, D. and Keahey, K., Managing Appliance Launches in Infrastructure Clouds. TG'11: 2011 TeraGrid Conference: Extreme Digital Discovery, Salt Lake City, UT, USA, 2011. 6. Bryant, L., UC3: A Framework for Cooperative Computing at the University of Chicago Open Science Grid Computing Infrastructures Community Workshop, http://1.usa.gov/ZXum6q, University of California Santa Cruz, 2012. 7. Deryng, D., Sacks, W.J., Barford, C.C. and Ramankutty, N. Simulating the effects of climate and agricultural management practices on global crop yield. Global Biogeochemical Cycles, 25(2), 2011. 8. Eaton, B., Gregory, J., Drach, B., Taylor, K., Hankin, S., Caron, J., Signell, R., Bentley, P., Rappa, G., Höck, H., Pamment, A. and Juckes, M. NetCDF Climate and Forecast (CF) Metadata Conventions, version 1.5. Lawrence Livermore National Laboratory, http://cf-pcmdi.llnl.gov/documents/cfconventions/1.5/cf-conventions.html, 2010. 9. Elliott, J., Deryng, D., Muller, C., Frieler, K., Konzmann, M., Gerten, D., Glotter, M., Florke, M., Wada, Y., Eisner, S., Folberth, C., Foster, I., S. Gosling, Haddeland, I., Khabarov, N., Ludwig, F., Masaki, Y., Olin, S., Rosenzweig, C., Ruane, A., Satoh, Y., Schmid, E., Stacke, T., Tang, Q. and Wisser, D. Constraints and potentials of future irrigation water availability on global agricultural production under climate change. Proceedings of the National Academy of Sciences, Submitted to ISI-MIP Special Issue, 2013. 10. Elliott, J., Glotter, M., Best, N., Boote, K.J., Jones, J.W., Hatfield, J.L., Rosenzweig, C., Smith, L.A. and Foster, I. Predicting Agricultural Impacts of Large-Scale Drought: 2012 and the Case for Better Modeling. RDCEP Working Paper No. 13-01, http://ssrn.com/abstract=2222269, 2013.
11. Goecks, J., Nekrutenko, A., Taylor, J. and The Galaxy Team Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8):R86, 2010. 12. Hategan, M., Wozniak, J. and Maheshwari, K., Coasters: Uniform Resource Provisioning and Access for Clouds and Grids. Fourth IEEE International Conference on Utility and Cloud Computing (UCC '11), Washington, DC, USA, 2011, IEEE Computer Society, 114-121. 13. Higgins, R.W. Improved US Precipitation Quality Control System and Analysis. NCEP/Climate Prediction Center ATLAS No. 6 (in preparation), 2000. 14. Izaurralde, R.C., Williams, J.R., McGill, W.B., Rosenberg, N.J. and Jakas, M.C.Q. Simulating soil C dynamics with EPIC: Model description and testing against long-term data. Ecological Modelling, 192(3–4):362-384, 2006. 15. Jones, J.W., Hoogenboom, G., Porter, C.H., Boote, K.J., Batchelor, W.D., Hunt, L.A., Wilkens, P.W., Singh, U., Gijsman, A.J. and Ritchie, J.T. The DSSAT cropping system model. European Journal of Agronomy, 18(3–4):235-265, 2003. 16. Kirschbaum, M.U.F. CenW, a forest growth model with linked carbon, energy, nutrient and water cycles. Ecological Modelling, 118(1):17-59, 1999. 17. Leemans, R. and Solomon, A.M. Modeling the potential change in yield and distribution of the earth's crops under a warmed climate. Environmental Protection Agency, Corvallis, OR (United States). Environmental Research Lab., 1993. 18. Lenards, A., Merchant, N. and Stanzione, D., Building an environment to facilitate discoveries for plant sciences. 2011 ACM Workshop on Gateway Computing Environments (GCE '11), 2011, ACM, 51-58. 19. Liu, J., Williams, J.R., Zehnder, A.J.B. and Yang, H. GEPIC – modelling wheat yield and crop water productivity with high resolution on a global scale. Agricultural Systems, 94(2):478493, 2007. 20. Malik, T., Best, N., Elliott, J., Madduri, R. and Foster, I., Improving the efficiency of subset queries on raster images. HPDGIS '11: Second International Workshop on High Performance and Distributed Geographic Information Systems, Chicago, Illinois, USA, 2011, ACM, 34-37. 21. Paskin, N. Digital Object Identifiers for scientific data. Data Science Journal, 4:12-20, 2005. 22. Pordes, R., Petravick, D., Kramer, B., Olson, D., Livny, M., Roy, A., Avery, P., Blackburn, K., Wenaus, T., Würthwein, F., Foster, I., Gardner, R., Wilde, M., Blatecky, A., McGee, J. and Quick, R., The Open Science Grid. Scientific Discovery through Advanced Computing (SciDAC) Conference, 2007. 23. Probert, M.E., Dimes, J.P., Keating, B.A., Dalal, R.C. and Strong, W.M. APSIM's water and nitrogen modules and simulation of the dynamics of water and nitrogen in fallow systems. Agricultural Systems, 56(1):1-28, 1998. 24. Rosenzweig, C., Jones, J.W., Hatfield, J.L., Ruane, A.C., Boote, K.J., Thorburn, P., Antle, J.M., Nelson, G.C., Porter, C., Janssen, S., Asseng, S., Basso, B., Ewert, F., Wallach, D., Baigorria, G. and Winter, J.M. The Agricultural Model Intercomparison and Improvement Project (AgMIP): Protocols and pilot studies. Agricultural and Forest Meteorology, 170(0):166-182, 2013. 25. Rosenzweig, C. and others Assessing agricultural risks of climate change in the 21st century in a global gridded crop model intercomparison. Proceedings of the National Academy of Sciences, Submitted to ISI-MIP Special Issue, 2013.
26. Rutledge, G.K., Alpert, J. and Ebisuzaki, W. NOMADS: A Climate and Weather Model Archive at the National Oceanic and Atmospheric Administration. Bulletin of the American Meteorological Society, 87(3):327-341, 2006. 27. Stackhouse, P.W., Gupta, S.K., Cox, S.J., Mikovitz, J.C., Zhang, T. and Hinkelman, L.M. The NASA/GEWEX Surface Radiation Budget Release 3.0: 24.5-Year Dataset. GEWEX News, 21(1):10-12, 2011. 28. Waha, K., van Bussel, L.G.J., Müller, C. and Bondeau, A. Climate-driven simulation of global crop sowing dates. Global Ecology and Biogeography, 21(2):247-259, 2012. 29. Wilde, M., Foster, I., Iskra, K., Beckman, P., Zhang, Z., Espinosa, A., Hategan, M., Clifford, B. and Raicu, I. Parallel Scripting for Applications at the Petascale and Beyond. IEEE Computer, 42(11):50-60, 2009. 30. Williams, D.N., Ananthakrishnan, R., Bernholdt, D.E., Bharathi, S., Brown, D., Chen, M., Chervenak, A.L., Cinquini, L., Drach, R., Foster, I.T., Fox, P., Fraser, D., Garcia, J., Hankin, S., Jones, P., Middleton, D.E., Schwidder, J., Schweitzer, R., Schuler, R., Shoshani, A., Siebenlist, F., Sim, A., Strand, W.G., Su, M. and Wilhelmi, N. The Earth System Grid: Enabling Access to Multi-Model Climate Simulation Data. Bulletin of the American Meteorological Society, 90(2):195-205, 2009.