Automating data-model workflows at a level 12 ... - Semantic Scholar

5 downloads 40862 Views 6MB Size Report
Jan 17, 2014 - workflows. 1.2. What is automated within the data-model workflows? ..... If the automated option is selected, an email is sent (if reques- ted) after ...
Environmental Modelling & Software 61 (2014) 174e190

Contents lists available at ScienceDirect

Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft

Automating data-model workflows at a level 12 HUC scale: Watershed modeling in a distributed computing environment Lorne Leonard*, Christopher J. Duffy Department of Civil & Environmental Engineering, The Pennsylvania State University, 212 Sackett Building, University Park, PA 16802, USA

a r t i c l e i n f o

a b s t r a c t

Article history: Received 17 January 2014 Received in revised form 28 June 2014 Accepted 31 July 2014 Available online

The prototype discussed in this article retrieves Essential Terrestrial Variable (ETV) web services and uses data-model workflows to transform ETV data for hydrological models in a distributed computing environment. The ETV workflow is a service layer to 100's of terabytes of national datasets bundled for fast data access in support of watershed modeling using the United States Geological Survey (USGS) Hydrological Unit Code (HUC) level-12 scale. The ETV data has been proposed as the Essential Terrestrial Data necessary to construct watershed models anywhere in the continental USA (Leonard and Duffy, 2013). Here, we present the hardware and software system designs to support the ETV, data-model, and model workflows using High Performance Computing (HPC) and service-oriented architecture. This infrastructure design is an important contribution to both how and where the workflows operate. We describe details of how these workflow services operate in a distributed manner for modeling CONUS HUC-12 catchments using the Penn State Integrated Hydrological Model (PIHM) as an example. The prototype is evaluated by generating data-model workflows for every CONUS HUC-12 and creating a repository of workflow provenance for every HUC-12 (~100 km2) for use by researchers as a strategy to begin a new hydrological model study. The concept of provenance for data-model workflows developed here assures reproducibility of model simulations (e.g. reanalysis) from ETV datasets without storing model results which we have shown will require many petabytes of storage. © 2014 Elsevier Ltd. All rights reserved.

Keywords: Distributed hydrological model Data workflows Data-model workflows Model workflows Provenance Essential terrestrial variables HydroTerre PIHM Geographic information science Data as a service Model as a service

Software availability

1. Introduction

Name: HydroTerre Developer: Lorne Leonard, Department of Civil Engineering & Penn State Institutes of Energy and the Environment, The Pennsylvania State University Contact information: Christopher J. Duffy & Lorne Leonard, Department of Civil & Environmental Engineering, The Pennsylvania State University, 212 Sackett Building, University Park, PA 16802, USA Software required: Internet browser (later versions are recommended) Program language: Cþþ, C#, Microsoft SQL, ArcGIS, Silverlight, COM, HTML, JavaScript Availability and cost: Any user can access HydroTerre web applications at no cost at: http://www.hydroterre.psu.edu

The HydroTerre web services (www.hydroterre.psu.edu) provide the Essential Terrestrial Variable (ETV) datasets to create common hydrological models anywhere in the continental United States (CONUS) (Leonard and Duffy, 2013). This service allows web users to download data for applications within their own computing environment. The datasets are provided using standard Geographic Information Science (GIS) formats appropriate for endusers to transform data for their needs, goals, and computing environment. In this article, we demonstrate the feasibility of automating data-transformation workflows at scales consistent with the United States Geological Survey (USGS) level-12 Hydrological Unit Codes (HUC-12) (USGS, 2013; Seaber et al., 1987) to be consumed in spatially distributed hydrological models. The Penn State Integrated Hydrological Model (PIHM) (Qu and Duffy, 2007) is demonstrated here, but the workflows serve as a template for other models to adapt and become new services. The focus of this article is the data transformation process, not particular model results. The goal is to demonstrate that workflows empower modelers to rapidly create watershed models anywhere in the CONUS, and to

* Corresponding author. E-mail addresses: [email protected] (L. Leonard), [email protected] (C.J. Duffy). http://dx.doi.org/10.1016/j.envsoft.2014.07.015 1364-8152/© 2014 Elsevier Ltd. All rights reserved.

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

demonstrate how the approach can serve as a data provenance resource for HUC-12 models. It is shown that the way these services are coupled is critical for web service performance. An explanation of both hardware and software architecture is required to explain how the software components operate. Section 1 introduces the reasons why to use data-model workflows within hydrological modeling. Section 2 provides an overview map about the architecture while Section 3 explains details about the data-model workflows. Our prototype web application to create and evaluate data-model workflows is discussed in Section 4. Section 5 demonstrates the feasibility of using the prototype to create provenance data-model workflows using distributed computing environments with CONUS HUC-12 catchments. 1.1. Why data-model workflows? The ETV services provide both spatial data (soils, land cover, etc.) at the HUC-12 scale, and time-series North American Land Data Assimilation System (NLDAS, 2011) climate forcing for a period of 30 years, one climate normal (Arguez and Vose, 2011). Catchment data is available to the user by selecting an HUC-12 which then returns an email to the user with a web link to the data to be downloaded as a first step in the data workflow (Leonard and Duffy, 2013). The following example demonstrates how web-based users access ETV data that has been transformed to support the PIHM input datasets. This transformation is hidden from the user to avoid the numerous data processing and time consuming steps. Transforming data is not simply about converting one file format to another. The data-model workflow is about preparing data to be consumed in a hydrological model. This involves simplification of catchment boundaries, stream delineation, and generating meshes that represent the physical processes being investigated by the modeler. One goal here is to standardize the model data step, and minimize errors either due to the original data or due to the users' method itself. Often these steps are not easy to reproduce, as the reasons for why the modeler made decisions are not recorded or transferred to other users. Reducing the time that modelers invest in manually changing individual parameters and input data to optimize the data inputs to generate quality hydrological models is another major goal of workflow. The data-model workflow can also assure that model provenance is not lost. Overall, the fundamental reason for workflows is to capture all data processing steps and to enable reproducibility and provenance of data transformations. To capture these steps requires capturing user interaction within the web application. The workflows described here do not restrict users from downloading the data transformation results and using the data offline. However, the emphasis of these workflows is reproducibility and rapid prototyping so users can retrieve a personal copy at the end of the process. Furthermore, the method to create the data inputs is easily re-created with the stored parameters of the user interaction to replicate the entire workflow processes. This may appear trivial for a few case studies. However, 100's of terabytes of data storage is necessary for the ETV web services and assuming 1e10 GB of storage is required for 30 years of input data per HUC-12, 100's of terabytes of disk storage would be required to keep data-transformation steps for the entire CONUS. Data workflows eliminate the need to keep the data transformation results and only store the modelers' input values to the versioned workflows. 1.2. What is automated within the data-model workflows? A user selecting an HUC-12 initiates the data-model workflow. There are three phases; the first is the ETV data workflow that is responsible for selecting, projecting, clipping, and extracting data

175

within the HUC-12 catchment efficiently. The sources of the ETV data have been transformed from their original file formats to databases so that data can be consumed in multiple ways and be accessible in a distributed computing environment. What is being automated with this phase is the entire data selection process pertaining to the spatial context of the selected HUC-12 and the climate forcing time series. Metadata is included with the selection process, including the data sources, what data versions are used, and the software version used. The second phase is the transformation of ETV data generated into PIHM input file formats. There are two steps to this phase. The first is the physical representation of the boundaries and stream network of the catchment as a discrete mesh. Each mesh element has properties (soil, geology, land cover, and climate-forcing variables) assigned. Desktop PIHMgis assists in the creation of these files, that requires user intervention with the geometry creation of the catchment and stream delineation (Bhatt et al., 2008). In this paper, these steps have been automated as a web service and are controlled by user-defined variables that control the simplification process. It is noted that automation of workflow requires extensive HPC services (High Performance Computing) executing tasks in parallel. Terrain, soil, geology, land cover, and forcing data from the ETV workflow are automatically assigned to mesh cells. Default values are suitable for the initial trial. The second step is to generate model parameters that define the initial conditions and calibration values to control and calibrate the hydrological model. Default values are assigned to these parameters so that a user can in principal, generate everything necessary for an initial hydrological model: data, parameterization, execution and calibration. Clearly, the workflow will be of greatest value to expert users familiar with the usual methods of model implementation, although novice users and educational applications will also be important use cases. The third phase is the web-based user interface that captures the steps involved in phases one and two by storing data-model and model parameters. All workflow parameters are stored as database objects for fast retrieval, provenance, and reproducibility. This happens automatically when a user submits a task and is a critical step to sharing parameters and model with other users and stakeholders. 1.3. Constraints This article focuses on the data-model transformation process and the use of a distributed computing environment to evaluate the input data generated by the transformation. In a future paper, the problem of model implementation, calibration, uncertainty assessment and the validity will be explored. In the prototype phase, the data workflows presented here are restricted to one HUC-12 selected by the web user, and executing model workflows would be restricted to expert users, but data-model workflows would be available. In the next phase of this prototype, will deal with issues associated with scaling up to a network of HUC-12 watersheds, as discussed in Leonard and Duffy (in press), where flow direction between HUC-12s requires validation to verify the hierarchy and upstream HUC-12 is calibrated. 1.4. Related work In recent years a number of hydrological service-oriented applications have been developed. The AWARE project geo-portal application (Granell et al., 2010) supports two hydrological models, the Snowmelt Runoff Model (Martinec et al., 1994) for daily stream flow forecasts in mountain basins, and the TUW-HBV model (Parajka et al., 2005) a semi-lumped rainfall runoff model. Goodall et al. (2008, 2011) consider service oriented computing as a

176

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

strategy for integrating independent water resource models and Horsburgh et al. (2009) have applied the concept to publishing environmental data. Nativi et al. (2013) discuss model service and web services to consume data from the Global Earth Observation System of Systems (GEOSS) with the SWAT hydrologic model (SWAT, 2013; EnviroGRIDS, 2013). With regard to integrating data and model workflows, we point the reader to Turuncoglu et al. (2013) who discusses coupling an Earth System Modeling Framework (ESMF) with the Regional Ocean Modeling System (ROMS) and Weather Research and Forecasting Model (WRF). Clearly, hydrological service oriented applications will be essential to the next generation of model applications. 2. System design This section describes the computer hardware and software that forms the foundation for automation of data-model workflows. The hardware has been structured to efficiently serve the large volumes of data required to support datamodel workflows anywhere in the CONUS. The data-model processing is distributed within the data-tier of the HydroTerre system, and the hydrological modeling is distributed to other HPC systems to compute PIHMs. Clearly, an efficient and robust service-oriented architecture (Section 2.2) is critical to support the rapid prototyping and delivery of data-model workflows. 2.1. Hardware and administration layers The data-model workflows are implemented in a three-tier hardware layer system (Fig. 1). The web interface tier hosts the web applications and services. ESRI's ArcGIS server software (ESRI, 2014) development kits (SDK) support GIS web applications and Microsoft SQL server (Microsoft, 2014e) is used to store, create, and query spatial datasets. Microsoft SQL server is used to store and query datasets in the second tier. Components of the data workflows that retrieve the forcing data are implemented on this tier and form the first layer of the distributed computing system as data queries are executed on multiple compute nodes. All the components are executed in parallel for maximum performance that ranges from minutes to many hours depending on the catchment size. The reader is referred to Leonard and Duffy (2013) for further details about compute times and data sizes of ETV datasets. The web and data tiers are tightly united via a private fast router to minimize performance lost when retrieving datasets between servers. The data-model workflows reside on these two tiers and are explained further in Section 2.2. Both ETV and data-model workflow results that are compressed and zipped, reside on the web tier accessible by web interfaces and applications. Web users gain access to these results via a public network connection. The model support tier also gains access to the zipped data using the same public network. The model support tier operates in two modes due to administration restrictions. The first is manual mode where service applications are not allowed to operate, or service units restrict public access. For example, the data-model workflows have been tested using Extreme Science and Engineering Discover Environment (XSEDE, 2014a) that operates on service units. In manual mode, the user is

Fig. 1. Three-tier hardware layer system to support data-model and model workflows. Tier-one supports web applications, tier-two supports the data services, and tier-three supports the model development. Both tiers one and two support the data-model workflows.

required to login to the compute environment and execute the tasks under their own account. With XSEDE, a Python script retrieves the data-model workflow results and submits the PIHM jobs with the Portable Batch System (PBS) (Henderson, 1995) in a Linux environment. The second mode is automated. Using a specified PIHM account, a custom PIHM dispatcher application runs continuously and uses web services to retrieve PIHM jobs. For example, the data-model workflows have been tested using Penn State Universities CyberSTAR cluster (CyberSTAR, 2014). In this mode, the user does not need to login to the compute environment and the PIHMs are automatically dispatched to the compute nodes. Job management is achieved via the data tier, with all compute nodes accessing job tasks from the data tier, via user-project databases. In both modes, statistics about data and the model performance are returned to the web tier database. The main difference between the two methods is gaining access to the model results. In the manual mode, access to model results is restricted and extra steps are required to move model results. With the automated mode, access to the model results is automatically sent back to the web interface tier. The other significant difference is that the computer names are recorded when a PIHM job is executed. Using the automated mode, the exact workflow can easily be replicated on the exact distributed computer, while with the manual mode, more time consuming steps are required to replicate the workflow. Having the ability to replicate the entire workflow is extremely useful for debugging purposes when trying to determine if data is being lost or corrupted via network connections. Another issue for debugging is determining whether hardware and/or operating systems are causing problems with the model jobs. To the reader this may appear unnecessary, but when 100,000s of PIHM jobs are running within different types of HPC environments, having this information is valuable to understand failure. The issues outlined with the manual mode can be overcome by working with the HPC management to address security issues. In this article, we will assume that the model support tier is automated and all tiers operate as a distributed computing system. The web tier is the Graphical User Interface (GUI) to all aspects of the workflows, with the data and model tiers operating without any user intervention. How the three tiers work together, when a web user starts the HydroTerre web application, is discussed in Section 2.2, starting with an overview map of the serviceoriented software architecture. 2.2. Overview map of service-oriented architecture In previous sections, ETV, data-model, and PIHM-model workflows have been presented abstractly as individual objects to represent their main functionality. In fact, the workflows are hundreds of discrete pieces of software that provide application functionality to other applications that constitute HydroTerre data-model workflows. The workflows are accessible as private service-oriented architecture (SOA) (Microsoft, 2014c; Bell, 2008, 2010) services using common communication techniques of Simple Object Access Protocol (SOAP) (World Wide Web Consortium, 2014b), Representational State Transfer (REST) (Fielding and Taylor, 2002), Web Services Description Language (WSDL) (World Wide Web Consortium, 2014c), and Database Markup Language (DBML) (Microsoft, 2014a). Section 3 discusses specific details about the workflows and their components. However, it is important to first provide the reader with an overview map of the SOA and explain the significant paths behind the web application that occur when a user selects an HUC-12 to execute a data-model workflow. When a user visits the prototype application1 via a web browser, they are accessing internet services hosted on the web interface tier. The HydroTerre website user interface has been developed with Silverlight (Microsoft, 2014d) and ArcGIS server SDK. The user interface is responsible for selecting, querying, creating, and retrieving Microsoft SQL Server datasets for display within the web application (Fig. 2A). All data displayed and used in controls reside in databases on the data-tier; the user interface is data driven. The main communication methods between the user interface and the data tier and between the data tier and workflow service layer (Fig. 2B) are SOAP, REST, WSDL, and DBML. The choice of communication technique depends where the data resides, what tier layer, and system administration. However, the overriding choice depends on balancing performance with disk, memory, and number of Central Processing Units (CPU). Just because one tool operates with high performance on server type 1, does not equate the same tool operates the same way on server type 2, due to differences in hardware configurations. Thus, some of the workflow tools called by the user interface have multiple versions simply due to where the tool resides. Each of the servers shown in Fig. 1 are not the same and perform differently. For example, one of the data servers has 512 GB of memory but relatively slow disk, thus the forcing tool was developed to take advantage of the large amount of memory. However, the same tool resides on the other data server, which has 128 GB of memory, but fast disk. Therefore, the forcing tool was modified to take advantage of the fast disk and use little memory. The data tier (Fig. 2C) has two categories. The first consists of ETV datasets and the reader is referred to Leonard and Duffy (2013) for further details about their

1 http://www.hydroterre.psu.edu/Development/HydroTerre_Leonard_Models/ HydroTerre_Models.aspx.

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

177

Fig. 2. Service-oriented architecture for data-model workflows consists of three layers. The first layer is the web based user interface, supported by a data tier layer, and a workflow service layer.

function and computation complexity. A metadata repository is connected to the ETV datasets, which stores information about the ETV datasets' properties, version, and technical attributes. The metadata repository is queried by the web application and informs users, by populating the graphical user interface controls, of available ETV datasets. The second category contains databases that store all the parameters chosen by users after they have chosen to execute the workflows. Section 3 will discuss these parameters in more detail. These databases are queried by the user interface to populate data controls so that users can interrogate the workflow results (success or failure), inspect provenance, create a clone of the user's parameters to tweak parameters, and re-submit the workflow. When a user selects an HUC-12 and submits a job to execute the workflows, a new table row with fields shown in Table 1 is created with a Globally Unique Identifier (GUID) (Microsoft, 2014b) primary key. Each row contains the HUC identification key, the users email address, and each workflow is stored as a separate Extensible Markup Language (XML) (World Wide Web Consortium, 2014a) document. Thus, via the web user interface, queries against HUC-12 or email addresses can be made to populate the data controls and replicate the workflow parameters. See Appendix A for further details that are more specific about the database schema of the project object that stores these values. Recall that only the parameters for workflows are stored, not the results. Therefore, a user cannot simply download the ETV or data-model results from a previous job, due to the large amount of disk storage required to store results. The task will need to be executed again, but at the HUC-12 scale, the time to re-create is minimal and requires minimal effort from the user. Section 4 will demonstrate the simplicity of creating and reproducing workflows. Assuming a user is either creating a new, or re-executing, an HUC-12 job, the execution of workflow services is shown in Fig. 2D, from ETV, data-model to PIHM. There are potentially thousands of errors when executing the workflows to create or share data. To empower the web user to resolve an error returned during workflow execution, either due to administration or from parameter issues, a meaningful error object is returned to the user via the web interface. The error object has a unique key, the software tool name and version, operating system, computer name, time stamp, and reason for the error. See Appendix B for further details about the error object. Furthermore, the numbers of errors are compounded since tasks are distributed on various operating systems and developed with different computer languages. To simplify the explanation of data automation workflows for HUC-12s in a distributed computing environment, the software versions are omitted. In addition, the computer language the software has been developed with is not included. Briefly, the software languages used in both Linux and Windows environments include Cþþ, C, SQL, and Python. Windows environments additionally include C#, JavaScript, Silverlight, and ASPX. These differences occur as Windows server is the operating system for the web and data tiers, while both Linux and Windows operating systems are used in the model tier. However, it should be noted, there are multiple workflow versions that have been designed and developed to be optimized for PIHM in HPC environments that are specific to PIHM in these environments. This is due to the emphasis on

performance, which is constrained by the various computing environment and management practices. Due to management practices beyond our control, the prototype web services are restricted to the public to protect data and security of resources. Other models will have the same difficulties and constraints. Here, we want to emphasis the importance of reproducibility, provenance, and rapid hypothesis testing, by using data-driven automated workflows.

Table 1 The HydroTerre National Job object stored when users execute workflows. The email, name, HUC, and date objects enable SQL queries for filtering and identifying jobs. The HPC properties object stores information related to the compute nodes where the workflows are executed. HUC properties store information about the HUC catchment. The Model and Data Properties object store parameters returned by the workflows. UI Properties contain all the parameters used to execute the data and model workflows. Job and Workflow properties store parameters used and returned by the workflows on the compute nodes. Column name JobID_Nat JobID_Data SubmitJob DeleteJob Last_Accessed Email_Address Project_Name Pretty_Name HUC_Name HUC_ID HPC_Properties

Type

nvarchar nvarchar datetime datetime datetime nvarchar nvarchar nvarchar nvarchar nvarchar nvarchar (xml document) HUC_Properties nvarchar (xml document) Model_Properties nvarchar (xml document) Data_Properties nvarchar (xml document) UI_Properties nvarchar (xml document) Job_Properties nvarchar (xml document) Workflow_Properties nvarchar (xml document) Status_DWF int Status_MWF int

Description Project GUID key Data workflow GUID key Time user submitted project job When project was deleted When project was last accessed User email address Automated project name User specified project name USGS HUC Name (not unique) USGS HUC identification HPC XML object (Appendix A1) HUC XML object (Appendix A2) Model XML object (Appendix A3) Data XML object (Appendix A4) User interface XML object (Appendix A5) Job Project XML object (Appendix A6) Workflow XML object (Appendix A7) Data workflow status (Appendix B1) Model workflow status (Appendix B2)

178

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

3. Workflow services Section 2 describes an overview of how the digital infrastructure contributes to data-model workflows. Section 3 describes details of how the workflow services operate together to transform standard GIS datasets at an HUC-12 catchment scale to be modeled in a hydrological model (PIHM). The ETV workflow (Section 3.1) is responsible for retrieving GIS datasets at the HUC-12 scale anywhere in the CONUS and is model independent. The data-model workflow (Section 3.2) transforms the ETV workflow results and creates an unstructured mesh with physical properties for the user selected HUC-12. The PIHM workflow (Section 3.3) consumes the data-model workflow outcomes within a distributed computing environment. 3.1. ETV workflow The ETV workflow is a service with the sole purpose of retrieving data rapidly for any HUC-12 in the CONUS with minimal interaction from the user. The web application asks the user to select an HUC-12, provide an email address, and specify the forcing period. These are the core user inputs (orange) (in the web version) shown in Fig. 3A. The workflow inputs also require the job start time and a unique project GUID, both (gray) generated by the web application without user intervention. The web application also initiates a new project object (unique via GUID) in the project database in the data tier (Fig. 3B) and assigns these input values. These are the essential user inputs necessary to execute the ETV workflow. The base software to execute the ETV workflow (Fig. 3C) is ArcGIS server. The HUC-12 key is used to query the HUC from the national database to create a bounding box. This box is then used to clip and extract soil, geology, land cover, and elevation from the national datasets as described in Leonard and Duffy (2013). This process produces Tagged Image File Format (TIFF) (Adobe Developers Association, 1992) and text files. When possible, existing ArcGIS tools were used to do common GIS operations such as select and clip if the performance was adequate and the tools took advantage of the HPC environment. However, new tools were

needed, in particular how to handle the distributed forcing datasets on the data tier. As described in Section 2.2, the forcing tool was designed to take advantage of the HPC resources that some of the ArcGIS tools are not designed to do. For example, many of the standard ArcGIS tools are 32bit only, but all the servers described in Fig. 1 are 64bit. Microsoft SQL datasets are located on multiple servers, with each server having different system properties. Essentially, each forcing variable per year is stored as a separate database. The forcing tool (Fig. 3D) is responsible for identifying which forcing cells overlap the HUC-12, then in parallel, query the forcing variables, and generate an XML file. For further details about ETV workflow performance at the HUC-12 scale, the reader is referred to Leonard and Duffy (2013). The data generation steps (light blue) (in the web version) shown in Fig. 3C are executed in parallel. After each data step, the error object (Appendix B) assigned to the GUID project object is updated with the status returned by the ETV software tool. If an error is returned by the data step, the error object is updated and the user is informed at the web interface. If all the data steps are evaluated, the results are compressed and zipped. Then, an email is sent to the specified address with a web link, where the user can download the zipped file. At the same time, a completion time stamp is generated and the error object is updated on the GUID project object (Appendix A), created when the workflow started. 3.2. Data-model workflow The ETV workflow is an independent service that provides data, downloaded via web links, for any model. Conversely, the datamodel workflow is dependent and consumes the ETV workflow service as data inputs. The results from the data-model workflow presented in this article are for PIHM, but the workflow, or results, can be adopted for other models. The user inputs are an extension of the inputs presented in Section 3.1 and it is assumed that the ETV workflow is valid. With the same philosophy as the ETV workflow, the goal is to minimize the number of user inputs necessary to control the processing tools within the data-model workflow and that the process

Fig. 3. The Essential Terrestrial Variable (ETV) workflow expects six variables (a) from the web application operated on the web tier. These variables are stored in a project database (b) that keeps project settings for all enquires. The main processes executed in the workflow (c) are queries and clipping of datasets within the selected HUC-12 catchment boundary, that are compressed into a zip file and the web user is emailed a link to the datasets.

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

be data-driven. Four additional user inputs (orange) (in the web version), as shown in Fig. 4A, control the catchment topology. These inputs control the catchment boundary and stream network that form the derived mesh topology. Thus, these inputs control the level of detail of the catchment unstructured mesh. The user input, catchment simplification, controls the maximum allowed offset within the polyline segments by removing small variations and unnecessary bends while preserving the polyline shape (Alves Dal Santo and Führ, 2010). The user input, stream simplification, applies the same technique to the steam network generated by Terrain Analysis Using Digital Elevation Models (TauDEM) (Tarboton, 2011). The stream raster threshold controls the TauDEM accumulation threshold that determines the beginning of streams. The remaining user input, Shewchuk's Triangle software input flags, controls the mesh area size with the Delaunay mesh (Shewchuk, 1997). These workflow inputs are included in the GUID project object, created from the ETV workflow, located in the project database (Fig. 4B). In addition, these input flags have default values so a novice user can generate data and gain experience. The data-model workflow consumes the user input values and the ETV workflow results. There are two categories to transform the ETV datasets. The first category is data processing that has two functions. The first function is to generate unique lookup tables with derived data in parallel for soil, geology, and land cover within the HUC-12 catchment boundary. Appendix C explains these XML file structures. The second function creates catchment and stream networks. One version of the data-model workflow simply uses the HUC-12 catchment and stream network provided by NHD datasets to create the topology. The alternative version creates the catchment and stream network using raster and vector child-workflow techniques to create topology as shown in Fig. 4C. The raster childworkflow uses elevation to determine flow direction, accumulation, and contributing-area, using TauDEM (Tarboton, 2011). The vector child-workflow transforms the contributing-area (catchment boundary) and stream network from TauDEM to vector geometry. Both methods in the second function create vector topology that is simplified using the user inputs (Fig. 4A) described earlier to

179

control the level of detail to produce an unstructured mesh (Triangulated Irregular Network) (Shewchuk, 1997). These results are critical for the second category, file generation. Each mesh cell from the unstructured mesh is assigned default initial values to create the initialization XML file. Additionally, each mesh cell is assigned an index to lookup tables that describe the physical parameters (soil, geology, land cover) and is then stored as the attribute XML file. The lookup tables for the physical parameters are also stored as individual files, independent from the mesh, for other uses within the PIHM. After each software step, the error object (Appendix B) assigned to the GUID project object is updated with the status returned by the tool. If the tools return an error, the workflow is canceled and the user is informed via the web interface. If the workflow succeeds, a completion time stamp is assigned and an email is sent to the user (if requested) with a link to download the results for their personal use. However, the intention of the data-model workflow is rapid prototyping with the model workflow by testing parameters and refining the processes. The results from each workflow are discarded within a user-defined period (Section 4.1) as the process to repeat is efficient using the stored provenance data. Once a user has finished refinement, then the user is expected to download a personal copy. 3.3. PIHM workflow The PIHM workflow is dependent and consumes the data-model workflow service as data inputs. The user inputs are an extension from the inputs presented in Sections 3.1 and 3.2. Unlike the ETV and data-model services, there is no goal of limiting the user inputs to control PIHM. There are four general user input categories (orange) (in the web version), as shown in Fig. 5A, that control and execute PIHM on a distributed compute environment. HPC credentials, the first category, validates against encrypted stored user and password credentials within the User Project Database stored in the data tier (Fig. 5B). With the manual case, for example on XSEDE HPC resources, the PIHMs are not executed, only the data-model workflows are created and the user is expected to access the results via a script. With the automated case, for example

Fig. 4. The data-model workflow expects four additional variables (a) from the web application operated on the web tier. These variables are stored in a project database (b) that records project settings for all enquires. The main processes executed in the workflow (c) are three child-workflows (Raster, Vector, and Topology) that process and convert the ETV datasets into XML files for the PIHM. These XML files are compressed into a zip file and the web user is emailed a link to the datasets.

180

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

Fig. 5. The model workflow expects four additional variable categories (a) from the web application operated on the web tier. These variables are stored in a project database (b) that keeps project settings for all enquires and validates user credentials. The main processes executed in the workflow (c) are submitting PIHM jobs, compressing the model results into a zip file, and sending the web user an email with a link to the results.

using CyberSTAR HPC resources, the models are executed within that compute environment using local HydroTerre credentials (i.e. not the web user credentials). The HPC credentials, provided using the web interface tier, initiate the workflow processes only by updating the project database indicating that the model workflow is ready to begin. This simplifies security policies amongst different compute domains and management practices. Thus, we have different job scheduling tools per HPC environment that queries the project databases and retrieves which model to execute next. Sections 5.1 and 5.2 demonstrate the importance of discovering error sources. The second category, initialization values, specifies the initial state conditions for each mesh cell with physical parameters. The third category, parameter values, specifies print and controls for the Sundials solver (Hindmarsh et al., 2005) used in PIHM to solve partial and ordinary differential equations, the reader is referred to Qu and Duffy (2007) for further details. The fourth and last category, calibration values, is used for calibrating physical parameters. Categories three and four, the reader is referred to the PIHM website (PIHM, 2014) for further details. Default values are assigned to all parameters within these model categories to minimize a beginner's requirement to start a PIHM. However, these values are unlikely to be accurate, but the process is useful to focus whether the model parameters are causing problems, or the data-model workflow setup is the cause of model failures. Assuming a successful data-model workflow, the user accesses the distributed compute environment to execute PIHM. Fig. 5C shows the main steps to complete a hydrological PIHM. The web user submits a job, the user credentials are checked against the user database and the model workflow is added to the project database. If the manual option is selected, an email is sent to the user specifying where they can download the data-model workflow results as a python script.

Fig. 6. User settings to initiate the distributed compute system on HydroTerre.

If the automated option is selected, an email is sent (if requested) after the model has been executed. The estimated time of completion is provided to the user based on the mesh size. The job is canceled (can be overridden) if this time threshold is met (Experience suggests it is highly probable that PIHM solver physics is not converging). Any other types of error (Appendix B), for example administration or data issues, are reported to the GUID project database, and the user interface is updated to inform the user. If the model does succeed, the results are compressed and zipped from the scratch directory on the distributed computer. Then an email is sent to the specified address with a web link where the user can download the PIHM results as a zipped file. The model output files are automatically deleted from the compute node at a specified period.

4. Prototype to create data-model workflows At the website www.hydroterre.psu.edu, under the services tab, a stand-alone demonstration to execute the ETV workflow independent of any model is available to the reader. Here we present a prototype2 web application that does not treat the ETV workflow as a standalone service, and is coupled with both the data-model and model workflow services. This prototype consumes private web services (due to administration restrictions) based on Sections 2 and 3 that are summarized in Appendix D. To operate the web application requires user credentials to access remote HPC resources (Section 4.1) and a strategy to select catchments from an

2 www.hydroterre.psu.edu/Development/HydroTerre_Leonard_Models/ HydroTerre_Models.html.

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

181

Fig. 7. Web application interface to select HUC-12s within the CONUS. The user can select all HUC-12s within a US state (a) or county (b) or select individual HUC-12s (c) to construct a selection list (d).

HUC-12 to CONUS state scale (Section 4.2). Section 4.3 details how users consume the workflows, described in Section 3 to setup datamodel and model workflows with any CONUS HUC-12. 4.1. User credentials To access the distributed compute system as described in Section 3.3, the prototype web application requires user credentials of identification and password as shown in Fig. 6. In addition, the user can specify a project name and when to delete any workflow results. Email settings are available to the user to define which workflow results the user wishes to retrieve via email, if any. Both the project and email settings are helpful when submitting large number of compute jobs. 4.2. Selecting HUC-12s The user selects HUC-12s via the application interface shown in Fig. 7. The user can select all HUC-12s within US state boundaries (Fig. 7A), or they can drill-down to the county level (Fig. 7B). Otherwise, the user can select individual HUC-12s (Fig. 7C) anywhere within the CONUS by zooming into their area of interest. As the user selects HUC-12s, a main selection list is created (Fig. 7D) which will use the same ETV, data-model, and PIHM workflows with parameters applied in the same manner. 4.3. Setup data-model and model workflows After the user has created a selection list of HUC-12s to model, and assuming the user credentials are valid (Section 4.1), the next step is to define the data workflow. At present, there is only one ETV

workflow, so the user is not required to select a version to use within their model. However, the web user needs to select which data-model workflow they wish to use as highlighted in Fig. 8A. In this prototype, the user can select to create stream networks using TauDEM, or they can use NHD streams as discussed in Section 3.2. The user can define data-model workflow parameters by clicking on the interface button highlighted in Fig. 8B to reveal the user interface control (Fig. 8C). Any changes will be applied to the selected HUC-12 selection list highlighted in Fig. 8D. After defining the data-model workflow properties, the user can select which PIHM workflow version they wish to use, and which HPC resource to use (Fig. 9A). As discussed in Section 3.3, the user can define and control PIHM, by clicking on the interface button highlighted in Fig. 9B to reveal the user interface control (Fig. 9C). All the selected HUC-12s will use the same user defined parameters (Fig. 9D). Assuming the user has valid credentials and has defined the data-model workflows, the user initiates the process by clicking on the submit model button (Fig. 10A) which adds the project objects to the workflow submission list (Fig. 10B). The project object indicates the users' email, the project name, the HUC identification, and when the job was added to the submission list. The user can investigate all the workflow settings by clicking on the appropriate buttons highlighted in Fig. 10C. The status of the workflows (Fig. 10D) is indicated to the user with four colors; white indicates the workflow was not requested by the user, orange (in the web version) indicates the workflow has started, green (in the web version) indicates the workflow succeeded, and red (in the web version) indicates the workflow failed. When a job workflow (ETV, data-model, and model) has succeeded, the next workflow starts on the next available

182

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

Fig. 8. The user selects which data-model workflow (a) they wish to apply. To change the data-model workflow, the user clicks on the data settings button (b) and changes variables in the interface (c). Workflow settings are then applied to the HUC-12 selection (d).

computing environment. As described in Section 2, the ETV and data-model workflows execute on the HydroTerre distributed computing system, while the PIHMs are distributed on the user selected HPC resources (Fig. 9A). When the workflow has failed, the reason for failure is available to the user by clicking on the status button (Fig. 10D) which reveals the dialog (Fig. 10E) using error codes from Appendix B. This provides the user valuable information about ways to fix the problem. The main reason for failure using this prototype is poor meshes, which requires the user to modify catchment simplification parameters (Section 3.2). However, this prototype web application simplifies the process of fixing these errors by the user selecting an existing project, and cloning the project. Then the user changes parameters (Fig. 10C) to resubmit the workflow processes (Fig. 10F). The submission list embraces the provenance data associated with all the workflows. All the parameters chosen by users are kept in the user project databases, and allow users to query other modeler's choices involved with any CONUS HUC-12. Unfortunately, model results are not stored permanently, due to the large amount of disk resources that would be required. To store one climate norm (30-years of forcing) of PIHM results per CONUS HUC-12 requires three petabytes of disk storage. However, by storing the provenance steps only, the need for large amounts of disk storage is reduced. Thus, this prototype can store thousands of workflow steps per HUC-12 across the

CONUS, providing a new resource for modelers using the system to gain insight to start a new model study and to download a refined model. 5. Creating provenance data-model workflows using distributed compute environments The web application discussed in Section 4 has been used to create provenance datasets, a new data product for users, on a distributed compute environment with both automated and manual strategies. Section 5.1 discusses the distributed compute environments used to evaluate the CONUS HUC-12 data-model workflows. Section 5.2 discusses these results and how provenance improves our strategy to automate and evaluate HUC-12 datamodel workflows at the CONUS scale. 5.1. Distributed compute environments Two distributed compute environments were used to evaluate data-model workflows using CONUS HUC-12s. The first environment was XSEDE HPC resource, Trestles, at San Diego Supercomputer Center (SDSC) (XSEDE, 2014b, 2014c) to evaluate the HUC-12s within the states of Utah and Pennsylvania. The web application and services discussed in Section 4 were used to select and distribute the data-model workflows. Due to administration

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

183

Fig. 9. The user selects which model workflow (a) they wish to apply. To change the model workflow, the user clicks on the model settings button (b) and can change variables in the interface (c). Workflow settings are then applied to the HUC-12 selection (d).

restrictions with XSEDE resources, the data-model results were retrieved using the manual strategy as discussed in Section 3.3 with a Python script that retrieved the input data and executed the PIHM. Both states required 100's of gigabytes of data-model datasets and produced terabytes of model results. The data and model results were discarded, but the provenance to create the workflows (Sections 2 and 3) are stored and are accessible via the web application shown in Section 4. Table 2 summarizes the creation times and dataset sizes for the states of Utah and Pennsylvania using XSEDE resource Gordon. The second environment was HPC resource CyberSTAR, at The Pennsylvania State University (CyberSTAR, 2014), to evaluate all 90,762 CONUS HUC-12s using the automated strategy discussed in Sections 2.2, 3.2, and 3.3. Each CONUS HUC-12 was appended to the submission list (Section 4.2) and six CyberSTAR server nodes continuously downloaded and executed the PIHM input files. Table 3 summarizes the creation time and data sizes for all CONUS HUC-12s. Unlike the first environment, the entire process was

automated and once a model result was generated, the data-model and model results were deleted. The automated process has been repeated twelve times to improve the software and hardware infrastructure. More than a million ETV and data-model workflows have been generated to collect provenance data regarding each HUC-12 in the CONUS. The CONUS HUC-12 meshes that were successful with default settings are shown in Fig. 11. Recall, these model results are not calibrated and the default settings may not be appropriate. However, the provenance data is useful for understanding where the workflows fail and with the error objects (Sections 2.2 and 3.1) (Appendix B), where exactly within the workflows they failed. 5.2. Evaluation of CONUS data-model workflow provenance All the HUC-12s per CONUS state (Section 4.2) were selected using default settings that may not be appropriate for

184

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

Fig. 10. The web user submits the model workflows (a) that are then inserted into the workflow submission list (b). Web users can investigate existing workflow project settings (c) and by hovering the mouse cursor over the status bar (d) the workflow status appears (e) for quick investigation of any error sources. Users can share, repair, and redo existing workflows with the controls (f).

Table 2 The HUC-12s were selected using the web application discussed in Section 4 for the states of Utah and Pennsylvania. On the HydroTerre compute environment, the data-model workflows took 3.5 h, creating 650,000 files, totaling 80 GB in size for one year of forcing, for the HUC-12s in the state of Utah. The data-model workflows took 22 h, creating 370,000 files, totaling 256 GB for ten years of forcing, for the HUC-12s in the state of Pennsylvania. The data-model workflow results were then retrieved and executed with PIHM on XSEDE's HPC resource, Trestles, at San Diego Supercomputer Center. For the state of Utah, 2 TB of model results were generated consuming 16,000 service units and 48 h to compute. The state of Pennsylvania required 150,000 service units and 48 h to compute, generating 11.5 TB of model data. Spatial selection

Number of HUC-12s

Utah Pennsylvania

2558 1451

Data-model workflow

Model workflow

Creation time

Number of files generated

Size of files generated

Forcing duration

Service units

Size of files generated

Model compute time

3.5 h 22 h

650,000 370,000

80 GB 256 GB

1 year 10 years

16,000 150,000

2 TB 11.5 TB

48 h 48 h

understanding the catchment. However, the purpose is not to create valid hydrological models, but to evaluate where in the datamodel workflow errors occur; whether data is the error source, or the data transformation process itself. All CONUS HUC-12 ETV workflows were successful, but the data-model and model workflows did not have the same success as summarized in Table 4 and illustrated as white gaps in Fig. 11. Appendix E lists the success rates for each CONUS state using default values and using the TauDEM catchment generation workflow (Section 3.2).

At a CONUS state average, the data-model workflow succeeded 57.81% and Fig. 11 shows spatially, while Fig. 12 shows in tabular form, the success rate per state. Immediately, it is apparent that the states North Dakota (ND), Rhode Island (RI), and Vermont (VT) have disproportionally low success rates. By checking the error object codes returned (Appendix B), these states have more soil error codes than others. Investigating the soil datasets in these regions, there are missing values used in the soil algorithms. Issues with soil data also happen with South Dakota (SD), New Hampshire (NH),

Table 3 The ETV and data-model workflows took 80 h to prepare the PIHM inputs for two months of forcing data. Approximately 25 million input files were generated and deleted during the process, creating close to 450 GB of input data. The model jobs took two weeks, generating 10 TB of results, which were immediately deleted once the outcomes (failure or success) were added to the project objects as discussed in Sections 3.2 and 3.3. Spatial selection

Number of HUC-12s

Data-model workflow

Model workflow

Creation time

Number of files generated

Size of files generated

Forcing duration

Size of files generated

Model compute time

CONUS

90,762

80 h

25 million

450 GB

2 months

10 TB

2 weeks

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

185

Fig. 11. CONUS HUC-12 mesh size results using PIHM on HPC CyberSTAR with automated workflows and default settings. Using XSEDE HPC resources, the states Utah (A) and Pennsylvania (B) were modeled with PIHM using a manual strategy. These are not calibrated model results but demonstrate where data-model and model workflows succeeded (mesh produced) or failed (white space). Only the provenance workflow parameters are kept, so that users of the web application (Section 4) have an initial starting point to start a new data-model and model workflow.

Table 4 Data-model and PIHM workflow results. On average across the CONUS states using default values, the data-model workflows succeeded 57.81% of the time. Of those that failed, stream network issues caused 26.91% of the errors, poor meshes caused 5.14% of the errors, and the remaining errors (10.14%) were due to hardware and operating system failures. From the successful data-model workflows, 28.83% succeeded in the model workflow, with poor meshes being the main reason for failure.

State average

HUC-12 count

Data-model workflow

Model workflow

Success

Failed

Stream network failed

Mesh failed

Remaining error

Success

Failed

1852.29

57.81

42.19

26.91

5.14

10.14

28.83

71.17

and Massachusetts (MA). Thus, the soil algorithms need modification to address when these circumstances arise. Note: The datamodel workflow preserves the original form of the national datasets. Exploring all the state error codes returned by the data-model workflows, the most common source of errors arose during stream delineation (error code 704), and mesh generation (error code 701). Part of the reason for these failures is the use of default values (Section 3.2) during the selection process (Section 4.2). Using the TauDEM tool suite requires user intervention to appropriately create both catchment and stream networks at each HUC-12. Thus, an expert user refining these simplification values will improve the success rates, which are automatically shared to all users in the submission list, as discussed in Section 4.3. The remaining errors predominately occur during file generation, when hardware and operating system resources were overwhelmed. For example, bottlenecks such as memory corruption and failed disk operations. From the successful data-model workflows (57.81%) distributed on the PSU CyberSTAR HPC system, 28.83% succeeded using default settings (Section 3.3). Poor quality meshes, with large cell skewness and aspect ratios (slivers), account for 50% of the model workflows that failed (Cheng et al., 2012). As identified in the data-model workflow, further investigation is required with the mesh generation techniques. About 10% of the model workflows were programmatically canceled due to slow performance,

as simulation steps were evaluated but did not finish in a reasonable amount of time. The most probable reason is due to poor meshes. With the remaining 40%, the Sundials solver (Hindmarsh et al., 2005) did not converge (based on experience) and the simulation time-step remained at zero until the estimated time of completion was reached and the program was forced to quit. The reasons for these failures are again mesh quality due to using default settings for stream delineation and catchment boundary simplifications. Using the state of Pennsylvania as a case study (Table 5), the stream and catchment simplification values were altered in twelve different combinations to improve the datamodel workflow results. The default values used for the CONUS HUC-12s resulted with 67.13% success (Appendix E), while the average result increased to 73.67% trying different combinations. Except for one case, there was an improvement in all scenarios. As has been stated, stream (73.13%) and mesh (13.72%) generation is the main reasons for failure. Further interrogation to the high values in the remaining error column were predominately due to the vector simplification (Section 3.2) before the stream and mesh generation process. The remaining test (Table 6) conducted within the state of Pennsylvania was to use the best simplification values returned in Table 5 (35,000 and 100) and use the NHD stream networks (Section 3.2) instead of the TauDEM results. The data-model workflow slightly increased to 92.97% from 82.62%. From the 7.03% that failed,

186

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

Fig. 12. Data-Model workflow success rate (light blue) per CONUS state using default values and the TauDEM catchment strategy. The causes for failure include stream failure (dark blue), mesh failure (red), and remaining issues (green). The black straight line represents the average success rate (57.81%). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 5 Data-model and PIHM workflow results with different simplification values for stream and catchment parameters within the state of Pennsylvania. On average, the data-model workflows succeeded 73.67% of the time. Of those that failed, stream network issues on average caused 73.13% of the errors, poor meshes caused 13.72% of the errors, and the remaining errors (13.15%) were due to other data issues, or hardware and operating system failures. From the successful data-model workflows, 36.67% succeeded in the model workflow, with poor meshes accounting for 45.21% on average and the remaining failure due to failure to converge. Stream

5000 10,000 15,000 20,000 50,000 100,000 20,000 20,000 20,000 35,000 35,000 35,000

Catchment

100 100 100 100 100 100 250 500 1000 1 10 100 Average

Data-model workflow

Model workflow

Success

Failed

Stream network failed

Mesh failed

Remaining error

Success

Failed

Poor mesh

Failed to converge

67.69 70.23 73.49 77.19 71.68 59.85 74.38 73.13 70.16 81.48 82.12 82.62 73.67

32.31 29.77 26.51 22.81 28.32 40.15 25.62 26.87 29.84 18.52 17.88 17.38 26.33

83.07 83.81 80.05 80.67 31.14 84.81 77.28 74.40 71.38 71.44 69.74 69.79 73.13

12.97 12.73 14.56 13.90 6.32 12.50 15.61 18.46 22.62 14.69 9.56 10.70 13.72

3.96 3.46 5.39 5.44 62.54 2.69 7.10 7.15 6.00 13.88 20.69 19.51 13.15

33.84 37.28 38.80 41.21 39.35 25.50 39.42 36.18 34.11 46.31 20.74 47.35 36.67

66.16 62.72 61.2 58.79 60.65 74.5 60.58 63.82 65.89 53.69 79.26 52.65 63.33

43.83 45.49 48.32 51.93 40.79 39.96 51.07 51.50 48.00 50.83 17.13 53.68 45.21

56.17 54.51 51.68 48.07 59.21 60.04 48.93 48.50 52.00 49.17 82.87 46.32 54.79

Table 6 Data-model and PIHM workflow results using NHD streams and catchments within the state of Pennsylvania. The data-model workflows succeeded 92.97% of the time. Of those that failed, stream network issues on average caused 37.25% of the errors, poor meshes caused 40.19% of the errors, and the remaining errors (22.56%) were due to other data issues, or hardware and operating system failures. From the successful data-model workflows, 58.34% succeeded in the model workflow, with poor meshes accounting for 41.29% on average and the remaining failure due to failure to converge. State

HUC-12 count

Data-model workflow

Model workflow

Success

Failed

Simplification failed

Mesh failed

Remaining error

Success

Failed

Poor mesh

Failed to converge

PA

1451

92.97

7.03

37.25

40.19

22.56

58.34

41.66

41.29

58.71

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

simplifying the mesh inputs and the mesh generation accounted for 77.44% of the failure. The remaining 22.56% was caused by soil generation and operating system failures. There was a slight improvement in the model workflow results to 58.34% from 47.35%. Again, poor meshes accounted for 41.29% of the failures and the remaining due to failure of the model to converge. Thus, the next step to improve and refine the CONUS HUC-12s results, is to improve the mesh generation techniques used in the data-model workflows. 6. Conclusion In this paper, a prototype for automating data-model workflow services that transform CONUS Essential Terrestrial Variables (ETVs typically used in hydrological model studies) to model input datasets, and then executes the hydrological model in a distributed compute environment. The services are a world wide webbased application that enables researchers and modelers to retrieve ETV and data-model data. By balancing hardware and software configurations, we demonstrated the feasibility of transforming data sources from several federal agencies that amounts to 100's of terabytes of disk storage, by implementing a workflow prototype for HUC-12 catchments within the CONUS. We demonstrated the effectiveness of data-model workflows using automated and manual strategies by distributing the datamodel datasets on High Performance Computing environments with CONUS HUC-12s. By using distributed compute environments, a provenance workflow dataset was developed that is accessible by users to develop watershed models anywhere in the CONUS. The provenance for data-model workflows developed here assures reproducibility of model simulations from ETV datasets without storing model results which we have shown will require many petabytes of storage.

187

(GEO-44417482), NSF INSPIRE (IIS-1344272), EPA (96305901), NOAA (NA10OAR4310166). The authors would like to acknowledge the support from the Institute for CyberScience Director Padma Raghavan and Penn State Institutes for Energy and the Environment Director Tom Richard at The Pennsylvania State University. Appendix A. Project database objects The HydroTerre National Job object stores user workflows from the web application described in Section 4. The object contains provenance for the user interface selection for the ETV, data-model and model workflows. From this object, the web application can query and reproduce the entire workflow process. The sections below describe details about the workflow objects first described in Section 2.2 and the processes in Section 3. The schema diagrams are available to view at (http://www.hydroterre.psu.edu/Development/ Help_Model/AppendixA.aspx). Storing data as XML objects increases the flexibility of versioning and reproducibility between data-model workflows. Appendix A1. HPC XML object The high performance computing (HPC) object stores information about the hardware resources such as type, resource, and institute that indicates what HPC is available to the user. Appendix A2. HUC XML object The Hydrological Unit Code (HUC) object stores information about the HUC such as spatial extent, location, county, city and unique identifications.

7. Future direction Appendix A3. Model XML object This research focuses on the important issue of eliminating hurdles involved with using physics based models such as PIHM in an HPC environment. The approach demonstrates how automated web-based data access and workflows allow seamless allocation of resources with minimal interaction from the user, and supports shared software, data, and HPC resources. The next phase of the research is to improve automation of domain decomposition (e.g. quality numerical mesh generation) and to scale up the simulation domain from HUC-12 to major river basins. The challenge will require adaptation of existing HUC-12 workflows to evolve the software to seamlessly allocate much larger data requirements within the existing prototype. At both the HUC-12 and major-river basin scale, the next step is to incorporate visualization resources necessary for implementation, calibration, uncertainty assessment in a complete ETV, datamodel and model workflow. Coupling data-model, model, and visualization workflows is an important step towards providing numerical watershed predictions as a product in the form of a dynamic watershed atlas that provides surface and groundwater budgets from model simulations with data and analysis provenance. We believe visual analytics will be an important resource to decide what model workflow results are essential to store in the watershed atlas. Acknowledgments This research was supported in part by the National Science Foundation through XSEDE resources provided by the XSEDE Science Gateways program (TG-EAR120019), NSF EarthCube

The Model object stores URLs assigned to services at various HPC resources identified in Appendix A1. This objects stores keys and job identifications for model workflows. Appendix A4. Data XML object The Data object stores user defined parameters to control the data-model workflows and populate the data-model graphical user interface. Appendix A5. User interface XML object The user interface object stores user defined parameters to control the model workflows and is used to populate the model graphical user interface. Appendix A6. Job project XML object Job project contains properties about all the workflow job settings such as keys, start and finish times, directory settings and job status. Appendix A7. Workflow XML object The workflow object contains properties of all the workflows and is used to populate the graphical user interface.

188

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

Appendix B. Error object Error object used to communicate between workflows and web application interface

Error code Workflow Type Pretty Message Message

able at (http://www.hydroterre.psu.edu/Development/Help_ Model/AppendixC.aspx). All these files are generally structured as Object File Information that indicates when and where the file originated from. Then a list of input files and or parameters is used to create the output object list.

B1: Data workflow

B2: PIHM workflow

Error code range

Message

Error code range

Message

100 to 1 0e499 500e599 600e699

Hardware problems Workflow status Raster processing Vector processing

2050 to 2059 2040 to 2049 2030 to 2039

700e799 800e899 900e999 1000e1200

Topology processing XML generation Image processing Soil & geology processing

Invalid element Time job canceled Threshold job canceled Sundials errors PIHM file errors Workflow status Hardware problems

2000 to 2029 100 to 1999 0 to 99 1e100

Appendix C. Data-model generated files The data-model workflows generate XML file structures to share data with models with a file structure accessible with object oriented code, version control and is operating system agnostic. The XML schema diagram for the Unstructured Mesh, Attribute, Initialization, Soil, Land cover, Geology, and Forcing files are avail-

State

AK HI Average AL AZ AR CA CO CT DE FL GA ID IA IL IN KS KY LA MA

HUC-12 count

0 0 1852.29 1482 3287 1557 4450 3157 184 103 1349 1845 2743 1701 1856 1585 2056 1303 1266 247

Unique error code (Appendix B1 and B2) Key to workflow type and version Meaningful message show to user Actual message returned from software and/or hardware

Appendix D. Private web services used by prototype The prototype web application discussed in Section 4 uses private web services as those described in Sections 2 and 3 that have XMD schemas available at (http://www.hydroterre.psu.edu/ Development/Help_Model/AppendixD.aspx). This prototype uses three web services, (1) The Job service is used to create project jobs to handle the workflows as shown in Section 4.1; (2) Model Service used to create HUC-12 submission lists and update/submit workflow properties to job objects stored as shown in Section 2.2; (3) Project service used to populate the project listed discussed in Section 4.3. Appendix E. Success rates for workflows evaluated within the CONUS using default values and TauDEM catchment generation workflow

Data-model workflow

Model workflow

Success

Failed

River issues

0.00 0.00 57.81 64.71 57.68 60.76 57.03 62.40 54.89 63.11 41.88 64.61 61.83 55.38 63.04 60.88 52.24 65.31 51.82 47.77

100.00 100.00 42.19 35.29 42.32 39.24 42.97 37.60 45.11 36.89 58.12 35.39 38.17 44.62 36.96 39.12 47.76 34.69 48.18 52.23

Not CONUS State Not CONUS State 26.91 25.57 32.64 30.76 30.13 29.62 35.87 25.24 28.61 26.88 27.63 24.81 27.86 27.57 39.83 29.01 33.65 17.41

Mesh issues

5.14 3.51 7.42 5.27 9.46 5.29 3.80 5.83 9.04 4.50 4.85 3.00 5.28 4.16 6.18 3.76 10.27 6.07

Remaining error

Success

Failed

10.14 6.21 2.25 3.21 3.37 2.69 5.43 5.83 20.46 4.01 5.69 16.81 3.83 7.38 1.75 1.92 4.27 28.74

0.00 0.00 28.83 35.25 21.26 34.25 25.06 28.68 32.67 35.38 20.53 32.63 25.00 33.97 32.05 29.95 29.80 29.26 22.41 27.97

100.00 100.00 71.17 64.75 78.74 65.75 74.94 71.32 67.33 64.62 79.47 67.37 75.00 66.03 67.95 70.05 70.20 70.74 77.59 72.03

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

189

(continued ) State

HUC-12 count

Data-model workflow Success

MD ME MI MN MO MS MT NE NM NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY Total

365 1044 1841 2486 1981 1347 4343 2103 3152 2562 333 274 3152 1664 1766 1912 1538 2079 3116 1451 56 978 2410 1152 6439 2558 264 1268 1984 785 1803 2385 90,762

70.96 69.92 59.26 53.70 62.39 62.29 62.42 56.92 57.23 54.29 45.95 57.30 57.23 64.72 68.63 16.11 67.56 59.16 67.04 67.13 25.00 63.50 45.48 68.49 61.30 59.07 10.61 72.00 65.78 69.81 65.06 60.96

Failed 29.04 30.08 40.74 46.30 37.61 37.71 37.58 43.08 42.77 45.71 54.05 42.70 42.77 35.28 31.37 83.89 32.44 40.84 32.96 32.87 75.00 36.50 54.52 31.51 38.70 40.93 89.39 28.00 34.22 30.19 34.94 39.04

Model workflow River issues

Mesh issues

Remaining error

Success

22.74 20.69 25.04 26.67 29.38 28.43 27.49 34.19 33.28 31.15 18.32 31.75 33.28 23.02 24.18 13.23 25.62 29.73 25.96 26.81 17.86 28.43 27.34 24.91 30.70 28.77 6.06 21.77 25.45 24.59 27.68 31.15

3.84 4.21 4.94 5.11 4.90 5.94 4.97 5.47 6.41 10.34 3.90 7.30 6.41 4.45 4.30 2.41 4.68 3.80 4.85 4.14 3.57 4.91 3.94 2.52 5.17 7.04 2.27 3.94 4.94 3.95 3.55 6.08

2.47 5.17 10.76 14.52 3.33 3.34 5.11 3.42 3.08 4.22 31.83 3.65 3.08 7.81 2.89 68.25 2.15 7.31 2.15 1.93 53.57 3.17 23.24 4.08 2.83 5.12 81.06 2.29 3.83 1.66 3.72 1.80

29.34 26.30 28.23 32.06 31.55 7.27 30.10 27.90 23.61 23.36 29.41 32.48 23.61 29.62 28.71 27.60 32.63 35.37 21.54 33.78 28.57 32.69 34.95 36.25 28.05 23.83 32.14 31.76 31.95 27.74 33.42 20.56

References Adobe Developers Association, 1992. TIFF Revision 6.0. Adobe Developers Association, Mountain View, p. 121. Retrieved from: partners.adobe.com/public/ developer/en/tiff/TIFF6.pdf. Alves Dal Santo, M., Führ, C., 2010. Polygonal line simplifying methods applied to GIS. In: Facing the Challenges e Building the Capacity, pp. 11e16 (Sydney, Australia). Arguez, A., Vose, R.S., 2011. The definition of the standard WMO climate normal: the key to deriving alternative climate normals. Bull. Am. Meteorol. Soc. 92 (6), 699e704. http://dx.doi.org/10.1175/2010BAMS2955.1. Bell, M., 2008. Service-oriented Modeling Service Analysis, Design, and Architecture. John Wiley & Sons, Hoboken, N.J.. Retrieved from: http://www.books24x7. com/marc.asp?bookid¼24356 Bell, M., 2010. SOA Modeling Patterns for Service-oriented Discovery and Analysis. John Wiley & Sons, Hoboken, N.J.. Retrieved from: http://public.eblib.com/ EBLPublic/PublicView.do?ptiID¼477774 Bhatt, G., Kumar, M., Duffy, C.J., 2008. Bridging the gap between geohydrologic data , M., Be jar, J., Comas, J., and distributed hydrologic modeling. In: S anchez-Marre Rizzoli, A., Guariso, G. (Eds.), iEMSs 2008: International Congress on Environmental Modelling and Software Integrating Sciences and Information Technology for Environmental Assessment and Decision Making. International Environmental Modelling and Software Society (iEMSs), p. 8. Cheng, C., Dey, T.K., Shewchuk, J., 2012. Delaunay Mesh Generation. Chapman & Hall/CRC, Boca Raton, Fla., London, p. 410. CyberSTAR, 2014. A Scalable Terascale Advanced Resource for Discovery through Computing. Retrieved January 01, 2014, from: http://www.ics.psu.edu/infrast/. EnviroGRIDS, 2013. WP4 Hydrological Models. Retrieved August 13, 2013, from: http://envirogrids.net/. ESRI, 2014. ArcGIS Server. Retrieved January 01, 2014, from: http://www.esri.com/ software/arcgis/arcgisserver. Fielding, R.T., Taylor, R.N., 2002. Principled design of the modern web architecture. ACM Trans. Internet Technol. 2 (2), 115e150. http://dx.doi.org/10.1145/ 514183.514185. Goodall, J., Horsburgh, J., Whiteaker, T., Maidment, D., Zaslavsky, I., 2008. A first approach to web services for the national water information system. Environ. Model. Softw. 23 (4), 404e411. http://dx.doi.org/10.1016/j.envsoft.2007.01.005. Goodall, J.L., Robinson, B.F., Castronova, A.M., 2011. Modeling water resource systems using a service-oriented computing paradigm. Environ. Model. Softw. 26 (5), 573e582. http://dx.doi.org/10.1016/j.envsoft.2010.11.013.

Failed 70.66 73.70 71.77 67.94 68.45 92.73 69.90 72.10 76.39 76.64 70.59 67.52 76.39 70.38 71.29 72.40 67.37 64.63 78.46 66.22 71.43 67.31 65.05 63.75 71.95 76.17 67.86 68.24 68.05 72.26 66.58 79.44

Granell, C., Díaz, L., Gould, M., 2010. Service-oriented applications for environmental models: reusable geospatial services. Environ. Model. Softw. 25 (2), 182e198. http://dx.doi.org/10.1016/j.envsoft.2009.08.005. Henderson, R.L., 1995. Job scheduling under the portable batch system. In: Job Scheduling Strategies for Parallel Processing, vol. 949, pp. 279e294. http:// dx.doi.org/10.1007/3-540-60153-8_34. Hindmarsh, A.C., Brown, P.N., Grant, K.E., Lee, S.L., Serban, R., Shumaker, D.E., Woodward, C.S., 2005. SUNDIALS. ACM Trans. Math. Softw. 31 (3), 363e396. http://dx.doi.org/10.1145/1089014.1089020. Horsburgh, J.S., Maidment, D.R., Whiteaker, T., Zaslavsky, I., Piasecki, M., 2009. Development of a community hydrologic information system. In: Anderssen, L.T., Braddock, R.S., Newham, R.D. (Eds.), 18th World IMACS Congress and MODSIM09 International Congress on Modelling and Simulation, pp. 988e994 (Cairns, Australia). Leonard, L., Duffy, C., 2014. Hydroterre: selecting up-stream level-12 hucs using depth-first graphs anywhere in the Continental USA. In: 11th International Conference on Hydroinformatics HIC New York City, USA. Leonard, L., Duffy, C.J., 2013. Essential terrestrial variable data workflows for distributed water resources modeling. Environ. Model. Softw. 50, 85e96. http:// dx.doi.org/10.1016/j.envsoft.2013.09.003. Martinec, J., Rango, A., Roberts, R., 1994. The Snowmelt Runoff Model (SRM) User's Manual, p. 29 (Berne, Switzerland). Microsoft, 2014a. Database Markup Language. Retrieved January 01, 2014, from: http://msdn.microsoft.com/en-us/library/bb399400%28v¼vs.110%29.aspx. Microsoft, 2014b. Globally Unique Identifier. Retrieved January 01, 2014, from: http://msdn.microsoft.com/en-us/library/aa373931(VS.85).aspx. Microsoft, 2014c. Service-oriented Architecture. Retrieved January 01, 2014, from: http://msdn.microsoft.com/en-us/library/bb977471.aspx. Microsoft, 2014d. Silverlight. Retrieved January 01, 2014, from: http://www. microsoft.com/silverlight/. Microsoft, 2014e. SQL Server. Retrieved January 01, 2014, from: https://www. microsoft.com/en-us/sqlserver/default.aspx. Nativi, S., Mazzetti, P., Geller, G.N., 2013. Environmental model access and interoperability: the GEO model web initiative. Environ. Model. Softw. 39, 214e228. http://dx.doi.org/10.1016/j.envsoft.2012.03.007. NLDAS, 2011. North American Land Data Assimilation System. Retrieved December 05, 2010, from: http://ldas.gsfc.nasa.gov/nldas/NLDAS2forcing.php. €schl, G., 2005. A comparison of regionalisation methods for Parajka, J., Merz, R., Blo catchment model parameters. Hydrol. Earth Syst. Sci. 9 (3), 157e171. http:// dx.doi.org/10.5194/hess-9-157-2005.

190

L. Leonard, C.J. Duffy / Environmental Modelling & Software 61 (2014) 174e190

PIHM, 2014. Penn State Integrated Hydrologic Model. Retrieved January 01, 2014, from: http://www.pihm.psu.edu. Qu, Y., Duffy, C.J., 2007. A semidiscrete finite volume formulation for multiprocess watershed simulation. Water Resour. Res. 43 (8), 1e18. http://dx.doi.org/ 10.1029/2006WR005752. Seaber, P.R., Kapinos, F.P., Knapp, G.L., 1987. Hydrologic Unit Maps. U.S. G.P.O, Washington; Denver, CO, p. 66 (For sale by the Books and Open-File Reports Section, U.S. Geological Survey). Shewchuk, J.R., 1997. Delaunay Refinement Mesh Generation. Carnegie Mellon, Pittsburgh, Pa. SWAT, 2013. Soil and Water Assessment Tool. Retrieved January 06, 2013, from: http://swat.tamu.edu/. Tarboton, D.G., 2011. TauDEM Hydrology Research Group. Retrieved February 03, 2011, from: http://hydrology.usu.edu/taudem/taudem5.0/index.html. Turuncoglu, U.U., Dalfes, N., Murphy, S., DeLuca, C., 2013. Toward self-describing and workflow integrated earth system models: a coupled atmosphereeocean

modeling system application. Environ. Model. Softw. 39, 247e262. http:// dx.doi.org/10.1016/j.envsoft.2012.02.013. USGS, 2013. USGS HUC. Retrieved from: http://water.usgs.gov/GIS/huc.html. World Wide Web Consortium, 2014a. Extensible Markup Language. Retrieved January 01, 2014, from: http://www.w3.org/TR/xml11/#charsets. World Wide Web Consortium, 2014b. Simple Object Access Protocol. Retrieved January 01, 2014, from: http://www.w3.org/TR/soap12-part1/. World Wide Web Consortium, 2014c. Web Services Description Language. Retrieved January 01, 2014, from: http://www.w3.org/TR/wsdl. XSEDE, 2014a. Extreme Science and Engineering Discovery Environment. Retrieved January 01, 2014, from: https://www.xsede.org. XSEDE, 2014b. San Diego Supercomputer Center Gordon. Retrieved January 01, 2014, from: https://www.xsede.org/web/guest/sdsc-gordon. XSEDE, 2014c. San Diego Supercomputer Center Trestles. Retrieved January 01, 2014, from: https://www.xsede.org/web/guest/sdsc-trestles.

Suggest Documents