HYDROML: CONCEPTUAL DEVELOPMENT OF A HYDROLOGIC MARKUP LANGUAGE MICHAEL PIASECKI1, LUIS BERMUDEZ2 1 2
Assistant Professor, Dept. of Civil, Architectural & Environmental Engrg. Graduate Assistant, Dept. of Civil, Architectural & Environmental Engrg., Drexel University 3141 Chestnut Street Philadelphia, PA 19104, USA TEL: +(215) 895-1721; FAX: +(215) 895-1363
[email protected]
ABSTRACT This paper presents a conceptual development of a markup language that is specific to the hydrologic community. The predicted vast increase in data volume as well as the heterogeneity of the data used across the many disciplines in hydrology, have mandated the development of a standardized description of data in order to facilitate storage, querying, analysis, retrieval and exchange among data holding cites and end users. To this a so-called metadata standard needs to be adopted from which the specific community metadata needs can be derived, in this case the ISO Metadata norm 19115. In conjunction with a formal syntactic structure or language convention like XML, a hydrologic markup language can be developed. The ensuing difficulties and challenges, like the diversity of data holdings, and formats are highlighted and an approach is presented that conceptualizes the need to define data ontologies and a common vocabulary. Finally, some examples are presented to illustrate the use of semantic and syntactic standards in creating data-descriptions for hydrologic data. KEYWORDS Metadata Standards, Markup language, XML, HYDROML, Hydrologic Information Science INTRODUCTION The development and expansion of the existing information infrastructure has recently been swept to the top of the US geosciences community’s agenda culminating in a number of high level funded efforts that span across the disciplines. Initiatives like the Geoscience Network (GEON, 2002), Thematic Real-time Environmental Data Distributed Service (THREDDS, 2002), and the National Virtual Ocean Data System, (NVODS, 2002), to name just a few, have all set out to improve the data access, distribution, exchange and archival situation in the respective communities. For the hydrologic science community the Consortium for the Advancement of the Hydrologic Sciences, Inc. (CUAHSI, 2002) has taken on the leading role in developing the key elements of the community’s future information system. In order to obtain a better understanding of the community’s data needs, a survey was conducted. Among a great many of issues that emerged like the need of benchmark data sets, better data models that relate and describe hydrologic processes, community modeling expertise with calibrated and identified test scenarios and knowledge discovery in data bases (KDD), the biggest concern was the lack of availability and access to existing datasets and the time individual researchers and scientists have to spend to search, inspect, collect, and then reformat data for their respective needs. The problem manifests itself on many levels. First, the National Oceanographic and Atmospheric Administration, NOAA, prediction show
(Zevin, 2000) that their data repository is expected to grow at a rate of 100% per every two years. The sheer increase in volume is slanted to aggravate the difficulties that currently surround the data infrastructure. Secondly, the heterogeneity of the hydrologic science dataworld poses an enormous difficulty in trying to unify and standardize data descriptions and data formats. Data sets vary considerably in their origin, because hydrology encompasses many areas that cover subsurface, surface, and atmospheric environs, including all the interfaces between these. Third, even within a small subgroup of the community, the lack and absence of standards tend to create considerable difficulties to share data and results, because the expectations, intent, and means to describe data or even to use the same semantics in general is sorely missing. For example, it should be relatively simple to identify a unique descriptor for measured stream gauge data. However, as shown in Table 1, a search among gauge data providers revealed the use of 5 different words or word combinations to identify the very same data set. Table 1 Various Names used to identify gage measurements Measurement Gage height Gage Water depth Stage Water level
Organization USGS State of ColoradoDivision of Water Res. Montana Natural Resource Informat. Sys. US Army Corps of Engineers, New Orleans Delaware River and Bay PORTS
Link http://waterdata.usgs.gov/nwis/discharge http://dwr.state.co.us/Hydrology/Ratings/A LAWIGCO02.txt http://nris.state.mt.us/wis/ http://www.mvn.usace.army.mil/eng/edhd/ Wcontrol/miss.htm http://coops.nos.noaa.gov/dbports/AllWL_db.html
While above example represents only a single instance, it nevertheless serves well to illustrate the problem at hand. This problem is based on the lack of a standard to describe data, both in its semantic and syntactic structure. The former refers to use of agreed upon descriptors or vocabulary for data (like in the above example), while the latter addresses how data is stored, i.e. whether it is for example in ASCII or binary and what data storage format convention is used. As a result, it has been recognized that the adoption of a common metadata standard, or a standard for data to describe data, must be one of the future priorities of the hydrologic science community in order to overcome the existing inadequacies as far as data storage, description, querying, retrieval and ultimately exchange and sharing among members of the hydrologic community are concerned. METADATA-STANDARDS Metadata is by definition ‘data about data’ or ‘data that makes data useful’, (Grötschell and Lügger, 1998). Unfortunately, and this is part of the current state of hydrologic information science, metadata means a lot of different things to different people. In the strictest sense scientists and researchers have at some point or another used metadata, though perhaps unknowingly, for a long time. Every comment line or header in a data file or the many attempts to find a descriptive chain of acronyms or abbreviations that are used to elucidate in a file name what the file contains, are attempts of using metadata albeit in a somewhat disorganized fashion. It is therefore meaningful to highlight a few concepts and conventions that surround the use of metadata.
METADATA Metadata is required to fulfill a number of tasks. First, it needs to provide a means by which a user can find a data set of interest. Hence, metadata descriptions must contain information that can be queried, for example when the data was collected, by whom, where, spanning what period in time, and of course a unique descriptor of the data variable to name just a few important areas. Secondly, metadata elements must contain information about how it is stored, for example in ASCII or binary format (digital), or in paper form, or on tape only, and of course what storage format has been used (for example HDF, or GRIB, netCDF, etc). Third, it must contain information about the purpose for which this data set has been collected to give the user an idea in what context this data set exists. In addition, metadata must provide information about issues such as data quality assurance, calibration values for the measurement device, the sampling interval, zero or reference levels for the data, and last but not least pointers that direct the user to additional information about the data set (lets a say a report associated with a data collection campaign) and of course to the physical location of the data itself. It is clear that all of the above listed demands on metadata pose a considerable challenge in defining a standard that serves all needs. In order to bring some structure to this challenge researchers have attempted to categorize metadata depending on the specific function of a desired metadata element. Metadata classes can be subdivided into Content, Syntactic, and Semantic Metadata, as illustrated in Table 2. Table 2 Categories of Metadata Category
Subcategory
Description Same for all the project
Content metadata
General Different for different data types Specific (Item) Information of the data structure
Syntactic metadata
Content part Structure of the data Syntactic part
Semantic metadata
Data description
Content metadata, also sometimes called “search” metadata, contains all the elements that can be queried, hence used to find data sets. It is commonly divided into two sub-categories one for general information like title, time, collector, subject, purpose, location, while the specific part deals with, for example number of collected data points, time intervals, period covered, and other specific features of the data set. Syntactic metadata, sometimes referred to as “use” metadata, is divided into two parts as well. The content subcategory contains information where to find the data set (location pointer), reference levels and datum, unit system in use,
conversation conventions, and whether it resides on a file system or is extracted from a data base. The syntactic part describes the structure of the data, for example whether an array is stored in columns or rows. Finally, the semantic metadata section contains the data itself, i.e. instances of the data items that make up the entire data file. The latter does not necessarily need to be part of the metadata file, but can be and is very often separated out into a “pure” data file, because it can make up 99% of the entire file size. In general it is advantageous to separate the information about data and the data itself and just parse the metafile for relevant information. STANDARDS To bring some order to the data description task, several initiatives have been formed during the past few years resulting in a number of metadata standards that have been published. The most important ones are listed below • International Standards Organization, ISO, Technical Committee 211, 19115 norm • Dublin Core Metadata Initiative, DMCI • Federal Geographic Data Committee, FGDC • ANZLIC • Global Change Master Directory, GCMD • EOSDIS Core System, ECS Some of these standards have originated from specific user groups (sometimes global) like the Dublin Core Standard (DCMI, 2002) having its main clientele in the library community, while others have started as nationwide initiatives limited to certain countries or continents, like the FGDC (FGDC, 2002), ANZLIC (ANZLIC, 2002), and GCMD (GCMD, 2002), and yet others are initiatives started by large organization like the GCMD and ECS, which are hosted by NASA. The only truly international standard is the ISO norm 19115, (ISO, 2002). The standards vary considerably as far as the coverage is concerned. The Dublin Core for example, has only very few elements but can be extended by a number of add-on qualifiers. On the other side, the FGDC and the ISO standards are quite comprehensive in their scope and content, which is the result of trying to cater to a much broader spectrum of the scientific community. While each of the standards have certain advantages and disadvantages, all of them have one thing in common; they have been developed with the realization in mind that any standard need to be flexible or generic enough such that community specific standards can be derived from it. Hence, in addition to the generic nature of many of the defined element classes, some standards have provided rules and provisions for user specific extensions. The perhaps most general standard from the group is the ISO 19115 norm. In fact, many other standards, like the ANZLIC and FGDC, orient themselves on the ISO 19115, and considering the fact that the GCMD is FGDC compliant, then it is fair to say that currently all standards are in the process of seeking compliance with the ISO 19115 norm, hence, the ISO standard is slated to become the global reference standard in the future. HYDROLOGIC MARKUP LANGUAGE Before we define what a hydrologic markup language, HYDROML, must be able to do, it is worthwhile to examine what actually comprises a markup language and how it differs from a metadata standard. As with any language it needs to be constructed from two components; the first is the set of grammar rules (syntactic structure) and the second is the dictionary (semantics) to provide the ‘words’. The latter can be derived form the metadata standards described earlier, while the former is provided through the use of the Extensible Markup Language, XML (W3C-XML, 2002). That it is called a language is in our context actually a little misleading because it does not really contain any semantics at all. However, it has
become the syntactic vehicle of choice not only to develop markup languages but also to use it for any type of data description and transfer between two entities that seek to exchange information. The advantages are that it is verbose, machine and human readable, independent of content and presentation, and that it permits validation against so-called schemas. It has been published by the World Wide Web consortium, W3C, and has since its inauguration in 1996 seen a number of updates reaching recommendation status in 1998. The perhaps most commonly known markup language today is HTML (or XHTML) that is not extensible at all, but can be read by browsers. While quite a number of markup languages are currently being developed and used in a variety of areas, two stand out as perhaps the ones closest related to hydrologic sciences, the Geography Markup Language (GML, 2002) and the Earth Science Markup Language, (ESML, 2002). The former has been developed by ISO (norm 19136) and the openGIS community, while the latter originated from an effort to incorporate data elements from the earth observation community and is therefore as no surprise in large part based on the ECS of NASA. While neither of the two markup languages serves the purpose of the hydrologic community, both feature certain advantages that will be made us of in developing HYDROML. The objective of HYDROML therefore is to provide a general framework or backbone for the future data needs of the hydrologic community that can be used over the years to be filled and extended as the specific data needs and data description definitions emerge. HYDROML will also have to be easy to understand, yet be comprehensive enough to cover the data sets that various areas produce. In addition, it must include a thesaurus that permits semantic mappings as the community uses a diverse vocabulary to describe the same or similar processes or data sets. Finally, it needs a structure or ontology that permits the data creator and end user to seamlessly navigate through the meta-structure of the hydrologic data realm, i.e. a navigation system that guides the user to data assemblies describing relevant processes. The development of HYDROML consists of three steps: 1) Creation of hydrologic domain ontology. 2) Development of specification to describe hydrologic metadata. 3) Creation of tools to facilitate the use of HYDROML. The domain ontology is the by far most difficult component to develop, because the field of hydrology is quite diverse in its nature and concerns processes that take place in the subsurface, surface, atmosphere, and all interfaces that connect these domains. While some processes and accompanying data sets can be isolated, many cannot. This adds a considerable level of complexity to the ontology design and to some extent precludes the creation of isolated data models. There are a number of options that can be pursued. First, a possible approach to remedy this difficulty could be to isolate the horizontal and vertical water balance and develop ontologies along these water balance boundaries. The advantage is that the hydrologic cycle can be divided into sub-cycles whose processes are logically linked to each other which in turn create a data model that can be used to model an ontology. On the other side processes taking place in the interfaces, like the streambed, would seem to preclude a separation because of the interaction of the horizontal and vertical water cycle. The feasibility of this approach is currently being investigated and may lead to a promising framework along which to model HYDROML. A second alternative starting point for defining the Hydrology domain ontology is to examine the geospatial extent of the processes domain, i.e. the use of geospatial subsetting. The most obvious approach is to use the main unit in surface hydrology; the watershed. In the US, the
US Geological Survey developed a hierarchic division of hydrologic units, which are classified into four levels: regions (21), sub-regions (222), accounting units (352), and cataloging units (2150). The advantages are obvious, as this hierarchical system permits the creation of a data tree along which smaller scale data models can be developed. One such approach is the creation of ArcHydro (an ArcView system), which features a data model ontology that describes surface hydrology processes like linkage of basin runoff to streamflow, and stream-gauge measurements to topography features like stream cross-sections to compute discharge (Maidment, 2002). While this approach is very appealing, it does not take into account that important processes like precipitation (atmosphere) and groundwater flow (sub-surface) take place in domains that do not follow the delineation of a watershed, which is basically a topographical feature, but are bounded by meso-scale or even global circulation patterns (meteorology) and the geology underlying land masses (groundwater flow), neither of which coincides with the watershed boundaries. Current research efforts are focusing on ways to overcome this mismatch and to investigate the possibilities of subsetting sub-surface and atmospheric processes on a watershed level. A third approach is to omit all attempts to develop a high level ontology, but to concentrate on the data itself, i.e. to treat each data set as a stand alone entity that is not related to any process at all. If one accepts the general subdivision into sub-surface, surface and atmospheric data and then proceeds to subdivide each of this three divisions into raw data and processed data that can either be the result of field, laboratory or numerical model observations, then a relatively straight forward data ontology emerges, as shown in Figure 1. Processed data can be simply an exposure to QA/QC methods, but also contain added value data, for example the conversion of NEXRAD level III data (itself raw data) to spatial and time variant rainfall data. These distinctions are important because the description of data for a numerical model, field observations and processed data may differ from each other. For example, descriptions for numerical model data need to explain the model used in the simulation; descriptions of field observation data need descriptors that identify the device used to collect the data; and processed data may need, for example, a descriptor of what interpolation formula was used to fill in a missing value. This approach is also very appealing, because it automatically permits the inclusion of the geospatial watershed subsetting idea for the ‘Surface’ data.
Hydrologic data
Sub-Surface
Sub-Surface
Surface
Raw data Recorded measurement Lab observation
Processed data Numerical model
Field observation
QA/QC corrected “added-value” data
Figure 1 Hydrologic Data Model
Reformatted data
Once the ontology has been defined data descriptions need to be identified. In this case GML provides a solid base from which to start, all the more as it incorporates the ISO 19115 standard and is encoded in XML. While GML is very generic in its nature, it can be used to derive descriptive elements for use in HYDROML on a grass root level. In addition, the inclusion of the 19115 norm ensures that the metadata classes can be extended to accommodate data description needs that are not part of the 19115 norm or cannot be derived from it. In this regard ESML provides very valuable elements as it is focused on the description of popular self describing data storage models like HDF, GRIB, and netCDF. These descriptions, which are purely syntactic in nature, can help in closing the gap to the ISO 19115 norm that is mostly comprised of content metadata elements. For the markup language to become accepted as a standard that will be utilized by members of the hydrologic community, extensive tools need to be provided to ease the burden of creating metadata descriptions. Whatever the ontology will be in the end, it is imperative that an editor is being made available to the user community, possibly web-based, that aids in navigating through the data structure so individuals can find the proper location for their specific metadata requests. There are a number of editors available both free-ware and commercial that can be modified to serve this purpose. Finally, it must be clear that the development of HYDROML cannot be carried out by a single individual. While the framework can be outlined and syntactic metadata can be agreed upon relatively conflict free, the development of acceptable content and semantic metadata elements is an arduous and sometimes controversial task that will require a high degree of community involvement. SUMMARY We have outlined the conceptual development of a specific markup language for the hydrologic community. The motivation for this research is based on the spreading recognition that in order to overcome the obstacles that researchers and scientists around the world face in storing, describing, searching, analyzing, retrieving and exchanging data, a information structure needs to be put into place that deals solely with how data and information is being processed. We have outlined how a specific markup language can help pave the way to a solution for the hydrologic community, i.e. the need for adopting a global metadata standard like the ISO 19115 and the ensuing community specific derivations that result from it. Also, the adoption of XML as encoding standard will elevate the community specific metadata standard to a markup language that will permit the use schemas for validation, extension of the language to accommodate the heterogeneous nature and diversity in the hydrologic community field, and the permitting of data streams to be passed between users in a machine readable format. The definition of the data ontology is most difficult task as there is no best solution to this task. We have outlined and briefly discussed three different approaches that all contain some advantages and disadvantage. The biggest obstacle is that there is no clear separation between the various environs within which water cycles. A first step may therefore be to start from a low level ontology that merely focuses on the data sets itself without attempting to relate them to each other and link them to specific hydrologic processes. The classification of hydrologic data into three environs each with raw and processed data and further subdivisions are a straight forward concept that promises to achieve the objectives. The addition of GML and ESML components provide a good starting point to set up both a simple ontology and to create a comprehensive set of content and syntactic metadata elements.
ACKNOWLEDGEMENTS This work is funded by the National Ocean Partnership Program (NOPP) under grant number NAG 13-00040. In addition, the work has been supported through collaborative efforts, with Rainer Lehfeldt, Christof Lippert, and Frank Sellerhoff of SMILE CONSULT GMBH in Hannover Germany. REFERENCES CUAHSI. (2002), Retrieved Nov , 2002 from http://www.cuahsi.org/ DCMI (2002), Dublin Core Metadata Initiative Retrieved Sep., 2002 from http://dublincore.org/. NVODS, (2002), National Virtual Oceanographic.Data System. Retrieved Dec. , 2002 from http://www.po.gso.uri.edu/tracking/vodhub/vodhubhome.html. ESML (2002), Earth Science Markup Language. Retrieved Dec. , 2002 from http://esml.itsc.uah.edu/ FGDC (2002), Federal Geographic Data Committee. Retrieved Nov. , 2002 from http://www.fgdc.gov/ GCMD (2002), Global Change Master Directory. NASA. Retrieved Nov., 2002 from http://gcmd.gsfc.nasa.gov/index.html GEON (2002), The Geosciences Network, San Diego Super Computing Center, accessed Jan. 2003, from http://www.geongrid.org/index.html GML (2002) Geographical Markup language. Retrieved Jan. , 2003 from http://www.opengis.net/gml/01-029/GML2.html Grötschell M. & and Lügger J. (1998) Scientific Information Systems and Metadata. Konrad-Zuse-Zentrum für Informationstechnik. Berlin . October 1998 ISO, 2002. International Organization for Standarization. Retrieved Dec. , 2002 from http,//www.iso.ch/ Maidment D. R., (2002), Arc Hydro Gis for Water Resources. ESRI, California. THREDDS (2002), Thematic Realtime Earth Data Distributed Servers. Retrieved Dec. , 2002 from http,//www.unidata.ucar.edu/projects/THREDDS/Overview/Home.htm Zevin S. (2000), NOAA's data archives: data policy implications. Presentation to Earth Observation Data Policy and Europe (EOPOLE) Project, European Commission ENV4CT97-0760, Workshop 5, January 2000.