Using Characteristics of Computational Science ... - Indiana University

Using Characteristics of Computational Science Schemas for Workflow Metadata Management * Scott Jensen and Beth Plale Computer Science Department, Indiana University [email protected], [email protected] Although the authors of the iMeMex dataspace system have suggested their approach could also be applied to scientific data [6], as noted by Gray et al. in [17], metadata regarding scientific models is ephemeral and cannot be reconstructed at a later date. Additionally, there is an increased emphasis on auto-generating metadata because the task often goes undone if left to users [21]. The CI must be able to (1) catalog metadata regarding workflows as they execute, (2) communicate using the XML schemas of varied scientific communities, and (3) must provide good query performance over XML for interactive queries. The Globus MCS metadata catalog [10] and the Storage Resource Broker (SRB) [25] both provide the ability to catalog metadata – including the flexibility to add additional basic metadata as simple name/value pairs (MCS) or name-value-unit triplets (SRB). Both approaches use a relational database for metadata storage, but neither communicates metadata based on community XML schemas. In [31] the authors note that the mapping between XML and MCS was cumbersome and slow when used to catalog metadata in the Earth System Grid (ESG). However, aside from performance issues with communicating via XML, the MCS researchers found good query performance using a relational database. Additionally, there has been extensive research in the database community on using relational databases to both store and query XML leveraging decades of research on optimizing relational database systems ([30,15] see [20] for a summary). Storing XML in a relational database usually takes one of two approaches: (1) storing the entire XML document as a Character Large Object (CLOB), or (2) extracting the values contained in the leaf elements of the document using a process commonly known to as “shredding”. There are multiple approaches to shredding XML for storage in a relational database, but one of the most common for schema-based XML is referred to as inlining [30]. Both the CLOB and shredding approaches exhibit limitations when cataloging the metadata of scientific workflows. While the CLOB approach is

Abstract Computational science workflows are generating an ever-increasing volume of data products. Metadata for these workflows is communicated using one or more discipline-specific schemas and is not static but instead is subject to frequent updates and additions. In contrast to general XML data, the unique uses for scientific metadata allow further optimization. We propose a general metadata catalog for storing scientific metadata that is optimized for community science use and communicates metadata as XML using the schemas of scientific domains. In this paper we show that our hybrid approach outperforms the well-known inlining approach to storing XML when applied to scientific metadata.

1. Introduction Computational science uses advanced computing capabilities to solve complex problems [4], and while cyberinfrastructure (CI) provides the middleware needed to carry out the workflows used in computational science, it must also provide scientists the ability to store and record the metadata needed for reuse of scientific workflows and data products [3]. As noted in a recent study funded by the UK e-Science Core Programme [21], “metadata is key to being able to share results” (emphasis theirs). In scientific grids, this metadata is generally communicated as XML conforming to an agreed community schema [9,24,1,28]. The CI must provide scientists with flexible query capabilities over their metadata and it must be able to communicate that metadata efficiently as XML [31]. When managing non-scientific personal data, a “pay as you go” approach of only collecting metadata as needed has been advocated for personal dataspaces [16]. !

This work supported under NSF cooperative agreement ATM-0331480 and EIA-0202048

1

Receive Workflow Notifications

Publish Workflow Notifications

s Work

flows

Mo

r nito

W My

flow ork

s

rc h D es ata ul ts

r My ry Fo Qu e

R

nput

Se a

ow I

d

Co mp ose Wo rkfl ow

orkfl

An

rd W R ec o

es ut Fil Outp and ta a D

Figure 1: Role of metadata management in the LEAD cyberinfrastructure

efficient for static data [11], it is not suitable for cataloging scientific metadata that will be updated frequently during workflow execution. The CLOB approach also performs poorly for complex queries without additional shredding to create indexes. Inlining addresses the shortcomings of the CLOB approach in querying the database since all of the leaf elements of the XML document are stored as atomic values in the relational database. However, the XML document must be reconstructed in response to queries, so building the query response is slower with inlining than under the CLOB approach (where the document is already stored as a single value). The inlining approach also suffers from a tight coupling of the XML schema and relational database schema [33]. This tight coupling restricts its ability to respond to schema changes or changes in usage patterns. Under the inlining approach, even minor changes to the XML schema mandate changes not only to the relational database schema but also to the temporary tables and SQL queries used to retrieve the metadata as well as the code used to add XML tags to the result. Research on storing XML in a relational database has focused on general solutions and not specifically on storing the metadata of scientific workflows. A characteristic of e-Science metadata schemas is that they are composed of distinct concepts which are then related to lower level elements in a “part of” relationship. As an example, the FGDC schema [14], mandated for the geospatial data of U.S. government agencies, consists of 7 major sections, some of which contain independent sub-concepts. Likewise, the ISO 19115 metadata standard [18] consists of 11 major concepts which are then composed of sub-concepts. While both of these metadata schemas focus on spatial data, the same concept and sub-concept pattern can be found in schemas ranging from astronomy [24] to the social sciences [27]. This concept characteristic inherent in e-Science schemas can be used to achieve the efficient updates of the shredding

approach (but without the tight schema coupling) while preserving the fast query response of the CLOB approach. We propose an efficient and flexible approach to storing schema-based XML metadata in a relational database that stores each concept as a CLOB and additionally shreds the concepts to support complex queries. We achieve fast query performance while also achieving quick construction of query responses. We refer to this hybrid of the CLOB and shredding approaches as the hybrid approach. Since e-Science metadata schemas are decomposable into distinct independent concepts, updates can be performed at the concept level - avoiding the excessive duplication of data that would occur when applied as a general XML solution [11]. Since query responses are not constructed from the shredded relational data as under the inlining approach, a generic database structure can be used that does not exhibit the tight schema coupling of the inlining approach. However, unlike generic XML storage approaches such as the edge table used for storing schema-less XML [15], the schema’s concepts provide a global ordering and are used for indexing [19]. The hybrid approach is implemented as the myLEAD personal workspace in the cyberinfrastructure of the Linked Environments for Atmospheric Discovery (LEAD) project [23]. The myLEAD workspace allows scientists to review and query the metadata of their workflows. In LEAD, metadata is communicated using the XML LEAD Metadata Schema (LMS) which is a profile of the FGDC standard [14]. The contribution of this paper is the hybrid model for storing workflow metadata in a relational database. We assess the validity of the approach experimentally by comparing the hybrid and inlining approaches. The evaluation is performed using a realistic workload drawn from the LEAD cyberinfrastructure for meteorology forecasting. The myLEAD software is currently in use as a component of the LEAD system.

2

The remainder of this paper is organized as follows. Section 2 discusses the myLEAD architecture in which the model is tested and section 3 provides an overview of the inlining approach. The experimental evaluation appears in Section 4. Section 5 discusses related work, and Section 6 concludes with future work.

queries and results in tuples containing 102 columns. While the sorted outer union generates a result set that is approximately in schema order, the query result still needs to then be tagged and assembled into an XML document. For this comparison we have implemented the constant-space tagger approach advocated in [29] as being the most efficient alternative. Under the hybrid approach on the other hand, metadata is partitioned into concepts - both general concepts such as temporal coverage or distribution that are common in many schemas and also domain-specific metadata such as the grid spacing and terrain specifications used in LEAD. Since metadata concepts are stored as CLOBs, query responses are constructed from these CLOBs using a static global ordering with a local ordering for sibling metadata concepts. As noted in [34] a global/local ordering is more efficient than the Dewey ordering used for the inlining approach.

2. Framework for a Metadata Catalog The myLEAD metadata catalog used in the LEAD cyberinfrastructure (LEAD-CI) [12] is seeing considerable use by meteorology scientists and students, so it is a good research platform. The metadata catalog is accessed by users through a web browser portal (LEAD Portal in Figure 1), where a user browses and searches their personal workspace using metadata stored in myLEAD. The search interface is built dynamically at the start of each user session based on concept definitions stored in the metadata catalog. Through the portal, users also define new projects and configure new experiments (workflows) within those projects. The portal calls myLEAD to store metadata regarding projects and experiments as they are defined. The metadata catalog is also accessed through a web service (myLEAD Agent in Figure 1), which is used to register both the inputs and results of workflows and add metadata as workflows execute. Workflow nodes call the myLEAD Agent to transfer and store data files in a repository, and the agent calls myLEAD to store metadata regarding each file. The Agent also listens on the message bus for notifications regarding executing workflows and stores notifications as metadata in myLEAD.

4. Experimental Evaluation Our hypothesis in developing the hybrid model is that the concept notion that is common in scientific metadata schemas can be used to optimize query performance for scientific workflow data. To evaluate our hypothesis, we compared the hybrid approach to the inlining approach using the following performance metrics: ! Query response time ! Insert performance ! Database size ! Scalability performance Adding metadata for a new object, adding new metadata to an existing object, and querying for the metadata of an object are the most frequent operations performed. Although we measure the space requirements (i.e., database size), the responsiveness and scalability of the system is more critical than the footprint of the database due to the availability of inexpensive storage. All of the tests use metadata generated by actual meteorological forecasting workflows from the workspaces of scientists using the LEAD grid. The scalability tests are performed by scaling the workload based on a prior study of the expected workload in the LEAD grid [22]. Each workflow run in the LEAD grid is represented as an experiment in myLEAD and contains files representing input data, intermediate results, and the results of that workflow. Files can also be aggregated into a hierarchy of collections within an experiment. In the LEAD grid, the file level has a moderate level of metadata in that key terms from the Climate Forecasting CF-1.0 [13] controlled vocabulary are used to characterize input data products, but most of the metadata regarding a workflow, including configuration

3. The Inlining Approach to Storing XML The Inlining approach essentially starts at the root node of an XML document and includes child elements in the same relation as long as cardinality cannot exceed one and there is not a recursive relationship – those subtrees are put into separate relations [30]. The relational schema under inlining can be optimized based on usage patterns - such as the cardinality of elements, the frequency of optional elements, and the types of queries issued [7,8]. In implementing inlining, we applied those optimizations that would impact the results – repetition splits of elements with cardinality greater than one and the inlining of optional elements. Query results under inlining are constructed using the sorted outer union approach [29] which is considered one of the most efficient approaches for reconstructing XML [11]. A potential problem with the sorted outer union is the wide tuples required to generate the query result (the number of columns in the result). For the LMS, the sorted outer union approach requires the union of 27

3

parameters and notifications, is stored at the experiment level. To provide a comparable platform, we implemented the inlined and hybrid approaches using the same version of the middleware stack - mainly OGSA-DAI and Globus.

Ensemble Experiments (1%) Data Import (39%) Each of the 125 active users is projected to run 4 workflows each day (over a 12-hour period), so we defined 125 users and populated their workspaces using the workspaces of 15 active LEAD users as described in Section 4.2. The scalability base case simulates the grid workload for a 20-minute time span based on the projected frequency of each category of workflow. The number of files generated for each workflow varies based on the category of the experiment. Educational experiments are projected to generate an average of 16 files, whereas canonical experiments consist of 8 stages (nodes in the workflow graph) and generate an average of 9 files in each stage. Ensemble experiments execute multiple canonical workflows in parallel and the average ensemble is projected to consist of 100 canonical experiments. The data import is not actually a workflow, but represents a user importing 15 files into their workspace from a public data catalog or other external source. Data products from a node in the workflow graph are available as that stage is finished, so metadata for files generated by each stage in a canonical experiment are written in a burst. For educational experiments, all metadata is written at the end of the experiment. Canonical experiments are projected to require 12 hours to run, so every 90 minutes on average one of the 8 stages of the experiment will complete and insert metadata for the 9 files generated in that stage. Ensemble experiments write 100 batches of 9 files in the same 90minute time frame. In a 20-minute period, the base workload creates 24 new experiments, adds 112 files to educational experiments, 99 files to canonical experiments, 801 files to ensemble experiments, and 75 files for data imports - a total of 1,087 files. A realistic query workload was calculated based on the 76.9:23.1 query-to-insert ratio of the Transaction Processing Council’s TPC-E benchmark [35]. The recently issued TPC-E benchmark is the closest parallel to a metadata catalog in that it simulates customers entering orders and querying for the status of their account and pending orders. Using the projected LEAD workload and the TPC-E ratio, the base workload inserts a total of 17,374 metadata concepts containing 65,631 elements and queries for 32,344 metadata concepts containing 142,688 elements. For each insert or query, the time it entered the work queue, the time when processing started, and the time when processing completed is logged along with the type of operation and object type. The workload is increased in multiples of the projected workload until performance indicates that the saturation point had been reached.

4.2 The Data Sets In [22] the LEAD-CI is expected to have 125 active users at any one time. We derived a realistic data space by beginning with 15 rich user workspaces and synthesizing additional users by replicating each of these workspaces 25 times, for a total of 375 users. The microbenchmark tests use the workspace of the actual LEAD user with the largest workspace. Each iteration of the workspace contains metadata for 314 objects, (a file, collection, experiment or project is a separate metadata object). Of these, 270 represented files, which are aggregated into 30 collections, which in turn are contained in 9 experiments, belonging to 4 projects. Currently collections and projects are used for logical groupings and contain only minimal metadata, so the test cases focus on files and experiments. Within each object, metadata is grouped into metadata concepts – where each concept contains sub-concepts and metadata elements (the data values contained in the leaf elements of the LMS). The average file object contained 9 metadata concepts and 27 metadata elements. In contrast, the experiments ranged in size from 62 to 271 metadata concepts and 483 to 1,240 metadata elements. This difference is due to model configuration parameters and critical notifications about a workflow being stored as metadata of the experiment. As additional sources of metadata for computational workflows become available, the size of the metadata documents for experiments will increase. However, the entire metadata for an experiment is not inserted at a single point in time. As each node of an experiment’s workflow executes, additional metadata become available and are added incrementally to the catalog. The footprint for the hybrid database is larger due to the storage of the metadata concepts as XML CLOBs in addition to shredding the metadata. However, the space required for the metadata is dwarfed by the storage requirements for the actual binary data [22].

4.3 The Scalability Workload Workflows in the LEAD grid workload were characterized in [22] as belonging to one of four basic categories: Educational Experiments (50%) Canonical Experiments (10%)

4

Figure 2: Inlined vs. hybrid - inserts and queries

containing significantly more metadata. The XML metadata inserted for each experiment ranged from 200KB to 682 KB. We found that total processing time is closely related to the size of the metadata, but the best predictor of total processing time for both approaches is the number of metadata concepts being added. The hybrid approach requires an additional insert for the CLOBs, so we also looked at the additional cost for the CLOB insert, but it was 6% or less of the total time under the hybrid approach. The greatest difference in processing time between the two approaches was the shredding of the document. Under the hybrid approach, the shredding of the XML must parse out the CLOBs in addition to the metadata elements. The shredding is performed using XSLT, and under the inlining approach, the XSLT directly writes the SQL insert statements. We intend to explore whether a variation of this could be used with the hybrid approach to reduce the insert cost. However, due to the interactive nature of querying for documents, the insert process is the less time-critical of the two activities.

4.4 Test Environment For both the hybrid and inlined approach, the server runs on a dual 2-core 2.0GHZ Opteron AMD with 16GB of memory running RHEL 4. Both approaches use OGSA-DAI version 6.0 for OGSI, which uses Globus Toolkit version 3.2.1 (WS-Core) and Jakarta Tomcat Server version 4.1.31. The database is MySQL 5.0.18. The client code ran on a separate server connected by Gigabit Ethernet.

4.5 Query and Insert Performance The most common metadata catalog activities are inserting metadata, adding metadata to an existing object, and querying for the full metadata of an object. Since, the process for adding metadata to an existing object is essentially the same as inserting a new object under both the inlining and hybrid approaches, we focus on inserting new objects and querying for those objects. Inserts occur primarily when a workflow node registers files or adds metadata to an existing experiment, but a user can also import data into their workspace interactively from a public data catalog or outside of the grid. The query for an object’s metadata is done interactively by a user when browsing their workspace. Users do this on a frequent basis not only to search their workspace, but also to review the status of on-going workflows. Since this is done interactively, response time is critical.

4.5.2 Querying for Metadata. The most frequent query in myLEAD is for an object’s full metadata based on its global ID. When a user browsing their workspace hierarchy in the portal clicks on an object, the portal issues a query based on the object’s global ID. The query testing uses this fundamental query. The query benchmark uses the base data of 375 simulated users. To test query performance, we averaged the performance over 10 iterations of a suite of queries that retrieve the metadata of experiments and files. Each iteration targets different objects to prevent the querying of cached metadata, but the mix of metadata concepts is the same in each iteration. As shown in Figure 2(b), average query execution time is significantly lower under the hybrid approach.

4.5.1 Inserting Metadata. Using the workspace of the most active LEAD user from the data set described in Section 4.2, Figure 2(a) compares the average execution time for inserting the metadata of a file and experiment under both the hybrid and inlining approaches. The additional time required to insert the metadata for an experiment in comparison to a file is due to experiments

5

Figure 3: Inlining and hybrid - scaled workload

Our analysis of the query execution time for files identified two factors that caused the performance of the hybrid approach to be significantly better. First, the inlining approach requires the building of temporary tables used to execute the sorted outer union query. Second, the execution of the query itself is more efficient since the hybrid approach queries a single table to retrieve the CLOBs needed to build the query response whereas the sorted outer union query under the inlining approach must first run insert queries to populate the temporary tables and then union the results. These two factors constitute 97% of the performance difference in querying for a file’s metadata. The difference in execution time illustrates the efficiencies gained by only having to query the CLOB table under the hybrid approach as opposed to queries over all of the shredded data tables under inlining. Experiments have significantly more metadata than files and a greater variance in the volume of metadata stored for each experiment, resulting in a significantly wider variance in the execution time. The largest experiment in the benchmark (based on the size of the XML document generated) is more than 300% the size of the smallest experiment. Figure 2(c) shows the processing time when querying for the metadata of the smallest and largest experiment in the benchmark. The average time required to execute the query for experiments under inlining is nearly four times that of the hybrid approach. In addition, the processing time required for the sorted outer union query used under the inlining approach increases at a faster rate than the size of the query response document. This is in contrast to the hybrid approach where the increase in query processing time relative to document size is sub-linear. The cost to tag the resulting document using the inlining constant-space tagger [29] is proportional to the size of

Figure 4:

Saturation workload

the XML document generated (as expected), but the cost of tagging under the hybrid approach is nearly constant less than 1ms greater when tagging the response for an experiment than for a file. Under the hybrid approach, tags for all metadata concepts are already in the CLOBs containing the XML fragments, so the only additional tagging is for the schema hierarchy above the concept level – which is the same for both files and experiments.

4.6 Scalability In this section we evaluate the scalability of the inlining and hybrid approaches under increasing workloads with multiple users concurrently inserting metadata and executing queries. To test the scalability of the system, we analyze total response time for inserting files in batch mode and querying for the metadata of experiments. These two operations represent the most common operations based on the projected grid workload. A comparison of scalability under an increasing workload is shown in Figure 3. Both insert and query operations reach the saturation point by 4 times the projected workload under inlining. While inlining appears to show improvement, it is due to the slowest operations timing out and failing. Under the

6

hybrid approach, inserts are saturated at 9 times the projected workload but queries are not. Even at 10 times the projected workload, hybrid queries outperform the inlining approach under the initial workload. A deeper examination of the two approaches at saturation is shown in Figures 4. Saturation could be caused by a lack of system resources or collisions as inserts acquire a write lock, so we analyzed the performance of each approach at the workload that causes that approach to exhibit saturation. For the hybrid approach this is a workload of 9 times the base workload whereas for the inlined approach it is a workload of 4 times the base workload. Figure 4(a) shows that inlining degrades quickly when inserting metadata for experiments and that the hybrid approach performs 10 times better on these inserts even though subject to more than twice the workload. Experiments have a broader range of metadata than files, containing hundreds more concepts, revealing a significant limitation of the inlining approach. Further, query operations are often an interactive operation requiring good response time, but the inlining approach exhibits significantly degraded and inconsistent performance. In contrast, query response time under the hybrid approach barely registers in Figure 4(b). In summary, the overall better performance of the hybrid approach is attributable to both the reduction in collisions between queries and inserts and a significant reduction in the complexity of constructing query responses.

to commercial solutions, such techniques target generic XML solutions, requiring more queries to build the response than are needed using the hybrid approach.

6. Future Work Our current research is focused on a generic metadata catalog that could be configured for other domains based on annotations in the schema and the structure of the schema. We posit that syntactic and semantic clues in a metadata schema could be used to suggest a starting point for defining the metadata concepts addressed in scientific metadata schemas. The hybrid approach generalizes beyond LEAD. All FGDC profiles divide metadata concepts into sections, compound elements, and elements, with compound elements representing higher-level concepts that cannot be contained in a single element [14], so the application to FGDC profiles is by straight extension. The ISO 19115 international standard for geographic information is similarly a straight extension because it takes an approach similar to the FGDC. Profiles of ISO 19115 have been adopted by many metadata initiatives worldwide such as the ANZLIC Metadata Profile [1] and Sea-Search [28]. A longer term focus of our research is on interfaces that allow a scientist to query their workspace using detailed concepts contained in the metadata. A hybrid approach to metadata storage can provide a generic interface to complex concepts.

5. Related Work

7. Acknowledgements

In addition to the discussion in Section 1, a recently published alternative to managing scientific data is Maitri [32]. This approach differs from ours in that it sits between the scientific tools and data libraries. Maitri maintains metadata regarding the data repository (e.g., is there an index over the data a user is looking for) in contrast to metadata regarding the data itself (such as the model configuration data in myLEAD). Commercial databases have recently added handling of XML data [26]. In IBM’s DB2 the XML data type was added which can contain an entire XML document. Instead of being targeted at managing metadata, these databases are aimed at handling a wide spectrum of XML including schema-less documents and allowing queries based on XPath or XQuery. In DB2, XML documents are updated as whole documents [5], so metadata attributes added during a workflow would require the entire document to be updated. Researchers have also proposed alternatives to the sorted outer union query that can be used to reconstruct XML subtrees for query responses [2]. However, similar

We thank the meteorological researchers at the University of Oklahoma and at other institutions using LEAD for generating the actual forecasting workload used in this analysis as well as the LEAD researchers working on the workflow engine for enabling them to run their forecasting experiments. We also want to thank the reviewers for their helpful feedback.

8. References [1] ANZLIC the Spatial Information Council, ANZLIC Metadata Profile: An Australian/New Zealand Profile of AS/NZS ISO 19115:2005, Geographic Information Metadata Draft Version 1.1, 2007. [2] C. Artem, M. Atay, S. Lu, and F. Fotouhi, “XML Subtree Reconstruction from Relational Storage of XML Documents”, Data & Knowledge Engineering, vol. 62, no. 2, pp. 199-218, 2007. [3] D. Atkins, et al., “Revolutionizing Science and Engineering through Cyberinfrastructure”, Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure, January 2003.

7

[4] M. R. Benioff, E. D. Lazowska, et al., “Computational Science: Ensuring America’s Competitiveness”, President’s Information Technology Advisory Committee, 2005. [5] K. Beyer, et al., “DB2 Goes Hybrid: Integrating Native XML and XQuery with Relational Data and SQL, IBM Systems Journal, vol. 45, no. 2, pp. 271-298, 2006. [6] L. Blunschi, J. P. Dittrich, O. R. Girard, S. K. Karakashian, and M. A. V. Salles, “A Dataspace Odyssey: The iMeMex Personal Dataspace Management System”, CIDR 2007, January 2007. [7] P. Bohannon, J. Freire, P. Roy, and J. Simeon, “From XML Schema to Relations: A Cost-based Approach to XML Storage, ICDE, 2002. [8] S. Chaudhuri, Z. Chen, K. Shim, and Y. Wu, “Storing XML (with XSD) in SQL Databases: Interplay of Logical and Physical Designs”, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1595-1609, 2005. [9] L. Cinquini, “Metadata Development for the Earth System Grid”, NIEeS Workshop, 2002. [10] E. Deelman, G. Singh, et al., “Grid-based Metadata Services”, SSDBM, pp. 393-402, 16th International Conference on Scientific and Statistical Database Management, 2004. [11] D. Draper, “Mapping Between XML and Relational Data”, in XQuery from the Experts, H. Katz Ed., AddisonWesley, 2004, pp. 309-352. [12] K. Droegemeier, et al., “Service-oriented Environments for Dynamically Interacting with Mesoscale Weather”, Computing in Science and Engineering, IEEE Computer Society Press and American Institute of Physics, vol. 7, no. 6, pp. 12-29, 2005. [13] B. Eaton, J. Gregory, B. Drach, K. Taylor, and S. Hankin, “NetCDF Climate and Forecast (CF) Metadata Conventions”, Version 1.0, 2003. [14] Federal Geographic Data Committee, “Content Standard for Digital Geospatial Metadata Workbook Version 2.0”, Federal Geographic Data Committee, May 2000. [15] D. Florescu, and D. Kossmann, “Storing and Querying XML Data Using an RDBMS”, IEEE Data Eng. Bull., vol. 22, no. 3, pp.27-34, 1999. [16] M. Franklin, A. Halevy, and D. Maier, “From Databases to Dataspaces: A new Abstraction for Information Management”, SIGMOD Record, vol. 34, no. 4, pp.27-33, 2005. [17] J. Gray, A. S. Szalay, A. R. Thakar, C. Stoughton, and J. vandenBerg, “Online Scientific Data Curation, Publication, and Archiving, Tech. Rep. MSR-TR-2002-74, Microsoft, 2002. [18] International Organization for Standardization, “Geographic Information - Metadata ISO19115:2003”, 2003. [19] S. Jensen, B. Plale, S. L. Pallickara, and Y. Sun, “A Hybrid XML-relational Grid Metadata Catalog”, in Proceedings of the 2006 International Conference Workshops on Parallel Processing, pp. 15-24, 2006. [20] R. Krishnamurthy, R. Kaushik, and J. F. Naughton, “XML-to-SQL Query Translation Literature: The State of

[21]

[22]

[23]

[24]

[25] [26]

[27] [26] [29]

[30]

[31]

[32]

[33] [34]

[35]

8

the Art and Open Problems”, in Proceedings of the 1st International XML Database Symposium, pp. 1-18, 2003. S. Newhouse, J. M. Schopf, A. Richards, and M. P. Atkinson, “Study of User Priorities for e-Infrastructure for e-Research (SUPER)”, in Proceedings of the UK e-Science All Hands Conference, 2007. B. Plale, “Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace”, LEAD TR 001 version 3.0, Linked Environments for Atmospheric Discovery (LEAD), 2007. B. Plale, D. Gannon, J. Alameda, B. Wilhelmson, S. Hampton, A. Rossi, and K. Droegemeier, “Active Management of Scientific Data”, IEEE Internet Computing Special Issue on Internet Access to Scientific Data, vol. 9, no. 1, pp. 27-34, 2005. R. Plante, et al., “VOResource: an XML Encoding Schema for Resource Metadata”, Version 1.02, 2006. At: http://www.ivoa.net/Documents/cover/VOResource20061107.html A. Rajasekar, “Managing Metadata in SRB”, SRB Workshop, 2006. At: http://www.sdsc.edu/srb/Workshop/Talks M. Rys, D. Chamberlin, and D. Florescu, “XML and Relational Database Management Systems: the Inside Story”, in Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 945-947, 2005. J. Ryssevik, “The Data Documentation Initiative (DDI) Metadata Specification”. At: http://www.ddialliance.org Sea-search European Directory of Marine Environmental Data, 2007. At: http://www.sea-search.net/edmed/welcome.html J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, and B. Reinwald, “Efficiently Publishing Relational Data as XML Documents”, The VLDB Journal, vol. 10, no. 2-3, pp. 133-154, 2001. J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton, “Relational Databases for Querying XML Documents: Limitations and Opportunities”, in Proceedings of the 25th International Conference on Very Large Data Bases, pp. 302-314, 1999. G. Singh, S. Bharathi, A. Chervenak, E. Deelman, C. Kesselman, M. Mahohar, S. Pail, amd L. Pearlman, “A Metadata Catalog Service for Data Intensive Applications”, in Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 2003. R. R. Sinha, S. Mitra, and M. Winslett, “Maitri: Format Independent Data Management for Scientific Data”, 3rd International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), 2005. D, Suciu, “On Database Theory and XML”, SIGMOD Record, vol. 30, no. 3, pp. 39-45, 2001. I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang, “Storing and Querying Ordered XML Using a Relational Database System”, in Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 204-215, 2002. Transaction Processing Performance Council, “TPC Benchmark E Standard Specification Version 1.2.0”, 2007.