Specifying and Iterating over Virtual Datasets - Semantic Scholar

4 downloads 0 Views 213KB Size Report
Oct 12, 2004 - ... Le Hors, Philippe Le Hgaret, Lauren Wood, Gavin Nicol, Jonathan Ro- ... Thompson, David Beech, Murray Maloney, and Noah Mendelsohn.
Specifying and Iterating over Virtual Datasets Luc Moreau, Yong Zhao, Ian Foster, Jens Voeckler and Michael Wilde October 12, 2004 Abstract We are concerned with the following problem: How do we allow a community of researchers to define, implement, and share computational procedures able to process diverse data stored in many different formats? Standard data formats and data access APIs can help, but are not general solutions due to their assumption of homogeneity. We propose a new approach to this problem based on a separation of concerns between logical and physical structure. We use XML Schema as a type system for expressing the logical structure of datasets, and define a separate notion of a mapping that combines declarative and procedural elements to describe physical representations. Thus, for example, a collection of environmental data might be mapped variously to a set of files, a relational database, or a spreadsheet, but can look the same in all three cases to the user who accesses the data via its logical structure. This separation of concerns allows us to specify scientific workflows that operate over complex datasets with, for example, selector and iterator constructs being used to select and compute on sets of dataset elements–regardless of whether the sets in question are files in a directory, tables in a database, or columns in a spreadsheet.

1 Introduction The Virtual Data system [9] is a computational software architecture [8] that provides an integrated treatment of both data, the computational procedures to manipulate such data, and the computations that apply those procedures to data. It allows for the representation of data that does not exist, being defined only by computational procedures that are activated on demand. In such a virtual data system, data, procedures and computations are all first-class entities and can be published, discovered and manipulated. Such a model of computation facilitates large scale collaborations between scientists, by making all these first class entities accessible to a community of users, as illustrated by the GriPhyN project (www.griphyn.org).

1

So far, work in this area has principally focused on the design of a workflow language, the virtual data language, allowing the specification of computational procedures [8], their deployment on grids using advance planning techniques (such as Euryale and Pegasus [6]), and the recording of provenance, which captures how data was derived [8]. The resulting system is available as the Virtual Data Toolkit, which has been deployed successfully in numerous scientific applications. However, the virtual data language suffers from some weaknesses that prevent it from being adopted widely in large-scale scientific applications. First, a pattern of work commonly adopted by scientists is difficult to replicate in the virtual data language: given a large dataset, identify a number of elements that satisfy some criteria, and iterate over all of them in order to perform some computations that generate further data. Currently, the virtual data language does not have any mechanism by which it can identify elements in a dataset and iterate over them. Such a limitation is currently circumvented by writing ad-hoc scripts, outside the virtual data language, hence breaking the virtual data approach since it prevents true compositionality of workflows and automatic provenance gathering. Second, practical experience has shown that datasets tend to be encoded in rather complex formats, which may change over time, with new data releases. Computational procedures that were designed for a given format may no longer be operational, once formats change. Such format dependency prevents code reuse over datasets that are conceptually identical, but physically represented differently. In other words, while the virtual data language aims at representing data that does not exist by computational procedures, it is currently unable to abstract away from the physical representation that such datasets will take, once generated. The purpose of this paper is to address these two issues, by introducing a notion of virtual dataset, allowing for the description and processing of datasets that do not exist, independently of their concrete physical representation, and allowing the specification of computational procedures capable of selecting subsets of datasets and iterating over them, in a manner that is agnostic of physical representation. Specifically, our contribution is to introduce XDTM, the XML Dataset Type and Mapping system, which is a two-level description of datasets: a type system that characterise the abstract structure of datasets, which is complemented by a physical representation description that identifies how datasets are physically laid out in their storage. The type system that we propose to use is XML Schema [17, 3], which has the benefit of supporting some powerful standardised query language that we use in our selection methods. This paper is organised as follows. In Section 2, we review some scientific applications making use of the virtual data language, and explain limitations that prevent easy manipulation of complex scientific datasets; we also review existing systems, which, while not providing a final answer to our problem, inspired our solution. Section 3 then describes how a type system can be used to specify the structure of datasets. Leveraging XML technologies, we show how the XPath notation can be adopted to identify datasets elements in Section 4. Then, Section 5 focuses on the representation of datasets. Both the type system and physical representation of datasets are used in 2

a proposed extension of the virtual data language to perform selections and iterations over virtual datasets, in a manner that is independent of their actual physical representation (Section 6). Finally, Section 7 overviews our prototype implementation.

2 Background First, we review some scientific applications that have strong requirements for virtual datasets, and that have driven the design of our system. We then discuss existing technologies that are capable of describing complex hierarchical data, but fall short of providing a satisfactory solution for virtual datasets.

2.1 Some Applications The Sloan Digital Sky Survey [16] maps in detail one-quarter of the entire sky, determining the positions and absolute brightnesses of more than 100 million celestial objects. It also measures the distances to more than a million galaxies and quasars. Data Release 2, DR2, is the second major data release and provides images, imaging catalogs, spectra, and redshifts for download.

94

imaging

. . . 1239 . . .

astron… …

6

objcs

1

corr 40…

3478

fpAtlas-001239-1-0011.fit ... fpAtlas-001239-1-0166.fit fpBIN-001239-g1-0011.fit ... fpBIN-001239-z1-0166.fit fpFieldStat-001239-1-0011.fit ... fpFieldStat-001239-1-0166.fit fpM-001239-g1-0011.fit ... fpM-001239-z1-0166.fit fpObjc-001239-1-0011.fit ... fpObjc-001239-1-0166.fit psField-001239-1-0011.fit ... psField-001239-1-0166.fit

2… 3… …

Figure 1: An Excerpt of DR2 Figure 1 displays an excerpt of DR2, which can be found from http://das.sdss. org/DR2/data. We focus here on the directory imaging appearing at the root of the 3

distribution. Inside it, we find a collection of directories, each of them containing information about a Run, which is a fraction of a strip (from pole to pole) observed in a single contiguous observing pass scan. Within a Run, we find Reruns, which are reprocessing of an imaging run (the underlying imaging data is the same, just the software version and calibration may have changed). Both Runs and Reruns directories are named by their number: in this paper, we will take Run 1239 and Rerun 6 as an example to focus the discussion. Within a Rerun directory, we find a series of directories, including objcs, object catalogs containing lists of celestial objects that have been cataloged by the survey. Within objcs, we find six subdirectories, each containing a Camcol, i.e., the output of one camera column as part of a Run. Finally, in each Camcol, we find collections of files in the FITS format, the Flexible Image Transport System, a standard method of storing astronomical data. For each detected object, the atlas image comprises the pixels that were detected as part of the object in any filter. The fpObjc files are FITS binary tables containing catalogs of detected objects output by the frames pipeline. For these files, we see that the Run number (1239) and other information are directly encoded in the file name. We refer the reader to the DR2 definition for further details. From this brief analysis of a DR2 subset, we can see that DR2 is a complex hierarchical dataset, in which metadata is sometimes directly encoded in filenames, as illustrated for Run, ReRun, and CamCol directories and for all FITS files. Furthermore, DR2 is available in multiple forms: some institutions have DR2 available from their file system; it can also be accessed through the rsync and http protocols; alternatively, sometimes, scientists package up subsets directly as tar balls, such as specific Runs, for rapid exchange and experimentation. Such a kind of dataset is being used as follows. The “Near Earth Object Search” is a typical scientific process in which the sky has been divided into fields, and one needs to iterate over them, performing some computation aiming at detecting objects. Therefore, in order to achieve such a task, the scientists would need to identify in their workflow the kind of fields they wish to operate over, and start an operation over each of them. QuarkNet (quarknet.fnal.gov) is an educational programme for high school students and teachers that seeks to train them into some of the research techniques for studying the structure of matter and the fundamental forces of nature. In this context, simple workflows are defined to process experimental data captured by sensors scattered around the country. In a specific application, it was observed that different data formats were adopted for 2002 and 2003, and each year, both ASCII and spreadsheet versions were supported. On the other hand, computational procedures, programmed in Perl, made assumptions on the format they intended to operate on, preventing the reuse of the code from one year to the other. Within this context, a common task that is performed in high energy physics is the selection of a range of events, and their dispatching to specific compute node for further processing while balancing the load across different nodes. Bioinformatics [15] and medical [12] applications have also similar patterns of 4

work, requiring comparison of a biological sequence or medical data with each individual in a data base of sequences or data, respectively.

2.2 XML Schema XML Schema, standardised by the World Wide Web Consortium [17, 3], provides a means for defining the structure, content and semantics of XML documents. We assert in this paper that XML Schema can also be used to express the structure of datasets manipulated by scientific applications. We refer the reader to the XML Schema specifications for definitions of the key concepts. Increasingly, XML is being used as a way of representing different kinds of information that may be stored in diverse systems. This use of XML does not imply that all information needs to be converted into XML, but that the information can be viewed as if it were XML. Much of this information would traditionally be seen as data rather than as documents; nevertheless, XML presents this data as documents. In this context, the Document Object Model (DOM) [11] is an application programming interface for well-formed XML documents: it defines the logical structure of documents and a programmatic way of accessing and manipulating a document.

2.3 Related Data Formats The Data Format Description Language (DFDL) is a descriptive approach by which one chooses an appropriate data representation for an application based on its needs and then describes the format using DFDL [2]. DFDL is not a data format but a language for describing any data format. In particular, its purpose is to describe legacy data files, and hence is an important technique for supporting backward compatibility as formats evolve. DFDL has an ambitious scope: it aims to describe complex binary formats, including compression formats such as zip. DFDL relies on an abstract data model that specifies data structures independently from their physical representation on disk or in a stream. Primitive types include integers, floats, string, etc, whereas compound structures include vectors, arrays, and userspecified compositions. Complementary to the abstract data model, DFDL includes a mapping to and from the physical representations (e.g. integers can be required to be represented by 4 bytes in little-endian format). DFDL uses a subset of XML Schema to describe abstract data models, which are augmented by information specifying how the data model is represented physically. Specifically, mappings are declared as restrictions of primitive data types annotated by the concrete types they must be mapped to; the XML Schema is then decorated by scoped annotations specifying which mapping applies for the different types. While DFDL aims at separating mappings from abstract models, actual annotations have to be embedded inside an XML Schema to define a specific mapping for a specific schema; therefore, they are not separable in practice.

5

The METS schema [13] is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. In particular, it contains a structural map, which outlines the hierarchical structure of digital library objects, and links the elements of that structure to content files and metadata that pertains to each element. The METS standard represents a document structurally as a series of nested “div” elements, that is, as a hierarchy (e.g., a book, composed of chapters, composed of sub-chapters, themselves composed of text). Every div node in the structural map hierarchy may be connected to content files that represent that div’s portion of the whole document. As opposed to DFDL or our proposed approach, METS does not separate an abstract data structure from its mapping to physical representation, nor does it support the kind of physical aggregations encountered in scientific applications such as complex directory structures, tar balls, etc. Hierarchical Data Format 5, HDF5 [7], is a file format (and associated software library) for storing large and complex scientific and engineering data. HDF5 relies on a custom type system allowing arbitrary nestings of multi-dimensional arrays, compound structures similar to records, and primitive data types. The HDF5 library incorporates a virtual file layer that allows applications to specify particular file storage media such as network, memory, or remote file systems, or to specify special purpose I/O mechanisms such as parallel I/Os. The virtual file layer bears some similarity with the mapping of DFDL, which is separate from the abstract data model, but focuses on run-time access to data rather than physical encoding. While HDF5 is not capable of describing legacy datasets that are not encoded as HDF5, some tools allow conversion to and from XML [10]. Specifically, the XML “Document Type Definition” (DTD) can be used to describe HDF5 files and their contents. Microsoft ADO DataSet [1] is a “memory-resident representation of data that provides a consistent relational programming model regardless of the data source.” Such “DataSets” can be read from XML documents or saved into XML documents; likewise, DataSet schemas can be converted into XML Schemas and vice-versa. The approach bears some similarity with ours since it offers a uniform way of interacting with (inmemory copies of) datasets, with a specific focus on relational tables: however, the approach does not support interactions with arbitrary legacy formats. Conversion to and from XML also provides support for Web Services invocation. To some extent, the commercial product Xmlspy [19] provides a symmetric facility, by allowing the conversion of relational data base to XML (and vice-versa), but using XML Schema as the common model for interacting with data stored in different databases. Again, no support is provided for datasets encoded in non-relational legacy formats.

2.4 Two level descriptions in myGrid The myGrid project [18], which aims at facilitating the composition of Web Services into complex bioinformatics workflows, has recognised that interfaces of some existing Web Services are sometimes poorly specified and do not lend themselves naturally to meaningful composition. To address this problem, myGrid explicitly stores in a 6

registry [14] annotations that semantically describe the functionality, the inputs and outputs of services, and uses such annotations to reason about service compositions. Hence, myGrid service descriptions consist of two levels: the interface of a service is typically defined by a service provider and encoded using the Web Service Description Language WSDL, whereas its separate semantic description, potentially provided by the user or any other third party, provides further information on the service functionality. The latter information is used for reasoning over service composition while the former is used to construct service invocations. In the spirit of the multi-level explicit representation of information in myGrid, we propose to make the physical representation, i.e., the format, of datasets an explicit notion, to coexist with the abstract structure of the dataset, i.e., its type. Similarly to DFDL, we strive to specify format separately from type, so that a dataset of a given type can be encoded into different physical representations. Workflows that operate on datasets can then specified in terms of the dataset’s types rather than their format, and ultimately can be executable independently of the datasets’ physical representations.

3 A Type System for Virtual Datasets We now introduce the type system that we use to specify the abstract structure of datasets. As in programming languages such as C or Java, we consider two forms of aggregations: arrays that aggregate elements and allow us to refer to them by an index, and records (or structures) that also group elements but enable us to refer to them by their name. In addition, creating and naming new types is a necessary feature that provides abstraction and expressiveness. As far as atomic types are concerned, i.e., types of datasets that cannot be decomposed into smaller elements, we consider the usual primitive types found in programming languages such int, float, string, boolean, etc. Furthermore, in scientific applications, there exist a plethora of formats for encoding well defined datasets, such as FITS files in astronomy, or DICOM for medical imaging. We want to be in a position where such established formats can be regarded as opaque, i.e., scientists are not required to specify their contents, which means that such datasets should also be viewed as primitive types. Figure 2 illustrates a type-based specification of Sloan DR2 structure, which we overviewed in Figure 1. Each element of the dataset is identified by a name, and this element has a structure that we characterise by a type. In this paper, our naming conventions are the following: element names are identifiers with low capital letters, whereas types have an identifier starting with a capital letter. Furthermore, in Figure 2, types are used as labels to shaded boxes encapsulating the whole structure of the element they characterise. For example, at the root, the element named imaging has a structure defined by type Imaging, which is composed of a sequence (represented by the three dots) of elements run, themselves of type Run. The number of run in this sequence is unbounded, hence the range 0 . . . ∞ appearing below the element run. 7

A difficulty of DR2 is that some metadata is encoded in file or directory names. The structure of such metadata, i.e., metadata key and metadata type, needs to be explicit in the type system. Therefore, this amounts to providing the type system with attributes and associated values, i.e., key value pairs, which can be attached to the datasets specified by the type system. Note that the graphical notation in Figure 2 does not make such attributes explicit; we will come back to these when covering the corresponding textual notation. I maging R un R eR un

astro m c alibC hunks c orr dplogs fangs fields gangs logs nfc alib O bjcs C amcol

fpA tlas 0..∞

fpB IN imaging

run 0..∞

0..∞

rer un 0..∞

fpF ield Stat 0..∞

objcs

c amcol 1..6

fpM 0..∞

fpO bjc 0..∞

psB B 0..∞

psF ield 0..∞

photo psF angs s sc sq Z oom

Figure 2: Type of Dataset in SDSS DR2 We note from this example that the type system provides a uniform mechanism for representing the structure of datasets both “within” and “outside” files. In fact, at the level of the type system, no construct distinguishes between what is inside a file and what is outside. Such a uniform approach allows us to express workflows that operate both on datasets and file contents. We note however that this is not mandatory 8

for system designers to express the structure of a file content in the type system, but merely it is an opportunity that can be used, when convenient for a given application. In particular, it may not be desirable to describe the structure of binary formats, and we may prefer to consider such types as opaque. This specific point highlights a crucial difference with DFDL which seeks to describe all binary formats, for legacy reason. The type system must be language independent since it should be capable of describing the structure of datasets independently of any programming or workflow language. All the features we discussed above (and more), namely array and record like aggregation, type naming and attributes, are supported by XML Schema, [17, 3]. We therefore propose to adopt XML Schema, since it brings along other benefits by its standardised nature. For completeness1 , we show in Figure 3, an excerpt of the XML Schema for DR2. The schema specifies that it is composed of a single element named imaging of type Imaging. The complex type Imaging is a sequence of elements run, of type Run, with a variable (and possibly unbounded) number of occurrences. The complex type Run is a sequence of elements rerun, of type Rerun. Note that a mandatory attribute is being specified for type Run, with name number and value required to be a non negative integer. So, while every Run directory of Figure 2 was encoded by its number (e.g. 1239), in the abstract representation, each of the run element is named “run” and is given an attribute number with a value (e.g. 1239).

Figure 3: XML Schema Excerpt for DR2 As a result, since we use XML Schemas to describe the structure of datasets, we 1

We note however that for conciseness we keep namespaces implicit in this paper.

9

can take the conceptual view that there is an XML document that represents a dataset that is encoded into a physical representation. Navigating or iterating over a dataset can thus be expressed by using the conceptual XML view of such a dataset. We shall come back to this specific point in Section 6, when discussing iterators. Note that the XML document represents a conceptual view of a dataset, and we do not expect every dataset to be transformed into XML, since this would defeat the purpose of what we are trying to achieve here, i.e., to deal with complex scientific data formats, as currently used in every day practice. Since XML Schemas are standardised by the W3C, tools for editing and validating schemas exist; also, query languages such as XPath and XQuery, are specifically conceived to operate on XML documents; finally, the Document Object Model [11] offers a standard programmatic interface to such documents. We shall use these benefits in the rest of the paper. Before turning to the problem of declaring the physical representation of datasets whose abstract structure has been specified by XML Schema, we show how the XPath notation can be used to identify elements of datasets.

4 XPath Notation for Identifying Elements in Datasets Since datasets with complex physical representations can be seen as XML documents whose structure is specified by an XML Schema, we can leverage existing XML technologies and apply them to such datasets. Specifically, we show in this section how the XPath notation can be applied to datasets in order to identify some of their internal elements. We adopt XPath 1.0 [4] since it meets our requirements; however, we do not preclude the use of more recent versions of XPath or XQuery. We note that XPath is being used in two different circumstances in our proposal: on the one hand, it is used to identify elements in mappings to physical representation; on the other, it is used as a selector language; both aspects will be developed in the following sections. In XPath, elements of XML documents are identified by names, and nested elements are separated by a “slash operator”. Hence, by using the notation /a/b for a given dataset, we would identify all elements with name b that occurs with an element of name a in the dataset. For elements with multiple occurrences, we can refer to a specific occurrence by using its index in square brackets. Hence, /a[3]/b[4] would refer to the fourth element named b inside the third element named a. We note that XPath indices start with value 1. Concretely, looking back at Figure 2, the XPath expression /DR2/imaging/run identifies all elements run, within an element imaging, itself within an element DR2, which we assume denotes the root of DR2 distribution. Likewise, any fpObjc element can be referred to as /DR2/imaging/run/rerun/objcs/camcol/fpObjc, or simply as //fpObjc if we do not wish to provide an explicit full path that leads to the element. Figure 4 shows more complex sample XPath expressions over the Sloan Digital Sky Survey. In Figure 4.a, we use a simple directory-like navigation of the hierarchy, allowing us to identify the 10th fpBIN file in the third camera column and that comes 10

from the first run and first rerun. By using double slashes, one can avoid specifying a complete path: for instance, Figure 4.b shows how to select all fpBIN files, whatever their depth in the survey. Some datasets are qualified by attributes, which can be used in queries to identify precisely the elements that need to be selected; Figure 4.c illustrates how to select those fpObjc files that have field value 110 and that come from run number 1239, and rerun number 6. (a) Directory-like navigation of datasets /DR2/imaging /run[1]/rerun[1] /objcs /camcol[3] /fpBIN[10]

(c) Use of attributes /DR2/imaging /run[@number=’1239’] /rerun[@number=’6’] /objcs/camcol[1] /fpObjc[@field=’110’]

(b) Short-cuts /DR2//fpBIN

Figure 4: XPath Selection over Virtual Datasets

5 Physical Representation of Datasets In this section, we present our approach for specifying the physical representation of datasets, inspired by requirements derived from our applications. First, since the structure of a dataset is specified by an XML Schema and since a given dataset may have different physical representations, it is appropriate to characterise the physical representation in a document that is distinct from the schema, so that the schema can be shared by multiple representations. We expect the physical representation specification to refer to the Schema (using XML namespaces and conventions), but we do not want to adopt the DFDL approach, which annotates the schema directly with physical representation information, and therefore prevents the schema from being reused for multiple representations of a same dataset. Second, like DFDL, we recognize the need to contextualise physical representations, so that we can allow a given type that occurs multiple times in a dataset to be encoded into multiple physical representations depending on its position in the dataset. Whereas DFDL directly relies on the block structure of XML Schemas to specify a notion of scope, within which a given type is encoded in a given manner, we need to provide another mechanism since we have just precluded the use of annotations, and therefore cannot take advantage of their natural scoping. Third, our focus is on representing datasets, i.e., aggregation of data, rather than on specifying the format of arbitrary (binary) data files (for example for legacy reason). Hence, we do not expect a single formalism for representing physical formats to be 11

able to characterise arbitrary files in a descriptive manner. Instead, we support a mix of declarative and procedural representations; concretely, we opt for a pair of conversion methods specifying how to read the physical representation of a dataset into the abstract one, and vice-versa how to write the abstraction representation into a physical one. We anticipate that a library of such converters could be made available to the user and that new converters would be permitted to be defined, hereby providing for extensibility. Such an operational view of conversion to and from physical representation (i) allows for direct reuse of legacy converters for pre-existing formats; (ii) does not have to wait for DFDL to be fully standardised; (iii) and can still leverage DFDL, since a declarative DFDL format description should allow for the automatic derivation of readers (i.e., parsers) and writers. I maging R un R eR un

astro m c alibC hunks c orr

Represented as directories with names given by attribute "number"

Represented as file with name fpObjc-@run-@[email protected]

dplogs fangs fields gangs logs nfc alib

O bjcs C amcol

fpA tlas 0..∞

fpB IN imaging

run 0..∞

0..∞

rer un 0..∞

fpF ield Stat 0..∞

objcs

c amcol 1..6

fpM 0..∞

fpO bjc 0..∞

psB B 0..∞

psF ield 0..∞

Represented as directories

photo psF angs s sc sq Z oom

Figure 5: Implementation of SDSS DR2 Figure 5 illustrates, for some of the datasets in DR2, the kind of description we would like to express to denote specific physical representations. Some of these data elements (such as imaging and objcs) would be represented directly as directories. On the other hand, the elements run, rerun and camcol would be represented as directories that are named by the value of their attribute number. Finally, elements fpObjc should directly be encoded as FITS files, an astronomy file format, with a 12

specific naming convention, in which the name is constructed from several attribute values associated with such datasets. Requirements have identified that it is not desirable to add annotations to an XML Schema directly. Instead, we need a mechanism by which we can refer to an element in a dataset, and indicate how it should be converted. Furthermore, XML Schema does not impose element names to be unique in a schema. Therefore, we need a mechanism capable to refer uniquely to an element in a dataset. Such a notation already exists: given that our datasets structures are specified by XML Schema, we can use the XPath notation, which we have shown in Section 4 was able to identify the path that leads to a given element in a dataset. Therefore, the mapping2 between abstract and concrete representation is specified by set of triples: hmappingsi ::= hmapi | hmappingsi hmapi hmapi ::= hhxpath expressioni, himplementation descriptionii We expect the XPath expressions in a given mapping to denote different elements in a dataset, so that the order of individual hmapi can be arbitrary, since the XPath expressions are not conflicting. There may be cases in which we need to allow different physical representations of a given element for different positions it appears at in a dataset. As an illustration, all fpObjc elements in run 1239 could take a given representation r1 , where all the others would take an other representation r2 . A mapping could be specified as: h/DR2/imaging/run[@number =0 12390 ]//fpObjc, r1 i h/DR2/imaging/run[@number! =0 12390 ]//fpObjc, r2 i Section 7 will introduce the XDTM (XML Data Typing and Mapping system) which provides an XML notation for such mappings so that they can be made explicit with datasets, and published with their XML Schemas; it will also describe the notation adopted to describe their physical representation.

6 Virtual Dataset Selectors and Iterators From our scientific applications, we have identified two fundamental operations over datasets. The first operation termed selection constructs a new dataset by identifying some elements from a dataset. The second consists of iterating over dataset elements, 2 While identifying elements by their name is precise, it may be too fine-grained in some circumstances. Indeed, some datasets may contain many elements of a same type, as specified by their XML Schema. In such cases, it may be preferable to refer to such elements by their types. We therefore also offer an alternate form of hmapi:

hmapi ::= hhtypei, himplementation descriptionii in which an implementation description is associated with a type.

13

by applying computational procedures that produce new results to form new dataset. The purpose of this section is to present the sort of constructs that would be available to programmers. The key benefit of separating the abstract structure of a dataset from its physical representation is that we can specify workflows in terms of the former, independently of the latter. Therefore, workflows become independent of the physical representation of datasets, which makes them more reusable. First, we focus on the Query Language that allows us to express queries over virtual datasets; we then see how such language can be used in selectors and iterators over virtual datasets. Every XDTM dataset is characterised by its structure and physical representation. Therefore, programmatically, we need to be able to declare a virtual dataset variable by providing a name, the dataset structure specified in terms of an XML Schema type, and optionally a physical representation. If the physical representation is not provided, the variable is polymorphic in its physical representation, meaning that it can be bound to any physical representation for the given type. Figure 6 illustrates pseudo code to operate over virtual datasets; the first three statements are intended to declare variables dr2, tmp and result, respectively of type DR2, FpObjcFiles and Results. The variable dr2 is not only declared to be an element of type DR2, but also that the dataset bound with it has got a physical representation specified by a given mapping file; for convenience, the contents of the mapping file is presented in Figure 8 and will be discussed in the next section. Virtual dataset variables may be assigned like any other variables. If a l-variable for a virtual dataset was declared with a specific physical representation, the workflow system will check that the r-value to be assigned to the l-variable has the required physical representation. Through assignment, the workflow language also provides a physical representation casting operation that converts a dataset of a given representation into a dataset of another representation, but with the same type. We anticipate that such representation conversion will be drawn from previously declared converters, and that ultimately such “representation casting” could be performed automatically by the system. A select expression takes an XPath expression and a virtual dataset expression and returns a dataset. An iterator expression will apply a computational procedure to all elements of a virtual dataset, and will return the dataset consisting of all the result produced by each individual invocation of the procedure. Computational procedures are defined as functions mapping a data set type (and an optional dataset physical representation) into a dataset type and a physical representation. We expect the workflow language to verify that the procedure is applied to datasets of suitable type and physical representation; if necessary, casting into the appropriate physical representation may be performed by the workflow language. We anticipate that workflow language implementations will be able to execute such iterations in parallel, when circumstances are appropriate. Statement 4 of Figure 6 shows how a select expression can search for all fpObjc files with field 110 in run number 1239. The set of such files is bound to variable tmp, 14

which itself appears in statement 5 and is used in an iterator invoking a transformation tr on each of its fpObjc files; the result is finally accumulated in the dataset bound to variable result. 1

defineVirtualDataSet dr2 http://www.griphyn.org/SDSS#DR2 //type http://luc.org/mapping-dr2-http.xml //mapping

2

defineVirtualDataSet tmp http://www.griphyn.org/SDSS#FpObjcFiles //type

3

defineVirtualDataSet result http://www.other.org#Results

4

tmp=select "/DR2//run[@number=’1239’]//fpObjc[@field=’110’]" in dr2

5

result=forall v in tmp call tr v

//type

Figure 6: Pseudo code for Programming with Virtual Datasets The Chimera system [8] maintains a virtual data catalog, which is a catalog of all produced datasets, and metadata entered by the user. The XPath syntax allows us to select elements according to their position in virtual datasets, but also according to metadata stored in the virtual data catalog, as illustrated by the following query, /DR2 //run[@number=’1239’] //fpAtlas[vdc:metadata(’isProducedBy’)=’Luc’] which selects all fpAtlas files in run number 1239, such that the virtual data catalog contains a metadata with key isProducedBy and value Luc.

7 Implementation In order to study the practicality of XDTM, the XML Dataset Typing and Mapping system, separating structural type from physical representation, we have implemented the selector described in Section 6, and have experimented with it over scientific datasets, including the queries of Figure 4. We present some preliminary performance results . Our Java implementation makes use of the Jaxen XPath engine (jaxen.org), which is capable of processing XPath queries over DOM objects. Hence, the aim of our implementation is to materialise datasets as DOM objects, so that they can be navigated by Jaxen, like any other DOM object, even though these would refer to the actual physical representation of datasets. To this end, our implementation provides two 15

elements: a hierarchy of classes implementing the standardised DOM interfaces [11] and a syntax in which the physical representation of datasets should described. We now discuss each of these in turn. Figure 7 summarises the key classes required to support Virtual Datasets. We find a VDS class for each kind of physical representation of a dataset: directory in a file system (VDSDirectoryElement), reference to a http server through a URL (VDSURLElement), file containing an opaque or a fully described dataset (VDSFileElement), line within a file (VDSLineElement), and conjunction of multiple representations (VDSGroupElement) . Likewise, we are in the process of making relational databases and WS-Resource Property values [5] available as possible physical representation of datasets, via the classes VDSDatabaseElement and VDSGridServiceElement respectively. In addition, we need to cater for immediate values such as primitive types, which we abstract by the class VDSImmediateElement. VDSDomNode

implements

VDSDomElement

org.w3c.dom.node

implements

org.w3c.dom.Element

VDSComplexElement

VDSDirectoryElement

VDSURLElement

VDSFileElement

VDSLineElement

VDSGroupElement

VDSDatabaseElement

VDSGridServiceElement

VDSImmediateElement

Figure 7: Virtual Dataset Document Object Model Classes These classes offer a common way of navigating and creating datasets, as specified by the DOM interface, namely by allowing access and creation of node children, attributes, parent, and siblings. These classes take care of performing such operations directly inside the physical representation, without requiring their user to have to consider the details of such representations. While such classes are crucial for supporting the concept of Virtual Dataset, they are not directly available to the programmer. Instead, the specifier of an XDTM dataset or a workflow designer refer to these classes 16

indirectly, by writing a document to express the physical representation of datasets, according to a syntax that we now introduce. Figure 8 presents an excerpt of the mappings between abstract and physical representation of DR2. Each individual mapping identifies an element in DR2 using an XPath expression, and specifies its physical representation by one the XML elements directory, file, line, url, etc; such XML elements correspond to the classes that we have implemented to support Virtual Datasets. For each kind of physical representation, we indicate how the name of the representation is derived: Self refers to physical representations that have the same physical name as their corresponding element in the abstract representation; RepresentRun is a specific naming method that uses the run attribute to encode the physical name; RepresentFpObjc is used for encoding/decoding the fpObjc files. In the case of the fpObjc file, the mapping also specified that the file type is “opaque”, which means that we do not aim to describe the internal physical representation of this dataset.

Figure 8: Excerpt of Mapping to/from Physical Representation Note that all naming methods (Self, RepresentRun, RepresentFpObjc, . . . ) have a similar functionality. When a physical data set is read, they are used to convert its physical name into the element name (and possibly some attributes) in the abstract representation. Symmetrically, when an abstract dataset is written into a physical dataset, these naming methods are used to reconstruct the physical names from the abstract element. New naming methods can be defined by the programmer, provided they comply with some predefined interfaces. Likewise, other physical representations can be supported if they are made available through a DOM interface. A specific physical format is XML, which can naturally be exposed through DOM. This provides a quick way, though admittedly not the most efficient, to integrate physical formats that offer conversion to and from XML. Figure 9 summarises the key components of the architecture. A dataset will be characterised by its type and its physical representation, which will identify the DOM class that will “encapsulate” all information regarding that dataset. The DOM interface support common methods to navigate and create elements in DOM objects. Such interface is exploited by an XPath library, which is invoked when the workflow engine 17

has to execute selectors and iterators over datasets; similar, this interface is used by workflow instruction for creating and saving datasets.

XPath Q uery

Assi gnme nt, W ri te Request s

Jaxen Xpath library

DO M Interf ace

Data SetType And Physi calRepr.



Gri d Service

DB

File

URL

DIR

Group

Figure 9: Implementation Overwiew A number of interesting issues need to be addressed in order to make such selectors robust to operate over large scientific datasets in the context of the Grid. 1. Our implementation avoids copying a physical dataset into an in-memory abstract representation of the dataset, which would be navigable through its DOM interface, since such a copy could not be held in memory for large data sets. Therefore, our implementation of such DOMs minimises the memory footprint by keeping only the information that permits navigating the dataset as requested by the XPath engine. 2. We rely on the XPath engine to issue the necessary requests to navigate a dataset depending on the XPath query provided as input. Our DOM implementation only navigates datasets following requests from the XPath engines, it constructs new DOM elements on the fly, and it avoids caching any intermediary result to prevent memory leaks and to minimise memory footprint. 3. Writing a virtual dataset into its physical representation on a file system should be performed incrementally, as the dataset is constructed, if we wish to avoid creating an in-memory copy before writing to the file system. 4. Iterators may require a long time to select elements in a large dataset and to invoke computational procedures over each of them. Such iterators may have to be run in parallel to improve efficiency. Furthermore, as the length of computations increases, the probability of failures becomes non negligible. In order to recover from such failures, the system may have to restore a previous state of the iterator: 18

therefore, we are seeking to transform an XPath engine so that it can checkpoint its internal state, and can restart a computation from a previously checkpointed state. In order to evaluate the XDTM approach, we considered a dataset of a Functional Magnetic Resonance Imaging (FRMI) medical application. The dataset was made available through both a file system and an http server. Over such a dataset we executed queries that selected a set of medical images satisfying some criteria. The same query was run, using XDTM, over both physical representations. In addition, for reference, we used a native implementation of the query that run directly over the file system representation; since such a query native implementation made assumptions over the physical representation, it could only run over the file system, whereas the XDTM approach could run over both. Table ... summarises the execution ... MORE TO COME HERE

8 Conclusion In this paper, we have presented the XML Dataset Typing and Mapping system, which is a new approach for characterising legacy data sets used in scientific applications. The fundamental idea of XDTM is that it separates the description of a dataset structure from the description of its physical representation. We adopt the XML Schema type system to characterise a dataset structure, while we introduce a context-aware mapping of dataset types to physical representation. The separation of concerns, structure vs. physical representation, facilitates the definition of workflows since they can be specified in terms of dataset structures, but can operate on multiple representations of them. Not only have we shown that such modular approach can help scientists to operate on concrete scientific datasets in astronomy, high energy physics and other disciplines, but also we have shown that it can be implemented both in memory-conscious and robust manner, to accommodate long term Grid computations operating over large datasets. Our future work is to specify a new release of the Virtual Data Language capable of operating over such Virtual Datasets.

9 Acknowledgements This research is funded in part by the Overseas Travel Scheme of the Royal Academy of Engineering, EPSRC project (reference GR/S75758/01), and by the National Science Foundation grant 86044 (GriPhyN).

19

References [1] Ado .net dataset. http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpguide/html/cpcontheadonetdataset.asp. [2] Mike Beckerle and Martin Westhead. GGF DFDL Primer. Technical report, Global Grid Forum, 2004. [3] Paul V. Biron and Ashok Malhotra. XML Schema Part 2: http://www.w3.org/TR/xmlschema-2/, May 2001.

Datatypes.

[4] James Clark and Steve DeRose. Xml path language (xpath) version 1.0. http://www.w3.org/TR/xpath, November 1999. [5] Karl Czajkowski, Donal Ferguson adn Ian Foster, Jeffrey Frey, Steve Graham, Igor Sedukhin, David Snelling, Steve Tuecke, and William Vambenepe. The WS-Resource Framework, March 2004. [6] Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, Kent Blackburn, Albert Lazzarini, Adam Arbree, Richard Cavanaugh, and Scott Koranda. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1:25–39, 2003. [7] National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign (UIUC). HDF5 Wins 2002 R&D 100 Award. http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All About HDF5.pdf. [8] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying and automating data derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database Management, Edinburgh, Scotland, July 2002. [9] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. The virtual data grid: a new model and architecture for data-intensive collaboration. In Proceedings of the Conference on Innovative Data Systems Research, Asilomar, CA, January 2003. [10] The HDF5 XML Information Page, jan 2004. [11] Arnaud Le Hors, Philippe Le Hgaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. Document object model (dom) level 3 core specification version 1.0. http://www.w3.org/TR/DOM-Level-3-Core/, April 2004. [12] Kelvin K. Leung, Rolf A. Heckemann, Nadeem Saeed, Keith J. Brooks, Jacky B. Buckton, Kumar Changani, David G. Reid, Daniel Rueckert, Joseph V. Hajnal, Mark Holden, and Derek L.G. Hill. Analysis of Serial MR images of joints, 2004. 20

[13] Mets: An overview and tutorial. http://www.loc.gov/standards/mets/METSOverview.v2.html, 2003. [14] Simon Miles, Juri Papay, Terry Payne, Keith Decker, and Luc Moreau. Towards a protocol for the attachment of semantic descriptions to grid services. In The Second European across Grids Conference, page 10, Nicosia, Cyprus, January 2004. [15] Alex Rodriguez, Dinanath Sulakhe, Elizabeth Marland, Veronika Nefedova, Natalia Maltsev, Michael Wilde, and Ian Foster. A Grid-Enabled Service for HighThroughput Genome Analysis, 2004. [16] Sloan digital sky survey. www.sdss.org/dr2. [17] Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. XML Schema Part 1: Structures. http://www.w3.org/TR/xmlschema-1/, May 2001. [18] Chris Wroe, Carole Goble, Mark Greenwood, Phillip Lord, Simon Miles, Juri Papay, Terry Payne, and Luc Moreau. Automating experiments using semantic data on a bioinformatics grid. IEEE Intelligent Systems, pages 48–55, 2004. [19] Altova: Xml development, data mapping, and content authoring tools. http://www.xmlspy.com, 2004.

21

Suggest Documents