An XML-Based Distributed Metadata Server - Department of Computer ...

3 downloads 96 Views 128KB Size Report
DIstributed MEtadata Server (DIMES) prototype system. Designed to be flexible yet ... Space Research/George Mason University), have been developing a ...
An XML-Based Distributed Metadata Server (DIMES) Supporting Earth Science Metadata Ruixin Yang, Xinhua Deng, Menas Kafatos, Changzhou Wang, X. Sean Wang Center for Earth Observing & Space Research School of Computational Sciences George Mason University, Fairfax, VA 22030, USA [email protected]

Abstract With explosively increasing volumes of remote sensing, model and other Earth Science data available and the popularity of the Internet, scientists are now facing challenges to publish and to find interesting data sets effectively and efficiently. Metadata have been recognized as a key technology to ease the search and retrieval of Earth science data. In this paper, we discuss the DIstributed MEtadata Server (DIMES) prototype system. Designed to be flexible yet simple, DIMES uses XML to represent, store, retrieve and interoperate metadata in a distributed environment. DIMES accepts metadata in any well-formed XML format and thus assumes the “tree” semantics of metadata entries. Additional domain knowledge can be represented as specific links through XML ID/IDREF(s) mechanism. DIMES provides a number of mechanisms, including the “nearest neighbor search”, to navigate and to search metadata. Though started for the Earth Science community, DIMES can be easily extended to serve scientific communities in other disciplines.

1. Introduction Applications and products of Earth observing and remote sensing technologies have been shown to be crucial to our global social, economic, and environmental well being. Earth observing from space produces very large amount of data at ever-expanding rates. The Earth Observing System (EOS) satellite Terra alone is adding more than half a terabyte of Earth science data each day [1], and other Earth observing platforms and computer weather and climate models are producing or will produce even more massive data products. Distributed information systems with effective search capability are needed to help scientists to perform their research more efficiently. The Earth Science Information Partners (ESIP) program was thus established in 1997 by NASA to meet this need [2]. We, at CEOSR/GMU (Center for Earth Observing & Space Research/George Mason University), have been developing a distributed information system, Seasonal to Interannual Earth Science Information Partner (SIESIP,

URL: http://www.siesip.gmu.edu) [3]. SIESIP is a federated system and a part of the larger federation of ESIPs. The system provides services for data searching, analysis and ordering. From our experience, we have concluded that a flexible metadata standard for representation and exchange is truly necessary in order to ensure the interoperability, integration, and spreading of metadata uses. Metadata are in very diverse formats since different data providers and data users usually define their own metadata schema. Even the contents of metadata may be significantly different among those created by different users. How to handle the metadata, therefore, becomes a challenge to the designers and developers of distributed information systems. XML (Extensible Markup Language) provides an encoding standard for metadata. In the development of the SIESIP system, we have successfully implemented a Distributed Metadata Server, using XML, to support metadata interoperation and ingestion. The rest of this article is organized as follows. We first briefly introduce the SIESIP system in section 2. In section 3, we discuss in detail the XML DIMES server; in particular, we present our novel metadata model in DIMES, apply the model to support metadata interoperability, and demonstrate metadata search capability. In section 4, we summarize XML metadata ingestion, cleaning and mapping process. Finally, we describe our long-term plan for a system supporting both data and metadata.

2. SIESIP System Overview SIESIP is one of the 14 science-focused ESIP-2 Federation projects selected by NASA [2] or that have since joined the federation. We are developing this distributed information system to serve data and other information products to Earth scientists, especially to seasonal-to-interannual (SI) and other climate researchers. Therefore, the system is science driven and is specially built for serving science communities. Thus, by design, the system has a “user-centric” emphasis based on different science needs as opposed to a “data-centric” emphasis

1

followed in traditional (often government-run) data systems such as NASA’s EOSDIS. The system will provide data and information products as well as data visualization, analysis and user support capabilities. The SIESIP system adopts a client-server architecture. The emphasis of development efforts to date has been on the server side to support various queries for information and data products. On the client side, some GUI interfaces based on WWW technology have been developed for a prototype purpose [4, 5, 6]. Actually, the communication protocols to SIESIP servers are well defined, so users can develop their own client tools to access SIESIP servers. A three-level system access model is currently supported. We term it the three-phase data access model. Phase 1 is metadata based system access; phase 2 uses online data analysis to get a quick estimate on real data; and phase 3 is for data ordering through SIESIP. With this three-phase model, the server side of SIESIP can be considered as a variety of service and product providers through the Internet. Currently, queries of all three phases are supported by SIESIP servers. Certain level of data interoperability is also achieved by using ftp and DODS (Distributed Oceanographic Data System) technologies [7]. In this paper, the emphasis is on phase 1, metadata queries. We use XML-based system to support metadata interoperability.

3. XML-Based Server (DIMES)

Distributed

Metadata

The extensible Markup Language (XML) is an ideal standard to describe ASCII based data, since data encoded in XML are understandable by both human users and computers. Most of metadata for Earth science data are in ASCII format, and therefore can be easily migrated to XML format. Right now, most work on XML-based metadata is to define XML structure (tags and relations) for specific disciplines. We, on the other hand, try to provide an XML-based software solution to support a wide variety of metadata. DIMES, which includes a metadata model, an XML query tool, and an XML ingester and cleaner, is a successful example of the desirable solution.

3.1. Metadata Model As metadata are varied in their contents and formats, the necessity to have a flexible standard for metadata representation and exchange is indisputable to ensure metadata interoperability [8]. A common weakness of many existing Earth science distributed information systems is the lack of metadata interoperability support. To support metadata interoperability, the system should facilitate the search of metadata from different sources/providers without requiring major, if any, development effort from the user. This requires a flexible

metadata model capable of capturing diverse sources and formats of metadata. XML provides a solution for this need. We, therefore, developed our metadata model using the XML technology. A naive way to integrate metadata from heterogeneous source is to represent metadata from different sources in XML format without additional restrictions, and pull together all these XML documents as the metadata repository. Though this approach requires minimum effort, it does not effectively support the search of metadata since there is no common agreement among these different XML documents. We, in DIMES, introduce some special attributes for the XML elements representing metadata concepts, and hence require some additional semantic enforcement. However, we follow the philosophy to minimize the semantics requirements: special attributes, if presented in the XML (metadata) documents, will be recognized and utilized automatically during the search of metadata. For XML documents without the special attributes, the tree structure in XML is used instead. An XML document naturally presents the data or knowledge in a DOM (document object model) tree. Similarly, we interpret the metadata recorded in the XML document in a derived metadata tree as follows: an element with a unique ID attribute in the DOM tree and all its subelements without ID attributes (thus consists a sub-tree) corresponds to a metadata node. In other word, a node is an ID-identifiable XML element and its “non-node” subelements. A node N1 is a child of another node N2 if the element corresponding to N1 is a descendent of the element corresponding to N2. The child/parent relationship between two metadata nodes (or the descendent/ancestor relationship between IDidentifiable elements) often reflects the type/instance relationship between concepts. For instance, the EL-NINO node may have a child node EL-NINO-1991, indicating that EL-NINO-1991 is a node of a type of EL-NINO. However, there are possible multiple types for a single instances, hence it is desirable to have multiple parents for a node. Unfortunately, the basic XML model does not support multiple parents for a single element. Hence, we introduce the special attribute node_types (with DTD type IDREFS) to record additional parents of a node. We indeed treat the structural child/parent relationship the same way as the child/parent relationship defined by node_types. In addition, we also propose another attribute type_instances to record the reverse relationship of the attribute node_types. For example, let the node EL-NINO1991 has ID=1 and the node PHENOMENA-1991 has ID=2, then we can add node_types=2 to EL-NINO-1991 and add type_instances=1 to PHENOMENA-1991. Though redundant, the type_instances attribute is helpful in finding nodes whose type is a given type (node). In additional to the type/instance relationship, nodes can be related in other ways. We thus introduce a third

2

special IDREFS attribute, namely, refer_to. For simplicity, we assume that the refer_to relationship is symmetric, i.e., if a node A refers to a node B, B also refers back to A. Thus this attribute actually assumes a minimum semantics to link a pair of related nodes together.

Figure 1. The node relation graph The last special IDREFS attribute we introduced is inline_types. Intuitively, a node represents a piece of identifiable metadata information, and different nodes may be presented to the user separately. In practice, many nodes may share common information, and such information is isolated and represented as a separate node, even though such a node itself does not provide a complete piece of knowledge. For example, many data sets have the same TEMPORAL-COVERAGE, thus the TEMPORALCOVERAGE itself becomes a node. We can define node type of TEMPORARY-COVERAGE as an inline node of nodes type data set by using the inline_types attribute. Therefore, when a temporal coverage node is related to a data set node, the temporal coverage node is inline for the data set node. When presenting or searching the metadata, the system will assume that the inline node is conceptually a coherent part of their containing nodes. Figure 1 demonstrates the node relationships as described above. Lines with arrows denote the special links we introduced in our metadata model. These links are utilized during the search of the metadata, as we will elaborate in the following sections. Different data providers might use additional attributes or links to represent special semantics to satisfy their own metadata requirements. Those semantics are probably different for different data providers. Though it is possible to add more types of links with specific semantics, such a list of link types is endless. Instead, DIMES is designed to make minimum assumptions about data providers and provide maximum flexibility to interoperate heterogeneous metadata without major modification of existing systems.

3.2. XML Query Engine In addition to the XML-based metadata model and the corresponding XML metadata documents, another major part of our XML server is an XML Query Engine, which is responsible for answering queries against the XML metadata. The main design goal of DIMES is flexibility, and the query engine is designed using the same strategy. Instead of defining fixed queries for specific purpose, we identify a few fundamental types of queries, and use these queries to answer complicated user requests. 3.2.1. Basic Queries. The simplest query is to find a node by its ID. Once the user has finished navigating and located an interesting node, he/she needs to know more information about that node. Other basic queries for Earth science data system is to find data sets based on regular conditions such as spatial resolution, temporal resolution, spatial coverage, temporal coverage, and textual condition like key words or free text search. In our XML-based approach, these queries can be answered by evaluating these conditions on each node including its inline nodes. 3.2.2. Nearest Neighbor Search. In DIMES, we introduce a powerful yet simple search mechanism taking advantage of the XML structure and our extended metadata semantics (the special IDREFS attributes). We call it the nearest neighbor (or shortest-path) search [9]. Intuitively, if we consider each node as a concept in the metadata, the nearest neighbor search is to find some closest neighbor concepts of a given one. This is especially useful to support metadata interoperability, i.e., when incorporating various kinds of metadata from a diverse data sources, the system provides a way for the user to discover metadata information that are closely related to her known knowledge or focus point. As the nearest neighbor search does not require any a priori knowledge about the metadata, it provides a good start point for ordinary users. This also supplies data providers a tool to accommodate special metadata and to make them searchable without modifying the metadata structure. In order to formally specify the semantics of the query, we need first define some concepts. The Distance between two nodes is measured by the minimum number of relations (type/instance, parent/child, or refer_to) needed to connect the two nodes. A distance could be defined with or without a given relation list. If a relation list is given, only relations in the given list can be used to connect the two nodes. By requiring for the minimum number, we will not count the loop of links introduced by the symmetric relationships of refer_to or the paired relationship of type_instances and node_types. For a given node, its nearest neighbor (closest node) from a given group is the one with the shortest distance among

3

the group. Sometimes, there might be more than one nearest neighbor nodes having the same shortest distances. A simple nearest neighbor query with one or more nodes and a target node type is to find a list of nodes of the target node type which are closest to the original given node(s). The query is processed by using a breadth-first search starting from the given nodes. The types of these nodes are first checked. If no node is of the target type, all those nodes that are connected to these nodes by a single relation are then checked and the process goes on until at least one target node is found or all nodes are checked. This algorithm ensures the shortest search path from the entry to target nodes, regardless of how one traverses those nodes. This approach has obvious advantages over the routine table-joining query in a RDBMS, and is free of any hassle on how to join tables to reach a target table. The flexibility of our design is reflected in the distance definition of above. Data providers with special metadata may define their own relations. If they think the specially defined relations are more suitable for the distance measurement, they can just use the new relations for the nearest neighbor queries. 3.2.3. Tree Expand Query. Tree expand query is indeed a special variation of the nearest neighbor search. As we mentioned before, the XML metadata contains a set of inter-linked nodes. Each node has some nearest neighbor nodes. If we choose one node as a root and all its nearest neighbors as the first level branches and so on, we will have a tree presentation of all the nodes. In other words, the tree expand query is to re-organize a graph (nodes linked with different relations) into a tree with a specified root node. Specifically, a query with a list of nodes as its input is to output a tree according to the graph of nodes. In the result tree, only the nearest neighbors of one of the given nodes are shown. The input list has the property that each node is the nearest neighbor of its previous node. Duplicate nodes are also removed from the result tree. In practice, tree expand query is used to present the metadata in a way that users can navigate easily and understand quickly.

expect users to use the engine based on the socket only. Therefore, a layer is added to accept XML query through HTTP protocol. For demo purpose, we have developed two web-based prototypes for exploring the capabilities of DIMES. A web-based DIMES client usually includes a web interface, an XML Translator and an XML-to-HTML Mapper. The web interface is certainly for users to submit queries. When a Web user submits a query, the query is passed to a specific XML Translator, which will automatically translate the query into one or more predefined types queries in XML format, and then send them to the XML query engine. A specific XML-to-HTML mapper will convert the output from XML format into HTML page, and return the result to the user. Actually, we have used Java Servlet technology and XSLT tool for the translator and the mapper. Since the mapping is a dynamical process, the actual resultant interface also depends on the query results. Even for the initial interface, we use the contentdriven mechanism to define the interface based on the content of the XML document. 3.3.1. Regular search against DIMES. Figure 2 is a screen snapshot of a regular search interface of DIMES ( http://esip.gmu.edu/~siesip/xml/search/interface.cgi). The appearance of the interface and the search categories are similar to other Earth science data search engines. The unique feature about this is the content of each selection is dynamically created based on content of DIMES XML metadata. For example, the 2.5X2.5 degrees spatial resolution is one of the existing resolutions in the metadata; therefore, it appears in the spatial resolution list. This feature is most important for the textual searches.

3.3. Applications of DIMES Complex queries can be answered using the above fundamental queries. For example, in order to find Phenomena and Parameters that are closely related in North America during 1990s, we can first use basic queries to find all recorded Phenomena scoped over North America during 1990s, then search for their nearest neighbors of type Parameters. The exact usage of DIMES depends on clients. Actually, any client can use DIMES query engine as long as the XML-based protocol is followed. The base communication mechanism with the engine is the TCP/IP socket. It is obvious that we cannot

Figure 2. A web-based search interface of DIMES. Our textual search supports both free text search and field text search. As we mentioned before, data providers may define their own attribute names for describing

4

special metadata. The new attributes are new fields. The dynamical created field list allows the data-provider specified fields searchable through the field search. Of course, the free text search works on any new fields without any changes on either the client or the server. To achieve the flexibility efficiently, the dynamical lists are usually created as batch jobs after each change on the metadata details. 3.3.2. Metadata navigation. Another prototype using DIMES query engine is a metadata navigation system shown in Figure 3 (URL: http://esip.gmu.edu:8080/siesip/servlet/SiesipDataTree). This design is based on interactions with Earth scientists in the SIESIP project. After extensive and iterative interactions with the scientists in the Advisory Board of SIESIP, we defined major scenarios of how scientists navigate and browse, and query Earth science metadata and data sets. Simulating the scenarios of how the users query our XML metadata files, we developed this contentdriven metadata navigation prototype.

Figure 3. A screen shot for the metadata navigation prototype. The relations among the nodes in metadata are organized into a metadata tree for users to browse through metadata. The tree structure groups nodes such as “data sets” by various categories. Each category could be considered as a dimension in a “multiple dimensional” database. Users can start their browsing through any dimension. In addition to the tree browsing, users can use the nearest neighbor search to find data sets closely related to the current search criterion. For instance, if a user is on the SIESIP/CEOSR repository page and wants to find the data sets located at CEOSR, the user can issue the “Find Data Sets” command. The XML metadata server will return data sets at CEOSR by using the nearest neighbor search. The applications and advantages of our XML metadata are numerous. First of all, the model allows us to incorporate domain knowledge from scientists. For

example, we have a phenomenon node EL-NINO and we link this node to data set node SSTA (Sea Surface Temperature Anomaly) by using a refer_to link. The addition of this link is based the domain knowledge. The knowledge allows users to search data sets through specific scientific scenario point of view, such as “tell me what data sets are related to the El Nino phenomenon.”

4. Metadata Ingestion One major task for developing an operational metadata server is to develop a tool kit for metadata ingestion. The tool kit includes software components for both metadatato-XML ingestion and XML-to-XML cleaning and merging. Although our metadata approach is flexible for both metadata contents and metadata formats, ingesting initial metadata into XML documents searchable by the server software is an unavoidable step. Under SIESIP, we developed tools to ingest massive metadata from those common-separated values published daily by the GES DAAC. Also, we have developed ingesting tools for GrADS [10] control files, or metadata in GrADS format. The ingestion is in an automated process, although scientific domain knowledge, which is not originally recorded in the original metadata, is added manually. Once the outside metadata are converted and ingested into our system, we need to clean them and merge them into our metadata repository. Cleaning is necessary to maintain the consistency and efficiency of our metadata repository. We designed a general tool for this task. The XSLT technique is used to separate the domain knowledge from the programming logic of the tool. This tool is also useful in cleaning up existing XML documents or importing external XML files into our XML metadata repository.

5. Conclusion We have successfully built a Distributed Metadata Server (DIMES) in SIESIP. DIMES uses XML as a vehicle to communicate with clients and as a standard to encode metadata at a conceptual level. DIMES has a novel metadata model, XML query engine, and associated toolkit for metadata ingestion, cleaning and mapping. Our metadata model has some unique features such as domainknowledge, flexibility, and nearest neighbor search. Metadata are ingested from the comma-separated values (CSV) that GES DAAC is publishing online daily, model output data at COLA, and the data sets served by the Distributed Oceanographic Data System, DODS, another federation data access system. After the initial ingestion of metadata into XML formats, we clean and map them into the DIMES. The use of XML can be widely applied to

5

distributed information systems, allowing true federations to operate. For DIMES, we chose not to use a “standard” approach, such as the RDF from the XML community and other standard efforts, in order to make metadata easier for scientists to enter and handle. It should be noted, however, any well-formed XML is accepted as metadata for DIMES, and additional tools that work with metadata in these standards can cooperate with other metadata in the same DIMES server. In some sense, our work is closely related to the idea of “mediators” that were introduced for federated databases [11], with the goal to accommodate various metadata sources into a unified framework. We plan to use XML to maintain annotations. Scientists usually have their own strong opinions on specific data sets. With XML metadata, we intend to allow users to add their notes into the metadata on-the-fly. The added annotations will become part of the metadata and could be used to search the data at later times by the authors. The remarks could also be useful for other users since the annotation authors are usually domain experts. Of course, adding annotations involves the common update consistency problem. However, as the concurrency level of the update is expected to be very low for Earth science information system, some simple mechanism can be adopted to ensure only that no update is lost.

Acknowledgments We acknowledge partial funding support from the NASA ESDIS ESTO prototype Project (NAG5-8607), from the Earth Science Enterprise WP-ESIP CAN Program (NCC 5-306), as well as from George Mason University. References [1] G. Asrar and R. Greenstone, eds., 1999 “EOS Reference Handbook,” NASA (Washington, D.C.), 1999. [2] NASA, NASA Press release, Dec. 2, 1997. “NASA Selects Earth Science Information Partners”, http://www.nasa.gov/releases/1997/. [3] M. Kafatos, X. Wang, Z. Li, R. Yang, and D. Ziskin, “Information Technology Implementation for a Distributed Data System Serving Earth Scientists: Seasonal to Interannual ESIP,” in Proceedings of the 10th International Conference on Scientific and Statistical Database Management (M. Rafanelli and M. Jarke, eds.), pages 210-215, IEEE, Computer Society, 1998. [4] M. Kafatos, et al., “Data Exchanges and Interoperability in Distributed Earth Science Information Systems,” Proceedings of 11th Intl. Conf. on Scientific & Statistical Database Management, IEEE, page 228 (invited to chair this panel), 1999. [5] R. Yang, Z. Li, and M. Kafatos, “Remote Sensing Data Access Via WWW-Enabled Search and Analysis Engines,” in Proceedings of the 4th Pacific Ocean Remote Sensing Conference (PORSEC 98), Qingdao, China, July 28-31, 1998. [6] R.Yang, C. Wang, M. Kafatos, X.S. Wang, T. El-Ghazawi, “Remote Data Access via the SIESIP Distributed Information System,” in Proceedings of the 11th International Conference on SSDBM, p. 284, IEEE, Computer Society, 1999. [7] DODS: Distributed Oceanographic http://www.unidata.ucar.edu/packages/dods/

Data

System.

[8] T. Vetterli, A. Vaduva, and M. Staudt, “Metadata Standards for Data Warehousing: Open Information Model vs. Common Warehouse Metamodel,” SIGMOD Record, Vol. 29, no. 3, pages 68-75, 2000.

Our long-term goal is to integrate software components with existing data servers to build the Scientific Data and Information Super Servers (SDISS) which are defined here as servers to support interactive access to metadata, data, and domain knowledge. Figure 4 is the system architecture of SDISS. The flexibility achieved by using XML without a predefined standard makes system porting across science disciplines much easier. Actually, DIMES is an enabling system, rather than a solution system. To fully use the capabilities supported by SDISS, we will also develop an end-to-end, one-stop data information system serving Earth science communities. In this way, users can use their familiar environment to search and access data in the same way.

[9] R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina, “Proximity Search in Databases,” Proceedings of the Twenty-Fourth VLDB 1998. [10] B E. Doty, J. L. Kinter III, M. Fiorino, D. Hooper, R. Budich., K. Winger, U. Schulzweide, L. Calori, T. Hol, and K. Meier, “The Grid Analysis and Display System (GrADS): An update for 1997,” 13th Conf. on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, pages 356-358, 1997. [11] G. Wiederhold, “Mediators in the architecture of future information systems,” IEEE Computer, March 1992, pages 3849.

6

Suggest Documents