Modeling and Navigation of Large Information

6 downloads 0 Views 488KB Size Report
capture the metadata, we have designed a repository for ..... Ram, S. & Shankarnarayan, G. - Data Repository for the USM, Report, Department of Management ...
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Modeling and Navigation of Large Information Spaces: A Semantics Based Approach Sudha Ram Dept. of Management Information Systems The University of Arizona, Tucson, AZ 85721 [email protected]

Shankaranarayanan G. MIS Dept., School of Management Boston University, Boston, MA 02215 [email protected]

Abstract

protect the environment and decrease the rate of global warming. The quality of global climate change models and the predictions based on these models is dependent on the data that forms the input to these models. There are a number of problems in managing the data and models that support research and policy making of the kind described above. These problems may be categorized based on (i) variety and volume of data, (ii) combining data from multiple sources, (iii) identifying relevant data for each user, (iv) reuse of existing software for preprocessing data, and (v) integrating data islands. In this paper we describe techniques for intuitively modeling and navigating an information space that includes all details about the various types of data. Our techniques combine the use of a semantic model and graph theory to address the problems outlined below: (i) Variety and Volume of Data: An extraordinary amount of data (order of several terabytes per day) is stored and processed by scientists and researchers working on global climate change. To utilize this data, it is essential for the researcher to understand what is available, how it has been collected, and who is responsible for managing and distributing the data. Let us consider a subset of the global climate change data, which includes atmospheric and hydrology data. Atmospheric data includes data on cloud properties, surface conditions, wind properties, vegetation, and radiation. Cloud data such as cloud density, cloud water content, and cloud top temperature, are measured by instruments or probes mounted on satellites and aircraft. These are constantly transmitted to earth stations located in different parts of the globe, managed by NASA. Other data include profiles of the atmosphere such as temperature, humidity, and wind velocity measured by radio-sonde balloons, vegetation measured by satellites or ground crews, and precipitation, stream flows, snowfall, and run-off data measured by earth stations. If a researcher is interested in modeling the effects of El Nio over the western United States, he/she might want access to cloud, profile, and

In this paper we present techniques for modeling the semantics of large information spaces and for navigating them. This information space represents heterogeneous data stored in different formats and distributed across multiple locations on the Internet. We also describe a prototype system called SEMQUEST (SEMantics based QUEry SysTem) that employs graph-based algorithms and allows users to interactively explore, manipulate, and relate data in large information spaces to their interests. It provides users with the flexibility to understand what is available in the information space, determine which parts are relevant, and query/retrieve underlying data using a visual framework. SEMQUEST also allows users to share software modules permitting them to reuse data analysis/visualization codes. We demonstrate system use with global climate change data collected by centers across the world. We believe this research serves as a foundation for future work in integrating information sources across the WWW.

1. Introduction and Motivation Research on global climate change attempts to investigate and understand the effects of various phenomena on the earth's climate such as global warming, and the green house effect. The destruction of vegetation, ocean effects, and the burning of fossil fuels also contribute to changes in climate. Scientists and researchers working on global climate change develop research models that help predict the effect of one or more of these causes on the earth's climate. A good example is the recent weather phenomenon, El Nio, a disruption of the ocean-atmosphere system in the tropical Pacific, having important consequences for weather around the globe. United States has taken a leadership role in supporting international agreements, and policy making to

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

1

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

precipitation data over this region. Each of this data is collected in different ways, stored in distributed locations, and managed by different agents. The researcher needs to be made aware of what data is available and enough details about the data so he/she can judge whether to further explore it. (ii) Identifying Relevant Data for each User: Prior to requesting data, users are often interested in sampling it to ensure that it does fit their requirements and that there are not many missing values. While some datasets (an aggregated collection of data) have sample data available, many do not. Users may waste considerable time and effort, requesting for datasets and then discovering that it has little relevance to their research. Users also need to explore a large dataset and narrow their search down to smaller subsets that best suit their specific requirements. (iii) Combining Data from Multiple Sources: If the researcher (the user of the data) is successful in identifying the data relevant to the task at hand, the next hurdle is in accessing and combining data from multiple sources. A dataset contains different types of data (vegetation data, precipitation data, etc.). Each type of data in a dataset may be temporally averaged (yearly, monthly, weekly, daily, and hourly). For example, a new precipitation station that has become operational at the start of 1997 may measure rainfall data on a daily basis. This data could be available in one dataset (dataset1). An old station may have measured the same data on an hourly basis and stored it in a different dataset (dataset2). A researcher interested in the rainfall pattern in this decade needs to partition the data for 199096 from dataset2, convert it to a daily average, and combine it with the data in the dataset1 to get the information required. Similarly there may be differences in measurement units, precision and other semantics. While most datasets are stored in Hierarchical Data Format (HDF), some others may have a different format for storage and yet others may not be formatted at all. This further complicates the access to data. (iv) Reuse of Existing Software for Preprocessing the Data: Most datasets are very large and usually need to be partitioned based upon the requirements of the user. The burden of partitioning the data is on the users. Each user has different needs and hence knows best, the method for partitioning data. The software modules that partition the data are usually in Fortran, composed by the users and form an important, but unpublished, part of the researcher's work. Users also develop small test modules that sample the dataset and submit it to the model used in the research, or display the data in a graphical format in an attempt to gauge the quality of the data. These modules again form a key component of the researcher's work and take a fair amount of time and resource to develop. Information about such (potentially) reusable components needs to be made available to other users to help save them considerable time and effort. (v) Integrating Data

Islands: Typically each researcher has a collection of datasets, tools for preprocessing datasets, partitioned datasets that they use (or have used), and information on the location of some datasets that are of interest to him/her. It is not unusual for this information to reside on his/her personal computer. These valuable pieces of information remain unknown to the rest of the research community thus forming data islands. Another user, and sometimes the same person, may spend time and effort in re-developing the same or similar software modules to access the data, unaware that a perfectly good re-usable module is available right next door! There may be many such data islands and these islands when bridged, will save the research community considerable amount of time, effort, and money. All the problems described above indicate a need for intuitively modeling and navigating an information space that contains details about the various types of data, the associated software for processing, partitioning, and sampling the data, and the results from using them. Such an information space can also link the different data islands. As new datasets and software tools become available, these should be linked to the existing information space so that the users are made aware of the existence of the newly added components. As the Internet has become a very powerful medium that links researchers all over the globe, it is important that the information space be accessible over the web. The problem of managing large volumes of data in a heterogeneous, distributed environment is not restricted to scientific data (and databases) such as those described above. Such problems have also been investigated in the areas of engineering and manufacturing data environment [6] and in health administration systems [13]. Our approach is applicable to industries and organizations where the data supporting the automation of entire business processes are stored in multiple individual databases, resulting in data islands. In this paper we present a way to represent and search large information spaces and to locate and retrieve the underlying heterogeneous distributed data. We use a semantic model to represent the information space. The semantics are based on the perspectives of multiple users. The remainder of the paper is organized as follows: a summary of relevant research using semantic models and graphical interfaces is presented in section 2. Section 3 describes our information space model and graph-based algorithms for navigating the model. Section 4 presents the architecture of a prototype system called SEMQUEST developed at the University of Arizona and illustrates the use of the system. Finally, section 5 provides a summary and discussion of future research.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

2

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

2. Existing Approaches for Retrieving Heterogeneous Data One solution to accessing data in a heterogeneous system is to develop a global or a federated schema. A number of research methods to integrate the schema in heterogeneous database systems have been proposed and compared [1][12]. These methods can be applied if the set of heterogeneous databases to be integrated is known a priori. In the scenario described in this paper, this set is not known before hand and no schemas exist for any of the data. Another proposed solution is the development of tools that can query multiple databases, providing a uniform interface for retrieving data. Most of these tools use semantic or functional data models. Some of them provide graphical interfaces and query languages, some incorporate artificial intelligence techniques, while others employ relational algebra and data transformation operations to query one or more databases. An adaptation of the graphical query language CANDID has been proposed for the extended Entity Relationship (ER) model [23]. CANDID has been designed for general semantic models and offers an interface that permits schema definition operations and updating. Other graphical interfaces for querying include the Interface for Semantic Information System (ISIS) [10] and SNAP [2] based on semantic models, G-WHIZ [11] based on a functional model, and systems based on relational models [17]. These query facilities have been demonstrated using single relational databases and their capability for querying multiple databases has not been evaluated. Moreover these systems are effective when the underlying database has a well-defined schema. A database visualization system that uses a graph model has been described in [3]. This system incorporates multiple paradigms for the interaction between the data model and the database it represents. While its flexibility is highlighted, its applicability to querying multiple databases has not been shown. DATAPLEX, a heterogeneous database management system integrates relational (INGRES) and non-relational (IMS hierarchical) databases and uses SQL for uniform data access [6]. DIRECT, a query facility for multiple databases provides extensibility and autonomy by defining a uniform query language and a mechanism for identifying semantic equivalence and resolving semantic differences [16]. However, these systems do not support graphical interfaces to help users visualize and explore the information space. DEVise is a data exploration system that allows users to integrate several tabular datasets from local and remote sources [14]. It uses a declarative approach to query the information in the datasets and appears to be restricted in that it can integrate tabular datasets only. Database visualization systems for object-

oriented databases have also been proposed [7]. The application of these systems to other kinds of data sources is not shown. These systems are insufficient to query global climate change data described in section 1. Most of the datasets containing the global climate change data are not tabular and, in fact, some do not have any format. Object-oriented systems are too specific for querying global climate change data. All of the above languages are declarative in nature, like SQL. When applied to scientific databases, declarative languages do not meet all of the requirements because scientific research requires a close integration of data analysis with data retrieval. This observation forms the basis of a graphical query interface proposed for scientific databases [25]. The query interface defines a procedure-oriented graphical query language using dataflow graphs providing visual capabilities as well. Artificial intelligence based techniques such as rule-based inference engine that forms a component of KORTEX have also been explored [13]. Such systems prove useful when non-experts perform data retrieval tasks and have been shown to improve performance of the database management system. The facilities, however, tend to be domain specific and do not support multiple databases. Currently there a number of existing systems supported by federal agencies that provide access to scientific data. The important ones among these are (1) The Earth Observing System Data and Information System (EOSDIS) is an integral part of the NASA's Earth Science Enterprise [8]. EOSDIS Version 0 Information Management System (IMS) is a co-operative effort between various data archives around the country and the Goddard Space Flight Center (GSFC). EOSDIS allows users to search for and order data from various data centers in a single session using the Internet. However, users have to know the name of the dataset they require, or, have to browse through an endless list to select the specific datasets. The global change master directory provides the capability to browse through a mass of metadata associated with this system. (2) The U.S. Geological Survey (USGS) provides access to hydrology and water resources data over the Internet using the USGS-Web system [26]. Of interest to global climate change research are the mean water flow, peak water flow, and precipitation data that are collected by gauging stations and precipitation stations in the country. This data is accessible only if the identity of the specific station is known. However, hydrologists, when looking for stations in a specific basin or county, are required to spend valuable time in selecting from many independent lists of stations, and manually collating information. CLIMVIS (Climate Visualization System) is an interactive graphing tool that allows visual browsing of climate data available on line at the National Climatic Data Center (NCDC) [5].

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

3

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

The research presented in this paper incorporates lessons learned from several of the above experiences and systems. Users' ability to access information is enhanced if they are able to visualize abstract information and formulate queries graphically [24]. The objective of our system is to allow users to model, explore, manipulate, and relate large information spaces to their interests in an interactive manner. It provides the users with the flexibility to explore the information space as well as to query and retrieve the data using a visual framework. Another important functionality required for scientific data retrieval is the data analysis component. All of the analytical functions and modeling performed are complex and require complex software modules. In addition, users make use of visualization tools to preview and observe the data using software components that they develop. Our system provides the functionality for users to share these software modules and components, thereby providing the ability to analyze the data as well. Our system also links existing systems and data sources such as EOSDIS and individual researcher’s datasets and software.

3. The Information Space The information space is represented using a semantic model with various kinds of abstractions described later in this section. We also discuss navigation through the information space using graph-based algorithms.

3.1 A Semantic Model for Representing the Information Space The semantic model used in this paper is based on the traditional abstractions of classification, aggregation, generalization and association. In particular it proposes concepts to represent constraints on relationships between subclasses and formally defines complex object classes such as Composites and Groups/Aggregates [18], [19]. The elementary classes of the semantic model include Simple Strong, Simple Weak, Interaction Weak, Subclasses, Composites, Groups, Aggregates, Attributes, and Domains of attributes. Each of these is related to one or more classes using relationships that include interaction, subclass, composite, group and aggregate besides attribute and domain relationships. Details of the semantic model and its constructs are available in [19]. The information space for atmospheric and hydrology data is modeled in figure 1. This model was developed over the past 2 years and involved extensive interactions with researchers in atmospheric sciences and researchers in hydrology at The University of Arizona [20]. A number of valuable lessons were learned in the process of developing this information space model. Information

systems researchers commonly use conceptual models such as the Entity Relationship Model [4]. Communicating this idea to scientific researchers was our first major hurdle. Understanding their requirements and translating it into the semantic model was the second hurdle. The process also helped us appreciate the importance of using extensions to traditional ER models to adequately represent concepts in scientific data. Some of these extended constructs and their use are illustrated in the following examples. Figure 1 shows a part of the information space representing atmospheric science and hydrology data. A dataset is a collection of different types of data. A single such dataset could contain multiple types of data like wind data, humidity data, and temperature profile data measured over a specific time period over a specific region. Surface measurements such as earth surface temperature data (called skin temperature) and precipitation data measured over the same region, over the same time period may also reside in the same dataset. Each dataset is often viewed along different temporal dimensions. For instance, precipitation could be averaged daily, skin temperature weekly, and temperature profiles monthly. Constructs in traditional ER models are insufficient to capture this information. One of the extensions available in our semantic model is a construct for defining a complex class known as a Composite. A Composite relationship defines a new class that has other classes as its members. Members of the Composite class are members of some other class, taken as a whole. Composite classes are needed to define properties that describe a whole class [19]. This construct was also useful in representing the fact that Atmospheric Properties over a given region, such as energy balance, are determined by the profile measurements and surface conditions over that region. The class STATIONS represents the set of all stations that measure and collect hydrology data. Since this model was developed to capture information about the gauging and precipitation stations, STATIONS may be sub-classified as GAUGING stations and PRECIPITATION stations. Each type of station measures a particular type of data. Hence we indicate that gauging stations collect FLOW data (which includes Daily Mean Discharge data and Peak Flow data) while the precipitation stations measure precipitation data. The entity class DATA represents the composite collection of both types of data in addition to the atmospheric science data. The class STATION is identified by the attribute Station number and each station is associated with a specific hydrological unit code (HUC). HUCs are defined at four different levels. For convenience and ease of understanding we have named these levels as Large Basins, Basins, Watersheds, and Sub watersheds. Large basins (level 2) represent a very large water drainage area like the Lower Colorado River Basin. Large

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

4

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

basins are in turn made up of a number of basins (level 4), which in turn is a collection of watersheds (level 6). A collection of sub-watersheds (level 8) is a watershed. A set of individual stations is associated with a subwatershed. As stations may also be associated with counties (and states), this fact needs to be modeled as well. Also, gauging stations are located on rivers (or streams/creeks/wash etc.). Thus we associate GAUGING STATIONS with the class WFLOW representing all types of flowing bodies of water by depicting a relationship between the two. To depict the relationships between Z e n it h

N am e S A T E L IT E PR O B E S

C O U N T R IE S

E n e rg y _ b a l

GRP

Id en t ify D e te rm in e

C O U N T IE S

P ro b e _ t y p e P ro b e _ n a m e

PR O B E S

P R E C IP T N . S T A T IO N S

S _ ty p e S

F re q u e n c y SE L

M e a su re M e a su re d _ b y

A ss o c ia ted _ w ith M e a su re d _ O v e r

F IE L D PR O G R A M S

S

Long 1 L a t2 Long 2

R E G IO N S

SE L

A IR C R A F T PR O B E S

L a t1

A rea A T M O S P H E R IC P R O P E R T IE S

different HUC we have employed a complex class construct called Grouping [19]. Grouping is another extension to the set of constructs available in the traditional ER model. Members of a grouping class are collections of members from some other classes. In a grouping class, each member entity class is a collection of members from some other single entity class called the base class or component class. Other standards constructs such as the relationship “loc_in” between STATIONS and COUNTIES depict how the various entity classes are related.

G A U G IN G S T A T IO N S

PR O B E TYPES

m o t io n

L o c _ in

S T A T IO N S B el o n g _ to SU B W A T E R SH E D S GRP

CLOUD C HARAC T. P _ lo n g i t

P _ lat it u d e

Loc _on

Moelle a su C c tre

DATA

W A T E R SH E D S

C o lle c t

SE L

S _ lo c n

W FL OW

GRP

Type W e ak T o p t em p

CLOUDS O p t_ d e n

W IN D

P R O F IL E S

P _ alt it u d e

T E M PE RA T U R E

SU R FA C E C O N D IT IO N S P _ ty p e

S

H U M ID IT Y

P R E C IP IT A T IO N

B A S IN S

ST RE A M FL O W F l o w ty p e

S

S _ tem p

LAND

S _ h u m i d ity

OCEAN

S

D A IL Y M E A N D IS C H A R G E

GRP LARGE B A S IN S

PE A K F L O W

Figure 1: Semantic Model for Atmospheric Science and Hydrology

3.2 Navigating the Information Space Navigating the semantic model is essential for the user to explore the types of data available. It also helps nonexperts refine their requirements should they start with no specific requirement in mind. Our information space is essentially a graph. A simple graph G = (V, E) is a representation that consists of a set of vertices V and a set of edges E that connect the vertices in set V. The edges are all binary, each edge connecting exactly a pair of vertices. If the edges have directions associated, the graph is called a directed graph. The semantic model described in this research (and other similar semantic models) can be represented as a directed graph G (V, E) where V = {entity classes in the semantic model} and E = {the relationships in the semantic model}. These edges have directionality that captures the semantics of the information being modeled. Interaction relationships are bi-directional while subclass and complex class relationships are uni-directional, directed towards the subclass or complex classes representing the class hierarchy, dependency, and inheritance information. In any directed graph G = (V, E), a vertex Y ∈ V is reachable from vertex X ∈ V, if there is a directed path

from X to Y in G. This path traces a sequence of vertices and edges in G. These edges capture the semantic relationships between the vertices. This gives us a method for navigating the graph by tracing a chain of related entity classes and their information. To perform this navigation, we use a modified algorithm based on the depth first search (DFS) algorithm for directed graphs [9], [15]. DFS for directed graphs attempts to “mark” all the vertices in the graph. Our modified algorithm (called SNAFU - Selecting Nodes Accessible, For you) stops when the set of vertices, reachable from a starting vertex, is found. For example, in the semantic model shown in figure 1, if the class STATIONS is selected, the algorithm will traverse to GAUGING STATIONS or PRECIPITATION STATIONS or both. The algorithm would also be able to reach STREAM FLOW data (and/or PRECIPITATION data). The algorithm can further identify DAILY MEAN DISCHARGE data and/or PEAK FLOW data. In another direction, with the same starting point (i.e. STATIONS), the algorithm can reach SUBWATERSHEDS and through that eventually LARGE BASINS. Similarly, COUNTIES can be reached via the relationship between it and the entity class STATIONS and WFLOW can be reached via the

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

5

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

relationship between WFLOW and GAUGING STATIONS. Thus all the entity classes that are reachable from the starting entity class STATIONS, will be identified by the algorithm. We can prove that SNAFU ultimately reaches all the nodes reachable from a given starting node. SNAFU is a modification of the DFS for directed graphs hence the proof given below is based on the proof for the DFS algorithm [15]. DFS for directed graphs examines every out-bound edge from a given node. If such edges exist, it traverses these edges, marking them as “used”. If no such edges exist, the DFS back tracks through each previous stage (node) of the search, looking for "new" edges at each stage. When it has backtracked all the way to the starting point without finding any "new" arc, the DFS algorithm jumps to a new unmarked starting node and starts all over. The DFS component in SNAFU is modified to terminate at his point and does start over. Besides, SNAFU utilizes the metadata from the semantic model during navigation. Theorem: Given a directed graph G = (V, E), SNAFU identifies all the nodes accessible from a given starting point. Proof: Let node n ∈ V be an unmarked node in G. Since the semantic graph G is connected there must be an edge e ∈ E that is incident on n that connects n to the rest of G, (G – {n}), that includes the starting vertex. If e is an out-bound edge from n, n is not reachable from the rest of G using edge e and hence is of no interest to the algorithm (or to us). If e is inbound into n, then let m be the other end point of the edge, where m ∈ (G – {n}). If m is part of a directed path from the starting node, m is reachable and therefore marked. In that case, n cannot be an unmarked node. If m is not part of a directed path, m is not reachable from the starting point. Since n is reachable from m using edge e, n is not reachable from the starting point. Navigating the model based on reachable vertices alone is not sufficient. The problem arises because of the directionality associated with the relationships. For instance, if SATELLITE PROBES is selected as the starting node there are no reachable nodes. Therefore to make the algorithm practical, we use some additional information (in the form of metadata) to identify that SATELITE PROBES is a (subclass of) PROBES. Once this is known, the algorithm is now able to expand the search by navigating through the semantic model. We demonstrate the use of this algorithm through examples in section 4.

4. Architecture and Use of the System We have implemented a prototype software system, SEMQUEST, accessible through the Web to model and navigate the information space, and access the underlying

data. This system is implemented using the C language on a UNIX platform. The components of SEMQUEST are described briefly followed by examples of how it is used.

4.1 Architecture of the System The architecture of SEMQUEST is shown in figure 2. The information space manager is the module that serves to integrate the components. This module interacts with the graphical user interface to help the user design the information space model and query the data in local and remote data sources. The data repository includes a metadata repository and a data dictionary. The metadata associated with the semantic model plays an important role in retrieving data. For example, the attribute Station number associated with the entity class Station is the key for accessing information on stations. The class of Precipitation stations will have this attribute as well owing to the fact that it is a specialization of the class Stations. These semantics that are explicitly modeled in the semantic model serve to facilitate retrieval of data associated with the objects in the semantic model. To capture the metadata, we have designed a repository for the metadata using a state-of-the-art relational DBMS. It consists of a set of 22 relations designed to capture the metadata corresponding to the semantic model. The system is capable of designing and storing multiple projects, each identified by a unique system generated project number. The objects (entities, relationships) are associated with each project and are identified by an object handle and the project number. A sample subset of the relations in the repository along with a brief description of each are listed in table 1. The graphical query interface that displays the semantic model permits the user to view the metadata of any entity class if the user clicks the right mouse button on the entity class displayed. To access the data through the information space, it is essential that we define an association between the metadata and the data. The actual data associated with the model is stored in different distributed locations. Some data is served on the Internet by systems described in section 2. To link this data with the metadata we have developed a data dictionary. The data dictionary is a set of relations that store information on the attributes for which data is available, datasets that contain this data, properties (temporal and spatial dimensions) of the datasets, and access methods. More details on the metadata repository and data dictionary are available in [22]. In some cases, the location information may also be stored as part of the dictionary. For example, the station number is an attribute of the entity class station and is therefore part of the metadata associated with the semantic model. The URL of the USGS web servers that serves information on the stations is stored in the dictionary. The identifiers of the

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

6

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

specific station(s) selected by the users is parsed by the system and submitted to the appropriate web server. Specific selections are made by the users from the query interface and are parsed and submitted to these addresses by the system. The information space manager interacts with the metadata repository and the data dictionary. It is capable of retrieving data that is requested by the user(s) as well as updating information. Should a user choose to add a new object (entity class, relationship, and/or property) to the information space, he/she can enter the new information using an interface provided by our system. This information is then incorporated into data repository and is available for access by all users. Thus the system is capable of dynamically managing the addition of new sources

holds the modeling constructs. This interface could also be used to design or update the information space model. The modeling interface consists of two parts: the tool-bar that displays the constructs available in the semantic model, and a drawing area to define (display) the model. The interface functions in two modes: the design mode and the query mode. In the design mode, the user can use the mouse to create/define the objects in the information space by dragging and dropping the appropriate constructs from the tool-bar on to the drawing area. As each semantic object is created, the user can define the information (metadata) associated with the object by filling in the appropriate information, for which he/she is prompted. A user may also edit an existing model using the menu options or add additional data sources to the system. When a model is displayed, the system is in the design mode (default). To use the data retrieval facility, the user must switch to the query mode. In the query mode, the user can navigate the information space to retrieve the underlying data. To query the data users can specify the information space they wish to have displayed using the menus available above the drawing area. For example, the information space for hydrology data is shown in figure 4. Functionality, rather than performance, has been the focus in the development of this prototype system.

4.2 Examples of System Use We now present some examples to show how the system is used. Since the information space for our data is large and complex, in these examples we display only the relevant sections of the large model. The web page that serves as the starting point of this system is shown in figure 3. Access to the system is currently protected by a password. However, any user can request for one. Once logged in, the design and query interface for modeling and accessing data is displayed along with a tool-bar that G ra p hic al U s e r In terfac e to

Q ue r y a nd D e s ign M o d e l

rel2 E nt1

E nt2 G e o g r a p h i c a ll y D is tr ib u t e d D a t a S t o r a g e s rel1

S

L o c a l D a ta b a s e s

S u b1

S u b2

E nt3

A c c e s s o v e r th e n e tw o r k a n d in t e r n e t

IN F O R M A T IO N SP A C E M A N A G E R

L oc al acc es s

M e t a D a t a R e p o s it o r y

D a t a D ic t i o n a r y

D a t a R e p o s i to r y

Figure 2. Architecture of SEMQUEST Table 1. Sample set of relations in the Metadata Repository Relation Name ADSUB ATTRIB

Stores Metadata for Attribute defined sub classes Attributes of all entity classes

Relation Name ENTCLASS ENTREL

ATTCONST

Inter-attribute constraints

ENUMCOMP

ATTDEFCOMP

Attribute defined composite classes Domains of attributes

GRPAGG

Stores the metadata for Master table of all entity classes Interaction relationships including weak class relationships Composite classes defined based on enumeration by user(s) Grouping and Aggregate classes

PROJECT

Project information

DOMAIN

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

7

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Figure 3 and 4. Web Interface for SEMQUEST and Modeling Interface with the Hydrology model

Figures 5 and 6. Interfaces for Guiding User Query and Selecting Attributes for a Selected Entity Example 1: Accessing Existing Data Servers for Hydrology Data (figures 4, 5, 6, 7, and 8). Consider a user who is interested in water discharge data measured in Navajo County in Arizona. USGS serves data on water flow and precipitation collected by the different stations. Users can go directly go to the USGS web site (if they know about it) and access this data provided they know the identity of specific stations. More often, users require information on a set of stations that may be located in a particular county, or along a particular river and its tributaries in a particular river basin. This information is not directly available through the USGS web site and users end up wading through large datasets manually collating information. Our system provides a more convenient user oriented way to access this data. By selecting the entity class COUNTIES, the user is provided with query guidance shown in figure 5. This interface enables the user to navigate the model taking advantage of the semantics of the information space. Should the user wish to retrieve all the stations in a particular county,

he/she can select the attribute option in this interface and select the attributes from the list shown (figure 6). The system then retrieves the data associated with this attribute(s) and lists the counties (data for the attribute County_Name) from which the user can select one or more. The stations located in this county are then retrieved and displayed as shown in figure 7. Once the user chooses the station (by name or number) for which data is required, the system submits this query to the USGS web service and displays the returned data on the users workstation (figure 8). Example 2: Re-Using Software/Data Modules for Atmospheric Science Data (figures 9 through 15). Users develop a number of software modules for partitioning large datasets, analyzing data, and visualizing sample data. These modules and their locations are stored in the data dictionary enabling other users to share these modules and information. Figure 9 displays a subset of the information space for atmospheric science data.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

8

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Figures 7 and 8. Interfaces showing list of Stations in a Selected County, and USGS data on a Selected Station

Figures 9 and 10. Subset of Semantic Model for Atmospheric Data and Interface for Selecting Attributes Consider a user who wants to know which datasets contain information on a particular property of land surfaces, say skin temperature. By following a similar procedure described in example 1 illustrated in figures 10, 11 and 12, the user can now view the list of datasets that meet his/her criteria. The interface shown in figure 12 provides the user with options to identify the location of datasets, down load the dataset to their local machines, run programs that can partition, analyze, or visualize the data in these datasets, as well as, view remote datasets. If the user wants to visualize data in a particular dataset, he/she can select the dataset, and view the list of software modules available that can run against this dataset (shown in figure 13). Almost all of these software modules are designed to accept parameters such as time period ranges and latitude/ longitude co-ordinates. When a particular module is selected for execution, the system prompts the user for these parameters (sample shown in figure 14). SEMQUEST then executes this module and displays the results textually or in a graphical form as shown in figure 15.

5. Discussion and Conclusion We have developed a system called SEMQUEST for data retrieval from independent and distributed sources of data using a semantic information space. Our system has addressed problems in locating, understanding and retrieving large amounts of heterogeneous data using a visual interface and graph based algorithms. There are several issues that need further investigation. In order to use this system for updating metadata, the users have to be knowledgeable in semantic modeling, and in working with databases. At this time, these operations can only be performed database administrators (DBAs). All research centers, including the ones we have been involved with in this research, have people to support their computer and data operations. These individuals, with some training in semantic data modeling, will be capable of supporting the system. Developing a model of the information space

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

9

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

requires a considerable amount of initial, one-time effort to understand and define the model as well as to populate the data repository. Subsequent changes and updates are incremental and can be performed with a fair amount of ease. Currently our system can be accessed over the web using X-terminals or computers that can emulate the Xterminal. This is so because X-Window based systems are the standard environment for the researchers in the atmospheric sciences and hydrology at the University of Arizona. Having the modeling and data retrieval tool along with the data repository on the same server makes this system fairly server intensive because a large amount of processing is performed on the server side. An effort is underway to re-develop this system using Java. This will allocate a part of the processing load to the client system, thus decreasing the load on the server. An important research issue that was identified in the process of constructing the information space model described here is that of “schema evolution”. The structure of the information space model is changing constantly. Each time a new satellite is launched, new types of data become available and this needs to be

reflected in the model. With improving technology and renewed interest in global climate research, information about the earth’s atmosphere and surface is frequently updated as new relationships are discovered. Such information should be made available to all users immediately. Support for dynamic evolution of the information model is therefore an important issue. We are currently investigating this issue to define techniques for automating the evolution process. More details on schema evolution can be found in [21]. Improving the graphical user interface using iconbased representation was a suggestion that was highlighted during our interaction with the scientific research community. We are currently working on incorporating this into our system interface. Tailoring the display to accommodate different types of users is also being investigated. For example, the entity classes that represent the different water collection areas are shown as sub-watersheds, watersheds, basins, and river basins in accordance with one set of users. Other users understand the terms, level-8, level-6, level-4, and level-2 watersheds for the same set of entity classes. We are exploring the use of ontologies and dictionaries to resolve this issue.

Figures 11 and 12. Interfaces for Selecting Datasets Associated with Attributes and Accessing Dataset Information

Figure 13, 14, and 15. Accessing Programs, Passing Parameters to Programs, and Results from Program Execution

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

10

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

REFERENCES 1. Batini, C., Lenzerini, M., & Navathe, S. - A Comparative Analysis of Methodologies For Database Schema Integration, ACM Computing Surveys, vol. 18, no. 4, 1986, pp. 323-364 2. Bryce, D. & Hull, R. - SNAP: A Graphics Based Schema Manager, Proceedings of the 2nd IEEE International Conference on Data Engineering, IEEE, New York, 1986, pp. 151-164 3. Catarci, T., Chang, S., Costabile, M. F., Levialdi, S. & Santucci, G. - A Graph-Based Framework for Multiparadigmatic Visual Access to Databases, IEEE Transactions on Knowledge and Data Engineering, vol. 8, no.3, June 1996, pp. 455-475 4. Chen, P. P. - The Entity Relationship Model - Towards a Unified View of Data, ACM Trans. on Database Systems, vol.1, no. 1, March 1976, pp. 9-36 5. Climate Visualization System at www.ncdc.noaa.gov/onlineprod, 1996 6. Chung, C. - DATAPLEX: An Access to Heterogeneous Distributed Databases, Communications of the ACM, vol. 33, no. 1, 1990, pp. 70-80 7. Cruz, I. F., Averbuch, M., Lucas, W. T., Radzyminski, M. & Zhang, K. - Delaunay: A Database Visualization System, Proceedings of the ACM SIGMOD International Conference on Management of Data (ed. Joan M. Peckham), June 1997, pp. 510-513. 8. EOSDIS Information Management System Manual, prepared by Hughes STX Corporation for EOSDIS IMS, NASA, 1994. 9. Even, Shimon - Graph Algorithms, Computer Software Engineering Series, Computer Science Press, Potomac, Maryland, 1979 10. Goldman, K. T., Goldman, S. A., Kanellakis, P.C. & Zdonik, S. B. - ISIS: Interface for a Semantic Information System, Proceedings of the ACM SIGMOD International Conference on the Management of Data, New York, 1995. 11. Heiler, S. & Rosenthal, A. - G-WHIZ, a Visual Interface for the Functional Model with Recursion, Proceedings of the VLDB Conference, Stockholm, 1985, pp. 209-218. 12. Hull, R. - Managing Semantic Heterogeneity in Databases: A Theoretical Perspective, Proceedings of ACM SIGMOD Conference on Principles of Database Systems (PODS'97), Tucson, Arizona, 1997, pp. 51-61 13. Kerschberg, L. & Baum, R. - KORTEX: Am Expert Database System Shell for a Knowledge-Based Entity Relationship Model, Proceedings of the 8th International Conference on ER-Approach (ER Approach to Database Design and Querying), Ed. Lochovsky, L., 1990, pp. 255-268. 14 Livny, M., Ramakrishnan, R., Beyer, K., Chen, G., Donjerkovic, D., Lawande, S., Myllymaki, J. & Wenger, K. - DEVise: Integrated Querying and Visual Exploration of Large Datasets, , Proceedings of the ACM SIGMOD International Conference on Management of Data, (ed. Joan M. Peckman), June 1997, pp. 301-312 15. Manber, Udi - Introduction to Algorithms - A Creative Approach, Addison Wesley Publishers, Reading, Mass., 1989 16. Merz, U. & King, R. - DIRECT: A Query Facility for Multiple Databases, ACM Transactions on Information Systems, vol. 12, no. 4, 1994, pp. 339-359. 17. Miranda, S. & Nsonde, J. - LAGRIF: A Pictorial Non-Programmer Oriented Request Language for a Relational Database Management System, Proceedings of the International Conference on Databases - Improving Database Usability and Responsiveness, Jerusalem, 1982, pp. 173-204. 18. Ram, S. & Storey, V. - Grouping and Composite; Extending the Realm of Semantic Modeling, Proc. of the 26th Hawaii International Conference on System Sciences - HICSS 26, Maui, Jan. 1993, pp. 212-218. 19. Ram, S. - Intelligent Database Design Using the Unifying Semantic Model, Information and Management 29, 1995, pp. 191-206 20. Ram, S. & Ramesh, V. - Collaborative Conceptual Schema Design - A Process Model and a Prototype System, forthcoming in ACM Transactions on Office Information Systems 21. Ram, S. & Shankarnarayan, G. - Managing Schema Evolution: Taxonomy and Operations using a Semantic Model, Working paper, Dept. of Management Information Systems, Univ. of Arizona, Tucson, AZ, 1998 22. Ram, S. & Shankarnarayan, G. - Data Repository for the USM, Report, Department of Management Information Systems, Univ. of Arizona, Tucson, AZ, 1998 23. Schneider, M. & Trepid, C. - A Graphical Query Language Based on an Extended ER Model, Proceedings of the 8th International Conference on ER-Approach (ER Approach to Database Design and Querying), Ed. Lochovsky, L., 1990, pp. 403-417 24. Spoerri, Anselm - Visual Tools for Information Retrieval, Proceedings of the IEEE Symposium on Visual Languages, 1993, IEEE Computer Society Press, Los Alamitos, CA, pp. 160-168 25 Tjan, B. S., Breslow, L., Dogru, S., Rajan V., Reick, K., Slagle, J. R. and Poliac, M. O. - A Data-Flow Graphical User Interface for Querying a Scientific Database, Proceedings of the IEEE Symposium on Visual Languages, 1993, IEEE Computer Society Press, Los Alamitos, CA, pp. 49-54 26. United States Geological Survey Water Resources Information at waterdata.usgs.gov/nwis-w, 1997.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

11

Suggest Documents