Integrating Diverse Information Repositories: A Distributed ... - CiteSeerX

Integrating Diverse Information Repositories: A Distributed Hypertext Approach 1 John Noll and Walt Scacchi University of Southern California Los Angeles, CA 90089-1421 USA (213) 740-4782, (213) 740-8494 Fax (jnoll, scacchi)@pollux.usc.edu

Abstract

Today's networked software engineering environment is characterized by a multitude of autonomous, heterogeneous information repositories, a variety of incompatible user interfaces, diverse, unconventional data types, including text, graphics, and possibly video and sound, rapid change, both in structure and content, and multiple ways of viewing relationships among the same information items. Existing information storage mechanisms fail to combine diverse data types/models, complex objects and storage structures, personal views and organizations of shared objects, access to distributed, heterogeneous repositories, and ease of evolution. This paper examines these issues and describes a Distributed Hypertext architecture that provides transparent access to autonomous, heterogeneous software object repositories, resulting in both a powerful organizational tool and a simple yet eective integration mechanism. Keywords: Hypertext, Distributed Systems, Heterogeneous Distributed Object Management, Distributed Software Engineering Environments, Distributed Hypertext

1

Revised version appears in IEEE COMPUTER, Vol.24(12), December 1991, pp.38-45.

1 Introduction This paper addresses the problem of integration: how to provide transparent access to heterogeneous information repositories, while maintaining their autonomy. In today's networked computing environment, powerful workstations enable a variety of sophisticated, graphics oriented applications; local area networks (LANs) connect these workstations to le servers; these LANs are themselves linked by wide area networks, enabling access to information around the globe. This abundance leads to a diversity of access protocols, storage managers, data formats, and user interfaces, rendering much data inaccessible to any single user simply because there are too many things to know before one can even begin. Furthermore, nding an item once is not necessarily the same as nding it again. The original discovery process may be lengthy and involved, or even accidental; once discovered, interesting objects should be organized so they can be easily retrieved again. The key problems we address are how to support multiple, heterogeneous repositories, each under separate and autonomous administration; a variety of incompatible interfaces; diverse, unconventional data types; and dierent ways of viewing relationships among the same information items. In this paper, we present a radically dierent solution to these problems. Based on our Distributed Hypertext (DHT) Architecture, our solution combines transparent access to autonomous, heterogeneous information repositories with a powerful,

exible organization technique. This is accomplished without change to the structure or content of participating repositories.

2 Related work To what extent do existing systems address this problem? They can be broadly classi ed into two approaches: distributed lesystems and heterogeneous databases. Distributed le systems provide access to les distributed among a network of le servers, which may be of dierent architectures. The resulting le space can be extremely large. A problem with le systems is that a le's name must serve several functions: it indicates the location of a le in the le system hierarchy; it expresses the le's purpose; it often contains an indication of the le's type in the form of le name extensions; and it may include some notion of the relationship of a le to other les in the same directory. This overloading can lead to complex, cumbersome lenames that are dicult to manage and evolve. Attributed le systems [10] attempt to solve this problem by attaching attributes to les; these enable additional information, such as type, creation and access methods, relationships, etc. to be expressed. In addition, they provide a way of locating relevant les by associative search against attribute values. File systems share the common drawback that relationships among objects can only be expressed by grouping them into directories, or, in the case of attribute based systems, by attaching attributes that point to other les. Directories can only express membership in a set; attributes force relationships that are really separate entities to be expressed as part of one or more other objects. Database management systems address this relationship problem by providing primitive to model relationships explicitly. Heterogeneous database systems add the ability to integrate 1

many separate databases into logically uni ed worlds. This can result in a single global schema, or many application speci c schemas, as in federated architectures [5]. The chief problem with heterogeneous database systems is how to provide transactions while preserving autonomy. A transaction is typically controlled by a transaction manager. If the transaction involves objects from more than one database, then some or all participating databases become subordinate to the transaction manager, violating their autonomy. Summarizing, we notice that le systems oer exibility and ease of use, but organization is limited to hierarchies of directories. Databases oer many high level features, including powerful organization primitives and multiple views, but must sacri ce autonomy for transactions. Yet, there appears to be a gap between the two; this gap can be lled by DHT, as we describe next.

3 Distributed Hypertext: A solution Hypertext is a simple concept for organizing and viewing information. Chunks of text are stored in objects called nodes, which may be individual les, entries in a database, or possibly generated on demand at each access. These nodes may be logically connected with named relationships called links. Links may represent such concepts as semantic relationships between nodes, a logical progression from one node to another, citations in an article, or cross references. They may be anchored, in which case the endpoints of the link are represented by an icon or other indicator in the contents of the linked nodes (see Figure 2). Usually, the links are one-way; the resulting structure forms a directed graph called a corpus. Users navigate through the corpus by following links from node to node. Our integration solution is based on the hybrid nature of hypertext. Conklin describes the \essence" of Hypertext as a combination of three components: a \database method", a particular way of accessing information by following links to information nodes; a \representation scheme", a technique for structuring information, and an \interface modality", a style of user interaction based on direct manipulation of link \buttons" [2][p.33]. We exploit these three features of hypertext to achieve integration while preserving autonomy. Speci cally, we address the following issues. Organization and Flexibility. Hypertext oers a simple, natural method for organizing textual data; it can be extended to handle multiple media types (i.e., Hypermedia2). User interaction is simple and straightforward, based on a common set of commands applying to all object types. Yet, the data elements can be organized into complex structures by linking them together. Furthermore, data elements typically have attributes, so the hypertext can be searched by querying against node attributes, links, or contents. Transparency. Schatz describes three types of transparency that must be provided by an information environment: type transparency, the ability to manipulate an object using the same interaction technique independent of its type; location transparency, whereby an object should be accessible in the same fashion whether local or remote; and scale transparency, which requires objects to behave the same whether there are 100 or 100,000 objects in the system [8]. In a heterogeneous environment, source transparency must also be addressed; objects should be accessed uniformly regardless of the type of repository that manages them. 2

Henceforth, we will use the terms hypertext and hypermedia interchangeably.

2

Hypermedia systems have demonstrated type transparency by presenting seamless access to diverse media types, including text, graphics, sound, and images. Distributed hypertext systems have demonstrated location transparency (see Section 4). It remains to be seen whether hypertext can achieve scale transparency in a very large system containing hundreds of thousands of objects. Our approach is discussed in Section 5. Autonomy. Hypertext transactions are simpler than those of databases. A hypertext transaction typically involves reading or editing a node, or traversing a link. These operations are performed on a single object: a node or a link. Thus, managing any single transaction is the responsibly of a single repository. This makes hypertext ideally suited to integrating autonomous information sources, since transactions do not cross administrative boundaries and thus can be supported without global control. Hypertext nodes and links can be combined to form complex structures. This capability can be exploited to accommodate design autonomy, whereby local databases preserve the ability to maintain their structure and content to support existing applications. By providing dynamic gateway processes, local objects can be transformed into hypertext structures, and hypertext operations can be transformed into local queries. The underlying database structure remains unchanged.

4 Sidebar: Survey of Distributed Hypertext Systems Most hypertext systems are single-user, centralized designs [2]. There are several examples, however, of distributed hypertext systems. Most of these employ a single server to store the links and nodes for a hypertext; client programs running on workstations connected by a local-area network access this server to manipulate portions of the hypertext. Examples include Neptune [3], which stores nodes and links in a single database, and Intermedia [11], DIF [4], and KMS [1], which store links in a database and nodes in les. Multiple server designs allow nodes to reside on several dierent server machines, and links to be made between such nodes. This category includes Sun's Link Service [7], Distributed DIF, and VNS [9], which store links and node locations in a database, and PlaneText [2], which stores both links and nodes in Unix les. The ultimate category includes systems like Telesophy [8], and Xanadu [6]. These are intended to link the knowledge base of a large organization, community, or nation.

5 An Architecture for Distributed Hypertext To implement the solution outlined above, we begin with a DHT architecture to serve as the infrastructure for distributed information management. This architecture is based on a client-server model, and includes the following four components: A Common Hypertext Data Model. From the client point of view, all objects in the information space are described in terms of a common model, that de nes both the structure of information and the operations allowed on objects. The data model has three basic object components: nodes, the container objects; links, representing relationships between nodes or sections within nodes; and attributes of links or nodes, which serve to identify various properties of objects. 3

Client applications

LAN Gateway

Software Archive

LAN Gateway

OODB

Bibliography Database

Gateway

Gateway

Ingres RDB

Wide Area Network in-memory database

file system

Name Server

LAN Gateway

Gateway

Client applications

Link Server

B-tree

B-tree

Annotation Server

Figure 1: Components of the Distributed Hypertext Architecture A set of operations on these components de nes how clients access nodes and links. These operations are: Create an object; Update the contents or attributes of a node, or the endpoints or attributes of links; Delete nodes and links; and Traverse a link from one node to another. A Communication Protocol. All components communicate via a common application protocol, which implements the operations de ned by the data model and provides a mechanism for moving objects between clients and servers. Servers. The content and structure of the hypertext are managed by servers. Servers have two components: a gateway process that transforms hypertext operations into local access operations, and local information objects into hypertext nodes or links; and an information repository that contains the information to be accessed. These information repositories are the entities to be integrated; they may be le systems, databases, or special purpose storage managers. From the repository point of view, the gateway process appears to be another local application that accesses the repository's data. In this manner, existing databases can be incorporated into the DHT architecture without having to copy their data or modify their schemas. Also, existing applications continue to function as before. Client Applications. Clients implement applications and speci c styles of user interaction; they serve as a user interface to the primitive operations provided by the hypertext data model. These components are shown in Figure 1.

5.1 Summary

The architecture described above solves the integration problem by implementing hypertext in a distributed, heterogeneous environment. By doing so, we take advantage of the transparency, organization, and exibility inherent to hypertext, while exploiting the hypertext data access modality to preserve repository autonomy. Heterogeneity and Integration. Integration is accomplished by the common data model, which allows diverse information from diverse sources to be interpreted in a uniform manner by client applications. One of the key advantages to this integration strategy is that 4

a source of information can participate without having to coordinate with other sources. This is because of the nature of hypertext nodes as separate and distinct entities, and the representation of relations via links that are separate from nodes. The result is that integrating a new repository is only a task of translating objects into the common data model, and implementing a server to accept and process protocol messages. Adding a new server does not aect existing nodes, links, or servers. Transparency. The hypertext data model describes the structure of objects and the operations on objects; the communication protocol implements these operations. Therefore, any client can manipulate any object using the same set of operations, regardless of source or type. Transactions and Autonomy. Repositories maintain complete control over the type and number of objects exported, who may access them, and what operations may be performed. Only those objects exported are known outside of the repository; there is no requirement to export a repository's entire schema. Also, there is no need for a particular server to provide write access to its database, either to users or system functions. Furthermore, existing applications access the database in the same way, because the server handles all translations between local and common object descriptions. Transactions are restricted to read or update operations on a single node or link, and create operations that invoke a server's constructor procedure. Therefore, there is no need for a global transaction manager that might infringe upon local control and authority, since any transaction is managed solely by the aected server. The result of this approach is to render a complex, evolving object space comprehensible by applying a simple set of structuring operations.

6 Implementation This section describes how the Distributed Hypertext System (DHT) works in our prototype implementation. We start with an overview of the implementation, then proceed to a discussion of how hypertext functions are carried out from the client and server point of view.

6.1 System Overview

The goal of our prototype system is to demonstrate how existing heterogeneous repositories can be integrated using the DHT architecture. Therefore, we built several servers to provide access to a variety of repositories ranging from simple hash tables to objectoriented databases. Speci cally, servers were constructed to access an existing software project database implemented using the Ingres relational database management system; an object-oriented database containing bibliographic entries; a name server to managing addresses of the other servers in the system; and the Unix le system. Additionally, we implemented link and annotation servers using a B-tree storage manager. The link and annotation servers store annotations linked to other objects, as well as user's personal links. These servers enable user's to attach comments to other nodes in the system, and build personal link networks. We have several clients that provide users with access to the servers listed above, including a hypertext browser providing a means for navigating links, 5

viewing and editing nodes, and creating annotations to nodes, and several shell commands for creating and retrieving nodes from speci c servers. See Figure 1 for an overview of these components, and the accompanying sidebar for details on implementing a DHT server.

6.2 Using the System

User's engage in three basic activities: browsing links, viewing and editing nodes, and creating new objects. These tasks are carried out using a window-oriented browser/editor. Servers support these activities by exporting data objects, providing type descriptions of those objects, and creating new objects in their storage managers in response to user's (or client's) requests. We will examine each of these activities below.

6.2.1 Browsing

Browsing involves repeated execution of the following steps by a client process: 1. Locate appropriate link servers. As discussed previously, links are not necessarily stored as part of the nodes that they link. Therefore, a client managing the browsing process must rst determine what server to contact to retrieve the list of links originating at a node. This server may be speci ed with each command, or in a default con guration le. By convention, each user has a default con guration that speci es, among other things, where to look for links. 2. Retrieve the list of links associated with a node. Once the link server has been speci ed, the client simply sends a message to the server requesting the list of links emanating from the desired node. 3. Filter the list according to attributes speci ed by the user. Applications and users may be interested in only a subset of links associated with a node. Therefore, a user may specify a lter to apply to the list of links. 4. Present the resulting links to the user. This step is application dependent as well as link dependent. Some links may simply represent a relationship between two nodes; others are \anchored", representing a relationship between a subcomponent or section of one node to one in another node. The latter style of link is best presented by emphasizing the link's \anchor" in the displayed node (see Figure reflink-example, where anchors are shown in boldface type), while the former can be selected from a menu-style list. 5. Retrieve and display the endpoint of a chosen link. Once the links have been presented, either as highlighted anchors or menu items, the user can select one for traversal. This step can take place only after a client determines that it has sucient local resources to display a node. This is because certain node types require special hardware or software; a video object, for example, requires special playback and display hardware whereas text can be shown on any terminal. To facilitate this step, object identi ers contain a type eld, so clients know an object's type before retrieving the whole object. Once it determines that it can present the contents of a node, the client process sends a message to the server that stores the node, requesting that the it be returned.

6

Figure 2: Displaying Links

6.2.2 Creating New Objects

Object (node or link) creation is complicated by the need for servers to have absolute control over the creation process in order to ensure consistency of the local repository. This means that users cannot arbitrarily create nodes or links at any server. Rather, to allow servers to control access and meet repository consistency constraints, we provide object \constructors", local procedures that create complex objects from client-supplied arguments. Constructor arguments are speci ed in object type descriptions. When a user requests to create an object, he must specify the type of object to create. The client then retrieves the appropriate type description, and interprets the constructor argument speci cation. The user is then asked to supply values for each of the constructor arguments. Figure 3 shows a type descriptions with constructor arguments.

7 Sidebar: Implementing a server Servers must perform three tasks to support browsing: export data objects, provide type descriptions, and execute object constructors. Here we describe how these tasks were implemented for our Software Archive server. The server provides access to an existing Ingres database containing components of software projects including speci cations, manuals, and source code. This database has been used to archive student projects collected over the past three years. The archive is partitioned into more than 80 Projects, each of which is then further partioned into seven subsections called Forms. Each Form corresponds to a phase in the software lifecycle and contains numerous Modules that are the actual contents of a project. The database schema comprises two relations: a Projects relation which stores the project's name, le system location, and login names of users having access to the project; and a Modules relation which stores information about each module in a project. These 7

Relation: column name project form name instance type heading author status

Modules type length c 20 c 32 c 12 integer 4 c 12 c 64 c 16 integer 4

defclass Module {DHT Node} {attributes {{name type:string {instance type:int {type type:string {heading type:string {status type:int {author type:string {constructor {{name type:string {instance type:int {heading type:string {contents type:string {parent type:oid

Relation: column name name home

Projects type length c 20 text 1024

Relation: column name project name

Engineers type length c 20 c 16

defclass Project {DHT Composite} {attributes {{name type:string {home type:string {engineers type:string {constructor {{name type:string {home type:string {engineers type:string defclass Form {DHT Composite} {attributes {{name type:string {constructor {{}}}

card: card: card: card: card: card:

1 1 1 1 1 1

desc: desc: desc: desc: desc: desc:

"module name"} "instance number"} "module type"} "module description"} "module status"} "module author"}} }

card: card: card: card: card:

1 1 1 1 1

desc: desc: desc: desc: desc:

"module name"} "instance number"} "module description"} "module contents"} "module parent"}} }

card: 1 desc: "project name"} card: 1 desc: "root directory"} card: * desc: "project developers"}} } card: 1 desc: "project name"} card: 1 desc: "root directory"} card: * desc: "project developers"}} }

card: 1 desc: "Form name"}}}

Figure 3: Software Archive Relations and their associated DHT type de nitions.

8

relations are depicted in Figure 3. This database, though simple, presents a number of interesting problems to an integrator. All projects have the same Forms, and each Form has a set of \core" modules that must always be present. This core set may be augmented by additional modules at the discretion of project team members. Any module added to the database must be associated with a Project and Form within that project. These constraints can cause update anomalies if users are allowed unrestricted authority to create views of the database. DHT solves this problem through object constructors that create components of complex objects in a locally consistent manner. The Module constructor, for example, requires that the client specify the \parent" Form that is to contain the Module, as well as values for some of the Modules attributes. The constructor procedure supplies values for the remaining attributes (status, type, and author), and determines appropriate values for the project and form els of the Modules relation from the parent argument supplied by the client. The resulting values are appended to the Modules relation. If the parent object does not exist, however, or the user is not assigned to the project, the creation fails and an error message is returned. The process of creating a server for this database involved four steps: 1. Deciding the types of nodes and links to be exported, and writing the type descriptions. 2. Determining which operations to support. 3. Writing SQL queries to implement these operations, which will be called from the server process. These include the constructor queries to create new projects and modules. 4. Linking these components with the server library, and registering the new server with the DHT nameserver. There are three basic types to export from the archive database: Projects, Forms, and Modules. The rst two are directory objects that serve as the source for links to their children. Modules are container objects that have editable contents. The type descriptions for each of these are shown in Figure 3. The archive server supports the full range of DHT read operations, and creation of Project and Module objects. Queries implementing these operations must therefore map DHT concepts and objects into SQL queries against record attributes. For example, a \get links" request for a form object is translated into an SQL query that retrieves all modules that are part of the the speci ed object: SELECT DISTINCT project, form, name, instance FROM modules WHERE project =:_project AND form =:_form;

Of particular interest is the implementation of the Module constructor. From the DHT perspective, this operation creates a new node, and links it to the speci ed form. From the database perspective, it adds a new record to the Modules table, with the parent eld set to the speci ed form. These queries are embedded into a set of C language functions that are linked into a server template provided as part of the DHT library. Once this process is complete, the 9

new server is ready to run; the last step is to add its name to the nameserver so that it can register its address upon startup. The entire development process was completed in two days, and comprises about 300 lines of C and SQL code.

8 Conclusion We began this paper with a statement of four requirements for information integration: transparency, autonomy, exibility, and organization. Our Distributed Hypertext architecture meets these requirements by extending hypertext to a distributed, heterogeneous environment. We chose hypertext as a solution strategy because of the its multifaceted nature: it combines a user interaction technique, a data representation method, and a data storage mechanism [2]. The user interaction facet provides transparency; users interact with information by creating nodes and links and by browsing the resulting linked information space. Browsing is conducted using a common set of tools regardless of the types of nodes contained in the information space. The data representation facet provides the organization and exibility to construct multiple views and complex structures by linking nodes. This simple organization technique is both powerful and extremely exible; views can be changed or removed entirely simply by adding or removing links. Yet the linked nodes remain unaected by such changes. Contrast this with database management systems, in which dierent views can produce critical update problems, and changes in schema structure can aect numerous data instances. Finally, the data storage facet provides the key to implementing an integrated hypertext in an autonomous and heterogeneous environment. This is because transactions under the hypertext data storage model can be supported without compromising the autonomy of participating information repositories. Furthermore, a hypertext storage mechanism can manage diverse data types and can be implemented on a variety of storage managers. Thus, the hypertext based approach results in two gains: a simple integration strategy that preserves repository autonomy, and a powerful organizational tool that allows many users and applications to construct personal views of shared objects. The chief weakness of our solution is the necessity to implement a new gateway for each new repository; this is a weakness we share with other integration strategies. However, our experience has been that new gateways can be implemented in as little time as one day, so this is less of a drawback than it might appear. Our future work focuses reusable mechanisms for integrating new servers as well as continuing to gather experience with the prototype system. We also plan to implement more sophisticated clients to perform software process modeling, and servers to integrate repositories for software engineering environments. Acknowlegements. Christina Chen, Hongyan Luo, and Vien Chan participated in the implementation of the prototype system. We would like to thank Peter Danzig, Deborah Estrin, Doug Fang, Shahram Ghandeharizadeh, Steve Hotz, and Kim Korner for their many helpful comments. This work has been supported in part by contracts and grants from 10

AT&T, Northrop Inc., Paci c Bell, and the Oce of Naval Technology through the Naval Ocean Systems Center. No endorsement is implied.

References [1] Robert M. Akscyn, Donald L. McCracken, and Elise A. Yoder. KMS: A distributed hypermedia system for managing knowledge in organizations. Communications of the ACM, 31(7):820{835, July 1988. [2] Je Conklin. Hypertext: An introduction and survey. Computer, 20(9):17{41, September 1987. [3] Norman M. Delisle and Meyer D. Schwartz. Neptune: a hypertext system for CAD applications. In SIGMOD record, v15, n2 (June): pp. 132-142, pages 168{186, New York, 1986. ACM. [4] Pankaj K. Garg and Walt Scacchi. A hypertext system for software life cycle documents. IEEE Software, 7(3):90{99, May 1990. [5] Dennis Heimbigner and Dennis McLeod. A federated architecture for information management. ACM Trans. on Oce Information Systems, 3(3):253{278, July 1985. [6] Ted Nelson. Managing immense storage. Byte, 13(1):225{233, 1988. [7] Amy Pearl. Sun's link service: a protocol for open linking. In Hypertext '89 Proceedings, pages 137{146, New York, 1989. ACM. [8] Bruce R Schatz. Telesophy: A system for manipulating the knowledge of a community. In GLOBECOM '87 Proceedings, Tokyo, November, 1987, pages 1181{1186, New York, 1989. ACM. [9] Frank M. Shipman III. Distributed hypertext for collaborative research: the Virtual Notebook system. In Hypertext '89 Proceedings, pages 29{135, New York, 1989. ACM. [10] Marvin Theimer, Luis-Felipe Cabrera, and Jim Wyllie. QuickSilver support for access to data in large, geographically dispersed systems. In The 9th International Conference on Distributed Computing Systems, pages 28{35. IEEE Computer Society, June 1989. [11] Nicole Yankelovich, Bernard J. Haan, Norman K. Meyrowitz, and Steven M. Drucker. Intermedia: the concept and the construction of a seamless information environment. Computer, 21(1):81{96, January 1988.

11

Integrating Diverse Information Repositories: A Distributed ... - CiteSeerX

Integrating Diverse Information Repositories: A Distributed ... - CiteSeerX

Suggest Documents

Integrating diverse information resources in a case

Integrating diverse information resources into ... - Semantic Scholar

Distributed Repositories of Highly Expressive Reusable ... - CiteSeerX

Distributed Repositories of Highly Expressive Reusable ... - CiteSeerX

integrating diverse learning styles

Institutional repositories: review and an information ... - CiteSeerX

Integrating Information to Bootstrap Information Extraction ... - CiteSeerX

Session T4C INTEGRATING INFORMATION ... - CiteSeerX

Integrating FLOSS repositories on the Web - DERI

Cooperative Information Gathering: A Distributed Problem ... - CiteSeerX

Information Technology Implementation for a Distributed ... - CiteSeerX

A Distributed Decentralised Information Storage and ... - CiteSeerX

A Distributed Decentralised Information Storage and ... - CiteSeerX

Cooperative Information Gathering: A Distributed Problem ... - CiteSeerX

Integrating Text Mining into Distributed Bioinformatics ... - CiteSeerX

Integrating geographically distributed development teams ... - CiteSeerX

Integrating geographically distributed development teams ... - CiteSeerX

Ranking Distributed Knowledge Repositories - Semantic Scholar

Accessing Distributed Learning Repositories ... - Semantic Scholar

Integrating Distributed Simulation Objects

Integrating Distributed Simulation Objects

Lago Distributed Network Of Data Repositories

Integrating Heterogeneous Metadata into a Distributed Multimedia ...

A Solid State Transformer Integrating Distributed