Towards Large-Scale Information Integration - Semantic Scholar

Towards Large-Scale Information Integration Kenneth M. Anderson

Susanne A. Sherba

William V. Lepthien

University of Colorado Dept. of Computer Science 430 UCB Boulder CO, 80309-0430 USA



[email protected]

[email protected]

[email protected]

ABSTRACT Software engineers confront many challenges during software development. One challenge is managing the relationships that exist between software artifacts. We refer to this task as information integration, since establishing a relationship between documents typically implies that an engineer must integrate information from each of the documents to perform a development task. In the past, we have applied open hypermedia techniques and technology to address this challenge. We now extend this work with the development of an information integration environment. We present the design of our environment along with details of its first prototype implementation. Furthermore, we describe our efforts to evaluate the utility of our approach. Our first experiment involves the discovery of keyword relationships between text-based software artifacts. Our second experiment examines the code of an open source project and generates a report on how its module relationships have evolved over time. Finally, our third experiment develops the capability to link code claiming to implement W3C standards with the XHTML representation of the standards themselves. These experiments combine to demonstrate the promise of our approach. We conclude by asserting that the process of software development can be significantly enhanced if more tools made their relationships available for integration.

1.

INTRODUCTION

Software developers face significant information management challenges during software development. A key problem is relationship management, which refers to the task of tracking the myriad ways in which software artifacts can be related. For instance, a code module is related both to a design document that describes its position within a larger software system and a protocol specification that specifies the way in which the module is to be accessed. These documents, in turn, are related to a set of functional and non-functional requirements that may be specified in one or more requirements documents. Each of these documents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE ’02 Orlando, Florida Copyright 2002 ACM 1-58113-472-X/02/05 ...$5.00.

participates in versioning relationships with previous versions of themselves and all of these versioning structures participate in a configuration being managed by a configuration management system. These examples are just the tip of the iceberg. The reason that relationship management is so difficult is that the majority of the relationships in a software development project are implicit. That is, there are no formalisms or tools that allow software developers to completely track all of the possible relationships that exist between their software artifacts. For instance, if a developer solves a maintenance problem by using information from a set of documents, then implicitly there is a relationship between the problem being solved and the set of information that was used to solve it. Rarely is this type of relationship explicitly tracked in a software development project. An additional problem is that explicitly specified relationships are often isolated from each other by incompatible formats and tools. For instance, given a particular version of a system, a testing tool might maintain a set of relationships that describe which test cases apply to which modules, but this set of relationships cannot be accessed or exploited by the tool that is versioning the modules in the first place. Or, a requirements tool might maintain a set of relationships that track the internal dependencies of a set of requirements, but offer no mechanism for a design tool to link these relationships with the relationships that it maintains over its design documents. These isolated islands of information make gaining a global picture of software development difficult; a software engineer is only able to examine a particular set of relationships when accessing the tool that maintains them. This inability to gain a global perspective on software development is a problem that has plagued software engineering since its inception. Indeed, Fred Brooks talks about this problem in “No Silver Bullet” [8] when he describes invisibility as one of the essential difficulties of software development. However, we believe that the inability of software tools to share relationships is an accidental difficulty (to adopt Brooks’ terminology) that can be addressed by appropriate techniques and mechanisms. We present an approach to address this issue and report on the results of evaluating our techniques. While techniques and tools for addressing the relationship management problem will not provide a “silver bullet,” we believe they can significantly help software engineers cope with the complexity of software development. Our approach to the relationship management problem is to provide an environment to support the task of information integration. We refer to our environment (and its underlying framework) as InfiniTe (pronounced “infinity”).

Information integration is an overloaded term. For instance, the database community uses the phrase to describe schema integration tasks, while the intelligence community uses it to refer to the analysis of a large set of distributed information to obtain a global picture of national or world events. We define information integration to be the task of discovering, creating, maintaining, and evolving explicit sets of relationships between software artifacts.1 This definition is similar in spirit to the other definitions but, at the same time, refers to an explicit and well-defined software development task. As suggested by our definition, we view the task of information integration as a set of distinct processes that must be supported by any environment claiming to provide information integration services: • The discovery process is one in which implicit relationships are recognized, either by humans or by automation, and converted into an explicit form that can be tracked and maintained by the environment. • The creation process is one in which a well-defined rule (or set of rules) is used to automate the creation of a set of relationships between a specific set of documents or to automate the creation of links between existing sets of relationships (as is the case when design relationships are linked to requirements relationships to establish requirements traceability). It also refers to the process in which a developer manually establishes a relationship between a set of artifacts using mechanisms provided by the environment. While it is important that the manual creation of relationships be supported, our research focus is on the degree of automation that can be provided by the environment, since automation is required to make information integration practical in large-scale software development. • The maintenance process refers to the task of manipulating and analyzing a set of relationships once it has been created. For instance, after a set of relationships has been created, it will be useful to name the set and save it in a persistent store. In addition, it should be possible to export the set out of the environment in a format that facilitates its use in other contexts. Once exported, the set may be manipulated and updated by external tools. As such, it should also be possible for a set of exported relationships to be imported back into the environment. Indeed, the environment’s creation process must allow for the situation in which a particular type of relationship can only be generated external to the environment and then imported for use in other maintenance or evolution tasks. • The evolution process refers to the task of tracking changes to a set of relationships. Indeed, one difficult task of software development is determining how a set of relationships has evolved over the course of a project. For instance, a software engineer may want to know when a particular type of relationship, e.g., uses relationships between modules, experiences an explosion in the number of instances between versions of a 1 We refer to sets of relationships without making reference to the specific type of relationship included in the set. It is our goal to support multiple types of relationships generically and thus there is no need to distinguish between different types when discussing our environment’s capabilities.

system. This type of explosion may indicate that a particular module lost coherence between versions or that a design decision was poorly implemented. An information integration environment therefore needs to provide services that allow a developer to view the changes in a set of relationships over time. In addition, it must allow software engineers to modify and extend a set of relationships to help advance the state of a software project. For instance, a developer may want to make a copy of a set of relationships that specifies the architecture of a system and then modify those relationships to test alternative arrangements. Finally, an information integration environment must provide ways for a developer to view the sets of relationships it maintains. (This requirement was implicit in the discussion above, but needs to be mentioned explicitly.) Ideally, it should be possible to view the relationships as “close” to the software artifacts as possible. That is, if a developer must abandon his favorite set of tools to access and view relationships being maintained by the environment, then it is less likely that a developer will actually use the environment. Our past research [2, 4] in open hypermedia [14] has addressed this particular problem and we intend to exploit the technology and lessons learned from that work to address this problem in this new research context. In particular, we intend to export relationships created in InfiniTe into an open hypermedia system such that engineers can view and navigate these relationships while using standard tools. The rest of this paper is organized as follows. We first present the InfiniTe framework and discuss our prototype implementation of it. Next, we evaluate our claims for the framework, and then conclude.

2.

FRAMEWORK

Our approach is centered around an environment consisting of elements designed to support the task of information integration. The environment and its elements are conceptual entities that make up the Infinite framework (see Fig. 1). Before we present each element of the framework, it is important to consider the insight that led to the development of the framework in the first place. The key contribution of this framework is that it provides a mechanism for dealing with the heterogeneity of the software artifacts found in software environments and the types of relationships that exist between these artifacts. In particular, the framework takes the approach of translating software artifacts from their native formats into documents stored in a repository that makes use of a uniform file format. These documents can be examined in a generic way to search for relationships that exist between the original artifacts. Relationships are also stored in a uniform format to facilitate the creation of generic relationship operations. Metadata can be associated with both documents and relationships to keep track of typing information, references to external artifacts, etc. Thus, the repository establishes a homogeneous environment while supporting a heterogenous mix of software artifacts and relationships. The elements of the InfiniTe framework include: • Users: Users play two very important roles in the InfiniTe framework. They act as consumers of the environment’s information. Additionally, users provide information to the environment.

Integrator

Translator

Data Source

End-User Context

Context End-User

Context End-User

Context

Open Hypermedia Layer

Information Integration Environment

Data Source

Figure 1: The elements of the InfiniTe framework. • Data Source: A data source is any entity that provides information to the InfiniTe environment. Data sources include traditional software artifacts such as requirements/design documents, code, test cases, etc. but may also include non-traditional software artifacts such as humans, databases, workflow systems, event notification systems, and Web-based information. • Translator: A translator is a computational entity that either imports information into the environment (translating the data source’s native format into the environment’s uniform format) or exports information out of the environment (either back into a data source or as a new file). When importing information into the environment, the translator stores the information as a document within the repository. • Integrator: An integrator is a computational entity that generates new information within the environment. Integrators access information contained in documents or contexts and make use of relationships between these entities to explore the information space. Typically, integrators are created to discover a specific type of relationship between sets of documents and to explicitly record their existence. They may also be used to generate reports over the information stored in the environment, such as reporting how a set of relationships has evolved over a project’s lifetime. • Repository: The environment maintains a repository (indicated by the large oval in Fig. 1) that stores environment-related information (documents, contexts, and relationships) in a uniform format. The repository is the mechanism which enables the homogeneous information space within the environment, despite the fact that most environment information is drawn from a highly heterogenous set of data sources. In addition, the repository provides a mechanism for associating (and accessing) metadata with any document, context, or relationship. Metadata takes the form of an unbounded set of attribute-value pairs. Finally, the repository must maintain an address for each of its documents, contexts, and relationships. • Documents: Documents (indicated by the black circles in Fig. 1) represent information imported into the

environment from data sources or information generated by integrators. Documents are read-only and are stored in a uniform format that allows them to be generically processed by the environment’s integrators. The design rationale for making documents read-only is discussed below in Section 2.1.5. • Context: A context is a mechanism for partitioning InfiniTe’s information space. It is similar to the notion of an open hypermedia composite [9]. Contexts can contain documents and other contexts and can participate in relationships that span contexts. Furthermore, metadata can be associated with contexts to specify how a particular context is to be used. • Relationship: A relationship is an n-ary set of document or context addresses. What it means to “traverse” a relationship is left to each individual integrator. For instance, an integrator may scan a relationship looking for references to other contexts and ignore all document addresses while another integrator may process each reference in turn. • Open Hypermedia Layer: The environment maintains a connection to an open hypermedia system that allows relationships generated in the environment to be viewed as navigational relationships within the native editing environment of the original data source (provided the editing application is integrated with the open hypermedia system). Due to space constraints, we cannot provide an introduction to open hypermedia. Interested readers are referred to [4, 14].

2.1

Framework Discussion

In this section, we provide more detail and insight into the elements of the framework and expand on some of the possible interactions between them.

2.1.1

Users

Users are an important part of any software system. Within InfiniTe, a user’s role as a source of information for the environment is very important. There are many relationships between the artifacts of a software project that exist solely in the minds of its developers. Hence, these relationships are implicit and highly ephemeral in nature. For

instance, a developer may modify a design document and realize that a corresponding change must be made in a related requirements document. This change-impact relationship exists only in the developer’s mind, which is a tenuous place indeed. If the developer is interrupted before the corresponding change is made, it is likely that the relationship will be lost and the artifacts will contain an inconsistency. To address this problem, the framework provides a mechanism for users to specify implicit relationships explicitly within the environment. In the scenario above after the design document is modified, the developer can translate the new version into the environment. The environment may then display a list of documents that are traditionally updated whenever this design document changes. This notification then reminds the developer to make the corresponding change to the requirements document.

2.1.2

Data Source

Data sources provide information to InfiniTe. However, as indicated in Fig. 1, the environment may flow information back to a data source. This situation is especially likely when the format of the data source supports hypermedia linking. For instance, a Web document may be updated with new links after being processed by the environment.

2.1.3

Integrators and Translators

In the definition of integrators and translators, the phrase “computational entity” was carefully chosen. It is easy to think of integrators and translators as batch processors, which execute only long enough to perform a single task. While the majority of integrators and translators may indeed take this form, the framework places no restrictions on the architecture of an integrator or translator. Indeed, we have already constructed a translator which is a client of an event notification system. When invoked, this translator subscribes to a particular set of events and then runs continuously waiting for event notifications to arrive. Similarly, some integrators are envisioned to function in much the same manner as web crawlers, continuously roaming the information space looking for relationships that match their interests.

2.1.4

Repository

The repository is responsible for managing InfiniTe’s information space. In particular, it must maintain a namespace that provides each of its elements with an address. By “address,” we refer to a mechanism that allows documents, contexts, and relationships to be referenced by other entities. It must also be possible to reference the internal contents of each entity. Thus a context may refer to a document stored in another context, or a relationship may contain a reference to a paragraph within a document. The framework only specifies that a namespace must exist, implementations are free to use any addressing scheme they choose. Conceptually, the repository is a centralized source of information accessible to all integrators and translators. However, there is no requirement that the repository be centralized when implemented. Indeed, modern software development typically involves globally-distributed software teams that must coordinate their activities to make progress on shared development tasks [10]. As such, any implementation of the framework intended for production software environments needs to provide support for distribution (and not just for the repository).

2.1.5

Documents

Documents are read-only information that have either been translated into the environment from an external data source or created by an integrator. Note, not all integrators may be able to understand the information contained within a document. However, the environment’s uniform file format allows for a limited form of reflection, similar to CORBA’s dynamic invocation interface [13]. Before processing a document, an integrator can use the reflection mechanism to query whether it contains particular structures. Thus, an integrator designed to look for def-use relationships in source code may be unable to process a document translated from a configuration management system. But, an integrator looking for keywords may ignore the semantic structure of a document and just search its text for “hits.” A major design decision for the framework is that a document’s information is read-only. We made this decision for several reasons. Firstly, a document typically represents a data source external to the environment at the time it was translated. An important goal of our research is to produce an environment that can provide a global view of a software project to its developers. One aspect of this global view is the ability to show how the project evolved over time. If we allow modifications to translated documents, then this type of history information is jeopardized. If a modification is required on a document, the modification should be made on the original data source and then the original data source should be translated once again to produce a new document. The new document and the original document can then be related using an “is-a-version-of” relationship. We apply the same reasoning to documents that are created by integrators. That is, these documents represent the result of applying an integrator to a particular set of documents at a particular point in time. An integrator’s output can be an essential part of a project’s history and requires the same protections that translated documents enjoy. Secondly, if we modify a document to include, for instance, a set of relationships generated by an integrator, we may prevent the document from being processed by other integrators, since the modifications may obscure the presence of other types of relationships that the original document contained. As such, InfiniTe stores any relationships generated by an integrator for a document externally. The open hypermedia field has demonstrated the importance of external links for more than a decade. See [14] for details.

2.1.6

Contexts

Contexts are the mechanism by which the environment’s information space is partitioned. Three additional issues to discuss with respect to contexts are the global context, how metadata can be used to configure the use of a context, and how links between contexts can be used. When a document is translated into InfiniTe, it is placed in a global context (not shown in Fig. 1). Both translators and integrators may then include the document in other contexts by adding a reference to the document in the target context. The use of references allows a document to participate in multiple contexts and enables scalability since each document is stored only once. With respect to contexts and metadata, metadata can indicate how a context is to be used. For instance, a developer may create a context for performing a change-impact analysis on a set of software artifacts and use metadata to record

that purpose along with other information such as the context’s creator and creation time. Finally, contexts can be linked to other contexts. This allows contexts to be related to each other in various ways including time-based relationships. For example, suppose the developer above wants to perform another change-impact analysis on the same set of software artifacts one week later. Assuming that there has been some changes to the artifacts between sessions, he creates a new context to contain the new versions of the artifacts and links this new context back to the original context. This capability, then, allows software developers to track the history of a project explicitly. A developer can jump to the most recent context of a particular analysis and then follow links between contexts to see how the results of the analysis have evolved over time.

2.1.7

Relationships

There are two additional issues concerning relationships that need to be discussed. Firstly, relationships are stored externally to the documents and contexts they address. As such, the repository must provide a service that given a particular document (or context) retrieves all relationships that reference it. Secondly, since relationships are stored in the repository as documents, it is possible to create “links between links.” This capability is important for linking between sets of relationships between different software tools, such as linking test cases for particular modules (maintained by a testing tool) with particular configurations of a system (maintained by a configuration management tool).

2.2

Summary

We believe that the InfiniTe framework provides an excellent conceptual foundation upon which to design and implement an information integration environment that can help software developers address the relationship management problem. We have constructed an initial prototype of the framework that has provided considerable feedback on the approach and has led to some initial evaluation experiments that have produced positive results (see Section 4 for details). As such, the framework’s elements have evolved from an initial set consisting of just data sources, translators, and integrators to the final list presented above based on our experience implementing the framework and attempting to evaluate its utility. While we do not anticipate further additions to the elements of the framework, we do not consider the list closed. If additional feedback or experience indicates a need for a new conceptual element, we will certainly evolve the framework to meet those needs.

3.

PROTOTYPE IMPLEMENTATION

We have constructed an implementation of InfiniTe to gain insight into the framework2 . Furthermore, this prototype has served as a vehicle for conducting a series of experiments to evaluate our approach to information integration. The architecture of the prototype (shown in Fig. 2) conforms to an increasingly common Web-based architecture for processing and presenting shared information. The prototype is implemented as a set of Java servlets running in the Tomcat application server [15]. Tomcat functions as a traditional Web server but also acts as a servlet engine 2 The prototype is available at .

mapping a portion of its URL namespace into invocations on Java servlets, e.g., a URL of the form is translated into a request on a servlet that handles context services for the repository. This URL is asking the servlet to display the contents of the context dev contained within the context infinite. Each servlet (our prototype consists of five servlets) supports a generic style of interaction: 1) a URL is mapped to an operation that manipulates the information in the repository in some way. 2) The servlet may access or modify the repository directly, or it may invoke a translator or integrator that then accesses or modifies the repository. The repository consists of a set of extensible markup language [18], or XML, files. 3) The file that was accessed, modified, or generated as a result of the invoked operation is then targeted for display. Since support for XML in existing web browsers is uneven, the document is paired with an XSLT stylesheet [21] and passed to an XSLT processor for conversion to HTML. 4) The generated HTML is then passed back to the web browser (via the servlet) for display. The advantage of this architecture is that support for distributed access to a shared information space is provided “for free,” by leveraging the already widely-deployed infrastructure of the Web. In addition, the interface to the environment is a Web browser, an increasingly common tool in a software engineer’s tool chest. Furthermore, the reuse of Web protocols (namely HTTP [11]) enables access to the environment by many types of clients as long as they access the correct URLs. We intend to document these URLs and publish them as a form of application program interface (API) to InfiniTe. Finally, the use of XSLT to perform the conversion of XML into HTML drastically simplifies the implementation of a servlet which can then focus primarily on carrying out the semantics of its assigned operations. We now describe the prototype in more detail, giving special attention to the implementation choices we made with respect to the requirements imposed by the framework.

3.1

Servlets

There are five servlets that make up the prototype: services, documents, contexts, integrators, and translators. The services servlet generates InfiniTe’s HTML-based user interface. The user selects choices presented by the services servlet to import and export information, create and manipulate contexts and documents, and invoke integrators. The other four servlets are invoked as needed by the services servlet to carry out the requests of the user. Thus, when a user indicates that a new context should be created, the services servlet routes a request to the context servlet to actually create the context.

3.2

Uniform File Format

We have chosen XML to serve as the uniform file format of the InfiniTe environment. In particular, contexts and documents are stored as XML files. Relationships are also stored as XML files but they make use of a particular format defined by the XML Linking Language (XLink) [16]. The decision to use XML frees us from having to develop our own proprietary format and allows us to leverage a wide range of free (and often open source) XML support tools including parsers, stylesheet processors, and editors. In addition, XML is being widely adopted as an import/export format by many software development and desktop publish-

Integrators/ Translators

I/O

Local File System

Invokes M i l t Manipulates

R Request Web

Responsee R

Web Server (Servlet) Reads

Repository

Stylesheet Repository

Invokes kes Reads Reads

HTML Files

Generates

XSLT Processor E Exports

Client Applications

L Links

Open Hypermedia System

Im mports

XLinks

Figure 2: The architecture of the prototype of the InfiniTe environment. 0

Figure 3: The repository configuration file.

ing tools. This situation reduces the need for third-party translators, since the exported XML of a software tool can be directly incorporated into InfiniTe. We can directly incorporate third-party XML into the environment because we made an implementation decision that we would not require a single document type definition, or DTD [18], for XML documents stored in the repository. (DTDs are used to define the set of tags that can appear in an XML document.) This decision was made to support heterogeneity; imposing a single DTD over the entire repository would have been impractical (since different documents are used for different tasks, such as storing information about contexts, storing metadata, specifying parameters for an integrator or translator, etc.) and would require the creation of a significant number of translators to translate both nonXML and XML information into a document consistent with the repository-specific DTD. Furthermore, recall that the framework specified the need for a reflection mechanism to allow integrators to query the structures of documents. We make use of the document object model [6], or DOM, that has been defined for XML documents to provide this functionality. That is, a document can be parsed into a DOM tree by the repository and then an integrator can issue queries against this tree to ask if a particular tag exists within a document. The choice of XML as the uniform file format fulfills the responsibilities imposed by the framework. A heterogeneous set of tools can be supported outside the environment, while the explicitly defined syntax of well-formed XML documents provides the desired homogeneity within the environment and enables generic processing of the environment’s information using standard XML tools.

3.3 Repository The repository consists of XML files distributed across directories in a file system. Each directory is used to categorize XML documents according to function, e.g., there is a directory for contexts, another for documents, relationships, and metadata, and one each for integrators, translators, and stylesheets. In addition, there is one XML file that is devoted to tracking configuration information about the repository (see Fig. 3). This file keeps track of the available integrators and translators and the next available file id (each document of the repository receives a unique id). Note that a distinction has been made between translators that import information into the environment and those that export information out of the environment (see Fig. 3). This distinction was made primarily to provide context sensitive support in the environment’s user interface, e.g., when a user initiates an export operation, they are presented with a list of export-only translators. This list is generated by an XSLT stylesheet, as shown in Fig. 4. The stylesheet processes the repository configuration file looking for all translator tags whose type attribute has the value “export.” All such tags are added to an HTML input form that is being generated by the stylesheet which, when displayed in a user’s web browser presents them with the desired list. The information contained in the repository is accessed via a repository API implemented as a set of Java classes. Integrators, translators and the five servlets of the prototype use this API to create, access, and manipulate documents, contexts, relationships, and metadata.

3.3.1

Namespace

The repository API implements a namespace, as required by the framework, using a two-level scheme. It defines a global context that contains all documents and contexts. Contexts are arranged as a forest of trees within the global context. At any level of the tree, sibling context names must be unique. The namespace is traversed by combining names together using the dot operator, similar to the mechanism used in Java to traverse packages and classes. Examples of legal context names include global.infinite.dev.ken and xml.opensource.apache.xerces. (Note: the global context is assumed when not specified explicitly.) These names (and the examples which follow) can be passed as parameters to various repository API operations. For

Infinite Export
Infinite Export

radio translator

Figure 4: An XSLT stylesheet for processing the repository configuration file.

example, passing either of the previous two names to the LoadContext() operation would cause the corresponding context to be loaded into memory. Once a particular context is specified, a file is accessed by appending its file id. In our prototype, file ids consist of the string “file” and a unique integer. Thus a file can be accessed using an address like infinite.dev.ken.file0. Finally, the contents of a file can be accessed using the pound operator followed by a quoted XPath expression [19]. Thus, the author of a document might be accessed using an expression like global.infinite.dev.file1#xpointer(//author). This makes use of a simple XPath expression that retrieves a tag named author from the document. (A detailed discussion of XPath is out of scope.) Note that addresses like this are typically hidden from the users of the prototype. Users think in terms of documents, contexts, and relationships and while they may care that a relationship points to the author of a particular document, they do not necessarily care how the reference is specified.

3.3.2

Contexts

Contexts are implemented using a directory structure on the file system of the prototype’s host machine. That is, within the repository’s contexts directory is a subdirectory called “global.” Each new context is created as a subdirectory under global, or one of its children. The semantics of the file system ensure that each level of the directory structure consists of uniquely named contexts. Each context directory contains an XML file that stores information about the context, including a pointer to each context document, pointers to relationships that reference context documents (see below), and a pointer to a set of metadata that describes the context. When the repository API is interpreting a repository address, it traverses the file system hierarchy to see if the referenced context exists. If so, it can then access the context’s XML file to access the context’s documents.

3.3.3

Documents

Documents are stored as XML files in the repository’s documents directory. Each file may specify a particular DTD for XML validation purposes, but, as described above, a DTD is not required and no attempt is made to make documents conform to a repository-specific DTD. As specified by the

framework, documents are given read-only permissions using file system operations. This prevents rogue integrators from violating this key design decision.

3.3.4

Relationships

As mentioned above, relationships are stored as documents, external to the entities that they relate, and are encoded as XLinks. The repository keeps track of relationship documents through the use of contexts. When an integrator creates a set of relationships, it does so with respect to a particular context. After the integrator has stored the relationships inside a document, the integrator adds a reference to this document in the context’s XML file. Documents which contain relationships that span contexts are referenced by each context touched by the relationships. These implementation choices provide a very useful feature, i.e., the ability to associate different sets of relationships with the same document based on context. Thus, a code artifact may have links to design documents in a requirements traceability context, but may have links to documentation in a maintenance context.

3.3.5

Metadata

The repository implements metadata as another type of document that can be contained in the environment. (“One person’s metadata is another person’s data.”) We developed a simple XML format to represent the set of attribute-value pairs specified by the framework. The root tag of a metadata document is a tag named atts and it can contain an arbitrary number of attribute tags, each containing name and value tags. The repository API provides operations for creating metadata sets along with operations for adding, removing, and accessing each individual attribute. Metadata is associated with a document using a relationship. This allows generic sets of metadata to be created which can be associated with more than one document. In addition, using relationships to associate metadata with documents, allows metadata to be associated with relationships and even other metadata sets, since both are types of documents.

3.3.6

Open Hypermedia

The repository has no direct integration with an open hypermedia system. Rather, we have developed a translator that can export relationships (see Section 4) which can be imported by an open hypermedia system and used to establish navigational links between software artifacts.

3.3.7

Summary

Having made these choices, our prototype’s repository meets the obligations defined in the framework. It provides users, integrators, and translators with access to its contents via the repository API. In addition, each repository element can be referenced using the repository’s two-level naming scheme. Furthermore, our repository has implemented the functionality specified for contexts, documents, relationships, and metadata. Finally, a translator exists that can export relationships to be used by an open hypermedia system as a first step towards supporting the open hypermedia layer specified by the InfiniTe framework.

3.4

Integrators and Translators

Integrators and translators are Java classes that must implement a simple interface consisting of only two operations:

setParameters(Object[] params) and run(). We now provide an example of invoking an integrator using this interface. However, each step is equally applicable (with only negligible differences) to invoking a translator. To invoke an integrator, a user selects from a list of available integrators. Once selected, the integrators servlet uses the name of the integrator to find an XML file stored in the integrators directory of the repository. Thus, if the name of an integrator is “KeywordIntegrator,” then the servlet looks for a file named “KeywordIntegrator.xml.” This file specifies the parameters required by the integrator. Examples of parameters include file and directory names, context names, and information specific to a task, such as a list of keywords for the keyword integrator. This XML file is transformed into an HTML form using an XSLT stylesheet. Each type of parameter is mapped to an appropriate HTML input widget. Thus, a file parameter is transformed into an input field that is accompanied with a “Browse” button. This button brings up a dialog that allows a user to browse the file hierarchy to locate the desired file. Note, that it is the Web browser that provides this browsing functionality. The stylesheet is simply leveraging the power of HTML forms to get the information required by the integrator. Once the user has entered the actual values for each parameter, the user submits the form for processing by the integrators servlet. (The stylesheet generates an HTML form that is preconfigured with a URL that sends the completed form to the integrators servlet.) Now that the servlet has the parameters, it is ready to invoke the Java class which implements the integrator’s functionality. The integrator’s XML file specifies the Java class name to be used. This name is passed to Java’s Class.forName() method to dynamically load a class descriptor whose newInstance() method is called to retrieve an instance of the specified class. Once an instance is obtained, the servlet calls the setParameters() method passing the submitted parameters via an object array. Then, the servlet calls the run() method and its job is complete. The integrator runs in a separate thread and performs its assigned task. The advantage of this design is that it is straightforward to add and remove integrators and translators to the environment. Such changes can be made in an incremental and dynamic fashion without having to restart Tomcat or InfiniTe’s servlets. The steps to add an integrator are: 1. Add the integrator’s name to the repository configuration file. This ensures that the new integrator will appear in the list of available integrators. 2. Add the integrator’s XML file to the integrators directory. This file specifies the integrator’s parameters and the name of its associated Java class file. 3. Add the specified Java class file into a directory that is included in the environment’s Java class path. This ensures that Java’s Class.forName() method can load the class dynamically. This mechanism provides rapid prototyping capabilities to the InfiniTe environment. It is straightforward to add an integrator that starts out simple, and incrementally add to its functionality by modifying its associated XML file or modifying and recompiling its Java source code and replacing its old class file with the newly generated one.

3.5

Future Plans

The prototype of the InfiniTe environment is in a functional state and has served as a foundation for three evaluation experiments that are described in Section 4. Now that we have completed these experiments, our future plans involve mainly adding new integrators and translators to 1) expand the types of software artifacts that the InfiniTe prototype can process and 2) expand InfiniTe’s relationship management and discovery capabilities. With respect to the prototype itself, we see two possible changes. The first change is to migrate the current repository implementation to a database to enable data and processing scalability. We have used relational databases in our past open hypermedia work [1] but, given the nature of our current implementation, we intend to explore the capabilities of XML databases. The second change is to explore the creation of InfiniTe clients that can access the environment outside of a Web browser. For instance, its difficult to generate overview maps of the environment’s repository using HTML. We envision creating a map client that runs alongside a user’s web browser and augments their access into the environment’s information space.

4.

EVALUATION

We have performed three evaluation experiments of InfiniTe’s capabilities. We have not yet performed evaluations of InfiniTe’s user interface because our focus has been mainly on functional concerns and the user interface to the environment is by no means finalized. In this section, we briefly present each experiment and discuss the lessons learned.

4.1

Keyword Relationships

Our first experiment involves discovering relationships between text-based software artifacts through the use of keywords. The software artifacts chosen for this experiment was the source code of the InfiniTe prototype itself. In addition to searching for keywords, we wanted to implement a complete “round-trip” scenario for the prototype involving the following steps: 1) Artifacts are translated into the environment and 2) processed by an integrator to produce a set of relationships. 3) These relationships are exported out of the environment and 4) imported into an open hypermedia system. Finally, 5) the original artifacts are viewed and the generated relationships appear as navigable hyperlinks courtesy of the open hypermedia system. This scenario touches all aspects of the InfiniTe framework and represents a “proof-of-concept” of the approach. In particular, we developed a translator that takes as input a text file and converts it into an InfiniTe document stored in the repository. The translator stores a pointer back to the original artifact via metadata. After importing a subset of the prototype’s code into InfiniTe, a keyword integrator is used to search for keywords representing InfiniTe concepts (e.g. documents, contexts, integrators, etc.). The integrator generates a set of XPointers [20] that indicate the location of each keyword in each document. A second integrator is run to generate an index for the keyword search and a set of relationships that link each instance of a keyword to a listing in the generated index. In addition, “guided tour” relationships are generated such that a developer can traverse from one instance of a keyword to the next for each document included in the search. These

relationships are stored as XLinks and directly include the XPointers generated by the first integrator. A translator then exports the keyword index into an external text file and the relationships as an external XLink file. This translator uses the metadata stored by the first translator to ensure that the exported XLinks refer back to the original source code files. Next, an importer for the Chimera open hypermedia system [4] processes the exported XLinks and converts them into Chimera hypermedia links. Finally, a Chimera-integrated text editor is used to view the generated index and Chimera provides access to a set of links that allows developers to view how the source code files are related to each other via the various keywords. The key difficulty in implementing this experiment was establishing a mapping function between the XML documents stored in the repository and the external software artifacts. This mapping function ensures that keywords found in the XML documents match up with the keywords found in the artifacts when viewing the relationships as open hypermedia links. Details of the mapping are documented in [3]. This experiment demonstrates the feasibility of the approach and also provides a useful function. We personally used the relationships generated between the source code files to keep track of module coherence (“Hey, that concept is not supposed to be referenced in that file!”) and found the round trip processing time of a new version of the prototype to be fast enough (on the order of one to two minutes) to make use of these integrators and translators as we worked on the development of the prototype itself.

4.2

Project Evolution

The second experiment addresses the need to track relationships over the lifetime of a software development project. For this experiment, we downloaded the software artifacts of the Slashcode open source project. Slashcode is the software that runs the popular slashdot.org website. These artifacts consisted, for each version of the system, of a mailing list archive and a set of Perl code. Our first idea was to develop an integrator that searched for relationships between the source code and the mailing list. This idea fell through, however, when it was revealed that the mailing list has very little to do with the development of the Slashcode software! (The mailing list is primarily a place for users of the software to post installation and troubleshooting questions.) However, since Slashcode is a system that has been under development for a number of years, there are a large number of releases (>10) available for analysis. As such, we developed a translator that translates each Perl source code file into InfiniTe. The translator does not preserve all of the information contained in the source code. Instead, it simply stores for each file, its header information, the number of subroutines defined in that file, and a copy of each subroutine’s source code. Each version of the system is stored in a separate InfiniTe context and each such context is linked together using versioning relationships. We then developed an integrator that can traverse these contexts, comparing each one with its successor. It computes differences between versions, such as when a subroutine appeared or changed or when a module disappeared. A second translator processes this information to generate an HTML report of the differences between each version of Slashcode.3 3

The benefits of such a report are almost immediately obvious. The picture that emerges from processing slashcode is one of a stable system that changes very little from release to release. This makes sense since the Slashcode software had a high degree of “burn-in” running slashdot.org before the source code was released as an open source project.

4.3

Linking of Code and Web Standards

The World Wide Web Consortium (W3C) is an organization devoted to the development of Web-based standards. Each W3C standard is published at using the XML version of HTML, known as XHTML. We were interested if open source tools claiming to implement W3C standards actually reference the standards within their code. To answer this question, we developed a set of translators and integrators to process W3C specifications and Java-based software tools. Our approach to this problem is divided into the following steps: 1. A W3C core specification is read by a translator and parsed for the URLs of any subdocuments. These URLs are stored in a document in an InfiniTe context created for this task. 2. The source code of a Java program claiming to implement the W3C specification is downloaded and stored in InfiniTe by a translator. 3. For each of the subdocuments identified in step 1, an integrator processes the specification to identify its key concepts. Key concepts are indicated by use of the tag in W3C specifications. The integrator creates an index document (in the same context created in step 1) to store these concepts for later processing. 4. A second integrator uses the index to process the entire W3C specification looking for the sections in which key concepts are defined and discussed. This information is stored in a third document as a set of XPointers and URLs that reference the W3C specification. 5. A third integrator uses the source code from step 2 to generate JavaDoc index references. This information is stored in a fourth document as a set of XPointers and URLs into the generated JavaDoc documentation. 6. Finally, a translator exports an HTML version of the index containing links to the specification and the JavaDoc documentation, using the information generated by the integrators in steps 3, 4, and 5. It is now possible for a developer to traverse from the exported index to the W3C specification or the JavaDoc documentation to see how key concepts are defined and implemented for a particular specification and tool. An example of the output of this process is available at . This file was produced by processing the DOM specification [6] on the W3C website and the source code of the Xerces XML parser. Xerces is an open source tool developed by the Apache Foundation. With this generated index, it is possible to see where each key concept of the DOM specification is defined and how that concept is implemented in Xerces. We intend to provide this index to the Xerces development community and solicit their feedback with respect to the utility of this information. In addition, we also plan to process other standards-based open source tools once the work with Xerces is complete. 4.4 Summary In total, the experiments combine to indicate that our approach to information integration provides utility to software developers. As a result, we intend to further develop the prototype and expand its capabilities to integrate information from a wide range of software tools and artifacts. 5. RELATED WORK We now briefly review several related systems. Systems included in the discussion below either employ similar techniques or address a similar problem domain. 5.1 GeoWorlds GeoWorlds [22] is an information environment that allows users to retrieve documents from the Web, organize them into collections, and then analyze them in a variety of ways. GeoWorlds is strictly focused on the World Wide Web and can only import information from Web-based data sources. We intend to support both remote and local information sources, with particular attention to supporting legacy, third-party, data formats. This will allow our environment to be applied to both existing and new software development projects. In addition, GeoWorlds services are focused more on information analysis while our focus is on relationship management. Our environment will thus have greater capabilities for discovering, viewing, and manipulating relationships than what is found in GeoWorlds. We should note that Fig. 3 indicates that we have integrated InfiniTe with GeoWorlds. Indeed, we have developed a translator that can import the results of a GeoWorlds keyword search over a set of Web-based documents. These results are stored as an XML document that indicates which Web-based documents contained instances of the keywords that were specified in the search. We have also developed an integrator that can link these results into keyword searches performed by InfiniTe’s keyword integrator. This translator/integrator pair thus allows software developers to explore the Web using GeoWorlds searching for documents related to their development project and then see how those documents relate to the development project’s software artifacts. For example, it is possible using this method to relate a section of a requirements document that deals with security properties to a website that contains information on techniques that can help achieve the desired properties. 5.2 xlinkit The second related system, xlinkit [17], is a link generation engine that allows consistency relationships to be specified over software artifacts. The basic idea is that a software engineer writes consistency rules for a set of documents and then submits those rules along with a set of documents. (Documents must be converted to XML before the link generation engine can process them.) The link generation engine then checks the documents to see if they follow the submitted consistency rules. As output, the engine generates a report that displays the results of the analysis. Our environment can be used to track consistency relationships over software artifacts (assuming an integrator has been developed for this purpose), but it is also intended to support a broader spectrum of relationship types. For instance, we intend to build integrators that can aid the process of generating requirements traceability links, similar to the results we achieved with Northrop Grumman using only the Chimera open hypermedia system [1]. Rather than providing a rule-based language for a single relationship type, our environment will provide APIs to software engineers that will allow them to construct their own translators and integrators to manage the relationships relevant to their software development projects. However, rule-based languages are helpful in automatic link generation; indeed the experience with xlinkit demonstrates the benefits of this technique. In fact, we plan to leverage the results of the xlinkit experience, along with other work in hypermedia link generation, to create a generic rule-based integrator that can support various rule sets via a plug-in mechanism. In addition, our use of open hypermedia will allow the relationships discovered in the environment to be viewable within the native editing context of the original software artifacts. Thus, while both of these systems require a translation step into XML, our approach will allow information to flow back to the original artifacts. 5.3 DOORS DOORS is a requirements management tool used by over a 1000 companies [7]. It is designed to capture, link, trace, analyze and manage a wide range of information to ensure a project’s compliance to specified requirements and standards. It is related to our approach in that it attempts to support a wide range of input and output formats (including Microsoft Word, RTF, Interleaf, Framemaker, text, spreadsheets and HTML) and it supports the manual creation of links between documents. Our techniques go a step further to help automate the creation of links and to provide tools to manage their evolution. In addition, our environment provides an API for developers to create third-party translators that can be used to expand the range of data types supported by the InfiniTe environment. 5.4 LaSSIE LaSSIE [5] is an information system that attempts to integrate architectural, conceptual, and code views of a large software system into a knowledge base for use by developers. The knowledge stored in the knowledge base is intended to serve as an index into a library of reusable components. LaSSIE’s information is accessed by users submitting queries to the knowledge base and interpreting the results. Our goals for InfiniTe are similar in spirit to the LaSSIE system. Our approach is different in that we do not rely on a knowledge base to store the information contained in our environment. Indeed, we view knowledge bases as another type of artifact in a software development project and would advocate the development of translators to extract information from a project’s knowledge bases to see if any relationships exist between that information and other project artifacts. 5.5 ViewPoints ViewPoints [12] is a framework for addressing the problem of software development by multiple stakeholders each making use of different requirements techniques and notations. A viewpoint is defined as “a loosely-coupled, locally managed object encapsulating representation knowledge, development process knowledge and partial specification knowledge about a system and its domain.” [12] Thus, each requirements specification employed by a stakeholder is represented as a ViewPoint. ViewPoints provides a communication model to specify inter-ViewPoint relationships. Each relationship is governed by an inter-ViewPoint rule that describes the relationship at an abstract level and enables instances of that relationship to be checked for consistency, transformation of information, etc. The types of relationships that can be specified include relationships between different development techniques, relationships between different tools, relationships between specification fragments, and protocols of interaction and behavior between stakeholders. The ViewPoints framework is important because it provides excellent examples of the types of relationships that need to be handled in software development projects. Its focus on the requirements phase points to the need for InfiniTe to support relationships specific to a particular software development phase. Therefore, we intend to integrate explicit support for process into the InfiniTe environment in the future. One initial approach to addressing this problem is to allow users to define sets of integrators and translators specific to a particular phase and allow users to activate and deactivate sets as needed. Of course, membership of sets may overlap since some integrators and translators will be useful in all phases of software development. 6. [5] [6] [7] [8] [9] [10] CONCLUSIONS We conclude by calling on the software engineering community to consider the benefits of producing tools that can export information concerning their relationships. All such relationships can potentially be incorporated into an information integration environment using the techniques described in this paper. As the types of information stored in the environment grows, the ability of software engineers to gain a global picture of a software development project increases. We intend to pursue the work described in this paper to increase the sophistication of InfiniTe’s relationship management tools as well as to provide better support for each process described in the Introduction. Indeed, one avenue we are pursuing is to integrate an integrated development environment (IDE) into InfiniTe such that the invocation of translators and integrators is hidden from the developer. Thus, for instance, as a developer is writing a software module, InfiniTe may be performing keyword searches on it in the background and dynamically linking it to related software artifacts. Since IDEs are an example of a common software tool, such an integration can bring information integration services directly to a developer’s fingertips. 7. [4] ACKNOWLEDGMENTS [11] [12] [13] [14] [15] [16] [17] [18] [19] This material is based upon work sponsored by the NSF under Award Number CCR-99-88517. [20] 8. [21] REFERENCES [1] K. M. Anderson. Issues of data scalability in open hypermedia systems. The New Review of Hypermedia and Multimedia, 5:151–178, December 1999. [2] K. M. Anderson. Supporting industrial hyperwebs: Lessons in scalability. In Proc. of the 21st Int’l Conf. on Software Engineering, pages 573–582, May 1999. [3] K. M. Anderson and S. A. Sherba. Using open hypermedia to support information integration. In [22] Proc. of the 7th Int’l Workshop on Open Hypermedia Systems, August 2001. K. M. Anderson, R. N. Taylor, and E. J. Whitehead, Jr. Chimera: Hypermedia for heterogeneous software development environments. ACM Trans. on Information Systems, 18(3):211–245, July 2000. P. Devanbu, R. Brachman, P. Selfridge, and B. Ballard. LaSSIE: A knowledge-based software information system. Communications of the ACM, 34(5):34–49, May 1991. Document object model (DOM) level 2 core specification. . Telelogic DOORS. . J. Frederick P. Brooks. No silver bullet—essence and accident in software engineering. In Proc. of the IFIP 10th World Computing Conference, pages 1069–1076, 1986. K. Grønbæk. Composites in a dexter-based hypermedia framework. In Proc. of the 6th ACM Conf. on Hypertext, pages 59–69, September 1994. J. D. Herbsleb, A. Mockus, T. A. Finholt, and R. E. Grinter. An empirical study of global software development: Distance and speed. In Proc. of the 23rd Int’l Conf. on Software Engineering, pages 81–90, May 2001. Hypertext transfer protocol – HTTP/1.1. . B. Nuseibeh, J. Kramer, and A. Finkelstein. Expressing the relationships between multiple views in requirements specification. In Proc. of the 15th Int’l Conf. on Software Engineering, pages 187–196, May 1993. R. Orfali, D. Harkey, and J. Edwards. The Essential Distributed Objects Survival Guide. John Wiley & Sons, Inc., 1996. K. Østerbye and U. K. Wiil. The flag taxonomy of open hypermedia systems. In Proc. of the 7th ACM Conf. on Hypertext, pages 129–139, March 1996. The Jakarta Site — Jakarta Tomcat. . XML linking language (XLink) version 1.0. . xlinkit.com - link generation engine. . Extensible markup language (XML) 1.0 (second edition). . XML path language (XPath) version 1.0. . XML pointer language (XPointer) version 1.0. . XSL transformations (XSLT). . K. Yao, I. Ko, R. Eleish, and R. Neches. Asynchronous information space analysis architecture using content and structure-based service brokering. In Proc. of the 2000 ACM Conf. on Digital Libraries, pages 133–142, May 2000.

Towards Large-Scale Information Integration - Semantic Scholar

Towards Large-Scale Information Integration - Semantic Scholar

Infinite Export

Suggest Documents

Towards an automatic semantic integration of information

INFORMATION TECHNOLOGY, INTEGRATION ... - Semantic Scholar

INFORMATION TECHNOLOGY, INTEGRATION ... - Semantic Scholar

Selfconsistent modeling of the largescale ... - Semantic Scholar

Towards the integration of genomics ... - Semantic Scholar

Integration and Multiculturalism: Ways towards ... - Semantic Scholar

Towards the Integration of Distributed ... - Semantic Scholar

Towards Transparent Integration of ... - Semantic Scholar

Interorganizational Information Integration in the ... - Semantic Scholar

Integration of Historic Building Information ... - Semantic Scholar

Perceptual Information Integration: Hypothetical ... - Semantic Scholar

Geospatial Information Integration for Authoritative ... - Semantic Scholar

Enabling Ad hoc Information Integration - Semantic Scholar

Information Integration Agents: BargainFinder and ... - Semantic Scholar

Integration of manufacturing information using ... - Semantic Scholar

Supporting Information Integration with ... - Semantic Scholar

Process integration, information sharing, and ... - Semantic Scholar

Vertical Information Integration for Cross ... - Semantic Scholar

Towards an Information Driven Software ... - Semantic Scholar

towards dynamic information modelling in ... - Semantic Scholar

Information Collection Policies: Towards load ... - Semantic Scholar

Towards Ubiquitous Brushing for Information ... - Semantic Scholar

Towards Ontology-driven Information Systems - Semantic Scholar

Towards Linked Data based Enterprise Information Integration