Improving Reverse Engineering Through the Use of ... - CiteSeerX

4 downloads 7879 Views 90KB Size Report
come a re-evaluation of software maintenance tools. Such tools range from .... and human experts (including system users, business analysts, and IT system ...
Improving Reverse Engineering Through the Use of Multiple Knowledge Sources

P.J.Layzell, M.J.Freeman, P.Benedusi1 University of Manchester Institute of Science and Technology PO Box 88 Manchester M60 1QD United Kingdom Tel: 0161 200 3331

Abstract With the growing awareness of the importance of software maintenance has come a re-evaluation of software maintenance tools. Such tools range from source code analysers to semi-intelligent tools which seek to reconstruct system designs and specification documents from source code. However, it is clear that relying solely upon source code as a basis for reverse engineering has many problems. These problems include poor abstraction which leads to over detailed specification models and the inability to link other parts of a software system, such as documentation and user expertise, to the underlying code. This paper describes the work of the Esprit DOCKET project which has developed a prototype environment to support the development of a system model linking user-oriented, business aspects of a system, to operational code using a variety of knowledge source inputs: code, documents and user expertise. The aim is to provide a coherent model to form the basis for system understanding and to support the software change and evolution process. 1

Address: CRIAI, Piazzale Enrico Fermi, 80055 Portici, Naples, Italy. 1

1.

Introduction

It is widely recognised that the software industry has entered a new phase in which a major proportion of IT resources are being directed towards the maintenance, enhancement and support of existing systems, with general agreement that over 50% of software costs relate to maintenance type activities [Parikh & Zvegintzov, 1983] and in some cases as high as 85%. A number of tools are commercially available which assist maintainers to support large, commercial applications; however these primarily focus upon a syntactic analysis of source code and hence fail to adequately address the semantic content of such representations [Layzell & Macaulay, 1990]. The result is that current maintenance support tools offer little apparent benefit to software maintainers. This paper describes the work undertaken by the Esprit DOCKET2 project, which aims to build upon the current research and tool base and provide an intelligent reverse engineering toolset, capable of addressing semantic-related issues. This is achieved by supplementing existing source code analysis techniques with document analysis and end-user interaction, enabling the effective interpretation of source code. DOCKET uses a repository to contain interrelated system information, from business requirements through to the implementation of those requirements in source code and supporting documentation. A populated repository thus provides the means to improve the output of existing reverse engineering tools and enables software maintainers to accept change requests expressed at any level- business requirement, system specification or implementation- and map them to the appropriate level. 2.

Background

Much of the motivation for the DOCKET project was derived from a published study [Layzell & Macaulay, 1990] which made an in-depth analysis of a number of industrial organisations and their practices and perceptions of software maintenance. The aim of the study was identify how management perceived the software maintenance process and what their attitude was towards tool support of that process. The study identified a number of interesting themes outside the scope of this paper, including the perception and image of software maintenance and the changing nature of the relationship between IT departments and their host organisations. Two issues were of specific relevance however. First, some support/ maintenance departments believed that users needed a clearer understanding of the link between changing business requirements and the consequent need for software maintenance. Secondly, there was considerable sympathy for reducing the divide between the processes of development and support, either through complete integration of the functions or through closer cooperation. This implies a holistic view of the entire IT function and requires modification to existing methods, procedures and tools to recognise this integration. Inspite of the comprehensive range of maintenance tools currently available, none of the organisations studied used them or similar tools. Some support teams had investigated the use of source code analysers and reverse engineering tools, but these have had little impact on the maintenance process. This was primarily due to a perceived of lack intelligence in tools, although commonly this related to the inability of source code analysers to address many of the semantic issues involved in reverse engineering. 3.

2

The Organisation-IT Relationship

DOCKET stands for the Document and Code Knowledge Elictation Toolset. 2

The conclusions from the original software maintenance study was the identification of an artificial divide between an organisation and its IT function, in particular in its support (or maintenance3 ) activity. Even in the more successful organisations studied, where an integrated model of IT was adopted, it was clear that explicit and long lasting relationships between the organisation’s operational domain and knowledge about IT systems was neither held by the organisation nor maintained as each evolved. The particular problem identified by the study was that domain knowledge, IT system knowledge and the relationship between the two was needed to carry out effective maintenance and that contemporary maintenance tools lacked this. Most organisations studied recognised that the maintenance process has to understand complex relationships between organisational structures, business requirements, company policies and practices and operational systems. So emerged the requirement for an integrated domain and system knowledge base to maintain these descriptions and their relationships. Such a knowledge base could then be used: • • • •

to provide intelligence for conventional software maintenance tools to act as a repository which can be queried by new or inexperienced maintenance staff to act as a source of knowledge which can be consulted on an open learning basis by new staff in order to orientate themselves to the organisation's business and thus achieve higher staff motivation to enable third party contractors to provide a better service.

As part of the startup of the DOCKET project in 1990, many of the issues raised by the original study were revisited; only to be confirmed. Indeed, with the advent of enterprise modelling and business process re-engineering, the scale of the problem appeared more widespread as organisations expected a greater degree of cooperation between business and IT units. Whilst some work has been conducted in developing more reliable reverse engineering tools, which explore the relationship between organisation and IT system and output models at higher levels of abstraction (e.g. [Karakostas, 1992], [Avellis et al, 1991], [Sneed et al, 1988]), they do not recognise that knowledge sources, other than source code, can play an effective role in the reverse engineering process. The result of this situation is typified by one of the DOCKET partners who are responsible for systems supporting fiscal control. Such systems are typically driven by governmental fiscal law and codes of practice developed with finance ministries. The scale of the maintenance problem is quantified by an operational system of 30 million lines of COBOL and C code, processing 1.6 million transactions per day from 1,800 office terminals. This system requires approximately 40 maintenance changes per day. Unfortunately, such changes are often expressed in highly complex legal language, with the resulting, significant overhead that changes must be mapped through a series of representations and understandings from the application domain to the technical, delivery environment. Part of the aim of the DOCKET project has been to map documentation onto operational systems, so that when changes to domain documentation occurs, the identification of software which may need to be changed can be readily achieved. 4.

The DOCKET Approach

The approach of the DOCKET project was to identify the need to improve the intelligence of reverse engineering tools to enable more effective use of computer-based support in the software maintenance process. However, unlike contemporary tools which aim at producing more intelligent outputs from existing source code analysis (e.g. [Benedusi et al, 1990]), DOCKET 3

The term support was used by many organisations in order to present a more positive image than that engendered by the term maintenance. 3

adopted the approach of enriching the inputs to the process. Project Objectives The objective of the DOCKET project was to research and develop ways in which intelligence can be added to the software maintenance process and its supporting tools. The hypothesis of the project was that such intelligence can be added by the construction of a system knowledge base which is derived from a variety of sources, including source code, documentation and interaction with system users. Within this overall aim, the following detailed objectives can be identified: a. to demonstrate the feasibility of integrating existing reverse engineering output with higherlevel software engineering and knowledge-based representations, through explicit linking of the different levels of representation b. to investigate and develop text analysis tools which enable the animation of system documentation and its structural analysis in order to generate a user-oriented view of a system c. based upon the feasibility of (a) and (b), to produce a method for maintenance knowledge acquisition, partially supported by computer-based tools d. to provide a system knowledge base which permits change requests to be expressed at any level of representation (business, specification, design, implementation or interface) and the ability to trace this through to the appropriate representational level e. to provide a system which captures system and domain knowledge as part of a natural, software production process so as to minimise the overhead of capturing such knowledge for use in the software maintenance process, thus enabling the easier introduction of new technology such as methods, techniques and CASE tools f. to evaluate the overall approach in a live, application development environment. It was recognised that may technical solutions often fail because of the overhead cost that must be incurred by their use and so the project sought to develop knowledge acquisition tools which form an integral part of the software maintenance and understanding process and thereby minimise the cost of using such tools. The DOCKET Hypothesis The basic DOCKET hypothesis was that maintainers need to understand software applications in the large: that is be able to relate operational source code to clear statements of business objective, purpose and process. This can be achieved by improving the reliability and quality of output from the reverse engineering process; by providing an integrated view of system components; and by introducing knowledge with higher levels of abstraction, whilst recognising the unreliability of knowledge sources other than source code. The system knowledge sources on which DOCKET draws are: source code; test cases; documentation (including user manuals, programming documentation, specifications, designs, change requests, bug reports etc.) and human experts (including system users, business analysts, and IT system experts). The analysis of this knowledge involves the use of meta-knowledge: programming language grammars, code clichés and patterns, text heuristic and business domain knowledge. The integration of such knowledge will provide a resource for wide-system understanding which will facilitate system-scale maintenance through the provision of 4

knowledge at different levels of abstraction ,with the ability to walk through and between views from multiple knowledge sources. This is of particular importance to those organisations whose software systems are large and have little or no structure, either because they were developed before the introduction of structured techniques, or because prolonged maintenance has resulted in degradation of structure. Documentation is often inadequate, out-of-date or nonexistent; some large systems currently in use are more than 20 years old, so that personnel involved in the original design of the system have moved on; indeed, there may have been several turnovers in staff since the system was introduced. DOCKET Knowledge Sources

In order to provide the wider understanding of a software system, DOCKET identified a number of potential knowledge sources and classified them by the level of linguistic representation and formality. This latter classification facet is particularly important since it must recognised that human experts may be mistaken and documentation can be notoriously out of date. Three general categories of knowledge source were defined, as follows. Formal Knowledge Sources Formal knowledge refers to documentation-based descriptions represented in a formal, machineprocessable language. Whilst natural language documents may fall into this category, a pragmatic distinction is made between source code programs written using well defined languages and natural language documents whose semantics are more difficult to comprehend. Formal knowledge sources thus include source code (excluding program comments), structured designs and specification documents represented in notations such as ER or DFD models, together with any formalised representation of test cases. Formal knowledge typically contributes to the understanding of a system’s implementation and design limitations and its analysis can provide the basic information for knowledge acquisition about an operational software system. It is in effect a set of information extraction processes, consisting of static analysis of sources (object code analysis is excluded) and testing (including object code). The output of these processes is a language-independent, low-level representation of the information extracted from the operational software system. This means that the information extractors also perform a first-level abstraction from the original sources, whose goal is to preserve the original information content while decoupling all other subsequent analysis tools within the DOCKET toolset from the syntactic details of the sources. This enables existing source code analysers to be employed within the toolset, provided that a translator is written to convert to the intermediate representation. Once processed, the intermediate representations of a source program can be further analysed by an information abstraction process which, by combining the results of low-level source code analysis and additional knowledge, aims to reconstruct the following typology of targets: • representations of software which can be used directly by the maintainer, and possibly compared with artefacts of the forward engineering process; among them, possible targets are skeletons of the formal documents of widespread software design and development methodologies, such as JSP or Warnier diagrams, data flow diagrams and structure charts, etc. • representations of the information extracted from code at a level of aggregation, synthesis and abstraction such that the population of DOCKET's knowledge base and the integration 5

with other sources of knowledge would be facilitated. In first-generation reverse engineering tools, the issues of reconstructing relevant links between software components and the selection and application of appropriate information hiding, aggregation and encapsulation techniques are addressed only mechanically, and any choice is made on the basis of standard, implicit and generic default assumptions. Even many of the existing knowledge-based prototypes tend to search in the code for occurrences of very general programming clichés and concepts. The DOCKET goals include second-generation intelligent reverse engineering tools, which tend to drive and refine the production of abstractions, as well as the search for occurrences of specific concepts in the operational system, based on multiple sources of information which are specific to the user and application domain. Informal Knowledge Sources Informal knowledge is defined in DOCKET to be anything represented in a semi-structured or informal notation, other than directly from humans, such as system documentation, domain documentation and change requests. This gives the source certain characteristics in terms of consistency, although not necessarily accuracy, and as such can be distinguished from human experts. The knowledge obtainable from domain documentation, such as a shareholders’ report, is of a general nature and relates to the domain in which the computer system being analysed by DOCKET operates, rather than to the system itself. System documentation such as system and program specifications or user guides, on the other hand, relates to the design and implementation aspects of a computerised system and can therefore form the basis of a more formalised view of the internal aspects of a system. Finally, the knowledge obtainable from change requests relates to alterations requested by the users of a system. These could arise either because the system is operating in a manner other than that intended or because of requested enhancements or alterations to the system. It was established at an early stage of the project that a complete syntactic and semantic analysis of text was not feasible, given the quantities of documentation to be processed and the current state-of-the-art in computational linguistics. Short-cuts which still achieve a sufficient degree of understanding have to be used in the analysis process to locate the key concepts and events. Thus, there is a critical distinction to be made between the Text Processing, approach taken by DOCKET, and the more detailed analysis techniques seen in most Natural Language Processing/ Understanding systems. A UMIST study into the way in which maintainers worked revealed a number of different attitudes towards the usefulness of documentation to the maintenance process and identified some of the facilities that are desirable in an automated documentation system. The study showed that maintainers whilst relying upon source code for system understanding, would make use of documentation if the search space of the documentation could be reduced- by making the documentation on-line and by maintaining explicit links to aspects of an operational system. Dynamic Knowledge Sources The term dynamic knowledge was used to relate to system knowledge obtained directly from humans. Within this classification, a number of subdivisions exist: domain experts, system IT experts and user experts. The knowledge obtainable from a domain expert such as an administrator concerns general aspects of the domain and is likely to contribute to the higher-level understanding of a system, 6

such as its goals, general components and how it relates to the business. Thus in a taxation system, such a domain expert might be a tax lawyer or ministry of finance official. Knowledge obtainable from a system IT expert such as a systems analyst or a system designer concerns the design and implementation aspects of a computer-based system. In essence, such knowledge provides an insight into the design decisions and software structuring that may have been employed during system construction, even if the expert was not involved in the original construction of the system. The knowledge obtainable from user experts such as clerical staff concerns the user’s view of the use and implementation of a system. As such it is likely to contribute to the knowledge about the overall functionality of a system, its interface and use. It also particularly highlights links between operational software and organisational need. 5.

The DOCKET Architecture

Figure 1 shows a conceptual view of the DOCKET system architecture, complemented by a method of use described in section 7. The architecture, consists of three main phases: importation, extraction and abstraction, and integration. Importation Phase In the importation phase, raw knowledge about the system from original knowledge sources (termed external sources in DOCKET) is processed into internal, machine-readable forms using input tools. Formal Source Capture In the case of formal sources, capture is through a set of source code analysers. In the DOCKET project, analysers for COBOL and C are supported which convert code units to a languageindependent representation scheme called CSI, representing information in graphs and tables. The CSI language also provides a basis for the batch import and export of knowledge to and from CASE tools. Informal Source Capture For documents, industry-standard scanning and text conversion tools are used, followed by a tool to convert text to the standard markup language, SGML [Goldfarb, 1989]. The markup tool is written in C and runs in a PC environment. Dynamic Source Capture It is widely reported that people find it hard to know what they know. DOCKET therefore needed to create the role of a knowledge facilitator to acquire knowledge from the experts, and to coordinate the input and representation of the knowledge within DOCKET. The knowledge facilitator acquires knowledge by conducting informal interviews with experts, as a means of acquiring overviews of the domain and the system, and by acquiring specific knowledge from experts. The gathered facts are then formalised through a set of tools supporting graphical hierarchy and network models. These models include: organisation hierarchies, goal hierarchies, activity and task hierarchies, dialogue graphs and conceptual graphs [Sowa, 1984]. The model editors are implemented on a Sun/4 workstation using a mixture of C and Prolog, with Motif handling interface issues. C was selected to implement the main model editor operations, which outputs Prolog predicates representing edited models. Additional Prolog statements are then 7

used to automatically verify the models, with C handling any error reporting. Techniques of card sorting were found to be particularly helpful in identifying key system and domain concepts which could then be used to facilitate document and text analysis [Burton and Shadbolt, 1987]. Extraction and Abstraction Phase In extraction and abstraction phase, analysis tools produce abstracted structured form of the essential knowledge from each source. The term structured form is used to distinguish between how a knowledge source represents something and what it is actually representing, i.e. a distinction is made between form and content. Formal Source Analysis For source code, the structured form is an enhanced form of the CSI language, which contains higher levels concepts beyond the immediate source representation. A system-level abstraction subsystem produces the basic system-level representation of the operational software starting from the forms produced by unit-level analysers. Further abstractions are the system graphs, which are built by the corresponding tool. Filtering and enrichment tools improve the production of reverse engineering models, by both enriching their contents and the construction process, and making them more effective by adding powerful information discrimination functions. The filtering tool is supported by technical domain knowledge about the influences of the operating environment, DBMS, TP monitor, user interface sublanguages and utility libraries on software composition. This makes it possible to hide application-independent aspects of the system, and reduces the scope of search for application-specific aspects, which provide better candidates for the integration with other knowledge sources and for the population of the global system model. The enrichment tool combines knowledge from test cases with the model produced by static analysis. This may enhance the model with further interrelationships between the components, and contribute to the functional interpretation and semantic assignment of software structures. Informal Source Analysis Central to the processing of informal sources is the ability to handle natural language. Bearing in mind the requirement that short-cuts need to be made in the analysis process, there were a number of different techniques that could be used, including those based around sublanguage [Grishman et al, 1986] and the text skimming ideas employed by automatic indexing and abstracting systems. Although the use of pure sublanguage techniques in the detailed form described in the literature were found to be inappropriate for use in DOCKET, a system using a combination of some of the ideas of sublanguage and of text skimming is able to provide a solution to the problem. A set of heuristics, based on indicator phrases and textual format are used which allows the key features of a text to be located by pattern matching followed by a very restricted syntactic analysis, rather than requiring the complete syntactic and semantic analysis typically associated with NLU systems [Paice, 1981]. These heuristics include the following items: •

the location and marking of headings and subheadings, which often introduce new concepts and events 8



the examination of text rendered in particular typographical styles such as boldface, italics and special fonts, since these are often used to highlight important items



the location and marking of lists of items: these may occur in several forms and can provide information about both the structure and the meaning of concepts; they can be easily identified from format clues and often have a preceding explanatory sentence



the identification of concept definitions, which are often introduced by indicator phrases such as this is defined as… or a primary objective is…



the identification of events, which are often introduced by indicator words such as if and when



more detailed examination of sentences mentioning an already identified key concept since such sentences are often used to introduce a new key concept.

Due to the complex, time-consuming nature of the process of extracting information from textual sources, as well as the unrestricted nature of its inputs, the text analysis element of DOCKET is less able to contribute a comprehensive model of a software system than source code analysis or dynamic source analysis. The analysis approach taken is essentially to locate and mark significant concepts and events in the text and to provide a list of these to the other DOCKET processing streams in support of their activities. Full linguistic analysis is recognised to be a major research task in its own right and well beyond the scope of the DOCKET project. A fuller discussion of some of the lingusitic processing issues can be found in [Black et al, 1992] Dynamic Source Analysis For human expert representations, the structured form consists of all input models converted to a set of standard conceptual graphs.

9

KEY supplementary knowledge DOCKET Analyst

decision flow between a processing tool and an interface tool information flow between a repository and an interface tool

register of sources

10

Figure 1: The DOCKET Conceptual Architecture

EXTERNAL FORMS

DOCKET model administration tool

input tools

INTERNAL FORMS

analyser tools

GLOBAL SYSTEM MODEL

STRUCTURED FORMS

concept thesaurus

dynamic source M O D E L consolidation and resolution tool

informal source

formal source

External RE tools

Batch conversion

browser tool

CASE tools

DOCKET end user

importation

extraction & abstraction

integration

M A N A G E R

Integration Phase In the integration phase, a consolidation and resolution tool integrates concepts from the different structured forms to populate a global system model , whilst resolving any conflicts arising in the knowledge supplied from the different structured forms. This latter process is not fully automated; intervention being possible by a DOCKET Analyst via a model administration tool. Other Architectural Features At the heart of the architecture is the global system model which incorporates and link concepts relevant to the system at different levels of abstraction- from an organisational view of the system, down to an implementation view. This is described in more detail in the next section. A browser allows ad hoc access to the knowledge in the global system model and to navigate through it by means of a hypertext-like facility; whilst a concept thesaurus provides a centralised list of all known concepts. 6.

Knowledge Representation Issues

The Global System Model At the heart of the DOCKET architecture lies a modelling formalism capable of representing any system description, from its implementation view through to a high level organisation view. Analysis of existing work revealed that four levels are identifiable [Black et al, 1987; van Griethuysen, 1982; Olle et al, 1988]. The world level represents a real world view of an information system identifying the long-term objectives of an organisation. This level also contains a description of the functional areas of the organisation and maps these to its structure and human resources. External entities which influence the organisation are included, capturing their effect on the organisation’s activities. The conceptual level represents an implementation-independent abstract view of an information system. It contains concepts capable of representing the three different information system perspectives of activity, data modelling and behavioural modelling. The design level represents a pure functional decomposition of the target, computer-based information system, including user interface aspects. The implementation level represents a physical view of the implemented software system. It is included in the global system model in order to allow DOCKET users to reason about the physical system components and the relationship of these components to the higher levels of representation. Existing information system modelling schema were analysed and a set of common concepts were identified covering the four levels. This work formed the basis of the development of a global system model which would form the core of the DOCKET repository. The global system model is key to the integration aspects of the DOCKET architecture and can be regarded as a multi-tiered, meta-model capable of representing aspects of an information system in an original notation-independent fashion. It can maintain a links between the different abstract levels of description of a system, so that an explicit link can be made between say a data file, the conceptual level entity that the file implements and an external institution. Links can also be made to the knowledge source origins of an element of the global system model, thereby 11

providing traceability to the knowledge source. Figure 2 provides an entity-relationship model summary of the model’s concepts and relationships. The global system model is implemented in the DOCKET architecture using SML (System Modelling Language), based upon the Conceptual Modelling Language produced by the Esprit DAIDA project. It is an object-centred language which supports the organisation of knowledge into general classes (concepts) and particular instances of classes (objects). Like similar languages, it supports the notions of classification, specialisation and aggregation.

12

contained_in

has

ORGANISATIONcontains UNIT

overseen_by oversees

responsible_for conflicts_with

USER GOAL

responsibility_of

belongs_to

conflicts_with

GOAL

ACTIVITY

achieves

performs

implemented_by

liases_with liases_with

references generates

implemented_by

generated_by implemented_by

referenced_by

inputs input_by

origin_of destination_of

has_origin has_destination

RESOURCE

EXTERNAL INSTITUTION implemented_by

belongs_to

AGENT

executes executes

performed_by

determines

causes caused_by triggers

triggered_by

implements

causes executed_by

inputs

reads

manipulates

manipulates

executed_by

implements

triggered_by

triggered_by triggers

INTERNAL EVENT

triggered_by triggers

manipulated_by

ATTRIBUTE

derives belongs_to has identified_by

identifies implemented_by has_subject has_object derives causes

subject_of object_of identified_by

utilised_by

calls called_by

contains manipulates inherits

sent_by received_from contained_in

identifies identified_by

subtype_of supertype_of

implements

OBJECT

manipulated_by inherited_by

DIALOGUE STRUCTURE FUNCTION

ENTITY

implements

implements

implements

implemented_by

implements

implements

implements

FUNCTION

is_of

implemented_by

control_passed_to_by transfers_control_to

sends receives

implemented_by

USER

implements

outputs

inputs

has

identifies

generates

caused_by

causes

RELATIONSHIP

utilises

STATE

implemented_by

manipulated_by

outputs

inputs input_to

implements

output_to

output_to

generated_by

input_to

implemented_by

read_by

part_of

has_part

caused_by

caused_by

implemented_by

implemented_by

output_to

input_to

DATA STRUCTURE

causes

caused_by

PROCESS

supertype_of

can_be_achieved_by can_be_changed_by

derived_from

succeeds

subtype_of

can_achieve can_change

derived_from

precedes

can_affect

ACTION

outputs implemented_by

can_be_affected_by implements

precedes

EXTERNAL EVENT

is_in

succeeds

implemented_by triggers

USER INTERFACE OBJECT implements

implements

DIALOGUE FUNCTION

MODULE

sends receives contains manipulates inherits

sent_by received_from contained_in manipulated_by

DATA

inherited_by

Figure 2: The Global System Model

Concept Hierarchy In addition to the main global system meta-model, a concept abstraction hierarchy (figure 3) was produced to allow for the partial classification of concepts. The concept abstraction hierarchy contains all the global system model concepts as leaf nodes on the tree. The concepts are then 13

abstracted to higher level, more general concepts. This additional structuring therefore allows DOCKET analysis tools to start to insert concepts into the global system model, even when a full classification of a concept is not possible. This is particularly the case with text analysis tools, which, without full semantic analysis, cannot easily classify the term book without reference to other knowledge sources (is book a verb, i.e. process or noun, i.e. object?). The concept hierarchy also plays an important role, together with a set of permitted relations between concepts, in validating the contents of conceptual graphs generated by the dynamic source analysis tools.

CONCEPT

PHENOMENON ARTIFACT TIME

REAL WORLD PHENOMENON

ACTIVITY

FUNCTION

EXTERNAL EVENT

DIALOGUE STRUCTURE FUNCTION

REAL WORLD ARTIFACT

MODULE

DATA

OBJECT

CONCEPTUAL ARTIFACT

USER INTERFACE OBJECT

DIALOGUE FUNCTION DYNAMIC ARTIFACT

STATIC ARTIFACT

GOAL

RECORDED ARTIFACT

STATE

CONCEPTUAL PHENOMENON ORGANISATION UNIT

CONCEPTUAL EVENT

RESOURCE

EXTERNAL INSTITUTION

USER GOAL

DATA STRUCTURE

ENTITY

CONCEPTUAL QUALIFIER

CONCEPTUAL RESPONSE AGENT

INTERNAL EVENT

EXTERNAL EVENT

PROCESS

ATTRIBUTE

ACTION

RELATIONSHIP

USER

Figure 3: Concept Hierarchy

Design and Usage Issues of the Global System Model In this section, we review some of the issues arising from the population of the global system model. Reliability In the course of populating the global system model a variety of pieces of information are received from many different sources and with different levels of reliability. Consequently, information held in the global system model is tagged with a reliability weighting. Thus facts derived from source code analysis are treated as intrinsically more reliable than those from text analysis or simple expert assertions. However, such measures themselves must be treated with caution, since point scale type reliability measures have themselves been shown to be unreliable. This problem is naturally compounded when more than one person is assigning these ratings. Thus machine driven arbitration between concepts is not desirable. The compromise adopted is to store links to concepts that support and refute the consistency of a stored fact. The DOCKET analyst must then decide. Indicators can be given when, for example, the overall links indicate that there are now more for than against links for a concept. Traceability 14

When maintainers identify concepts in the global system model they will want to know where they came from or where to get more information; comprehensive links must therefore be kept to correspond model facts with the DOCKET Structured Forms from which they have been derived. This is the issue of traceability and allows one to verify the substance of such sources of information and to see them in their natural context. Traceability takes the form of a unique identifier for each source of information (Structured Form), with an additional pointer to reference inside the source itself. Abstraction Paths Many of the concepts in the global system model are linked by the implemented_by relationship. These relationships are fundamental to the model and its support of the reverse engineering (and forward engineering) process. It is through these paths that the products of reverse engineering can be abstracted to higher levels. For example, being able to map the file name CUST.DAT up to the entity customer allows part of the conceptual model of the domain to be reconstructed. Thus the abstraction paths provide one of the most powerful mechanisms for reasoning about a software system, locating missing concepts at different levels of the model (activities that are not supported by functions, unnecessary functions that do not support any actions or processes, etc.). Furthermore they show the feasible and profitable reverse engineering paths which may be taken. Some of the concepts are not linked up to higher levels. This means that the lower level concepts cannot realistically be abstracted up to a higher level concept. The additional DOCKET sources of knowledge are of particular importance here, as without them there would be no way of obtaining these higher level concepts at all. Versioning and Variants The description of any information system suffers modifications. It has to reflect the evolution of the organisation and its information system. Whenever a change in the information system or its environment is detected, the global system model must be updated. This presents a major design issue for the model, for which four alternatives exist, as shown in figure 4.

Populate once

Populate and overwrite

---------- One GSM ----------

Populate many times

Populate continuously

Discrete GSMs

GSM evolves

Figure 4: Approaches to Versioning in the Global System Model The first case on the far left is the application of reverse engineering where the current information system is analysed and the design recaptured. From here either the design is kept and used to restore design documents as a baseline for further maintenance, or the system is reimplemented from the resulting global system model. In either case the global system model is 15

only created once, has a point where work on it stops, the model is treated as if complete (and immediately out of date) and is thrown away after use. The next case is the one found in most contemporary programming environments, where only the currently saved version of a system is worked on. This does not mean that backups are not kept. They are kept but they are not reasoned about. This prevents historical reasoning such as: why is this sub-system included, what depended on it in the past? The third case indicates when many discrete photographs of the global system model are taken. After the latest version is established, all new changes to the global system model are collected but not included until the next version is established. This batch method means that the global system model will be out of date and lose information concerning time-ordering of changes between selected check points. In addition, for any object in the global system model that changes more than once between checkpoints, information will be lost. The final case is the never delete option, where objects in the model are never deleted, only updated. The inadequacy of the above approaches and the clear advantage of the last case: the continuous modelling of a system, are decisive in its selection as the best option. Whilst being the most technically difficult, it maps best to the maintainers’ needs and hence was the version model selected in the DOCKET project. 7.

The DOCKET Method

In addition to the basic toolset, the DOCKET project has developed a method of use of the toolset. This method can be regarded as a more general technique of managing an organisation’s software assets, although clearly this is assisted by the toolset. The method consists of six phases, from initial planning through to the exploitation of the DOCKET global system model. These phases are summarised in figure 5.

START

Domain Definition Phase

Identify Knowledge Requirements

Inventory Phase Locate Knowledge

Initial Population Phase

Yes

Perform Task

Consolidation Phase

Task Oriented Phase

Reuse / Export

16

No

Acquire Additional Knowledge

Figure 5: The DOCKET Method

Domain Definition Phase The overall objective of the domain definition phase is to identify the domain for which DOCKET will be used and the applications contained therein. DOCKET is intended to support only a single domain, although it can support many applications within that domain. Therefore, in this initial phase, strategic decisions must be made to determine what should be supported. The tasks within this phase aim to identify possible domains and applications and, through a procedure of analysing the critical relationships between domain and application, select a target domain and application pool. Factors to be considered include: domain and application complexity, application stability/ rate of change and personnel turnover. Inventory Phase The overall objective of the inventory phase is to identify the main knowledge sources within the selected domain and application set and produce a detailed inventory of such sources. As a precursor to populating the DOCKET knowledge bases, the DOCKET Analyst needs to identify all the available knowledge sources and their interdependencies, so that a proper knowledge elicitation plan can be constructed. This will subsequently improve the knowledge capture processes, ensuring maximum integration in the global system model and minimum, unnecessary redundant knowledge capture (although some duplication will be desirable). Initial Population Phase The overall objective of the initial population phase is to prime the various DOCKET repositories, prior to use in task-oriented activities. The initial population of the repositories will be based upon the knowledge acquisition plan defined in the previous phase. It is in this phase that the majority of the DOCKET tools are used, including the initial knowledge source capture tools. As part of this process, a systematic approach to ensuring the consistency of the various DOCKET models must be employed. For example, in the global system model there is a rule which states that: If a goal belongs to an organisation_unit then no subgoals of that goal may belong to a parent of that organisation_unit.

Now if a particular organisation unit (U3) has a goal (G2) and that goal has a subgoal (G4) which is the goal of an organisation unit (U1) which is at a higher level than the original organisation unit (U3), then this rule would be violated. Under these circumstances, one of the offending relationships must be removed. This should be achieved by reference to the original input through the traceability links maintained when inserted data into the global system model. Another situation which may arise is one where an entity or relationship is missing from the global system model. For example, every goal must have a relationship with at least one activity of the type determines, i.e. a goal must be achieved through some activity. This type of constraint is imposed by the relationship cardinality constraints shown for each relationship type in the global system model documentation. Violation of these constraints indicates that something is missing from the global system model and can be used to formulate questions to dynamic sources of the form, “what activities contribute to achieving the goal ...”. Clearly, it is in this checking procedure where the DOCKET toolset can save the most time and ensure consistency and model quality. 17

Task-Oriented Phase Within the task-oriented phase of DOCKET, emphasis changes from the general acquisition of knowledge about the application and its domain, to a more task-oriented approach. This change is reflected in the manner by knowledge is acquired and who drives that acquisition process. The main tasks are as follows. Identify Knowledge Requirements The completion of a maintenance task will depend on the utilisation of information pertinent to that task. The first step is therefore to recognise the knowledge requirements of the task in hand, and to identify the repositories in the DOCKET system where the relevant information may be located. The type of maintenance or system-understanding activity will largely determine the nature and type of knowledge required. Locate Knowledge A successful retrieval request on the relevant repositories will produce the information required for the current maintenance task. If this is achieved, then the DOCKET user may proceed to perform the task in hand. Acquire Additional Knowledge If the required information is currently not available from the DOCKET repositories, this step is intended to acquire the missing knowledge. This will involve the use of the DOCKET toolset in any of the input, analysis or consolidation steps. The precise activities will be based upon the task in hand and the current state of the DOCKET repositories. Perform Task Depending on the nature of the maintenance task being performed, the knowledge contained in the DOCKET system may be enhanced by actual execution. Examples include the establishment of new links in the repositories or the insertion of new information into an experience base. Consolidation Phase At any stage in the use of the DOCKET system, the knowledge content of the repositories may be checked for consistency and completeness. This process is independent of any maintenance tasks and is intended to (a) produce a coherent global system model of the current target system, where this has not yet been achieved in the task-oriented phase and (b) identify those areas which require further knowledge elicitation. This step is to be performed either as a clarification procedure during the use of DOCKET within a given domain, or as a prerequisite for the exportation of the repositories to a new domain. Reuse/ Export Phase The contents of the DOCKET repositories may be reused within a new domain, as part of the initial population phase for that domain, or exported for use by other software tools (e.g. a CASE development tool). In either situation, it is assumed that the information to be reused is in a consistent state, produced by the consolidation phase.

18

8.

Benefits of DOCKET and the Global System Model

The use of a comprehensive information system meta model, such as the global system model, provides a number of benefits. Change Impact Analysis The comprehensive links between all interdependent concepts in the global system model allows the impact of proposed changes to given concepts or groups of concepts to be estimated with greater accuracy. For example, if an activity at the world level changes, the global system model can be used to determine the likely impact on the design and implementation levels of the model. By tracing the links back through to the Structured Forms and ultimately the original knowledge sources, the precise programs and documents which may be subject to change can be identified. Better Planning and Management of System Extensions Central to the global system model is the inclusion of the additional perspective of how a software system is embedded into its business environment. As a result, maintainers can more easily identify both the activities which a system supports and those it does not. In other words, the global system model can form the basis of a goodness of fit measurement between the information system and the underlying domain it is intended to serve. This has the beneficial effect of highlighting and subsequently allowing relatively easy analysis of potential extensions to a system, to improve such goodness of fit. Effective Prioritisation of Maintenance Maintenance effort can be appropriately prioritised by making use of the enterprise model of the environment in which a software system is situated, so that change requests can be prioritised on the basis of those which affect the most crucial aspects of the business. Linking requirements to areas of the business gives maintainers a genuine feel for what is important or must be done quickly. Redocumentation Specification, design and implementation documents could be regenerated from the global system model, along with many of the functional and non-functional requirements. The assistance this would provide a technical author is invaluable. Furthermore, the information provided would be consistent (having been subjected to extensive checking in the global system model), and large parts of the documentation could be automatically generated, and re-generated appropriately upon each change. Training Maintainers who are new to a project spend a considerable period of unproductive time familiarising themselves with its software. The provision of a model of the whole system with the ability to browse through its components, in a full business context, is extremely advantageous to even the most experienced maintainers. Use of this facility could be done in parallel with the maintenance of the system, allowing maintainers to quickly and accurately familiarise themselves with the system components to be maintained. The learning time of maintainers new to the system could be reduced considerably, especially if the browser mentioned above was available on-line. 9.

Evaluation and Conclusions 19

As part of the evolution of the DOCKET project a number of evaluations have taken place, both to prove the underlying concepts and to develop additional approaches. In particular, an experiment was conducted on the enhancement of a well structured, 39,000 line, COBOL-based administrative system. The aim was to identify whether DOCKET could reduce the knowledge required by maintenance programmers in order to make the required system changes. Although the experiment cannot be fully described here, a detailed research report is available from the authors. The analysis of source code was straightforward. More time consuming was the knowledge capture from domain experts, requiring a total of 6 hours of interviews with domain and system experts, followed by a further 3 hours of data input to the system representing the essential knowledge captured from the experts. Documentation analysis was made simple by the fact that all documentation was stored under Microsoft Word, with appropriate headings marked using style sheets. A series of simple changes were required to be made to the analysed system. Maintainers unfamiliar with the source code system were required to identify the number of possible modules to be changed. Starting with the assumption that all 103 modules required changing, they used the global system model to trace changes in activities through the data structures and modules to be changed. Whilst all 8 modules actually requiring change were identified, a further 7 were also selected. Although the overhead of 9 hours knowledge gathering and modelling is relatively high, a subsequent extension to the system of a further 8,500 COBOL lines of code required only minor modifications to the domain expert knowledge base, totalling 15 minutes. One of the major findings of the experiment was the role of test cases in assisting system understanding. Together with links from the source code to the accompanying user manual, system release notes and domain expertise, DOCKET was able to provide a sound basis for scoping, planning and executing the required changes, minimising the time required for system comprehension. Other experiments have also been conducted, where attention has been more focused upon the DOCKET method and the extent to which such a method of maintenance management can be introduced into a traditional programming environment. This work is ongoing. The work of the full DOCKET consortium ended as a collaborative venture in January 1993, however the fruits of the research are starting to appear within the tools and techniques delivered and supported by consortium members, the aim of which has been to bridge the knowledge gap between organisational need and supporting IT systems. Acknowledgements The work described in the paper was partially funded by the European Commission under the Esprit research and development programme- project 5111. The DOCKET project consortium consisted of four industrial partners and two academic partners: Computer Logic (Greece), CRIAI (Italy), Software Engineering Service (Germany), SOGEI (Italy), UMIST (UK) and Universidade Portucalense (Portugal). The authors would like to thank all the project team for their contributions to this paper and their work and support during the project.

20

References Avellis, G, Iacobbe, A., Palmisano, D., Semeraro, G. and Tinelli, C. (1991) An Analysis of Incremental Assistant Capabilities of a Software Evolution Expert System, Proc. IEEE Conference on Software Maintenance, Sorrento, Italy, October 1991, pp.220-227. Benedusi, P., Benvenuto, V. and Caporaso, M.G. (1990) Maintenance and Prototyping at the Entity-relationship Level: A Knowledge-Based Support, Proc. IEEE Conference on Software Maintenance, San Diego, November 1990, pp.161-169 Black, W.J., Gianetti, A., Layzell, P.J., Stanitsas, G. (1992) Exploitation System Documentation for Design Recovery, Proc. Int. Conference on Software Engineering and its Applications, Toulouse, December 1992. Black, W.J., Layzell, P.J., Loucopoulos, P. and Sutcliffe, A.G. (1987) AMADEUS Project: Final Report, Esprit project 1229(1252), UMIST, UK. Burton, M. and Shadbolt, N. (1987) Knowledge Engineering, Department of Psychology, Technical Report 87-2-1, University of Nottingham, Nottingham, UK. Goldfarb, C.F., The SGML Handbook, Oxford University Press, Oxford, 1989. van Griethuysen, J.J. (ed.) (1982) Concepts and Terminology for the Conceptual Schema and the Information Base, ISO/TC97/SC5. Grishman, R., Hirschman, L. and Nhan, N.T. (1986) Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments, in: Computational Linguistics, Vol. 12, No. 3, pp. 205-215, July-September. Karakostas, V. (1992) Intelligent Search and Acqusition of Business Knowledge from Programs, Journal of Software Maintenance, Vol.4, No.1, pp.1-18, March 1992. Layzell, P.J. and Macaulay, L. (1990) An Investigation into Software Maintenance- Perception and Practices, Proc. Conference on Software Maintenance, San Diego, IEEE, pp.130-140. Olle, T.W., Hagelstein, J., Macdonald, I.G., Rolland, C., Sol, H.G., van Assche, F., VerrijnStuart, A.A. (1988) Information Systems Methodologies: A Framework for Understanding, Addison-Wesley. Paice, C. D. (1981) The Automatic Generation of Literature Abstracts: An Approach Based on the Identification of Self-Indicating Phrases, in: Information Retrieval Research, (eds.) R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen and P. W. Williams, pp. 172-191, Butterworths, London. Parikh, G. and Zvegintzov, N. (1983) Tutorial on Software Maintenance, IEEE Press. Sneed, H. and Jandrasics, G. (1988) Inverse Transformations of Software from Code to Specification, Proc. IEEE Conference on Software Maintenance, Phoenix AZ, pp.102-108. Sowa, J. F. (1984) Conceptual Structures, Addison-Wesley.

21