Program Comprehension in a Reuse ... - Semantic Scholar

2 downloads 0 Views 245KB Size Report
Program comprehension is a very expensive activity which precedes each ..... existing among the di erent user de ned data types used in the heading of a procedure, this ..... of the data base of Prolog facts representing a CCG is shown. ..... no. 1, Jan. 1995, pp. 24-34. 34] T.A. Standish, \An Essay on Software Reuse", IEEEĀ ...
Program Comprehension in a Reuse Reengineering Environment Andrea De Lucia

Dep. of \Informatica e Sistemistica" University of Naples \Federico II" Via Claudio 21, 80125 Naples, Italy [email protected]

Malcolm Munro

Centre for Software Maintenance University of Durham South Road, DH1 3LE Durham, UK [email protected]

Abstract

Program comprehension is the most expensive activity of software maintenance. The di erent phases of a reuse reengineering process involves comprehension activities for understanding the structure of existing systems, the functionality implemented by a reuse-candidate module and the reengineering e ort. We present an integrated environment implemented in Prolog for reuse reengineering existing C systems. Di erent tools developed in the RE2 project are integrated in the environment through sharing a ne-grained representation for C program, the Combined C Graph (CCG). Di erent views of a system can be abstracted and visualised from the data-base of Prolog facts implementing its CCG representation. Software metric tools evaluate the reengineering costs, while reengineering operations are expressed as transformation rules and a symbolic executor allows the production of the module's speci cation.

Keywords: Program Comprehension, Reverse Engineering, Reengineering, Reuse, Program

Representation.

1 Introduction Program comprehension is a very expensive activity which precedes each maintenance operation. The comprehension of an existing system consumes from 50% up to 90% of its maintenance time [34]. This is in particular true for legacy systems that evolve due to their economic value and usefulness in the industrial eld [29]. Because of the diculty to cope with such a system [2], a large amount of time is required even for the comprehension of a small part of it. Several approaches to program comprehension have been proposed in the literature [32]. In this paper we consider the program comprehension issue from the reuse and reengineering perspective. Reuse reengineering processes exploit reverse engineering and reengineering techniques [13] for identifying and extracting software components from existing systems with the aim to populate repositories of reusable modules [1]. A reuse reengineering reference paradigm has been de ned in the RE2 project [10] as a framework where available methods,

Candidature

Search & Display

Election

Qualification

Classification & Storage

Figure 1: The RE2 Reference Paradigm techniques and tools can be used and experiments can be repeated. The RE2 paradigm decomposes a reuse reengineering process in ve sequential phases called Candidature, Election, Quali cation, Classi cation and Storage and Search and Display, as depicted in gure 1. The RE2 project deals with the rst three phases that concern the production of the reusable modules from existing systems. The later two phases populate the repository and set up the environment for the retrieval and the reuse of modules during the development of new systems. The Candidature phase produces sets of software components by using source code analysis and reverse engineering techniques. Each set of software components is candidate to make up a reusable module. The rst step of this phase is the de nition of a candidature criterion, i.e. the criterion to be applied to produce a rst approximation to the set of the reuse-candidate modules. This involves the de nition of the model to apply the criterion and the information needed to make up an instance of the model. The Election phase transforms reuse-candidate modules in reusable modules by decoupling each set of software components from the external environment and clustering them into a module according to a prede ned template. A Concept Assignment [3] process can be required before the Election phase in order to select the subset of reuse-candidate modules that can be associated with humanoriented concepts. Only such modules will be decoupled and reengineered. The Quali cation phase involves reverse engineering activities for the production of the speci cations of the modules obtained in the election phase. Each of the phases above involves program comprehension activities. In particular, the candidature phase allows to understand the architectural design of the existing system being analysed or the part of a system which implements a particular functionality. Moreover, code comprehension is the main activity of the concept assignment process. The election phase requires the understanding of the complexity of the reuse-candidate modules and the evaluation of the reengineering costs. Finally, although automatic tools can support the production of a formal speci cation of a reusable module during the quali cation phase, the comprehension of this speci cation and of the code is necessary to produce an informal more intuitive description of the implemented functionality. This phase also uses the results from the previous concept assignment process.

The integration of the reuse reengineering activities in a unique and exible environment is fundamental to make the software engineer able to control and understand the process. In this paper we propose to use a ne-grained program representation for C program, called Combined C Graph [28], to integrate reuse reengineering tools. The Combined C Graph (CCG) of a system is produced by analysis tools in the form of a data-base of Prolog facts. The data-base is shared by di erent reverse engineering tools written in Prolog for abstracting di erent views of the system in the form of reuse-candidate modules. A graphical tool can be used to visualise the abstracted information and make the comprehension of the system more intuitive for the user. Election and quali cation tools for producing software metrics, restructuring and reengineering candidate modules and performing symbolic execution [27] are also integrated in the environment. The paper is organised as follows. In section 2 di erent candidatura criteria developed in the RE2 project for the comprehension of existing systems are recalled. Section 3 presents the integration of the reuse reengineering activities in a Prolog environment based on the CCG representation. Concluding remarks are given in section 4, while the Appendix shows the architecture of the CCG Prolog data-base.

2 Program Comprehension and Candidatura Criteria Candidature criteria can be divided in structural criteria and speci cation driven criteria [18]. Structural criteria extract the sets of reuse-candidate components by exploiting structural properties based on a metric model or on the type of the abstraction to be identi ed. Basili and Caldiera propose a criterion based on metrics [1]. In the RE2 project structural criteria have have been proposed for the identi cation of both functional [19] and data abstractions [7, 8, 9]. These criteria produce a large set of candidate modules when applied to a system. Hence, they can be used in program comprehension to understand the way a software system has been designed and implemented [9, 12, 15]. Speci cation driven criteria presuppose the existence of a full or partial speci cation of an abstraction and search existing systems for pieces of code which implement the speci cation. For each speci cation they produce one module. A speci cation driven criterion has been proposed in the RE2 project for the identi cation of reusable functions [18]. Speci cation driven criteria are useful in the comprehension process to understand whether and where a system implements an abstraction. Structural criteria and speci cation driven criteria can be combined in a two step process. In the rst step structural criteria can be applied to understand which part of a system has been designed and implemented following an approach based on functional or data abstraction. In the second step speci cation driven criteria can be applied to identify speci c abstractions and to possibly obtain a further re nement of the sets of components extracted in the rst step. In the following subsections we will describe some of the candidature criteria developed in the RE2 project. We will not discuss the details of these criteria that can be found in the referenced literature. The candidature criteria are language independent and have been validated by several experiments conducted on systems written in di erent imperative languages.

2.1 Aggregating Procedures on the Call Graph

Cimitile and Visaggio [19] propose a candidature criterion for the identi cation of functional abstractions based on the aggregation of procedures1 on the call graph. In particular, the criterion exploits the dominance tree [22] of the call graph of a system. A Call Directed Graph (CDG) of a program is a owgraph (PP; CALL; s) where PP is the set of procedures, s 2 PP is the main procedure and CALL describes the activation relation on PP  PP n fsg. The Call Directed Acyclic Graph (CDAG) is obtained from the program CDG by collapsing each strongly connected subgraph2 into a single node. In a CDAG a procedure px dominates a procedure py if and only if each path from s to py contains px. The re exive and transitive closure of the dominance relation is the direct dominance relation. A procedure px directly dominates a procedure py if and only if px dominates py and all the procedures that dominate py dominate px too. A procedure px strongly and directly dominates a procedure py if and only if px directly dominates py and px is the only procedure that calls py . The direct dominance relation can be represented as a tree, called the Direct Dominance Tree (DDT), whose root is the main procedure s. The Strong and Direct Dominance Tree (SDDT) is obtained from DDT by marking all the edges representing the strong and direct dominance relation. The set of the subtrees of a SDDT can be divided in two subsets, the subset MET of the subtrees containing only marked edges and the subset UMET of the subtrees containing at least an unmarked edge. The Reduction of the Strong Direct Dominance Tree (RSDDT) is a tree obtained from the SDDT by collapsing each subtree in MET into a unique node. Four rules have been proposed to aggregate procedures into reuse-candidate modules and to identify the uses and is composed of relationships [20] between them: 1. The set of procedures represented by the nodes of a strongly connected subgraph of a CDG is a candidate to constitute a reusable module. 2. The set of procedures represented by the nodes of a subtree t 2 MET is a candidate to constitute a reusable module represented by the root of t. 3. The set of procedures represented by nodes of a subtree t 2 UMET linked to the root of t by a marked edge is a candidate to constitute a reusable module. This module uses the modules represented by the nodes in t which are linked to the root by an unmarked edge. 4. Each of the marked (unmarked) edges of RSDDT is a candidate to constitute an is composed of (uses) relationship between the modules represented by the nodes that the edge links. The criterion has been validated both in Pascal [14] and COBOL [12] environment. We are currently carrying out experiments using C systems. We will refer to the primitives programming languages provide to implement functional abstractions, e.g. program, procedure, function, subroutine, with the generic name \procedure". 2 A strongly connected subgraph of a CDG contains at least one cycle involving all its nodes. This cycle is due to the presence of recursion between the procedures of the program. 1

2.2 Identifying Objects

A logic based method for identifying objects (or data structures) has been proposed by Canfora et al. [10]. This approach looks for the set of procedures PP , the set of global variable DD and the set of references of procedures to global variables, which can be expressed by a relation DAT  PP  DD. It is worth noting that this relation can also be represented as a bipartite graph called variable-reference graph [8]. The criterion is based on two rules: 1. Two global variables d1 and d2 contribute to de ne the same candidate object if they belong to a reference cobweb, i.e. if and only if3 (d1; d2) 2 DSDAT = (trans(DAT ) DAT )  DD  DD 2. A procedure c de nes one of the methods of the object to which a global variable d belongs if c directly references one of the variables in the cobweb around d, i.e. if and only if (d; c) 2 PPDAT = (trans(DAT ) DAT ) trans(DAT )  DD  PP . A di erent criterion has been proposed and validated in C environment [8] which treats undesired pairs in DAT , called coincidental and spurious connections4, that produce clusters of procedures and functions implementing more than one object. The candidature criterion considers for each p 2 PP the subgraph generated by clustering together the set DD(p) of global variables p references and the set PP (p) of procedures that only access these data items. These sets can be de ned as: DD(p) = PostSet (p) [ PP (p) = P (d; p) d2DD(p)

where,

P (d; p) = fpi 2 PP j pi 2 PreSet(d) ^ PostSet(pi)  PostSet(p)g PreSet(d) = fp 2 PP j (p; d) 2 DAT g PostSet(p) = fd 2 DD j (p; d) 2 DAT g For such a subgraph the index IC (p) de ning its internal connectivity and the variation IC (p) in the internal connectivity, due to the possible clustering with respect to procedures in PP (p), are calculated: P P (d; p) IC (p) = P d2PostSet(p#) #PreSet (d) d2PostSet(p) X #fpi j PostSet(pi) = fdgg IC (p) = IC (p) ? #PreSet(d) d2PostSet(p)

(R) and R denote the transpose and re exive transitive closure of the relation R, respectively. A procedure which implements more than one functions, each function logically belonging to a di erent object, generates coincidental connections. A procedure which implements system speci c operations by directly accessing the supporting data structure of more than one object generates spurious connections. Coincidental connections can be eliminated by slicing the procedure and isolating the di erent functions. Spurious connections can be eliminated by deleting the procedure from the set P P . 3 trans 4

where #A denotes the number of elements in the set A. The procedures whose associated IC is greater than a given threshold are used to generate clusters. All the other routines are considered to introduce coincidental or spurious connections and are sliced or deleted, respectively. Moreover, some of the data items are merged into a unique item. These operations generate a new variable-reference graph on which the indexes are recalculated and the operations reexecuted. The process ends when the graph is partitioned into a set of isolated sub-graphs, each of which consists of one data node (corresponding to a set of global variables) and a collection of procedures that only access it. Each one of these isolated sub-graphs de nes a candidate object.

2.3 Abstract Data Types Candidature Criteria

A logic based method for identifying abstract data types has been proposed by Canfora et al. [7]. This approach looks for the set of procedures PP , the set of user de ned data types TT and the set of uses of user de ned data types in the headings of the procedures5, which can be expressed by the relation TY P  PP  TT . To take into account the relationships possibly existing among the di erent user de ned data types used in the heading of a procedure, this relation is re ned by the relation STY P  TY P containing the pairs (p; t) such that p does not use any super-type6 of t in the interface. The criterion is based on two rules: 1. Two user de ned data types t1 and t2 contribute to de ne the same candidate abstract data type if they belong to a cobweb of formal parameters declarations, i.e. if and only if (t1; t2) 2 ABTY P = (trans(STY P ) STY P )  TT  TT 2. A procedure p de nes one of the operators of the abstract data types to which a user de ned data type p belongs if p uses one of the user de ned data types in the cobweb around t to declare a formal parameter, i.e. if and only if (t; p) 2 PPTY P = (trans(STY P ) STY P ) trans(STY P )  TT  PP . The criterion above has been applied to a set of case studies in Pascal environment [9]. Although the experiment gave satisfactory results, some modules were too large and dicult to understand and associate with data abstractions. This was in particular caused by a massive use of user de ned sub-range or enumeration types in the headings of procedures. Such user de ned types were used in some procedures that could not be associated with operators of an abstract data type and procedures that could be associated with operators of di erent abstract data types. The criterion was also a ected by another problem which caused lack of precision: procedures contributing to implement an abstract data type, but that did not use any user de ned data type (for example, procedures implementing a subfunction of an operator of an abstract data type), were not selected into the candidate module. To overcome these problems, an improved criterion has been proposed by Canfora et al. [11]. A procedure p uses a user de ned type t if and only if t is used to de ne the type of a formal parameter or a return value of p [7] 6 The user de ned data type t is a super-type of the user de ned data type t if t is used to de ne 1 2 2 t1 [30]. 5

The problem of complex modules has been solved by identifying and deleting the user de ned sub-range or enumeration types that caused large clusters of procedures and user de ned data types. In this way complex modules can be divided in simpler ones that can be associated with human oriented concepts. To include all the procedures involved in the implementation of an ADT the previous method has been combined with an iterative process based on the SDDT of the system. At each step, the process deletes from the CDAG the procedures that do not belong to some candidate modules and that are strongly and directly dominated by the main procedure. Moreover, new call edges are inserted between the main procedures and procedures that do not have any incoming edge and the SDDT is reconstructed. The last SDDT, so obtained, is only constituted by procedures that implement candidate ADTs. Moreover, the dominance relationships on this SDDT can be used to identify the structure of a module implementing an ADT (the procedures exported, i.e., the operators of the ADT, and those only used in the body for implementation purpose) and the uses relationships between modules. The details of the improved criterion can be found in [11]. The criterion has been applied to the same set of case studies used to validate the former criterion [15]. The results of the experiments were characterised by a greater number of reusable modules of higher quality. We are currently applying the criterion to software systems written in C language.

2.4 Speci cation Driven Program Slicing

A speci cation driven candidature criterion based on program slicing [36] and symbolic execution [27] has been proposed by Cimitile et al. [18]. The aim of the criterion is the isolation of a program slice which implement the speci cation of a functional abstraction. The interface speci cation is given as a function mapping the set of input data to the set of output data, while the semantics is expressed in terms of two rst order logic formulas, called precondition and postcondition [23]. These two formulas express the condition which must hold on the input data for the application of the function and the relation between the input and the output data after the execution of the function, respectively. A new de nition of slice is proposed based on the consideration that a piece of code implementing a functional abstraction should have a unique entry point on the control ow. A slicing criterion of a procedure p is de ned as a triple hsin , sout, Vout i where sin and sout are statements in p, sin dominates sout and Vout is a subset of variables of p. Then, a slice on such a slicing criterion consists of all the statements and predicates of p that lie on a control

ow path from sin to sout and that might a ect the values of the variables in Vout just before the statement sout is executed. The rst step of the criterion entails the use of symbolic execution and theorem proving techniques to identify a slicing criterion. The procedure is symbolically executed and each statement is associated with the symbolic state [27] holding before its symbolic execution, which contains the precondition of the statement (the invariant assertion (see e.g. [25]) holding before its execution). Once a statement has been annotated with its entry symbolic state, the equivalence of the statement's precondition with the precondition and postcondition of the speci cation is checked. A statement whose precondition is equivalent to the input precondition is candidate to be the statement sin of the slicing criterion, while a statement whose precondition is equivalent to the input postcondition is candidate to be the statement sout of the slicing criterion. If sin also dominates sout the slicing criterion hsin , sout , Vout i is produced as output.

Source Code

Extractor

Direct Rel.

Prolog Environment

Visual Tool

Summ. Rel.

Abstract. Prolog Rules

Figure 2: A logic based paradigm for reverse engineering tools Human interaction is required to associate the data of the speci cation with the program variables and in particular to de ne the set of variables Vout corresponding to the output data of the functional abstraction. Human interaction can also be required to provide solutions for undecidable problems such as nding invariant assertions for loops and proving implications between predicates. The second step of the criterion entails the extraction of the slice. This is done by exploiting algorithms based on dependence graphs [31]. The criterion has been validated in C environment [17]. An enhanced version of the Combined C Graph [28] is used as program representation for both symbolic execution [16] and program slicing.

3 Integrating the Reuse Reengineering Process While each of the criteria described in the previous section has been successfully validated on sample systems implemented according to the type of abstraction searched for, a complex system can be implemented following di erent decomposition approaches. In this case, each of the candidature criteria can only give a partial view of the system. To obtain a full comprehension of the system being analysed, the maintener needs to easily switch among di erent integrated abstractor tools in order to identify the criterion which best t the part of the system under consideration. Candidature criteria have been implemented in the RE2 project following the paradigm described by Canfora et al. in [6] (see gure 2). A static code analysis tool is used to extract the set of direct relations (i.e., the initial set of relations on which the criterion is based) as a data-base of Prolog facts. The criterion is then implemented as a tool written in Prolog [35] which abstracts the set of summary relations (i.e., the nal set of relations which de nes the criterion). Finally, a visual tool displays the information abstracted by reverse engineering. However, although the implementation paradigm is the same, di erent prototypes of static analysis tools have been implemented each of which only extracts the direct relations needed for applying one criterion. This is a limitation for the integration of the abstractors in a reuse-reengineering environment and for the full comprehension of the system being analysed. In order to integrate the abstractors implementing the di erent criteria the same code should be statically analysed more than once. This problem can be overcome if the same static code analysis tool extracts for a given language the direct relations needed for the application of all the criteria. Moreover, to understand the identi ed reuse-candidate modules and their evolution during the successive

Prolog Tool Library

Source Code

Extractor

Interm. Repres. Facts

Prolog Environment

Visual Tool

System View Facts

Reusable Module Facts

Code Generator

Reusable Module Code

Figure 3: An integrated reuse reengineering environment phases of the reuse reengineering process, it would be desirable to provide an automatic support for  the extraction of the software components implementing a module;  the concept assignment process;  the evaluation of the e ort needed for the restructuring and reengineering operations;  the restructuring and reengineering operations;  the quali cation of the reusable modules. The integration of this activities in a unique exible environment is fundamental to make the software engineer able to control the process. This can be achieved by extracting a ne-grained program representation from the source code that can be shared by all the tools of the environment. The information contained in such a representation should be sucient for recovering and visualising di erent views of a program and for generating the code of the reusable modules (see gure 3). In particular, the Prolog language allows the implementation of the program representation as a data-base of Prolog facts which can be easily manipulated for abstracting new sets of facts (for example, during the Candidature or the Quali cation phase), or for changing (deleting and/or adding) sets of facts (for example, during the Election phase), or for producing software metrics. Moreover, the information needed by visual tools, that are essential to make the comprehension of the system more intuitive, can be easily obtained from the data-base of Prolog facts. In the following subsections we will show an integrated reuse reengineering environment for systems written in C language based on the Combined C Graph [28] program representation.

3.1 The Combined C Graph

The Combined C Graph is a ne-grained representation for C programs introduced by Kinloch and Munro [28] to provide a solution for two problems: 1. combining the features of di erent program representations into a unique uni ed intermediate representation for a maintenance environment, on the same line as Harrold and Malloy [21]; 2. understanding problems induced by pointers and expressions containing embedded side-e ects (resulting from assignment operators, increment and decrement operators, comma operator and function calls) and control ows (due to the short-circuit evaluation of the boolean expressions [26]). The rst problem has been solved by: (i) designing a representation for C functions (FCCG) which consists of superimposing several types of intraprocedural edges (enclosing control and data dependences) on a control ow graph and (ii) interconnecting the di erent FCCGs by various interprocedural edges, like call interface edges (enclosing binding edges between actual and formal parameters and between return statements and calling sites) and interprocedural data dependencies. The second problem has been solved by: (i) considering pointer-induced aliasing during data ow analysis; (ii) providing explicit representation on the FCCGs for embedded side-e ects and control- ows. Whenever a statement contains embedded side-e ects or control ows an additional node is created for each subexpression containing a de nition or a possible change in the control

ow. Two edge types are used to relate the extra nodes of an expression, expression-use edges and lvalue de nition edges. An expression-use edge from node p to node q, p !eu q indicates the evaluation of an expression at node p followed by a use of the resulting value at q. For example, this situation occurs when the expression in the right-hand side of an assignment contains a side-e ect or an embedded control ow. An lvalue is an expression referring to a named region of storage in the left-hand side of an assignment. An lvalue de nition edge from node p to node q, p !ld q indicates the evaluation of an lvalue at node p followed by a writing to the corresponding storage location at node q. This situation occurs when the operand in the left-hand side of an assignment contains a side-e ect. While useful for program understanding and for recovering all the information needed by a structural candidature criterion, the representation could not be used for speci cation driven program slicing, because of the absence of syntactic and semantic informations necessary to perform symbolic execution. The CCG has then been extended [16] with the abstract syntax trees of the functions in a C program. The syntax tree and the control ow graph of a FCCG are linked together by semantic edges. For example, each node of the control ow graph is linked to the subtree's root of the corresponding expression. Figure 4 shows the CCG corresponding to the following simple program: main() { int a, b; a = 1 + (b = 0); while(a