An integrated environment for reuse reengineering ... - Semantic Scholar

The Journal of Systems and Software 42 (1998) 153±164

An integrated environment for reuse reengineering C code Gerardo Canfora a

a,* ,

Andrea De Lucia a, Malcolm Munro

b,1

Department of 'Ingegneria dell'Informazione ed Ingegneria Elettrica', University of Salerno, Faculty of Engineering at Benevento, Palazzo Bosco Lucarelli, Piazza Roma, 82100 Benevento, Italy b Centre for Software Maintenance, University of Durham, Durham DH1 3LE, UK Received 10 December 1996; received in revised form 10 November 1997

Abstract The paper presents an integrated environment implemented in Prolog for reuse reengineering existing C systems. Dierent tools developed in the RE2 project are integrated in the environment through sharing a ®ne-grained representation for C programs, the Combined C Graph (CCG). Dierent views of a system can be abstracted and visualised from the data-base of Prolog facts implementing its CCG representation. Software metric tools evaluate the reengineering costs, while reengineering operations are expressed as transformation rules and a symbolic executor allows the production of the reusable module's speci®cation. Ó 1998 Elsevier Science B.V. All rights reserved.

1. Introduction Existing software systems record a great deal of knowledge and expertise which is sometimes not available anywhere else than in the source code. Extracting and reengineering reusable modules from existing systems is a key means for salvaging this knowledge [4]. Reuse reengineering processes exploit reverse engineering and reengineering techniques for identifying and extracting software components from existing systems with the aim of populating repositories of reusable modules. A reuse reengineering reference paradigm has been de®ned in the RE2 project [10] as a framework where available methods, techniques and tools can be used and experiments can be repeated. The RE2 paradigm decomposes a reuse reengineering process in ®ve sequential phases called Candidature, Election, Quali®cation, Classi®cation and Storage, and Search and Display. The RE2 project deals with the ®rst three phases that address the production of the reusable modules from existing systems. The later two phases populate the repository and set up the environment for the retrieval and the reuse of modules during the development of new systems. The Candidature phase produces sets of software components by using source code analysis and reverse engineering techniques. Each set of software compo*

Corresponding author. Tel.: +39 824 305804; fax: +39 824 21866; email: [email protected]. 1 [email protected]. 0164-1212/98/$19.00 Ó 1998 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 9 8 ) 1 0 0 0 6 - 7

nents is a candidate to make up a reusable module. The ®rst step of this phase is the de®nition of a candidature criterion, i.e. the criterion to be applied to produce a ®rst approximation to the set of the reuse-candidate modules. This involves the de®nition of the model to apply the criterion and the information needed to make up an instance of the model. The Election phase transforms reuse-candidate modules in reusable modules by decoupling each set of software components from the external environment and clustering it into a module according to a prede®ned template. The Quali®cation phase involves reverse engineering activities for the production of the speci®cations of the modules obtained in the election phase. The integration of the reuse reengineering activities in a unique and ¯exible environment is fundamental to make the software engineer able to control and understand the process. In this paper we propose the use of a ®ne-grained program representation for C programs, called Combined C Graph (CCG) [25], to integrate reuse reengineering tools. The focus is on the integration of the reverse engineering tools developed within the RE2 project to support the candidature phase. The CCG of a system is produced by code analysis tools in the form of a data-base of Prolog facts. The data-base is shared by dierent reverse engineering tools written in Prolog for abstracting dierent types of reuse-candidate modules. A graphical tool visualises dierent views of the system to make its comprehension more intuitive to the user. Election and quali®cation tools for producing

154

G. Canfora et al. / The Journal of Systems and Software 42 (1998) 153±164

software metrics, restructuring and reengineering the candidate modules, and performing symbolic execution are also integrated in the environment. The reminder of the paper is organised as follows. Section 2 discusses related work and Section 3 recalls dierent candidature criteria developed in the RE2 project. Section 4 illustrates the overall architecture of the RE2 reuse reengineering environment and introduces the CCG program representation. Sections 5 and 6 discuss the integration of candidature tools and tools supporting the election and quali®cation phases, respectively. Finally, Section 7 discusses experimental results and gives some concluding remarks. 2. Related work Two main approaches have been proposed in the literature to achieve tool integration. The ®rst approach consists of using a core program representation shared by all the tools in the environment. This idea is common to several meta-environments for the generation of software development tools, such as Gandalf [21] and IPSEN [19]. However, these meta-environments have been conceived for the generation of software production environments and scarcely support reengineering and reuse of existing software. The main feature of Gandalf is the ability to semi-automatically generate families of software development environments: the ALOE generator takes a description of the language to be manipulated as an input and produces a syntax driven editor for that language. An abstract syntax tree is shared among the tools of the generated environments. Whilst Gandalf is mainly oriented to the production of programming environments, IPSEN supports the generation of tools manipulating high level documents. Graph schemes, de®ned in an object oriented fashion, are used to describe the documents, while a tool speci®cation language, PROGRESS, allows the description of the intended manipulations in the form of graph rewriting rules. The second approach to tool integration consists of de®ning a formalism to describe the communication interfaces among tools. An example of such a formalism is Interface Description Language (IDL) [26]. In the software development environments based on IDL, tool integration is achieved by describing the interfaces (i.e., the data structures through which the tools communicate) between the components of the environment. Each tool manipulates an internal representation suitable for the speci®c application and includes a reader and a writer to communicate with other tools in the environment. The reader translates data from the IDL format to the internal form; the writer produces the output data in the IDL format from the internal representation. This approach derives from the need to integrate existing

software tools each using its own internal program representation. The RE2 environment encompasses features from both these approaches. A core representation is used to store all program information, thus avoiding the waste of storage space due to the duplication of pieces of information in dierent representations. New tools developed within the environment are designed to directly exploit the CCG program representation, while the integration of existing tools is achieved by means of bridging scripts that derive the particular representations they require from CCG. The idea of sharing a common program representation to integrate tools has also been largely exploited to build maintenance and reverse engineering environments. The GENOA/GENII system [18] is an environment for code analysers production. The GENOA sub-system contains a speci®cation language to generate reverse engineering tools for a given programming language. The creation of a tool consists of specifying a traversal of the program syntax tree and the output to be produced during the traversal. The GENII sub-system customises the environment for a particular programming language. A pitfall of this system is the inability to modify the internal representation and to reuse the results of an analysis for further manipulations within the environment. Ward and Bennett [28] apply formal methods to the maintenance and evolution of legacy systems. Their approach exploits formal program transformations based on the theory of program equivalence and in®nitary logic to restructure and simplify legacy code and to abstract high level speci®cations. The program is translated into an equivalent abstract syntax tree based form expressed in a wide spectrum language, WSL, in which all subsequent operations are performed. The tools of the environment are written in Meta-WSL, an extension of WSL used for representing formal transformations. Approaches based on transformations are mainly oriented to the translation of the original program to more abstract forms that make easier reverse engineering and reengineering operations. However, unlike the GENOA/ GENII system [18], these environments do not provide report-generator building facilities. The RE2 environment comprises restructuring and reengineering tools based on the transformation of the underlying program representation and facilities for generating documents and reports. In addition, it also includes a ¯exible system to querying source code. 3. Candidature criteria Candidature criteria have been de®ned in the RE2 project following the paradigm described by Canfora et al. in [5]. A set of direct relations de®nes the system


model needed to apply the criterion while a set of summary relations describes the collections of components that form the reuse-candidate modules. Summary relations are obtained by combining direct relations into expressions showing how the reuse-candidate modules can be derived from an instance of the system's model. Candidature criteria can be divided in structural criteria and speci®cation driven criteria [17]. Structural criteria extract the sets of reuse-candidate components by exploiting structural properties based on a metric model or on the type of the abstraction to be identi®ed. These criteria produce a large set of candidate modules when applied to a system. A Concept Assignment Process [3] selects the subset of reuse-candidate modules that can be associated with human-oriented concepts. Only such modules will be passed to the Election phase. Basili and Caldiera [4] propose a criterion based on metrics. Examples of structural criteria developed in the RE2 project include the identi®cation of functional and data abstractions: · Aggregating procedures on the call graph. Cimitile and Visaggio [16] propose a candidature criterion for the identi®cation of functional abstractions based on the aggregation of procedures 2 on the call graph. A single direct relation, namely CALL PP PP , de®nes the system's model; PP denotes the set of procedures in the system. The criterion applies the concept of dominance among the nodes of a directed graph [23] to the call graph. A set of heuristic rules allows the aggregation of procedures into reuse-candidate modules and the identi®cation of use and composition relationships among them. · Identifying objects. Canfora et al. [10] propose a method to identify candidate objects based on a model of the system called variable-reference graph [12]. This is a bipartite graph described by the direct relation DAT PP DD, which represents the references that procedures (PP ) make to global variables (DD) in the system. The criterion identi®es the maximal connected sub-graphs of the variable-reference graph; each of the identi®ed sub-graphs contains the global variables that de®ne the state of a candidate object and the procedures that implement the methods. · Abstract data types candidature criteria. A similar criterion has also been used to identify abstract data types [6]. The criterion exploits a bipartite graph to model the system and identi®es the candidate abstract data types in the form of maximal connected subgraphs. The graph is described by the direct relation TYP PP TT which represents the uses of user-de®ned data-types (TT ) in the headings of procedures

(PP ) to declare formal parameters and/or return values. Speci®cation driven criteria presuppose the existence of a full or partial speci®cation of an abstraction and search existing systems for pieces of code that implement the speci®cation. For each speci®cation they produce one module. As an example, Wilde et al. [29] propose the use of test cases as a speci®cation to locate user functionality in existing code. The RE2 project has developed several speci®cation driven criteria for identifying functional abstractions: · Speci®cation driven program slicing. A speci®cation driven candidature criterion based on program slicing [30] and symbolic execution [24] has been proposed by Cimitile et al. [17]. The aim of the criterion is the isolation of a program slice that implements the speci®cation of a functional abstraction. A set of direct relations is used to implement a ®ne-grained program model based on an attributed abstract syntax tree. The criterion assumes that a formal speci®cation of the function to be isolated is given in the form of a precondition and a postcondition. Symbolic execution and theorem proving techniques are exploited to derive a slicing criterion from the precondition and postcondition. The slice is then extracted by computing the transitive closure of the data and control dependencies [20] captured in the direct relations. · Identifying conditioned components. Canfora et al. [9] propose a candidature criterion to isolate conditioned software components that implement reuse-candidate functions, once these have been fully or partially speci®ed in terms of binding conditions. These are either the function's precondition and postcondition, or conditions that trigger the execution of the code implementing the function in the context of the program. An example of a binding condition that is not a function's precondition is the check on a variable carrying the selection from a menu of functions. The criterion uses a set of direct relations that implement a dependence-based program model. Binding conditions are mapped onto predicates on program's variables and code components whose execution is bound by these predicates are then isolated by exploiting program dependencies. Candidature criteria have been implemented according to the scheme shown in Fig. 1. A static code analysis tool (extractor) is used to extract the set of direct

2

The generic name ``procedure'' refers to any type of primitives that programming languages provide to implement functional abstractions, e.g. programs, procedures, functions, and subroutines.

155

Fig. 1. A logic based scheme for reverse engineering tools.

156


relations as a data-base of Prolog facts, while Prolog rules (abstractor) are used to produce the summary relations. Finally, a visual tool displays the information abstracted by reverse engineering. However, although the implementation scheme is the same, dierent prototypes of static analysis tools have been implemented each of which only extracts the direct relations needed for applying one criterion. This is a limitation for the integration of the abstractors in a reuse reengineering environment and for the full comprehension of the system being analysed. 4. An integrated reuse reengineering environment The problem of integrating reverse engineering tools that support the candidature phase of a reuse reengineering process can be solved if the same static code analysis tool extracts, for a given language, the direct relations needed for the application of all the criteria. This can be achieved by extracting from source code a ®negrained program representation and using it to derive all the direct relations needed to apply the dierent candidature criteria. Such a representation would also allow the integration of tools that support activities in the election and quali®cation phases, such as: · the extraction of the software components implementing a module; · the concept assignment process; · the evaluation of the eort needed for restructuring and reengineering operations; · the restructuring and reengineering operations; · the quali®cation of the reusable modules. Fig. 2 shows the overall architecture of the RE2 integrated reuse reengineering environment which is built around a unique program representation for C code called CCG [25]. CCG is extracted in the form of a set of Prolog facts by a static code analyser (extractor) realised using the compiler-writing facilities Lex and Yacc. The Prolog tool library contains both the bridging procedures that derive the direct relations needed to apply

Fig. 2. RE2 integrated reuse reengineering environment.

any particular candidature criterion from CCG and the rules that realise the summary relations de®ning the criterion (abstractor). It also comprises facilities to extract the portion of CCG associated with a reuse-candidate module and to restructure and reengineer it; the result is a new set of facts from which the code of the reusable module is then (re-)generated. Finally, the Prolog tool library contains facilities to produce design level documents aimed at supporting the comprehension of the modules extracted and their meaning within the context of the original system. This is a sensible help for the software engineer during the concept assignment process and in the quali®cation phase, when a speci®cation has to be associated with the reusable modules. In particular, facilities have been implemented to produce hierarchical Data Flow Diagrams (according to the method proposed by Benedusi et al. [1]) and Structure Charts (according to the method proposed in Ref. [2]). Furthermore, Prolog rules have been implemented that answer queries of software engineers during the reengineering and evolution of a reusable module. These queries concern inter-procedural data-¯ow and analysis of the impact of a change [5]. Several reasons have motivated the choice of Prolog as an implementation platform (see [11] for a detailed discussion). Prolog programs can be viewed as a powerful extension to relational databases, the extra power coming from the ability to specify (recursive) rules. Moreover, the declarative style of Prolog is well suited to describing software analysis, and its prototyping and meta-programming capabilities allow the rapid implementation and evolution of analysis tools. 4.1. The combined C graph The CCG is a ®ne-grained representation for C programs introduced by Kinloch and Munro [25] to provide a solution to two problems: 1. combining the features of dierent program representations into a unique uni®ed intermediate representation for a maintenance environment, on the same line as Harrold and Malloy [22]; 2. understanding problems induced by pointers and expressions containing embedded side-eects (resulting from assignment operators, increment and decrement operators, comma operator and function calls) and control ¯ows (due to the short-circuit evaluation of the boolean expressions). The ®rst problem has been solved by: (i) designing a representation for C functions (FCCG) which consists of superimposing several types of intraprocedural edges (including control and data dependences) on a control ¯ow graph and (ii) interconnecting the dierent FCCGs by various interprocedural edges, such as call interface edges (including binding edges between actual and formal parameters and between return statements and call


sites) and interprocedural data dependencies. The second problem has been solved by: (i) considering pointer-induced aliasing during data ¯ow analysis and (ii) providing explicit representation on the FCCGs for embedded side-eects and control-¯ows. Whenever a statement contains embedded side-eects or control ¯ows an additional node is created for each subexpression containing a de®nition or a possible change in the control ¯ow. Two edge types are used to relate the extra nodes of an expression, expression-use edges and lvalue de®nition edges. An expression-use edge from node p to node q, p !eu q indicates the evaluation of an expression at node p followed by a use of the resulting value at q. For example, this situation occurs when the expression in the right-hand side of an assignment contains a side-eect or an embedded control ¯ow. An lvalue is an expression referring to a named region of storage in the left-hand side of an assignment. An lvalue de®nition edge from node p to node q, p !ld q indicates the evaluation of an lvalue at node p followed by a writing to the corresponding storage location at node q. This situation occurs when the operand in the left-hand side of an assignment contains a side-eect. While useful for program understanding and for recovering all the information needed by a structural candidature criterion, the representation could not be used for speci®cation driven program slicing, because of the absence of syntactic and semantic information necessary to perform symbolic execution. The CCG has been extended [14] with the abstract syntax trees of the functions in a C program. The syntax tree and the control ¯ow graph of a FCCG are linked together by semantic edges. For example, each node of the control ¯ow graph is linked to the subtree's root of the corresponding expression. Fig. 3 shows the CCG corresponding to the following simple program: main() { int a, b; a 1 + (b 0); while(a < 10) b+ double(&a); } int double(int *p) { return (*p)++ * 2; } For sake of simplicity, variable and formal parameter declarations and variable references in the syntax tree are represented by the corresponding identi®ers, without semantic information. Fig. 3(a) shows the abstract syntax tree, while the control ¯ow graph and the other edges are depicted in Fig. 3(b). The same node identi®er in the abstract syntax tree and in the control ¯ow graph expresses a relation between the corresponding syntax tree node and control ¯ow node. Appendix A shows the structure of the data base of Prolog facts representing a CCG.

157

5. Integrating the candidature tools The ®rst step for the integration of reverse engineering tools that support the application of candidature criteria consists of recovering the direct relations from CCG. CCG has already been used to apply speci®cation driven candidature criteria [15,17]: in this case the direct relations needed to apply the criterion are explicitely contained in CCG. For structural candidature criteria, simple procedures can be de®ned to recover the relevant direct relations. For example, the relations CALL and DAT , de®ned above, are implemented by the clauses: proc_call(Fun1, Fun2) :call(_, Fun1, _, _, Fun2, 0). and proc_ref_glob_var(Function, Var) :st_node(File, Function, id, Node), id_decl(File, Function, Node, obj_loc (_,@external, sc([],@), Var)). respectively. The ®rst rule states that a function Fun1 calls the function Fun2 if a call edge exists between a call site in the FCCG of Fun1 and the entry node of the FCCG of Fun2. The second rule asserts that a function Function references the global variable Var if a node Node of type id exists in the syntax tree of the FCCG of Function (fact st_node) and the corresponding variable identi®er Var is externally declared (fact id_decl). Similarly, the relation TYP is implemented by the following procedure: proc_use_type_in_interface(Function, Type) :node(File, Function, _, formal, _, [ex(def, ParForm)]), object(File, Function, sc([],@), Type, ParForm, _), type(_, _, _, _, Type, _). proc_use_type_in_interface(Function, [Tag_Type, Name]) :node(File, Function, _, formal, _, [ex(def, ParForm)]), object(File, Function, sc([],@), [Tag_Type, Name], ParForm, _), member(Tag_Type, [struct, union, enum]). proc_use_type_in_interface(Function, Type) :object(File, _, sc([],@), Type, Function, [@fun|_]), type(_, _, _, _, Type, _). proc_use_type_in_interface(Function, [Tag_Type, Name]) :object(File, _, sc([],@), [Tag_Type, Name], Function, [@fun|_]), member(Tag_Type, [struct, union, enum]).

158


Fig. 3. A sample CCG. (a) Abstract syntax trees. (b) Other CCG edges.

The ®rst two rules correspond to the declaration of a formal parameter ParForm in the interface of a function Function; the type of ParForm has been either de®ned by means of a typedef statement, or it is the name of structure, union, or enumeration. The other two rules deal with return values. Once the direct relations needed to apply a criterion have been computed, the candidate modules may be derived by executing the Prolog rules that implement the summary relations. Section 5.1 shows an example of implementation of summary relations.

reengineering environment is the criterion for identifying objects outlined in Section 2. The criterion consists of two summary relations, DSDAT DD DD and PPDAT DD PP , which produce the state variables and the methods of a candidate object, respectively. State variables and methods are identi®ed based on two rules: 1. Two global variables d1 and d2 contribute to de®ne the same candidate object if they belong to a reference cobweb, i.e. if 3 d1 ; d2 2 DSDAT transDAT DAT

5.1. An example of summary relations The candidature criterion chosen to exemplify how summary relations can be implemented in the RE2 reuse

3 transR and R denote the transpose and re¯exive transitive closure of the relation R, respectively.


2. A procedure p de®nes a method of the object to which a global variable d belongs if p directly references one of the variables in the cobweb around d, i.e. if d; p 2 PPDAT transDAT DAT transDAT The relation DSDAT is implemented by the following procedure: obj_state(D1, D2) :obj_state(D1, D2, [D1]). obj_state(D, D,_). obj_state(D1, D2, Visited) :trans_dat_dat(D1, Di), not(member(Di, Visited)), obj_state(Di, D2, [Di|Visited]). which veri®es whether or not the global variables D1 and D2 belong to the same reference cobweb. Executing the procedure obj_state requires that the pairs of global variables in the transitive closure of the relation transDAT DAT have been pre-computed and asserted: trans_dat_dat :proc_ref_glob_var(Pi, D1), proc_ref_glob_var(Pi, D2), D1\ D2, not(trans_dat_dat(D1, D2)), assert((trans_dat_dat(D1, D2)), fail. trans_dat_dat. Similarly, the relation PPDAT is implemented by the clause: obj_method(D, P) :obj_state(D, Di), proc_ref_glob_var(P, Di). which veri®es whether or not the procedure P de®nes one of the methods of the object that includes the global variable D as a state variable. In addition to the Prolog procedures that implement the summary relations DSDAT and PPDAT , additional procedures have been implemented to simplify querying the system. As an example, the procedure obj(D, D_Set, P_Set) builds an object around a global variable D, i.e. returns the set D_Set of all global variables which belong to the reference cobweb that include D, and the set P_Set of all the procedures that reference one or more variables in the cobweb. 6. Integrating the election and quali®cation tools The code that corresponds to each of the reuse candidate modules identi®ed by using the candidature tools has to be decoupled from the rest of the system and clustered into an actually reusable module. Such reengineering operations can be performed on the portion of CCG relative to the candidate module, from which the code of the new module can be then automatically generated. This requires the preliminary extraction of the CCG sub-graph that corresponds to the code of a candidate

159

module. As an example, the portion of CCG corresponding to the global variables that de®ne the state of a candidate object obj(D, D_set, P_set) can be extracted by formulating the following query: get_obj_state(D, State_Var_decls) :®ndall(object(File, @external, Scope, Type, D1, Access), (obj(D, D_set, P_set), object(File, @external, Scope, Type, D1, Access), member(D1, D_set)), State_Var_decls). Similarly, the CCG sub-graphs for each of the methods of the candidate object can be obtained by formulating queries to retrieve the facts associated with the functions in the set P_set. Extracting the CCG sub-graph that corresponds to the code of a candidate module may also be useful to support software engineers during the concept assignment process, when the candidate modules produced by structural criteria have to be associated with human oriented concepts. The RE2 reuse reengineering environment comprises a graph visualisation facility that computes the optimal layout of a diagram based on the typology of the graph. The visualisation of the CCG sub-graph of a candidate module has proven useful to help software engineers to achieve a deep understanding of the ®ne details of the source code. Understanding a piece of software and associating it with human oriented concepts is an iterative process of guessing, constructing hypotheses and verifying them (see [27] for an overview and a comparison of cognition models which apply to software comprehension). The RE2 environment supports this process by allowing hypotheses to be automatically veri®ed once they have been (re-)formulated in terms of queries on the CCG. As an example, the following program helps a software engineer to verify the existence of a swap between variables V1 and V2 in a function F (N1, N2 and N3 are the three control ¯ow nodes involved in the swap): swap(V1, V2, F, N1, N2, N3) :node(_, F, N1, expr, @, _), assign(F, N1, Temp, V1), node(_, F, N2, expr, @, _), assign(F, N2, V1, V2), node(_, F, N3, expr, @, _), assign(F, N3, V2, Temp), path(N1, N2, _), path(N2, N3, _), ¯ow(_, F, N1, _, F, N3). assign(Fun, CF_Node, V1, V2):st_cf_link(_, Fun, expr_stat, ST_Node, CF_Node), st_node(_, Fun, assign, ST_Node), get_var(left, Fun, ST_Node, V1), get_var(right, Fun, ST_Node, V2).

160


get_var(Where, Fun, Father, V):st_link(_, Fun, Where, Father, Node), st_node(_, Fun, id, Node), id_decl(_, Fun, Node, V). The procedure path veri®es the existence of a path between two nodes of a control ¯ow graph, while the clause assign checks whether or not a statement copies the value from one variable to another. The latter uses clause get_var to navigate on the syntax tree. The eort required to decouple a candidate module from the original environment and to reengineer it according to a prede®ned template may sometimes exceed the cost for developing the reusable module from scratch. The decision on whether or not reuse reengineering is a cost-eective choice depends on the eort required in the election phase. Querying the CCG furnishes useful information for estimating this eort. For example, it allows us to capture the coupling between a software component belonging to a candidate module and a software component that does not belong to it; common coupling 4 between two functions that do not belong to the same candidate modules is captured by the following simple Prolog clause: common_coupled(Fun, Fun1) :mod_fun(Module, Fun), not mod_fun(Module, Fun1), id_decl(_, Fun, _, obj_loc(File, @external, sc([],@), Var)), id_decl(_, Fun1, _, obj_loc(File, @external, sc([],@), Var)). A further example is the case of a function comprised in a candidate module and referencing a global variable that does not belong to the module. This type of references is to be substituted with references to formal parameters, once these have been added to the interface of the function. Actual parameters have also to be added in any call to this function. The eort required to decouple a function F belonging to a candidate module M from an external global variable V depends on the number N_Ref of references to V and the number N_Calls of calls to F: decoupling_effort(F, V, Effort) :mod_fun(M, F), not mod_dat(M, V), ®ndall(ref(F, Node, V), id_decl(_, F, Node, obj_loc(_, _, _, V)), Ref_set), card(Ref_set, N_Ref), ®ndall(F_call(F1, Node, F), call(_, F1, Node, _, F,0)), Call_set), card(Call_set, N_calls), cost_function(N_Ref, N_Calls, Effort).

4

Two procedures are common coupled if they share global data [31].

The procedure card returns the number of elements in a list and cost_function implements some form of cost function. Decoupling and other reengineering operations are implemented as transformations on the CCG. As an example, a global variable V referenced by a function F may be transformed into function's formal parameter with the following steps: (i) searching the references of the function Fun to the variable Var (querying on the fact id_decl); (ii) creating an object fact for the formal parameter (this must be a pointer if the variable is also de®ned by the function); (iii) inserting a formal node on the control ¯ow graph, and; (iv) changing all the references to Var in the syntax tree and in the control ¯ow graph to references to the new formal parameter. The last two activities in the production of the reusable modules are the generation of the code and the quali®cation of the reusable modules. In the RE2 environment, the code of the reusable modules is generated by the back-end of a compiler that uses CCG as an intermediate representation, and a symbolic executor allows formal speci®cations to be recovered [14]. Symbolic execution consists of using symbolic values as the input of a program. Accordingly, states are associated with logic formulas (called path-conditions [24]) which carry information about the execution paths traversed. Due to conditional statements, multiple executions can be generated. Each execution is represented by a token (implemented as a Prolog fact) which moves on the CCG control ¯ow sub-graph and carries the symbolic state. The dierent executions are then merged into a unique ®nal symbolic state the function postcondition is derived from. The speci®cation produced by symbolic execution can be analysed by a domain expert software engineer together with the information gathered during the concept assignment process in order to provide an informal description of the reusable module.

7. Concluding remarks The paper presents the integrated environment developed within the RE2 project for reuse reengineering existing C systems. The integration of dierent tools is achieved by sharing a unique ®ne-grained program representation, called CCG [25]. The RE2 environment has been tested in two pilot reuse reengineering projects that consisted of medium sized C systems. The two pilot projects applied a structural candidature criterion ± namely the method to identify objects and abstract them into classes ± and a speci®cation driven criterion ± namely the speci®cation driven slicing method to isolate functional abstractions.


The sample chosen to apply the structural criterion for identifying objects was a Microsoft Windows based hypertext re-documentation system called SYSDOC whose size is over 4000 LOC [7]. The system assists maintenance programmers in the incremental re-documentation of programs coded in languages like COBOL and PL/1. An analysis of the system design documentation suggested the presence of meaningful data abstractions in one of the three sub-systems in SYSDOC, namely the ®le manager sub-system. Applying the criterion to this sub-system produced four candidate objects, three of which implemented ®le buering for three dierent displaying facilities. Indeed, a detailed analysis of the code of the three candidates revealed that they were instances of an implicit class consisting of eight state variables and nine methods. This class was made explicit and encapsulated into a module. The creation of this module entailed re-engineering operations to extract the code corresponding to one of the three instances and to de-couple it by removing all links to the rest of the system. It also required the generalisation of the set of the global variables de®ning the object's state into a type de®nition and the implementation of the constructor and destructor methods. A module was also created for the fourth candidate object that handled ®le names, extensions and path-names. The modules produced were successfully used to reengineer the SYSDOC system. As a result, the size of the system was reduced and its structure was simpli®ed. This made the system easier to understand and reduced the maintenance eort ± primarily by reducing the chance of introducing new errors due to the need for making changes to seemingly unrelated portions of code ± as demonstrated by a successive maintenance process that included both corrective and perfective actions. The second experiment [15] was conducted on a subsystem of the CCG extractor [25]. This subsystem had undergone two perfective maintenance operations that had increased the size from 3412 LOC to 5481 LOC, while not signi®cantly changing the overall number of functions. Most of the new code was located in four pre-existing functions belonging to the module that produced the output of the analyser in the form of CCG Prolog facts. As a consequence, these functions did not comply with the internal quality standard that imposed 150 LOC as the maximum size of a function. They were therefore selected for the application of a speci®cation driven slicing process with the aim of isolating meaningful sub-functions. The analysis of the code and the documentation of these functions suggested the existence of 18 well de®ned functionalities which were speci®ed in terms of preconditions and postconditions. The symbolic executor integrated in the RE2 environment helped to map each speci®cation onto a slicing criterion and the slicer automatically computed the slice implementing the corresponding functionality. The slic-

161

es identi®ed were extracted and reengineered into new functions. The original system was also reengineered by replacing the slices identi®ed within the four functions with function calls. The case study demonstrated that even though the original design of a system followed functional decomposition principles, maintenance operations (in particular perfective maintenance) can add new functionalities (in terms of code) to existing functions. The case study also revealed the existence of duplicated code in dierent functions and even in the same function, due to insertion of the code implementing a functionality in dierent points. The duplicated code was eliminated as a result of the reengineering phase. In addition to these two pilot projects, whose aim was essentially to assess the strength of the RE2 integrated environment, several other case studies have also been conducted to validate the single candidature criteria and the related tools. Examples include experiments in identifying abstract data types [8] and functional abstractions based either on the aggregation of procedures on the call graph [13] or on conditioned slicing [9]. At present, the RE2 environment is being tested on larger systems jointly with industrial partners. The RE2 integrated environment is currently realised in Prolog. The main motivations of this choice were the prototyping and metaprograming capabilities that allow tools to be rapidly built and easily evolved. Moreover, the interactive query mechanism of Prolog helps in the solution of speci®c problems. As an example, during the de®nition of the interfaces of one of the classes resulting from the SYSDOC program, a simple Prolog query revealed that two interface functions contributed to the implementation of the same method (they were always used together), namely reading a particular line from a ®le. The drawback of Prolog is that it suers from technological problems related to time and space eciency. This limits the scalability of the environment to very large systems ± the CCG of the four functions reengineered in the second experiment (about 1000 LOC) consisted of about 7000 abstract syntax tree nodes. This limitation can be overcome by using dierent technological platforms, mainly for the implementation of the CCG repository. Future work will be in the direction of migrating the environment towards the Software Re®nery toolkit [32], whose object oriented repository is well suitable for implementing a ®ne grained model as CCG. Appendix A. CCG prolog facts In the following, the CCG Prolog facts for the representation of declarations, abstract syntax tree, control ¯ow graph, interprocedural edges and program dependences are illustrated.

162


A.1. Declarations The data base contains the following facts for representing source code ®les and the symbol table: · ®le(File-Name, File-ID) associates a source code ®le name with a unique identi®er; · type(File-ID, Function, sc(Stmt-Block, Storage-Speci®er), Type-Speci®er, Name, Access-List) for user-de®ned type declarations; · tag(File-ID, Function, sc(Stmt-Block, Storage-Speci®er), Tag-Type, Name, Member-List) for struct, union or enum de®nitions; · object(File-ID, Function, sc(StmtBlock, Storage-Speci®er), Type-Speci®er, Name, Access-List) for variable and function declarations. In the facts type, tag and object, the terms FileID, Function, Stmt-Block and Storage-Speci®er are sucient to determine the scope of the declaration. In the facts type and object, the terms TypeSpeci®er and Access-List express the type of the item identi®ed by the term Name. In the fact tag, the term Tag-Type can assume the values struct, union or enum, Name is the name of the structured item and Member-List is a list of the item's components (enumerated constants or terms of type mem(Type-Speci®er, Name, Access-List) for struct and union. A.2. Abstract syntax tree Two types of facts are used for representing the nodes and the edges of the Abstract Syntax Tree (AST): · st_node(File-ID, Function, Node-Type, Node-ID) · st_link(File-ID, Function, Edge-Type, Node-From, Node-To) where File-ID, Function and Node-ID identify a node of the AST, Node-Type and Edge-Type denote the types of a node and an edge, respectively. Moreover, each node representing a compound statement (NodeType is group) is associated with its scope by the fact · scope(File-ID, Function, Stmt-Block, Node-ID) while each node corresponding to an identi®er (NodeType is id) is associated with its declaration by the fact · id_decl(File-ID, Function, Node-ID, obj_loc(File-ID, Function, sc(StmtBlock, Storage-Speci®er), Name))

· edge(File-ID, Function, Node-From, Node-To, Edge-Label) where File-ID, Function and Node-ID identify a node of the CFG, Node-Type is the type of the node and Node-Qual is an identi®er that quali®es the node, or @ if unde®ned. Expr-List is a list of terms ex(Access, Elem) that expresses the accesses (de®nitions and/or uses) to the variables or constants of an expression, while Edge-Label denotes the label of a control ¯ow edge (true, false, uncond). Expression-use and lvalue de®nition edges are respectively represented by the facts · expuse(File-ID, Function, Node-From, Node-To) · lvaldef(File-ID, Function, Node-From, Node-To) while links between syntax tree and control ¯ow graph nodes are expressed by facts of type · st_cf_link(File-ID, Function, EdgeType, Node-AST, Node-CFG) A.4. Interprocedural edges Interprocedural edges consist of call edges, between the call site and the entry of the called procedure, parameter-binding edges between actual and formal parameters and return expression-use edges between a return node and the node using the expression evaluated by the called function: · call(File-ID1, Calling-Fun, Call-Node, File-ID2, Called-Fun, 0) · bind(File-ID1, Calling-Fun, Actual, File-ID2, Called-Fun, Formal) · return_expuse(File-ID2, Called-Fun, Return-Node, File-ID1, Calling-Fun, Call-Site) A.5. Program dependencies The control dependencies are represented by facts of the type: · control(File-ID, Function, Node-From, Node-To, Edge-Label) where Edge-Label can be true or false. The data ¯ow dependencies are represented by facts of the type: · ¯ow(File-ID1, Function1, Node1, FileID2, Function2, Node2) that model both intraprocedural and interprocedural dependencies.

A.3. Control ¯ow graph Two types of facts are used for representing nodes and edges of the Control Flow Graph (CFG): · node(File-ID, Function, Node-ID, NodeType, Node-Qual, Expr-List)

References [1] P. Benedusi, A. Cimitile, U. De Carlini, A reverse engineering methodology to reconstruct hierarchical data ¯ow diagrams for software maintenance, Proceedings of the International Confer-


[2] [3] [4] [5] [6]

[7]

[8]

[9]

[10] [11]

[12] [13]

[14]

[15]

[16] [17] [18]

ence on Software Maintenance, Miami, Florida, USA, IEEE Computer Soc. Press, Silver Spring, MD, 1989, pp. 180±191. P. Benedusi, A. Cimitile, U. De Carlini, Reverse engineering processes, design document production and structure charts, J. Syst. Software 19 (3) (1992) 225±245. T.J. Biggersta, B.G. Mitbander, D. Webster, Program understanding and the concept assignment problem, Comm. ACM 37 (5) (1994) 72±83. G. Caldiera, V.R. Basili, Identifying and qualifying reusable software components, IEEE Computer 24 (2) (1991) 61±70. G. Canfora, A. Cimitile, U. De Carlini, A logic-based approach to reverse engineering tools production, IEEE Trans. Software Engrg. 18 (12) (1992) 1053±1064. G. Canfora, A. Cimitile, M. Munro, A reverse engineering method for identifying reusable abstract data types, Proceedings of Working Conference on Reverse Engineering, Baltimore, Maryland, USA, IEEE Computer Soc. Press, Silver Spring, MD, 1993, pp. 73±82. G. Canfora, A. Cimitile, M. Munro, C.J. Taylor, Extracting abstract data types from C programs: A case study, Proceedings of the International Conference on Software Maintenance IEEE Computer Soc. Press, Silver Spring, MD, 1993, pp. 200± 209. G. Canfora, A. Cimitile, M. Munro, M. Tortorella, Experiments in identifying reusable abstract data types in program code, Proceedings of the Second Workshop on Program Comprehension Capri, Italy, IEEE Computer Soc. Press, Silver Spring, MD, 1993, pp. 36±45. G. Canfora, A. Cimitile, A. De Lucia, G.A. Di Lucca, Software salvaging based on conditions, Proceedings of the International Conference on Software Maintenance Victoria, Canada, IEEE Computer Soc. Press, Silver Spring, MD, 1994, pp. 424±433. G. Canfora, A. Cimitile, M. Munro , RE2 : reverse engineering and reuse re-engineering, J. Software Mainten.: Res. Prac. 6 (2) (1994) 53±72. G. Canfora, A. Cimitile, M. Tortorella, Prolog for software maintenance, Proceedings of the International Conference on Software Engineering and Knowledge Engineering Rockville, Maryland, USA, 1995, pp. 478±486. G. Canfora, A. Cimitile, M. Munro, An improved algorithm for identifying objects in code, Software Prac. Exper. 26 (1) (1996) 24±48. A. Cimitile, A.R. Fasolino, P. Maresca, Reuse-reengineering and validation via concept assignment, Proceedings of the International Conference on Software Maintenance Montreal, Canada, IEEE Computer Soc. Press, Silver Spring, MD, 1993, pp. 216± 225. A. Cimitile, A. De Lucia, M. Munro, Qualifying reusable functions using symbolic execution, Proceedings of the Second Working Conference on Reverse Engineering Toronto, Canada, IEEE Computer Soc. Press, Silver Spring, MD, 1995, pp. 178± 187. A. Cimitile, A. De Lucia, M. Munro, Identifying reusable functions using speci®cation driven program slicing: A case study, Proceedings of the International Conference on Software Maintenance Opio, Nice, France, IEEE Computer Soc. Press, Silver Spring, MD, 1995, pp. 124±133. A. Cimitile, G. Visaggio, Software salvaging and the dominance tree, J. Syst. Software 28 (2) (1995) 117±127. A. Cimitile, A. De Lucia, M. Munro, A speci®cation driven slicing process for identifying reusable functions, J. Software Mainten.: Res. Prac. 8 (3) (1996) 145±178. P. Devambu GENOA/GENII ± a customizable, language and front-end independent code analyzer, Proceedings of the Fourteenth International Conference on Software Engineering Melbourn, Australia, IEEE Computer Soc. Press, Silver Spring, MD, 1992, pp. 307±319.

163

[19] G. Engels, C. Lewerentz, M. Nagl, W. Schafer, A. Schurr, Building integrated software development environments part I: Tool speci®cation, ACM Trans. Software Engrg. Methodologies 1 (2) (1992) 135±167. [20] J. Ferrante, K.J. Ottenstain, J. Worren, The program dependence graph and its use in optimization, ACM Trans. Program. Lang. Syst. 9 (3) (1987) 319±349. [21] A.N. Habermann, D. Notkin, Gandalf: Software development environments, IEEE Trans. Software Engrg. 12 (12) (1986) 1117± 1127. [22] M.J. Harrold, B. Malloy, A uni®ed interprocedural program representation for a maintenance environment, IEEE Trans. Software Engrg. 19 (6) (1993) 584±593. [23] M.S. Hecht, Flow Analysis of Computer Programs Elsevier, New York, 1977. [24] J.C. King, Symbolic execution and program testing, Comm. ACM 19 (7) (1976) 385±394. [25] D.A Kinloch, M. Munro, Understanding C programs using the combined C graph representation, Proceedings of the International Conference on Software Maintenance Victoria, Canada, IEEE Computer Soc. Press, Silver Spring, MD, 1994, pp. 172± 180. [26] R. Snodgrass, The Interface Description Language MD: Computer Science Press, Rockville, MD, 1989. [27] A. von Mayrhauser, A.M. Vans, Program comprehension during software maintenance and evolution, IEEE Computer 28 (8) (1995) 44±55. [28] M.P. Ward, K.H. Bennett, Formal methods to aid the evolution of software International, J. Software Engrg. and Knowledge Engineering 5 (1) (1995) 25±47. [29] N. Wilde, J.A. Gomez, T. Gust, D. Strasburg, Locating user functionality in old code, Proceedings of the International Conference on Software Maintenance Orlando, Florida, IEEE Computer Soc. Press, Silver Spring, MD, 1992, pp. 200±205. [30] M. Weiser, Program slicing IEEE Transactions on Software Engineering, SE-10 (4) (1984) 352±357. [31] E. Yourdon, L.L. Constantine, Structured Design: Fundamentals of a Discipline of Computer Program and System Design Prentice Hall, Englewood Clis, NJ, 1979. [32] Reasoning Systems REFINE User's Guide Palo Alto, CA 1989. Gerardo Canfora received the Laurea degree in Electronic Engineering from the University of Naples ``Federico II'', Italy, in 1989. From 1990 to 1991 he was with the Italian National Research Council (CNR). During 1991 he was at the Department of `Ìnformatica e Sistemistica'' of the University of Naples Federico II, Italy. From 1992 to 1993 he was a visiting researcher at the Centre for Software Maintenance of the University of Durham, UK. Since 1993 he is an Assistant Professor in Computer Science at the Department of `Ìngegneria dell'Informazione ed Ingegneria Elettrica'' of the University of Salerno, Faculty of Engineering at Benevento, Italy. His research interests include software maintenance, program comprehension, reverse engineering, reuse, reengineering, and migration. He has served on the program committees of a number of international conferences and he was program co-chair of the 1997 International Workshop on Program Comprehension. Andrea De Lucia received the Laurea degree in Computer Science from the University of Salerno, Italy, in 1991, the M.Sc. degree in Computer Science from the University of Durham, UK, in 1995, and the Ph.D. in Electronic Engineering and Computer Science from the University of Naples Federico II, Italy, in 1996. From 1991 to 1992 he was at the Department of `Ìnformatica e Applicazioni'' of the University of Salerno, Italy, where his research was funded by a scholarship from the Italian National Research Council (CNR). From 1992 to 1996 he was at the Department of `Ìnformatica e Sistemistica'' of the University of Naples Federico II, Italy. From 1994 to 1995 he also was a visiting postgraduate student at the Centre for Software Maintenance of the University of Durham, UK. Since 1996 he is an Assistant Professor in Computer Science at the Department of `Ìngegneria dell'Informazione ed Ingegneria Elettrica'' of the University of Salerno, Faculty of Engineering at Benevento, Italy. His research interests include reverse

164


engineering, reuse, reengineering, migration, program comprehension, and visual languages. Malcolm Munro is Senior Lecturer in Computer Science at the University of Durham, UK. He is a founder member of the Centre for

Software Maintenance and has been an active researcher in software maintenance since 1987. His particular research topics are program comprehension, software visualisation, reverse engineering, and reusereengineering. He has served on the program and organising committees of a number of international conferences.