Reverse Engineering Using Association Rules - Suraj @ LUMS

Reverse Engineering Using Association Rules

O.Maqbool, A.Karim, H.A. Babri, and M.Sarwar Computer Science Department Lahore University of Management Sciences DHA Lahore, Pakistan [email protected], [email protected], [email protected], [email protected]

Abstract Software systems need to evolve as business requirements, technology and environment change. Very often, these changes to the software are not documented, hence it becomes difficult to understand and manage such systems. To gain system understanding when documentation is non-existent or incomplete, we use reverse engineering. In this paper, we explore the use of data mining for software reverse engineering i.e. given the source files of a software system, we use association rule mining algorithms and tools to gain insight about the software. Our purpose is to determine whether association rule mining can be used for finding interesting patterns and associations within the software, that can lead to program understanding and, if required, re-structuring. We apply association rule mining to a test system and present our results. Finally we analyze our results and suggest modifications to improve the structure of the software. Index Terms — Reverse Engineering, Program Understanding, Data Mining, Association Rules

1. Introduction Legacy systems are old software systems that are crucial to the operation of a business. These systems are expected to have undergone changes in their lifetime due to changes in requirements, business conditions and technology. It is quite likely that such changes were made without proper regard to software engineering principles. The result is often a deteriorated structure, which is unstable but cannot be discarded because it is costly to do so. Moreover, another reason for retaining

these legacy systems is that they embed business knowledge which is not documented elsewhere. Since it is often not feasible to discard a system and develop a new one, techniques must be employed to improve the structure of the existing system. An effective strategy for change must be devised; reengineering is one such strategy. Re-engineering is a process that transforms or re-implements legacy systems to make them more maintainable. The re-engineering option should be chosen when system quality has been degraded by regular change, but change is still required i.e. the system under consideration has low quality but a high business value, and the re-engineering effort is less risky and less costly than system replacement. The re-engineering effort starts with gaining an understanding of the software system, a process known as reverse engineering. Gaining system understanding is difficult because documentation for the system is often not available and source code files are the only means of information regarding the system. In this paper, we explore the use of association rule mining for software reverse engineering i.e. given the source files of a software system, we use association rule mining algorithms and tools to gain understanding and insight about the software. The understanding gained allows suggestions for making subsequent changes and optimizations to the source code for better maintainability. In recent years, there has been growing interest in the application of data mining techniques to gain better understanding of software systems. Researchers have applied data mining techniques in different contexts e.g. to recover the architecture of software legacy systems [1] - [3], to discover patterns for re-using software library components [4], [5], and to support software system maintenance [6], [7]. In this paper, our focus is on the use of association rule mining for discovering

patterns within the source code that are helpful in system understanding and improvement. The organization of our paper is as follows. In section 2 we present an overview of association rule mining. Section 3 details our approach. Section 4 gives the results of applying association rule mining to a test system. Finally, we present the conclusions.

2. An overview of association rule mining Association rule mining is a data mining technique that finds interesting association or correlation relationships among a large set of data items [8]. Traditionally, association rule mining has been employed as a useful tool to support business decision making by discovering interesting relationships among business transaction records. To illustrate the concept of association rule mining, consider a set of data items I = {i1, i2,….in}. Let D be a set of transactions, with each transaction T being a subset of I i.e. T ⊆ I . An association rule is an where implication of the form A⇒ B A ⊂ I , B ⊂ I and A I B = φ . As an example, consider a set of computer accessories (CDs, memory sticks and microphone, speakers) that are available at a certain store. These accessories form the set of items I of interest to us. Every sale made represents a transaction T. Suppose the sales made are represented in the form of the following set of transactions D: Transaction ID T1 T2 T3 T4 T5

Items Sold CD, memory stick CD Microphone, speaker CD, speaker, Microphone Memory stick, microphone, speaker

Association rules in the above case represent the items that tend to be sold together e.g. the association rule CD ⇒ Spea ker shows that those who buy CDs also tend to buy speakers. A number of such association rules may exist in a given set of transaction and not all of them may be of interest. To find interesting rules, support and confidence thresholds are used. Support represents the percentage of transactions in D which contain both A and B. Confidence is the percentage of transactions in D containing A that also contain B. Mathematically: Support ( A ⇒ B) = P( A U B) Confidence ( A ⇒ B) = P( B | A) An association rule is said to be strong if it satisfies

both a minimum support threshold and a minimum confidence threshold. Another measure of interest is coverage. The coverage of an association rule is the proportion of transactions in D that have the items specified on the left hand side of the rule. Mathematically: Coverage ( A ⇒ B) = P(A) For the association rule CD ⇒ Spea ker , support is 1/5, confidence is 1/3 and coverage is 3/5. An interesting association rule in this case is Microphone⇒ Spea ker , for which support is 3/5, confidence is 1 and coverage is 3/5.

3. Our rule mining approach 3.1. Item selection The first step in our approach is to identify items to be used in mining. For generating results from source code files, these items should be ‘facts’ about the software that exist in the form of ‘transactions’. Most of the legacy software systems that exist have been developed using the structured approach, with functions or routines forming basic components. Moreover, in legacy software, the use of global variables is often widespread leading to difficulty in understanding the code. In view of these facts, we decided to use both functions and global variables as items. Moreover, we also decided to use user defined types. The reason is that user defined types become potential objects when code is to be re-structured as object-oriented code. Thus the guiding principle for item selection is to select items and identify association rules which facilitate understanding of the code and allow suggestions for re-structuring the code for greater maintainability. Three sets of transactions based on these items are used. These are: - The set of functions along with global variables accessed by them - The set of functions along with user defined types accessed by them - The set of functions along with function calls to other functions The fourth set is a consolidated set consisting of functions with global variables, user defined types accessed or function calls made.

3.2. Identification of interesting association rules

In this step, we identify the interesting associations between global variables, types and functions. As pointed out previously, those association rules will be of interest to us that help in understanding the code and in improving its structure and maintainability. In the table below, we list these association rules, detail their implication and offer suggestions for improvement. We use both minimum and maximum thresholds of the three measures, coverage, support and confidence. We consider the range 0.7 to1.0 as high for any measure and 0.0 to 0.3 as low. It is relevant to note that if we employ user defined types, functions and global variables as items, and use coverage, support and confidence criteria with values zero, low, high and one, the number of possible association rules exceeds 1500. The 13 association rules listed below represent a small subset of the total possible association rules. They have been selected because they represent meaningful patterns that provide insight into the software system under study. Patterns were first adopted by the software community as a way of documenting recurring solutions to design problems [9]. The patterns presented below help in identifying problems in legacy software and also offer suggestions for improvement. Pattern 1: Reduce level of coupling LHS RHS Coverage Global (Single) Function Low Implication: Only a small number of functions use the global variable on the LHS. Suggestion: Rather than using the variable as a global variable, pass it as a parameter within the relevant functions. Pattern 2: Increase efficiency LHS RHS Coverage Global Function High (Single/Multiple ) Implication: Global variable(s) on LHS is used by most of the functions. Suggestion: Place the global variable in a ‘register’, but do this carefully, keeping all constraints in view. Some languages may not allow such placement at all.

LHS RHS Function Function (Single) Implication: Functions are called together. Suggestion: Place the functions in the same file.

Confidence High

Pattern 3: Localize structures LHS RHS Confidence Global Function One (Multiple) Implication: The global variables are used by one function only. Suggestion: Place global variables in one local structure. Pattern 4: Identify utilities LHS RHS Coverage Function Function High (Single) Implication: Function on LHS is called by most of the functions. Suggestion: Treat the function as a utility function. Pattern 5: Increase modularity LHS RHS Global Global (Single/Multiple ) Implication: Globals are used together. Suggestion: Combine the globals into a structure.

Confidence High

LHS RHS Type Type (Single/Multiple ) Implication: Types are used together. Suggestion: Combine the types into a structure.

Confidence High

Pattern 6: Reduce memory requirements LHS RHS Global (Single) Global Implication: Globals are never used together. Suggestion:

Confidence Zero

Combine the globals into a union structure. LHS RHS Confidence Type Type Zero (Single/Multiple ) Implication: Types are never used together. Suggestion: Combine the types into a union structure. Pattern 7: Beware of side effects LHS RHS Confidence Function Global High (Single/Multiple ) Implication: The functions have high coupling. Suggestion: If any one of the functions is changed, take care how it affects the global variable.

4. Experiments and results 4.1 The test system The software we have chosen for applying association rule mining is Xfig version 3.2.3, which is an open source drawing utility that runs under the X Window System. It has been written in C, and consists of around 75,000 lines of code. The design documentation of Xfig is not available, although the user manual and other useful information is available at the Xfig site [10]. The Xfig system consists of five major subsystems, whose source code files can be identified by their names. Some relevant statistics of these sub-systems are provided in the table below: System d_*files e_*files f_*files u_*files w_*files

Pattern 8: Strengthen encapsulation LHS RHS Confidence Function Type High (Single/Multiple ) Implication: The functions access the same type in most cases Suggestion: If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, the type is a candidate class and the functions are its member functions. LHS RHS Confidence Global Type One (Single/Multiple ) Implication: The global and type occur together in all cases. Suggestion: If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, the type is a candidate class and the global variable is a static data member. LHS RHS Confidence Type Global One (Multiple) Implication: The types and global variable occur together always. Suggestion: If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, the types are candidate

Purpose Drawing tasks Editing tasks File related tasks Utility files Window related tasks

Files 10 19 15 18 30

Functions 94 369 139 422 637

In this paper we present the results for the d_files subsystem only. The results and analysis can be extended to the other 4 subsystems as well. The source files for the Xfig system have been parsed using the Rigi tool and relevant ‘facts’, which represent the transaction set related to the system, have been stored in an exchange format called the ‘Rigi Standard Format (RSF)’ [11]. Facts of interest to us include: -

Global variables accessed by a function User defined types accessed by a function Functions called by a function

4.2 Analysis of results In this section, we present the results of applying association rule mining to the d*_files subsystem of Xfig and analyze the results using the patterns identified in section 3.2. The d_files subsystem consists of 94 functions. Moreover, the Xfig system makes use of 1746 global variables and 828 user defined types.

Application of Pattern 1 LHS Global (Single)

RHS Function

Application of Pattern 5 Coverage Low

In the Xfig software, there are a number of global variables satisfying this condition. This shows that many global variables are being used by very few functions and can be passed as parameters. The fact that they have been defined as global shows poor design or a design that has deteriorated. Application of Pattern 2 LHS Global (Single/Multiple)

RHS Function

Coverage High

In the Xfig software, there are no global variables satisfying this condition. This shows that no global variable is used by most of the functions. This indicates a better design than one where many functions access a global variable, thus increasing coupling. So the system exhibits some degree of common coupling, but this coupling is not between all functions. LHS Function (Single)

RHS Function

Confidence High

In the Xfig software, some functions satisfy this condition. For example, make_sfactors and spline_drawing_selected are always called together. It would be helpful to place the functions in the same file to decrease overhead of accessing the functions from two different files. Application of Pattern 3 LHS Global (Multiple)

RHS Function

Confidence One

In the Xfig software, multiple global variables do not satisfy this condition. However, there are some global variables that are used by only one function. For example, the global variable cur_arctype^* and function d_line occur together in all cases. In this case, the global variable may be localized. Application of Pattern 4 LHS Function (Single)

RHS Function

Coverage High

In the Xfig software, there are no functions satisfying this condition, which implies that no function is called frequently by many functions. This shows that there are no functions which can be termed as utility functions.

LHS Global (Single/Multiple)

RHS Global

Confidence High

In the Xfig software, there are a number of globals that occur together. For example, canvas_font^*, canvas_middlebut_proc^* and work_psflag^* occur together with confidence=1. It is feasible to make a structure out of them, for restricted access. LHS Type (Single/Multiple)

RHS Type

Confidence High

In the Xfig software, some types satisfy this condition, for example F_line and F_point. This shows that these types always occur together and can be turned into a structure or class for a meaningful interpretation and restricted access if required. Application of Pattern 6 LHS Global (Single)

RHS Global

Confidence Zero

In the Xfig software, there are no global variables satisfying this condition. This shows that no two global variables are such that if one is accessed, the other is not accessed. In other words, no two global variables are mutually exclusive. LHS Type (Single/Multiple)

RHS Type

Confidence Zero

In the Xfig software, there are no types satisfying this condition. This shows that no two types are such that if one is accessed, the other is not accessed. In other words, no two types are mutually exclusive. Application of Pattern 7 LHS Function (Single/Multiple)

RHS Global

Confidence High

In the Xfig software, some functions and global variables satisfy this condition. For example, new_text function and the global variable new_t^* have high coupling. This does not indicate a good design, since a change in the global variable cannot be traced to a particular function.

Application of Pattern 8 LHS Function (Single/Multiple)

RHS Type

Confidence High

In the Xfig software, some functions and types satisfy this condition. For example, add_subspline_point function accesses type F_sfactor in most cases. In case Xfig is to be re-structured as an object-oriented system, F_sfactor is a candidate class and add_spline_point is a candidate member function of this class. LHS Global (Single/Multiple)

RHS Type

Confidence One

In the Xfig software, some global variables and types satisfy this condition. For example, the global variable cur_boxradius^*and type F_line occur together in all cases. In case Xfig is to be re-structured as an objectoriented system, F_line is a candidate class and cur_boxradius^* is a candidate static member for this class. LHS Type (Multiple)

RHS Global

Confidence One

In the Xfig software, some types and global variables satisfy this condition, for example, the type F_arc and global variable center_point^* occur together in all cases. In case Xfig is to be re-structured as an objectoriented system, F_arc is a candidate class and center_point^* is a candidate static member for this class.

5. Conclusions In this paper we explored the application of association rule mining to the problem of understanding a software system given only the source code. We applied association rule mining to analyze the structure of Xfig, which is a reasonably large software system consisting of 75,000 lines of code. Extracting interesting and meaningful association rules was helpful in gaining insight about the software’s overall structure. A manual inspection would have taken a much longer time. Our results show that association rule mining can be applied to find interesting association between functions, types and global variables within the source files. These associations can be used to gain deeper understanding of the code, and may be used to restructure the code for maintainability. In some cases, the associations found can be helpful in re-modularizing the

code, e.g. in converting a structured design to an objectoriented design. It will be interesting to pursue the mining of associations between items other than the ones explored here. We have considered global variables, user defined types and functions in this paper. Association between other items such as input/output parameters of functions, set/get of variables etc. may also provide insight into the software. Moreover, it may be useful to identify other interesting association rules that throw light on other aspects of the software system, leading to a more complete picture.

References [1] C. Tjortjis, L. Sinos, P. Layzell, “Facilitating Program Comprehension by Mining Association Rules from Source Code”, 11th IEEE International Workshop on Program Comprehension (IWPC'03), May 2003. [2] C. Montes de Oca, D.L. Carver, “A Visual Representation Model for Software Subsystem Decomposition”, Working Conference on Reverse Engineering (WCRE'98), October 1998. [3] K. Sartipi, K. Kontogiannis, F. Mavaddat, “Architectural Design Recovery Using Data Mining Techniques”, Conference on Software Maintenance and Reengineering (CSMR’00), February 2000. [4] A. Michail, “Data Mining Library Reuse Patterns in UserSelected Applications”, 14th IEEE International Conference on Automated Software Engineering, October 1999. [5] A. Michail, “Data mining Library Reuse Patterns Using Generalized Association Rules”, Proceedings of the 22nd International Conference on Software Engineering, June 2000. [6] J.S. Shirabad, T.C. Lethbridge, S. Matwin, “Supporting Maintenance of Legacy Software with Data Mining Techniques”, Proceedings of the 2000 Conference of the Centre for Advanced Studies on Collaborative Research, November, 2000. [7] J.S. Shirabad, T.C. Lethbridge, S. Matwin, “Supporting Software Maintenance by Mining Software Update Records”, International Conference on Software Maintenance, (ICSM’01), November 2001. [8] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. [9] S. Demeyer, S. Ducasse, O. Nierstrasz, Object-Oriented Reengineering Patterns, Morgan Kaufmann, 2003. [10] http://www.xfig.org [11] J. Martin, K. Wong, B. Winter, H.A.Müller, “Analyzing Xfig Using the Rigi Tool Suite”, The Seventh Working Conference on Reverse Engineering (WCRE'00), November 2000.

Acknowledgement We would like to thank the Lahore University of Management Sciences for providing funding for this research. Thanks also to Asim Qureshi for his help in the initial phases of this work.