Clustering and Lexical Information Support for the Recovery of Design Pattern in Source Code Simone Romano1, Giuseppe Scanniello1, Michele Risi2, Carmine Gravino2 1
Dipartimento di Matematica e Informatica University of Basilicata Viale dell’Ateneo, I-85100, Potenza, Italy e-mail:
[email protected] 2 Facoltà di Scienze MM. FF. NN. University of Salerno Via Ponte Don Melillo, I-84084, Fisciano (SA), Italy {gravino, mrisi }@unisa.it
Keywords- Design Patterns, Fuzzy clustering, Maintenance
I.
INTRODUCTION
In the last years, several approaches have been proposed for recovering design pattern instances1 from source code. They can differ in the kind of representation used for coding the design patterns, the type of analysis (a static analysis of structural aspects of the patterns only or a combination of static and dynamic analyses), and the kind of support they provide for recovering design pattern instances (i.e., manual, semi-automatic or automatic) [1][4][5][13]. To assess the validity of a design pattern recovery approach and possibly compare it with other approaches, the instances recovered can be classified according to the confusion matrix of TABLE I in: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). For example, a design pattern instance is classified as FP if an approach recovers that instance, but it is not an 1
A design pattern is a general reusable solution to a commonly occurring problem in software design and it includes a name, an intent, a problem, its solution, some example, and so on [2]. However, almost all the proposed recovery approaches refer to design pattern instances considering only the solutions of design patterns or design motifs [4]. This is the case also for the work presented here.
actual instance [11]. The goal of a design pattern recovery approach is to minimize the number of FP and of FN and at the same time to maximize the number of TP and of TN. The classification of the identified instances in FN and TN requires a manual analysis of the entire source code. In case of large software systems, this analysis is practically impossible. Thus, the classification of design pattern instances in TP and FP is the only possible solution to assess a design pattern recovery approach [1]. In this paper, we propose an approach that leverages lexical information and fuzzy clustering. The approach reduces the number of FP and preserves the number of TP using existing design pattern recovery approaches based on structural information [1][13]. Our proposal is inspired from data mining approaches, where often the data space is clustered before information is retrieved [9], and from program comprehension and reverse engineering approaches, where lexical and structural information are combined to improve the results [8][12]. The effectiveness of our approach has been assessed through a case study conducted on four well known and widely investigated open source software systems implemented in java: JHotDraw, JUnit, QuickUML, and MapperXML. TABLE I. THE CONFUSION MATRIX USED TO CLASSIFY RECOVERED INSTANCES OF DESIGN PATTERS. Recovered Actual
Abstract—We propose an approach that leverages lexical information and fuzzy clustering to reduce the number of the design pattern instances that existing approaches based on structural information (i.e., navigating the dependencies among software elements) erroneously recover in source code. To assess the effectiveness of the techniques, we present the results of a case study conducted on four open source software systems implemented in java. The data analysis shows that the use of lexical information and fuzzy clustering improves the correctness of the results achieved by existing design pattern recovery approaches based on structural information, while preserving the number of design pattern instances correctly identified.
Actual instance
Not actual instance
Actual Instance
True Positive
False Negative
Not actual instance
False Positive
True Negative
II.
THE PROPOSED APPROACH
We combine lexical and structural information to recover design pattern instances. Lexical information is used to group classes that implement similar services, while the structural information is used to recovery design pattern
instances in each group and in the entire system. In particular, the design pattern recovery process consists of the following steps: 1. Text extraction and text normalization. The text (i.e., comments and source code) of the software entities (i.e., classes and interfaces in our case) is extracted. The text is then normalized using different techniques including stop words removal, stemming, etc… 2. Lexical analysis. An IR engine is used to index the corpus. Later, this index is used to determine similarity measures between entities. We used the Vector Space Model (VSM) technique [7] in this work. 3. Clustering. Classes are clustered using their similarity computed on the base of the index. Entities in the same cluster are lexically similar, while entities in different clusters are lexically dissimilar. Interface classes are added to each cluster after the clustering. We used the Fanny clustering algorithm [6] in this work. 4. Design Pattern Recovery. Design pattern instances are recovered using existing design pattern recovery tools on each cluster and on the entire system. We used tools available on the web that exploit only structural information [1][13]. A. Text extraction and text normalization The source code files are treated as plain text that is normalized by performing the following steps: (i) deleting non-letters (i.e., operators, punctuation, numbers, etc.), (ii) splitting terms composed of two or more words (e.g., “design_pattern” is turned into “design” and “pattern”), and (iii) eliminating the words in a stop list (i.e., common English words that are not usually useful for searching) and those having the length less than three characters. Finally, the stemming is applied to reduce words to their root forms (e.g., both the words “designing” and “designer” lead to the common radix “design”). B. Lexical analysis We use one of the most popular IR techniques in text retrieval field, namely VSM [7]. This technique is based on the occurrences of words in documents from the corpus. It can be easily applied to any kind of corpora since it does not use a predefined vocabulary or grammar. VSM constructs a term-by-document matrix, where a generic entry denotes the occurrence of a term in a given document. The assumed model is called bag-of-words, since each document is represented as a multi-set of words, disregarding all information about their order or syntactic structure. To determine the weights, we use the term frequency-inverse document frequency (tf-idf) model, which is defined for every term t and entity e: n tf _ idf ( ti ,e j ) tf ( ti ,e j ) log 1 df ( ti ) tf(ti, ej) is the number of occurrences of the term i in the entity j, while df(ti) is the number of entities containing the term ti and n is the number of classes and interfaces. For example, tf_idf is equal to zero when ti does not appear in the entity, while it is highest when ti occurs many times
within a small software number of entities (it lends a high discriminating power to those entities). To compute the similarity between pairs of classes, we use the measure: V e1 V e2 d e1 ,e2 1 V e1 V e2 where e1 and e2 are entities, V(e1) and V(e2) are vectors corresponding to the entities e1 and e2, while ||…|| is the Euclidian norm. The adopted measure ranges from 0 to 2, where 0 means that the software entities are lexically equal, while 2 means the entities are completely different. C. Clustering Clustering algorithms can be soft/fuzzy (i.e., an entity can be in one or more clusters) or hard (i.e., an entity can be in exactly one cluster). We use a fuzzy clustering algorithm since there could be classes that are in different design pattern instances. In particular, we select a variant of the fuzzy c-means clustering algorithm, namely Fanny [6]. This algorithm is well known and widely used in several fields. This algorithm groups entities into c clusters and, similar to the fuzzy logic, an entity has a membership degree to one or more clusters. The clustering process is carried out through an iterative optimization of the following function: m
nc
v 1
m
u u i 1 j 1
r iv
r jv
d (ei , e j )
m
2 u rjv j 1
where ei and ej are pairs of entities selected in the set of all the entities (i.e., classes) to cluster. The size of this set is m, while nc is the number of clusters to identify and uiv is a not negative value specifying the membership of the entity ei to the cluster v. The sum of all the memberships of a given entity ei is 1, while the membership exponent is r and can assume values between 1 and ∞. In case r is close to 1, the algorithm behavior is like its hard version (i.e., the k-means clustering algorithm). For larger value of r, the fuzziness level of each cluster is higher. The clustering process will stop when this inequality is verified:
max i ,v 1..nc uiv(t 1) uiv(t ) t indicates the maximum number of iterations, while represents a termination criterion. The value ranges from 0 to 1. Fanny produces a matrix of membership ui=1..m, v=1..nc, which is then used to get the clusters. To this end, we use a threshold on the membership values equals to 1/nc. Furthermore, we use 1.7 as the r value, while 1e-18 as the value and 5000 as the iteration number, respectively. We have chosen these values since they are common in the literature [6]. Differently, the choice of nc is an open issue. We here have tried different values and selected the one that produced better results. The definition of a heuristic to
reduce the search space for nc is subject of future work. The interface classes are not used in the clustering. This was because the lexemes of the interfaces are useful to build the corpus, while they are not in the cluster discrimination. We verified this postulation on all the systems used in the case study (details are not provided for space reasons). Finally, since interfaces are very relevant [2], they are added to each identified cluster. D. Design Pattern Recovery Although our approach is general, we use the tools DPR [1] and Pattern4 [13]. These tools use structural information to recover design pattern instances and have been selected since they are well known and are free for download. Behavioral approaches are not considered since their results strongly depend on execution of the systems. The use of this kind of approaches is subject of future work. DPR uses a design pattern library, expressed in terms of visual grammars, and LR-parsing techniques. Differently, Pattern4 uses graph representations for the software and the design patterns to be retrieved and then a graph similarity algorithm is employed to retrieve design pattern instances. We employ these tools as black boxes. These tools are applied on each group of classes obtained in the step Clustering and on the entire system. The instances of design patterns are achieved by intersecting the set of instances recovered in all the clusters and the set of instances recovered in the entire system. The intersection has the effect of removing FP and preserving TP. This result is experimentally observed on all the software systems used in the case study. The intersection of the recovered instances has also the effect of combining lexical and structural information. This is because the design pattern recovery tools work on structural information, while the clusters are obtained through lexical analysis. III.
CASE STUDY
To automate the design pattern recovery, we have implemented a prototype of a supporting system intended as an Eclipse plug-in. To assess the validity of the underlying techniques, we applied this prototype on 4 open source java software systems: JHotDraw 5.1 (8 KLOC and 144 classes, 19 interface classes); JUnit 3.7 (3 KLOC and 78 classes, 10 interface classes); QuickUML 2001 (12.7 KLOC and 152 classes, 13 interface classes); MapperXML 1.9.7 (23 KLOC and 217 classes, 23 interface classes). We selected these systems since they have been largely used to assess the effectiveness of design pattern recovery approaches (see e.g., [1][4][13]). Further, the design pattern instances of these systems are documented and available in a public dataset, i.e., the P-MARt repository [3][10]. TABLE II summarizes the achieved results in terms of FP and TP by applying DPR and Pattern4 alone. The results achieved by applying these tools in combination with our techniques are shown on the right hand side of the table. We grouped the results with respect to the understudy software systems and the kind of design pattern to recover. With regard to DPR, the techniques reduce the number of FP from 32% to 69% for all the systems. In particular, the
best result was achieved on QuickUML, while the worst results were obtained on JHotDraw. In all the cases the number of TP remains the same both using and not our approach. Note that DPR is able to identify a few design pattern instances classifiable as TP. This is true for all the systems with the exception of JHotDraw. The low number of recovered TP is due both to the systems, developed without exploiting design patterns, and to the tools used for recovery design patterns. For Pattern4, the number of FP is reduced from 9% to 23%, so indicating that the techniques are less effective. The best and the worst results were achieved on MapperXML and JUnit, respectively. Similar to DPR, a few design pattern instances classifiable as TP were recovered. This is true for all the systems with the exception of JHotDraw. The data analysis shows that when the design pattern recovery tools identify a larger number of FP and a few numbers of TP, the application of the techniques sensibly reduce the number of FP. The results also indicate that on the Command design pattern the techniques allowed to strongly reduce the number of FP on all the systems and on both the design pattern recovery tools. The best results have been achieved on QuickUML for DPR (the number of FP is 70% less) and on MapperXML for Pattern4 (the number of FP is 55% less). For JHotDraw, the improvement related to the use of the techniques is less clear. A. Further Analysis To assess the benefit deriving from the lexeme within the comment, we also conducted the case study modifying the phase Text extraction and text normalization. This phase was modified to build the corpus only considering the source code. We observed that better results were achieved on JHotDraw, while on the other systems we mainly got worse results. This finding suggests that in the case of software written by explicitly using design patterns (i.e., through an appropriate name convention), as it is the case of JHotdraw, the comments introduce rumors that reduce the benefit deriving from the application of the used techniques. Differently, for the other systems the comment plays a relevant role to reduce the number of FP. We plan to further investigate this aspect in the future. IV.
DISCUSSION AND CONCLUSION
We have proposed an approach that leverages lexical information and fuzzy clustering to reduce the number of the design pattern instances erroneously recovered by existing approaches based on structural information. The use of these techniques has leaded to an improved precision (i.e., TP/(TP+FP)) of the results. As a matter of fact, if a tool produces results with a very low precision, an excessive post-processing is needed to throw away the FP. This makes the approach useless, whatever is the number of TP. Another remarkable characteristic of our proposal concerns the combined use of lexical information and structural information. The greater part of the existing approaches relies solely on structural information and ignores textual information [4][13].
TABLE II EXPERIMENTAL RESULTS Tool
Software
JHotDraw
JUnit DPR
QuickUML
MapperXML
Tool
Software
JHotDraw
Pattern4
JUnit
QuickUML
MapperXML
Design pattern State Strategy Adapter Command Composite Template Method State Strategy Adapter Command Composite Template Method State Strategy Adapter Command Composite Template Method State Strategy Adapter Command Template Method Design pattern Decorator State/Strategy Adapter/Command Composite Observer Prototype Singleton Template Method Factory Method Decorator State/Strategy Adapter/Command Composite Observer Template Method State/Strategy Adapter/Command Composite Singleton Template Method State/Strategy Adapter/Command Observer Singleton Template Method
TP 2 9 5 9 2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 4 TP 1 8 10 1 1 1 2 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 3 1
FP 82 75 33 181 5 9 10 10 5 28 1 2 34 34 61 144 1 7 37 36 8 75 4 FP 2 36 13 0 2 2 0 4 1 1 8 7 0 0 1 15 11 0 0 5 22 11 11 0 3
[2]
[3]
[4]
[5]
[6] [7]
A. De Lucia, V. Deufemia, C. Gravino, M. Risi, “Design Pattern Recovery through Visual Language Parsing and Source Code Analysis”, Jour. of Syst. & Softw., 18(7), 2009, pp. 1177–1193. E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1 edition, November 1994. Y.-G. Guéhéneuc, “P-MARt: Pattern-like Micro Architecture Repository”, in Proc. Of EuroPLoP Focus Group on Pattern Repositories, 2007. Y.-G. Guéhéneuc, G. Antoniol, “DeMIMA: A Multilayered Approach for Design Pattern Identification”, IEEE Trans. on Softw. Engineer., 34(5), 2008, pp. 667–684. J. Ka-Yee Ng, Y.-G. Guéhéneuc, G. Antoniol, “Identification of behavioural and creational design motifs through dynamic analysis”. Jour. of Softw. Mainten., 22(8), 2010, pp. 597-627. L. Kaufman, P.J. Rousseeuw, Finding Groups in Data – An Introduction to Cluster Analysis. Wiley. C. D. Manning, P. Raghavan, H. Schütze, Introduction to
Techniques + DPR
Tool
Techniques + Pattern4
TP 2 9 5 9 2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 4 TP 1 8 10 1 1 1 2 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 3 1
FP 45 68 20 135 3 9 7 7 2 14 1 1 16 16 20 43 0 5 15 25 4 27 3 FP 1 36 13 0 2 2 0 4 0 1 7 4 0 0 1 10 9 0 0 5 11 5 10 0 3
Information Retrieval. Cambridge University Press. 2008.
REFERENCES [1]
Tool
http://nlp.stanford.edu/IR-book/information-retrieval-book .html. [8]
[9]
[10] [11] [12]
[13]
C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, C. Fu, “Portfolio: a search engine for finding functions and their usages”, in Proc. of Intl Conf. on Softw. Eng., 2011, pp. 10431045. P. Berkhin, Survey of Clustering Data Mining Techniques. Techniques, 10, 1-56. Springer. Retrieved from http://www.springerlink.com/index/x321256p66512121.pdf P-MARt: http://ptidej.dyndns.org/downloads/pmart/. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983. G. Scanniello, A. D’Amico, C. D’Amico, T. D’Amico, “Using the Kleinberg Algorithm and Vector Space Model for Software System Clustering”. in Proc. of Intl. Conf. on Prog. Compr., 2010, pp. 180-189. N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, S. T. Halkidis, “Design Pattern Detection using Similarity Scoring”, IEEE Trans. on Softw. Eng., 32(11), 2006, pp. 896– 909.