Design pattern detection using a DSL-driven graph

1 downloads 0 Views 3MB Size Report
specification of variant forms of the classic DPs (as coded in the literature). .... in [28], four new variants of standard patterns (Abstract Factory, Decorator, Adapter, ...
JOURNAL OF SOFTWARE: EVOLUTION AND PROCESS J. Softw. Evol. and Proc. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/smr.1674

Design pattern detection using a DSL-driven graph matching approach Mario Luca Bernardi1,*,†, Marta Cimitile2 and Giuseppe Di Lucca1 1

Department of Engineering, University of Sannio, Benevento, Italy Faculty of Jurisprudence, Unitelma Sapienza University, Rome, Italy

2

ABSTRACT Knowledge about design pattern (DP) instances improves program comprehension and reengineering of object-oriented systems. Effectively, it helps to discover developer design decisions and trade-offs that often are not documented. This work describes an approach to automatically detect DPs in existing object-oriented systems by tracing systems’ source code components with the roles they play in the patterns. In the proposed approach, DPs are modeled based on their high-level structural properties (e.g., inheritance, dependency, invocation, delegation, type nesting, and membership relationships) that are checked, by source code parsing, against the system structure and components. Moreover, the approach can also detect pattern variants, defined by overriding the pattern properties. This paper presents a description of the approach, provides a brief description of the supporting tool, and discusses the results from the experiments carried out to validate it. The approach was validated on seven systems of an open benchmark that contains systems of increasing sizes. For five additional systems, the results have been compared with the ones from a similar approach existing in the literature. The obtained results, the identified DP variants, and the effectiveness of the approach are thoroughly presented and discussed. Copyright © 2014 John Wiley & Sons, Ltd. Received 29 October 2013; Revised 17 June 2014; Accepted 2 July 2014 KEY WORDS:

design pattern detection; object-oriented systems; graph matching; domain-specific languages; model-driven development

1. INTRODUCTION Design patterns (DPs) were firstly introduced in [1] as general repeatable solutions to commonly occurring problems in software design. Several works [2, 3] show how software quality greatly improves by implementing DPs and documenting their adoption. In the last 20 years, as the number of pattern-based systems and frameworks increased, the topic of (semi-)automatic detection of pattern instances in object-oriented (OO) software systems became more critical to improve program comprehension, maintenance, and reuse [4]. When design documentation is not available (or updated), DP detection (DPD) can help program comprehension by providing useful insights on software architecture, the underlying design choices, and the role played by each code component in a DP [5, 6]. This is even more true in the case of bad or incomplete documentation. In fact, the lack of adequate documentation in a software system may make it hard to understand which are the adopted design solutions and their code components. Finally, searching a software project for DPs can also be used to assess the quality of the source code [3, 5]. To address these issues, several methodologies, approaches, and tools have been proposed in the literature in the last 20 years [7]. Most of these approaches take into account a fixed and limited set

*Correspondence to: Mario Luca Bernardi, Department of Engineering, University of Sannio, Benevento, Italy. † E-mail: [email protected] Copyright © 2014 John Wiley & Sons, Ltd.

M. L. BERNARDI, M. CIMITILE AND G. DI LUCCA

of properties to specify a pattern. In particular, they do not consider behavioral properties, which are crucial in the characterization of object DPs. Moreover, most of these approaches are very sensitive to structural differences of the searched patterns because their specifications are embedded in the detection algorithm. This paper proposes a detection approach that addresses these issues. The detection algorithm is based on a meta-model representing both the software system and the searched DPs through a wider set, with respect to existing approaches, of high-level properties related to the source code elements, the static relationships among them, and their behavior. Each system is considered as an instance of this meta-model and is represented as a graph of elements and properties about them. DP identification is performed by matching each pattern graph with the overall system graph and by annotating the elements of the type hierarchy with information on the roles they play in the pattern. An advantage of the proposed approach over most of the existing ones is that it also allows easy specification of variant forms of the classic DPs (as coded in the literature). This is an important issue to address because it is well known that DPs are present in real-world systems in many variants [8, 9]. Our detection approach is driven by a set of pattern models written using a domainspecific language (DSL) defined to model the structure of both the software system and the DPs. It organizes such DP models as a hierarchy of declarative specifications. In particular, a DP variant can be expressed as a set of changes to an existing specification by adding, removing, or relaxing the properties. Hence, it is possible to write a new pattern specification by deriving it from an existing one (to detect a variant) or by writing it from scratch (to detect a new kind of pattern), with no impact on the mining algorithm. An eclipse-based tool, called Design Pattern Finder (DPF), has been developed to provide an automatic support to the approach (in the following, DPF is used to concisely refer to the proposed approach, not only just the tool). The approach has been assessed by applying it to seven systems of an open benchmark proposed in [10] and [11]. For five additional systems, we compared our results with the ones obtained using the DPD tool proposed in [8]. In this case, the results have been validated by experts in order to evaluate precision and recall considering the true positives of both tools as a gold standard (GS). This paper improves and enhances our previous preliminary investigation reported in [12, 13]. The improvements and enhancements are mainly referred to as follows: (i) a wider set of DSL specifications allowing the detection of new DPs not considered before; (ii) a new version of the DPF prototype tool provided with a user interface; (iii) more experiments and the related discussion of the results provided by the approach, using the DPF tool, on a larger set of systems, and comparing them with results provided by both an open benchmark and the DPD approach. The paper is structured as follows. In Section 2, relevant related work is discussed. Section 3 describes the meta-model and DSL defined to represent the system and pattern structure and the detection approach. Section 4 concisely describes the catalog of the DP specifications, while the DPF tool is briefly described in Section 5. Section 6 reports the experiment setup, whereas the results are discussed in Section 7. Section 8 contains conclusive remarks and briefly discusses future work. The Appendix reports and describes some DSL specifications of the most relevant DPs and their variants.

2. RELATED WORK The problem of mining DPs in existing OO systems has been faced and discussed in several works, and different methods and techniques have been proposed to support it. Some reviews on current techniques and tools for discovering architecture and DPs from OO systems are provided in [14] and [15]. In the last work, authors classified pattern recovery techniques based on the used type of analysis and the adopted searching methodology. 2.1. Type of analysis The type of analysis can be classified as structural analysis, behavioral analysis, semantic analysis, and formal specification/composition analysis. Copyright © 2014 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. (2014) DOI: 10.1002/smr

DESIGN PATTERN DETECTION

Structural analysis approaches consist in recovering the structural relationships from available source code artifacts. They focus on recovering structural DPs such as Adapter, Proxy, and Decorator. These approaches consider interclass relationships to identify patterns’ structural properties [16]. An example is in [17], where a structural analysis of DP is proposed starting from C ++ systems. Moreover, in [18], the source code is parsed using a third-party commercial tool called Understand for C++. The tool extracts the entities and the references from C++ source code and stores them in a database. Queries are performed on the database to extract different properties of patterns. In their experimentation, authors recovered Singleton, Factory, Template, Observer, and Decorator patterns from a version control system. Behavioral analysis approaches adopt dynamic analysis, machine learning, and static program analysis techniques for patterns’ behavioral aspect extraction. They can be used together with structural analysis when patterns are structurally identical or have a weak structure (e.g., State and Strategy patterns are structurally identical). The limit of these approaches is the high number of false positives when the number of execution traces is increased [19]. Semantic analysis approaches complete the structural and behavioral analysis, reducing the falsepositive rate for recognition of different patterns. They use naming conventions and annotations that contain the role information about the classes and methods [19, 20]. This analysis can be used for recovery of patterns having similar static and behavior properties (e.g., Bridge and Strategy patterns). Different techniques are used for semantic analysis. In [19], three options are discussed, and they conclude that naming conventions are the most appropriate and feasible option. Finally, formal specification/composition analysis of DPs includes some approaches on formal pattern specifications. It is important to supplement different detection approaches by formally specifying different patterns [21–24]. Moreover, DPs have different implementation variants, and any formal specification of patterns can help to specify the possible variations in different patterns as well as overcome the challenges of capturing their semantics. These approaches use formal specification languages to specify DPs supported by tools validating the correctness and completeness of the specifications [22]. In [25, 26], an extension of the UML sequence diagram is proposed, allowing designers to define and visualize the pattern roles and the different types of interaction groups for a DP. Other approaches [27, 28] propose definitions of pattern variants composed of reusable feature types. A feature type is detected in the source code with a search technique that is most fitting for its characteristics. Different technologies are used for detecting different parts of a pattern. In particular, in [28], four new variants of standard patterns (Abstract Factory, Decorator, Adapter, and Proxy) are defined during analysis of different source applications. This approach is similar to ours because it explicitly represents the concept of variants. It presents a model based on a set of properties (called feature types by authors) that all together are capable of representing a DP. Our approach defines variants using DSL modeling a wide set of structural and behavioral properties (like delegation, object creation, calls, and dependencies) using inheritance to reuse previous specifications. This makes it easy to express and mine DP with complex behavioral relationships. 2.2. Searching methods Searching methods can be classified as database queries, constraint resolver, metrics, eXtended Positional Grammar (XPG) formalism and parsing, UML structures and matrices, and miscellaneous approaches. Several pattern recovery techniques use database queries [20, 29–31] for extracting patterns. They produce an intermediate representation of the source code (i.e., abstract semantic graph, abstract syntax tree, XML Metadata Interchange (XMI), meta-data, and UML structures) and then use Structured Query Language (SQL) queries to extract pattern-related information. Performances in these cases depend on the underlying database and can be scaled very well, but the queries are limited to the information available in the intermediate representations. Existing SQL-based approaches usually store a limited set of properties related to the source code elements. This is mainly due to the relational model that is not suitable to easily express and query complex graph-based structures. Moreover, they are used for structural and creational DPs, and they only partially support Copyright © 2014 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. (2014) DOI: 10.1002/smr

M. L. BERNARDI, M. CIMITILE AND G. DI LUCCA

behavioral DP identification. These issues are discussed in [31]. A notable exception is in [20], which proposes a meta-model that takes into account a reasonably complete set of structural and behavioral properties. With respect to it, our meta-model does the following: (i) includes the possibility to express fields, methods, and constraints on them; (ii) takes into account delegation and dependence properties; and (iii) allows the use of the defined DSL to build higher-level properties starting from the low-level ones in order to mine complex structure (requiring no changes to the search engine). Several tools and frameworks to identify idioms, macro-patterns, DPs, and design defects use explanation-based constraints programming techniques. For example, in [32], authors recover patterns using a multilayered approach that focuses on ensuring an optimal recall rate, but precision and performance are low. Metric-based techniques compute program-related metrics (e.g., generalizations, aggregations, associations, and interface hierarchies) from different source code representations and compare their values with source code DP metrics. These techniques [33–36] are computationally efficient because they reduce the search space through filtration [37]. The limit is that they have been experimented on a few numbers of patterns. Moreover, their precision and recall are low. The XPG formalism and parsing techniques use a scalable vector graphics format for the intermediate representation of the source code and represent DPs in a visual language by mapping the grammar of each pattern with the graph representation. They give a precise visualization but are limited only to structural DPs. Moreover, to the best of our knowledge, the existing experimentations are limited to a few patterns [38] and do not show any recall rates. UML structure and matrix techniques [3, 8, 19, 34, 39, 40] allow us to represent structural and behavioral information of software systems. They apply different techniques to match DP template metrics with the matrices generated for the system. In [39], a semi-automatic approach to identify micro-architectures in the source code is proposed. The approach is based on information organized in three layers: two layers are used to recover an abstract model of the source code (including binary class relationships), and a third layer is used to identify DPs in the abstract model. A DP detection methodology based on similarity scoring between graph vertexes is proposed in [8]. The approach is able to also recognize patterns that are modified from their standard representation. It exploits the fact that patterns reside in one or more inheritance hierarchies (in order to reduce the size of the graphs to which the algorithm is applied). These approaches are computationally efficient and have good precision and recall rates. Their limit is that they miss extracting the implementation variants of similar DPs. Furthermore, they are limited to a few numbers of patterns. Finally, there are some well-known techniques that cannot be classified in the aforementioned categories (e.g., fuzzy reasoning, bit vector compression, minimum key structure method, predicate and rho calculus, dynamic analysis using run-time execution traces, formal methods based on semantics, machine learning-based approaches, and concept analysis) but are good as a complement to improve the structural methods cited earlier [41–43]. For example, in [44], De Lucia et al. present some case studies of recovering structural DPs from OO source code, and in [45], they propose a model-checking approach to analyze the behavior of pattern instances both dynamically and statically. In [46], a tool for DP detection and software architecture reconstruction is proposed. An approach mixing structural and metric techniques is used to detect pattern instances. More recently, in [47], DP recovering is obtained using ontology formalism. DP restrictions are formalized and translated into rules that are executed on a knowledge base that is populated with semantic descriptions of a library code. Some studies have been also focused on the formalization of empirical evaluation criteria [19, 44, 48]. Each applied technique should be evaluated using well-defined criteria, and different authors have proposed taxonomies and related frameworks to perform such evaluations. The approach we propose is based on a system meta-model, and a DSL, that is able to represent elements down to statements and expressions. This allows us to reason about structural and behavioral properties that can be used to do the following: (i) improve search space reduction and (ii) distinguish between patterns that have the same structure but behave (or are used) in different ways [8]. Thus, the type of analysis includes structural and behavioral analysis with a graphmatching searching method. Copyright © 2014 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. (2014) DOI: 10.1002/smr

DESIGN PATTERN DETECTION

2.3. Comparison with the proposed approach As discussed earlier, in the last years, a high number of DP mining approaches and tools are implemented. According to this, there are several studies aiming to compare and describe existing approaches [49, 44, 50]. In Table I, starting from these works and from the analysis of literature, we synthesize the most relevant approaches to DP mining. For each approach, the table reports the following: the referring authors, a synthesis of the adopted strategy, the supporting tool (if any), the programming language of the mined source code, and the searched DPs. Moreover, the list of the systems used during the tool experimentation and the obtained average precision are shown. The main advantage of the DPF approach over most of the existing ones is that it allows us to identify variant forms of the classic DPs (as known in the literature). This is a particularly important issue because DPs are present in real-world systems in many variants [8, 9]. With respect to the existing approaches, usually based on a fixed subset of relationships among source code elements, our approach is based on a richer meta-model taking into account both structural relationships and behavioral ones (as delegation, calls, dependence, and object creations). Moreover, the declarative DSL specification “override” capability makes the approach flexible in detecting new patterns, domain-specific patterns, and architectural ones. As shown in the table, there are tools for discovering patterns from Java source codes and tools for mining patterns from C++. Starting from C++ mining tools, in [30], authors use Datrix as an intermediate format to express a wide set of source code properties in order to model DPs. DPF extends the set of such properties and introduces a DSL in order to express pattern specification in a declarative fashion. Moreover, inheritance among specifications allows us to mine variants without changes to the pattern detection algorithm. In [50], a Rational Rose C++ analyzer enables DP UML diagram extraction out of C++ source codes. As the authors observed, the language used to write the detection cannot express several key behavioral properties, and they also observed the following: (i) the approach is not capable of distinguishing among patterns with the same structure but different behavior; (ii) it is difficult to extend the approach in order to consider a new kind of properties that such model does not take into account. Another C++ code mining tool is Pat [31], where the fundamental idea for the automated search is to represent both patterns and designs in Prolog. Unfortunately, some information that would also be relevant for a precise search for pattern instances is not completely extracted by the structural analysis there proposed. The DPF overcomes the limitations of [50] and [31] tools because it is independent from any modeling approach/tool, and its meta-model and DSL allow modeling of a very broad variety of structural and behavioral properties in a declarative way to easily extend the original properties without changing the mining algorithm. Another DP mining tool is Crocopat [51]. This tool works exclusively on a Rich Sequence Format (RSF) input format. To use Crocopat on Java projects, a tool that parses Java source code and creates the Crocopat input format from it is needed. Crocopat is compared with Columbus [52] in [53]. This tool is versatile and easily extensible (given the high number of available plug-in), but differently from our proposed approach, it requires that the systems are annotated with DP Markup Language to describe DPs. For the Java mining tool, FUJABA [41] and SPQR [9] were considered. These tools are based on a very similar decomposition method. DPF, with respect to FUJABA, introduces an explicit DSL to express pattern specification declaratively. Moreover, in order to improve scalability, thus reducing the search space, our approach allows variants to be defined using inheritance. An approach of DP identification using a high-performance bit-vector algorithm is proposed in [43]. The approach is more efficient in terms of space and the compactness of representation but is based on a restricted set of properties with respect to our approach. Comparing DPF with DeMIMA [39] is quite complicated because DeMIMA is a very different multilayer approach. Based on precision results shown in the table, we can suppose that DPF offers a higher precision because it can be configured using a set of variables. The MARPLE tool [46] is based on an interesting graph-matching technique using an attributed relational graph in which types are nodes and microstructures are associated with edges. Compared with the DPF approach, MARPLE is less capable of reducing the search space, and according to the literature, it has been validated just on few DPs. The DPRE tool [44] is based upon the XPG formalism to express patterns. Once the grammar is defined, the Visual Programming Environment Generator produces a visual editor and an XpLR Copyright © 2014 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. (2014) DOI: 10.1002/smr

M. L. BERNARDI, M. CIMITILE AND G. DI LUCCA Table I. An overview of design pattern detection approaches. Tool

Mined code

Minimum key structure

SPOOL

C++

Philippow et al. [50]

Minimum key structure



C++

Kramer and Prechelt [31] Beyer et al. [5]

Class structure

Pat

C++

Predicate calculus

Crocopat

Java/C++

Dong et al. [19]

Matrix and weights

DP-Miner

Java

Adapter/command, bridge, composite, strategy/state

Columbus

C++

Reclassified GoF

Observer, mediator, chain of responsibility, visitor, and decorator GoF AWT library

Reference

Detection approach

Keller et al. [30]

Balanyi and Ferenc [52] Class structure

Recovered DPs

Case studies

Template method, factory method and bridge GoF

2 industrial systems, ET++ Student projects

Adapter, bridge, proxy, composite, decorator Composite, mediator

NME, LEDA, zApp 2 Mozilla, JWAM, wxWindows Java AWT, JEdit, JHotDraw 6.0b1 2 Jikes, Leda, StarOffice, StarOffice Writer Swing

Heuzeroth et al. [42]

Predicates on abstract syntax trees



Java

Niere et al. [41]

Clichè recognition and graph transformation, with fuzzy logic Clichè matching with software metrics in the class structure Metrics

FUJABA

Java



C++



C++

Adapter, bridge, proxy, composite, decorator (1995) GoF

Class structure, exploiting inter-class relationships Bit-vector based on string representation

PINOT

Java

Reclassified GoF



Java

Abstract factory and composite

Elemental design patterns and rho calculus Class structure expresses as matrices, exploiting graph similarity algorithm XPG formalism and LR-based

SPQR

Java

Decorator



Java

DPRE

Java

Composite, adapter/ command, decorator, observer, state/strategy, prototype, visitor Adapter, bridge, composite, proxy, and decorator

IDEA

UML

Antoniol et al. [35]

Kim and Boldyreff [36] Olsson and Shi [40]

Kaczor et al. [43]

Smith and Stotts [9]

Tsantalis et al. [8]

De Lucia et al. [44]

Bergenti [3]

Class structure

Arcelli [46]

Basic elements and metrics Semi-formals diagrams translated into queries UML

MARPLE

Java

No name

C++

No name

C++

Static analysis techniques and SQL UML-like multilayered approach

D3

Java

Abstract Factory, Bridge, Decorator, Singleton, Observer, Composite, and Visitor GoF

DeMIMA

Java

GoF

DPF

Java

GoF and their variants

Vokac [18]

France et al. [22]

Stencel [67] Gueheneuc [39]

Proposed approach

DSL-driven graph matching

Copyright © 2014 John Wiley & Sons, Ltd.

Template, Proxy, Bridge, Composite, Decorator, Adapter Abstract Factory, Composite, Visitor GoF

LEDA, libg++, galib, mec, socket 3 systems (no info) ANT, AWT, JHotDraw, Swing JHotDrawm, QuickUML, Juzzle — JHotDraw, JRefactory, JUnit

Precision — 100% for Singleton, Interpreter 14–50% — 91–100%

Suggest Documents