Towards a Prioritization of Code Debt: A Code Smell Intensity Index Francesca Arcelli Fontana∗ , Vincenzo Ferme† , Marco Zanoni∗ , Riccardo Roveda∗ ∗ Department
of Informatics, Systems and Communication University of Milano-Bicocca, Milano, Italy
[email protected],
[email protected] † Faculty of Informatics University of Lugano (USI), Switzerland
[email protected] only metrics values, it is difficult to decide the actions to undertake to improve the system.
Abstract—Code smells can be used to capture symptoms of code decay and potential maintenance problems that can be avoided by applying the right refactoring. They can be seen as a source of technical debt. However, tools for code smell detection often provide far too many and different results, and identify many false positive code smell instances. In fact, these tools are rooted on initial and rather informal code smell definitions. This represents a challenge to interpret their results in different ways. In this paper, we provide an Intensity Index, to be used as an estimator to determine the most critical instances, prioritizing the examination of smells and, potentially, their removal. We apply Intensity on the detection of six well known and common smells and we report their Intensity distribution from an analysis performed on 74 systems of the Qualitas Corpus, showing how Intensity could be used to prioritize code smells inspection.
Refactoring can be expensive, in particular if we detect many code smell instances to be first inspected and then removed, but only if they represent real problems. Hence, we have to focus first of all 1) on the real smells and 2) on the most critical ones. With respect to the first issue, using the different tools that have been developed for code smell detection, we could detect several false positive instances, that should be inspected. We addressed this aspect by defining different kind of filters for six different smells with the aim to consider only real smells [9]. In this paper, we focus our attention on the second issue, i.e., the identification of the most critical smells to be removed first. In particular, to face this problem we define an Intensity Index to quantify how critical each code smell instance is, compared to other instances of the same code smell. This index allows to report detection results to the user in a prioritized way, according to the metrics used in its computation, without considering other features, e.g., related to the context or domain.
I. I NTRODUCTION As outlined by Falessi et al. [1], “progress on managing technical debt takes advantage of existing work in code quality analysis and software measurement”. Software quality analysis and assessment can be addressed by following different directions, e.g., by detecting issues about bad code or design, or by computing different kinds of metrics in different categories [2]. We focus here our attention on managing technical debt [3], [4], by considering the debt that can be caused by code problems or anomalies, as those represented by code smells. Code smells [5] are symptoms of problems at code or design level that can be resolved through the right refactoring steps. Code smells may constitute a form of debt because, e.g., as observed by Macia [6] through the empirical analysis she performed, if some smells are introduced in the early stages of development, it is better to remove them as soon as possible, otherwise they can lead to more serious problems both at code and architectural level. Moreover, the combination of some smells can be particularly harmful [7], some smells are change-proneness and/or fault-proneness [8], hence they represent a source of debt. From the point of view of managing technical debt, code smells have the advantage that they usually point to specific and localized issues, associated to specific refactorings. This allows, e.g., to build action plans based on the code smells to be solved, with advantages in the estimation of the maintenance costs. While, if we consider
c 2015 IEEE 978-1-4673-7378-4/15/$31.00
The computation of the code smell Intensity Index is integrated in a code smell detection tool that we have developed, called JCodeOdor, but our Intensity Index can be applied also on top of any other detection tool based on metric thresholds computation. Our detection strategies, as in many other tools for code smell detection, are based on the computation of a set of metrics (see Section II). The choice of the right thresholds of these metrics is a complex and sensitive task; by allowing to use softened or hardened thresholds for the metrics, the user can decide to identify a larger or smaller set of code smells. We have computed the Intensity Index for six smells [5], [10] that we detect: God Class, Data Class, Brain Method, Shotgun Surgery, Dispersed Coupling, Message Chain. We decided to focus our attention on these smells because they are among the most related to faults or change-proneness, the most common [11], and can be detected using metric-based detection rules. We detected the smells on 74 systems of the Qualitas Corpus [12] and we show the Intensity distribution of the six smells, to underline the effectiveness of this index in prioritising the smells more critical to be removed.
16
MTD 2015, Bremen, Germany
Table I C ODE SMELL DETECTION TOOLS AND APPROACHES Tool/approach
License
Supported languages Detection approach
Detected Other features
BSDT Eclipse Plugin [13] no more available Checkstyle [14] 6.8.1 (2015) DECOR [15] 1.0 (2009) Fica [16] 2012 iPlasma [10] 6.1 (2009) jCOSMO [17] no more available JDeodorant [18] 7.3.2 (2014)
Latest version
Research prototype Open Source Research prototype Research prototype Research prototype Research prototype Research prototype
Java Java Java C Java, C++ Java Java
Metrics-based Metrics-based Dedicated specification language Machine Learning Metrics-based Metrics-based Refactoring opportunities
7 3 6 1 11 3 3
MLCS [19] PMD [20]
2013 5.3.2 (2015)
Machine Learning Metrics-based
4 4
Stench Blossom [21] HIST [22] DCPP [23]
1.0.4 (2009) not available not available
Research prototype Java Open Source Java, JavaScript, XML, XSL Research prototype Java Research prototype Java Research prototype model-based
The paper is organized as follows: in Section II, we outline some related work on code smell detection. In Section III, we define hardened/softened metric thresholds, and we report the applied detection strategies. In Section IV, we describe the Intensity index. In Section V, we provide results on the effectiveness of the code smell Intensity. In Section VI, we outline possible threats to the validity of our work. Finally, in Section VII we conclude and outline future developments.
Metrics-based 8 change history analysis 5 design change propagation analysis 2
Link to source code Clone ranking Views Views Refactoring Application Link to source code Views -
severity corresponds to increased petal size. It uses the count of the number of detection rule violations to determine the length of the petal; each rule violation is given equal weight. This is different from our Intensity index because we compute a single measure aggregating the value of the detection rule violations compared with a reference distribution. Recently, Mkaouer et al. [31], with the aim to suggest refactoring solutions, propose a code smell Severity, that is a level assigned to a code smell by a developer. This assessment can change over time. They also define the code smell Importance of a class that contains a code smell, related to the number and size of the features that the class supports. Also this property can vary over time as classes are added/deleted/split. In the same direction, Vidal et al. [32] present a semi-automated approach for prioritizing code smells before deciding on suitable refactorings for them. The approach is based on three criteria: the stability of the component in which the smell was found, the subjective assessment that the developer makes of each kind of smell using an ordinal scale, and the related modifiability scenarios. The two previous approaches consider a different kind of severity respect to the one we describe, that is completely related to the metrics values and thresholds and not to subjective evaluations by developers and other features. Moreover, our method is automatic. In another direction, Zazworka et al. [33] investigate how design debt, in the form of God Classes, affects the maintainability and correctness of software. They do not consider the criticality of each single smell, but they assert that God Class smell is related to technical debt. Zhao et al. [34] propose a hierarchical approach to identify and prioritize refactorings, that are ranked based on predicted improvement to the maintainability of the software. They analyzed two systems. They do not define an Intensity Index for code smells.
II. R ELATED W ORK Many tools for code smell detection have been proposed, both commercial tools and research prototypes. In this section, we consider only some open source and research tools and we describe the approaches that introduce some kind of Intensity or Severity index. Tools exploit different techniques to detect code smells: some are metrics-based [10], [14], [24], while others use a dedicated specification language [15], use program analysis to identify refactoring opportunities [18], [25], [26], or use a machine learning approach [19], [27], [28]. In Table I, we outline the characteristics of different tools. We can observe that many tools are able to detect only few code smells, only one tool, JDeodorant, provides refactoring facilities and many detection approaches are metrics based. Other commercial tools have been developed for code smells detection, e.g., CodeVizard, Borland Together, and inFusion. Several tools have been developed only for the Duplicate Code smell detection [29]. Among the reported tools, no one provides some kind of code smell Intensity similar to the one we propose. We are aware that, among the commercial tools (not reported in Table I), only inFusion (extension of iPlasma) provides a Severity index, defined as [30]: “[. . . ] computed by measuring how many times the value of a chosen metric exceeds a given threshold”. Code smell Intensity in our approach, instead, is computed considering the values of all the metrics used in the detection strategy, giving the same weight to all metrics. Hence, to compute the Intensity index, we take into account all the features used as hints of the presence of a smell. Stench Blossom [21] provides a code smell severity as a visual effect on a petal corresponding to a code smell: increased
III. M ETRICS -BASED D ETECTION S TRATEGIES We have detected the six code smells through a tool we have developed called JCodeOdor, which exploits the Eclipse JDT to analyze Java systems. It is not an Eclipse plugin, allowing a simple usage in batch scenarios, and it can be used as a library
17
Table II C ODE S MELL D ETECTION S TRATEGIES ( THE COMPLETE NAMES OF THE METRICS ARE GIVEN IN TABLE IV) Code Smells
Detection Strategies: LABEL(n) → LABEL has value n for that smell
God Class
LOCNAMM ≥ HIGH(176) ∧ WMCNAMM ≥ MEAN(22) ∧ NOMNAMM ≥ HIGH(18) ∧ TCC ≤ LOW(0.33) ∧ ATFD ≥ MEAN(6)
Data Class
WMCNAMM ≤ LOW(14) ∧ WOC ≤ LOW(0.33) ∧ NOAM ≥ MEAN(4) ∧ NOPA ≥ MEAN(3)
Brain Method
(LOC ≥ HIGH(33) ∧ CYCLO ≥ HIGH(7) ∧ MAXNESTING ≥ HIGH(6)) ∨ (NOLV ≥ MEAN(6) ∧ ATLD ≥ MEAN(5))
Shotgun Surgery
CC ≥ HIGH(5) ∧ CM ≥ HIGH(6) ∧ FANOUT ≥ LOW(3)
Dispersed Coupling CINT ≥ HIGH(8) ∧ CDISP ≥ HIGH(0.66) Message Chains
MaMCL ≥ MEAN(3) ∨ (NMCS ≥ MEAN(3) ∧ MeMCL ≥ LOW(2))
For every dataset, the distribution of the metric values is computed, creating a quantile function
Table III M ETRIC THRESHOLDS NAMES AND PERCENTILES
VERY-LOW LOW MEAN HIGH VERY-HIGH
Default
Hardened
Softened
10% 25% 50% 75% 90%
10% 30% 60% 80% 90%
10% 20% 40% 70% 90%
Qf (p) = inf {xR : p ≤ F (x)} where F (x) = P (X ≤ x), and X is a random variable representing the metric. For a probability 0 < p ≤ 1, QF (p) returns the minimum value of x for which p ≤ F (x). The returned value represents the value below which random draws from the given distribution would fall p ∗ 100 percent of the time. On the quantile function, five values are selected, representing five significant points often used in statistics as reported in Table III, e.g., for the creation of box plots; each point has a name, and is associated to a position (percentile) in the distribution. Table III assumes that the value of the metric gets worse when it is higher, and is compared with the threshold with a ≥ operator. In the opposite case, when the metric gets worse for lower values, the hardened and softened percentiles are swapped. The actual value assigned to each one of the five points is specific for each metric-smell combination. In fact, a prefiltering step of the quantile function is applied before deriving the thresholds, removing the low tail of the distribution. All the computed metrics present a highly skewed distribution, where most values are concentrated in the lower part. These distributions are non-normal, and difficult to characterize using standard thresholds. The applied filtering procedure is nonparametric, and automatically excludes the values (in the distribution) with a frequency higher than the average. The detailed description of the automatic metric thresholds derivation is out of the scope of this paper and can be found in [35].
or as a standalone tool for extracting metrics. The analyzed systems are represented in an abstract model, recording also the extracted metrics. JCodeOdor queries the model requesting metrics (or other structural information) on a particular entity or set of entities, to be used for code smell detection. Then it applies detection strategies (reported in Table II), computes the Intensity index and applies different kinds of filters to the code smell detection results. We have identified different types of filters for each smell (e.g. Library, Test, Parser class), through which, we can identify domain-dependent or design-dependent smells, not symptoms of real problems [9].
A. Hardened and Softened Metric Thresholds Our detection approach introduces the possibility to select, in addition to the default setting, hardened or softened thresholds for the metrics. Hardened thresholds provide a stricter detection, collecting less code smells; on the opposite, softened thresholds produce more results than the default setting. Exploiting these thresholds, the user can decide to expand or restrict the results set, e.g., to check if there are other instances or to make a first filtering of results. For metrics representing direct measures (not ratios), thresholds are computed on the distribution of each metric over a large dataset. The threshold values that we report in this paper are estimated on the 74 systems used for the experiments. The dataset for calculating the thresholds of a metric for a particular code smell is composed of the metric values for only the entities that are relevant for that smell. For example, for the Brain Method code smell, only methods having a notempty body are used to compose the dataset, since, e.g., an abstract method cannot be a higly complex method.
B. Detection Strategies In Table II we briefly outline the detection strategies of the six code smells we have considered, according to the definitions proposed in the book of Lanza and Marinescu [36], and the definition of the Message Chains smell given by Fowler [5]. The metrics used in the detection strategies are reported in Table IV; custom metrics are prefixed with a * in the table. Our approach is similar to the one proposed by Lanza and Marinescu [10], but we defined some new metrics, e.g., to capture more specific smell characteristics, or to exclude getters and setters from the computation. Each detection strategy is a logical composition of predicates. Each
18
characteristics of the smell, which are captured by metrics used in the detection strategy. To allow the user to intuitively understand the information related to code smell Intensity, we decided to present it using a numeric value in the range 1–10 (where 10 is the most critical value) and a label that provides a semantic description of the Intensity. We have five Intensity levels and corresponding values ranges: 1) Very Low: [1, 3.25); 2) Low: [3.25, 5.5); 3) Mean: [5.5, 7.75); 4) High: [7.75, 10); 5) Very High: [10, 10]. The Intensity computation, similarly to the computation of thresholds, relies on the metrics distribution, i.e., relies on percentiles. We refer to the distribution of the metrics used in the detection strategy of the code smell (represented by 100 percentiles, as described in Section III-A) and we: 1) Compute the Intensity for each metric used by the detection strategy: a) we equally distribute five points on the 100 percentiles of the quantile function; each point will correspond to an Intensity level. The first point is set to the metric threshold’s percentile, while the last point is set to the extreme threshold (VERY-LOW or VERY-HIGH) in the direction defined by the comparator. b) We compare the metric’s value with the value corresponding to the five Intensity points (obtained using the distribution), and we choose the worst Intensity level exceeded by the value of the metric. 2) Compute the average of each metric’s Intensity levels, to obtain a single value for the code smell. We use the semantic behind the logical operators composing the detection strategy: metrics in AND are always considered, metrics in OR are considered only if they exceed their threshold. The obtained index is represented by a value, in the range 1–10, and a corresponding label, chosen among the five we defined. We can apply the average of single Intensity values because they are normalized in the range 1–10, and they incorporate (and are not subjected to) the distribution properties. The algorithm that computes the Intensity is independent from the specific code smell and relies only on the detection rule and the corresponding metrics distribution. It is possible to calculate the Intensity of code smells detected by any tool or approach based on metrics and threshold comparisons, like the one from Lanza and Marinescu [10]. The only requirement is to know the metric distribution, for estimating the distance between the thresholds and metric values. When we compute the Intensity of each of the metrics used by the detection strategy, we also compute a ratio between the metric value and the threshold value:
Table IV M ETRICS USED FOR C ODE S MELLS D ETECTION Short Name
Long Name
ATFD *ATLD CC CDISP CINT CM CYCLO FANOUT LOC *LOCNAMM *MaMCL MAXNESTING *MeMCL *NMCS NOAM NOLV *NOMNAMM NOPA TCC *WMCNAMM WOC
Access To Foreign Data Access To Local Data Changing Classes Coupling Dispersion Coupling Intensity Changing Methods McCabe Cyclomatic Complexity Number of Called Classes Lines Of Code Lines of Code Without Accessor or Mutator Methods Maximum Message Chain Length Maximum Nesting Level Mean Message Chain Length Number of Message Chain Statements Number Of Accessor Methods Number Of Local Variables Number of Not Accessor or Mutator Methods Number Of Public Attributes Tight Class Cohesion Weighted Methods Count of Not Accessor or Mutator Methods Weight Of Class
Metric
VERY-LOW LOW MEAN HIGH
God Class
LOCNAMM WMCNAMM NOMNAMM TCC ATFD
26 11 7 0.25 3
38 78 14 22 9 13 0.33 0.5 4 6
Message Disp. Shotgun Brain Method Data Class Chains Coup. Surgery
Table V D EFAULT THRESHOLDS FOR ALL SMELLS VERY-HIGH
WMCNAMM WOC NOPA NOAM
11 0.25 1 2
14 21 0.33 0.5 2 3 3 4
40 0.66 5 7
81 0.75 12 13
LOC CYCLO MAXNESTING NOLV ATLD
11 3 3 4 3
13 4 4 5 4
19 5 5 6 5
33 7 6 8 6
59 13 7 12 11
CC CM FANOUT
2 2 2
3 3 3
4 4 4
5 6 5
10 13 6
CINT CDISP
3 0.25
4 0.33
5 0.5
8 0.66
12 0.75
MaMCL MeMCL NMCS
2 2 1
3 2 2
3 3 3
4 4 4
176 393 41 81 21 30 0.66 0.75 11 21
7 5 5
predicate is based on an operator that compares a metric with a threshold. Detection strategies use only LOW, MEAN and HIGH thresholds. VERY-LOW and VERY-HIGH thresholds are used for the computation of Intensity, which is described in Section IV. In Table V we report all the extracted thresholds for the metrics used in the detection strategies [35]. IV. C ODE S MELL I NTENSITY
M etricV alue T hresholdV alue We then sum the ratios of each metric and we round (half to even) the value, to obtain an index representing the closest
Code smell Intensity is an index we defined with the aim of capturing the smell “amount” of each code smell. It is based on the idea that we need to prioritize the inspection and resolution of code smells, taking into account all the
ER = ExceedingRatio =
19
approximation of how many times the metrics used by the detection strategy exceed the thresholds. We use this index when we have to show the detection results to the user and there are code smells having the same Intensity; this second index is used as an additional sorting criterion. We provide an example of the computation of the Intensity on a Shotgun Surgery instance example, namely the setAttribute(CrawlerSettings, Attribute) method defined in the ComplexType class of Heritrix v1.14.4 (see Table X), having the following metric values: CC: 8; CM: 10; FANOUT: 6. These metrics values satisfy the constraints defined in the detection strategy for Shotgun Surgery (reported in Table II, threshold values are reported in Table V). For each metric, we choose five equally distributed points on the quantile function Qf (p) starting from the threshold and ending in VERY-HIGH (all the comparisons are “≥”). For FANOUT, the starting threshold is LOW (25th percentile). Figure 1 shows how we choose the five points for the FANOUT metric in Shotgun Surgery’s detection strategy. The five points highlighted in red in Figure 1 correspond to five percentiles p in the metric distribution. We map these five percentiles to the actual values using the Qf (p) function, obtaining what we call the IntensityV alues set. In our running example these five points correspond to: • Very Low: percentile = 25 → metric value = 2; • Low: percentile = 41 → metric value = 2; • Mean: percentile = 58 → metric value = 3; • High: percentile = 74 → metric value = 4; • Very High: percentile = 90 → metric value = 6.
Table VI S TATISTICS ABOUT THE 74 Q UALITAS C ORPUS S YSTEMS N. Systems
LOC
N. Packages
N. Classes
N. Methods
74
6,785,568
3,420
51,826
404,316
% Code smell instances
100% 80% 60% 40% 20% 0% God Class
Data Class
Brain Method
Shotgun Surgery
Dispersed Coupling
Message Chains
Very High
19.18%
22.49%
8.11%
5.84%
13.05%
0.00%
High
54.33%
64.92%
64.16%
45.67%
21.30%
40.69%
Mean
22.77%
10.16%
23.10%
32.19%
39.02%
24.29%
Low
3.47%
2.37%
4.59%
14.69%
26.63%
35.02%
Very Low
0.25%
0.06%
0.04%
1.61%
0.00%
0.00%
Figure 2. Intensity distribution chart on 74 systems
As we described, the computation of Intensity depends on the thresholds values. For this reason, the selection of hardened or softened thresholds for the detection affects also the Intensity value. Hardening thresholds defines a smaller range of percentiles where to place the the opposite effect, i.e., creates a larger range and more distant labels. V. E XPERIMENTAL E VALUATION In this section, we outline the information we get through the code smell Intensity index and its distribution on the six smells in the 74 systems. Table VI summarizes some statistics about the analyzed systems, that are heterogeneous in terms of domain, functionality and size.
Figure 1. Intensity points selection
The actual metric values are compared to the five points, to define the Intensity label for each metric in the detection rule, using the following criterion: arg max
{M etricV alue ≥ x}
A. Intensity Distribution
xIntensityV alues
Figure 2 shows the Intensity distribution for each of the six analyzed code smells in the 74 systems. JCodeOdor reports the results sorted by code smell Intensity. The distribution for God Class, Data Class, Brain Method, and Shotgun Surgery is similar, although with different percentages. The most common Intensity level is High, the rarest is Very Low. Very Low is the rarest also for Dispersed Coupling and Message Chains. A low number of code smells with a Very Low Intensity is an evidence of the fact that the metrics, and the corresponding thresholds we use in the detection strategies, well characterize entities affected by code smells.
In the formula, in case the detection rule uses “≤” as comparison, the comparison operator is ≤, too. In our example, the Intensity labels are: CC = High, CM = High, FANOUT = Very High. The Intensity of the code smell is computed as the average of the Intensity of the single metrics. In particular, each Intensity label is associated to the lower bound of the corresponding value range (reported in Section IV), i.e., High = 7.75, Very High = 10. The overall Intensity value is then computed as (7.75 + 7.75 + 10)/3 = 8.5. Value 8.5 is associated to the “High” Intensity label, i.e., is in the range [7.75, 10).
20
Table VII B RAIN M ETHODS DETECTED IN H ERITRIX (MAXN.: MAXNESTING) Class
Method
Intensity ER LOC CYCLO MAXN. NOLV ATLD ≥33 ≥ 7 ≥6 ≥6 ≥5
ExtractorHTML MirrorWriterProcessor JobConfigureUtils RegexpHTMLLinkExtractor LogReader JobConfigureUtils RecoveryJournal LogReader LogReader RecoveryJournal RecoveryLogMapper ExtractorUniversal RegexRule CrawlController PublicSuffixes WorkQueueFrontier WorkQueueFrontier ComplexType CrawlJobHandler FetchHTTP BdbMultipleWorkQueues SelfTestCase AbstractFrontier CachedBdbMap
processGeneralTag(CrawlURI,CharSequence,CharSequence) 9.0 uriToFile(String,CrawlURI) 9.0 handleJobAction(CrawlJobHandler,HttpServletRequest,HttpServletResponse,String,String,String)9.0 processGeneralTag(CharSequence,CharSequence) 9.0 tail(RandomAccessFile,int) 9.0 checkAttribute(ModuleAttributeInfo,ComplexType,CrawlerSettings,HttpServletRequest,boolean) 9.0 importQueuesFromLog(File,CrawlController,int,CountDownLatch) 9.0 getByRegExpr(InputStreamReader,String,int,boolean,int,int,long) 9.0 getByRegExpr(InputStreamReader,String,String,boolean,int,int,long) 9.0 importCompletionInfoFromLog(File,CrawlController,boolean) 9.0 load(String) 8.0 extract(CrawlURI) 8.0 format(Matcher,String,StringBuffer) 8.0 processBdbLogs(File,String) 8.0 buildRegex(String,StringBuilder,SortedSet) 8.0 next() 7.0 initialize(CrawlController) 7.0 replaceComplexType(CrawlerSettings,ComplexType) 6.0 loadProfiles() 6.0 loadCookies(String) 6.0 getFrom(FrontierMarker,int) 6.0 filesFoundInArc() 6.0 enforceBandwidthThrottle(long) 5.0 getDisk(Object) 4.0
Message Chains has its Intensity label in the reduced range of Low – High. The distributions of the metrics we use for the detection of this code smell do not allow a better separation of Intensity values. In fact, the three metrics we use in the detection strategy (MaMCL, MeMCL, and NMCS - see Table IV) are distributed in a little range of values. From the Intensity distribution, we can observe that Intensity can be used to sort code smells detection results and show them to the users in a priority order. A user having only the time to investigate the most critical problems can refer, for example, only to Very High code smells. In this way, he can check only nearly 10% of the detection results, which are the most critical (only in terms of the values of the metrics used by the detection strategies).
High High High High High High High High High High High High High High High Mean Mean Mean Mean Mean Mean Mean Low Low
27 23 21 20 13 12 12 12 12 11 14 11 11 10 10 13 9 11 9 9 9 9 11 9
216 110 118 125 85 76 74 67 64 60 85 83 64 51 56 82 50 44 38 52 43 35 41 48
46 37 40 30 17 17 12 13 13 14 18 16 19 13 13 14 9 10 10 11 8 10 7 7
6 6 6 6 6 7 8 6 6 7 6 6 6 6 6 7 6 6 7 6 6 6 6 7
30 28 21 17 15 15 14 18 17 12 10 10 10 12 11 6 3 9 9 6 11 10 7 6
14 18 12 13 0 1 4 0 0 4 6 4 1 2 0 9 8 5 4 3 2 0 6 0
Table VIII G OD C LASSES DETECTED IN H ERITRIX Class
Intensity
CrawlController 10.00 CachedBdbMap 10.00 CrawlJobHandler 10.00 AbstractFrontier 10.00 StatisticsTracker 10.00 Heritrix 9.55 FetchHTTP 9.55 WriterPoolProcessor 9.10 CrawlSettingsSAXSource 7.75 AdaptiveRevisitFrontier 7.75 WorkQueueFrontier 7.75 SettingsHandler 7.75 ToeThread 7.75 WARCWriterProcessor 5.95 Processor 3.70
ER LOC’ WMC’ ATFD TCC NOM’ ≥ 176 ≥ 22 ≥ 6 ≤ 0.33 ≥ 18
Very High Very High Very High Very High Very High High High High High High High High High Mean Low
39 31 29 28 28 44 28 18 30 28 25 15 15 31 18
1,834 1,802 1,228 1,120 1,080 2,453 1,434 573 433 1,274 1,315 509 523 584 217
235 148 173 143 147 325 172 75 49 158 134 64 49 73 25
34 37 26 32 27 16 17 18 35 18 13 6 11 6 7
0.18 0 0.25 0.18 0.13 0.22 0.18 0.22 0.03 0.26 0.27 0.19 0.25 0.02 0.04
89 42 55 60 62 101 43 33 18 69 51 30 21 19 19
Legend LOC’: LOCNAMM, WMC’: WMCNAMM, NOM’: NOMNAMM
B. Intensity Report Example To better understand the effect of Intensity on the report of detection results, we consider here as an example a system among the 74, Heritrix v1.14.4. We manually inspected the results, to relate Intensity with code characteristics. Tables VII, VIII, IX, X report all the Brain Method, God Class, Data Class and Shotgun Surgery code smells detected in Heritrix. For Dispersed Coupling and Message Chains, no results were found in this system. Every line reports the name of the affected class or method, the Intensity, and the values and thresholds of each metric relevant to the detection. Results are sorted (descending) first by Intensity value and then by Exceeding Ratio (ER), with the most urgent code smells at the top of the list. The value of each metric relevant to the detection, compared with the threshold, gives the user a way to inspect the reasons why the smell is reported as critical. From the report of detected God Classes (Table VIII), we can see that the Intensity value spans the range (3.7, 10),
Table IX DATA C LASSES DETECTED IN H ERITRIX Class
Intensity
MatchesFilePatternDecideRule10.00 LowDiskPauseProcessor 9.25 RegexpLineIterator 8.50 StringIntPair 7.75 DomainScope 7.00
ER WOC WMC’ NOPA NOAM ≤ 0.33 ≤ 14 ≥ 3 ≥ 4
Very High 13 0.08 High 10 0.11 High 9 0.2 High 20 0.2 Mean 6 0.25
9 9 4 1 12
12 8 4 0 3
0 0 0 4 0
Legend WMC’: WMCNAMM.
a large range of the allowed interval for the Intensity. This range is justified by the huge differences in the respective metric values. For example, the Heritrix and Processor classes have a very different Intensity value, suggesting that Heritrix is more urgent to be analyzed. Considering the
21
Table X S HOTGUN S URGERY DETECTED IN H ERITRIX Class
Method
Intensity
Listing 1. Intensity 10 Data Class package org.archive.crawler.deciderules; /* REMOVED: Import lines */ public class MatchesFilePatternDecideRule extends MatchesRegExpDecideRule { private static final long serialVersionUID = -4182743018517062411L; /* REMOVED: one private field */ /* REMOVED: string values */ public static final String ATTR_USE_PRESET = ""; public static final String IMAGES_PATTERNS = ""; public static final String AUDIO_PATTERNS = ""; public static final String VIDEO_PATTERNS = ""; public static final String MISC_PATTERNS = ""; public static final String ALL_DEFAULT_PATTERNS = ""; public static final String ALL = ""; public static final String IMAGES = ""; public static final String AUDIO = ""; public static final String VIDEO = ""; public static final String MISC = ""; public static final String CUSTOM = "";
ER CC CM F.O. ≥5 ≥6 ≥3
ComplexType setAttribute(CrawlerSettings,Attribute) 8.50 High 8 8 Processor getController() 7.75 High 20 32 ANVLRecord addLabelValue(String,String) 5.50 Mean 8 5
10 55 16
6 3 3
Legend F.O.: FANOUT
respective metric values we get, e.g., 2453 vs 217 LOC, 325 vs 25 WMCNAMM, i.e., they define very different class profiles. The value of each metric used in the detection rule, compared to its threshold, supports the user dealing with the code smell removal, offering a deep view on the reasons of the detection. From Table VIII, we can also understand the effect of the ER that we use to sort detection results. If we look for example at the CrawlSettingSAXSource and ToeThread classes, they have the same Intensity value and, as a consequence, the same Intensity label. The ER gives more importance to the CrawlSettingSAXSource, moving it upward in the detection results, because it has two very strong metric outliers: ATFD and TCC. This example highlights the role of the two indexes in the ranking of results: Intensity takes care of the distribution of the values of all the metrics used in the detection strategy, while the ER is strongly impacted by outliers and gives more importance to single extreme metric values. Choosing ER as the main sorting criterion, the results would have been different and mainly driven by some extreme characteristics of the affected smell. We prefer to consider all the factors captured by the detection strategies, which have been chosen to characterize the presence of the smell. Looking at the Data Class detection results reported in Table IX, we can find the same kind of examples we described for God Class. If we look at the code of the MatchesFilePatternDecideRule class (Intensity 10) in Listing 1 and the DomainScope class (Intensity 7) in Listing 2, the first is a mere data container with many public attributes (NOPA metric four times over the threshold of 3) and no logic (WOC metric more than four times under the threshold of 0.33), while the second has a low WOC but contains some logic inside the focusAccepts(Object) method, less public attributes (NOPA metric) and accessor methods (NOAM) respect to the first one. The results regarding Brain Method and Shotgun Surgery, reported in Table VII and Table X, contain further examples of the way our sorting criterion gives relevance to smells with different characteristics. For Brain Method, it is even more evident the use of the ER as a second sort index, because we have more detection results with the same Intensity value, and we can better differentiate them according to strong outliers.
public MatchesFilePatternDecideRule(String name) { super(name); /* REMOVED: assignment lines */ } protected String getRegexp(Object o) { try { String patternType = (String) getAttribute(o,ATTR_USE_PRESET); if (patternType.equals(ALL)) { return ALL_DEFAULT_PATTERNS; } else if (patternType.equals(IMAGES)) { return IMAGES_PATTERNS; } else if (patternType.equals(AUDIO)) { return AUDIO_PATTERNS; } else if (patternType.equals(VIDEO)) { return VIDEO_PATTERNS; } else if (patternType.equals(MISC)) { return MISC_PATTERNS; } else if (patternType.equals(CUSTOM)) { return super.getRegexp(o); } else { assert false : "Unrecognized pattern type " + patternType + ". Should never happen!"; } } catch (AttributeNotFoundException e) { logger.severe(e.getMessage()); } return null; } }
could differ from the ones collected by other tools. We tested each metric detector with reference code to mitigate this issue. A possible threat to the internal validity could derive from the usage of the same set of 74 systems for defining the metric thresholds and for the experimental evaluation. To avoid biasing the analyses we made and described on Heritrix, we re-estimated the relevant thresholds by removing Heritrix from the initial set of systems (obtaining a set of 73 systems), and the thresholds did not change. We applied the same procedure to every analyzed system for producing the data. We experienced a very little number of cases in which threshold values changed only by a single unit, while in most cases thresholds did not change at all (w.r.t. the values reported in Table II). The resulting difference in the reported code smells is negligible (≤ 0.01%). Therefore, using the thresholds reported in Table II would have not changed the results.
VI. T HREATS TO VALIDITY A possible threat to construct validity of this work lays in the metrics collection phase. We implemented metric detection directly in our tool, so the values of the metrics we gathered
22
Listing 2. Intensity 7 Data Class
} if (candidateDomain == null) { return false; } Iterator iter = seedsIterator(); while(iter.hasNext()) { UURI s = (UURI)iter.next(); try { seedDomain = s.getHostBasename(); } catch (URIException e) { /* REMOVED: string values */ logger.severe("" + s); } if (seedDomain == null) { continue; } if (seedDomain.equals(candidateDomain)) { checkClose(iter); return true; } seedDomain = DOT + seedDomain; if (seedDomain.regionMatches(0, candidateDomain, candidateDomain.length() - seedDomain.length(), seedDomain.length())) { checkClose(iter); return true; } } checkClose(iter); return false;
package org.archive.crawler.scope; /* REMOVED: Import lines */ public class DomainScope extends SeedCachingScope { private static final long serialVersionUID = 648062105277258820L; private static final Logger logger = Logger.getLogger(DomainScope.class.getName()); /* REMOVED: string values */ public static final String ATTR_TRANSITIVE_FILTER = ""; public static final String ATTR_ADDITIONAL_FOCUS_FILTER = ""; public static final String DOT = ""; Filter additionalFocusFilter; Filter transitiveFilter; public DomainScope(String name) { super(name); /* REMOVED: assignment lines */ } protected boolean transitiveAccepts(Object o) { return this.transitiveFilter.accepts(o); } protected boolean focusAccepts(Object o) { UURI u = UURI.from(o); if (u == null) { return false; } String seedDomain = null; String candidateDomain =null; try { candidateDomain = u.getHostBasename(); } catch (URIException e1) { /* REMOVED: string values */ logger.severe("" + u);
} protected boolean additionalFocusAccepts(Object o){ return additionalFocusFilter.accepts(o); } }
A threat to the external validity can be related to the dataset composition. The Qualitas Corpus is composed only by Java open source systems, and this could affect the generalization of our approach to commercial and non-Java systems.
reach a proper experimental validation of its reliability. In fact, refactoring decisions are influenced by many contextual factors that cannot be decided by considering code only. We are also interested in improving our code smell Intensity in different directions: 1) by weighting the metrics in the detection strategy, to be able to give them different relative importance; 2) by studying the correlations among the metrics in detection strategies, and incorporate this knowledge in the index; 3) by considering not only metrics thresholds, but also other features useful to prioritize the smells, e.g., number of defects, change frequency, cost for program comprehension. Moreover, we started investigating [37] if code smell relations or co-occurrence of code smells lead to architectural smells. Hence, we would like to enhance our sorting criteria by taking into account both Intensity and related or co-occurrent smells. If a smell is related or co-occurrent with respect to several other smells, it represents a more critical smell and it could be useful to analyze it before other ones. We aim also to extend the detection considering other smells; the proposed detection approach is general enough to be applied to most code smells. We plan to provide a set of customized views [37], representing all aspects related to the presence of potential code/design problems, useful to locate them and to find the best refactoring solution. Through the identification of the most critical smells and the correlations existing among them, we could also try to define a kind of Refactoring Advisor, as we defined for duplicated code [38] following a different approach specific for clone detection.
VII. C ONCLUSION AND F UTURE D EVELOPMENT By starting from the assumption that code smells can represent a class of code quality issues that contribute to debt, in this paper we addressed the problem of finding the most critical smells, to support the prioritization of the smells to be inspected and removed first through refactoring. Refactoring is an expensive task, and prioritization can save time, allowing to consider only (or first) the most critical problems. We described the Code Smell Intensity index, introduced to prioritize and identify the most critical smells. It exploits thresholds and metrics’ distributions to assign different Intensity levels to each detected smell. We define Intensity with the aim of approximating the actual severity of detected code smells. By analyzing the distribution of the Intensity index, we observed that only 10% of smells belong to the higher Intensity level. This means that the users can effectively use Intensity to reduce the amount of instances to inspect. We showed how the Intensity and Exceeding Ratio indexes are effective in ranking code smell instances on a system, providing examples of how they synthesize the values of all metrics used in the detection. For the future developments, we plan to perform experiments with developers to collect feedback regarding the prioritization of detection results obtained using Intensity, and
23
R EFERENCES
[21] E. Murphy-Hill, N. Carolina, and A. P. Black, “An interactive ambient visualization for code smells,” in Proc. 5th Int. Symp. Software visualization (SOFTVIS ’10). Salt Lake City, Utah, USA: ACM, October 2010, pp. 5–14. [22] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, and D. Poshyvanyk, “Detecting bad smells in source code using change history information,” in Proc. 28th IEEE/ACM Int. Conf. Automat. Software Eng. (ASE 2013). Silicon Valley, CA: IEEE, Nov. 2013, pp. 268–278. [23] A. A. Rao and K. N. Reddy, “Detecting bad smells in object oriented design using design change propagation probability matrix,” in Proc. Int. MultiConference of Engineers and Computer Scientists (IMECS 2008). Newswood Limited, Mar. 2008. [24] M. Munro, “Product metrics for automatic identification of ”bad smell” design problems in java source-code,” in Proc. 11th IEEE Int. Software Metrics Symp. (METRICS’05). Como, Italy: IEEE, 2005, p. 15. [25] M. Fokaefs, N. Tsantalis, and A. Chatzigeorgiou, “JDeodorant: Identification and Removal of Feature Envy Bad Smells,” in Proc. Int. Conf. Software Maintenance (ICSM 2007). IEEE, Oct. 2007, pp. 519–520. [26] J. Lee, D. Lee, D.-k. Kim, and S. Park. (2012, Apr.) A semanticbased approach for detecting and decomposing god classes. [Online]. Available: http://arxiv.org/abs/1204.1967v1 [27] A. Maiga, N. Ali, N. Bhattacharya, A. Sabane, Y.-G. Gu´eh´eneuc, and E. Aimeur, “SMURF: A SVM-based incremental anti-pattern detection approach,” in Proc. 19th Working Conf. Reverse Eng. (WCRE 2012). Kingston, Ontario, Canada: IEEE, Oct. 2012, pp. 466–475. [28] N. Maneerat and P. Muenchaisri, “Bad-smell prediction from software design model using machine learning techniques,” in Proc. 8th Int. Joint Conf. Computer Science and Software Eng. (JCSSE 2011). Nakhon Pathom, Thailand: IEEE, May 2011, pp. 331–336. [29] D. Rattan, R. K. Bhatia, and M. Singh, “Software clone detection: A systematic review,” Information & Software Technology, vol. 55, no. 7, pp. 1165–1199, 2013. [30] R. Marinescu, “Assessing technical debt by identifying design flaws in software systems,” IBM Journal of Research and Development, vol. 56, no. 5, pp. 9:1–9:13, 2012. ´ Cinn´eide, “A [31] M. W. Mkaouer, M. Kessentini, S. Bechikh, and M. O robust multi-objective approach for software refactoring under uncertainty,” in Search-Based Software Engineering, ser. Lecture Notes in Computer Science, C. Le Goues and S. Yoo, Eds. Springer International Publishing, 2014, vol. 8636, pp. 168–183. [32] S. A. Vidal, C. Marcos, and J. A. D´ıaz-Pace, “An approach to prioritize code smells for refactoring,” Automated Software Engineering, pp. 1–32, Dec. 2014. [33] N. Zazworka, M. a. Shaw, F. Shull, and C. Seaman, “Investigating the impact of design debt on software quality,” in Proceeding of the 2nd working on Managing technical debt (MTD ’11). Honolulu, USA: ACM, May 2011, pp. 17–23. [34] L. Zhao and J. Hayes, “Rank-based refactoring decision support: two studies,” Innovations in Systems and Software Engineering, vol. 7, no. 3, pp. 171–189, 2011. [35] F. Arcelli Fontana, V. Ferme, M. Zanoni, and A. Yamashita, “Automatic metric thresholds derivation for code smell detection,” in Proc. of the 6th Intern. Workshop on Emerging Trends in Software Metrics (WETSoM 2015). Florence, Italy: IEEE, May 2015, co-located with ICSE 2015. [36] M. Lanza, S. Ducasse, H. Gall, and M. Pinzger, “Codecrawler: an information visualization tool for program comprehension,” in ICSE ’05: Proc. 27th Int. Conf. Software engineering. St. Louis, MO, USA: ACM, 2005, pp. 672–673. [37] F. Arcelli Fontana, V. Ferme, and M. Zanoni, “Towards assessing software architecture quality by exploiting code smell relations,” in IEEE Proceedings of SAM 2015 workshop, co-located with ICSE 2015. Florence, Italy: IEEE, May 2015. [38] F. Arcelli Fontana, M. Zanoni, and F. Zanoni, “A duplicated code refactoring advisor,” in Agile Processes, in Software Engineering, and Extreme Programming, ser. Lecture Notes in Business Informat. Processing, C. Lassenius, T. Dingsøyr, and M. Paasivaara, Eds. Helsinki: Springer International Publishing, May 2015, vol. 212, pp. 3–14.
[1] D. Falessi, P. Kruchten, R. L. Nord, and I. Ozkaya, “Technical debt at the crossroads of research and practice: report on the fifth international workshop on managing technical debt,” ACM SIGSOFT Software Engineering Notes, vol. 39, no. 2, pp. 31–33, 2014. [2] S. Chidamber and C. Kemerer, “A metrics suite for object oriented design,” IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493, Jun. 1994. [3] Z. Li, P. Avgeriou, and P. Liang, “A systematic mapping study on technical debt and its management,” Journal of Systems and Software, vol. 101, no. 0, pp. 193–220, 2015. [4] P. Kruchten, “Strategic management of technical debt: Tutorial synopsis,” in Proceedings of the 12th International Conference on Quality Software (QSIC 2012). IEEE, Aug 2012, pp. 282–284. [5] M. Fowler, Refactoring: Improving the Design of Existing Code. Boston, MA, USA: Addison-Wesley, 1999. [6] I. Mac´ıa Bertr´an, “On the detection of architecturally-relevant code anomalies in software systems,” Ph.D. dissertation, PUC-Rio, Departamento de Inform´atica, Rio de Janeiro, 2013. [7] M. Abbes, F. Khomh, Y.-G. Gu´eh´eneuc, and G. Antoniol, “An Empirical Study of the Impact of Two Antipatterns, Blob and Spaghetti Code, on Program Comprehension,” in 15th European Conf. Software Maintenance and ReEng. Oldenburg, Germany: IEEE, Mar. 2011, pp. 181–190. [8] F. Khomh, M. D. Penta, Y.-G. Gu´eh´eneuc, and G. Antoniol, “An exploratory study of the impact of antipatterns on class change- and fault-proneness,” Empirical Software Engineering, vol. 17, no. 3, pp. 243–275, Aug. 2011. [9] F. Arcelli Fontana, V. Ferme, and M. Zanoni, “Filtering code smells detection results,” in Proceedings of the 37th International Conference on Software Engineering (ICSE 2015). Florence, Italy: IEEE, May 2015, poster track. [10] M. Lanza and R. Marinescu, Object-Oriented Metrics in Practice. Springer-Verlag, 2006. [11] V. Ferme, “JCodeOdor: A software quality advisor through design flaws detection,” Master’s thesis, University of Milano-Bicocca, Milano, Italy, Sep. 2013, http://essere.disco.unimib.it/reverse/files/ VFermeMsCThesis2013.pdf. [12] E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, and J. Noble, “The qualitas corpus: A curated collection of java code for empirical studies,” in Proc. 17th Asia Pacific Software Eng. Conf. Sydney, Australia: IEEE, December 2010, pp. 336–345. [13] P. Danphitsanuphan and T. Suwantada, “Code smell detecting tool and code smell-structure bug relationship,” in Proc. 2012 Spring Congr. Eng. and Technology (S-CET). Xi’an, China: IEEE, May 2012, pp. 1–5. [14] (2012, November) Checkstyle. [Online]. Available: http://checkstyle. sourceforge.net/index.html [15] N. Moha, Y.-G. Gu´eh´eneuc, L. Duchien, and A.-F. L. Meur, “DECOR: A Method for the Specification and Detection of Code and Design Smells,” IEEE Trans. Softw. Eng., vol. 36, no. 1, pp. 20–36, Jan. 2010. [16] J. Yang, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto, “Filtering clones for individual user based on machine learning analysis,” in Proc. 6th Int. Workshop on Software Clones (IWSC 2012). Zurich, Switzerland: IEEE, 2012, pp. 76–77. [17] E. van Emden and L. Moonen, “Java quality assurance by detecting code smells,” in Proc. 9th Working Conf. Reverse Eng. (WCRE 2002), no. October. Richmond, Virginia, USA: IEEE, 2002, pp. 97–106. [18] N. Tsantalis, T. Chaikalis, and A. Chatzigeorgiou, “JDeodorant: Identification and removal of type-checking bad smells,” in Proc. 12th European Conf. Software Maintenance and ReEng. (CSMR 2008). IEEE, Apr. 2008, pp. 329–331. [19] F. Arcelli Fontana, M. Zanoni, A. Marino, and M. V. M¨antyl¨a, “Code smell detection: towards a machine learning-based approach,” in Proc. 29th IEEE Intern. Conf. on Software Maintenance (ICSM 2013), ERA Track. Eindhoven: IEEE, September 2013, pp. 396–399. [20] (2012, November) Pmd. [Online]. Available: http://pmd.sourceforge.net/
24