2009 European European Conference Conference on on Software Software Maintenance Maintenance andand Reengineering Reengineering
Establishing Traceability Links between Unit Test Cases and Units under Test Bart Van Rompaey and Serge Demeyer Lab On REengineering University of Antwerp Middelheimlaan 1, 2020 Antwerpen, Belgium
[email protected],
[email protected]
relevant to a working set of production components. In a context of legacy systems and reengineering, tests are valuable as a source of up-to-date documentation [11]. Tests can be perceived as examples how to use part of a system [15]. Indeed, as software tests rely on a running system, they show how parts of a system are executed and as such how they are supposed to be used. Tests are therefore an excellent repository for developers trying to understand parts of a system. Being stored in the same version control system as the production source code, unit tests are easy to consult and edit, during forward engineering as well as refactoring. Unfortunately, current traceability between source code and test artefacts is poor. Even when some IDEs support the creation of a test case stub based upon a given production class, a hard, explicit link between a production class and a test case is seldom maintained. As such, identifying this change set remains a manual process of code inspection and text searching (name matching). In more standardized unit testing environments such as xUnit [17, Chapter 3], a developer seeks his way by following design guidelines and conventions that characterize these environments, such as the physical storage location of source and test artifacts, naming conventions, source code references, etc. The goal of this work consists of investigating which information from system artefacts (including its history) helps to establish traceability links between production classes and xUnit test cases in object-oriented systems. We hypothesize that (i) a test case naming convention; (ii) explicit fixture declaration; (iii) static test call graphs; (iv) run time traces; (v) lexical analysis; and (vi) co-evolution logs are all viable means to
Abstract Coding and testing are two activities that are tightly intermingled in agile software development, requiring developers to frequently shift between production code and test artefacts. Unfortunately, links between these artifacts are typically implicitly present in the source code, forcing developers towards time consuming code inspections. In this work, we evaluate the potential of six traceability resolution strategies involving test naming and design conventions, static call graphs, call behavior before asserts, lexical analysis and version log mining to make the relation between developer test cases and units under test explicit. The results show that test conventions yield highly accurate results, yet in their absence capturing the destination type of calls right before assert statements appears as a valuable alternative. Combining these strategies allows the user to find a balance between improved applicability and accuracy1 .
1
Introduction
Coding and testing are two activities that are tightly intermingled in many agile software development methodologies. Unit tests especially require frequent adaptation to reflect changes in the production code to keep an e↵ective regression suite. As a consequence, developers need to identify the set of test cases 1 An expanded version of this paper appears in the Ph.D. dissertation of the first author.
1534-5351/09 $25.00 © 2009 IEEE DOI 10.1109/CSMR.2009.39
205 209
reveal such links. For these six sources we evaluate the applicability to sample sets of test cases as well as the accuracy of the retrieved links. In the next sections, we first consider related work (Section 2). Then, we expand upon each of the resolution strategies in Section 3 before performing an experiment (Section 4). We discuss threats to validity in Section 6. After interpreting the experimental results in Section 7, we evaluate the gains in applicability and accuracy as a result of combining resolution strategies (Section 8). Finally, we wrap up in Section 9.
2
the program itself by using a Java annotation construct. Their main goal lies in pinpointing error locations in production code as a result of a failing test. To improve test structure and interconnectivity, Gaelli et al. propose to organize tests as examples of their units under test [15]. Moreover, developers can reuse units under test to build higher-level tests. Unfortunately, applying these techniques to existing test suites requires an extensive amount of manual e↵ort. Several techniques aim to understand and characterize test suites. Gaelli et al. propose a unit test taxonomy, making a distinction between the scope of the test (one method versus several methods) and the intent (e.g., asserting that some behavior works versus ensuring that other behavior does not work) [14]. Using an automated approach based on naming convention and test structure, they automatically categorize tests with high precision and moderate recall. Building on the idea that tests may serve well as documentation, Van Geet and Zaidman propose a call trace-based approach to evaluate unit tests for this purpose [27]. Applied to the Apache Ant build system software, they observe how the median number of executed methods per test is more than 200, which make them conclude that the test suite of this particular project is not well suited for documentation purposes. Because of such reasons, to help developers gain knowledge about the inner workings of a software system Cornelissen et al. propose to use sequence diagrams to visualize the information extracted from a test run [9]. The use of filtering techniques and stack depth limitations improves the scalability of the approach. Kanstr´en associates testing levels with call trace information, to reveal the scope of test cases, identify shortcomings in the test suite or remove redundant test cases [18]. The reverse engineering techniques behind these works are related to our static and dynamic analysis strategies. Traceability in software has been extensively studied. The following works concern traceability in test artefacts or use traceability techniques that we adopt in this work as well. Wilde et al. proposed to locate requirements in old code by mapping functionalities onto program components by profiling test case execution [29]. Two heuristics for the mapping identified 40% and 50% of the subroutines the developers indicated. Gall et al. [16] studied the history of a software project to deduce which modules are modified together, as indicated in release history systems. As such, logical coupling captures artifacts that actually change together, as opposed to how design documents indicate what is expected to co-change. Ying et al. and Zimmerman et al. have investigated change patterns
Related Work
In this related work section, we will first discuss current practices in navigation between code and developer test artifacts in integrated development environments, followed by the remedies proposed in research. Next, we consider related work in program comprehension through developer tests. Finally, we look at related work on traceability. Several sources have pointed to unit tests as a form of live and up-to-date documentation of a software system [11, 5, 15], and as such a worthwhile starting point for newcomers to the (sub)system. However, not all tests are alike, as there are tests of di↵erent granularity levels as well as intentions [14, 18]. This makes it difficult for developers to understand the purpose of a test case and as such the thoroughness with which production code is verified. Worse, in current xUnitstyle testing there is no fixed structure, nor exists an explicit link between a production unit and a test case. As such, developers can have a hard time finding the right tests for a programming task. Today’s integrated development environments o↵er little support to the developer to browse between developer tests and production code. The Eclipse Java environment2 suggests, when creating a new test case via a wizard, to provide the corresponding unit(s) under test. The link is then documented by means of a Javadoc annotation. Unfortunately, developers are free not to use this wizard, let alone completing the unit under test. Eclipse also o↵ers a “referring tests” feature that provides a list of (JUnit) test cases that statically ’use’ a production class. There are currently no means to store the developers’ choice from this list. To compensate these drawbacks, Bouillon et al. implemented a JUnit Eclipse plugin with static call graphs per test cases to help developers indicate the intended units under test [7]. Furthermore, they raise this traceability link from inside a comment string to 2 http://www.eclipse.org
210 206
in version control system logs to predict and guide changes [30, 32]. Lubsen [21] uses frequent set mining on the change log to characterize the amount of unit testing in the development process. We explore the use of version control logs to link production classes with developer test cases. Sneed combines static and dynamic linking to combine test cases and code [25]. First, code functions were mapped to requirements model functions using name matching. After an additional manual linking step, 85% of links were resolved. The link between a test case and a requirement function was already established. In the second approach, Sneed links test cases with code functions by associating test case time stamps with code function stamps. For a comprehensive overview of information retrieval work applied to software artifacts, see De Lucia et al. [10]. Recently, in a case study on an industrial system Lormans et al. investigate a.o. the relationship between requirements and test cases, by applying Latent Semantic Indexing (LSI) to both kinds of textual documents [20]. The many resulting false positive links make them conclude the presence of a mismatch between both sources. We include LSI as one of the techniques in this paper.
3
Fixture Element Types (FET). Similar to naming convention, another way for developers to express the unit under test explicitly consists of declaring fixture elements as instance variables of the test case. These are exposed to all test commands in the test case and can be initialized in the setUp method. Making the unit under test explicit in the form of fixture elements is advocated in literature as well [6]. To identify the UUT for a test case, we first identify the fixture elements’ set of types. We then apply a filtering operation to reduce this set, by selecting those one or more types that are associated with most fixture elements. This strategy falls short in case no objects are declared as explicit fixture elements. Static Call Graph (SCG). Thirdly, we hypothesize that we can derive the unit under test by inspecting method invocations in the test case. In contrast to name or fixture based techniques, that are merely indicators as a result of developers pursuing explicitness, a static call graph-based approach reveals references to production classes in the test case implementation. We are however not sure whether they are also tested. To identify the unit under test of a test case, we collect all production classes that are directly being called by a test case, i.e., classes that are the destination of an outgoing method invocation. We select the set of production classes that is referenced most. In Figure 1, in testSimpleAdd, the Money class is referenced twice, by means of the constructor and the add operation. The drawback of this approach, therefore, is the potential large set of helper and data object types that will be included into the unit under test set in case there is no dominantly called production class. Last Call Before Assert (LCBA). To address SCG’s drawbacks, we propose to circumvent the problem of helper methods and data types by looking at what happens right before the assert statement. To compare the actual outcome with the expected outcome, we reason, the test case needs to call the unit under test to retrieve the actual status change. Again, we risk to end up with a large set of units under test in case of a test style where developers write many asserts per test command. Van Deursen et al. call this test writing style ”Assertion Roulette” [26]. Lexical Analysis (LA). A fifth resolution strategy doesn’t rely on programmer discipline nor assumes particular call patterns. Instead, it relies on the vocabulary that developers use inside source code, i.e. natural language used in type names, identifiers, strings, comments, etc. Our assumption is that a test case and the corresponding unit under test contain very similar vocabulary.
Traceability Resolution Strategies
We propose a series of automatable traceability resolution strategies. Next to strategies that start from the source code, we use information derived from the running test suite or version control system logs as well. Naming Convention (NC). Considering the intention that is behind naming conventions, we highly value the name of a test case as source for a resolution strategy. Indeed, by containing the name of the unit under test (UUT) as part of the test case (TC), for example by pre or postpending the name of the UUT to “Test”, a developer communicates what the purpose of that test case is. Several tutorials and books describe this naming convention [6, 13, 12, 23]. As a result, we can expect widespread usage of these conventions. The resolution strategy consists of first identifying production and test files in the source code tree, then matching them using typical naming schemes. At the down side, this strategy does not create traceability links for (i) unit under test with no test case name containing their name and (ii) test cases with a name that does not entail a known type. Moreover, this approach risks to rule out types that contribute to the unit under test other than the one that matches the naming convention: the resulting set of types for this strategy is always a singleton.
211 207
To calculate the similarity between two (one test and one production) files, we rely on Latent Semantic Indexing (LSI) as information retrieval technique. The production file with the highest similarity to a test case, we hypothesize, is the unit under test. Note how there is a substantial amount of vocabulary in test cases that does not reappear in the unit under test, such as test, setUp, TestCase, expect, result, assert, etc. Co-Evolution (Co-Ev) In the last resolution strategy, we start from the version control system (VCS) of a software system such as CVS, SVN, Perforce or SourceSafe. Such a system captures changes that are made to the system throughout history and keeps versions. Test cases and their corresponding unit under test, we reason, ought to change together throughout time, as a change to the unit under test requires some modifications to the test case as well. To identify the unit under test here, we look for the production file(s) that have been changed together with a test case most in the version control change log. With this approach, we risk to wrongly identify production files that change very frequently as the unit under test of a variety of test cases. Moreover, this approach requires that developers use a version control system in such a way that changes to production and test code are indeed brought into the system at the same time. Also, unless developers practice testing during development, the co-evolution is not captured in the VCS.
4
1. Which testing conventions and guidelines exist in your project? • About the location and naming of unit tests? • About unit test implementation? 2. Which editor/IDE are you using to develop software? How do you make use of testing tools in that setting (testing framework, browsing, coverage, etc)? 3. How are you involved in the (unit) testing of the system? How much experience do you have with unit testing and testing frameworks such as JUnit? 4. How do you perceive the context switches while practicing unit testing? 5. Please provide, for each test case in the given list, the corresponding unit under test (a production class). • First, give the production class where the test case focuses most on (you are forced here to mention a single class!) • Secondly, list any other production class that is the target of this test (don’t mention production classes needed to setup an environment to test the actual unit under test). • Thirdly, you may provide any additional comments (e.g. ”this is really an integration test between classes x,y and z”, ”this is a stress test”, ”this test case is dead code”, etc). 6. How did you identify the unit under test for a given test case? e.g. naming conventions, location, domain knowledge, code inspection, use of asserts, etc. 7. How do you evaluate the difficulty of this exercise? How much time did it take?
Table 1. Developer Survey. to the unit under test may be mentioned. Then, we offer the possibility to comment on noteworthy aspects of the considered test case.
Experimental Setup
In the experiment described in this section, we apply each of the proposed resolution strategy techniques to a set of test cases of a system under study and compare the outcome against the result of an evaluation performed by a developer.
4.1
4.2
Evaluation Procedure
To quantify the result of each resolution strategies, we use the following three applicability and accuracy criteria:
Human Test Oracle
• Applicability. We consider a resolution strategy applicable to a test case when the strategy returns a set of unit(s) under test that is not emtpy. This means that the applicability of a resolution strategy i to a set of test cases T C is computed as
To compare the accuracy of the proposed strategies, we compare the retrieved relations with a human oracle, i.e., we consider developers manually establishing traceability links as the objective baseline for this experiment. This task of establishing traceability links is part of the questionnaire shown in Table 1. In the main question of the survey (question 5.), we ask developers to associate a list of test cases with a unit under test. This list of thirty test cases at most is randomly selected from the developer’s system. To verify how realistic resolution strategies providing a singleton result are, we first force the participant to provide a single production class. Next, additional classes adding
applicability =
|{tc 2 T C|retrievedU U Ttc 6= }| . |T C|
The Naming Convention strategy requires for example that there exists a production file in the project whose name matches with the test case. For tcs without a matching production name, the NC strategy does not apply. 212 208
• Accuracy. To measure accuracy, we calculate and compare precision and recall, two evaluation measures for information retrieval. These measures tell us, given a set of computed traceability links, how they compare to the links that humans would introduce. In this context, precision is defined as the fraction of the units under test retrieved that are relevant to the developers’ information need, or precision =
For the co-evolution based approach, we look at file co-changes throughout history as captured in the revision history of the version control system. The VCS’s log functionality provides this information. We decided to use dynamic analysis for the LCBA resolution strategy, as to better deal with (i) polymorphism; (ii) conditional logic in test cases; and (iii) abstractions such as separate verification mechanisms. We obtain a call trace by means of the method-level tracing functionality (“-Xtrace” option) of the IBM JDK5.0/6.0 virtual machines.
|relevantU U T \ retrievedU U T | . |retrievedU U T |
4.4
Recall is defined as the fraction of the UUTs that are relevant to the test case that are successfully retrieved, or recall =
In selecting sample projects we are hindered by the following limitations. First of all, we need access to the source code, as to verify the presence of a considerable test suite and next to apply the automated resolution strategies. Moreover, the co-evolution resolution strategy requires access to the version control system as well. Finally, the implementations we made are currently targeted towards systems developed in Java. We sent out the questionnaire to developers of ten open source Java software systems that we included because the system was subject to experimentation by us or by fellow researchers. In addition, we explored Sourceforge in search for systems with a considerable JUnit test suite. We successfully collected results for three systems: JPacman, ArgoUML and Mondrian. The JPacman system is a teaching example at the TU Delft used during a course about software testing. Its implementation is an example of best practice Java, JUnit, design-by-contract, etc. It has been developed using a test-intensive XP-style process, featuring unit and integration tests achieving a high level of test coverage. The source repository consists of 21 production classes and 11 test cases, totalling 2.3 kSLOC. Secondly, we studied 0.24 release of ArgoUML, an open source UML modeling tool. This release counts 1588 classes, 163 of which are test cases. In total, the system consists of 137 kSLOC. Thirdly, the Mondrian Online Analytical Processing (OLAP) server supports interactive analysis of large datasets stored in SQL databases. This system totals 153 kSLOC, consisting of 1262 classes (112 of which are test cases). JPacman and ArgoUML use SVN as version control system, while Mondrian uses Perforce. Hence, to calculate the co-evolution approach we run “svn log” for the former two projects and “p4 filelog” for Mondrian. Then, we postprocess using custom scripts. The comparison of the small teaching system against the two larger, long-living open source systems is worthwhile to identify the impact of best practices and
|relevantU U T \ retrievedU U T | . |relevantU U T |
In case a resolution strategy is not applicable to a test case, we can not compute the precision due to the empty retrievedUUT set. Therefore, we only evaluate the accuracy of test cases a strategy applies to. For comparison reasons, we aggregate each of these measurements to the test suite level (a set of test cases). For applicability, we calculate the percentage of test cases a strategy applies to. For this same set of test cases, we calculate mean precision and mean recall.
4.3
Sample Projects
Data Collection
The experimental data is collected as follows. We apply Naming Convention using a simple script that matches class names. It can be tailored to the actual naming convention applied within a project, as stated by the documentation or by developers. For the FET and SCG resolution strategies, we build up a model from the source code using our fact extraction tool chain (Fetch3 ). We make this model testaware by refining model elements that we detect as being test concepts. We use the idioms and interaction patterns of JUnit 3 and 4 to identify test entities such as test cases, test commands, fixture elements, etc. To match files using similar vocabulary, we first extract natural language from the source code (except for program language keywords). The set of terms per file is post processed with (i) camel case splitting (e.g., TestCase becomes Test and Case); (ii) lower case conversion; and (iii) Porter’s stemming algorithm [24]. The Latent Semantic Indexing implementation of Hapax computes the similarity between a test case file and each of the production source files [19]. 3 http://www.lore.ua.ac.be/Research/Artefacts/
213 209
conventions, including the potential degradation of adherence to them over time.
5
of the actual test cases, test helper classes and integration tests for a whole (sub)system. Removing these from the list of presented test cases, the JPacman set was reduced from 11 to 9, ArgoUML’s suite from 30 to 27 and Mondrian from 30 to 23. Secondly, one test case in JPacman was targeted at the interaction between two production classes. As such, the developer could not possibly choose one above the other. Q6. To identify the unit under test associated with a test case, the developers mention the use of naming conventions and knowledge about domain and design. One developer also mentions test case location and code inspection. Q7. Completing the survey took 10 to 20 minutes.
Results
In this section, we first present the results of the questionnaire. Then, we report on the applicability and accuracy of the proposed resolution strategies. For each system under study, one developer completed the questionnaire. For JPacman, this was the single author of the system. For ArgoUML, the main maintainer participated. Lastly, the founder and project lead of Mondrian completed our questionnaire. The latter two developers evaluated a random set of 30 test cases, while the JPacman developer was limited to evaluate the full set of 11 test cases.
5.1
5.2
Traceability Resolution Strategies
Table 2 shows how LA is maximally applicable for all test cases across projects. Furthermore, SCG, LCBA and Co-Ev appear to have a high applicability for JPacman and ArgoUML. The applicability of NC, FET and LCBA however varies strongly over the projects.
Manual Evaluation
Q1. All three developers reported about guidelines on naming and locating test cases. In JPacman, for most model-layer production classes an accordinglynamed test case is foreseen. The name of test cases in ArgoUML are prescribed to start with ’Test’ or ’GUITest’, depending on their use of the GUI layer. In Mondrian, tests should be suffixed with ’Test’. In all three systems, test cases have a particular functional focus, yet often require other parts of the system to be executable. In addition to the test affix, most test case names should bear the name of a production class as the unit under test according to the three participants. Q2. An IDE (Eclipse or IntelliJ) is used in all three projects at least some of the time. Apache Ant is commonly used for automated building and testing. Q3. The test experience of the interviewed developers varies from having written all (JPacman) test cases over a large part (Mondrian) to ’some’ of the test cases (ArgoUML). As such, we consider these developers suitable to ask about the purpose of the test cases. Q4. With this question, we intended to capture how developers perceive switching back and forth between the tasks (and their artefacts) of coding and testing. The participants, however, used this question to target context switches in general. Because of that, the answers don’t add to the experiment. Q5. In general, the participants did not seem to have much problem providing a single unit under test for each test case, except for (i) dependencies of the tests on particular lower layer subsystems; and (ii) two exceptions. Firstly, the answers for question 5.3 revealed how developers decided that for certain test cases it was not possible or advisable to determine a unit under test: i.e., test fixture classes as super classes
Applicab. JPacman ArgoUML Mondrian
NC 78% 85% 39%
FET 78% 33% 13%
SCG 89% 93% 57%
LCBA 100% 89% 35%
LA 100% 100% 100%
Co-Ev 100% 74% 100%
Precision JPacman ArgoUML Mondrian
100% 100% 100%
86% 56% 17%
63% 16% 19%
53% 21% 65%
11% 3.7% 13%
28% .081% 18%
Recall JPacman ArgoUML Mondrian
100% 100% 100%
100% 56% 33%
50% 16% 23%
100% 50% 75%
11% 3.7% 13%
44% 45% 30%
Table 2. Applicability, mean precision and mean recall of the six resolution strategies applied to JPacman, ArgoUML and Mondrian. The precision results per test case (see Figure 1) show a strongly bipolor data distribution: precision per test case is often either 0 (no relevant retrieved item) or 1 (all retrieved items are relevant). The fact that most relevantUUT sets are singletons explains this outcome. As a consequence, recall results are even more bipolar. Despite this data distribution we rely on the notion of mean precision and mean recall. From Table 2, we observe how Naming Convention achieves 100% precision and recall for all projects and is by far the most accurate strategy. The LA strategy scores very poor in precision and recall, while the results for FET vary most across projects. For comparison reasons, we calculate results in two additional ways. First, the aggregation formulas for
214 210
Figure 1. Bipolar distribution of precision results over the categories 0, 1 and the interval 0 < x < 1. The % of test cases the strategies are not applicable can be derived from Table 2. precision and recall by Antoniol et al. (see Aggregate Pr./Re. in Table 3) are calculated as the division of the sums of dividends and divisors of individual precision or recall calculations (as opposed to the average of the division as we calculated). Secondly, we calculate precision and recall for the subsets of test cases all strategies apply to. The general outcome remains the same, with the relative score between strategies being respected. The recall scores are typically lower due to the incorporation of all queries with empty retrievedUUT set. The trends in the calculation on the subset of test cases all techniques apply to remains the same as well. However, these results are not relevant due to the small size of these subsets (1 to 5 test cases). This does indicate that combining strategies is a worthwhile approach to investigate.
6
an issue. For one system under study, the participant corrected one of his answers after we identified a reference to a non-existing class. We moreover do not know whether humans tend to agree about the links that need to be established. Earlier studies observed that humans do not necessarily agree about the presence of code smells [22, 28]. For external validity, we identified the following three topics. First, for each of the traceability resolution strategies we made certain configuration decisions. For the Co-Ev strategy, e.g., we select those classes that changed most frequently together with the test case file rather than selecting the X most or a percentage. For the LA strategy, we did not apply a similarity threshold, resulting in 100% applicability yet potentially weak similarity between the test case and the unit under test. As a consequence, the validity of our conclusions preferring one strategy above another are limited to the particular configuration of these strategies in this work. Secondly, with all developers stating that their project adheres to naming conventions, it may not be surprising that the NC strategy has a high accuracy. The fact that the three systems have naming convention is rather coincidence than a deliberate choice. The fact is that most Java/JUnit systems do adopt these naming conventions. All of the seven other projects that we prepared questionnaires for adhered to naming conventions to a certain extent. Bruntink and van Deursen remark that all five of their systems under study exhibit this property as well [8]. Moreover, we see how the applicability of the NC strategy is rather low for the larger and longer-living systems in this experiment. Interestingly, it seems that during software evolution, these conventions tend to disappear. Thirdly, the external validity of the experiment beyond Java/JUnit projects is a concern. Most of the applied strategies are however environment neutral, such as Co-Ev, LA and NC, which makes them at least applicable in other contexts. SCG, FET and
Threats to Validity
Regarding construct validity, we have not been able to collect manually established traceability links by more than one developer per project. This impact the correctness of the manual evaluation due to the possibility of inexperience, lack of motivation or human errors. We tried to contact more developers per system but did not get response. We counter the experience argument by the selection of participants: two our of three are project initiators and leads, while the third is the lead maintainer. We obtained quick responses from the participants, even after requests for clarifications, which leads us to believe that the participants were indeed properly motivated. One of the participants praised us for the thorough preparation of the questionnaire tailored to his project in contrast to many other questionnaires that are sent to his and other open source projects. The ArgoUML developer has been involved in experiments of ours before and was willing to be involved again. He even replied a second time when we noticed that he forgot to respond to some questions. The presence of human errors remains 215 211
Aggregate Pr./Re. JPacman (9) ArgoUML (27) Mondrian (23) Maximum common subset of test cases. JPacman (5) ArgoUML (5) Mondrian (1)
NC 100%/70% 100%/82% 100%/39%
FET 70%/70% 78%/25% 25%/4.3%
SCG 75%/60% 19%/18% 19%/13%
LCBA 42%/100% 7.9%/43% 43%/26%
LA 11%/10% 3.7%/3.6% 13%/12%
Co-Ev 19%/60% 0.0041%/29% 0.067%/28%
100%/100% 100%/100% 100%/100%
90%/100% 100%/100% 0%/0%
60%/50% 0%/0% 0%/0%
66%/100% 4.0%/20% 20%/100%
20%/20% 20%/20% 0%/0%
41%/60% 0%/0% 0%/0%
Table 3. First series of results: Aggregate Precision and Recall results. Second series: average precision and recall for the subset of test cases all strategies apply to. The number between parentheses after each system under study represents the amount of test cases in each set. LCBA benefit from uniform test framework concepts such as test methods, fixture declarations and asserts. Most object–oriented xUnit implementations (JUnit, CppUnit, NUnit, PyUnit) adhere to the stated conventions. As for C, there are multiple unit testing frameworks yet none of them has become the dominant one. Framework concepts are often implemented using preprocessor constructs. Recognizing these constructs, e.g., at run-time, may prove to be harder.
7
revisions) for JPacman versus more than 9 years of ArgoUML history (12043 revisions). We do attribute this to the agile development style of the JPacman developer versus a more phased testing approach as testified by the ArgoUML developer in earlier work [31]. For similar reasons, SCG works best for JPacman. The precision of the resolution strategies LCBA and Co-Ev is in many cases hindered by large retrievedUUTs sets. Especially when the actual unit under test is typically not changed in the same revision as the test case, many other classes that have accidentally been changed together with that test case the same (small) number of times end up in the result set as well. In some cases, part of these accidental files can be removed when we have more information about the test suite. E.g., in case of ArgoUML the developer indicated that all the UML test cases depend on the model subsystem. Unfortunately, by removing references to model the precision of the Co-Ev strategy for ArgoUML only improves by 0.001%.
Interpretation
In this section, we reconsider the results in the context of each of the projects’ characteristics. Furthermore, we look back at the extraction process as well. Project-specific Factors. We learned how the applicability of Naming Convention and Fixture Element Types is sensitive to project conventions and developer’s discipline. Given such conventions, these strategies yield high precision and recall (JPacman and ArgoUML). Interestingly, some test cases in Mondrian carry a name starting with “BUG”, followed by a number: a clear traceability link with the concerned bug report. Such a test case may be created to document the bug at a time that a more precise defect location is not known yet. A minority of Mondrian tests contains test case-level Javadoc links to the unit under test. We did not observe this in the two other projects. The Fixture Element Type strategy not only relies on the presence of an explicit fixture, it furthermore relies on specific typing of fixture elements. In case of ArgoUML, fixture elements are frequently declared as of type java.lang.Object. This is the most generic super class of all classes in a Java system. From this static declaration, we can not derive the actual more specific type using FET’s approach. The experiment does not provide enough evidence to conclude that the Co-Evolution strategy works better for projects emphasizing unit testing in the development methodology. We can merely observe how precision varies largely between JPacman (28%) and ArgoUML (0.081%), despite the short version history (245
LCBA Applicability. During the experiment, we noticed how all the resolution strategies are not as easy to apply. Especially the LCBA resolution strategy based on dynamic analysis required (i) some manual preparation; and (ii) a long-running analysis. As tests are often executed using build scripts, we had to reverse engineer the build system and inject tracing options at the right places. Secondly, method-level tracing is time consuming even when using filters (e.g., we did not log library calls). The resulting log file for ArgoUML counted 3Gb, for Mondrian even 143Gb (!). From an analysis point of view, build system or IDE support to run individual test case could reduce some of these scalability issues. Moreover, collecting the test suite run-time trace of Mondrian took several iterations, as a number of tests do not terminate when the Java virtual machine’s trace option is switched on. We identify this as a example of the Hawthorne e↵ect, i.e., the change in behavior of an object of study during observation. Andrews notes, in a context of testing using log file analysis, that logging
216 212
may provoke subtle errors as well as largely a↵ect a program’s efficiency [2]. In this case, the tracing may either evoke timing issues in the interaction between the test case and the database used during testing, or the efficiency may have been reduced drastically such that the program does not end in reasonable time.
8
Combining Resolution Strategies.
In this section we verify to which extent we can improve the applicability and accuracy of traceability resolution by combining strategies. Resolution strategies are used sequentially, by (i) applying a strategy on test cases the previous strategy was not applicable to; and (ii) replacing the result of a previous strategy for a test case with that of the additional strategy when it improves the accuracy. We look for combinations of strategies that have a higher applicability and/or accuracy than single strategies. The application sequence is not important for this setup. Table 4 depicts the most interesting strategies or combination of strategies. We observe how combining strategies indeed improves the applicability, yet tradeo↵s between higher precision and recall at one side, and higher applicability at the other side must be made. We observe how the resolution strategy NC is always involved. Often, FET and LCBA also play a role, while LA does not occur at all. Figure 2, annotated with the trade-o↵ points, presents the results graphically. Project Pacman
Point P1 P2
ArgoUML
A1 A2 A3 A4 A5 M1 M2 M3 M4
Mondrian
M5
Strategy NC NC+FET/SCG SCG+FET NC NC+FET NC+Co-Ev NC+LCBA NC+LCBA+Co-Ev NC NC+FET/LCBA NC+FET+LCBA NC+FET+SCG +LCBA NC+FET+Co-Ev
App. 78% 100%
Pre. 100% 94%
Rec. 100% 100%
85% 89% 96% 100% 100% 39% 48% 57% 78%
100% 96% 89% 87% 87% 100% 86% 77% 56%
100% 96% 92% 93% 96% 100% 91% 85% 61%
100%
49%
57%
(a) Precision
(b) Recall
Figure 2. Combining Strategies to optimize applicability, mean precision and mean recall. projects, we can’t point out a single strategy that has a high applicability, precision and recall. We do observe at one side how a strategy based on test case naming conventions results in high precision and recall, yet depends upon project guidelines and developer discipline. At the other side, strategies such as Last Call Before Assert, Lexical Analysis or Co-Evolution have a high applicability, but score poorer in accuracy. As a result, combining strategies relying on developer conventions (Naming Convention and Fixture Element Type) with high-applicability approaches (mainly Last Call Before Assert) provides the best overall result. Experimentation showed the complementarity of five out of six resolution strategies proposed in this work.
Table 4. Applicability (App.), Mean Precision (Pre.) and Mean Recall (Rec.) of optimal strategy combinations.
9
Conclusion
In this paper, we verified the applicability and accuracy of a set of automated resolution strategies to establish traceability links between JUnit test cases and production classes as units under test. Based on an experiment involving the developers of three software
This outcome suggests that we can establish such traceability links in an automated manner with acceptable applicability and accuracy, yet we have to (i) in-
217 213
vest in tools that combine resolution strategies; and (ii) configure such tools based on knowledge about coding conventions, development methodology and test suite design for the project we apply the technique to.
[17] P. Hamill. Unit Test Frameworks. O’Reilly, 2004. [18] T. Kanstr´en. Towards a deeper understanding of test coverage. Journal of Software Maintenance and Evolution: Research and Practice, 20(1):59–76, 2008. [19] A. Kuhn, S. Ducasse, and T. Gˆırba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 49(3):230–243, March 2007. [20] M. Lormans, A. van Deursen, and H.-G. Groß. An industrial case study in reconstructing requirements views. Empirical Software Engineering, 13(6):727–760, 2008. [21] Z. Lubsen. Studying co-evolution of production and test code using association rule mining. Master’s thesis, TU Delft, the Netherlands, 2008. [22] M. M¨ antyl¨ a, J. Vanhanen, and C. Lassenius. Bad smells – humans as code critics. In Proceedings of the 20th IEEE International Conference on Software Maintenance (ICSM’04), pages 399–408, 2004. [23] G. Meszaros. XUnit Test Patterns: Refactoring Test Code. Addison-Wesley, 2007. [24] M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. [25] H. Sneed. Reverse engineering of test cases for selective regression testing. In Proceedings of the European Conference on Software Maintenance and Reengineering, pages 69–74, Los Alamitos, CA, USA, 2004. IEEE Computer Society. [26] A. van Deursen, L. Moonen, A. van den Bergh, and G. Kok. Refactoring test code. In M. Marchesi and G. Succi, editors, Proceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP2001), 2001. [27] J. Van Geet and A. Zaidman. A lightweight approach to determining the adequacy of tests as documentation. In Proc. of the 2nd WS on Program Comprehension through Dynamic Analysis, pages 21–26, 2006. [28] B. Van Rompaey, B. Du Bois, S. Demeyer, and M. Rieger. On the detection of test smells: A metricsbased approach for general fixture and eager test. IEEE Transactions on Software Engineering, 33(12), 2007. [29] N. Wilde, J. Gomez, T. Gust, and D. Strasburg. Locating user functionality in old code. In Proceedings of the International Conference on Software Maintenance, pages 200–205, 1992. [30] A. Ying, G. Murphy, R. Ng, and M. Chu-Carroll. Predicting source code changes by mining change history. IEEE Transactions on Software Engineering, 30(9):574–586, 2004. [31] A. Zaidman, B. V. Rompaey, S. Demeyer, and A. van Deursen. Mining software repositories to study coevolution of production and test code. In Proceedings of the 1st IEEE International Conference on Software Testing, Verification and Validation (ICST 2008), pages 220–229, 2008. [32] T. Zimmerman, P. Weissgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. IEEE Transactions on Software Engineering, 6(31):429–445, 2005.
References [1] ANSI/IEEE 1008-1987 standard for software unit testing. Technical report, ANSI/IEEE, 1986. [2] J. Andrews. Testing using log file analysis: tools, methods and issues. In Proceedings of the 13th International Conference on Automated Software Engineering (ASE98), pages 157–166, 1998. [3] G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering, 28(10):970–983, October 2002. [4] K. Beck. Simple smalltalk testing: With patterns. The Smalltalk Report, 4(2):16–18, 1994. [5] K. Beck. Test-Driven Development: By Example. Addison-Wesley, 2003. [6] K. Beck and E. Gamma. Test infected: Programmers love writing tests. Java Report, 7(3):51–56, 1998. [7] P. Bouillon, J. Krinke, N. Meyer, and F. Steimann. EzUnit: A framework for associating failed unit tests with potential programming errors. In Proceedings of 8th International Conference on Agile Processes in Software Engineering and eXtreme Programming (XP), Springer LNCS 4536, pages 101–104, 2007. [8] M. Bruntink and A. van Deursen. An empirical study into class testability. Journal of Systems and Software, 79(9):1219–1232, 2006. [9] B. Cornelissen, A. van Deursen, L. Moonen, and A. Zaidman. Visualizing testsuites to aid in software understanding. In Proceedings of the 11th European Conference on Software Maintenance and Reengineering (CSMR), pages 213–222. IEEE, 2007. [10] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Transactions on Software Engineering and Methodology, 16(4), September 2007. [11] S. Demeyer, S. Ducasse, and O. Nierstrasz. ObjectOriented Reengineering Patterns. M-K, 2003. [12] M. Feathers. Working E↵ectively with Legacy Code. Prentice Hall, 2005. [13] M. Fewster and D. Graham. Software Test Automation: e↵ective use of test execution tools. 1999. [14] M. Gaelli, M. Lanza, and O. Nierstrasz. Towards a taxonomy of SUnit tests. In Proceedings of 13th International Smalltalk Conference (ISC’05), Sept. 2005. [15] M. Gaelli, R. Wampfler, and O. Nierstrasz. Composing tests from examples. Journal of Object Technology, 6(9):71–86, October 2007. [16] H. Gall, K. Hajek, and M. Jazayeri. Detection of logical coupling based on product release history. In Proceedings of the International Conference on Software Maintenance (ICSM), pages 190–197, Washington, DC, USA, 1998. IEEE Computer Society.
218 214