Empir Software Eng (2009) 14:513–539 DOI 10.1007/s10664-008-9091-7
Developing search strategies for detecting relevant experiments Oscar Dieste & Anna Grimán & Natalia Juristo
Published online: 12 November 2008 # Springer Science + Business Media, LLC 2008 Editor: Per Runeson
Abstract Our goal is to analyze the optimality of search strategies for use in systematic reviews of software engineering experiments. Studies retrieval is an important problem in any evidence-based discipline. This question has not been examined for evidence-based software engineering as yet. We have run several searches exercising different terms denoting experiments to evaluate their recall and precision. Based on our evaluation, we propose using a high recall strategy when there are plenty of resources or the results need to be exhaustive. For any other case, we propose optimal, or even acceptable, search strategies. As a secondary goal, we have analysed trends and weaknesses in terminology used in articles reporting software engineering experiments. We have found that it is impossible for a search strategy to retrieve 100% of the experiments of interest (as happens in other experimental disciplines), because of the shortage of reporting standards in the community. Keywords Evidence-based software engineering . Systematic review . Controlled experiment
O. Dieste : N. Juristo Facultad de Informática, Universidad Politénica de Madrid, Campus de Montegancedo, 28660 Boadilla del Monte, Spain O. Dieste e-mail:
[email protected] N. Juristo e-mail:
[email protected] A. Grimán (*) Universidad Simón Bolívar, Valle de Sartenejas, Baruta, Apartado Postal 89000, Caracas 1080A, Venezuela e-mail:
[email protected]
514
Empir Software Eng (2009) 14:513–539
1 Introduction Evidence-based practice originated in the field of medicine with what is known as evidence-based medicine (EBM) (EBM working group 1992). This evidence-based practice involves identifying some clinical aspects of interest to then generate reliable evidence on which to base recommendations for medical practice. The inputs for generating such evidence are the results of controlled experiments which are combined by statistical metaanalysis techniques. To assure the reliability of the evidence, the whole process, from the identification of experiments to the generation of evidence, must be systematic in order to rule out bias and guarantee process repeatability. In EBM this whole systematic process is known as systematic review (SR) (Higgins and Green 2006). The concept of evidence-based software engineering (EBSE) emerged through an analogy with EBM (Kitchenham et al. 2004; Dybå et al. 2005). EBSE involves the performance of SRs to combine empirical results related to a software technology or process. SRs identify relevant empirical studies on the basis of a search strategy. Defining a suitable search strategy for detecting relevant empirical studies involves several decisions: (1) selecting the appropriate sources of information (i.e. bibliographic databases or digital libraries), (2) selecting the fields of the article in which to search the terms, (3) defining the right search string to identify empirical studies of interest, and (4) running a search. Because of the rapid growth of publications, identifying and retrieving empirical studies of interest is becoming more and more difficult. Finding search strategies that retrieve as many relevant experiments as possible is critical for the success of a SR. This problem has been dealt with in EBM by stressing the importance of taking a systematic approach to studies retrieval as a way of gathering unbiased information and of achieving costeffectiveness (Petitti 2000). Leaving relevant results out of a SR could lead to the generation of inaccurate evidence (Lajeunesse and Forbes 2003). This is especially true if the universe of detected empirical studies is small, which is often the case in software engineering (SE). We have studied different strategies by searching in different databases and analysing the papers retrieved. To evaluate the strategies we compared the experiments retrieved by every search with a manual and exhaustive review. By evaluating the results of different strategies we have been able to derive useful recommendations on searching to perform SRs. Section 2 presents definitions related to search and search optimality necessary to understand the evaluation of search strategies. Section 3 discusses the need for a gold standard in order to evaluate search strategies. Section 4 describes the search universe used in our searches. Section 5 discusses the selection of the search fields. Section 6 analyses the different search strategies. In Section 7 we validate the results searching in different bibliographic databases. Section 8 compiles the search recommendations that might be derived from the evaluation of strategies that we have performed. Section 9 discusses the influence that establishing a proper search universe has on the search results. Section 10 discusses trends and weaknesses in the standardization of terms when reporting SE experiments. Section 11 generalizes our results to involve other types of studies. Finally, Section 12 presents our conclusions.
2 Optimality of a Search Strategy As the few reviews to be found in SE have repeatedly made clear, the results of identifying primary studies for SR are unsatisfactory. In each of these cases, we find that huge numbers
Empir Software Eng (2009) 14:513–539
515
of articles were retrieved. After a sizeable selection/rejection effort, only a few articles then turned out to be relevant for the study. For example, (Davis et al. 2006) searched bibliographic databases and retrieved over 4,000 articles on elicitation techniques, of which only 74 were considered relevant and just 26 were included in the review. Both (Jørgensen 2004) and (Mendes 2005) also reported discouraging results with respect to the search for SE experiments. Searches for experiments on software development effort estimation in (Jørgensen 2004) retrieved 100 articles, of which only 17 were relevant for the SR. In (Mendes 2005), 343 articles were retrieved in a web engineering search, of which under 50% were used. Figures like these are evidence of the need to improve SE searches to retrieve as many relevant articles as possible at the least cost to reviewers. Other SRs fail to specify what databases were searched or how many articles were found, how they were found, etc. They kick off with the final set of relevant articles from which the evidence is to be gathered. This is the case, for example, of SRs reported in (Juristo et al. 2004; Sjøberg et al. 2005; Dybå et al. 2006, 2007a, b; Hannay et al. 2007; Jørgensen and Shepperd 2007; Kampenes et al. 2007; Kitchenham et al. 2007). The paper that we present here aims to help researchers realize how important it is to specify the bibliographic search conducted for a SR, as it (and its quality) has a decisive impact on the quality of the SR results. The concept of optimal search is based on two search properties: recall and precision. Recall and precision can be used to evaluate search results (van Rijsbergen 1979). Notice that, in the context of our research, relevant material is the articles reporting SE experiments. In accordance with this, recall and precision are ratios that relate: the relevant material retrieved by the search, the relevant material not retrieved by the search, and the irrelevant material retrieved by the search. Recall refers to the relevant material retrieved out of the whole set of relevant material, precision refers to the relevant material there is among retrieved material. The diagram in Fig. 1 shows the relationship between search universe, recall and precision. D is the search universe; (A+B) represent the search results; A is the set of relevant articles retrieved by the search; B is the set of irrelevant articles retrieved by the search; C represents the set of relevant articles not detected by the search. Recall and precision can be formulated as a ratio between A, B and C: Recall ¼ A=ðA þ C Þ; Precision ¼ A=ðA þ BÞ. The smaller C (relevant articles that the search missed) is, the higher the recall score is. If C is zero (all relevant articles have been retrieved), search recall is 100%. Therefore, a search with low recall detects very few relevant articles. A SR conducted on the basis of such a search will miss too many relevant experiments and generate inaccurate evidence. The smaller B (irrelevant articles gathered in the search) is, the greater precision is. When B is zero (no irrelevant material is detected), search precision is 100%. A search with low precision will lead to a lot of irrelevant articles being retrieved. A SR conducted on the basis of such a search requires a huge manual review effort to identify and reject the irrelevant articles in the retrieved material. Fig. 1 Search results vs. relevant material
D
C
A
B
516
Empir Software Eng (2009) 14:513–539
When the search recall improves (lot of relevant material is retrieved) precision tends to decrease (a lot of irrelevant material is also retrieved). That is, a strategy that increases recall, typically by adding more terms to the search string, increases the probability of retrieving irrelevant material that incidentally contain such terms but are not relevant. For example, experiment should be a search string with less recall than experiment or empirical study, as the latter string locates articles that contain the term experiment plus those that do not contain the term experiment but do contain empirical study. However, the likelihood of retrieving irrelevant studies that incidentally contain either of the terms experiment or empirical study grows (and, therefore, precision drops) as more terms are added to the search string. Deciding whether a search result is good enough is a relative issue. A search result is acceptable depending on the results of the other search. In other words, the evaluation of a search’s results depends on the best possible results that can be obtained. There are no absolute optimum values for recall and precision. The best recall and precision values depend mainly on the search topic. Table 1 shows a categorization of search strategies based on their recall and precision results. An optimal search can be defined as a search that strikes a balance between high recall and high precision (Petitti 2000). Sometimes, depending on the goals of the SR and available resources, we might want to use non-optimal searches. For instance, a high recall search strategy maximizes the quantity of the relevant articles retrieved, although it retrieves many irrelevant articles in return. On the other hand, a high precision strategy minimizes the number of retrieved articles with the drawback of missing a good number of relevant articles. But, generally, the goal will be to get an optimal, or at least an acceptable, search strategy. Therefore, a trade-off is necessary between recall and precision. Recall and precision take into account the number of relevant articles retrieved, the number of irrelevant articles retrieved and the number of relevant articles missed, but do not explain the number of irrelevant articles missed (value D in Fig. 1). However, this has a considerable effect on searches with larger search universes. The parameter for value D is termed fall-out, and is calculated as B/(B+D). Fall-out measures the number of irrelevant articles not retrieved or, in other words, how effective the strategy is at rejecting irrelevant articles. Like the recall and precision parameters, fall-out should be as high as possible.
3 Establishing a Gold Standard To be able to calculate a search strategy’s properties, we need to know how much relevant material (A+C in Fig. 1) there is in a set of publications (D in Fig. 1). Unfortunately, it is impossible to find out the value of these parameters beforehand. So a gold standard is necessary to calculate a search’s recall and precision. Table 1 Search strategy categorization Strategy type
Goal
High recall High precision Optimum Acceptable
Maximum recall despite poor precision Maximum precision despite poor recall Maximize both ranges (recall and precision) Good enough recall and precision
Empir Software Eng (2009) 14:513–539
517
We have used the set of 103 articles1 analysed in (Sjøberg et al. 2005) as our gold standard. Sjøberg et al. manually selected the 103 articles from a universe of 5,453 articles printed in 12 publications shown in Table 2. The 5,453 articles in the universe were all published from 1993 to 2002 (inclusive). The set of 103 articles can be used as a gold standard because the relevant material was manually identified from the 5,453 articles: “one researcher systematically read the titles and abstracts; when it was unclear from the title or abstract whether the article described a controlled experiment, the same researcher and another person read the entire article” (Sjøberg et al. 2005). The definition of SE experiment used in (Sjøberg et al. 2005) considers only studies where the experimental units were individuals or teams conducting one or more development tasks. Adopting this survey as our gold standard means using this definition of relevant material. So, we are searching for articles that report this type of experiments. This means that the findings of our study are applicable to the search for controlled experiments. To generalize our findings, the research needs to be extended to other types of empirical studies. However, we will try to extend the discussion to other type of studies wherever feasible. Running a SR to find empirical evidence about a particular piece of SE knowledge involves running a search intersecting two aspects: (a) the type of empirical study we are looking for (any or specific), e.g. we focus on controlled experiments in this case; and (b) the item of knowledge that we want to find evidence about, e.g. the efficiency of different testing techniques, the effectiveness of two specific requirements elicitation techniques or the effectiveness of inspection meetings. The research reported here focuses on part (a) of the search. There is also a need for research into part (b) of the search, which we intend to undertake at a later stage.
4 Establishing a Search Universe Having selected the publications and the time period for exploration, the next step is to choose the search universe. We have analysed the different document databases hosting the gold standard publications. For the set of publications that we intended to search, we had a choice between SCOPUS™, IEEEXplore®, ACM Digital Library, SpringerLink, ISI Web of KnowledgeSM and ScienceDirect®. We also considered the option of searching each publication directly, but this option calls for a bigger effort at integrating the search results. Therefore, systematic reviewers are more likely to use bibliographic databases. We ran a pilot study to select one of the bibliographic databases. Other authors (Brereton et al. 2007; Bailey et al. 2007; Dybå et al. 2007b) have identified some of the limitations of bibliographic databases as regards the identification of SE studies. Although (Brereton et al. 2007) do not set out to analyse the limitations of bibliographic databases, they discuss, as a spin-off of their SR, a number of lessons learned, some of which imply database limitations. Specifically, they pinpoint problems related to the organizational model and the use of Boolean expressions in the databases. The problems they identified (Brereton et al. 2007) do not match the ones we detected because their analysis pursued a different goal. Note that their experiences with some digital libraries (IEEEXplore, ACM Digital Library and ScienceDirect) and other indexing 1
The complete catalog of articles is published in (Kempenes 2007).
518
Empir Software Eng (2009) 14:513–539
Table 2 Publications used in the gold standard Journals
Journal of Systems and Software (JSS) Empirical Software Engineering (EMSE) Transactions on Software Engineering (TSE) Information and Software Technology (I&ST) IEEE Software Software Maintenance and Evolution (SM&E) ACM Transactions on Software Engineering Methodology (TOSEM) Software Practice and Experience (SP&E) IEEE Computer
Number of articles in the gold standard 24 22 17 8
Conferences
International Conference on Software Engineering (ICSE) IEEE International Symposium on Software Metrics (METRICS) IEEE International Symposium on Empirical Software Engineering (ISESE)
Number of articles in the gold standard 12 10 3
4 2 1
0 0
services (Google Scholar, Citeseer library, Keele University’s electronic library, Inspec and EI Compendex) lead them to conclude that “current software engineering search engines are not designed to support systematic literature reviews”. On the other hand, the objective in (Bailey et al. 2007) was to analyse the overlap between database results for SE studies. The authors focused their study on analysing a range of bibliographic databases and indexing services (IEEEXplore, ACM Digital Library, ScienceDirect, Google Scholar, Citeseer and Web of Science) to determine what overlap there was between them in different fields of SE and the impact of such an overlap on SRs. As a spin-off, the authors identified some search limitations, such as inconsistent user interfaces, limitations placed on the number of results displayed and the number of times some search engines were either unavailable because they were busy or crashed. Again, some of these limitations are complementary to ours, whereas others are the same. Dybå et al. (2007b) also addressed several of these problems when performing their systematic review. In their work eight electronic databases were searched (ACM Digital Library, IEEEXplore, Compendex, ISI Web of Science, Kluwer Online, ScienceDirect. SpringerLink and Wiley Inter Science Journal Finder), thereby a lack of standardization as well as overlapping among databases was revealed. As in our case, they experienced strong limitations with ACM Digital Library in using Boolean expressions and confining searches to specific fields (title–abstract and keywords) Our study also found that there were significant differences among databases that could affect search properties. The two key limitations that we found were:
&
Limited bibliographic resources. Some databases are confined to just publications by one publisher or a limited group of publishers. They do not cover a broad spectrum of publications. This applies to IEEEXplore® and ACM Digital Library (which index a small group of journals and conference proceedings edited by IEEE and ACM, respectively), but especially to SpringerLink, and ScienceDirect®, which index an even
Empir Software Eng (2009) 14:513–539
&
519
smaller group of journals published by Springer and Elsevier, respectively. This is an impediment for SR searches because, to achieve acceptable coverage, each search has to be applied over again on different databases for later combination. Incomplete article abstracts or full texts. The abstract or the full text is not always available in the bibliographic databases. This is an important weakness, as the full text of the articles is necessary for a SR. IEEEXplore and ACM Digital Library especially were both found to have this problem.
Other issues that may have an impact on the result of searches are confined to just one bibliographic database. We review them organized by databases:
&
IEEEXplore has the following limitations: –
In Basic Search mode: –
–
It searches all fields to retrieve articles. As we will see later, it is important to be able to confine the search to titles and abstracts to achieve satisfactory precision. – It does not search both the singular and plural of the specified terms. This means that the plural form of each term has to be entered separately. In Advanced Search mode: – –
&
The maximum number of hits for a search is confined to 500 documents. This might restrict the number of relevant articles obtained from the search universe. The results are only partially exportable, i.e. you can select the results on the displayed page but not all the search results. This makes the operation of exporting to text files or a Reference Manager (a useful SR utility) very inefficient.
ACM Digital Library, and especially its Advanced Search modality, has the following weaknesses: –
– –
– –
Searches are not user definable. You can only search pre-established fields (title, abstract, review, all information). Therefore, you cannot run a search in the mode that we have found to be the most interesting (title and abstract). This is a serious drawback taking into account that searching just the title or just the abstract will detract from search recall, whereas precision will fall if all fields are searched. The number of results is confined to 200 documents per search. This restricts the SR, as it is a small quantity compared to the search universe, which can amount to thousands of articles. The search algorithm fails when trying to search alternative values (values linked by an OR clause) for one field. In these cases, the search results are unreliable and articles that do not comply with the specified search condition are retrieved. This adds to the review effort. The results of a search cannot be exported to either text or Reference Manager format. This is a significant impediment in a SR. It retrieves all the terms that have the same root as the searched term, e.g., a search with the term experiment will retrieve articles containing the terms experimental, experimentation, experimentally, etc. This search mode has a high recall; however, it eats into precision. Therefore, it is preferable to leave it to the researcher to decide whether to search by root or word.
520
&
Empir Software Eng (2009) 14:513–539
ISI Web of Knowledge is widely used in a range of disciplines and, in particular, is a gateway to MEDLINE® and its powerful search engine for identifying clinical studies. The ISI Web of Knowledge has a number of limitations for searching SE experiments. We feel that many of these limitations are a consequence of ISI’s beginnings and mission. ISI (Institute for Scientific Information) was selling its citation index and other products for gauging scientific productivity long before the Web came into existence. In this respect, the Web support, although it has been extended, still focuses on authors, journals, years and other fields that have nothing to do with content searches. This means that: – – – –
–
The term cannot be searched in the abstract or keywords. This is significant limiting factor for our study and for any SR, as we will see later. Although it can be used to search in more than one database, it does not provide enough help about the coverage of each one or the overlap between them. This makes it difficult to decide which one to choose. The command semantics in the advanced search is not very straightforward, and the system does not provide enough help about commands. There is no consistency about the search fields across different databases, e.g. you can search the Web Citation Index database by the full text, topic, author, title and year published fields, whereas as if you select the Web of Science database the fields you have to choose from (topic, author, group of author, publication name, year published, address, language and document type) are not the same. Additionally, if you decide to search all the available databases at the same time, the fields that you can use are confined to: topic, author, publication name, and year published. Often you cannot follow a link to the full text of the article directly from the interface, but the link is provided via another application. This means that if you need to read the full text of the article, it takes longer to retrieve.
Taking into account the limitations that we have detected, we recommend that the following criteria should be considered to select a bibliographic database. We have found that they have a significant impact on the result of the SE experiment searches:
&
& & & &
&
Coverage. There is a sizeable number of indexed outlets and a big overlap between the selected database and other well-known databases in the area. This applies to databases containing ACM, IEEE, Springer and Elsevier publications, among others. Additionally, take into account the number and type of areas covered by the indexed publications, giving priority to databases that extend to different areas akin to SE. Content update. The publications indexed in the database should be regularly updated. This will result in a great deal of outlets across an equally wide ranging number of years. Text completeness. The full text and not only the abstract should be available. Search quality. Take into account how easy it is to build the searches through fields and commands. Quality of the results. The precision of the results is a critical concern. Unfortunately, not all the digital libraries run the searches properly. This results in absolute anarchy because reviewers assume that the results are reliable, and in an ordinary review (that is, not a study like this) reviewers are unlikely to notice the errors (which we actually chanced upon). Versatility of results exportation. As a lot of work will be done on the detected articles in a SR, it would make things much easier if the digital library search results could be exported to a text file, reference manager or other resource. This way, the reviewer can add any information of interest, add results from another digital library
Empir Software Eng (2009) 14:513–539
&
521
or process the information as required. If the digital library does not provide this functionality, the reviewer has to generate this file manually. Usability. The search engine should be easily to understand and operate, including a user friendly and consistent interface, as well as help for the user and interaction with other applications (e.g. EndNote, ProCite reference managers, etc.).
We selected SCOPUS™ as the database for our research, as it has fewer weaknesses than the others. It is a database developed by Elsevier publishers that can be accessed from http://www.scopus.com. It has a wide coverage of publications in the field of computer science (and other areas), maintains a full and consistent database, a reliable and friendly search engine and a range of result exportation facilities. SCOPUS™ includes ten of the 12 gold standard publications. It does not include the ISESE and METRICS conference proceedings. Therefore, we removed 13 papers presented at those conferences from the set of 103 articles, and we were left with a gold standard of 90 articles reporting controlled SE experiments. In this work we are interested in analysing the capability of the different terms to identify articles reporting controlled experiments. We do not want to get results valid only for one database. Therefore, the results we got from SCOPUS have been validated on IEEEXplore® and ACM Digital Library. We compare the results of searching different databases in Section 7 in order to check the generality of the results of our work.
5 Selecting Search Fields After having determined which database to use, the next step for defining a search strategy is to select the fields of the article to be searched. Therefore, our first strategy analysis activity was to test what fields (title, abstract, key words or all document fields) provides more relevant material. Once we have selected the fields, we will go on to analyse the effectiveness of the different search strings. The first field that we searched was title. To do this, we used the search string experiment because it denotes the type of empirical study that we want to retrieve: controlled experiments. Searching just the title retrieved 43 articles of which 17 were in the gold standard. Therefore, searching for the term experiment in the title has a recall of 18.8% and a precision of 39.5%, as shown in Table 2. While precision is extremely high (compared with the precision that we will get in later tests), the strategy of searching the title retrieves very few relevant articles out of the whole set of relevant material (recall). Then we used the same term to search just the abstract. This search retrieved 341 articles, 50 of which were in the gold standard. So, search recall is 55.6% and precision is 14.6%. The number of retrieved articles is low and precision is too, in this case. Reviewers would have to read the abstract of 247 articles that would then be rejected. In other words, six of every seven articles retrieved by the search are irrelevant. Searching titles and abstracts retrieved a total of 352 articles. Comparing these search results against the gold standard, the search identified 69 relevant articles and failed to capture the other 21 experiments that the gold standard contains. Hence, the strategy has a recall of 76.6% and a precision of 19.6%. Both the recall level and precision are quite acceptable. Searching the title, abstract and keywords got a total of 357 articles of which 69 were in the gold standard. This search retrieves the same number of articles as the last one (title–abstract search) and, therefore, the recall value for this search is also 76.6%. However, precision is worse than for the title and abstract search (19.3%), as an extra five irrelevant articles were
522
Empir Software Eng (2009) 14:513–539
retrieved. This means that we have to read an extra three in every 1,000 articles, none of which would be relevant. Bearing in mind that there could be hundreds of thousands of articles in the repository, this seemingly tiny difference could turn out to be relevant after all. Finally, we searched for the same term in all the document fields. In this case, 612 articles were retrieved. Of these, 74 articles matched the gold standard, whereas the other 549 were irrelevant articles. Accordingly, this strategy has a recall of 82.2%, which is quite an acceptable value, but a precision of 12%, which is very low. Compared with the title and abstract search, precision drops from 19.6% to 12% for a search-all-fields strategy. With a drop like this we will get 260 more irrelevant papers and only another five relevant experiments. In other words, precision is almost half and the number of irrelevant articles is double. This means that seven out of every eight articles retrieved by the search will be rejected. This is a very high cost for any review process. Note, importantly, that when we search the full text of an article, we really have little control over which fields the search covers. The parts of the document covered by the search can vary, e.g. include or exclude the references section, depending on the search engine. Also it may be the case that the database contains only abstract rather than the full text of the article. In this case, the search will be confined to more restricted fields than the full text (abstract and title). Table 3 summarizes the results. Searching titles and abstracts seems to be the best strategy. Therefore, the searches analysed below were run on these two fields.
6 Analysis of Search Strategies Although the search for the term experiment in the title and abstract (which we call SRCH1) has acceptable recall and precision scores, we have tried to improve its precision and, if possible, also increase its recall. To do this, we took three steps: 1 We ran further searches using terms that appeared recurrently in the titles and abstracts of the articles in the gold standard that were not retrieved using the term experiment (Section 6.1). 2 We ran searches combining terms: 2.1 2.2
Each term plus experiment (Section 6.2) and 2.2 All the terms together (Section 6.3).
Because of the definition of SE experiment used in (Sjøberg et al. 2005), which we mentioned in Section 3, the precision scores for the strategies we analyze here might vary. Precision for the searches we exercise might be lower if some of the articles including the terms used in the searches are not considered experiments considering a stricter definition Table 3 Scores for different fields Field
Title Abstract Title or abstract Title or abstract or keywords All
# articles retrieved
# articles matching gold standard
Recall (%)
Precision (%)
43 341 352 357 612
17 50 69 69 74
18.8 55.6 76.6 76.6 82.2
39.5 14.6 19.6 19.3 12
Empir Software Eng (2009) 14:513–539
523
of the term. On the contrary, precision might be higher if a more relaxed definition of experiment is used. But this does not affect our analysis of strategy results since all the strategies are equally affected by the definition. Therefore, the tested strategies can be expected to behave with respect to each other as we have observed, irrespective of the definition used for experiment. 6.1 One-Term Searches We already analysed the most obvious search term in view of the type of empirical study that we are looking for: experiment (SRCH1). The terms that we examine to find out whether they can improve the results of experiment are the recurrent terms within the set of gold standard experiments not retrieved by SRCH1: empirical study, experimental study, empirical evaluation, experimentation, experimental comparison, experimental analysis, experimental evidence, experimental setting and empirical data. Table 4 shows how each of these terms behaves:
&
& & & & & &
One of the most frequent terms we found within the gold standard articles not retrieved by experiment is empirical study. Since this term refers to both controlled experiments and other study types we did not use it as our first option. After searching by empirical study, the scores are rather low. Another term that appears fairly often is experimental study. The result is low recall but fairly high precision. SRCH4 analyses the behaviour of the term empirical evaluation. Its recall and precision scores are very low. We analysed the term experimentation. This is a search with rather low recall and precision. We examined the term experimental comparison as the search labelled SRCH6, which is a search with very low recall but very high precision. We also investigated the terms experimental analysis, experimental evidence and experimental setting. The behaviour of all three terms was similar. And the recall value for these terms is quite low, whereas precision was high in all three cases. Another term that we examined was empirical data. This term obtained very low scores for both properties.
Table 4 One-term search strategy results and properties Search
Search term
SRCH1 SRCH2 SRCH3 SRCH4 SRCH5 SRCH6 SRCH7 SRCH8 SRCH9 SRCH10
Experiment Empirical study Experimental study Empirical evaluation Experimentation Experimental comparison Experimental analysis Experimental evidence Experimental setting Empirical data
# articles retrieved
# articles matching gold standard
Recall (%)
Precision (%)
Fall-out (%)
352 87 8 26 34 9 2 2 2 21
69 9 4 2 3 7 1 1 1 2
76.6 10 4.4 2 3.3 7.7 1.1 1.1 1.1 2.2
19.6 10 50 7.7 8.8 77.7 50 50 50 9.5
94.7 98.5 99.9 99.5 99.4 99.9 99.9 99.9 99.9 99.6
524
Empir Software Eng (2009) 14:513–539
It is clear from this analysis that experiment is the only term to have a good enough recall to be used in a search on its own. Either the recall (in the case of experimental study, experimental comparison), or the precision (in the case of experimental analysis, experimental evidence, experimental setting), or both the recall and the precision (in the case of empirical study, empirical evaluation, experimentation, empirical data) of the other terms is too low for them to be used as a single search term. 6.2 Two-Term Searches In an attempt to improve the results of searching with experiment, we looked at whether any of the above search terms could improve the results of experiment if combined with this term. Table 5 shows the properties of the searches combining each term with experiment. These results were obtained by searching each combination of terms in the title and the abstract. Table 5 also indicates the changes in recall (ΔRe) and precision (ΔPe) with respect to the results for the term experiment when combined with another term. Let us look at how each of these terms behaves:
&
&
Combining empirical study with experiment increases recall but decreases precision with respect to experiment. This combination provides seven more relevant articles than the search with just the term experiment. However, another 70 irrelevant articles are retrieved in return. Therefore, the term empirical study should be added to the term experiment only if we want to maximize recall, but not if we want to optimize the search. Combining experimental study and experiment produces an increase in recall and precision which is slightly higher than for experiment (19.9%). Therefore experimental study adds two relevant articles to experiment, with the advantage of finding fewer irrelevant articles, which is a definite improvement.
Table 5 Properties of the search strategies combining experiment with another term Search
Combination Search terms
SRCH11 SRCH1+ SRCH2 SRCH12 SRCH1+ SRCH3 SRCH13 SRCH1+ SRCH4 SRCH14 SRCH1+ SRCH5 SRCH15 SRCH1+ SRCH6 SRCH16 SRCH1+ SRCH7 SRCH17 SRCH1+ SRCH8 SRCH18 SRCH1+ SRCH9 SRCH19 SRCH1+ SRCH10
Experiment OR empirical study Experiment OR experimental study Experiment OR empirical evaluation Experiment OR experimentation Experiment OR experimental comparison Experiment OR experimental analysis Experiment OR experimental evidence Experiment OR experimental setting Experiment OR empirical data
# articles # articles matching Recall Precision Fall-out ΔRe ΔPe retrieved the gold standard (%) (%) (%) 429
76
84.4
17.7
93.4
+7.8 −1.9
356
71
78.8
19.9
94.6
+2.2 +0.3
375
71
78.8
18.9
94.3
+2.2 −0.7
374
70
77.7
18.7
94.3
+1.1 −0.9
353
70
77.7
19.8
94.7
+1.1 +0.2
353
70
77.7
19.8
94.7
+1.1 +0.2
354
70
77.7
19.7
94.7
+1.1 +0.1
354
70
77.7
19.7
94.7
+1.1 +0.1
368
70
77.7
19
94.4
+1.1 −0.6
Empir Software Eng (2009) 14:513–539
& &
& & & & &
525
Combining empirical evaluation with experiment, we get two more relevant articles and an extra 24 irrelevant articles. So, empirical evaluation helps to maximize the recall of experiment, but does not optimize the strategy. The combination of experimentation with experiment slightly increases recall in exchange for a small drop in precision. This means that experimentation provides one more relevant article and an extra 31 irrelevant articles. It appears to be a questionable option for improving experiment. The combination of experimental comparison with experiment increases recall and precision. This is a definite improvement. If we combine experimental analysis with experiment, we find that the gain in recall is small but precision is no worse. By combining experimental evidence with experiment, we get the same increase in recall as in the last case (1.1%), although the increase in precision is lower. The combination of experimental setting and experiment improved recall and precision with respect to experiment. Again this is a modest gain but there is no loss of precision. Finally, after combining empirical data with experiment, we found that its contribution to recall is low with a small cost in precision. This means that another 16 articles have to be reviewed to detect one more relevant experiment. Therefore, the combination of empirical data and experiment degrades rather than improves the search.
Therefore, combining experiment with certain terms (experimental study, experimental analysis, experimental comparison, experimental evidence, experimental setting) leads to an improvement in both recall and precision. Notice that all the terms that improve experiment share the property of being of the type experimental X. Combining experiment with empirical study, empirical evaluation, experimentation, and empirical data degrades either precision, recall or both. Notice that all the terms that degrade experiment are of the type empirical X; plus the term experimentation. In pursuit of our objective of optimizing experiment, let us now try to combine experiment with more than one of these terms to study whether we can improve the results even further. 6.3 Multi-term Searches In view of the pattern of terms that improve and degrade the search term experiment, we tried out two multi-term strategies: combine experiment with all the terms used (i.e. all the synonyms of experiment used in gold standard articles) and combine experiment with the terms that improve its performance only. Table 6 shows the properties of the multi-term searches. It indicates the changes to recall and precision compared with experiment (ΔRe, ΔPe). Analysing the first combination we found that there is a sizeable increase in recall, whereas precision is less than for experiment. However, when combining experiment with its closest synonyms of the type experimental X both recall and precision are higher than for experiment. The improvement in precision means that 108 articles, only eight of which reported relevant experiments, no longer have to be reviewed. Taking into account these results, we hypothesized that the combination of experiment with any possible synonym of the type experimental X might improve further the search. To confirm this, we analysed a strategy that combined the terms experiment and experimental finding that precision falls. This is equivalent having to review another 172 without hardly any relevant article being found. Consequently, this generalized search degrades the optimized strategy (SRCH21).
Combination
Search terms
SRCH20 SRCH1+ SRCH2+ SRCH3+ SRCH4+ Experiment OR empirical study OR experimental SRCH5+ SRCH6+ SRCH7+ SRCH8+ study OR empirical evaluation OR experimentation SRCH9+ SRCH10 OR experimental comparison OR experimental analysis OR experimental evidence OR experimental setting OR empirical data SRCH21 SRCH1+ SRCH3+ SRCH6+ SRCH7+ Experiment OR experimental study OR SRCH8+ SRCH9 experimental analysis OR experimental evidence OR experimental setting SRCH22 Experiment OR experimental SRCH23 SRCH1+ SRCH2+ SRCH3+ SRCH4+ Experiment OR empirical study OR experimental SRCH5+ SRCH6+ SRCH7+ SRCH8+ study OR empirical evaluation OR experimentation SRCH9 OR experimental comparison OR experimental analysis OR experimental evidence OR experimental setting SRCH24 SRCH1+ SRCH2+ SRCH3+ SRCH4+ Experiment OR empirical study OR experimental SRCH6+ SRCH7+ SRCH8+ SRCH9+ study OR empirical evaluation OR experimental SRCH10 analysis OR experimental evidence OR experimental setting OR empirical data
Search
Table 6 Properties of aggregated search strategies
83
75
76 83
83
491
362
534 477
470
92.2
84.4 92.2
83.3
92.2
17.7
14.2 17.4
20.7
16.9
92.7
91.4 92.6
94.6
92.3
ΔPe
+15.6 −1.9
+7.8 −5.4 +15.6 −2.2
+6.7 +1.1
+15.6 −2.7
# articles #articles Recall Precision Fall-out ΔRe retrieved matching the (%) (%) (%) gold standard
526 Empir Software Eng (2009) 14:513–539
Empir Software Eng (2009) 14:513–539
527
Apart from these key combinations, we tried out different mixes of terms in case any combination provided better results. The following combinations merit a special mention:
&
&
The combination of experiment, empirical study, experimental study, empirical evaluation, experimentation, experimental comparison, experimental analysis, experimental evidence and experimental setting is a search achieving a maximum recall of 92.2%. This combination has the same recall as SRCH20 but it is better as the drop in precision compared with the term experiment is less than for SRCH20. The 2% loss in precision means that another 125 articles will have to be reviewed to get an extra 15 relevant articles. After removing the term experimentation from SRCH20 there was a small loss in precision with respect to experiment; however, this value is no greater than for SRCH21.
The fall-out of all the analysed searches is always greater than 91%. We will use these three properties (recall, precision and fall-out) of the search strategies that we exercised in Section 8 to generate recommendations on searching SE experiments. But first let us check that the results we have obtained are not database-dependent.
7 Validating Results with Other Bibliographic Databases Table 7 shows the results of the analysed searches when they were run using the IEEEXplore and ACM Digital Library (DL for short) bibliographic databases. We found that the relative behaviour of the search terms and their combinations is, in most cases, similar, although the values are not exactly the same. The reasons for the observable differences are:
&
&
&
The SCOPUS, IEEEXplore and ACM DL gold standards are necessarily different. The new gold standard for the two new databases is made up of a subset of the original gold standard articles tailored to the publications available in these databases. Specifically, the gold standard for IEEEXplore is a subset of 33 articles (approx. a third of the original) and for ACM DL it is 13 articles (approx. a third of the IEEEXplore gold standard). Because of these changes in gold standard size, each article retrieved by a strategy has a different weight in the recall calculation. Whereas each relevant article represents 1.1% of the gold standard in SCOPUS (100%/90=1.1%), a relevant article represents 3.03% for IEEEXplore (100%/33=3.03%) and 7.7% (100%/13=7.7%) for ACM DL. Therefore, variations in the number of retrieved articles cause appreciable differences in recall. However, as we are comparing strategies, the important point is that the relative differences in recall should be unchanged. One search engine may retrieve articles that another does not depending on what data about the article the repository stores. As mentioned in Section 4, IEEEXplore and ACM DL do not always contain the abstract of the articles that they index. Consequently, searches cannot retrieve the articles that only have the search terms in the abstract. Of the 33 articles that are in the IEEEXplore gold standard subset, three (approx. 10%) do not have an abstract. This percentage was higher for ACM DL, four out of the 13 articles in the ACM gold standard (28.5%) do not have an abstract. This leads to lower recall and precision values for these databases than for SCOPUS. As specified in Section 4, the ACM DL search algorithm behaves differently from the other two. For example, the term experiment retrieves any variation of this term.
Experiment Empirical study Experimental study Empirical evaluation Experimentation Experimental comparison Experimental analysis Experimental evidence Experimental setting Empirical data Experiment OR empirical study Experiment OR experimental study Experiment OR empirical evaluation Experiment OR experimentation Experiment OR experimental comparison Experiment OR experimental analysis Experiment OR experimental evidence
– – – – – – – – – – SRCH1+ SRCH1+ SRCH1+ SRCH1+ SRCH1+ SRCH1+ SRCH1+
SRCH1 SRCH2 SRCH3 SRCH4 SRCH5 SRCH6 SRCH7 SRCH8 SRCH9 SRCH10 SRCH11 SRCH12 SRCH13 SRCH14 SRCH15 SRCH16 SRCH17
SRCH2 SRCH3 SRCH4 SRCH5 SRCH6 SRCH7 SRCH8
Search term
Combination
Search
Table 7 Comparison of search results and properties IEEEXplore
76.6 10 4.4 2 3.3 7.7 1.1 1.1 1.1 2.2 84.4 78.8 78.8 77.7 77.7 77.7 77.7
19.6 10 50 7.7 8.8 77.7 50 50 50 9.5 17.7 19.9 18.9 18.7 19.8 19.8 19.7
94.7 98.5 99.9 99.5 99.4 99.9 99.9 99.9 99.9 99.6 93.4 94.6 94.3 94.3 94.7 94.7 94.7
69.6 12.1 9 3 0 6 0 0 0 6 75.7 72.7 72.7 69.7 72.7 69.7 69.7
15.7 7.5 75 10 0 66.7 0 0 0 12.5 13.4 16.3 15.6 14.5 16.3 15.7 15.4
97.7 99.1 99.9 99.8 99.6 99.9 99.9 99.9 100 99.7 97.0 97.7 97.6 97.4 97.7 97.7 97.6
Recall Precision Fall- Recall Precision Fall(%) (%) out (%) (%) out (%) (%)
SCOPUS
53.8 0 0 0 53.8 0 0 0 0 15.3 53.8 53.8 53.8 53.8 53.8 53.8 53.8
10.7 0 0 0 10.7 0 0 0 0 50 9.4 10.7 10.6 10.7 10.7 10.7 10.7
98.9 99.7 100 99.9 98.9 100 100 99.9 100 99.9 98.7 98.9 98.9 98.9 98.9 98.9 98.9
Recall Precision Fall(%) (%) out (%)
ACM DL
528 Empir Software Eng (2009) 14:513–539
Combination
Search term
SRCH18 SRCH1+ SRCH9 Experiment OR experimental setting SRCH19 SRCH1+ SRCH10 Experiment OR empirical data SRCH20 SRCH1+ SRCH2+ SRCH3+ Experiment OR empirical study OR experimental SRCH4+ SRCH5+ SRCH6+ study OR empirical evaluation OR experimentation SRCH7+ SRCH8+ SRCH9 OR experimental comparison OR experimental + SRCH10 analysis OR experimental evidence OR experimental setting OR empirical data SRCH21 SRCH1+ SRCH3+ SRCH6+ Experiment OR experimental study OR experimental SRCH7+ SRCH8+ SRCH9 analysis OR experimental evidence OR experimental setting SRCH22 Experiment OR experimental SRCH23 SRCH1+ SRCH2+ SRCH3+ Experiment OR empirical study OR experimental SRCH4+ SRCH5+ SRCH6+ study OR empirical evaluation OR experimentation SRCH7+ SRCH8+ SRCH9 OR experimental comparison OR experimental analysis OR experimental evidence OR experimental setting SRCH24 SRCH1+ SRCH2+ SRCH3+ Experiment OR empirical study OR experimental SRCH4+ SRCH6+ SRCH7+ study OR empirical evaluation OR experimental SRCH8+ SRCH9+ SRCH10 analysis OR experimental evidence OR experimental setting OR empirical data
Search
19.7 19 16.9
20.7
14.2 17.4
17.6
77.7 77.7 92.2
83.3
84.4 92.2
92.2
92.7
91.4 92.6
94.6
94.7 94.4 92.3
84.4
75.7 84.4
75.7
69.7 72.7 84.4
14.2
11.7 13.5
16.5
15.7 15 12.8
96.9
96.5 96.6
97.6
97.7 97.5 96.4
SCOPUS IEEEXplore Recall Precision Fall- Recall Precision Fall(%) out (%) (%) out (%) (%) (%)
61.5
53.8 53.8
53.8
53.8 61.5 61.5
10.2
10.9 9.4
10.7
10.7 11.7 10.2
98.7
98.7 98.7
98.9
98.9 98.9 98.7
ACM DL Recall Precision Fall(%) (%) out (%) Empir Software Eng (2009) 14:513–539 529
530
Empir Software Eng (2009) 14:513–539
Additionally, the term experimentation retrieves other terms with the same root. This explains why all the searches of ACM DL that include terms whose root is experiment provide the same recall and precision values in Table 7.
8 Recommendations on Search Strategies for Retrieving SE Experiments Depending on researchers’ interests, the aim of a SR might be to maximize the amount of relevant material retrieved (high recall search), minimize the non-relevant material retrieved (high precision search) or optimize the retrieval of relevant material (high recall and high precision search). Although a high recall search strategy is capable of identifying more relevant material than any other type of search, there is a trade-off between the amount of potentially eligible articles that such a strategy can retrieve and the additional effort required to determine whether or not they are relevant. Therefore, the selection of the best strategy depends on the characteristics and available SR resources. We have obtained the following recommendations from the results of the searches we have exercised:
&
&
&
& &
Using just the term experiment to search titles and abstracts to retrieve controlled SE experiments is not a bad strategy at all. Although quite a few experiments are left out (25% in our case), the number of irrelevant articles retrieved is one of the best (a precision of 19.6% is quite good). Therefore, for a quick search, do not hesitate to use this strategy. To increase the number of relevant papers, close and accepted synonyms of experiment (that authors may have used instead of that term) should be added to the search. The most commonly used synonyms that do not detract from precision are: experimental study, experimental comparison, experimental analysis, experimental evidence, and experimental setting. Combining these terms, we get good recall (83.3%) and precision (20.7%) scores, overlooking only 16.7% of the gold standard articles (15 articles). Adding other more general terms like experimentation, empirical study, and empirical evaluation to these synonyms of the term experiment detects more relevant experiments (92.2%). However, there is a cost in terms of an increase in the number of irrelevant articles. Therefore, this search should only be used when there are plenty of SR resources for investment in detecting and rejecting irrelevant articles. None of the synonyms of experiment, not even empirical study, should be used alone as a search strategy because they omit the huge majority of experiments. The selected search terms should be tracked down in titles and abstracts. Confining the search to one of these fields omits many relevant articles. Additionally, searching all the fields of the document does not significantly increase the number of relevant articles, whereas it does require a big effort to select material.
Apart from these guidelines, after having run a pilot study on all the bibliographic databases described in Section 4, we can suggest a number of improvements that would help to more effectively identify important experiments. Two very specific improvements would lead to better searches and hence more effective SRs: (1) add the study type as a mandatory descriptor, and (2) use controlled vocabulary. However, it would take a separate study to identify all the potential improvements to bibliographic databases, which is outside the scope of this paper.
Empir Software Eng (2009) 14:513–539
531
9 Importance of the Search Universe For this research the search universe has been imposed by the gold standard we needed for evaluating the search results. When a researcher is retrieving experiments for a SR, however, the selection of the search universe is a key issue since the search universe has an influence on strategy recall and precision. The relevance of the material retrieved depends on the sources (journals, conferences, etc.) being searched. To minimize the selection effort, the searches should be run on sources of interest, composed of publications whose scope includes the target SE empirical studies. Establishing what the sources of interest are is an important decision. For instance, in the review that we used as our gold standard (Sjøberg et al. 2005) the researchers decided that the sources of interest were confined to highly reputed publications. We have looked at whether such a constraint omits a sizeable number of relevant articles. To analyse this point, we compared the gold standard that we have been using in our work against another two SRs (Juristo et al. 2004; Davis et al. 2006). In these two reviews the researchers wanted to get a large as possible search universe and have access any possibly relevant article.2 Therefore these two SRs did not place any limitation on the universe based on the quality of sources. The first survey (Juristo et al. 2004) reviewed 24 articles in the testing area over a period of 25 years. Out of these 24, 16 experiments belong to the 1993–2002 period. Of these 16 experiments only eight were part of our gold standard (Sjøberg et al. 2005) search universe. Therefore, 50% of relevant experiments for this SR were published in other journals or conferences and not the reputed publications selected by (Sjøberg et al. 2005) as search universe. The publications in which the relevant experiments not belonging to the gold standard universe appeared are shown in Table 8. The second SR (Davis et al. 2006) included a total of 21 articles in the field of requirements elicitation, of which 10 controlled experiments belonged to the 1993–2002 period. None of these ten experiments were found in the gold standard search universe. In this case, the experiments were published in the journals and conferences listed in Table 8. These data let us to estimate the amount of relevant material missed by using a restricted search universe. The three reviews together identified 121 experiments for the 1993–2002 time period (the 18 articles covered by the above two reviews, but left out by the gold standard, plus the 103 articles belonging to the gold standard). This means that the publications of repute considered in the gold standard leave aside a 15% of the existing experiments between 1993 and 2002. Actually, the real percentage is probably greater than 15% because the searches in (Juristo et al. 2004; Davis et al. 2006) are restricted to the fields of testing and requirements elicitation, respectively, and we can expect there to be more experiments in other SE areas out of those of the gold universe accordingly. Note that the experiments that the gold standard omits are not necessarily of poor quality. Most of the omitted articles were published in special-purpose publications on the domain of the experiments. Therefore, this is an important point to take into account when selecting the search universe for SRs. This leads us to make the following recommendations:
&
2
If the search universe for a SR is confined to general SE publications of repute, a sizeable number of relevant experiments will not be retrieved.
Although a broader range of study-types was considered in these two surveys, the definition of term “experiment” is the same as the one in the gold standard.
1 1 1 1 2 1 1
Journal of Economic Psychology
Journal of Management Information Systems Knowledge Acquisition Human Factors Expert Systems Information Systems Research Journal of Experimental Psychology: Applied
(Davis et al. 2006)
1
Software Quality Journal
(Juristo et al. 2004)
Number of articles
Journals
Systematic Review
International Symposium on Software Testing and Analysis International Symposium on Foundations on Software Engineering International Conference on Software Maintenance International Symposium on Software Reliability Engineering European Software Engineering Conference International Conference on Requirements Engineering International Conference on Automated Software Engineering
Conferences
Table 8 Publications with relevant experiments for (Juristo et al. 2004) and (Davis et al. 2006) not belonging to (Sjøberg et al. 2005) search universe
1
2 1
2 1
1
1
Number of articles
532 Empir Software Eng (2009) 14:513–539
Empir Software Eng (2009) 14:513–539
& &
&
533
The universe should include not only general SE publications but also publications specific to the topic of the SE experiments (e.g. requirements conferences and journals, or quality and testing conferences and journals). Depending on the specific topic or technology of interest (such as elicitation techniques), publications from other areas apart from SE should be explored (in the case of elicitation, areas like information systems, economics, artificial intelligence, etc.). Consequently, we recommend the use of bibliographic databases that index as many outlets as possible. The ideal thing would be a digital library that indexed all SE- and/or computer science-related journals and conferences. Unfortunately, there is no such digital library at present. Failing this, repositories like SCOPUS can be very helpful, as they integrate a range of databases and a very large number of outlets.
Notice that we do not recommend “expanding the search universe”, but “trying to consider all the relevant sources”. Confining searches to publications that are representative of the aspect that the SR is analysing is a key issue. Increasing the search universe indiscriminately it is not a good strategy. The fall-out value and the recall score tend to be unchanged when the search universe is increased, but precision drops sharply. Even though the goal is to get more relevant material, expanding the search universe leads to a slight increase in relevant articles in return for a spectacular growth in the number of irrelevant articles. A bigger universe returns more irrelevant articles because the search string used can appear in more articles by chance. Consequently, search precision falls in bigger universes, whereas recall is almost unchanged (the number of relevant articles and the gold standard are only a little bigger). Fall-out tends to remain unchanged or even drop slightly, depending on the ratio between the growth of the search universe and the increase in the irrelevant articles retrieved by the search. To illustrate the behaviour of precision as the search universe grows, Table 9 shows a simulation using a search strategy with a set recall score (92.2%, similar values to SRCH20), a set fall-out score (92%, an optimal fall-out value) and a gold standard of 150 articles. Precision drops from 25.69% in a universe of 5,000 articles—a value close to the universe used in our gold standard—to 0.17% for a universe of 1,000,000 articles (not unusual in bibliographical databases like SCOPUS, IEEEXplore or ACM DL, unless constraints are placed on the journals and conferences to be searched).
Table 9 Simulation of search properties with different sizes of search universes # of articles in # articles returned by # relevant A the search the search (recall= papers universe 92.2%) 5,000 10,000 20,000 50,000 100,000 200,000 500,000 1,000,000
490 840 1,540 3,640 7,140 14,140 35,140 70,140
150 150 150 150 150 150 150 150
B
C
D
138 400 12 4,450 138 800 12 9,050 138 1,600 12 18,250 138 4,000 12 45,850 138 8,000 12 91,850 138 16,000 12 183,850 138 40,000 12 459,850 138 80,000 12 919,850
Recall Precision Fall-out (%) (%) (%)
92.20 92.20 92.20 92.20 92.20 92.20 92.20 92.20
25.69 14.74 7.96 3.34 1.70 0.86 0.34 0.17
91.75 91.88 91.94 91.98 91.99 91.99 92.00 92.00
534
Empir Software Eng (2009) 14:513–539
Consequently, whereas confining the search to reputed general SE publications leads to many relevant articles being overlooked, searching the whole bibliographic database leads to a great deal of effort being necessary to debug the search results (a precision of 0.17% means that only one out of every 500 articles is relevant).
10 Terminological Trends in Articles Reporting SE Experiments Compared to other disciplines, in SE we use rather different terms to refer to the concept of controlled experiment. To get a satisfactory recall, any search strategy should be composed of several keywords, and it is necessary to use an unmanageable number of terms to get a search with 100% recall. At the same time, this would cause a considerable drop in precision, calling for unwarranted effort at the selection stage. We have examined the evolution of this terminology over time to check if we are converging towards a definite term, which would be very helpful for retrieving relevant articles. We focused the analysis on the most recurrent terms in the gold standard articles (experiment, empirical study, experimental study, empirical evaluation and experimental comparison). Figure 2 shows the use of these terms in the gold standard articles. Although the predominant term is experiment, there is no clear indication that its use has been generalized over the years. The same applies to the other terms. None appear to be gaining ground as the term generally accepted in ESE to refer to a controlled experiment. The promiscuity of terms used to refer to controlled experiments since 1993 cannot be said to be decreasing. For example, the three articles reporting controlled experiments for 1995 used the term experiment, that is, experiment was term used in the 100% of the articles; however it was only used in seven articles (63.6%) reporting experiments in 1997. In 2001 the term was used in 11 articles (84.6%); however, it was used by only four articles (57.1%) in 2002. The use of the term experimental comparison is neither especially significant (with the exception of 1993) nor very constant over the years. This term hardly ever appears or has fallen into disuse. Experimental study appears in 2000 only, when it was used by one article (4%) reporting experiments and then disappeared, even though experimental study is one of experiment’s closest synonyms. Empirical evaluation is a term used sporadically by some authors. Finally, a set of other miscellaneous terms (experimental analysis, experimental
Fig. 2 Number of papers using a term in title and abstract
100% 80%
1 1
1
1
1
2
1
1
2
1 1 1 1
1
2
1 1 1 1
60%
1 3
40%
2
2
1993
1994
12
6
6
15
11
7 1
4
20% 0% 1995
1996
1997
EXPERIMENT EXPERIMENTAL STUDY EXPERIMENTAL COMPARISON
1998
1999
2000
2001
EMPIRICAL STUDY EMPIRICAL EVALUATION OTHER TERMS
2002
Empir Software Eng (2009) 14:513–539
535
evidence, experimental setting, and empirical data) cover from 9% to 22% of all articles. With respect to empirical study, we find that it is a term with a modest, but constant presence over the years to refer to controlled experiments. In other words, although it is a broader term than experiment, it is its most commonly used synonym. Summarizing, there is a fairly erratic use of terminology. There has been no standard use of terminology between 1993 and 2002. Note that the non-standardization of terminology is a big impediment to effective searching and, consequently, to SRs. In other more mature areas, like medicine, the terminology in experimental publications is much more standardized, and the searches achieve a recall of 92% for a precision of 48%, and even a recall of 99% if precision can be sacrificed a little (22%) (Straus et al. 2005). Another aspect we looked at concerns how indicative the titles and abstracts of the articles are. We analysed how many authors used the term experiment in the title when they published an experiment. In our gold standard set, we found that only 16 articles (18%) used the term experiment in the title. However, a large proportion of articles (55.6%) did use the term experiment in the abstract, even if it did not appear in the title. As a whole, this term was explicitly mentioned in either the title or the abstract of 66 of gold standard articles (73.6%). The remainder (24 articles) use either one of the synonyms analysed in this paper or no reference term in either the title or the abstract. This analysis lends strength to our recommendation of encouraging experimenters performing SRs to adopt the practice of searching a combination of title and abstract and not confining searches to just the article title. Whereas this might be a good practice in other more mature experimental disciplines, it is not in ESE. Analysing the use of terms in titles and abstracts against time (see Fig. 3), we find that the situation did not improve during the reference period used in this work. Diverse standardization initiatives and recommendations have emerged since 1998. In (Zelkowitz and Wallace 1998), after analyzing 600 published papers that the authors concluded that experimentation is a term frequently incorrectly used in Computer Science Community and they established a hierarchy of twelve different experimental approaches for validating technology. On the other hand, other authors have proposed guidelines for carrying out the entire experimental process (Kitchenham et al. 2002) and experimental reports (Singer 1999). However, such initiatives did not have much effect until 2003. Our findings confirm the need for community efforts working towards proposing and adopting this type of guidelines. As mentioned above, we believe that the use of structured abstracts Fig. 3 Ratio between articles with term experiment in title and abstract
100% 80% 60% 40% 20% 0% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 USED ONLY IN ABSTRACT NOT USED
USED IN TITLE OR TITLE-ABSTRACT
536
Empir Software Eng (2009) 14:513–539
as suggested by (Jedlitschka and Pfahl 2005) as a reporting standard for SE experiments should have a positive impact on search results in existing databases. SE community authors have started to use structured abstracts more or less systematically. It will be interesting to see what effect this has when their use is widespread. This we were unable to do in this paper because the articles in the gold standard were published before the proposal by (Jedlitschka and Pfahl 2005).
11 Generalization of the Results Our survey focused on a particular type of study: controlled experiments. However, there is something to be gained from considering whether our analysis and recommendations are applicable at all to other types of empirical studies. As far as our recommendations on the terms in the search string are concerned, it would be risky to venture that experiment is likely to behave similarly to case study, quasiexperiment or any other observational study. We can surmise that case study is likely to behave similarly to experiment insofar as experimenters use a variety of terms to refer to this type of study. Also we have found that the term empirical study could be one of the terms used as a synonym and that, therefore, it is applicable for retrieval. There is no evidence, however, that this term will produce acceptable or optimal results in terms of recall and precision. Likewise, we are unable to suggest a search string for case studies. As regards quasi-experiments, there is even less chance of extending our results, as we are unable to find anything about the behaviour of this term from the data that we have gathered in our study. Despite the above points it is reasonable to expect that, generally, the retrieval will be better for each study type insofar as the search string contains terms that are very close synonyms to the term representing the study type. A survey of a representative group of SE studies, such as we did here for the term experiment, would be necessary to identify synonyms for case study and quasi-experiment. On the other hand, we can easily extend to any study type our recommendations on:
&
&
&
Bibliographical database and search universe selection. As our study did not use databases specialized in controlled experiments (which, unlike medicine, would be impossible in our field), the analysis and recommendations about database selection for searching SR-relevant empirical studies will be applicable irrespective of the study type. Also, there are no SE outlets specialized in the publication of controlled experiments or any other study type. Therefore, our recommendations on how to optimally select the universe of publications to be searched are also independent of the SR study type. Selection of search fields. When we selected the search fields to be used in our strategy, we aimed to include the fields that usually mention the terms related to the study type. We tried this out with the term experiment and got optimum results when we searched the title and abstract. This search behaviour is likely to be similar in those cases where we want to retrieve case studies or quasi-experiments. Use of a controlled vocabulary. Our study analysed how the non-standardized use of terms in ESE affects the results of experiment searches. Even though we have not conducted a similar analysis for case studies and quasi-experiments, it would be only logical to expect that our recommendation on using a more standardized vocabulary will benefit the results of searches, generally, and any SR, irrespective of the type of evidence they are looking for.
Empir Software Eng (2009) 14:513–539
537
12 Conclusions We find that optimizing a search strategy for use to retrieve relevant SE experiments is not a straightforward issue. One way of increasing recall is to include a lot of terms related to experiment, but an important problem that has to be tackled is the loss of precision as recall increases. In mature fields, with a standardized use of terminology, searches can use very specific terms to maximize a strategy’s recall to close to 100%, but this approach does not manage to produce high-recall searches in experimental SE. With respect to the terms used in searches for SE systematic reviews, the term experiment has very acceptable scores as a search strategy. Therefore, it is a term that will produce good results if used alone. The same does not apply to any other synonyms of experiment. Supplementing the term experiment with closely related terms (e.g. experimental study, experimental comparison, experimental analysis, experimental evidence, and experimental setting) can get better scores for recall without any loss in precision. However, adding terms indiscriminately in an effort to maximize search recall would lead to a drop in precision. Authors do not use the term experiment scrupulously in the titles of their publications. Therefore, searches should be applied to both title and abstract and not confined to just the title. This highlights the need to establish guidelines within the empirical SE community instructing authors to use more indicative titles and abstracts. Our study alerts experimenters to the negative side of confining searches for SR to sources of repute or with high impact factors, as a considerable number of relevant articles could slip through the net. For some SE topics, the search should also explore publications in related fields, like information systems, psychology, economics, quality, artificial intelligence, etc. Nevertheless, the search universe should not be enlarged indiscriminately; it should be limited to publications that are likely to contain relevant articles. Under no circumstances should the search be run on all the articles contained in the different bibliographical databases. Acknowledgements We would like to thank Dag Sjøberg and Jo Hannay, of Simula Research Laboratory, for providing the references to the 103 articles that were used to formalize the gold standard used in this research.
References Bailey J, Zhang C, Budgen D (2007) Search engine overlaps: do they agree or disagree? Proceedings of the 2nd International Workshop on Realising Evidence-Based Software Engineering REBSE’07. Minneapolis, USA, 1–6 Brereton P, Kitchenham B, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583. doi:10.1016/j.jss.2006.07.009 Davis A, Dieste O, Juristo N, Moreno A (2006) Effectiveness of requirements elicitation techniques: empirical results derived from a systematic review. Proceedings of the 14th IEEE International Requirements Engineering Conference RE’06. Minneapolis, USA, 179–188 Dybå T, Kitchenham B, Jørgensen M (2005) Evidence-based software engineering for practitioners. IEEE Softw 22:58–65. doi:10.1109/MS.2005.6 Dybå T, Kampenes V, Sjøberg D (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745–755. doi:10.1016/j.infsof.2005.08.009 Dybå T, Arisholm E, Sjoberg D, Hannay J, Shull F (2007a) Are two heads better than one? On the effectiveness of pair programming. IEEE Softw 24(6):10–13. doi:10.1109/MS.2007.158 Dybå T, Dingsøyr T, Hanssen GK (2007b) Applying systematic reviews to diverse study types: an experience report. Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007). Madrid, Spain, 225–234
538
Empir Software Eng (2009) 14:513–539
EBM Working Group (1992) Evidence-based medicine: a new approach to teach the practice of medicine. J Am Med Inform Assoc 268(17):2420–2425 Hannay J, Sjøberg D, Dybå T (2007) A systematic review of theory use in software engineering experiments. IEEE Trans Softw Eng 33(2):87–107 Higgins J, Green S (2006) Cochrane handbook for systematic reviews of interventions 4.2.6 [updated September 2006]. In: The Cochrane library (vol 4). Wiley, Chichester, UK Jedlitschka A, Pfahl D (2005) Reporting experiments in software engineering. Proceedings of the International Symposium on Empirical Software Engineering (ISESE’05). Noosa Heads, Australia, 95–104 Jørgensen MA (2004) Review of studies on expert estimation of software development effort. J Syst Softw 70(1–2):37–60 Jørgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53 Juristo N, Moreno A, Vegas S (2004) Reviewing 25 years of testing technique experiment. Empir Softw Eng 9:7–44 Kempenes VB (2007) Quality of design, analysis and reporting of software engineering experiments—a systematic review. Ph.D. dissertation Nr 671, University of Oslo. ISSN 1501-7710 Kampenes VB, Dybå T, Hannay JE, Sjøberg DIK (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086 Kitchenham B, Pfleeger S, Pickard L, Jones P, Hoaglin D, Emam K, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28:721–734 Kitchenham B, Dybå T, Jørgensen M (2004) Evidence-based software engineering. Proceedings of the 26th International Conference on Software Engineering (ICSE’04). Scotland, UK, 273–284 Kitchenham BA, Mendes E, Travassos GH (2007) Cross versus within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng 33(5):316–329 Lajeunesse MJ, Forbes M (2003) Variable reporting and quantitative reviews: a comparison of three metaanalytical techniques. Ecol Lett 6:448–454 Mendes E (2005). A systematic review of web engineering research. Proceedings of the ACM/IEEE International Symposium on Empirical Software Engineering. Noosa Heads, Australia, 498–507 Petitti D (2000) Meta-analysis, decision analysis and cost-effectiveness analysis. Oxford University Press, Oxford Singer J (1999) Using the American Psychological Association (APA) style guidelines to report experimental results. Proceedings of the Workshop on Empirical Studies in Software Maintenance, Oxford, UK, 2–5 Sjøberg D, Hannay J, Hansen O, Kampenes V, Karahasanovic A, Liborg N, Rekdal A (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31:733–753 Straus SE, Richardson W, Glasziou P, Haynes RB (2005) Evidence-based medicine. How to practice and teach EBM. Elsevier, Oxford van Rijsbergen CJ (1979) Information retrieval. Department of Computer Science, University of Glasgow, Glasgow Zelkowitz MV, Wallace DR (1998) Experimental models for validating technology. IEEE Comput 31(5):23–31
Oscar Dieste is researcher with the Computing School of the UPM. His research interests include empirical software engineering and requirements engineering. He received his B.S. in computing from Coruña University and his Ph.D. from the Castilla-La Mancha University.
Empir Software Eng (2009) 14:513–539
539
Anna Grimán Padua is associate professor of information systems and software engineering with the Process and Systems department at Simón Bolívar University (USB) in Venezuela. She has a B.S. in Computing and a M.Sc. in Information Systems. Since 1997 she has been researcher and enterprise consultant with the Information Systems Research Laboratory (LISI-USB). Since 2006 Anna is a postgraduate student with the Empirical Software Engineering group at the Technical University of Madrid.
Natalia Juristo is full professor with the Computing School at the Technical University of Madrid (UPM) in Spain. She has been the Director of the UPM M.Sc. in Software Engineering for 10 years. She has served in several program committees including ICSE, RE, ISESE, SEKE, etc., and has been program chair for SEKE, and ISESE, and general chair for SEKE and SNPD. She has been guest editor of special issues in several journals, including IEEE Software, the Journal of Software and Systems, Data and Knowledge Engineering, and the International Journal of Software Engineering and Knowledge Engineering. Dr. Juristo has been member of several editorial boards, including IEEE Software and the Journal of Empirical Software Engineering. CS. Dr. Juristo has a B.S. and a Ph.D. in Computing from UPM.