Empirical Evaluation of Software Testing Techniques in an Open ...

29 downloads 358 Views 601KB Size Report
... Comparison of testing techniques, Evaluation, Experimentation, Open source, ... stress on replication of these studies under a common set of guidelines as ...
Empirical Evaluation of Software Testing Techniques in an Open Source Fashion Sheikh Umar Farooq

SMK Quadri

Department of Computer Sciences University of Kashmir, India

Department of Computer Sciences University of Kashmir, India

[email protected]

[email protected]

ABSTRACT

development in any discipline is empirical verification of knowledge [2]. Empirical studies are crucial to investigate the test techniques in order to compare and improve software testing techniques and practices [3]. In fact, there is no other way to assess the testing techniques, since all of them are, to various extents, based on heuristics and simplifying assumptions [4]. Although the aim of empirical software engineering is to provide evidence for selecting the appropriate technology, it appears that results from empirical research only rarely seem to find their way to industry practitioners. For years, it has been argued that decision makers in industry hesitate to introduce use technologies if evidence about their benefits and risks is not available, not convincing [5], not communicated in the right language [6], or if they lack relevance and rigor [7]. The same holds true for the empirical evidence about applicability and practicality of software testing techniques. Though, at present, we have multitude of software testing techniques, which can reveal faults, but we do not have all the adequate practical information about them. Despite the number of studies which were conducted to evaluate these techniques, we are still without realistic and generalized results. Majority of experimental studies conducted have significant limitations with respect to programs, subjects and methods utilized in the experiment. The studies conducted so far mostly vary significantly in terms of framework used and the parameters they have taken into consideration. Even though the researchers stress on replication of these studies under a common set of guidelines as proposed by [8] [9] [10], as meaningful results cannot be deduced from a single experiment [11]. However, most experiments still use different experiment plans and reporting mechanisms, which make aggregation process difficult if not impossible; as comparing non-identical replications has always been a complex issue. In addition, such replications are a time consuming affair especially when the industry needs immediate strong empirical evidence regarding the applicability conditions of testing techniques. To bridge this communication gap between researchers and industry professionals, we present an informal proposal to carry out testing techniques experimentation on a large scale under unified framework in an open-source fashion so that we can come up with realistic and generalized results in a shorter span of time.

Testing technique selection and evaluation remains a key issue in software testing. Industry practitioners need concrete evidence to select proper testing techniques in STLC. Despite the large number of empirical studies which attempt to study the testing techniques‘ applicability conditions and allied factors, we are still without realistic and generalized results as studies lack a formal foundation and are not complete in all respects. Additionally, besides varying significantly in terms of parameters they have taken into consideration, many existing studies show contradictory results. Even though the researchers stress on replication of these studies under a common set of guidelines, however, attempts to aggregate results from such replications still has not been fruitful so far. As such, to bridge the gap between researchers and industry professionals, we propose to carry out evaluation of testing techniques on a large scale under a unified framework in an open-source fashion so that the realistic and generalized results are obtained in a shorter span of time.

Categories and Subject Descriptors D.2.0 [Software Engineering]: General

General Terms Experimentation

Keywords Aggregation, Empirical Studies, Comparison of testing techniques, Evaluation, Experimentation, Open source, Replication.

1. MOTIVATION Over the last decade, it has become quite clear that software engineering is fundamentally an empirical discipline: Software development practices and technologies must be thoroughly investigated by empirical means in order to be understood, evaluated, and deployed in proper contexts [1]. Researchers are aiming for extensive and exhaustive empirical research in all areas, to underpin software engineering, since one of the basis for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not madetoormake distributed commercial advantage thator Permission digital or for hardprofit copies or of all or part of this work for and personal copies bear this notice and fee the provided full citation on the page.orTo copy classroom use is granted without that copies arefirst not made distributed for profit or commercial advantage and that bear and the full otherwise, to republish, to post on copies servers orthis to notice redistribute to citation lists, on the first page. Copyrights for components of this work owned by others than ACM requires prior specific permission a fee. To copy otherwise, or republish, must be honored. Abstracting with creditand/or is permitted. June 2, Hyderabad, toCESI'14, post on servers or 2014, to redistribute to lists,India. requires prior specific permission and/or a fee. Request permissions [email protected]. Copyright 2014 ACMfrom 978-1-4503-2843-2/14/06... $15.00. CESI’14, June 2, 2014, Hyderabad, India Copyright 2014 ACM 978-1-4503-2843-2/14/06...$15.00 http://dx.doi.org/10.1145/2593690.2593693

2. RESEARCH STATUS AND PROBLEM DESCRIPTION Over the years, the quality of the average empirical study in software engineering is increasing. Most especially, there have been several empirical studies on software testing techniques [12] [13] [14]. Specific guidelines and introduction on how to conduct experiments in software engineering are also discussed in [15]

21

[16]. A lot of empirical studies including replications have been conducted to study the software testing techniques empirically. However, summarizing the results of the studies conducted so far to evaluate the software testing techniques, we observed that:

Taking into account all these problems, we believe that we need to carry evaluation in an effective and efficient way so that it will be beneficial for the research as well as industry.

1. Most of the information related to the techniques available is focused on how to apply the techniques but not on the applicability conditions of the techniques – practical information, suitability, effectiveness, efficiency, strengths, weaknesses etc. Nevertheless, the absolute benefits and drawbacks of each of the techniques are still quite unknown or at best unclear [12].

3. PROPOSED EVALUATION APPROACH Empirical studies on large scale artifacts, within real world contexts, and replicated by several professional testers are needed to attain generalized and valid results. However, when researchers perform replications of experiments, they should keep in mind that there is always a need to combine (aggregate) the results, not only to see similarity or differences but to abstract a common (global) result representative of all the experiments. Aggregation (in SE terms) is synthesizing – organizing, summarizing and generalizing the results of multiple experiments to generate pieces of knowledge or evidence that can become facts or used in real world software development [11]. In case of software testing techniques evaluation, aggregation of results has not been fruitful as the replications (whether similar or dissimilar) are not standardized and carried out properly. In general, software engineering does not appear to be well suited to such replications, because it works with complex experimentally immature contexts [17]. Context differences usually oblige SE experimenters to adapt experiments for replication. As key experimental conditions are yet unknown, slight changes in replications have led to differences in the results that prevent verification. There is no standard agreement yet on terminology, typology, purposes, operation and other replication issues [11]. There are still many uncertainties about how to proceed with replications of SE experiments. Should replicators reuse the baseline experiment materials? What elements of the experimental configuration can be changed for the experiment to be considered a replication rather than a new experiment? [18]. A possible way out to overcome such difficult challenges could be that of combining the efforts of several research groups, currently conducting separate experimentations, and join their forces to carry out an experiment on a large scale using a common benchmark/framework. A common standard is required to standardize the evaluation process of such experiments. We can also factorize a large experiment in pieces among several laboratories [19]. The idea is similar to launching an ―Open Experiment" initiative, similar to how some Open Source projects have been successfully conducted. However, not all open-source projects are necessarily successful, and experimentation, to be credible, needs very careful planning and control. We propose that in order to make the experimentation effective, testing techniques experimentation should be carried out on a large scale under unified framework in an open-source fashion in a very carefully planned and well-coordinated manner so that the realistic and generalized results are obtained in a shorter span of time. However, to ensure that experimentation at all locations should be carried out using same framework and should use same programs, techniques and guidelines for subjects and other things; we should plan experiment in advance and should made the framework and lab package freely available to all without hiding any details. The framework and experimental setup can be decided in advance and can be implemented later at different locations by different groups; all using same framework set and lab package. In fact, the experiment can also be carried out by different people at different sites at same time using the same framework using the concepts of remote labs.

2. Although certain results extracted so far from the experiments conducted are interesting, however, they seem to indicate that research in this area has focused on specific questions and hypotheses rather than on building a larger picture of available techniques and when to select them. The experimental results are conflicting, and the experiments lack a formal foundation and studies have a lot of difference between parameters they have taken into consideration [13]. 3. The experimental studies on software testing techniques conducted so far do not provide a basis for making any strong conclusions regarding different software testing techniques. The results also are very inconclusive and do not reveal much information. As a result, we cannot generalize results of software testing techniques evaluation experiments. Recent surveys on comparisons of various software testing techniques also concludes that further empirical research in software testing is needed, and that much more replication has to be conducted before general results can be stated [12] [13]. Even though, many studies were replicated several times by researchers. However there are still many issues with those replicated studies: 1. Experimental replications are not carried out under a common framework; even though the ultimate goal of each replication is to contribute to the knowledge base of software testing techniques; however, the planning and execution of the replication actually deviates from that goal as each replication use different experiment plans, and collection and reporting mechanisms as decided by researchers who execute it. 2. There is no standard lab package for experiments. Even though, there are few lab packages like one built by K&L meant for comparing three defect detection techniques. However, such experimental packages do not divulge all the details that are relevant for replication, making such knowledge a subjective matter. In addition, we rely on an experiment packages that hardly reflect reality. As such, relevant information about an experiment for either replication or aggregation with other experiments is not fully available or usable. Mostly researchers alter or build packages at their discretion, including whatever information they consider appropriate for a replication. 3. There is no common result reporting framework. The result of each of the replication is independently gathered, analyzed and reported differently, according to the researcher in charge of the experiment. Aggregation of results is not strictly kept in mind during replication of experiments. This hampers the process of creating a unified view of all the results.

22

Our proposed approach is as follows: 1. Plan and define a standard framework for testing techniques evaluation at the overall SE community level which will define the experiment plan which includes experimental design, defect-detection techniques to be evaluated, programs to be used, subject characteristics including selection criteria and other allied things. Besides that, we should also define experiment procedures, and data collection and analysis standards and validation procedures. This planning can be done at some event like a conference like ESEM, EASE or workshop like ‗Conducting Empirical Studies in Industry’ by a group of people who are stakeholders of empirical industry especially dealing with testing techniques evaluation which includes people from groups like ISERN, ESERNET, SERG, FRAUNHOFER CENTER FOR EXPERIMENTAL SOFTWARE ENGINEERING and others.

order to make different replications successful and effective, we need to alter and analyze different variables of the experiment. Those things and variations should also be defined in framework clearly so that it does not create a problem in execution of experiment and aggregation of results later on.

4. CONCLUSION AND FUTURE WORK This paper presents an initial-informal proposal for evaluating software testing techniques in a large scale experiment carried out in an open-source fashion. The goal of this paper is to focus our efforts to fill the huge gap between research and industry as soon as possible so that the research results are put into practice. We should realize that even though replications are very promising, however, they need a good infrastructure and a very careful planning and control. In this paper we have only deliberated on ‗what should be done’ without specifying ‗how it could be done exactly‘. The feasibility of this proposal should be checked by weighing the pros and cons, and possible limitations of carrying out experimentation in the proposed way. We also need to understand that in order to succeed; we need a very strong collaboration between research and industry so that we can come to know what is really required by the industry.

2. Designate a committee or a group who will coordinate and monitor the overall experimentation process including design and development of experiment plan, lab packages and collection and aggregation mechanism of results which will be obtained from experiments carried out by different researchers. 3. Implement the full experiment as a series of sub experiments carried out at different locations at same or different time using the framework and artifacts defined by the central or controlling agency as shown in figure 1. 4. Report the standardized results in prescribed format to a central and controlling agency that continuously monitors and aggregates the results obtained from n experiments to present a larger picture about the testing techniques applicability and other conditions. In this approach, we go by the basics and determine what exactly we want in the long term. The key to this approach is that you first plan and gather requirements at the overall top level. We establish the architecture for the complete experiment. Then we can carry out experiment at different locations using the approach stated above. An experiment, in this approach, is a subset of the complete experiment. The overall result obtained, therefore, will be the integration of the results obtained from the individual experiments carried out at various locations. Before execution of an experiment, we have to make sure that everything is conformed. Every element must mean the same thing in every sub- experiment carried out at different locations by different researchers. This will ensure that results of various experiments (replications) will be comparable and will certainly help us in building a large and realistic knowledge base of testing techniques. The approach is shown in figure 1. We should also ensure that the subjects, programs and other elements are true representatives of the industry so that the results obtained can be applied by industry practitioners. We need to build some standardized and better laboratory packages which should represent actual software engineering practices. Carry out experiments on such packages will help in deriving realistic results. In addition, we also need to realize that not only a replication that produces similar results as the original experiment is successful, but a replication that produce results different from those of the original experiment can also be viewed as successful [20]. In

Figure 1 – Proposed Experiment Approach

23

[10] Juristo, N., & Gómez, O. S. (2012). Replication of software engineering experiments. In Empirical Software Engineering and Verification (pp. 60-88). Springer Berlin Heidelberg.

5. REFERENCES [1] Empirical Software Engineering: Applied Engineering Research and Best Industry http://www.cs.umd.edu/~basili/EMSE-leaflet.pdf. accessed on 14 Jan 2014.

Software Practice. Last

[11] Jedlitschka, A., & Ciolkowski, M. (2004, August). Towards evidence in software engineering. In Empirical Software Engineering, 2004. ISESE'04. Proceedings. 2004 International Symposium on (pp. 261-270). IEEE

[2] Sjoberg, D. I., Dyba, T., & Jorgensen, M. (2007, May). The future of empirical methods in software engineering research. In Future of Software Engineering, 2007. FOSE'07 (pp. 358378). IEEE. Tavel, P. 2007.

[12] Juristo, N., Moreno, A. M., & Vegas, S. (2004). Reviewing 25 years of testing technique experiments. Empirical Software Engineering, 9(1-2), 7-44.

[3] Condori-Fernandez, N., & Vos, T. (2013) PANEL: Successful Empirical Research in Software Testing with Industry. In Proceedings of the Industrial Track of the Conference on Advanced Information Systems Engineering 2013 (CAiSE'13) co-located with 25th International Conference on Advanced Information Systems Engineering, València, Spain, June 21, 2013.

[13] Juristo, N., Moreno, A., Vegas, S., & Shull, F. (2009). A look at 25 years of data. Software, IEEE, 26(1), 15-17. [14] Farooq, S. U., & Quadri, S. M. K. (2013) Empirical Evaluation of Software Testing Techniques–Need, Issues and Mitigation. Software Engineering - An International Journal, 41-51.

[4] Briand, L. C. (2007, September). A critical analysis of empirical research in software testing. In Empirical Software Engineering and Measurement, 2007. ESEM 2007. First International Symposium on (pp. 1-8). IEEE.

[15] Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones, P. W., Hoaglin, D. C., El Emam, K., & Rosenberg, J. (2002). Preliminary guidelines for empirical research in software engineering. Software Engineering, IEEE Transactions on, 28(8), 721-734.‖

[5] Pfleeger SL, Menezes W (2000) Marketing technology to software practitioners. IEEE Softw 17(1):27–33

[16] Wohlin, C., Runeson, P., Hst, M., Ohlsson, M. C., Regnell, B., & Wessln, A. (2012). Experimentation in software engineering. Springer Publishing Company, Incorporated.

[6] Glass RL (2006) The Academe/Practice Communication Chasm—Position Paper. Dagstuhl Seminar on Empirical SE 27.06.-30.06.06 (06262), Participant Materials. http://www.dagstuhl.de/Materials/Files/06/06262/06262.Glas sRobert.ExtAbstract!.pdf Accessed on 12 January 2013

[17] Juristo, N., Vegas, S., Solari, M., Abrahao, S., & Ramos, I. (2012). A process for managing interaction between experimenters to get useful similar replications. Information and Software Technology.

[7] Ivarsson M, Gorschek T (2011) A method for evaluating rigor and industrial relevance of technology evaluations. Empir Softw Eng 16(3):365–395

[18] Juristo, N., (Oct. 2013) "Towards understanding replication of software engineering experiments," Empirical Software Engineering and Measurement, 2013 ACM / IEEE International Symposium on , vol., no., pp.4,4, 10-11.

[8] Jedlitschka, A., Ciolkowski, M., & Pfahl, D. (2008). Reporting experiments in software engineering. In Guide to advanced empirical software engineering (pp. 201-228). Springer London

[19] Bertolino, A. (2004). The (im) maturity level of software testing. ACM SIGSOFT Software Engineering Notes, 29(5), 1-4.

[9] Carver, J. C. (2010, May). Towards reporting guidelines for experimental replications: A proposal. In RESER‘2010: Proceedings of the 1st International Workshop on Replication in Empirical Software Engineering Research, Cape Town, South Africa (Vol. 4).

[20] Shull, F. J., Carver, J. C., Vegas, S., & Juristo, N. (2008). The role of replications in empirical software engineering.

24

Suggest Documents