To achieve the goal of test efficiency, a set of criteria, having an impact on the test cases, needs to be identified. The analysis of several industrial case studies and also state of the art in this thesis, indicate that the dependency between integration test cases is one such criterion, which has a direct impact on test execution results. Other criteria of interest include requirement coverage and test execution time. In this doctoral thesis, we introduce, apply and evaluate a set of approaches and tools for test execution optimization at industrial integration testing level in embedded software development. Furthermore, ESPRET (Estimation and Prediction of Execution Time) and sOrTES (Stochastic Optimizing of Test Case Scheduling) are our proposed supportive tools for predicting the execution time and the scheduling of manual integration test cases, respectively. All proposed methods and tools in this thesis, have been evaluated at industrial testing projects at Bombardier Transportation (BT) in Sweden. As a result of the scientific contributions made in this doctoral thesis, employing the proposed approaches has led to an improvement in terms of reducing redundant test execution failures of up to 40 % with respect to the current test execution approach at BT. Moreover, an increase in the requirements coverage of up to 9.6 % is observed at BT. In summary, the application of the proposed approaches in this doctoral thesis has shown to give considerable gains by optimizing test schedules in system integration testing of embedded software development.
Mälardalen University Doctoral Dissertation 281 Sahar Tahvili MULTI-CRITERIA OPTIMIZATION OF SYSTEM INTEGRATION TESTING
Optimizing the software testing process has received much attention over the last few decades. Test optimization is typically seen as a multi-criteria decision making problem. One aspect of test optimization involves test selection, prioritization and execution scheduling. Having an efficient test process can result in the satisfaction of many objectives such as cost and time minimization. It can also lead to on-time delivery and better quality of the final software product.
Sahar Tahvili is a researcher at RISE (Research Institutes of Sweden) and also a member of the software testing laboratory at Mälardalen University. In 2014, she graduated as M.Phil. in Applied Mathematics with emphasis on optimization from Mälardalen University. Her research focuses on advanced methods for testing complex software-intensive systems, designing the decision support systems (DSS), mathematical modelling and optimization. She also has a background in Aeronautical Engineering. Sahar holds a licentiate degree in software engineering from Mälardalen University, titled: “A Decision Support System for Integration Test Selection’’, since October 2016.
ISBN 978-91-7485-414-5 ISSN 1651-4238
2018
Address: P.O. Box 883, SE-721 23 Västerås. Sweden Address: P.O. Box 325, SE-631 05 Eskilstuna. Sweden E-mail:
[email protected] Web: www.mdh.se
Multi-Criteria Optimization of System Integration Testing Sahar Tahvili
Mälardalen University Press Dissertations No. 281
MULTI-CRITERIA OPTIMIZATION OF SYSTEM INTEGRATION TESTING
Sahar Tahvili 2018
School of Innovation, Design and Engineering
Copyright © Sahar Tahvili, 2018 ISBN 978-91-7485-414-5 ISSN 1651-4238 Printed by E-Print AB, Stockholm, Sweden
Mälardalen University Press Dissertations No. 281
MULTI-CRITERIA OPTIMIZATION OF SYSTEM INTEGRATION TESTING
Sahar Tahvili
Akademisk avhandling som för avläggande av teknologie doktorsexamen i datavetenskap vid Akademin för innovation, design och teknik kommer att offentligen försvaras fredagen den 21 december 2018, 13.15 i Lambda, Mälardalens högskola, Västerås. Fakultetsopponent: Professor Franz Wotawa, Graz University of Technology
Akademin för innovation, design och teknik
Abstract Optimizing software testing process has received much attention over the last few decades. Test optimization is typically seen as a multi-criteria decision making problem. One aspect of test optimization involves test selection, prioritization and execution scheduling. Having an efficient test process can result in the satisfaction of many objectives such as cost and time minimization. It can also lead to on-time delivery and a better quality of the final software product. To achieve the goal of test efficiency, a set of criteria, having an impact on the test cases, need to be identified. The analysis of several industrial case studies and also state of the art in this thesis, indicate that the dependency between integration test cases is one such criterion, with a direct impact on the test execution results. Other criteria of interest include requirement coverage and test execution time. In this doctoral thesis, we introduce, apply and evaluate a set of approaches and tools for test execution optimization at industrial integration testing level in embedded software development. Furthermore, ESPRET (Estimation and Prediction of Execution Time) and sOrTES (Stochastic Optimizing of Test Case Scheduling) are our proposed supportive tools for predicting the execution time and the scheduling of manual integration test cases, respectively. All proposed methods and tools in this thesis, have been evaluated at industrial testing projects at Bombardier Transportation (BT) in Sweden. As a result of the scientific contributions made in this doctoral thesis, employing the proposed approaches has led to an improvement in terms of reducing redundant test execution failures of up to 40% with respect to the current test execution approach at BT. Moreover, an increase in the requirements coverage of up to 9.6% is observed at BT. In summary, the application of the proposed approaches in this doctoral thesis has shown to give considerable gains by optimizing test schedules in system integration testing of embedded software development.
ISBN 978-91-7485-414-5 ISSN 1651-4238
Abstract Optimizing the software testing process has received much attention over the last few decades. Test optimization is typically seen as a multi-criteria decision making problem. One aspect of test optimization involves test selection, prioritization and execution scheduling. Having an efficient test process can result in the satisfaction of many objectives such as cost and time minimization. It can also lead to on-time delivery and better quality of the final software product. To achieve the goal of test efficiency, a set of criteria, having an impact on the test cases, needs to be identified. The analysis of several industrial case studies and also state of the art in this thesis, indicate that the dependency between integration test cases is one such criterion, which has a direct impact on test execution results. Other criteria of interest include requirement coverage and test execution time. In this doctoral thesis, we introduce, apply and evaluate a set of approaches and tools for test execution optimization at industrial integration testing level in embedded software development. Furthermore, ESPRET (Estimation and Prediction of Execution Time) and sOrTES (Stochastic Optimizing of Test Case Scheduling) are our proposed supportive tools for predicting the execution time and the scheduling of manual integration test cases, respectively. All proposed methods and tools in this thesis, have been evaluated at industrial testing projects at Bombardier Transportation (BT) in Sweden. As a result of the scientific contributions made in this doctoral thesis, employing the proposed approaches has led to an improvement in terms of reducing redundant test execution failures of up to 40% with respect to the current test execution approach at BT. Moreover, an increase in the requirements coverage of up to 9.6% is observed at BT. In summary, the application of the proposed approaches in this doctoral thesis has shown to give considerable gains by optimizing test schedules in system integration testing of embedded software development. Keywords: Software Testing, Optimization, Integration Testing, Decision Support System, Dependency, Test Scheduling, Requirement Coverage i
Sammanfattning Optimering och förbättring av mjukvarutestningsprocessen har fått stor uppmärksamhet under de senaste årtiondena. Testoptimering är typiskt sett som ett multikriteriebeslutstödproblem. Aspekter av testoptimering innefattar testval, prioritering och schemaläggning. Att ha ett effektivt sätt för att köra olika testfall kan tillfredsställa många mål såsom kostnads- och tidsminimering. Det kan också leda till leverans i tid och bättre kvalitet på den slutliga mjukvaruprodukten. För att uppnå målet måste en uppsättning kriterier som har inverkan på testfallen identifieras. Analysen i denna avhandling av flera industriella fallstudier och toppmoderna metoder tyder på att beroendet mellan integrationstestfall är ett kritiskt kriterium, med en direkt inverkan på testresultaten. Andra viktiga kriterier är kravtäckning och testkörningstid. I denna doktorsavhandling introducerar vi, tillämpar och utvärderar en uppsättning metoder och verktyg för testoptimering på industriell integrationsnivå av mjukvara för inbäddade system. Dessutom, ESPRET (Estimation and Prediction of Execution Time) och sOrTES (Stochastic Optimizing Test case Scheduling) är våra stödjande verktyg för att förutsäga exekveringstid och för schemaläggning av manuell integrationstestning. Alla föreslagna metoder och verktyg i denna avhandling har utvärderats vid industriella testprojekt i Bombardier Transportation (BT) i Sverige. Slutligen har användning av de föreslagna metoderna i den här doktorsavhandlingen lett till förbättringar i form av en minskning av fel från redundanta testfall med upp till 40% jämfört med nuvarande metoder i BT, och en ökning av kravtäckning med upp till 9, 6%. Nyckelord: Programvarutestning, Optimering, Integrationstestning, Beslutsstödsystem, Beroende, Schemaläggning, Kravtäckning
ii
Populärvetenskaplig sammanfattning Rollen av mjukvara kan inte frånses från samhällets framsteg, med en direkt inverkan på våra dagliga liv. Att förbättra kvaliteten på mjukvaruprodukter har blivit allt viktigare för programvaruföretag under de senaste årtiondena. För att uppnå högkvalitativa mjukvaruprodukter måste man balansera ansträngningar mellan design och verifieringsaktiviteter under utvecklingsprocessen. Därför blir mjukvarutestning ett viktigt verktyg som bidrar till att tillgodose slutanvändarnas behov och att upprätthålla hög kvalitet på slutprodukten. Kvalitetssäkring av mjukvaruprodukter genererar stora satsningar på forskning inom programvarutestning. Programvarutestning utförs manuellt eller automatiskt och övergången till automatiserade testningssystem har snabbt blivit utbredd i branschen. Eftersom automatiserad testning idag inte fullt ut kan dra nytta av mänsklig intuition, induktivt resonemang och inferens så spelar manuell testning fortfarande en viktig roll. Testning utförs ofta på flera nivåer, såsom enhet, integration, system och acceptans. Lämplig testmetod (antingen manuell eller automatisk) beror på flera parametrar såsom kvalitetskrav, produktens storlek och komplexitet och testnivå. Integrationstestning är den nivå i testprocessen där olika enskilda programmoduler kombineras och testas som en grupp och kan ofta vara den mest komplexa nivån. Integrationstestning utförs vanligtvis efter enhetstestning, när alla moduler har testats och godkänts separat. För att testa en produkt manuellt måste en uppsättning testfallsspecifikationer skapas. En testfallsspecifikation beskriver textuellt en händelse och hur produkten ska uppträda vid angivna ingångsparametrar. Vanligtvis krävs en stor uppsättning testfall för att testa en produkt. Att köra alla testfall för en produkt iii
iv
manuellt kräver tid och resurser. Därför har urval, prioritering och schemaläggning av tester fått stor uppmärksamhet i programvarutestningsdomänen. I denna doktorsavhandling föreslår vi några optimeringstekniker för urval, prioritering och schemaläggning av manuella testfall för utförande. Alla föreslagna optimeringsmetoder i denna avhandling har utvärderats på industriella testprojekt vid Bombardier Transportation (BT) i Sverige.
In Memory of Professor Maryam Mirzakhani the first and only woman to win the Fields medal in mathematics (1977–2017).
v
There’s just one life to live, there’s no time to wait, to waste. Josh Alexander
vii
Acknowledgments I would like to express my sincere gratitude to my main supervisor Markus Bohlin for the continuous support of my Ph.D. studies and related research, for his patience, motivation, and immense knowledge. A very special thank you goes out to my assistant supervisors Wasif Afzal, his guidance helped me for the entire duration of research and writing of this thesis and also Mehrdad Saadatmand for all his help and support. I have learned so much from all of you both personally and professionally, working with you made me grow as a researcher. I am very grateful to my former supervisors Daniel Sundmark, Stig Larsson, Sergei Silvestrov, Tofigh Allahviranloo and Jonas Biteus, I have been extremely lucky to have supervisors who cared so much about my work and responded to my questions and queries promptly. I would also like to thank my additional co-authors Leo Hatvani, Rita Pimentel, Michael Felderer, my master thesis students Sharvathul Hasan Ameerjan, Marcus Ahlberg and Eric Fornander working with you is a great pleasure, and also thanks to Narsis Aftab Kiani at Karolinska Institute and Mohammad Mehrabi for brainstorming and effective discussions. My deepest gratitude goes to my family Mohammad, Shabnam, Saeed, Sara and Sepeher Tahvili and my friends: Shahab Darvish, Neda Kazemie, Lotta Karlsson, Jonas Österberg, Linnea Siem who have always been there for me no matter what. Without them I could have never reached this far. I am thankful to Razieh Matini, Iraj and Siavash Mesdaghi, I consider myself extremely blessed to be a part of your family. There is no way I could ever thank you enough for being my second family in Sweden. My sincere thanks also go to my manager, Helena Jerregard, who has always supported me throughout the work on this thesis. RISE SICS is a great workplace that I very much enjoy being part of. Furthermore, thanks to all my ix
x
colleagues at RISE SICS Västerås: Malin Rosqvist, Petra Edoff, Zohreh Ranjbar, Pasqualina Potena, Linnéa Svenman Wiker, Björn Löfvendahl, Daniella Magnusson, Markus Borg, Gunnar Widforss, Tomas Olsson, Kristian Sandström, Anders Wikström, Stefan Cedergren, Joakim Fröberg, Alvaro Aranda Munoz, Niclas Ericsson, Daniel Flemström, Martin Joborn, Cecilia Hyrén, Ksenija Komazec, Petter Wannerberg, Backer Sultan, Mats Tallfors, Peter Wallin, Thomas Nessen, Elsa Kosmack Vaara, Helena Junegard, Jawad Mustafa and also Barrett Michael Sauter for proofreading this doctoral thesis. A special thanks to Ola Sellin, Stefan Persson, Kjell Bystedt, Anders Skytt, Johan Zetterqvist and the testing team at Bombardier Transportation, Västerås, Sweden. In closing, I would like to express my sincere appreciation to Mariana Cook, a fine arts photographer with a kind heart, who granted the permission to use the image of Maryam Mirzakhani for the printed version of this doctoral thesis. The work presented in this doctoral thesis has been funded by RISE SICS, ECSEL and VINNOVA (through projects MegaM@RT2, XIVT, TESTOMAT and IMPRINT) and also through the ITS-EASY program at Mälardalen University. Sahar Tahvili November 2018 Stockholm
List of Publications Studies Included in the Doctoral Thesis1 Study A. Dynamic Test Selection and Redundancy Avoidance Based on Test Case Dependencies, Sahar Tahvili, Mehrdad Saadatmand, Stig Larsson, Wasif Afzal, Markus Bohlin and Daniel Sundmark, The 11th Workshop on Testing: Academia-Industry Collaboration, Practice and Research Techniques (TAIC PART’16), 2016, IEEE. Study B. Cost-Benefit Analysis of Using Dependency Knowledge at Integration Testing, Sahar Tahvili, Markus Bohlin, Mehrdad Saadatmand, Stig Larsson, Wasif Afzal and Daniel Sundmark, The 17th International Conference on Product-Focused Software Process Improvement (PROFES’16), 2016, Springer. Study C. ESPRET: a Tool for Execution Time Estimation of Manual Test Cases, Sahar Tahvili, Wasif Afzal, Mehrdad Saadatmand, Markus Bohlin and Sharvatul Hasan Ameerjan, Journal of Systems and Software (JSS), volume 146, pages 26-41, 2018, Elsevier. Study D. Functional Dependency Detection for Integration Test Cases, Sahar Tahvili, Marcus Ahlberg, Eric Fornander, Wasif Afzal, Mehrdad Saadatmand and Markus Bohlin, Companion of the 18th IEEE International Conference on Software Quality, Reliability, and Security (QRS’18), 2018, IEEE. 1 The included studies have been reformatted to comply with the doctoral thesis layout and minor typos have been corrected and marked accordingly.
xi
xii
Study E. Automated Functional Dependency Detection Between Test Cases Using Text Semantic Similarity, Sahar Tahvili, Leo Hatvani, Michael Felderer, Wasif Afzal and Markus Bohlin, Submitted to The 12th IEEE International Conference on Software Testing, Verification and Validation (ICST’19), 2019, IEEE. Study F. sOrTES: A Supportive Tool for Stochastic Scheduling of Manual Integration Test Cases, Sahar Tahvili, Rita Pimentel, Wasif Afzal, Marcus Ahlberg, Eric Fornande, Markus Bohlin, Journal of IEEE Access, 2018, In revision.
Additional Peer-Reviewed Publications, not Included in the Doctoral Thesis Licentiate Thesis 1. A Decision Support System for Integration Test Selection, Sahar Tahvili, Licentiate Thesis, ISSN 1651-9256, ISBN 978-91-7485-282-0, October 2016, Mälardalen University.
Journal 1. On the global solution of a fuzzy linear system, Tofigh Allahviranloo, Arjan Skuka, Sahar Tahvili, Journal of Fuzzy Set Valued Analysis, volume 14, pages 1-8, 2014, ISPACS.
Conferences & Workshops 1. Solving complex maintenance planning optimization problems using stochastic simulation and multi-criteria fuzzy decision making, Sahar Tahvili, Sergei Silvestrov, Jonas Österberg , Jonas Biteus, The 10th International Conference on Mathematical Problems in Engineering, Aerospace and Sciences (ICNPAA’14), 2014, AIP. 2. A Fuzzy Decision Support Approach for Model-Based Trade-off Analysis of Non-Functional Requirements, Mehrdad Saadatmand, Sahar
xiii
Tahvili, The 12th International Conference on Information Technology: New Generations (ITNG’15), 2015, IEEE. 3. Multi-Criteria Test Case Prioritization Using Fuzzy Analytic Hierarchy Process, Sahar Tahvili, Mehrdad Saadatmand, Markus Bohlin, The 10th International Conference on Software Engineering Advance (ICSEA’15), 2015, IARIA. 4. Towards Earlier Fault Detection by Value-Driven Prioritization of Test Cases Using Fuzzy TOPSIS, Sahar Tahvili, Wasif Afzal, Mehrdad Saadatmand, Markus Bohlin, Daniel Sundmark, Stig Larsson, The 13th International Conference on Information Technology : New Generations (ITNG’16), 2016, Springer. 5. An Online Decision Support Framework for Integration Test Selection and Prioritization (Doctoral Symposium), Sahar Tahvili, The 25th International Symposium on Software Testing and Analysis (ISSTA’ 16), 2016, ACM. 6. Towards Execution Time Prediction for Test Cases from Test Specification, Sahar Tahvili, Mehrdad Saadatmand, Markus Bohlin, Wasif Afzal, Sharvathul Hasan Ameerjan, The 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA’17), 2017, IEEE. 7. Cluster-Based Test Scheduling Strategies Using Semantic Relationships between Test Specifications, Sahar Tahvili, Leo Hatvani, Michael Felderer, Wasif Afzal, Mehrdad Saadatmand, Markus Bohlin, The 5th International Workshop on Requirements Engineering and Testing (RET’18), 2018, ACM.
Contents I
Thesis
1
1
Introduction 1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . .
2
Background 2.1 Software Testing . . . . . . . . . . . . . . . 2.2 Integration Testing . . . . . . . . . . . . . . 2.3 Test Optimization . . . . . . . . . . . . . . . 2.3.1 Test Case Selection . . . . . . . . . . 2.3.2 Test Case Prioritization . . . . . . . . 2.3.3 Test Case Scheduling . . . . . . . . . 2.4 Multiple-Criteria Decision-Making (MCDM) 2.4.1 Requirement Coverage . . . . . . . . 2.4.2 Execution Time . . . . . . . . . . . . 2.4.3 Fault Detection Probability . . . . . . 2.4.4 Test Case Dependencies . . . . . . .
3
4
3 5
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
11 11 13 14 14 15 15 16 17 17 17 18
Research Overview 3.1 Goal of the Thesis . . . . . . . . . . . . . . . . 3.2 Technical Contributions . . . . . . . . . . . . . 3.2.1 Overview of the Proposed Approach . . 3.2.2 Discussion of Individual Contributions 3.3 Research Process and Methodology . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
23 23 27 27 27 30
Conclusions and Future Work 4.1 Summary and Conclusion . . . . . . . . . . . . . . . . . . . 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
35 35 37
xv
xvi
Contents
Bibliography
39
II
47
5
6
Included Papers Paper A: Dynamic Test Selection and Redundancy Avoidance Based on Test Case Dependencies 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Background and Preliminaries . . . . . . . . . . . . . . . . . 5.2.1 Motivating Example . . . . . . . . . . . . . . . . . . 5.2.2 Main definitions . . . . . . . . . . . . . . . . . . . . 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Dependency Degree . . . . . . . . . . . . . . . . . . 5.3.2 Test Case Prioritization: FAHP . . . . . . . . . . . . . 5.3.3 Offline and online phases . . . . . . . . . . . . . . . . 5.4 Industrial Case Study . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Preliminary results of online evaluation . . . . . . . . 5.5 Discussion & Future Extensions . . . . . . . . . . . . . . . . 5.5.1 Delimitations . . . . . . . . . . . . . . . . . . . . . . 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Summary & Conclusion . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paper B: Cost-Benefit Analysis of Using Dependency Knowledge at Integration Testing 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Decision Support System for Test Case Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Architecture and Process of DSS . . . . . . . . . . . . 6.4 Economic Model . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Return on Investment Analysis . . . . . . . . . . . . . 6.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Test Case Execution Results . . . . . . . . . . . . . . 6.5.2 DSS Alternatives under Study . . . . . . . . . . . . . 6.5.3 ROI Analysis Using Monte-Carlo Simulation . . . . . 6.5.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . .
49 51 52 52 53 54 55 57 60 62 66 68 69 70 72 73
77 79 80 81 82 83 86 87 89 89 90 92
Contents
xvii
6.6 Discussion and Threats to Validity . . . . . . . . . . . . . . . 6.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95 96 97
7
Paper C: ESPRET: a Tool for Execution Time Estimation of Manual Test Cases 101 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2 Background and Related Work . . . . . . . . . . . . . . . . . 104 7.3 Description of the Proposed Approach . . . . . . . . . . . . . 108 7.3.1 Parsing and Historical Data Collection . . . . . . . . . 110 7.3.2 The Algorithm for Estimating the Maximum Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.3 Regression Analysis for Prediction of the Actual Execution Time . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.4 System Architecture, Implementation and Database Creation . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . 118 7.4.1 Unit of Analysis and Procedure . . . . . . . . . . . . 119 7.4.2 Case Study Report . . . . . . . . . . . . . . . . . . . 120 7.4.3 Model Validation . . . . . . . . . . . . . . . . . . . . 124 7.4.4 Model Evaluation Using Unseen Data . . . . . . . . . 130 7.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 133 7.6 Discussion and Future Extensions . . . . . . . . . . . . . . . 134 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8
Paper D: Functional Dependency Detection for Integration Test Cases 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Background and Related Work . . . . . . . . . . . . . . . 8.3 Dependency Detection at Integration Testing . . . . . . . . 8.3.1 Basic concepts definitions . . . . . . . . . . . . . 8.3.2 Implemented Method Details . . . . . . . . . . . . 8.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . 8.4.1 Unit of Analysis and Procedure . . . . . . . . . . 8.4.2 Case study report and results . . . . . . . . . . . . 8.5 Discussion and Future Extensions . . . . . . . . . . . . . 8.6 Summary and Conclusion . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
147 149 150 152 152 156 160 160 160 165 165
xviii
Contents
8.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 166 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9
Paper E: Automated Functional Dependency Detection Between Test Cases Using Text Semantic Similarity 171 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.4 The proposed Approach . . . . . . . . . . . . . . . . . . . . . 179 9.4.1 Feature Vector Generation . . . . . . . . . . . . . . . 180 9.4.2 Clustering Feature Vectors . . . . . . . . . . . . . . . 181 9.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . 183 9.5.1 Industrial Case Study . . . . . . . . . . . . . . . . . . 183 9.5.2 Ground Truth . . . . . . . . . . . . . . . . . . . . . . 185 9.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.6.1 Comparing the Clustering Results with the Ground Truth . . . . . . . . . . . . . . . . . . . . . . 188 9.6.2 Performance Metric Selection . . . . . . . . . . . . . 189 9.6.3 Metric Comparison . . . . . . . . . . . . . . . . . . . 190 9.6.4 Random Undersampling strategy for imbalanced datasets . . . . . . . . . . . . . . . . . . . . . . . . . 191 9.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 193 9.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . 194 9.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10 Paper F: sOrTES: A Supportive Tool for Stochastic Scheduling of Manual Integration Test Cases 205 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 10.2 Background and Related work . . . . . . . . . . . . . . . . . 208 10.2.1 Test case selection . . . . . . . . . . . . . . . . . . . 209 10.2.2 Test case prioritization . . . . . . . . . . . . . . . . . 209 10.2.3 Test Case Stochastic Scheduling . . . . . . . . . . . . 210 10.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . 211 10.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . 213 10.4 sOrTES- Stochastic Optimizing of Test case Scheduling . . . . 216 10.4.1 The Extraction Phase . . . . . . . . . . . . . . . . . . 217
Contents
10.4.2 Functional Dependencies Detection . . . . . . . . . . 10.4.3 Requirement Coverage Measurement . . . . . . . . . 10.4.4 The Scheduling phase . . . . . . . . . . . . . . . . . 10.4.5 Model Assumptions and Problem Description . . . . . 10.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Unit of Analysis and Procedure . . . . . . . . . . . . 10.5.2 Case Study Report . . . . . . . . . . . . . . . . . . . 10.6 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 10.6.1 Performance comparison between sOrTES and Bombardier . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 Performance comparison including a History-based test case prioritization approach . . . . . . . . . . . . . . 10.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 10.8 Discussion and Future work . . . . . . . . . . . . . . . . . . 10.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix
217 220 221 222 226 227 228 229 229 237 241 244 245 247
I Thesis
1
Chapter 1
Introduction he role of software is important in our daily lives and to the progress of society in general. Over the past few decades, improving the quality of T software products has become a unique selling point for software companies. Achieving high quality software products is possible through a balanced integrative approach of design and verification activities during the software development life cycle (SDLC) process [2]. Considering these facts, software testing becomes a major player in the product development process, which can satisfy the end users’ needs and also ensure high quality of the final product [2]. Software testing research faces many challenges such as test effectiveness [3], where in this regard, the concept and nature of software testing has changed. The transition from manual to automated testing and continuous changes in the testing procedure have quickly become widespread at industry. However, human intuition, inductive reasoning and inference cannot be fully covered by the current form of test automation and therefore manual testing still plays a vital role in software testing [4]. Achieving a more effective testing process often comes down to dividing it into several levels, such as unit, integration, system and acceptance testing levels. Proposing an appropriate testing procedure (either manual or automated) depends on several parameters such as the quality of requirements, the size and complexity of the product and also the testing level. Integration testing can be considered as the most complex testing phase in some practical scenarios. Integrating unit test modules and testing their behavior can result in a huge increase in complexity [5]. Verification of interactions between software modules has the objective of recognizing the correctness for 3
4
Chapter 1. Introduction
several modules of a system under test at least once, which ultimately results in a more complicated testing process, compared with other testing levels such as unit and acceptance testing. Having a large set of test cases for testing a product manually in such a complex testing level proves the need for optimization methods in the software testing domain [6]. Test optimization is a multi-faceted topic consisting of writing effective requirement specification, creating more effective test cases, executing a subset of test cases which are required for a product release, ranking test cases for execution, etc. [7]. In this doctoral thesis we investigate methods to optimize the manual integration testing process through reducing unnecessary redundant test execution failures. Moreover, the proposed optimization approaches in this thesis lead to increased requirement coverage in each testing cycle. While several works advocate test optimization in the software testing domain [8], [9], to the best of our knowledge, this is the first attempt to provide an automated supportive tool which schedules manual integration test cases for execution stochastically. Moreover, most related to our work is the work by Nardo et al. [10], Elbaum et al. [11], Yoo and Harman [7], which address several approaches for test case selection, prioritization and scheduling. In an efficient test optimization process, several factors such as test objective function, test constraint optimization and also the test case properties and features (e.g. execution time, requirement coverage) need to be identified in an early stage of testing. Testing time minimization is a crucial objective for test optimization, which is always demanded by industry as one of several optimization goals. Decreasing the total testing time can lead to on-time delivery of the final product and thereby leading to cost minimization. Furthermore, maximizing requirement coverage can be considered as another promising objective for the test optimization. On the other hand, identifying and measuring the test case properties in advance is a challenging task, especially for manual integration test cases, which are usually described by testers in written text and are recorded together in a test specification document. However, since the testing constraints (e.g. allocated testing budget, testing deadline, delivery plan) need to be satisfied by one of the proposed test optimization approaches [12], the testing constraints should be determined implicitly before applying the test optimizations techniques. To address all of the above-mentioned optimization aspects, the test optimization problem must convert to a multiple-criteria decision making (MCDM) problem, where each test case property represents a criterion in the optimization model. In this thesis, we propose several MCDM methods for solving the test optimization
1.1 Thesis Overview
5
problem in the following forms: test case selection, prioritization and test scheduling. All of the proposed approaches in this thesis are applied and evaluated on a set of industrial testing projects at Bombardier Transportation (BT), a large railway equipment manufacturer in Sweden. In summary, applying the proposed optimization approaches in this doctoral thesis has resulted in an increase in the requirements coverage of up to 9.6%, where simultaneously, an improvement in terms of minimizing redundant test execution failures of up to 40% is observed with respect to the current execution approach at BT.
1.1
Thesis Overview
This doctoral thesis consists of two main parts. The first part provides an introduction to the overall work, while Chapter 2 gives background information on the research conducted in this thesis in the area of software testing. Chapter 3 presents an overview of the research, which consists of the thesis goals, the research challenges and corresponding technical and industrial contributions, followed by the research methodology. Finally, Chapter 4 draws the final conclusions and presents an outlook on future work. A collection of six research studies (Study A-to-F) constitutes the second part of the thesis that describes the research results. A summary of the included research studies in this doctoral thesis is as follows: Study A: Dynamic Test Selection and Redundancy Avoidance Based on Test Case Dependencies, Sahar Tahvili, Mehrdad Saadatmand, Stig Larsson, Wasif Afzal, Markus Bohlin and Daniel Sundmark, The 11th Workshop on Testing: Academia-Industry Collaboration, Practice and Research Techniques (TAIC PART’16), 2016, IEEE. Abstract: Prioritization, selection and minimization of test cases are wellknown problems in software testing. Test case prioritization deals with the problem of ordering an existing set of test cases, typically with respect to the estimated likelihood of detecting faults. Test case selection addresses the problem of selecting a subset of an existing set of test cases, typically by discarding test cases that do not add any value in improving the quality of the software under test. Most existing approaches for test case prioritization and selection suffer from one or more drawbacks. For example, to a large extent, they utilize static analysis of code for that purpose, making them unfit for higher levels of testing such as integration testing. Moreover, they do not exploit the
6
Chapter 1. Introduction
possibility of dynamically changing the prioritization or selection of test cases based on the execution results of prior test cases. Such dynamic analysis allows for discarding test cases that do not need to be executed and are thus redundant. This paper proposes a generic method for prioritization and selection of test cases in integration testing that addresses the above issues. We also present the results of an industrial case study where initial evidence suggests the potential usefulness of our approach in testing a safety-critical train control management subsystem. Statement of Contribution: The first author is the main contributor of the study focusing on test case selection and prioritization based on dependency between manual integration test cases and other testing criteria; co-authors helped in study design and in writing related work. Study B: Cost-Benefit Analysis of Using Dependency Knowledge at Integration Testing, Sahar Tahvili, Markus Bohlin, Mehrdad Saadatmand, Stig Larsson, Wasif Afzal and Daniel Sundmark, The 17th international conference on Product-Focused Software Process Improvement (PROFES’16), 2016, Springer. Abstract: In software system development, testing can take considerable time and resources, and there are numerous examples in the literature of how to improve the testing process. In particular, methods for selection and prioritization of test cases can play a critical role in using testing resources efficiently. This paper focuses on the problem of selecting and ordering of integration-level test cases. Integration testing is performed to evaluate the correctness of several units in composition. Furthermore, for reasons of both effectiveness and safety, many embedded systems are still tested manually. To this end, we propose a process for ordering and selecting test cases based on the test results of previously executed test cases, which is supported by an online decision support system. To analyze the economic efficiency of such a system, a customized return on investment (ROI) metric tailored for system integration testing is introduced. Using data collected from the development process of a large-scale safety-critical embedded system, we perform Monte Carlo simulations to evaluate the expected ROI of three variants of the proposed new process. The results show that our proposed decision support system is beneficial in terms of ROI at system integration testing and thus qualifies as an important element in improving the integration testing process. Statement of Contribution: The first author is the main contributor of
1.1 Thesis Overview
7
this study focusing on both theoretical and industrial results, with the other co-authors having academic and industrial advisory roles. The simulation part was primarily the contribution of the second author. The first author developed the models, the concept, and also conducted the industrial case study. However, the writing process was an iterative contribution of all authors. Study C: ESPRET: a Tool for Execution Time Estimation of Manual Test Cases, Sahar Tahvili, Wasif Afzal, Mehrdad Saadatmand, Markus Bohlin and Sharvatul Hasan Ameerjan, Journal of Systems and Software (JSS), volume 146, pages 26-41, 2018, Elsevier. Abstract: Manual testing is still a predominant and important approach for validation of computer systems, particularly in certain domains such as safety-critical systems. Knowing the execution time of test cases is important when performing test scheduling, prioritization and progress monitoring. In this work, we present, apply and evaluate ESPRET (EStimation and PRediction of Execution Time) as our tool for estimating and predicting the execution time of manual test cases based on their test specifications. Our approach works by extracting timing information for various steps in manual test specification. This information is then used to estimate the maximum time for test steps that have textual specifications but have not been previously executed. As part of our approach, natural language parsing of the specifications is performed to identify word combinations to check whether existing timing information on various test steps is already available or not. Since executing test cases on the several machines may take varying amounts of time, a set of regression models are used to predict the actual execution time for test cases. Finally, an empirical evaluation of the approach and tool has been performed on a railway use case at Bombardier Transportation (BT) in Sweden. Statement of Contribution: The first author is the main contributor of the study, with the co-authors having academic and industrial advisory roles. The first author developed the prediction models, cross validation, the concept, and performed the industrial and also the evaluation case study. The last author implemented ESPRET as an automated supportive tool. Study D: Functional Dependency Detection for Integration Test Cases, Sahar Tahvili, Marcus Ahlberg, Eric Fornander, Wasif Afzal, Mehrdad Saadatmand and Markus Bohlin, Companion of the 18th IEEE International Conference on Software Quality, Reliability, and Security (QRS’18), 2018, IEEE.
8
Chapter 1. Introduction
Abstract: This paper presents a natural language processing (NLP) based approach that, given software requirements specification, allows the functional dependency detection between integration test cases. We analyze a set of internal signals to the implemented modules for detecting dependencies between requirements and thereby identifying dependencies between test cases such that: module 2 depends on module 1 if an output internal signal from module 1 enters as an input internal signal to the module 2. Consequently, all requirements (and thereby test cases) for module 2 are dependent on all the designed requirements (and test cases) for module 1. The dependency information between requirements (and thus corresponding test cases) can be utilized for test case selection and prioritization. We have implemented our approach as a tool and the feasibility is evaluated through an industrial use case in the railway domain at Bombardier Transportation, Sweden. Statement of Contribution: The first author is the main contributor of this paper, while the co-authors having academic and industrial advisory roles. The first author conducted the industrial case study, evaluation of the proposed approach and developed the models and concepts. The second and third authors implemented the dependency extractor algorithms. Study E: Automated Functional Dependency Detection Between Test Cases Using Text Semantic Similarity, Sahar Tahvili, Leo Hatvani, Michael Felderer, Wasif Afzal and Markus Bohlin, Submitted to 12th IEEE International Conference on Software Testing, Verification and Validation (ICST’19), 2018, IEEE. Abstract: Knowing about dependencies and similarities between test cases is beneficial for prioritizing them for cost-effective test execution. This holds especially true for the time consuming, manual execution of integration test cases written in natural language. Test case dependencies are typically derived from requirements and design artifacts. However, such artifacts are not always available, and the derivation process can be very time-consuming. In this paper, we propose, apply and evaluate a novel approach that derives test case similarities and functional dependencies directly from the test specification documents written in natural language, without requiring any other data source. The approach first applies a deep-learning based language processing method to detect text-semantic similarities between test cases and then groups them based on two clustering algorithms HDBSCAN and FCM. The correlation between test case text-semantic similarities and their functional dependencies is
1.1 Thesis Overview
9
evaluated in the context of an on-board train control system from Bombardier Transportation AB in Sweden. For this system, the functional dependencies between the test cases were previously derived and are, in this paper, compared against the results of the new approach. The results show that of the two evaluated clustering algorithms, HDBSCAN has better performance than FCM or a dummy classifier. The classification methods’ results are of reasonable quality and especially useful from an industrial point of view. Finally, performing a random undersampling approach to correct the imbalanced data distribution results in an F1 Score of up to 75% and an accuracy of up to 80% when applying the HDBSCAN clustering algorithm. Statement of Contribution: The first author is the main contributor of the study focusing on theoretical, experimental and industrial results. The simulation part was primarily the contribution of the second author. The first author also developed the models, the concept and performed the industrial case study. The other co-authors have academic and industrial advisory roles. Study F: sOrTES: A Supportive Tool for Stochastic Scheduling of Manual Integration Test Cases, Sahar Tahvili, Rita Pimentel, Wasif Afzal, Marcus Ahlberg, Eric Fornande, Markus Bohlin, Journal of IEEE Access (IEEE-Access), 2018, In revision. Abstract: The main goal of software testing is to detect as many hidden bugs as possible in the final software product before release. Generally, a software product is tested through executing a set of test cases, which can be performed manually or automatically. The number of test cases which are required to test a software product depends on several parameters such as: the product type, size and complexity. Executing all test cases with no particular order can lead to waste of time and resources. Test optimization can provide a partial solution for saving time and resources which can lead to the final software product being released earlier. In this regard, test case selection, prioritization and scheduling can be considered as possible solutions for test optimization. Most of the companies do not provide direct support for ranking test cases on theirs own servers. In this paper we introduce, apply and evaluate sOrTES as our decision support system for manual integration of test scheduling. sOrTES is a Python based supportive tool which schedules manual integration test cases which are written in a natural language text. The feasibility of sOrTES is studied by an empirical evaluation which has been performed on a railway use-case at Bombardier Transportation in Sweden. The empirical evaluation indicates that
10
Chapter 1. Introduction
around 40% of testing failure can be avoided by using the proposed execution schedules by sOrTES, which leads to an increase in the requirements coverage of up to 9.6%. Statement of Contribution: The first author is the main contributor of the study focusing on both theoretical and experimental results, with the other co-authors having academic and industrial advisory roles. The performance evaluation is performed by the first and second author. Moreover, the fourth and fifth authors implemented sOrTES as an automated supportive tool. The model, concept development and industrial case study is performed by the first author.
Chapter 2
Background n this chapter we provide a brief overview of some required background Ithesis. information of software testing concepts which are central to this doctoral Moreover, in this chapter we introduce our proposed approaches and solutions for solving the test optimization problem in an industrial domain. The following sections constitute mainly of the concepts and terminologies utilized in this thesis.
2.1
Software Testing
Software testing is a time and resource consuming process among the verification and validation (V&V) activities and can be considered as one of the critical phases in all software development life cycles [13]. According to the reports from both academia and industry, the process of software testing can take of up to 50% of total development cost [14]. IEEE international standard (ISO/IEC/IEEE 29119-1) provides a formal definition of the software testing process as follows: Definition 2.1. Software testing is the process of analyzing a software item with the aim to detect the differences between existing and required conditions (hidden bugs) and also to evaluate the features of the software item [15]. The process of software testing at industry is performed manually, automatically or semi-automatically, and each one has its own advantages and disadvantages. A largely automated testing procedure eliminates manual efforts, 11
12
Chapter 2. Background
which can lead to saving testing time in some scenarios. However, the development and maintenance of automated tests costs between 3 to 15 times higher compared to manual testing [16]. In the manual testing procedure, the testing process is led by testers who operate a set of test case specifications manually. The testing process continues until the expected behaviors are ensured. Definition 2.2. A test case specification textually describes the main purpose of a specific test through providing a step-by-step procedure for execution [17]. Moreover, the required inputs, expected outputs and test executed results (pass/fail) are also specified in a test case specification. Table 2.1 represents an example of a manual test case specification for a safety critical system at Bombardier Transportation (BT), which consists of a test case description, an expected test result, test steps, a test case ID, corresponding requirements, etc. Test case name: Test case ID 3EST000231-2756 Test configuration TCMS baseline:1.2 VCS Platform 3.24 Requirement(s) SRS-Line Voltage 07 SRS-Speed 2051 Tester ID BR−1211 Initial State No active cab Step 1 2 3 4 5 6 7
Drive And Brake Functions Test level (s) Sw/Hw Integration
Date: Test Result
2018-01-20 Comments
Action No passenger emergency brake activated in consist = F Restore the passenger emergency brake handle in the remote consist Ready to run from A1, Logged in as Driver, MSTO Wait 20 seconds Activate isolation of Fire System Deactivate Isolation of Fire System Clean up
Reaction Traction safe loop deenergized = 1 Traction safe loop deenergized = 0
Pass / Fail
"Start inhibit reason" Start inhibit reason = 0 Start inhibit reason = 72(8 + 64)
Table 2.1: A test case specification example from the safety-critical train control management system at Bombardier Transportation. As we can see, the test case specification presented in Table 2.1 is designed for manually testing the interaction between line voltage and speed modules. In order to determine the total number of required test cases for testing a software product, several factors including the product size, complexity, testing maturity, testing procedure (manual/ automated) and the level of testing need to be analyzed. An industrial testing project usually requires a large number of test cases in the various testing levels and therefore the majority of the total budget
2.2 Integration Testing
13
should be allocated towards the testing activities in the software development process [18], [14].
2.2
Integration Testing
Performing testing activities in several levels makes it possible to detect more faults in the software product and also to evaluate the system’s compliance with the specified needs [5]. Moreover, dividing the testing activities into separate levels can provide some clues for identifying missing areas in the software that have not been tested yet. A typical software development life cycle (SDLC) model is founded on six different phases including (i) requirement gathering and analysis, (ii) design, (iii) implementation, (iv) testing, (v) development and finally, (vi) maintenance. Generally, the testing phase is broken down into four main levels which are unit, integration, system and acceptance testing. Figure 2.1 illustrates the mentioned phases and the testing level as a V-model of the software development life cycle. Requirements Design
Acceptance Test
System Design
System Test
Architecture Design
Integration Test
ase
nP Va li
Ph
da
on
Unit Test
tio
ati
fic
ha
se
ri Ve
Module Design
Implementation
Figure 2.1: The V- Model for the software development life cycle. Software testing is not only limited to the four mentioned levels but can also be performed in other types of testing such as regression testing [19], buddy testing and alpha-beta testing. These tests can take place at any of the four main levels with a specific purpose as well. Furthermore, some of the mentioned testing levels can be removed or merged with each other during a testing process while integration testing is performed in almost all testing projects in multi-component systems.
14
Chapter 2. Background
Definition 2.3. Integration testing is a level of software testing which occurs after unit testing and before system testing, where individual modules are combined and tested as a group [5]. In some testing scenarios, most of the hidden bugs in a software product can only be detected when the modules are interacting with each other [20], which makes the integration testing more complex [5].
2.3
Test Optimization
Today, test optimization and efficiency have become an increasingly popular topic in the software testing domain, and according to published research from academia, it is going to get even more important [21], [22]. As highlighted in Chapter 1, the testing process is a critical and costly process and therefore there is opportunity for an increase in test efficiency and a decrease in testing costs. There are a number of ways to optimize the testing process such as test suite minimization [23], test case selection, test case prioritization, test case scheduling and test automation. In this regard, a number of different algorithms have been applied to address the test optimization problem [24], such as ant colony optimization [25], particle swarm optimization [26], artificial bee colony optimization [27] and genetic algorithm [28]. In the following sections, we review some of the test optimization aspects which we then utilize for the optimization of industrial level integration testing.
2.3.1
Test Case Selection
Selecting and evaluating a subset of generated test cases for execution is one technique to optimize the testing process [29]. Rothermel and Harrold [30] formulate the test case selection problem as follows: Definition 2.4. Given: The program, P , the modified version of P , P , and a test suite, T . Problem: To find a subset of T , T , with which to test P . Test case selection can be considered as a proper optimization approach e.g. in exploratory and regression testing, where the behavior of a modified software can be verified through selecting a subset of test cases for re-execution [31]. Indeed, not all created test cases need to be executed at the same testing level as they can instead be tested in some other testing levels, for instance acceptance testing, where all test cases have already been executed at least once and only a
2.3 Test Optimization
15
few test cases need to be selected for re-execution. Previously1 in [32] and [33], we propose some methods of compensatory aggregation2 for selecting a subset of test cases for execution.
2.3.2
Test Case Prioritization
To increase the rate of fault detection, all generated test cases should be ranked for execution in such a way that test cases of higher importance are ranked higher [35]. Test case prioritization can be applied almost at all testing levels with the main purpose of detecting faults earlier in the software product [7]. The problem of test case prioritization is defined by Rothermel and Harroldn [30] as: Definition 2.5. Given: A test suite, T , the set of permutations of T , P T and a function from P T to real numbers, f : P T → R. Problem: To find a T ∈ P T that maximizes f . The main difference between Definition 2.4 and Definition 2.5 is the number of test cases. In Definition 2.4 a subset of test cases will be opted for the test case selection, whereas in Definition 2.5, all generated test cases will be ranked for execution in the test case prioritization. The problem of test selection and prioritization is addressed in this doctoral thesis in Studies A and B respectively.
2.3.3
Test Case Scheduling
Most of the previous works [36], [37], [29], [38] on test case selection and prioritization are only applicable before test execution, meaning that they do not monitor the test results after each execution. In order to optimize the testing process more efficiently, the test execution results of each test case need be recorded and considered continuously. Selecting and prioritizing test cases for execution based on their execution results leads us to utilizing the term test case scheduling as a proper technique for addressing the test optimization. In keeping with the structure of Definition 2.4 and Definition 2.5, we propose the following definition to the problem of test case scheduling: Definition 2.6. Given: A test suite, T . For all subset of T , A ⊆ T , the set of all permutations A, SP A. For all B ⊆ T , the set of all possible outputs after 1 Studies
[32] and [33] are not included in this doctoral thesis. a compensatory aggregation method, a set of alternatives and criteria need to be defined and identified respectively. The compensatory methods measure a weight for each criterion and also calculate the geometric distance between each alternative [34]. 2 In
16
Chapter 2. Background
execution of the test cases in B, R. For each r ∈ R, the function fr : SP A → R. Problem: To find a prioritized set of T , T , considering the function f∅ : P T → R, where P T is the set of permutations of T . To execute the test cases in T until the first failure (if any). To update the previous procedure for T − Tp , considering the function fre , until Tp = T , where Tp is the sets of passed test cases and re is the output of the executed test cases, respectively. Indeed, the deciding factor for choosing test cases for execution and in what order, is dependent on the test execution results. Therefore, the executed test cases need to be saved in re and the prioritization process should be continued until all generated test cases have been executed at least once. We need to consider that the main difference between Definition 2.4 and Definition 2.5 with Definition 2.6 is that only the last one monitors the results of the test executions, while the other two do not. This inclusion of results monitorization provides the opportunity for a dynamic test optimization process. If no failures occur after the first execution then we only need to prioritize test cases once, according to the Definition 2.5 (note that the f in Definition 2.5 is the same as f∅ in Definition 2.6).
2.4
Multiple-Criteria Decision-Making (MCDM)
The problem of test optimization can be categorized into a multi-criteria and also a multi-objective decision-making problem. The criterion in the software testing can be interpreted as a property for each test case which also help us to distinguish between test cases. Recognizing and measuring the criteria for each test case is a challenging task; first all properties cannot be determined precisely. Take for instance the test case property of fault detection probability, it can be utilized for test case selection and prioritization. There is no precise value for this property before the first execution but performing a historical analysis of a system under test (SUT) can provide some clues about the probability of detecting faults by each test case. In this regard, identifying most faulty subsystems (e.g. the brake system in a train contains more faults than the radio system) in the SUT can be used for comparing test cases with each other based on this property. For instance, test case A (which tests the brake system) has a higher probability of detecting faults than test case B (which tests the radio system); thus, test case A should be ranked higher for execution. Secondly, in some scenarios, measuring the test case properties requires being in close proximity to the testing experts at industries. To account for this, several
2.4 Multiple-Criteria Decision-Making (MCDM)
17
criteria have previously been proposed by researchers in the testing domain including code coverage, test case size, execution time and cost, line of code and requirement coverage. In this thesis, we define, utilize and measure the following criteria on each test case for addressing the test optimization issues in the form of test case selection, prioritization and scheduling for manual integration of test cases.
2.4.1
Requirement Coverage
Requirement coverage indicates the number of requirements which have been covered by each test case. The coverage of requirements is a fundamental necessity throughout the software development life cycle. In some scenarios, one test case can test more than one requirement and occasionally several test cases are assigned to test only a single requirement. Executing a test case with a greater requirement coverage (than other test cases) during the testing process can increase the value of requirement coverage in earlier stages. As part of this thesis we also propose an automated approach for measuring the requirement coverage for manual integration test cases (see Studies D and F).
2.4.2
Execution Time
As the title implies, test execution time is the total time that each test case requires for execution. Note that each execution of a test case can also result in a different execution time value. Knowing the execution time of test cases before execution is one way that test managers can divide test cases among several testers. Moreover, estimating the required time for each test case can provide a better overview of the total required time for testing a software product. In this doctoral thesis, we also introduce, apply and evaluate ESPRET as an automated tool for execution time estimation of manual integration test cases (see Study C).
2.4.3
Fault Detection Probability
It refers to the average probability of detecting a fault by each test case. Fault detection probability can be determined through performing historical analysis on the previously executed test cases. Sometimes, the fault detection probability is directly related to the complexity of the test cases. Selecting those test cases which have a higher chance for detecting the hidden faults in the system under
18
Chapter 2. Background
test can lead to earlier fault detection in each execution cycle. In this thesis, this criterion is measured by using a questionnaire-based study at BT (see Study A).
2.4.4
Test Case Dependencies
Previous studies have shown that executing test cases without considering the dependencies between them can cause failure at each time point during a testing process [39], [40]. The dependency issue becomes more visible in the integration testing level, where testing the interactions between modules can lead to a strong interdependency between the corresponding integration test cases. The dependent test cases directly influence the execution results of each other [39] and therefore the dependency issue can be considered as a critical problem in the integration testing levels. Indeed, paying no attention to the dependency between test cases can cause redundant test execution failures during the testing process. Moreover, the concept of dependency is important in a wide range of testing contexts (e.g. ranking test cases for execution, selecting a subset of test candidates for automation) and dependency detection has become a research challenge as well as an industrial challenge today [41]. There are several kinds of dependencies between test cases which have been identified by researchers, such as functional dependency, temporal dependency, semantic dependency and also abstract dependency. In our collaboration with testers and engineers at Bombardier Transportation, the functional dependency between integration test cases was identified as one of the most critical types of dependency. Definition 2.7. Test cases TC1 and TC2 are functionally dependent if they are designed to test different parts of function F1 or if they are testing the interaction between functions F1 and F2 . For instance, given two functions F1 and F2 of the same system, let the function F2 be allowed to execute if its required conditions are already enabled by function F1 . Thus, function F2 is dependent on function F1 . In this thesis, we assume that all test cases which are designed to test F2 should be executed any time after the assigned test cases for testing F1 . Note that, in some testing scenarios, it may be the case that just some of the corresponding test cases for testing function F1 should be executed before the corresponding test cases for testing function F2 . For instance, the required conditions for testing function F2 might be enabled after testing some (e.g. 90%) of the designed test cases for function F1 . However, this assumption needs to be relaxed in the future.
2.4 Multiple-Criteria Decision-Making (MCDM)
19
Detecting functional dependencies between test cases can lead to a more efficient use of testing resources by means of: • avoiding redundant test executions, • parallel execution of independent test cases, • simultaneous execution of test cases that test the same functionality, • any combination of the previous options. The following three main approaches are proposed, applied and evaluated for detecting the functional dependencies between manual integration test cases in this doctoral thesis: 1. Questionnaire-based, participant observation + archival data: The data collection for detecting the functional dependencies was done using participant observation, questionnaire as well as taking help from archival data for finding the cause of test case failures. The test experts at BT answered a questionnaire where the test dependencies were identified based on requirements. A dependency graph is designed and proposed, which can help testers when prioritizing test cases for execution based on the dependencies between test cases (see Study A). 2. Signal analysis: A set of internal signals in the implemented software modules is analyzed for detecting the functional dependencies between requirements and thereby identifying dependencies between test cases such that: module 2 depends on module 1 if an output internal signal from module 1 enters as an input internal signal to module 2. Consequently, all requirements (and thereby test cases) for module 2 are dependent on all the designed requirements (and test cases) for module 1 (see Studies D and F). 3. Deep learning-based natural language processing: Since test specifications are usually written in a natural text, a natural language processing-based approach can help the testers for detecting the dependencies between test cases. To aid in this, we propose to use the Doc2Vec3 algorithm [42] that, given the test specifications, allows automatic detection of test case dependencies and converts each test case into a vector in n-dimensional space. These vectors 3 Doc2Vec is used to generate representation vectors out of a document, regardless of its length (see Study E and [42]).
20
Chapter 2. Background
BR490 project-Bombardier • 1748 Test cases • 3938 Requirement specifications • Study D
Requirement Specification
Internal Signal
Functional Dependency Detection
Signal Analysis
Test Specification
Deep learningbased NLP Questionnaire based Study
Test Specification
Test Result
Test Specification
C30 project- Bombardier • 1408 Test cases • Study E and also [42]
Zefiro project- Bombardier • 4578 Test cases • 7305 Test results • Study A
Figure 2.2: Three different approaches for functional dependency detection between manual integration test cases evaluated at Bombardier Transportation. are then grouped using the HDBSCAN 4 clustering algorithm into several clus4 Hierarchical
Density-Based Spatial Clustering of Applications with Noise.
2.4 Multiple-Criteria Decision-Making (MCDM)
21
ters. Finally, a set of cluster-based test scheduling strategies are proposed for execution [42] (see Study E). In order to show the feasibility of the proposed approaches, several industrial testing projects at BT have been selected and analyzed as use cases. However, the inputs for using the above approaches are different. Figure 2.2 represents the proposed approaches for dependency detection, the required inputs for the main approaches, the name and size of the analyzed testing project at BT and also the corresponding completed study targeting each approach.
Chapter 3
Research Overview n this chapter, we provide an overview of the doctoral thesis. First, we describe Iwhich the research goals of the thesis. Later, the individual technical contributions address the research goals are presented. Concluding the chapter is a comprehensive discussion of the research process and research methodology that is applied in this doctoral thesis.
3.1
Goal of the Thesis
In the past decade, it has become highly competitive among software companies to improve product quality [43], which in turn has also impacted the testing process [44]. Improving the quality of the software sometimes leads to increasing the final costs of the products [45]. Finding a trade-off between quality assurance and allocated testing resources at industry is a challenging issue and therefore optimizing the testing process has received much attention from researchers and also industry [46], [47]. This thesis aims to enable the applications of the test optimization techniques for improving the manual integration testing at industry. Under this target, we define our main research goal as follows: To provide methods for a more efficient manual integration testing process, while decreasing unnecessary test execution failures and increasing the requirement coverage at each testing cycle. This doctoral thesis is built on the prequel licentiate thesis titled: A Decision Support System for Integration Test Selection (see [48]), where a manual 23
24
Chapter 3. Research Overview
multi-criteria decision support system (DSS) for test case selection and prioritization was proposed by us and the performance of the proposed DSS was evaluated at Bombardier Transportation through measuring the value of return on investment (ROI). The economic models presented in the licentiate thesis (included as Study B also in this doctoral thesis) confirms that there is a need for an automated decision support system for test case selection, prioritization and also scheduling at industry. While the licentiate thesis mainly focused on the design and economical evaluation of the proposed DSS, this doctoral thesis includes an automated version of the proposed DSS with industrial empirical evaluations. As mentioned in Chapter 2 and also in the licentiate thesis [48], executing all generated test cases without any particular order may cause unnecessary test execution failures and thereby leads to a waste of testing resources and time. On the other hand, applying manual approaches for optimizing the testing process is a time and cost consuming procedure and requires even more testing resources [48]. In the licentiate thesis, the test case properties (criteria) had been measured through performing a set of questionnaire-based studies, which can be a taxing process and therefore prone to human error. Table 3.1 shows a sample survey that had been sent to the testing experts at BT for measuring the test case properties. Test Case ID Drive-S-046 Speed-S-005 Doors-S-011 Doors-S-022 Brake-IVV-31 Brake-IVV-41 Drive-S-024 Speed-IVV-04 Drive-IVV-30 Brake-S-044 Brake-S-042 Drive-S-011
Execution Time VH M VL H VL VL L L L L VL VL
Requirement Coverage H VL H L M L H M H H H M
Dependent on Air-S-005, Brake-S-25 Speed-S-21, Voltage-IVV-4 Brake-S-65, Air-S-005, Drive-S-09 Battery-S-13, DVS-IVV-08 None Fire-IVV-125 Speed-IVV-66, Battery-S-58 HVAC-S-06, Speed-S-17, Brake-S-02 None None Bogies-IVV-225, Brake-S-88 Radio-S-25, Front-S-002
Execution Cost M M L L L M M M L H M H
Fault Detection Probability H L H L M M H M M M M M
Table 3.1: A sample survey with values very low (VL), low (L), medium (M), high (H) and very high (VH), utilized for measuring five test case properties (execution time, requirement coverage, dependency, execution cost and fault detection probability) at BT in the licentiate thesis [48]. The highlighted columns represent those properties which are measured automatically today. Table 3.1 represents the testers’ opinions about five test properties captured
3.1 Goal of the Thesis
25
by using a set of fuzzy linguistic variables1 (very low, low, high, etc.). However, we were faced with several situations where testers had different opinions about each property of the test cases. In this doctoral thesis, the following test case properties are measured automatically: requirement coverage, execution time and the dependencies between test cases, which are also highlighted in Table 3.1. Moreover, test cases have been semi-automatically prioritized for execution by using two compensatory aggregation methods AHP2 and TOPSIS3 in the licentiate thesis. As discussed previously, in order to continuously optimize the testing process the execution results of each test case need be recorded and analyzed, which requires using automated tools. For this reason, we have extended the scope of our research on automation issues since the publication of the licentiate thesis. Thus, in this doctoral thesis, we propose two automated supportive tools which are empirically evaluated on several large-scale testing projects at Bombardier Transportation. By considering a large number of manual integration test cases and their property measurements, together with dynamic decision making based on test case execution results, the research goals can be defined as follows: RG1: Defining approaches for automatically detecting dependencies between test cases. Motivation: The dependencies between test cases have been identified as one of the most important criteria for test optimization at the integration testing level [50], [39], [51]. Paying no attention to the dependencies between integration test cases might lead to sequential failure of test cases [39], [52] and thereby a waste of testing resources. On the other hand, the manual approaches for dependency detection are costly approaches and suffer from uncertainty. Since our final goal in this thesis is to address test efficiency, we focus on semi and fully automated approaches for dependency detection between manual integration test cases. RG2: Automatically estimating the execution time for manual test cases. Motivation: Knowing the execution time of test cases in an early stage of a testing process can help testers and test managers to select, prioritize and schedule test cases for execution. Creating a database of the previously 1 In a fuzzy partition, every fuzzy set corresponds to a linguistic concept such as very low, low, average, high, very high [49]. 2 Analytic Hierarchy Process. 3 Technique for Order of Preference by Similarity to Ideal Solution.
26
Chapter 3. Research Overview
executed test cases and performing various regression analysis techniques may provide clues to estimate the required time needed for executing new generated test cases on the same system under test. A supporting tool can be connected to a database to predict the execution time for the new test cases. This research goal addresses the automation of test case execution time estimation. RG3: Automatically measuring the requirement coverage for manual test cases. Motivation: Comparing and measuring the number of allocated requirements to each test case can be considered as a solution for ranking test cases for execution. Comparing and ordering test cases for execution based on their requirement coverage can lead to the testing of more requirements with overall fewer test cases which thereby minimizes the total testing time. However, this information is not always available in test case specifications and sometimes we need to analyze the requirement specifications for measuring this value. This research goal addresses the automation of requirement coverage measurements. RG4: Proposing automated (and semi-automated) methods for test case selection, prioritization and scheduling. Motivation: Meeting research goals 1 and 2 can help us to optimize the execution orders of test cases based on their dependencies and also execution time. Moreover, an automated approach in the form of a decision support system can be utilized as a supportive tool for test case selection, prioritization and scheduling, which also monitors the results of each test execution. RG5: Measuring the effectiveness in terms of cost and time reduction of using the proposed optimization approaches. Motivation: Evaluation of feasibility and efficiency of the proposed optimization approaches in terms of reducing redundant tests execution failures and increasing the requirement coverage at each testing cycle. Applying the proposed approaches in this thesis can lead to decreasing required troubleshooting time and thereby minimizing testing cost. The usage of the proposed optimization approaches can also lead to increasing the value of return on investment (ROI) for the testing companies. Figure 3.1 illustrates a holistic overview of six performed studies (A-to-F), which supports the main objective of this PhD thesis. Moreover, the performed
3.2 Technical Contributions
27
studies provide contributions to the body of knowledge in the field of test optimization. Main Research Goal
RG1
RG2
RG3
RG4
RG5
Study A
Study C
Study D
Study A
Study B
Study D
Study F
Study E
Figure 3.1: Holistic overview of how the studies included in this PhD thesis support the research goals.
3.2
Technical Contributions
This section provides an overview of the contributions included in this doctoral thesis. The next section describes a high-level overview of the provided contributions and also how they together achieve the main goal of test efficiency. Later in this section, we discuss individual contributions in detail.
3.2.1
Overview of the Proposed Approach
In order to realize the main goal of this doctoral thesis, we proposed, applied and evaluated approaches for improving the manual integration testing process through gathering the empirical evidence, both for or against, the use of proposed optimization methods in industrial practice.
3.2.2
Discussion of Individual Contributions
The technical contributions presented in this thesis can be categorized into five main contributions:
28
Chapter 3. Research Overview
C1: Recognizing the challenges of test optimization Software quality is playing a more important role than ever before and therefore an increase in cost in the software product itself is expected [53]. The increasing demand for quality, fast turnover, and limitations in resources have encouraged researchers to provide and utilize optimization techniques in the software testing domain [54], [55], [56]. Test case selection, prioritization and scheduling have been identified as potential approaches for optimizing the testing process. Previously in [33] and [32], we proposed two methods of compensatory aggregation for addressing the mentioned issues, where test cases are selected and prioritized for execution based on their properties. Organizing test cases for execution without paying attention to their execution results might yield less than an optimal outcome. In fact, the process of test case selection and prioritization should be performed dynamically until all test cases are executed successfully (challenge 1). Moreover, the problem of test optimization is a multi-criteria decision making problem (challenge 2). However, identifying and measuring the test case properties (criteria) is a challenge being taken on by researchers, one of the reasons being that it requires a close proximity to industry. The following test case properties have been recognized by us as critical properties, which directly impact the process of test optimization: dependency between test cases, requirement coverage, execution time and cost and fault detection probability. Measuring the effects of the mentioned properties automatically for each test case is a research challenge, especially in a manual testing procedure, where test cases are written by testers in a natural language (challenge 3). Targeting Research Goal: RG3, RG4 Included Study: Study A, D, F C2: Execution time prediction As discussed in challenge 2, proposing an automated way for measuring test case execution time is required. The increased complexity of today’s test optimization, together with the increased number of test cases that are created during a testing process, require prediction models for execution time prediction. We investigate ESPRET4 as an automated tool for execution time estimation and prediction of manual integration test cases. Targeting Research Goal: RG2 Included Study: Study C 4 EStimation
and PRediction of Execution Time.
3.2 Technical Contributions
29
C3: Requirement coverage measurement Through performing a pairwise comparison of identified test case properties (criteria), we came to the realization that the requirement coverage has the highest priority (around 67.5%) for test case selection and prioritization at BT [33], [57]. In this thesis, we propose an automated approach for measuring the number of requirements covered by each test case. Targeting Research Goal: RG3 Included Study: Study D, F C4: Dependency detection Our studies indicate that integration test cases can fail based on four main reasons: (i) there is a mismatch between test case and its corresponding requirement, (ii) the testing environment is not yet adequate for testing efficiently, (iii) there exist bugs in the system which is under test and (iv) no attention is paid to the dependencies between test cases. Among the mentioned causes, failures based on failures between interdependent test cases are preventable. As illustrated in Figure 2.2, the dependency between manual integration test cases is detected through applying three different approaches in this doctoral thesis. Targeting Research Goal: RG1 Included Study: Study A, D, E C5: An automated decision support system and measures of effectiveness Using manual or semi-automated approaches for selecting, prioritizing or scheduling test cases for execution requires testing resources and human judgment and suffers from uncertainty and ambiguity. Executing a large set of test cases several times during a testing project is difficult to handle manually. An automated decision support system (DSS) can analyze the test cases and make decisions for execution orders of test cases more easily. However, the mentioned test case properties can be measured automatically inside the DSS. Some of the properties (e.g. dependency between test cases) can be changed after execution, and therefore the properties should be re-measured after each test execution. Optimizing the execution orders of test cases based on their execution results can provide a more efficient way of using testing resources. In this doctoral thesis, we investigate sOrTES5 as a supportive tool for stochastic scheduling of manual integration test cases. Additionally, sOrTES is able to measure the requirement coverage for each test case and also detects 5 Stochastic
Optimizing of TEst case Scheduling.
30
Chapter 3. Research Overview
the dependencies between manual integration test cases. Moreover, the effectiveness of sOrTES and manual and semi-automated approaches in this thesis is measured in terms of maximizing the return on investment through applying the approaches at BT. Targeting Research Goal: RG5 Included Study: Study F, B
3.3
Research Process and Methodology
Industrial research includes more than just publishing research results or technical reports [58]. In fact, it requires a close cooperation between industry and academia during the entire research process, where the academic research results can be evaluated in a real industrial setting and thereby improve the industrial development process [59]. A close and dynamic collaboration between researchers and practitioners is the golden key to success [60]. Additionally, choosing a proper research strategy can help the researchers to achieve valid answers to their research questions. The research conducted in this doctoral thesis utilized case studies6 and various data collection strategies (unstructured interviews, document analysis and observation). In summary the research methodology used in this research is described in the following process: (i) Identifying the research problems and challenges through reviewing the state of the art, current issues and observation and also employing semistructured interviews with the testing experts at Bombardier Transportation in Sweden. (ii) Selecting challenges to solve and designing the research objectives, goals and questions. (iii) Proposing a set of solutions for the identified research goals. (iv) Evaluating the proposed solutions with testing experts at Bombardier Transportation by running simulations of illustrative examples and also conducting empirical evaluations. The structure of the research process and framework can be summarized as depicted in Figure 3.2, which has been adapted from the proposed technology transfer model by Gorschek et.al in [58]. 6 The provided industrial case studies in this doctoral thesis are following the proposed guidelines for conducting and reporting case study research in software engineering by Runeson and Höst [61].
3.3 Research Process and Methodology
5
6 1
Industrial Need
Release Solution
Industry Academia
Problem Formulation 2
State of the Art
31
Dynamic Validation 3 Candidate Solution Academic Validation 4
Figure 3.2: The research model and technology transfer overview used in this doctoral thesis, adapted from Gorschek et.al [58]. As can be seen by Figure 3.2, the proposed research model by Gorschek et.al [58] is divided into two main phases: industry and academia, where both phases are dynamically communicating with each other throughout six individual steps. We now present how the technology transfer model outlined in Figure 3.2 is applied in this doctoral thesis. • Step 1. Identify potential improvement areas based on industry needs It is critical to consider the demands of the industry before designing research questions [60]. Thus, we started our research by observing an industrial setting. During this process, several potential areas of improvement at various testing levels at BT were identified, of which the integration testing level has been selected as a viable candidate for improvement. Moreover, the test optimization problem in the form of test case selection, prioritization and scheduling has been recognized as a real industrial challenge at BT during our process assessment and observation activities. • Step 2. Problem formulation Based on the identified demands in the previous step and by collaborating closely with the testing experts at BT, the problem statement was formulated. The testing department at BT consists of several testing teams
32
Chapter 3. Research Overview
including software developers, testers, testing team leaders and middle managers [48]. The researchers from academic partners in the several research projects (e.g. TESTOMAT7 , MegaM@RT28 , IMPRINT9 and TOCSYC10 projects) have regular meetings with BT. Furthermore, a set of non-directive interviews were employed in this step, which established a common core and vocabulary of the research area and the system under test [58] between researchers and testing experts. • Step 3. Formulate candidate solutions In a continuous collaboration with the testing teams at BT, a set of candidate solutions for improvement of the integration testing process were designed. In this step, BT covered the role of keeping the proposed solutions compatible with their testing environment [48]. On the other hand, we as the research partners took on the main responsibility for keeping track of state of the art in the test optimization domain and applying the proposed solutions with a combination of new ideas [58]. We designed a multi-criteria decision support system for execution test case scheduling. The main purpose of the proposed solution was to measure test case properties and schedule them for execution based on those properties. With an agreement with BT, the proposed solution was selected as the most promising solution for test optimization at the integration testing level at BT. • Step 4. Academic validation In the principal technology transfer model proposed by Gorschek et.al in [58], several steps are considered for evaluating the candidate solutions proposed in the previous step. In this doctoral thesis, we employed academic and dynamic validation methods to evaluate the proposed solution for solving the test optimization problem at BT. Our scientific work was evaluated by international review committees from the venues where we published our research results, of which six studies have been selected and presented in this doctoral thesis (Study A-to-F). In this step the limitations of the various approaches are identified and certain solutions for addressing these limitations are provided as future work. Two initial 7 The
Next Level of Test Automation [62]. scalable model-based framework for continuous development and run-time validation of complex system [63]. 9 Innovative Model-Based Product Integration Testing [64]. 10 Testing of Critical System Characteristics [65]. 8 An
3.3 Research Process and Methodology
33
prototype versions of the proposed approaches were implemented by our master thesis students in the software engineering program at Mälardalen University and also applied mathematics program at KTH Royal Institute of Technology. • Steps 5. Dynamic validation This step has been performed through two research projects (TESTOMAT and MegaM@RT2). According to the project’s plan, a physical weekly meeting needs to be held at BT between all industrial and academic partners who are involved in the mentioned research projects [48]. The results of the conducted case studies, prototypes and experiments are presented by researchers during the meetings. Moreover, some small workshops are organized by us for the team members of different internal testing projects (mostly BR49011 and C3012 projects) at BT. The industrial partners (testing expert at BT) gave valuable feedback, some of which are applied in this step. Tool support was the main feedback for the proposed solutions that we received from BT. • Step 6. Release the solution After receiving and analyzing the feedback from the academic and dynamic validation steps, the proposed solutions are then implemented as actual supportive tools. The initial versions of the proposed solutions are then implemented by our master thesis students and BT engineers. In this regard, first ESPRET is released as a supportive tool for estimating and predicting the execution time of manual test cases based on their test specifications (see Study C). ESPRET is a Python based tool which helps testers and test managers to estimate the required execution time for running manual test cases and can be employed for test case selection, prioritization and scheduling. As outlined in Chapter 2, the execution time is one of the test case properties which needs to be identified in an early stage of the testing process. Today, this property is measured automatically at BT by using ESPRET, of which is highlighted in Table 3.1 as an automatically measured property. Secondly, sOrTES is implemented and released by us as an automated decision support system for stochastic scheduling of manual integration test cases for execution (see Study F). 11 is an electric rail car specifically for the S-Bahn Hamburg GmbH network in production at Bombardier Hennigsdorf facility [66]. 12 is the project name of the new subway carriages ordered by Stockholm public transport in 2008 [67].
34
Chapter 3. Research Overview
sOrTES consists of two separate phases: extractor and scheduler, where the dependencies between test cases and their requirements coverage are measured in the extractor phase (highlighted in Table 3.1 as an automatically measured property). The scheduler phase embedded in sOrTES ranks test cases for execution based on their dependencies, requirement coverage, execution time (extracted from ESPRET) and also test case execution results. Full implementation of the released solutions in this doctoral thesis is pending on results from the trial releases. However, our ultimate goal is to integrate ESPRET and sOrTES and to incrementally release it as an open source tool.
Chapter 4
Conclusions and Future Work This chapter concludes the doctoral thesis. A summary of the main contributions is presented in Section 4.1, while Section 4.2 contains some suggestions for future research of this thesis.
4.1
Summary and Conclusion
The overall goal of this doctoral thesis is to provide methods for a more efficient manual integration testing process. To work towards this goal, we have conducted research in three sequential stages: (i) measuring the properties of test cases automatically, (ii) prioritizing and scheduling test cases for execution automatically in terms of a decision support system, and (iii) empirically evaluating the effectiveness of the proposed approach. All included publications in this doctoral thesis build on empirical research through performing several industrial case studies at Bombardier Transportation AB in Sweden. The measurement stage of the test case properties is presented by five studies. In Study A we were able to assess the test case properties based upon a questionnaire conducted on an industrial use case. Moreover, in Study A we show that a fuzzy linguistic variable can be assigned to each test case as a property, however, this process is a time consuming process as it requires human judgment and assessment. We also show that applying methods of compensatory aggregation can be utilized for ranking and prioritizing test cases for execution. 35
36
Chapter 4. Conclusions and Future Work
Study B reports an economic model in terms of return on investment (ROI) value for the proposed manual approach in Study A. Moreover, a semi and also a fully automated approach for data gathering, data analysis and decision making are simulated in Study B in order to a find a trade-off between effort and return. In Study B we show that even performing a manual approach for ranking test cases for execution can reduce unnecessary test execution failures and yields a positive value for ROI. Furthermore, having a semi and a fully automated DSS leads to gaining additional value for ROI. Since the publication of Study B we have extended the scope of our research on automation of the proposed approaches in Study A. Three test case properties are selected for automation: execution time, dependency and requirement coverage. Moreover, providing an automated tool support which can schedule test cases for execution is also considered in this doctoral thesis. In this regard, in Study C we introduce, apply and evaluate ESPRET as a Python based supportive tool for estimating execution time for manual integration test cases. In Study C we show that the execution time of manual test cases is predictable through parsing their test specifications and performing regression analysis techniques on the previously executed test cases. Study D provides an automated approach for functional dependency detection between manual test cases at the integration testing level. The necessary inputs for running the proposed approach in Study D are test case and requirement specifications. Study D indicates that the signal communications between software modules provide some clues about the dependency between them and thereby the dependency between their corresponding requirements and test cases. Since the signal information and requirement specifications are not available for all systems under test, we proposed another automated approach for dependency detection which only requires the test case specifications. Study E shows that the dependency between manual test cases can be detected by analyzing their test specifications using deep learning-based NLP techniques. However, the similarity between test cases can also be discovered by applying the proposed approach. Study F builds on Study D, where we propose sOrTES as an automated tool for stochastic scheduling of test cases. The proposed automated approach in Study D is embedded in sOrTES for dependency detection for those situations where the signal information and requirement specifications are available. In Study F we schedule test cases for execution based on their requirement coverage and also dependencies, where after each test execution, a new test schedule will be proposed based on the results of the previous execution of each test
4.2 Future Work
37
case. Study F indicates that the automatic scheduling of test cases based on their properties and execution results can reduce unnecessary redundant test execution failure by up to 40%, which leads to an increase in the requirements coverage of up to 9.6%. As mentioned earlier in this chapter, in order to show the feasibility of all proposed approaches in the doctoral thesis (Study A-to-F), several empirical evaluations have been performed on a railway use-case at Bombardier Transportation in Sweden.
4.2
Future Work
Work on this doctoral thesis has opened several future research directions for us: • Future development of sOrTES and integration possibilities for merging ESPRET and sOrTES as one open source tool for integration test stochastic scheduling. Today, we predict the execution time for each test case using ESPRET and the results are inserted manually into the extractor phase of sOrTES. Embedding ESPRET into the extractor phase of sOrTES can help us to measure three test case properties automatically by using only a single tool. • The proposed deep learning-based NLP approach for dependency detection in Study E can also be added to the extractor phase of sOrTES, where it captures the test case specifications as input and clusters the dependent test cases. Thus, sOrTES can be used in those situations where any of the test process artifacts (e.g. signal information, requirement specification, test specification) are available. • As outlined in Table 3.1, there are two properties (execution cost and fault detection probability) of test cases which are still candidates for automation. Execution cost can be calculated by using the execution time of each test case. It can also be calculated by adding more parameters such as the complexity of the test case, the testing level and also the number of requirements which have been assigned to each test case. Moreover, the probability of detecting faults by each test case is also measurable via performing historical analysis on some previously tested project within the same testing environment. For instance, the brake system at BT is the most faulty system and thus test cases which are assigned to test the
38
Chapter 4. Conclusions and Future Work
brake system have a higher probability of detecting these hidden faults in the system compared with other test cases. Prioritizing the systems and the subsystems for implementation and testing can be considered as another possibility for contributing to the existing research. • The concept of requirement coverage indicates the number of requirements assigned to be tested by each test case. In some testing scenarios, one requirement can be assigned to more than one test case. In all presented studies in this thesis, the number of allocated requirements to each test case are summed and presented as a numerical value. In the future, this value will be modified through combining with a unique requirement coverage value. A unique requirement coverage indicates the number of requirements which are only assigned to one test case. Considering this value in the test execution helps testers to test all requirements at least once in any particular proposed test execution order. • The dependency between the functions can also be expressed as a ratio. For instance, the total dependency ratio for function F1 and function F2 is 90% (F2 depends on F1 ), meaning that, after 90% successful executions of the designed test cases for function F1 , we are able to test the assigned test cases to function F2 . Finding the dependency ratio between the functions can help us to relax the assumption that, all test cases which are designed to test F2 should be executed any time after the assigned test cases for testing F1 . Indeed, some of the dependent test cases can be tested in parallel with other test cases, which leads to saving time in a testing cycle. Since the dependencies between software modules, functions, requirements and test cases are detected in this doctoral thesis, analyzing the test records on the previously executed dependent test cases might provide clues about the dependency ratio between dependent test cases.
Bibliography [1] M. Cook and R. Gunning. Mathematicians: An Outer View of the Inner World. Princeton University Press, 2009. [2] M. Young. Software Testing and Analysis: Process, Principles, and Techniques. Wiley India Private Limited, 2008. [3] A. Bertolino. Software testing research: Achievements, challenges, dreams. In Future of Software Engineering (FOSE ’07). IEEE, 2007. [4] V. Casey. Software Testing and Global Industry: Future Paradigms. Cambridge Scholars Publisher, 2008. [5] M. Ould. Testing in Software Development. Cambridge University Press, 1987. [6] A. Srikanth, K. Nandakishore, N. Venkat, S. Puneet, and S. Praveen Ranjan. Test case optimization using artificial bee colony algorithm. In Advances in Computing and Communications. Springer Berlin Heidelberg, 2011. [7] S. Yoo and M. Harman. Regression testing minimization, selection and prioritization: A survey. Software Testing, Verification and Reliability, 22(2):67–120, 2012. [8] M. Harman. Making the case for morto: Multi objective regression test optimization. In The 4th International Conference on Software Testing, Verification and Validation Workshops (ICSTW’11). IEEE, 2011. [9] B. Baudry, F. Fleurey, J. Jezequel, and Y. Le Traon. Genes and bacteria for automatic test cases optimization in the .net environment. In The 13th International Symposium on Software Reliability Engineering (ISSRE’02). ACM, 2002. 39
40
Bibliography
[10] D. Di Nardo, N. Alshahwan, L. Briand, and Y. Labiche. Coverage-based regression test case selection, minimization and prioritization: A case study on an industrial system. Software Testing, Verification and Reliability, 25(4):371–396, 2015. [11] S. Elbaum, G. Rothermel, and J. Penix. Techniques for improving regression testing in continuous integration development environments. In The 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE-22). ACM, 2014. [12] D. Berthier. Pattern-Based Constraint Satisfaction and Logic Puzzles (Second Edition). Lulu Publishing, 2015. [13] M. Khan and F. Khan. Importance of software testing in software development life cycle. International Journal of Computer Science Issues (IJCSI), 11(2), 2014. [14] E. Alégroth, R. Feldt, and P. Kolström. Maintenance of automated test suites in industry: An empirical study on visual gui testing. Information and Software Technology, 73:66 – 80, 2016. [15] "ISO/IEC/IEEE" international standard - software and systems engineering –software testing –part 1:concepts and definitions. "ISO/IEC/IEEE" 291191:2013(E), pages 1–64, 2013. [16] D. Mosley and B. Posey. Just Enough Software Test Automation. Just enough series. Prentice Hall PTR, 2002. [17] G. Yu-Hsin Chen and P. Wang. Test case prioritization in a specificationbased testing environment. Journal of Software (JSW), 9:2056–2064, 2014. [18] G. Myers. The Art of Software Testing. A Wiley-Interscience publication. Wiley, 1979. [19] A. Basu. Software quality assurance, testing and metrics. Prentice hall of India private limited, 2015. [20] T. Ostrand, E. Weyuker, and R. Bell. Where the bugs are. In The ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’04). ACM, 2004.
Bibliography
41
[21] J. Moré, B. Garbow, and K. Hillstrom. Testing unconstrained optimization software. ACM Transactions on Mathematical Software, 7(1):17–41, 1981. [22] J. Pintér. Global Optimization: Software, Test Problems, and Applications, pages 515–569. Springer, 2002. [23] G. Fraser and F. Wotawa. Redundancy based test-suite reduction. In Fundamental Approaches to Software Engineering, pages 291–305. Springer Berlin Heidelberg, 2007. [24] K. Alagarsamy. A synthesized overview of test case optimization techniques. Journal of Recent Research in Engineering and Technology, 1(2):1–10, 2014. [25] S. Biswas, M. Kaiser, and S. Mamun. Applying ant colony optimization in software testing to generate prioritized optimal path and test data. In The International Conference on Electrical Engineering and Information Communication Technology (ICEEICT’15). IEEE, 2015. [26] A. Windisch, S. Wappler, and J. Wegener. Applying particle swarm optimization to software testing. In The 9th Annual Conference on Genetic and Evolutionary Computation (GECCO ’07). ACM, 2007. [27] S. Dahiya, J. Chhabra, and S. Kumar. Application of artificial bee colony algorithm to software testing. In The 21st Australian Software Engineering Conference (ASWEC’10). ACM, 2010. [28] P. Srivastava. Optimization of software testing using genetic algorithm. In The Information Systems, Technology and Management (ISM’09). Springer, 2009. [29] G. Rothermel, R. Untch, and C. Chengyun. Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10):929– 948, 2001. [30] G. Rothermel and M. J. Harrold. Analyzing regression test selection techniques. IEEE Transactions on Software Engineering, 22(8):529–551, 1996. [31] E. Engström, P. Runeson, and M. Skoglund. A systematic review on regression test selection techniques. Information and Software Technology, 52(1):14 – 30, 2010.
42
Bibliography
[32] S. Tahvili, W. Afzal, M. Saadatmand, M Bohlin, D. Sundmark, and S. Larsson. Towards earlier fault detection by value-driven prioritization of test cases using fuzzy topsis. In The 13th International Conference on Information Technology : New Generations (ITNG’16). ACM, 2016. [33] S. Tahvili, M. Saadatmand, and M. Bohlin. Multi-criteria test case prioritization using fuzzy analytic hierarchy process. In The 10th International Conference on Software Engineering Advances (ICSEA’15). IARIA, 2015. [34] A. Rikalovic, I. Cosic, and D. Lazarevic. Gis based multi-criteria analysis for industrial site selection. Procedia Engineering, 69:1054 – 1063, 2014. [35] S. Elbaum, A. Malishevsky, and G. Rothermel. Test case prioritization: A family of empirical studies. IEEE Transactions on Software Engineering, 28(2):159–182, 2002. [36] R. Abid and A. Nadeem. A novel approach to multiple criteria based test case prioritization. In The 13th International Conference on Emerging Technologies (ICET’17). IEEE, 2017. [37] K. Wang, T. Wang, and X. Su. Test case selection using multi-criteria optimization for effective fault localization. Computing, 100(8):787–808, 2018. [38] J. Jones and M. Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on Software Engineering, 29(3):195–209, 2003. [39] S. Arlt, T. Morciniec, A. Podelski, and S. Wagner. If a fails, can b still succeed? inferring dependencies between test results in automotive system testing. In The 8th International Conference on Software Testing, Verification and Validation (ICST’15). IEEE, 2015. [40] P. Caliebe, T. Herpel, and R. German. Dependency-based test case selection and prioritization in embedded systems. In The 5th International Conference on Software Testing, Verification and Validation (ICST’12). IEEE, 2012. [41] M. Broy. Challenges in Automotive Software Engineering. In The 28th International Conference on Software Engineering (ICSE’06). IEEE, 2006.
Bibliography
43
[42] S. Tahvili, L. Hatvani, M. Felderer, W. Afzal, M. Saadatmand, and M. Bohlin. Cluster-based test scheduling strategies using semantic relationships between test specifications. In The 5th International Workshop on Requirements Engineering and Testing (RET’18), 2018. [43] N. Iqbal, W. Rizwan, and J. Qureshi. Improvement of key problems of software testing in quality assurance. Computing Research Repository (CoRR), 1202.2506, 2012. [44] S. Naik and P. Tripathy. Software Testing and Quality Assurance: Theory and Practice. Wiley, 2008. [45] D. Harter, K. Mayuram, and S. Slaughter. Effects of process maturity on quality, cycle time, and effort in software product development. Management Science, 46(4):451–466, 2000. [46] M. Harman. Making the case for morto: Multi objective regression test optimization. In The 4th International Conference on Software Testing, Verification and Validation Workshops (ICSTW’11). IEEE, 2011. [47] P. McMinn. Search-based software test data generation: A survey: Research articles. Software Testing, Verification and Reliability, 14(2):105– 156, 2004. [48] S. Tahvili. A Decision Support System for Integration Test Selection. 2016. Licentiate Thesis Dissertation, Mälardalen University, Sweden. [49] L. Biacino and G. Giangiacomo. Fuzzy logic, continuity and effectiveness. Archive for Mathematical Logic, 41(7):643–667, 2002. [50] S. Ulewicz and B. Vogel-Heuser. System regression test prioritization in factory automation: Relating functional system tests to the tested code using field data. In The 42nd Annual Conference of the IEEE Industrial Electronics Society (IECON’16). IEEE, 2016. [51] S. Vöst and S. Wagner. Trace-based test selection to support continuous integration in the automotive industry. In The International Workshop on Continuous Software Evolution and Delivery (CSED’16). IEEE, 2016. [52] M. Parsa, A. Ashraf, D. Truscan, and I. Porres. On optimization of test parallelization with constraints. In The 1st Workshop on Continuous Software Engineering co-located with Software Engineering (CSE’16). CEUR-WS, 2016.
44
Bibliography
[53] I. Burnstein, T. Suwanassart, and R. Carlson. Developing a testing maturity model for software test process evaluation and improvement. In The International Test Conference on Test and Design Validity (ITC’96). IEEE, 1996. [54] M. Usaola and P. Mateo. Mutation testing cost reduction techniques: A survey. IEEE Software, 27(3):80–86, 2010. [55] S. Slaughter, D. Harter, and K. Mayuram. Evaluating the cost of software quality. Communications of the ACM, 41(8):67–73, 1998. [56] B. Boehm and V. Basili. Software defect reduction top 10 list. The Computer Journal, 34(1):135–137, 2001. [57] S. Tahvili, M. Saadatmand, S. Larsson, W. Afzal, M. Bohlin, and D. Sundmark. Dynamic integration test selection based on test case dependencies. In The 11th Workshop on Testing: Academia-Industry Collaboration, Practice and Research Techniques (TAICPART’16). IEEE, 2016. [58] T. Gorschek, P. Garre, S. Larsson, and C. Wohlin. A model for technology transfer in practice. IEEE Software, 23(6):88–95, 2006. [59] S. Fleeger and W. Menezes. Marketing technology to software practitioners. IEEE Software, 17(1):27–33, 2000. [60] V. Basili, F. E. McGarry, R. Pajerski, and M. Zelkowitz. Lessons learned from 25 years of process improvement: the rise and fall of the nasa software engineering laboratory. In The 24th International Conference on Software Engineering (ICSE’02). ACM, 2002. [61] P. Runeson and M. Höst. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering, 14(2):131, 2008. [62] Testomat project- the next http://www.testomatproject.eu.
level
of
test
automation.
[63] W. Afzal, H. Bruneliere, W. Di Ruscio, A. Sadovykh, S. Mazzini, E. Cariou, D. Truscan, J. Cabot, A. Gómez, J. Gorroñogoitia, L. Pomante, and P. Smrz. The megam@rt2 ecsel project: Megamodelling at runtime – scalable model-based framework for continuous development and runtime validation of complex systems. Microprocessors and Microsystems, 61:86 – 95, 2018.
[64] Imprint-innovative model-based http://www.sics.se/projects/imprint.
product
testing of critical [65] Tocsyc https://www.sics.se/projects/tocsyc2.
integration
system
testing.
characteristics.
germany. [66] Electric multiple unit class 490 – hamburg, https://www.bombardier.com/en/transportation/projects/project.ET490-Hamburg-Germany.html. [67] Bombardier wins order to supply new generation movia metro fleet for stockholm. http://ir.bombardier.com/en/press-releases/pressreleases/44772-bombardier-wins-order-to-supply-new-generationmovia-metro-fleet-for-stockholm.
II Included Papers
47
Paper
A
Chapter 5
Paper A: Dynamic Test Selection and Redundancy Avoidance Based on Test Case Dependencies Sahar Tahvili, Mehrdad Saadatmand, Stig Larsson, Wasif Afzal, Markus Bohlin and Daniel Sundmark In the Proceedings of the 11th Workshop on Testing: Academia-Industry Collaboration, Practice and Research Techniques (TAIC PART’16), 2016, IEEE.
49
Abstract Prioritization, selection and minimization of test cases are well-known problems in software testing. Test case prioritization deals with the problem of ordering an existing set of test cases, typically with respect to the estimated likelihood of detecting faults. Test case selection addresses the problem of selecting a subset of an existing set of test cases, typically by discarding test cases that do not add any value in improving the quality of the software under test. Most existing approaches for test case prioritization and selection suffer from one or more drawbacks. For example, to a large extent, they utilize static analysis of code for that purpose, making them unfit for higher levels of testing such as integration testing. Moreover, they do not exploit the possibility of dynamically changing the prioritization or selection of test cases based on the execution results of prior test cases. Such dynamic analysis allows for discarding test cases that do not need to be executed and are thus redundant. This paper proposes a generic method for prioritization and selection of test cases in integration testing that addresses the above issues. We also present the results of an industrial case study where initial evidence suggests the potential usefulness of our approach in testing a safety-critical train control management subsystem.
5.1 Introduction
5.1
51
Introduction
While different characteristics of test cases can be evaluated in an offline fashion to determine and select which test cases to execute, the verdict of a test case can also serve as another factor in selection of other test cases to execute [1, 2]. Since the complexity of integration testing increases as the number of subsystems grows [3], considering the dependency between test cases plays a critical role for efficient use of test execution resources. This paper introduces a generic approach for combined static and dynamic prioritization and selection of test cases for integration testing. The prioritization is based on the dependency degree of each test case. Further prioritization is performed among test cases at each dependency degree level using the Fuzzy Analytic Hierarchy Process technique (FAHP, see [4]); a structured method where properties are expressed using degrees of truth. The approach is close to the way people usually reason, and therefore suitable to this type of complex decision problem. As a prerequisite, we assume the existence of a directed dependency relation, capturing information on which components use other components. In industry, such dependencies between test cases are usually found using reverse engineering [5], but also source code analysis [6], interviews with experts and analysis of documentation may be useful. In the setting of test-driven integration testing, it is often the case that test cases exist for components which have not been implemented yet, making interviews and documentation the most practical source of this type of information. In detail, the proposed approach consists of the following two phases (offline and online) and four steps in total: 1. (Offline) The test cases are partially ordered by calculating a dependency degree for each test case. The dependency degree of a test case indicates the extent to which the execution of a test case is redundant given that another test case fails. As a result of this step, some test cases may end up having the same dependency degree. 2. (Offline) Test cases with the same dependency degree are then prioritized by applying FAHP, producing an ordered set of test cases at each dependency degree. 3. (Online) During test execution, test cases are then selected one by one from each ordered set and in ascending order of dependency degrees. In this phase, when a test case fails, the test cases that are dependent on it are evaluated to determine if those dependent cases will also fail due to the failure of the former or not, hence avoiding redundancy in test execution.
52
Paper A
The overarching objective of the proposed approach is thus to avoid the execution of redundant test cases as well as to prioritize executable test cases based on dependencies and various prioritization criteria, in order to enable more efficient use of testing resources at integration testing.
5.2
Background and Preliminaries
Selecting a set of core test cases for execution to see whether further testing would be meaningful is beneficial for efficient use of testing resources [3, 7]. In this context, initially a set of test cases can be selected whose results (pass or fail) provide relevant information on which test cases to select next. This can be done by testing first the core features of the system whose failure can result in the failure of other features. In fact, by identifying the test cases that will fail because of the failure of some other test cases (result dependency), and avoiding executing the former, when the latter have failed, a better use of testing resources can be achieved. On the contrary, if test cases are selected without considering such dependencies, a test case might fail not because the feature it tests is actually faulty, but that another feature on which it depends on has failed. From this perspective, a dependency chain among test cases can be established. In short, dependencies among test cases can be determined before their execution (offline). The result of each test case during execution, when combined with the dependency information, enables us to dynamically identify which test case to execute next. The idea of using dependency information in identifying redundant test cases is also evaluated and confirmed by Arlt et al. [8] where dependency relationships are derived and inferred from a structured requirements specification.
5.2.1
Motivating Example
Many embedded control systems have a possibility to download applications, updates, and configurations making it possible to adapt the behavior of the system to the specific task it will control. This means that it needs to be possible to download the application to the control system and ensure that the integrity of the application is maintained. Different mechanisms, such as using checksums, can be used to confirm that the download is correct. When testing the download function, it is necessary to have a communication channel available with the download functionality implemented. In our example, we have three different communication channels as a part of the system: Bluetooth, Wi-Fi and USB. To
5.2 Background and Preliminaries
53
be able to test the application download function, at least one of these channels need to pass basic communication tests; hence the dependency between the download function and communication. This is shown in Figure 5.1. In this case we thus have an OR situation where it is enough that the tests for one of the channels pass before it is useful to test the application download function.
OR
AND
Download App
Checksum Figure 5.1: Dependency with AND-OR relations We can also get an AND situation for this case. If the tests for creating a checksum fails, there is no point in trying the application download function even if one of the tests for the communication channels passes.
5.2.2
Main definitions
To understand the concept of test case dependency in this work, the key terms that are used to describe dependency relationships between test cases are defined below: Definition 5.1. A dependent test case - Given two test cases A and B, B is dependent on A if from the failure of A it can be inferred that B will also fail (result dependency: fail based on fail). Consequently, based on the result of A, we can decide to also execute B or not. It only makes sense to execute test case B when A has passed. Otherwise, if A has failed and B is also executed, execution of B will not be an optimal use of testing resources, since we know based on the dependency relation and fail result of A that B also fails. As a side note, if test case B can still pass even if A has failed, based on our definition of dependency, B is not (result-) dependent on A. Moreover, it is important to remember that if A passes, B may still fail if the feature or functionality it tests is erroneous. A test case which is not dependent on any other test cases is referred to as an independent test case which will not fail due to the failure of another test
54
Paper A
case. According to our definition for the dependency relation we classify test cases in the following groups: • First Class Test Cases (white nodes): Independent test cases. • Second Class Test Cases (gray nodes): Those which are dependent on one or more independent test cases. • Third Class Test Cases (black nodes): Those which are dependent on at least one dependent test case. In a multiple dependency relationship two distinct scenarios can exist: • Passing of both test case A and B is necessary (but not enough) for C to succeed. In other words, if any of A or B fails, it can be concluded that C will also fail. In this case, based on the result of A and B , test case C will not be chosen as a candidate for execution. We refer to this as an AND dependency relationship, which can be formulated by Boolean operators: if result(A) = pass ∧ result(B ) = pass → consider C for execution. • Passing of A or B is enough so that C is selected as a candidate for execution (implying that C also has a chance to pass). In fact, only if both A and B fail, then it can be concluded that C will also fail, and therefore, will not be chosen for execution. This is regarded as an OR dependency relationship in this paper: if result(A) = pass ∨ result(B ) = pass → C can be executed (alternatively: if result(A) = f ail ∧ result(B) = f ail → do NOT consider C for execution).
5.3
Approach
Our approach for test case prioritization and selection is based on the valuation of both dependency degree and also other test case attributes. The main objective of the approach is to evaluate the effect of test case dependencies in selection and ordering of test cases such that redundant test cases are avoided during execution. To this end, we assume that there is information on test case dependencies corresponding to a binary relation between test cases. Figure 5.2 shows an overall view of the approach. The proposed approach consists of two phases: offline and online. In the offline phase, a dependency degree for each test case in relation to all other
5.3 Approach
55
OFFLINE PHASE
Order the test cases inside each set (using FAHP)
Construct set of test cases per each dependency degree
Calculate dependency degrees and formulate executability conditions
Dependency Graph
ONLINE PHASE In ascending order of dependency degrees pick the ordered set of test cases at a dependency degree not previously selected
An ordered set is selected?
Yes
Pick a test case from the beginning of the ordered set which: 1- is not previously executed & 2- its executability condition is not false (fail)
No more to select
Yes
Execute the test case
Yes
Test case passed?
No more to select
No STOP
A test case exists and is selected?
A new fail/false is produced?
Re-evaluate the executability condition of its dependent test cases
No
Yes
Figure 5.2: The steps of the proposed approach.
test cases is calculated. As the result, some test cases might have the same dependency degrees. In this step, test cases with the same dependency degrees, would be prioritized by applying FAHP. Considering an ascending order for dependency degrees and their corresponding set of test cases, an offline order for selection and execution of test cases can be determined. These sets (prioritized in an ascending order based on their dependency degrees) are then used in the online phase of the approach. Now the prioritized test cases are ready for execution. The results (pass or fail) for every single execution would be monitored in the online phase. In this phase it is decided that based on the verdict of a test case (pass or fail), which test case should be chosen for the next execution. By establishing and consulting the dependency relations between test cases, we are able to run them in an order that results in avoiding redundancy, and thus, a more efficient use of test execution resources.
5.3.1
Dependency Degree
Based on the dependency relationships between test cases, a dependency graph is constructed that represents test cases as nodes and the dependency relationships as directed edges. For each node in this graph, a dependency degree value is calculated as follows:
56
Paper A
1. The dependency degree of independent nodes (with no incoming edges) is set as 1. 2. For each directed edge (e) outgoing from a node, a value as its weight (We ) (hence a weighted directed graph) is assigned which is calculated as: We = Dsource_node + 1
(5.1)
where Dsource_node represents the dependency degree of the node at the start of the edge. 3. The weight of the output edge (Wo ) of an AND gate will be the the maximum of the weights of the incoming edges to the gate: Wo = Max{Wi }
(5.2)
4. The weight of the output edge (Wo ) of an OR gate will be the the minimum of the weights of the incoming edges to the gate: Wo = min{Wi }
(5.3)
5. The dependency degree of a node (v) will be the weight of the incoming edge (e) to it (either directly from another node or from an AND or OR gate): D v = We
(5.4)
Considering that the dependencies of test cases can be complex as described above, we also introduce the concept of executability condition for each test case and node in the graph. Executability condition of a node is the logical condition that is resulted from the incoming edges to that node. We use the executability condition to reflect when a test case needs not be considered for execution based on the fail result of other test cases it is dependent on. In this context, the pass result of a test case will be equivalent to the logical true, and the fail result will be the logical false, and all nodes are assumed to be true by default. Therefore, in Figure 5.3, the executability condition of node D will be A ∨ (B ∧ C). In this case, we can determine to skip executing test case D, only when test case A and either of test case B or C have failed (considering the OR
5.3 Approach
57
A B
OR
D
AND
C Figure 5.3: An illustration of executability condition.
relation between test case A and the AND relation that groups test case B and C). However, if for example, only test case A has failed, the executability condition of D can still become true, implying that there is still one more way (through the AND relation) that has to be fail (i.e., false) until we can definitely determine that D will also fail. When the executability condition of a test case is evaluated as false, that test case can then be skipped and not selected for execution. Evaluation of executability condition is done only in the online phase of our approach, while the executability condition itself can be determined and formulated in the offline phase.
5.3.2
Test Case Prioritization: FAHP
After calculating dependency degrees, some test cases can end up having the same dependency degrees. In this situation, we prioritize them based on some other criteria (such as requirement coverage, time efficiency, cost efficiency and fault detection probability). In fact, there is no test execution preference for test cases with the same dependency degree. The main goal of applying FAHP for prioritizing test cases is, giving more chance for earlier execution to the test cases which satisfy the identified criteria properly. FAHP is not, however, limited to any particular set of criteria and in different systems and contexts users can have their own set of criteria. For computing the effects of the criteria on the test cases, we define a set of linguistics variables (e.g., low, high, etc.) and then questionnaires are sent to testers, where testers specify the values for each criterion. The answers of the questionnaire are then interpreted into fuzzy environment. By re-defining AHP in fuzzy environment (called FAHP), the approach is more practical in real world scenarios when precise quantified values cannot be given for each criterion [9]. Fuzzy truth represents membership in vaguely defined sets. Variables over these sets are called fuzzy variables. From a user perspective, fuzzy properties
58
Paper A
are often described using linguistic variables. This section outlines the process of transforming a linguistic value into a fuzzy value. In this paper we use five triangular-shaped membership functions, shown in Figure 5.4. mA 1
Very Low
0
1
Low
Medium
High
3
5
7
Very High
9
Figure 5.4: Fuzzy membership functions for the linguistic variables. Definition 5.2. A triangular fuzzy number (TFN) can be defined as a triplet M = (l , m, u) where l , m, u are real numbers and l indicates low bound, m is modal and u represents a high bound (see [10]). By using Table 5.1, we are able to interpret the linguistic variables in the form of TFNs. Fuzzy number ˜ 9 ˜ 7
Description Very High
Triangular fuzzy scale (7, 9, 9)
Domain 7≤x≤9
mA (x) (x − 7)/(9 − 7)
High
(5, 7, 9)
7≤x≤9 5≤x≤7
(9 − x)/(9 − 7) (x − 5)/(7 − 5)
˜ 5
Medium
(3, 5, 7)
5≤x≤7 3≤x≤5
(7 − x)/(7 − 5) (x − 3)/(5 − 3)
˜ 3
Low
(1, 3, 5)
3≤x≤5 1≤x≤3
(5 − x)/(5 − 3) (x − 1)/(3 − 1)
˜ 1
Very Low
(1, 1, 3)
1≤x≤3
(3 − x)/(3 − 1)
Table 5.1:
THE FUZZY SCALE OF IMPORTANCE .
The fuzzy comparison matrix A = (˜aij )n×n can be formulated and structured as [11]: ⎛ ⎜ ⎜ A=⎜ ⎝
(1, 1, 1) a ˜21 .. .
a ˜12 (1, 1, 1) .. .
... ... .. .
a ˜1n a ˜2n .. .
a ˜n1
a ˜n2
...
(1, 1, 1)
⎞ ⎟ ⎟ ⎟ ⎠
(5.5)
5.3 Approach
59
where ˜aij (i = 1, 2, ..., n, j = 1, 2, ..., m) is an element of the comparison matrix and the reciprocal property of the comparison matrix is defined as a ˜ij = a ˜−1 ij . The pairwise comparisons need to be applied on every criteria and alternative, and the values for ˜aij come from a predefined set of fuzzy scale value as showed in Table 5.1. Moreover a ˜ij represents a TFN in the form of ˜aij = (lij , mij , uij ) and matrix A consists of the following fuzzy numbers: 1 i=j a ˜ij = ˜−1 , 5˜−1 , 7˜−1 , 9˜−1 i = j ˜ 3, ˜ 5, ˜ 7, ˜ 9 ˜ or 1 ˜−1 , 3 1, For computing a priority vector of matrix A, we need to calculate the value ˜i for each row in matrix A by (see [10]): of fuzzy synthetic extent S S˜i =
m j=1
a ˜ij ⊗
m n
−1 a ˜ij
(5.6)
i=1 j=1
where ˜aij is a TFN, ⊗ is the fuzzy multiplication operator. The degree of possibility for a convex fuzzy number can then be calculated by: ⎧ if m2 ≥ m1 ⎨ 1 0 if l1 ≥ u2 ˜1 ) = hgt(˜ a1 ∩ a ˜2 ) = d = V (˜ a2 ≥ a ⎩ (l1 −u2 ) otherwise (m2 −u2 )−(m1 −l1 ) (5.7) where d is the ordinate of the highest intersection point between a ˜1 and a ˜2 and the term hgt indicates the height of fuzzy numbers on the intersection of a ˜1 and a ˜2 (see [10]). As last step, we measure the weight vector for the criteria, assuming: d (Ai ) = min V (S˜i ≥ S˜k ), k = 1, 2, ..., n, k = i where Ai (i = 1, 2, . . . , m) is the m decision alternative and n is the number of criteria, then the weight vector is obtained by (see [10]): W (Ai ) = (d (A1 ), d (A2 ), ..., d (Am ))T , Ai (i = 1, 2, ..., m)
(5.8)
the normalized weight vectors can be calculated via normalizing Equation (5.8) (see [12]): W (Ai ) = (d(A1 ), d(A2 ), ..., d(An ))T
(5.9)
60
Paper A
where W is a non-fuzzy number and represents the arrangement of the alternatives. The importance degree of a criterion (WCj ) can be calculated by: W (Aj ) , j = 1, ..., n W Cj = n i=1 W (Ai )
5.3.3
(5.10)
Offline and online phases
In this section, through an example, we show how the calculation of dependency degree is done where we have both AND, OR situations and 13 test cases that test the system under test. Figure 5.5 illustrates a sample calculated dependency graph for the test cases where the calculated dependency degree for each node is specified inside parenthesis and weight of each edge is shown above it. 2
TC1 (1)
TC4 (2) 2
3
TC5 (3)
3 AND
3
4
TC6 (3) 2
TC2 (1) 2
OR
2 TC3 (1)
2
3
TC8 (2)
2
TC9 (3)
3
TC10 (2)
TC12 (2)
TC7 (4)
3
TC11 (3)
TC13 (3)
Figure 5.5: Dependency Graph. By using Eqs. 5.1, 5.2, 5.3, and 5.4 we get the following dependency degrees for the test cases. Noting that the dependency degrees for each independent node is equal to 1 then DT C1 = 1. For calculating the dependency degrees for the next node in the first row, which is the grey node T C4 , first the weight of the incoming edge to this node is calculated using Equation 5.1: WT C4 = Dsource_node + 1 = DT C1 + 1 = 2.
5.3 Approach
61
Since there is no other incoming edge to T C4 , Equation 5.4 is applied and therefore DT C4 = 2 (the weight and value coming from the edge between T C1 and T C4 ). Similarly, for node DT C6 first the weights of the incoming edges are calculated using Equation 5.1. Then because of the AND relation between the incoming edges, Equation 5.2 is applied for calculating the dependency degree for node DT C6 : DT C6 = M ax{2, 3} = 3 The set of test cases with the same dependency degree can be further prioritized by FAHP according to a selection of criteria (cost, execution time, etc). The result of this step will be an ordered set of test cases with the same dependency degree. A sample output as illustrated in Table 5.2 is produced. Dependency Degree 1 2 3 4 Table 5.2: O RDERED FAHP.
Set of ordered test cases {T C2 , T C3 , T C1 } {T C8 , T C10 , T C12 , T C4 } {T C9 , T C6 , T C11 , T C5 , T C13 } {T C7 }
SET OF TEST CASES PER DEPENDENCY DEGREE BY
Having an ascending oder of test cases for each dependency degree, an offline order for execution of test cases is generated. This means that starting from the lowest calculated dependency degree, the test cases can be selected for execution in the order that is determined for them using FAHP (i.e., ordered set). This can be repeated for the subsequent dependency degrees. In the online phase, the result of each executed test case is also added to the ordering process. In the online phase, the result of each test case execution is also taken into account in the selection of the next test case(s) for execution. The steps that are performed in the online phase are as follows: the first item in the set of test cases from the lowest dependency degree is selected and executed. Then the next item in the same set is executed until there is no item left. Then the (ordered set of test cases in the) next dependency degree greater than the previously selected dependency degree is considered. During the whole process, after the execution of each test case and based on its result, the executability condition of the test cases that are dependent on it are (re-) evaluated. In selecting test cases from the ordered set of test cases at each dependency degree, if the executability condition of a test case is false, it will be skipped and not selected for execution.
62
Paper A
5.4
Industrial Case Study
We have started validating our approach at Bombardier Transportation AB (BT) in Sweden. BT develops and manufactures trains and railway equipment. Reliability and safety of the train control management system along with all integrated functions is of great importance for BT. We plan to conduct a series of case studies to continuously adapt and improve our approach for BT. Case study represents a good choice as a research method because we need to develop a deeper understanding of decisions impacting test efficiency at BT. Furthermore, as our final objective is to improve the current state of testing practice at BT and it may involve different kinds of evidence, case study research is further justified [13]. This section presents the results of a case study where we evaluate the feasibility of our proposed approach. The objective of this case study is to understand the existing order of test execution at BT and how our approach is expected to impact test efficiency. We have selected a running project at BT as our case. The project is selected to fit the case study objectives as we wanted to observe and track the order of test execution. Moreover, our units of analysis is limited to two sub-level function groups (SLFGs): brake system and air supply. A SLFG is a grouping of functions related to a key functional requirement; other examples of SLFGs include aerodynamic performance, propulsion and auxiliary power. No. 1 2 3 4 5 6 7 8 9 10 11 12
Test case ID Drive-S-IVV-046 Speed-S-IVV-005 ExtDoors-S-IVV-011 ExtDoors-S-IVV-022 Brake-IVV-031 Brake-IVV-041 Drive-S-IVV-024 Speed-IVV-004 Drive-IVV-030 Brake-IVV-044 Brake-S-IVV-042 Drive-S-IVV-011
Associated SLFG Brake system Brake system Air supply Air supply Brake system Brake system Air supply Air supply Brake system Brake system Brake system Air supply
Table 5.3: T EST CASE ID S WITH ASSOCIATED SLFG. Brake system and air supply SLFGs were selected as a matter of convenience since the test cases for them were ready to be executed as part of the running project at BT. Moreover, these SLFGs represent two of the critical function groups in a train control management system, having inter-dependencies and
5.4 Industrial Case Study
63
these must be tested. Our current context is limited to a set of 12 integration test cases only, but these test cases are expensive to run in terms of time (approximately 1 hour per test case) since they cover coarse requirements. Moreover these test cases are run at a sub-system level, meaning that they are more time consuming to run than tests at unit level [14]. On a limited number of expensive simulators, therefore re-running them due to unintended failures is costly as simulators are kept busy waiting for other test cases to execute. Table 5.3 list down the test cases used in this case study along with their associated SLFG. We have retained the test case IDs used in BT for brevity. The data collection for the case study was done using participant observation, questionnaire as well as taking help from archival data for finding the cause of test case failures. As is shown in Figure 5.2, our approach is usable in two phases: offline and online. In the beginning of the offline phase, a test expert at BT answered a questionnaire where the test dependencies were identified based on requirements. The mapping of these dependencies resulted in two dependency graphs as shown in Figure 5.6. As given in Section 5.2.1, the white, grey and black nodes in Figure 5.6 show first class (independent), second class and third class test cases. Also, in the current set of test cases, we only have AND situations but no OR situation. Using Equations 5.1, 5.2 & 5.4, the dependency degree for each test case is also calculated, given in Table 5.4. Dependency Degree 1 2 3 4 5
Set of test cases {Drive-S-IVV-046, Brake-S-IVV-042, ExtDoors-S-IVV-022, Brake-IVV-041, ExtDoors-S-IVV-011} {Speed-S-IVV-005, Brake-IVV-031} {Brake-IVV-044, Speed-IVV-004, Drive-S-IVV-024} {Drive-IVV-030} {Drive-S-IVV-011}
Table 5.4: S ET OF TEST CASES PER DEPENDENCY DEGREE . As we can see in Table 5.4, there is more than one test case with dependency degrees 1, 2 and 3. To select the best candidates for execution in the online phase, we need to prioritize test cases having same dependency degree, based on an existing criteria. As explained in Section 5.3.2, we propose FAHP for prioritizing test cases in this step. In discussions with the test expert at BT, the following criteria have been identified, sorted in descending order of preference for BT:
64
Paper A
ExtDoorsS-IVV-022
ExtDoorsS-IVV-011
BrakeIVV-031
Drive-SIVV-024
AND
BrakeIVV-041
(a) Directed Dependency Graph 1 Drive-SIVV-046
Speed-SIVV-005
BrakeIVV-044
DriveIVV-030
SpeedIVV-004
AND
Drive-SIVV-011
Brake-SIVV-042
(b) Directed Dependency Graph 2
Figure 5.6: Directed dependency graphs for Brake system and Air supply SLFGs. • Requirements coverage: Refers to the number of requirements tested by a test case. • Time efficiency: Is the sum of test case creation time, test case execution time and test environment setup time. • Cost efficiency: Refers to the cost incurred by BT in test case configuration (e.g., setting environment parameters, hardware setup) and test case implementation. • Fault detection probability: Refers to the average probability of detecting a fault by each test case.
5.4 Industrial Case Study
65
We need to reiterate that these criteria have different preferences for BT, with requirements coverage being the most important criterion at sub-system level testing. The resulting weights for the mentioned criteria, as calculated through pairwise comparisons between the criteria, are shown in Table 5.5. Rank 1 2 3 4
Table 5.5:
Criteria Requirement Coverage Time Efficiency Cost Efficiency Fault Detection Probability
Priority 67.5% 22.5% 7.5% 2.5%
PAIRWISE COMPARISONS OF CRITERIA .
While there is a possibility to achieve quantitative numbers on some criteria, e.g., requirements coverage, there is always an element of human judgment in estimating them. In order to get expert judgment on these criteria for our set of test cases, five linguistic variables (Figure 5.4) are defined. A questionnaire was designed where the test experts responded with, for each test case, a linguistic variable for the different criteria. Table 5.6 represents a sample survey questionnaire which has been sent to the test experts at BT, the variables were assigned using pair-wise comparisons between the criteria. Test Case ID Drive-S-IVV-046 Speed-S-IVV-005 ExtDoors-S-IVV-011 ExtDoors-S-IVV-022 Brake-IVV-031 Brake-IVV-041 Drive-S-IVV-024 Speed-IVV-004 Drive-IVV-030 Brake-S-IVV-044 Brake-S-IVV-042 Drive-S-IVV-011
Requirement Coverage VH M VL H VL VL L L L L VL VL
Time H VL H L M L H M H H H M
Cost L M H M M M H M L L M M
Fault Detection M M L L L M M M L H M H
Table 5.6: A SAMPLE WITH VALUES VERY LOW ( VL ), LOW ( L ), MEDIUM ( M ), HIGH ( H ) AND VERY HIGH ( VH ). The linguistic variables have been interpreted in a set of fuzzy numbers. The last step in the offline phase of our approach involves prioritizing test cases with the same dependency degree by using Equations (5.6) to (5.10). The results are shown in Table 5.7.
66
Paper A
Dependency Degree 1 2 3 4 5
Ordered set of test cases (FAHP) {Drive-S-IVV-046, ExtDoors-S-IVV-011, ExtDoors-S-IVV-022, Brake-S-IVV-042, Brake-IVV-041} {Speed-S-IVV-005, Brake-IVV-031} {Drive-S-IVV-024, Brake-IVV-044, Speed-IVV-004} {Drive-IVV-030} {Drive-S-IVV-011}
Table 5.7: O RDERED SET OF TEST CASES BY FAHP.
We now have an order of execution of the test cases that takes into account test case dependencies along with multiple criteria of importance for BT.
5.4.1
Preliminary results of online evaluation
The objective with online evaluation is to identify improvement potential in the current ordering of test executions at BT and to assess if the online phase of our approach will be of any benefit. So far, we have monitored and observed the execution of a subset of our 12 test cases and the results have given us an early indication of usefulness of our approach. The subset of tests monitored are: Drive-S-IVV-024, ExtDoors-S-IVV-022, Brake-IVV-031, Brake-IVV-041 and ExtDoors-S-IVV-011, shown in Figure 5.6a. It should be noted that the current way of executing these tests at BT does not follow a dependency structure, rather the tester selects a test case to execute based on intuition and knowledge regarding if the associated functionality has been implemented as yet. The tester has to configure the simulator in an effort to successfully run a test case which also includes configuration of any signal inputs that are expected as part of dependencies between test cases. As will be evident shortly, without any systematic way to identify these dependent signals, the current execution of test cases need multiple runs which is both time consuming and expensive for BT. We continued monitoring test execution until every test case had a pass verdict. Table 5.8 presents the results of four runs of test execution that were required to successfully execute the test cases, while also showing the order of execution. It is evident that the execution order (column 1 in Table 5.8) has not followed the dependency directed graph as shown in Figure 5.6a. In the first execution run (column 3, Table 5.8), the first test case executed is Drive-S-IVV-024. According
5.4 Industrial Case Study
Execution Order 1 2 3 4 5
Test Case ID Drive-S-IVV-024 Brake-IVV-031 ExtDoors-S-IVV-011 ExtDoors-S-IVV-022 Brake-IVV-041
67
Execution 1
Execution 2
Execution 3
Execution 4
Fail Fail Fail Not Run Fail
Fail Fail Pass Fail Pass
Fail Pass — Pass —
Pass — — — —
Table 5.8: T EST E XECUTION ORDER - BT. to our calculation, the dependency degree for this test case is 3 (see Table 5.4), which means that it is a dependent test case and its successful execution is dependent on the successful execution of prior test case(s). This test case failed in the first execution run as shown as ‘Fail’ in Table 5.8 (column 3). The reason for this failure could be that it found a fault but reading the logged test record reveals that it failed because ‘the door lock status failed’, which would have been tested earlier by test case ExtDoors-S-IVV-022. While this was the reason mentioned in the test records, we know from Figure 5.6a that the successful executions of Drive-S-IVV-024 also depends on successful execution of two other test cases (Brake-IVV-031 & Brake-IVV-041). This is the reason why the test case Drive-S-IVV-024 does not pass until the fourth test run, after the test cases it depends on have successfully passed. If the tester had known the correct dependency structure among test cases, wasted effort in running Drive-S-IVV024 thrice would have been saved. We also measured the test execution time for a single test case; it took approximately one hour to get a verdict (pass or fail). Considering this time, just for testing the test case Drive-S-IVV-024, three hours were wasted. The test record for Brake-IVV-031 in first execution showed that this test case failed because of the ‘signal service brake failure’. The test case specification for ExtDoors-S-IVV-011 explains that this test case tests the signal service brake as well. This is also evident in Figure 5.6a where Brake-IVV-031 is dependent on successful execution of ExtDoors-S-IVV-011. Thus it turned out to be a wasted effort in executing Brake-IVV-031 before having a pass verdict on ExtDoors-S-IVV-011. In other words, it is a redundant test case to execute. According to our calculations in Table 5.7, the dependency degree for ExtDoorsS-IVV-011, ExtDoors-S-IVV-022 and Brake-IVV-041 is 1, indicating that these are independent test cases.In the first execution run, none of these independent test cases were able to get a pass verdict, due to reasons attributed to faults in test specifications. For ExtDoors-S-IVV-022, the test case could not even get started (indicated as ‘Not Run’ in Table 5.8) while for ExtDoors-S-IVV-011 and Brake-IVV-031, failures resulted after the test cases had run for approximately
68
Paper A
one hour each. For ExtDoors-S-IVV-022, when the fault in the test specification was fixed to enable it to run in the second execution run, it failed again due to another fault in the test specification. This highlight improvement opportunities in design of test specifications at BT but it is not a focus in this paper. In second execution run, the other two independent test cases (ExtDoorsS-IVV-011 and Brake-IVV-041) were able to get a pass verdict which allowed Brake-IVV-031 to pass in the third execution run. ExtDoors-S-IVV-022 was also eventually passed in third execution once the problem in the test specifications was fixed. The already passed test cases are now represented with ‘—’ in Table 5.8. The test case Drive-S-IVV-024 was also eventually passed in the fourth execution run once the test cases it was dependent on were passed. These are only preliminary results, but they have given us evidence that much time can be saved by incorporating dependency information in ordering test execution. The total estimated time for executing test cases in Figure 5.6 is approximately 5 hours (one hour per test case). But the total time taken to execute test cases in Table 5.8 is 45 hours. We need to consider that re-running a test case has additional associated costs, such as troubleshooting the cause of failure, potential update of test case, implantation, restarting the simulator and potential configuration setting. In this case, 40 hours of testing time was wasted. Given it takes approximately 0.5 hour to find dependencies in our case, 39.5 hours of testing time could potentially be saved from using our method. Our proposed approach further recommends an ordering of test cases that have the same dependency degree. This promises to further cut down test costs. The early results presented here suggest that efficiency gains can be made using our approach. We, however, need to provide further quantitative evidence in support of our approach by executing the online phase, which is left as a future work.
5.5
Discussion & Future Extensions
In our proposed approach, test cases were first categorized based on their dependency degree resulting in sets of test cases for each dependency degree. As each set can contain one or more test cases, FAHP was introduced to prioritize test cases at each dependency degree. This prioritization was based on a set of test case attributes serving as criteria in the decision-making process. From this perspective, dependency was used in our work as a separate criterion. An alternative way is to use dependencies directly as another criterion in the decision-making process. One can also consider using fuzzy dependency relationships. In other words, in our current approach, a test case is either
5.5 Discussion & Future Extensions
69
independent or not (i.e., binary: 0 or 1). By including concepts from fuzzy logic, the strength of the dependency between test cases can be specified with fuzzy variables mapping to values over the interval [0, 1]. This idea can particularly be helpful in cases where test engineers cannot determine if two test cases are fully dependent or not. To visualize the dependencies of test cases, a directed graph is used in this paper. However, we did not modify the structure of the graph after it was constructed. Another possible extension could be to update the graph dynamically during test execution (e.g., by removing some edges). Regarding the use of graph in providing visual hints to testers, we grouped test cases into three classes with respect to their dependency relations (white, grey, and black nodes). We believe this is a useful basis for discussions in a testing team, not only for dependency issues but potentially also for resolving traceability issues. One interesting future direction is to investigate the opposite form of result dependency. In other words, while here we determined redundancy of test cases based on the fail result of other test cases (i.e., fail based on fail), it would be interesting to consider whether and how from the pass result of a test case, it can similarly be asserted that the result of some other test cases will have to be pass as well (i.e., pass based on pass). So, in our current work, we start by test cases with lowest dependency degree and move to the ones with higher dependency degree while considering which fail verdicts will result in the failure of other dependent test cases. For the opposite case, test cases might be considered from the highest dependency degree towards the ones with lower dependency degree and determining their verdict whenever a test case further in the dependency path has passed. A combination of these two approaches (i.e., fail based on fail & pass based on pass) will be another possible future direction of this research. The approach is generic and independent of the type of analysis performed to identify dependencies. Currently, we are identifying individual dependencies by interviewing text experts. However, this approach may not be feasible when the number of test cases are larger. As a next step, we are therefore considering analyzing dependencies based on a combination of temporal order and pattern matching applied to historic test record data.
5.5.1
Delimitations
In discussions with BT, four prioritization criteria were agreed upon. But there can be other applicable criteria, e.g., requirements volatility. The increase in the number of criteria is not a limitation of our proposed FAHP approach but it might take more time for pair-wise comparisons. We did not undertake such an analysis in this study. Also, the answers to the criteria were given
70
Paper A
by one test expert at BT. There is a risk that another test expert will give different ratings on the criteria, leading to a different prioritization of test cases. However, we minimized this risk by having an experienced test expert who has a long background in testing train control management systems. We have used triangular fuzzy membership function for evaluating the effect of the identified criteria on each test case. We did not compare other membership functions, e.g., bell-shaped that might produce a better prioritization of test cases. We used result dependency (fail based on fail) for creating the dependency model. If in a different context, another type of dependency such as state dependency is considered and is more relevant, the approach might not be applicable as it is, and might require some modifications. Moreover, our approach assumes that test case dependencies are identified, either manually or otherwise. We did not assess the cost of identifying these dependencies but in cases where more complex dependencies exist, an automatic inference and extraction of dependencies is more feasible, see e.g., Arlt and Morciniec [8].
5.6
Related Work
Use of dependency information to prioritize, select and minimize test suites has recently received much attention. Bates and Horwitz [15] use program dependence graphs and slicing with test adequacy data to identify components of the modified program that can be tested using files from the old test suite and to avoid unproductive retesting of unaffected components. Rothermel and Harrold [16] also used slicing with program dependence graph to identify changed def.-use pairs for regression testing. Our approach uses a black-box approach in the sense that it is independent of the source-code modifications. We do not have access to implementation details of functions which is realistic for testing at higher levels. Also we do not address regression testing in particular. Ryser and Glinz [17] propose the use of dependency charts to manage dependencies between scenarios for systematically developing test cases for system test. They differentiate between three types of dependencies: abstraction, temporal and causal, while data and resource dependencies are taken as special cases of causal dependencies. Test cases are derived by traversing paths in the dependency chart, taking into account data and resource annotations and other specified conditions. While test suite reduction or prioritization is not their objective, their approach shows the importance of managing dependencies and interrelations between scenarios for thorough system testing, e.g., trying to break constraints and restrictions.
5.6 Related Work
71
Zhang et al. [18] challenge the test independence assumption of much of the traditional regression test prioritization (e.g. [19, 20, 21]) and test selection (e.g. [22, 23, 24, 25]) approaches. This assumption stems from the controlled regression testing assumption [26] which states: given a program P and a modified version P’, when P’ is tested with test t, all factors that may influence the outcome of this test remain constant, except for the modified code in P’. Zhang et al. [18] show that dependent test cases affect the output of five test case prioritization techniques. They further implemented four algorithms to detect dependent tests. An empirical study of 96 real-world dependent tests from 5 software issue tracking systems showed that dependent tests do arise in practice, both for human-written or automatically-generated tests. The presence of dependencies between tests is also confirmed by Bell et al. [27] and Lou et al. [28]. Haidry and Miller [29] use dependency structure of test cases, in the form of a directed acyclic graph, to prioritize test cases. The test cases are prioritized based on different forms of graph coverage values, however a set of independent tests are arbitrarily prioritized, which leads to lower performance in case of fewer unconnected tests. The authors emphasize the need to combine dependency with other types of information to improve test prioritization. Our work contributes to fill this gap whereby test case dependencies along with a number of other criteria are used to prioritize test cases. Caliebe et al. [30] present an approach based on dependencies between components whereby analysis could be performed on a graph representation of such dependencies. Two applications of their proposed approach are possible: general test case selection and test case prioritization for regression testing. Arlt et al. [8] use logical dependencies between requirements written in a structured format to automatically detect a redundant test case. Their approach is essentially a test suite reduction technique based on the current status of successful tests and failed tests. While being similar in purpose, our focus is mainly on the steps after the identification of dependencies. Moreover, our proposed approach cover a more general form of dependencies that can address more complex scenarios consisting of AND and OR relations. There has also been previous work on using fuzzy computing approaches for multi-faceted test case fitness evaluation, prioritization and selection. Kumar et al. [31] use a fuzzy similarity measure to filter out unfit and high ambiguity test cases based on four parameters of statement coverage, branch coverage, fault detection capability and execution time. Tahvili et al. [9] formulate test prioritization as a multi-criteria decision making problem and apply analytic hierarchy process (AHP) in a fuzzy environment to prioritize test cases. Alakeel [32] present a test case prioritization approach that uses fuzzy logic to measure the ef-
72
Paper A
fectiveness of a given test in violating program assertions of modified programs while Malz et al. [33] combine software agents and fuzzy logic for automated prioritization of system test cases. Xu et al. [34] use a fuzzy expert system to prioritize test cases based on knowledge represented by customer profile, analysis of past test case results, system failure rate, and change in architecture. A similar approach is used by Hettiarachchi et al. [35] where requirements risk indicators such as requirements modification status, complexity, security, and size of the software requirements are used in a fuzzy expert system to prioritize test cases. Schwartz and Do [36] use a fuzzy expert system to choose the most cost-effective regression testing technique for regression testing sessions. While similar to these studies in the use of a fuzzy approach, this paper is unique in combining dependencies in test cases with multiple criteria.
5.7
Summary & Conclusion
In this paper, we provide the following main contributions: (1) we formally define the dependency degree as a metric to be used in test case prioritization, together with an algorithm for calculating it; (2) we introduce a new approach for dynamic test case selection using the result of executed test cases and their dependency degrees whereby an offline order based on dependency of test cases is produced along with prioritization of test cases using FAHP; (3) we apply the method to an industrial case study of a safety-critical train control subsystem, compare it to the baseline test case execution order, and give a brief analysis of the results. The results of the BT case study show that the concept of ‘fail based on fail’ is applicable and can reduce test execution time. When the testers did not follow and consider test case dependency relations, some test cases were selected which failed due to the failure of test cases they were depending on. Consequently, using our approach will enable higher test execution efficiency by identifying and avoiding such forms of test redundancies.
Acknowledgements This work was supported by VINNOVA grant 2014-03397 (IMPRINT) and the Swedish Knowledge Foundation (KKS) grants 20130085 (TOCSYC) and ITS-EASY industrial research school. Special thanks to Anders Skytt, Ola Sellin and Kjell Bystedt at Bombardier Transportation, Västerås-Sweden.
Bibliography [1] M. Saadatmand and M. Sjödin. On combining model-based analysis and testing. In Proceedings of the 10th International Conference on Information Technology: New Generations (ITNG’13). IEEE Computer Society, 2013. [2] S. Tahvili, W. Afzal, M. Saadatmand, M. Bohlin, D. Sundmark, and S. Larsson. Towards earlier fault detection by value-driven prioritization of test cases using fuzzy TOPSIS. In Proceedings of the 13th International Conference on Information Technology: New Generations (ITNG’16). Springer, 2016. [3] A. Srivastava, J. Thiagarajan, and C. Schertz. Efficient integration testing using dependency analysis. Technical Report; no. MSR-TR-2005-94, 2005. [4] Ch. Carlsson and F. Fuller. Fuzzy multiple criteria decision making: Recent developments. Fuzzy Sets and Systems, pages 415–437, 1996. [5] A. Onoma, W. Tsai, M. Poonawala, and H. Suganuma. Regression testing in an industrial environment. Communications of the ACM, 41(5):81–86, 1998. [6] H. Agrawal, J. Horgan, E. Krauser, and S. London. Incremental regression testing. In Conference on Software Maintenance (ICSM’93). IEEE, 1993. [7] International Software Testing Qualifications Board. Standard glossary of terms used in software testing. ISTQB, v2.4 edition, 2014. [8] S. Arlt, T. Morciniec, A. Podelski, and S. Wagner. If A fails, can B still succeed? Inferring dependencies between test results in automotive system testing. In Proceedings of the 8th IEEE International Conference on Software Testing, Verification and Validation (ICST’15). IEEE, 2015. 73
74
Bibliography
[9] S. Tahvili, M. Saadatmand, and M. Bohlin. Multi-criteria test case prioritization using fuzzy analytic hierarchy process. In The 10th International Conference on Software Engineering Advances (ICSEA’15). IARIA, 2015. [10] T. Yu-Cheng and L. Thomas. Application of the fuzzy analytic hierarchy process to the lead-free equipment selection decision. International Journal of Business and Systems Research, 5:35–56, 2011. [11] T. Terano. Fuzzy Engineering Toward Human Friendly Systems. Number v. 2. Ohmsha, 1992. [12] D. Chang. Applications of the extent analysis method on fuzzy AHP. European Journal of Operational Research, 95:649 – 655, 1996. [13] P. Runeson and M. Höst. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering, 14(2):131–164, 2009. [14] H. Hemmati, L. Briand, A. Arcuri, and Sh. Ali. An enhanced test case selection approach for model-based testing: An industrial case study. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’10). ACM, 2010. [15] S. Bates and S. Horwitz. Incremental program testing using program dependence graphs. In Proceedings of the 20th SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’93). ACM, 1993. [16] G. Rothermel and M. Harrold. Selecting tests and identifying test coverage requirements for modified software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’94). ACM, 1994. [17] J. Ryser and M. Glinz. Using dependency charts to improve scenariobased testing - Management of inter-scenario relationships: Depicting and managing dependencies between scenarios. In the 17th International Conference on Testing Computer Software (ICTSS’20). ACM, 2000. [18] S. Zhang, D. Jalali, J. Wuttke, K. Mu¸slu, W. Lam, M. Ernst, and D. Notkin. Empirically revisiting the test independence assumption. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA’14). ACM, 2014.
Bibliography
75
[19] S. Elbaum, A. Malishevsky, and G. Rothermel. Prioritizing test cases for regression testing. SIGSOFT Software Engineering Notes, 25(5):102–112, 2000. [20] B. Jiang, Z. Zhang, W. Chan, and T. Tse. Adaptive random test case prioritization. In Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering (ASE’09). IEEE Computer Society, 2009. [21] A. Srivastava and J. Thiagarajan. Effectively prioritizing tests in development environment. SIGSOFT Software Engineering Notes, 27(4):97–106, 2002. [22] M. Harrold, J. Jones, T. Li, D. Liang, A. Orso, M. Pennings, S. Sinha, S. Spoon, and A. Gujarathi. Regression test selection for Java software. SIGPLAN Notes, 36(11):312–326, 2001. [23] A. Nanda, S. Mani, S. Sinha, M. Harrold, and A. Orso. Regression testing in the presence of non-code changes. In 4th International Conference on Software Testing, Verification and Validation (ICST’11). IEEE, 2011. [24] H. Hsu and A. Orso. Mints: A general framework and tool for supporting test-suite minimization. In Proceedings of the 31st International Conference on Software Engineering (ICSE’09). IEEE Computer Society, 2009. [25] A. Orso, N. Shi, and M. Harrold. Scaling regression testing to large software systems. SIGSOFT Software Engineering Notes, 29(6):241–251, 2004. [26] G. Rothermel and M. Harrold. Analyzing regression test selection techniques. IEEE Transactions on Software Engineering, 22(8):529–551, 1996. [27] J. Bell, G. Kaiser, E. Melski, and M. Dattatreya. Efficient dependency detection for safe java test acceleration. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15). ACM, 2015. [28] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. An empirical analysis of flaky tests. In Proceedings of the 22nd SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). ACM, 2014.
76
Bibliography
[29] S. Haidry and T. Miller. Using dependency structures for prioritization of functional test suites. IEEE Transactions on Software Engineering, 39(2):258–275, 2013. [30] P. Caliebe, T. Herpel, and R. German. Dependency-based test case selection and prioritization in embedded systems. In Proceedings of the 5th International Conference on Software Testing, Verification and Validation (ICST’12). IEEE, 2012. [31] K. Manoj, Sh. Arun, and K. Rajesh. Fuzzy entropy-based framework for multi-faceted test case classification and selection: an empirical study. IET Software, 8:103–112(9), 2014. [32] A. Alakeel. Using fuzzy logic in test case prioritization for regression testing programs with assertions. The Scientific World Journal, pages 1–9, 2014. [33] C. Malz, N. Jazdi, and P. Gohner. Prioritization of test cases using software agents and fuzzy logic. In Proceedings of the 5th International Conference on Software Testing, Verification and Validation (ICST’12), 2012. [34] Zh. Xu, K. Gao, T. Khoshgoftaar, and N. Seliya. System regression test planning with a fuzzy expert system. Information Sciences, 259:532 – 543, 2014. [35] C. Hettiarachchi, H. Do, and B. Choi. Risk-based test case prioritization using a fuzzy expert system. Information and Software Technology, 69:1– 15, 2016. [36] A. Schwartz and H. Do. A fuzzy expert system for cost-effective regression testing strategies. In Proceedings of the 29th International Conference on Software Maintenance (ICSME’13). IEEE, 2013.
Paper B B
Chapter 6
Paper B: Cost-Benefit Analysis of Using Dependency Knowledge at Integration Testing Sahar Tahvili, Markus Bohlin, Mehrdad Saadatmand, Stig Larsson, Wasif Afzal and Daniel Sundmark In the Proceedings of the 17th International Conference on Product-Focused Software Process Improvement (PROFES’16), 2016, Springer.
77
Abstract In software system development, testing can take considerable time and resources, and there are numerous examples in the literature of how to improve the testing process. In particular, methods for selection and prioritization of test cases can play a critical role in using testing resources efficiently. This paper focuses on the problem of selecting and ordering of integration-level test cases. Integration testing is performed to evaluate the correctness of several units in composition. Furthermore, for reasons of both effectiveness and safety, many embedded systems are still tested manually. To this end, we propose a process for ordering and selecting test cases based on the test results of previously executed test cases, which is supported by an online decision support system. To analyze the economic efficiency of such a system, a customized return on investment (ROI) metric tailored for system integration testing is introduced. Using data collected from the development process of a large-scale safety-critical embedded system, we perform Monte Carlo simulations to evaluate the expected ROI of three variants of the proposed new process. The results show that our proposed decision support system is beneficial in terms of ROI at system integration testing and thus qualifies as an important element in improving the integration testing process.
6.1 Introduction
6.1
79
Introduction
The software testing process is typically performed at various integration levels, such as unit, integration, system and acceptance level testing. At all levels, software testing suffers from time and budget limitations. Improving the testing process is thus essential from both product quality and economic perspectives. Towards this goal, application of more efficient testing techniques as well as automating different steps of the testing process (e.g., test case generation, test execution etc.) can be considered. For test execution, the decision of which test cases to select and the order in which they are executed can play an important role in improving test efficiency. In our previous work [1], we introduced a technique based on dependencies between test cases and their execution results at runtime. The technique dynamically selects test cases to execute by avoiding redundant test cases. In our technique, identified dependencies among test cases give partial information on the verdict of a test case from the verdict of another one. In this paper, we present a cost-benefit analysis and a return on investment (ROI) evaluation of the dependency-based test selection proposed in [1]. The analysis is conducted by means of a case study of the integration testing process in a large organization developing embedded software for trains. In particular, we analyze various costs that are required to introduce our decision support system (DSS) and compare these costs to the achieved cost reductions enabled by its application. To improve the robustness of the analysis, stochastic simulation of tens of thousands possible outcomes have been performed. In summary, the paper makes the following contributions: • A high-level cost estimation model, based on Monte-Carlo simulation, for the evaluation of integration test-case prioritization with test-case dependencies. The model is generic and can be used to analyze integration testing for a wide range of systems that exhibit test case dependencies. • An application of the cost estimation model in an industrial case study at Bombardier Transportation (BT) where three alternatives for process improvement are compared to the baseline test execution order. • A sensitivity analysis for the model parameter values in the case study. Through the analysis, various scenarios have been identified where the application of the proposed DSS can be deemed as either cost beneficial or not. The remainder of this paper is structured as follows. Section 6.2 presents the background while Section 6.3 provides a description of the DSS for test
80
Paper B
case prioritization. Section 6.4 describes a generic economic model. Section 6.5 provides a case study of a safety-critical train control management subsystem, and gives a comparison with the currently used test case execution order. In Section 6.6, the results and limitations are discussed and finally Section 6.7 concludes the paper.
6.2
Background
Numerous techniques for test case selection and prioritization have been proposed in the last decade [2], [3], [4]. Most of the proposed techniques for ordering test cases are offline, meaning that the order is decided before execution while the current execution results do not play a part in prioritizing or selecting test cases to execute. Furthermore, only few of these techniques are multi-objective whereby a reasonable trade-off is reached among multiple, potentially competing, criteria. The number of test cases that are required for testing a system depends on several factors, including the size of the system under test and its complexity. Executing a large number of test cases can be expensive in terms of effort and wall-clock time. Moreover, selecting too few test cases for execution might leave a large number of faults undiscovered. The mentioned limiting factors (allocated budget and time constraints) emphasize the importance of test case prioritization in order to identify test cases that enable earlier detection of faults while respecting such constraints. While this has been the target of test selection and prioritization research for a long time, it is surprising how only few approaches actually take into account the specifics of integration testing, such as dependency information between test cases. Exploiting dependencies in test cases have recently received much attention (See e.g., [5, 6]) but not for test cases written in natural language, which is the only available format of test cases in our context. Furthermore, little research has been done in the context of embedded system development in real, industrial context, where integration of subsystems is one of the most difficult and faultprone tasks. Lastly, managing the complexity of integration testing requires online decision support for test professionals as well as trading between multiple criteria; incorporating such aspects in a tool or a framework is lacking in current research. The cost of quality is typically broken down into two components: conformance and nonconformance costs [7]. The conformance costs are prevention and appraisal costs. Prevention costs include money invested in activities such as training, requirements and code reviews. Appraisal costs include money spent
6.3 Decision Support System for Test Case Prioritization 81 on testing such as test planning, test case development and test case execution. The non-conformance costs include internal and external failures. The cost of internal failure includes cost of test case failure and the cost of bug fixing. The cost of external failure include cost incurred when a customer finds a failure [8]. This division of cost of quality is also a basis for some well-known quality cost models such as Prevention-Appraisal-Failure (PAF) model [9] and Crosby’s model [10]. While general in nature, such quality cost models have been used for finding cost of software quality too, see e.g., [11, 12, 13]. Software testing is one important determinant of software quality and smart software managers consider the cost incurred in test related activities (i.e., appraisal cost) as an investment in quality [8]. However, very few economic cost models of software testing exist, especially metrics for calculating the return on testing investment are not well-researched. It is also not clear how the existing software test process improvement approaches [14] cater for software testing economics. One reason for this lack of attention of economics in software quality in general is given by Wagner [15]. According to him, empirical knowledge in the area is hampered by difficulties in cost data gathering from companies since it is considered as sensitive. Nikolik [16] proposes a set of test case based economic metrics such as test case cost, test case value and return on testing investment. A test cost model to compare regression test strategies is presented by Leung and White [17]. They distinguish between two cost types: direct and indirect costs. Direct costs include time for all those activities that a tester performs. This includes system analysis cost, test selection cost, test execution cost and result analysis cost. Indirect costs include test tool development cost, test management cost and cost of storing test-related information. A test cost model inspired by PAF model is also presented by Black [18] while several cost factors for ROI calculation for automated test tools are given in other studies [19, 20, 21]. Some other related work is done by Felderer et al. [22], [23] where they develop a generic decision support procedure for model-based testing in an industrial project and compare estimated costs and benefits throughout all phases of the test process.
6.3
Decision Support System for Test Case Prioritization
In this section we outline our proposed DSS, which prioritizes and selects integration test cases based on analysis of test case dependencies. Although not the focus of this paper, the DSS is also capable of performing multi-criteria
82
Paper B
decision analysis. The details of the approach can be found in [1]. In essence, the DSS provides an optimized order for execution of test cases by taking into account the execution result of a test case, its dependency relations and various test case properties. The steps performed in the DSS can be categorized into an offline and online phase: The offline phase produces an order for execution of test cases based on different test case properties (e.g., fault detection probability, execution time, cost, requirement coverage, etc.) while in the online phase, the pass or fail verdict of executed test cases is taken into account in order to identify and exclude upcoming test cases based on knowledge of dependencies between executed and scheduled test cases. The following definition of result dependency for integration test cases, first introduced in [1], constitute the basis of the dependency-based prioritization considered in this paper: Definition 6.1. For two test cases A and B, B is dependent on A if, from the failure of A, it can be inferred that B will also fail. In industrial practice, such dependencies may exist e.g., whenever a subsystem uses the result of another subsystem. During testing, the dependency may manifest whenever a test case B, dependent on test case A, is scheduled for execution before a component, tested by A, has been fully and adequately implemented and tested. By delaying the execution of B until A has passed, we ensure that the prerequisites for testing B are met. For instance, if the power system in a train fails to work, the lighting and air conditioning systems will not function either.
6.3.1
Architecture and Process of DSS
In this section, we give the basic architecture and process for the decision support system [1]. We use the term ‘decision support system’ to emphasize that it can be instantiated in contexts similar to ours, i.e., test cases written in natural language, meant for testing of integration of subsystems in embedded system development. In Figure 6.1, we describe the steps of the semi-automated decision support system for optimizing integration test selection. New test cases are continuously collected in a test pool (1) as they are developed. The test cases are initially not ordered and are candidates for prioritization. As a preparation for the prioritization of the test cases, the values for a selected set of criteria need to be determined (2) for the new test cases. To prioritize among the test cases, the DSS expects the use of a multi-criteria decision making technique (3, prioritization) [24]. Once prioritized, the test cases are in this step executed
6.4 Economic Model
2
Criteria Determination
3
Non-Ordered Test Cases
T C1 T C8 T C2 TC6
1
T Cn T C5
83
Prioritization Ordered Test Cases
4
Execution
5 Re-consider for execution
TC4 T C7 T C3
TC4 T C7
T Cn T C1
6
Monitor
λ0
T C2
Stop
7
Figure 6.1: Architecture of the proposed online DSS.
(preferably) according to the recommended order (4). The result of executing each test case could either be Pass or Fail . We have previously in [1] shown that by detecting the dependency between test cases, we are able to avoid the redundant executions. When a test case fails during the execution, all its dependent test cases should be disabled for execution. The failed test cases from the previous step enter a queue for troubleshooting. The failed test cases will be reconsidered for execution once the reason of their failure is resolved (5). The results of each test are monitored (6) to enable (re-)evaluation of the executability condition (see [1]) of the test cases that are dependent on it. This will determine if the dependent test case should be selected for execution or not. Furthermore, the completeness of the testing process will be monitored through the metric fault failure rate (see [25]), denoted by λ in Fig. 6.1. This metric is the proportion of the failed test cases to the total number of executed test cases. The goal is to reach successful execution of maximum test cases to be able to finish the current test execution cycle. The current test execution cycle will stop (7) once the fault failure rate becomes 0. The steps in the DSS are performed in the integration phase for each iteration/release of the system. This will ensure that the relevant test is executed in a suitable sequence, and that the tests resources are used in an optimal way.
6.4
Economic Model
In this section we describe an economic model for the cost-benefit analysis of a software integration testing process where test cases are delayed until their execution requirements are fulfilled. The model is independent on the specific
84
Paper B
multi-criteria decision making technique used and of the specific system under test. The purpose of the economic model is to adequately capture the costs and benefits which are directly related to the adoption of a DSS-supported testing process aiding in the prioritization of test cases. In industry, it is common that time and cost estimates for software development processes only exist as estimates or averages, if at all. Finally, for an analysis to be useful in practice, the analysis model should be reasonably lightweight and contain only the absolutely necessary parameters. Using the proposed economic model, a stochastic ROI analysis can then be obtained by Monte Carlo simulation. A stochastic analysis avoids the sensitivity of traditional ROI analyses by considering a large number of possible parameter values, thereby offsetting some of the disadvantages in being forced to use less reliable parameter value estimates. In this paper, we use a cost model with the following parameters: 1. A one-time fixed DSS implementation cost, corresponding to a fixed-price contract negotiated beforehand, 2. Variable costs for DSS training, on a per person-days of DSS usage basis, 3. Variable costs for DSS maintenance, on a per person-days of DSS usage basis, 4. Variable costs for (a) DSS data collection, (b) test planning, on a per test-case basis, and 5. Variable costs for (a) test-case execution (per executed test-case) and (b) troubleshooting (per failed test case). We make the following simplifying assumptions on the testing process to be analyzed: • A test case is executed at most once in each release cycle, • If a test case fails, it is delayed until the next release cycle, • Reliability of an already implemented and properly maintained system grows according to a simplified Goel-Okumoto model [26], and • Test execution and troubleshooting effort is independent of each other and between test cases.
6.4 Economic Model
85
In the model, we only include the costs and benefits which are affected by using the DSS, and hence, do not need to consider other efforts such as the effort of testing for test cases that pass or that fail for other reasons than dependency. The following cost model is used for the approach in each release cycle t: Ct = CtI + dt · CtT + CtM + nt · CtD + CtP + γt · λt · nt · CtE + CtB , (6.1) where C I is the implementation cost, dt is the number of person-days in t, C is the training cost, C M is the maintenance cost, nt is the number of test cases, C D is the data collection cost, C P is the test order planning cost, including possible preprocessing, test case data input, DSS runtime, post-processing and test case prioritization output, λt is the fraction of failed test cases in the release cycle, γt is the fraction of test cases that failed due to a fail-based-on-fail dependency (out of the failed test cases), C E is the average test execution cost, and C B is the average troubleshooting cost. The last term, γt ·λt ·nt · CtE + CtB , calculates the cost for unnecessarily running test cases which will surely fail due to dependencies. This is the only runtime cost we need to consider when comparing a DSS-supported process and the baseline process without DSS support, as the other costs for running test cases will remain the same. Over the course of time, the maintenance cost for a deployed system with a small number of adaptations will be approximately proportional to the failure intensity of the system. In this paper, we therefore assume that the DSS software reliability grows according to a simplified Goel-Okumoto model (see [26]), i.e., that the failure intensity decrease exponentially as λ(d) = λ0 e−σd , where λ(d) is the failure intensity at time d, λ0 is the initial failure intensity, and σ is the rate of reduction. Further, for any release cycle t, each test case belonging to t is assumed to be tested exactly once within t. It is therefore logical to assume that there is no decrease in the test case failure rate during any single release cycle. Under these assumptions, the expected maintenance cost in release cycle t can be calculated as follows. T
CtM = C0M · Dt · e−σDt ,
t
(6.2)
where C0M is the initial maintenance cost and Dt = i=1 di is the total project duration at release cycle t, where di is the duration of a single cycle i. Apart from the implementation cost C I , there are other unrelated infrastructure costs which are not affected by the process change and can therefore be disregarded from. Likewise, full staff costs are not included, as the team size remains constant, and we instead focus on measuring the savings in work effort cost from a change in process. In the model in Eq. (7.1), the savings of
86
Paper B
a process change taking test case dependencies into account can be measured as a difference in costs, under the assumption that all other costs are equal. As each integration test case is normally executed once in each release cycle, after each cycle there is a set of failed test cases that needs to be retested in the next cycle. In this paper, we are interested in estimating the economic benefits of delaying test cases whose testability depend on the correctness of other parts of the system. In other words, we are interested in estimating the number of test cases which fail solely due to dependencies. For these test cases, we can save executing and troubleshooting efforts. If γt · λt number of the test cases fail due to dependencies, then, from Eq. (7.1), we have that by delaying the execution of such test cases, the saving (i.e. benefit) of changing the process can be at most: Bt = γt · λt · nt CtE + CtB . (6.3) The estimate is an upper bound as in reality we may not be able to capture all dependencies in our analysis. Further, there is a possibility that the analysis falsely identifies dependencies which do not exist. The effect of delaying the corresponding test cases to the next phase is to delay the project slightly; however, this effect is likely small and we therefore disregard from it in this paper.
6.4.1
Return on Investment Analysis
A Return on Investment (ROI) analysis represents a widely used approach for measuring and evaluating the value of a new process and product technology [27]. In this study we consider all costs directly related to the process change to be part of the investment cost. If we also assume that the sets of test cases to execute for all release cycles are disjoint, we can calculate the total costs and benefits by adding the costs and benefits for each release cycle. We use the following ROI model based on the net present value of future cash flows until time T and an interest rate r. T t t=0 Bt (1 + r) −1 (6.4) Rt = T t t=0 Ct (1 + r) We assume that the implementation cost is paid upfront, so that CtI = 0 when t ≥ 1, and that there are no other benefits or costs at time t = 0. In other words, B0 = 0, C0 = C0I and, consequently, R0 = −1. The interest rate r is used to discount future cash flows, and is typicallt the weighted average cost of capital for the firm, i.e., the minimum rate of return that investors expect to provide the needed capital.
6.5 Case Study
6.5
87
Case Study
In order to analyze the economic feasibility of our approach, we carried out a case study at Bombardier Transportation (BT) in Sweden, inspired by the guidelines of Runeson and Höst [28] and specifically the way guidelines are followed in the paper by Engström et al. [4]. We investigated the software/hardware integration testing process for the train control management subsystem (TCMS) in the Trenitalia Frecciarossa 1000, a non-articulated high-speed trainset. The process aims to identify faults at the interface of software and hardware. The case study spanned six releases of 13 major and 46 minor function groups of the TCMS during a time period of 2.5 years, which involved in total 12 developers and testers for a total testing time of 4440 hours. The testing process is divided into different levels of integration, following a variation of the conventional V-model. The integration tests are performed manually in both a simulated environment and in a lab in the presence of different equipment such as motors, gearboxes and related electronics. The testing for each release has a specific focus, and therefore, there are only minor overlaps between the test cases in different releases. Each test case has a specification in free-text form, and contain information (managed using IBM Rational DOORS) on the: • test result, • execution date and time, • tester ID, and • testing level. The test result is one of the following: • Failed, (i.e., all steps in the test case failed), • Not Run, (i.e., the test case could be not executed), • Partly Pass, (i.e., some of the steps in the test case passed, but not all), and • Pass (i.e., all steps in the test case passed). According to the test policy in effect, all failed test cases (including “Not Run” and “Partly Pass” test cases) should be retested in the next release. Furthermore, each of these initiates a troubleshooting process that incurs cost and effort. In the rest of this paper, we therefore use the term failed to mean any test verdict
88
Paper B
except “Pass”. The objective of the case study is to analyze the improvement potential for the integration testing process at BT from decreasing the number of unsuccessful test executions using knowledge of test-case dependencies. The chosen method is to estimate the economic effect on BT in the form of earned ROI using Monte-Carlo simulation. We answer the following research question in this case study: • RQ. What is the economic effect of introducing a DSS for reducing the number of unsuccessful integration test executions based on dependency knowledge? The data collection for the case study was done through both expert judgment, inspection of documentation and a series of semi-structured interviews. The initial parameter value estimates for C I , C T and C M were made by the author team, as it was judged that the members of the team, having deployed several decision support systems in the past (see e.g [29, 30]), possessed the necessary experience for this evaluation. Likewise, C P was estimated by the research team through multiple meetings and re-evaluations. The documentation consists of the test case specification and verdict records in DOORS. In particular, the fault failure rate (λ) was calculated directly by counting the corresponding test case reports, and the fraction of dependent test cases (γ) was estimated through manual inspection of the comments in the same set of reports. Finally, a series of semi-structured interviews were conducted to both estimate the parameter values for the testing process itself, and to cross-validate the full set of parameter values already identified. The interview series were made with two testers (T1 & T2), a developer (D), a technical project leader (PL), a department manager (DM) and an independent researcher (R) in verification and validation. The composition and main purpose of the interviews are shown in Table 6.1. The final parameter values can be found later in this paper in Table 6.2 and Table 6.3. # 1 2, 3 4 5, 6 7 8 9
T1 × × ×
T2
D
× × ×
× × ×
PL
DM
× × ×
× ×
R Main purpose Estimate C D from dependency questionnaire. Identify criteria for dependencies. Validate C D . Validate dependencies. Validate number of dependencies (γ). Estimate C E . Estimate C B . Validate C E , C B , C I , C T , C M and C P . × Validate C I , C T and C M .
Table 6.1: Series of interviews to establish parameter values.
6.5 Case Study
6.5.1
89
Test Case Execution Results
To estimate the number of result dependencies between test cases, we performed a preliminary analysis of the results for 4578 test cases. The analysis was based on an early discussion with testing experts, in which test result patterns that were likely to indicate a dependency were identified. The patterns have previously been independently cross-validated on a smaller set of 12 test cases by other test experts at BT (see [1]). We classified the test results using the same patterns, resulting in 823 possible dependency failures out of 1734 failed test cases, resulting in a total estimate of γ ≈ 0.476. In the semi-structured interviews, two testers independently estimated that approximately 45% of the failed test cases were caused by dependencies, which is close to our estimate. Table 6.2 shows the full results for the six release cycles. Release number Parameter 1 2 3 4 5 6 Total 62 89 168 65 127 44 555 Working Days (d) Test cases (n) 321 1465 630 419 1458 285 4578 Fault failure rate (λ) 0.545 0.327 0.460 0.513 0.346 0.246 0.379 Fail based on fail. rate (γ) 0.411 0.267 0.393 0.753 0.630 0.457 0.475
Table 6.2: Quantitative numbers on various collected parameters per release. Note that the γ rate is reported as a fraction of the fault failure rate (λ).
6.5.2
DSS Alternatives under Study
We analyzed three different DSS variants, which all prioritize the test cases by aligning them with the identified dependencies but vary in the amount of automation they offer. The goal was to identify the tool-supported process change which is most cost-effective (as measured by the ROI metric) within a reasonable time horizon. The following DSS variants were considered: • Manual version: prioritization and selection of test cases in the level of integration testing manually. In this version, a questionnaire on the time for test execution, troubleshooting and set of dependencies to other test cases, is sent to the testing experts. To be manageable for an individual, the questionnaire is partitioned into smaller parts according to the subsystem breakdown. To increase precision and decrease recall, it is preferable that several experts answers the same questionnaire; however,
90
Paper B
the exact number should be decided based on the experience level of the testers. One of the commercially and publicly available toolboxes for multi-criteria decision analysis (such as FAHP or TOPSIS) are then used for prioritization of test cases. Data is fed manually into and out of the DSS, and a spreadsheet is used to filter, prioritize and keep track of the runtime test case pool. • Prototype version: Dependencies are collected as in the manual version. However, the DSS is custom-made to read the input in a suitable format, automatically prioritize, filter and keep track of the runtime test case pool, and can output a testing protocol outline, suitable for the manual integration testing process. • Automated version: in addition to the prototype version, the DSS detects the dependencies automatically by utilizing a publically-available toolbox (such as Parser [31]). The criteria determination step (in Figure 6.1) would be applied on the test cases by utilizing some learning algorithms (for example a counting algorithm for calculating the number of test steps in a test case for estimating the execution time for a test case). As explained earlier in Section 6.4, we divide the total cost needed for software testing into fixed (one-time cost) and variable cost. The fixed cost includes the DSS cost for three versions which includes implementation, maintenance and training costs. The variable cost contains execution cost and also troubleshooting cost for the failed test cases. The variable cost changes in proportion to the number of executed test cases and the number of failed test cases per project.
6.5.3
ROI Analysis Using Monte-Carlo Simulation
The three version of the DSS were evaluated on the six release cycles described before. As many other mathematical methods, ROI analyses are sensitive to small changes in the input parameter values. As an effect, the calculated ROI can fluctuate depending on the varying time estimates. For this reason we chose to both evaluate the ROI model above using Monte Carlo simulation, as detailed below, and to perform sensitivity analysis by varying the expected value of some of the time estimates, as detailed in the results section. The parameters for the three versions are shown in Table 6.3. The focus on initial analysis means that estimation efforts should be kept low. For this reason, a single-parameter Rayleigh distribution, which is the basis
6.5 Case Study
Parameter γt λt CE CB CI CT CM CD CP
91
Distribution parameter Distribution Manual Semi Auto Constant See Table 6.2. Poisson See Table 6.2. Rayleigh 2 2 2 Rayleigh 4 4 4 Rayleigh 120 825 1650 Rayleigh 540 360 360 Rayleigh 40 165 330 Rayleigh 69.2 69.2 0.00 Rayleigh 32.3 5.37 0.00
Comment Failed TC rate Failed dependency TC rate TC execution time, per TC TC troubleshooting time, per TC Total implementation time Training time, per year Maintenance time, per year DSS data collection time, per TC DSS run time, per TC
Table 6.3: DSS-specific model parameters and distributions.
in the Putnam model (see [32], [33]), was chosen for the distribution of effort for software testing and implementation tasks. Test-case failures were sampled from a Poisson distribution.
●
●
●
●
●
●
●
●
●
52,
●
●
●
●
●
●
'HFLVLRQVXSSRUW
●
●
●
0DQXDO
●
í
$XWRPDWLF 6HPLíDXWRPDWLF
●
10 and remain unchanged after D = 22 due to lack of larger distances in the ground truth dependency graph. The results for 11 ≤ D ≤ 22 have been omitted from the presented results. To clarify the presented results in Table 9.1, we analyze the results for HDBSCAN, D = 2. The distance 2 means that, for the ground truth, if two test cases are to be considered dependent, they have to be dependent on the same requirement (as illustrated in Figure 9.4a). If the shortest path between two test cases is 3 or more, the test cases are considered independent of each other. • True positive (TP) 390 of the test case pairs that are dependent in the ground truth are also clustered together by the HDBSCAN algorithm. • True negative (TN) 1524482 of the test case pairs that are not dependent in the ground truth are correctly identified as independent test case pairs.
9.6 Results
189
• False positive (FP) 1758 of the test case pairs that are not dependent in the ground truth are incorrectly clustered together. • False negative (FN) 248 of the test case pairs that are dependent in the ground truth are not clustered together.
9.6.2
Performance Metric Selection
Choosing suitable performance metrics is critical and also influences the measured performance of our approach. The performance metric used is dependent on the intended use as well as the balance between positive and negative instances. A pairwise approach, as is used in this paper, exacerbates the already relative sparseness of dependent test cases and thus the imbalance of the data set. It is well known that using accuracy as the performance metric for imbalanced data sets can result in misleading results [43]. Therefore, we have opted to measure the performance of our proposed approach using Precision, Recall, F1 Score, and Cohen’s Kappa [44]. D 2 3 4 5 6 7 8 9 10
HDB 0.1816 0.1830 0.2295 0.2318 0.2430 0.2477 0.2505 0.2556 0.2616
Precision FCM RND 0.0030 0.0004 0.0044 0.0009 0.0134 0.0034 0.0218 0.0061 0.0396 0.0120 0.0537 0.0171 0.0755 0.0254 0.0899 0.0322 0.1062 0.0413
All-Dep 0.0004 0.0009 0.0034 0.0061 0.0120 0.0172 0.0254 0.0323 0.0414
HDB 0.6113 0.2781 0.0958 0.0535 0.0286 0.0203 0.0139 0.0111 0.0089
Recall FCM 0.3871 0.2534 0.2127 0.1913 0.1772 0.1672 0.1589 0.1490 0.1373
RND 0.5298 0.5039 0.4978 0.4974 0.4993 0.4972 0.4995 0.4983 0.4983
HDB 0.2800 0.2207 0.1352 0.0870 0.0511 0.0375 0.0263 0.0213 0.0172
F1 Score FCM RND 0.0060 0.0009 0.0086 0.0019 0.0252 0.0067 0.0391 0.0120 0.0648 0.0234 0.0813 0.0330 0.1023 0.0483 0.1122 0.0605 0.1198 0.0762
All-Dep 0.0008 0.0018 0.0067 0.0121 0.0237 0.0338 0.0496 0.0625 0.0795
Cohen’s Kappa HDB FCM RND 0.2795 0.0052 0.0001 0.2199 0.0068 0.0000 0.1334 0.0190 0.0000 0.0849 0.0285 −0.0001 0.0487 0.0461 0.0000 0.0350 0.0567 −0.0002 0.0237 0.0703 0.0000 0.0187 0.0749 −0.0002 0.0145 0.0767 −0.0002
Table 9.2: Precision, Recall, F1 Score, and Cohen’s Kappa metrics based on the data from Table 9.1. The best results are highlighted. All values are between −1 and 1, HDB and RND are abbreviation for HDBSCAN and Random classifier. The mentioned performance metrics are selected based on the following considerations: (i) identifying independent test cases (true negative) is as important as identifying dependent ones, (ii) the number of missing dependencies between test cases outweighs the number of dependencies. Precision and Recall are commonly used metrics to measure the accuracy of a binary prediction system, F1 Score is a measure of a test’s accuracy, which is defined as the harmonic mean of Precision and Recall and does not directly include true negatives (independent test cases). Cohen’s Kappa is a measure of agreement between two observers and it is less than or equal to 1. Values of 0 or less imply a useless classifier [45]. Table 9.2 shows the computed results of the mentioned performance metrics, where the best results for each metric are highlighted. As outlined before, the
190
Paper E
most common dependency degree between test cases is D = 2, to make the results more understandable we just present the degree 2 to 10 in Table 9.2 (the performance metrics for degree 1 are equal to 0 and a very high dependency degree is not applicable in industry).
9.6.3
Metric Comparison
In this subsection we examine each of the metrics shown in Table 9.2 individually. Precision shows the number of correctly detected dependencies over the total number of detected dependencies using our proposed approach. Compared to the other methods, the HDBSCAN has continuously the highest value followed by the FCM method, meaning that 0.1816 of the identified dependencies by the HDBSCAN are detected correctly. For No-Dep method, Precision always results in 0 and thus has been omitted from Table 9.2. As explained before, paying no attention to the dependencies between test cases can lead to unnecessarily failure between test cases, which means that, even with a random classifier, the risk for redundant execution is reduced. Recall shows the number of correctly detected dependencies over the total number of dependencies that should be detected as dependent. However (in the same way as Precision) it does not account for true negatives (independent test cases). For the All-Dep approach Recall is always equal to 1 and for the No-Dep approach it is always 0 and therefore have been omitted from Table 9.2. For D = 2, the HDBSCAN outperforms all other shown approaches and this correlates with the previous conclusion that the distance D = 2 is the most industrially applicable interpretation of the ground truth. F1 Score is a harmonic average of Precision and Recall and is used as compound measurement. Its best value is 1 equating to the perfect Precision and Recall. According to Table 9.2 the highest F1 Score value is archived for D = 2 using the HDBSCAN algorithm, however, our alternative clustering approach, FCM, outperforms it at D = 6. Moreover, both proposed clustering algorithms reach a better value for F1 Score, compared to the random, All-Dep, and No-Dep classifier.
9.6 Results
191
Cohen’s Kappa can be considered as a robust measurement of classifiers with imbalanced data sets [46]. The Cohen’s kappa is a coefficient developed to measure agreement among observers [46], which can be calculated as: K=
PObs − PChance 1 − PChance
(9.1)
where PObs represents the relative observed agreement among raters and PChance is the hypothetical probability of chance agreement using the observed data. As we can see in Table 9.2 using the HDBSCAN algorithm for clustering leads to the largest value of Cohen’s Kappa, 0.2795, compared to FCM and random approach [45]. Moreover, All-Dep and No-Dep approaches result in Kappa of 0 and have thus been omitted from Table 9.2. Comparing the performance measurement results of the proposed clustering algorithms with each other and with the ground truth shows that the HDBSCAN generally provides better results for clustering in terms of functional dependencies between test cases. The primary reason for this result is HDBSCAN’s ability for identifying non-clusterable data points. Note that, the performance of the proposed approach is a case dependent factor, which means that in another context with more structured test case specifications we might get better results for the mentioned performance metrics. Even though the classification methods are only fairly good, they are still sufficient to make improvements in industrial settings, by identifying the functional dependencies.
9.6.4
Random Undersampling strategy for imbalanced datasets
The ratio of dependent and independent test case pairs in the ground truth is highly biased towards the independent pairs. As illustrated in Figure 9.7, imbalance of the ground truth varies depending on the observed maximum dependency depth, from 99.96% for distance 2 to 92.79% for distance 22. This imbalance does not affect the training of our system since the ground truth is not used for the training. However, two popular evaluation metrics F1 Score and Accuracy are tailored towards balanced data [47] and thus we present these metrics for re-balanced data set. There are several techniques for solving the imbalanced data distribution such as oversampling, undersampling, cost-sensitive learning, K-fold cross-validation and anomaly detection. In an undersampling approach, a subset of the original data is selected in order to balance the amount of the data within
192
Paper E
1600000
Dependent Independent
Number1of1test1case1pairs
1400000 1200000 1000000 800000 600000 400000 200000 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Dependency1depth
Figure 9.7: The ratio of the dependent and independent test-case pairs in our ground truth dataset.
each class. We utilize a random undersampling approach in this work to select the representative sample. D 2 3 4 5 6 7 8 9 10
Precision 0.9981 0.9959 0.9891 0.9804 0.9628 0.9476 0.9287 0.9104 0.8913
HDBSCAN Recall Accuracy 0.6113 0.8050 0.2781 0.6385 0.0958 0.5474 0.0535 0.5262 0.0286 0.5137 0.0203 0.5096 0.0139 0.5064 0.0111 0.5050 0.0089 0.5039
F1 Score 0.7582 0.4348 0.1747 0.1015 0.0555 0.0397 0.0273 0.0220 0.0176
Precision 0.8789 0.8275 0.8021 0.7845 0.7727 0.7653 0.7583 0.7477 0.7340
Recall 0.3871 0.2534 0.2127 0.1913 0.1772 0.1672 0.1589 0.1490 0.1373
FCM Accuracy 0.6668 0.6002 0.5801 0.5694 0.5625 0.5580 0.5541 0.5494 0.5438
F1 Score 0.5375 0.3879 0.3363 0.3076 0.2883 0.2745 0.2627 0.2485 0.2313
Table 9.3: The averages across 100 observations of performance metrics after random undersampling to a 50 : 50 ratio. We randomly undersampled the majority class (independent test case pairs) to a 50 : 50 post sampling class distribution. To avoid random sampling bias, we have sampled each group 100 times and computed averages (see Table 9.3) and also standard deviations of the metrics are presented in Table 9.4.
9.7 Threats to Validity
193
Table 9.3 shows that the average of F1 Score metric across 100 observations is 0.7582 and 0.5375 for the HDBSCAN and FCM algorithms respectively at D = 2, which is a significant improvement compared with the results before random undersampling. We need to consider that accuracy may not be a good measure in the case when the target classes are imbalanced. In our case, the ratio of the classes in the ground truth (Figure 9.7) is up to approximately 1 : 2400. After solving the imbalanced data distribution, we opted to measure the accuracy metric as follows: tp + tn Accuracy = (9.2) tp + tn + f p + f n Accuracy in the classification problems is the number of correct predictions made by the model over all predictions [44] but it may not be a suitable metric for imbalanced data. As highlighted earlier, D = 2 is the most common dependency degree between test cases, which leads to around to 80% accuracy by applying the HDBSCAN algorithm according to Table 9.3. Since Table 9.3 shows averages across 100 observations, we include Table 9.4 that represents the standard deviations for the same observations. D 2 3 4 5 6 7 8 9 10
Precision 0.0019 0.0032 0.0044 0.0056 0.0075 0.0077 0.0111 0.0112 0.0114
HDBSCAN Recall Accuracy 0.0000 0.0006 0.0000 0.0004 0.0000 0.0002 0.0000 0.0002 0.0000 0.0001 0.0000 0.0001 0.0000 0.0001 0.0000 0.0001 0.0000 0.0001
F1 Score 0.0005 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Precision 0.0174 0.0161 0.0092 0.0075 0.0054 0.0046 0.0035 0.0036 0.0033
FCM Recall Accuracy 0.0000 0.0044 0.0000 0.0030 0.0000 0.0015 0.0000 0.0012 0.0000 0.0008 0.0000 0.0007 0.0000 0.0005 0.0000 0.0005 0.0000 0.0004
F1 Score 0.0033 0.0018 0.0008 0.0006 0.0004 0.0003 0.0002 0.0002 0.0002
Table 9.4: Standard deviations of the performance metrics across 100 randomly undersampled observations of the results.
9.7
Threats to Validity
In this section, we discuss threats to validity, limitations and challenges faced in conducting the present study. Construct validity the major construct validity threat in this study is using the test specification for detecting the functional dependencies. In some contexts,
194
Paper E
the test specification might not exist or the information in it might be different from our case. Furthermore, the proposed approach in this paper is natural language-based and might face difficulties when applied to automated test scripts. Internal validity one of the threats is related to the structure of the test specification at BT. The proposed approach in the present work was performed on a set of well-defined natural language test specifications, which can be analyzed by the Doc2Vec algorithm, implemented in Python. We did not experiment with a more complicated structure of a test specification, where the Doc2Vec algorithm might not perform accordingly. External validity the proposed approach has been applied to just one industrial testing project in the railway domain. However, the approach should be applicable to other similar contexts, i.e., where manual, semi-structured test specifications exist; further studies should be able to corroborate or refute our findings. Conclusion validity this study started in parallel with the integration testing phase of BR490 and the intermediate results of this paper have been presented in few BR490 project meetings. Moreover, a workshop is conducted by us at BT for the members of BR490 The testers and engineers’ opinions about the detected dependencies between requirements and test cases have been gathered by us from the beginning, resulting in necessary modifications to the proposed approach to make it realistic.
9.8
Discussion and Future Work
One of the most important uses of dependencies and similarities between test cases is test optimization. Automated identification of dependencies, similarities and independencies and also utilizing them for test case scheduling and prioritization are the main aspects for test optimization. Knowing the dependency information at an early stage of the testing process has the potential to provide a more accurate overview of the required time and human resources for testing a system. This can lead to on-time delivery of the final product and consequently cost minimization. The dependency can be considered as an important factor for minimizing the total cost in a testing project. Avoiding unnecessary failure between test cases can lead to a less penalty fee. The dependencies found in
9.8 Discussion and Future Work
195
this paper can be used as an input for test scheduling at BT, along with the test cases’ execution time and requirements coverage (measured by us previously in [48, 3, 49]). A set of independent and also dependent test cases are suggested to testers for execution. As explained in Section 9.4.2, the HDBSCAN provides a set of non-clusterable data points as noise (the gray points in Figure 9.3a). According to our hypothesis, non-clusterable data points can be interpreted as independent test cases. From the ground truth (Figure 9.6), a total of 1026 test cases have been identified as independent test cases. HDBSCAN provides 526 non-clustered data points (test cases). When comparing the independent test cases in the ground truth with non-clustered data points provided by the HDBSCAN, a total of 365 test cases were correctly identified as the independent test cases from the ground truth. Moreover, by analyzing the values in Table 9.1 we can conclude that 390 dependent test case pairs were correctly detected by HDBSCAN, as well as 1524482 non-dependent test case pairs. In comparison, the FCM assigns every point to every cluster with a variable degree of belonging. With each point assigned to a cluster for which it has maximum probability, FCM has overall performed worse than the HDBSCAN in our experiments. As explained in Section 10.1, the cost of re-executing a failed test case at BT is 8 times higher than the normal test execution cost. The true positive and true negative values presented in Table 9.1 show that there is a potential of reducing cost for other testing projects at other companies. The most accurate approach up to date for dependency detection at BT is presented in [10] as the ground truth. Executing similar test cases, at the same time, which may require the same preconditions, initial states and system setups can lead to testing cost reduction. Due to the possibility of different interpretations of the ground truth (Section 9.5.2), in this work we have focused only on one variant of our approach even though it can be fine-tuned using an array of parameters. Hence, as the next step in our research, we plan to develop a generic method for discovery of parameters for the deep learning and clustering elements that would yield the optimal dependency reconstruction. In this work we have focused on reconstructing disjoint clusters of interdependent test cases and separating them from the independent test cases, giving us access to a subset of test case scheduling methods. As a future work we plan to develop an approach that will reconstruct the directed dependencies between the test cases, allowing us to use all available test case scheduling methods. As highlighted before the Doc2Vec algorithm provides a set of high dimensional data points. Using other dimensionality reduction techniques (t-SNE has been utilized in this work)
196
Paper E
such as PCA12 , missing values ratio and random forests on the obtained data sets might provide a better results for the performance metrics. The idea behind dimensionality reduction techniques is simply to find a low-dimensional set of axes that summarize data. However, grouping the dependent and independent test cases in the ground truth into several clusters and measuring the similarity between the provided clusters in this work with the ground truth clusters can be considered as another direction of the present work. Several approaches have been proposed for measuring the similarity between clusters [50], [51]. The achieved results in this paper can be modified by using test specifications from another testing project. Providing more structured test cases for the Doc2Vec algorithm might lead to even more accurate dependency detection for manual integration test cases. Additionally, Feldt et al. [52] proposed a similar approach, where cognitive similarity between manual test cases is used for ranking test cases for execution. Using the obtained results in this paper for test scheduling and comparing the performances of the proposed approaches in [15] and [52], is another future direction of the present study. As the presented results in this paper have the potential for industrial usage, Briand et al. [53] emphasize, that the context-driven research in software engineering is needed today and should be extensively replicated.
9.9
Conclusion
In this paper, we proposed a deep-learning based natural language processing approach to analyze the correlation between test case text semantic similarities and their functional dependencies. The test cases are interpreted as a set of vectors in n- dimensional space using Doc2Vec algorithm. Our test cases are written in English; however, the method is language independent [54]. Two clustering algorithms (HDBSCAN and FCM) were applied on the generated vectors in order to classify the dependent test cases into clusters. The proposed approach has been applied to an industrial use case at Bombardier Transportation AB. The performance of the proposed clustering algorithms (HDBSCAN and FCM) is compared with a random classifier, a classifier that always claims dependency and also a classifier that always claims that there is no dependency. Moreover, Precision, Recall, F1 Score, and Cohen’s Kappa are selected as the performance metrics. In order to deal with imbalanced data sets distribution, the random undersampling technique is applied on the ground truth. We have re-measured Precision, Recall, and F1 Score metrics together with the Accuracy 12 Principal
Component Analysis
9.9 Conclusion
197
metric on the randomly undersampled data sets. The results indicate that, using the HDBSCAN algorithm obtains the accuracy level around 80% and the F1 Score of up to 75%. The results of the empirical evaluation show that the proposed approach has a better performance with the HDBSCAN clustering algorithm.
Bibliography [1] W. Afzal, S. Alone, K. Glocksien, and R. Torkar. Software test process improvement approaches: A systematic literature review and an industrial case study. Journal of Systems and Software, 111:1 – 33, 2016. [2] S. Tahvili, M. Bohlin, M. Saadatmand, S. Larsson, W. Afzal, and D. Sundmark. Cost-benefit analysis of using dependency knowledge at integration testing. In Product-Focused Software Process Improvement: 17th International Conference (PROFES’16). Springer, 2016. [3] S. Tahvili, M. Saadatmand, S. Larsson, W. Afzal, M. Bohlin, and D. Sudmark. Dynamic integration test selection based on test case dependencies. In The 11th Work. on Testing: Academia-Industry Collaboration, Practice and Research Techniques (TAIC PART’16). IEEE, 2016. [4] S. Tahvili, M. Saadatmand, and M. Bohlin. Multi-criteria test case prioritization using fuzzy analytic hierarchy process. In The 10th International Conference on Software Engineering Advances (ICSEA’15). IARIA, 2015. [5] G. Cartaxo, D. Machado, and O. Neto. On the use of a similarity function for test case selection in the context of model-based testing. Software Testing, Verification and Reliability, 21(2):75–100. [6] P. Atish and M. Vijay. Calculating the similarity between words and sentences using a lexical database and corpus statistics. Computing Research Repository (CoRR), abs/1802.05667:1–14, 2018. [7] J. Kwon, C. Moon, S. Park, and D. Baik. Measuring semantic similarity based on weighting attributes of edge counting. In Artificial Intelligence and Simulation (AIS’04). Springer Berlin Heidelberg, 2005. 199
200
Bibliography
[8] P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95). Morgan Kaufmann, 1995. [9] S. Arlt, T. Morciniec, A. Podelski, and S. Wagner. If A fails, can B still succeed? Inferring dependencies between test results in automotive system testing. In Proceedings of the 8th IEEE International Conference on Software Testing, Verification and Validation (ICST’15). IEEE, 2015. [10] S. Tahvili, M. Ahlberg, E. Fornander, W. Afzal, M. Saadatmand, M. Bohlin, and M. Sarabi. Functional dependency detection for integration test cases. In Companion of the 18th IEEE International Conference on Software Quality, Reliability, and Security (QRS’18). IEEE, 2018. [11] S. Haidry and T. Miller. Using dependency structures for prioritization of functional test suites. IEEE Transactions on Software Engineering, 39(2):258–275, 2013. [12] G. Kumar and P. Bhatia. Software testing optimization through test suite reduction using fuzzy clustering. CSI Transactions on ICT, 1(3):253–260, 2013. [13] S. Eldh. On Test Design. PhD thesis, Mälardalen University, School of Innovation, Design and Engineering, 2011. [14] S. Tahvili, W. Afzal, M. Saadatmand, M. Bohlin, D. Sundmark, and S. Larsson. Towards earlier fault detection by value-driven prioritization of test cases using ftopsis. In Proceeding of the 13th International Conference on Information Technology: New Generations (ITNG’16). Springer, 2016. [15] S. Tahvili, L. Hatvani, M. Felderer, W. Afzal, M. Saadatmand, and M. Bohlin. Cluster-based test scheduling strategies using semantic relationships between test specifications. In 5th International Workshop on Requirements Engineering and Testing (RET’18). ACM, 2018. [16] S. Bates and S. Horwitz. Incremental program testing using program dependence graphs. In Symposium on Principles of Programming Languages (POPL’93). ACM, 1993. [17] G. Rothermel and M. Harrold. Selecting tests and identifying test coverage requirements for modified software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’94). ACM, 1994.
Bibliography
201
[18] J. Ryser and M. Glinz. Using dependency charts to improve scenariobased testing management of inter-scenario relationships: Depicting and managing dependencies between scenarios. In Proceedings of the 17th International Conference on Testing Computer Software (TCS 2000). ACM, 2000. [19] D. Acharya, P. Mohapatra, and N. Panda. Model based test case prioritization for testing component dependency in cbsd using uml sequence diagram. International Journal of Advanced Computer Science and Applications, 1(3):108–113, 2010. [20] H. Hemmati and F. Sharifi. Investigating nlp-based approaches for predicting manual test case failure. In 11th International Conference on Software Testing, Verification and Validation (ICST’18). IEEE, 2018. [21] M. Unterkalmsteiner, T. Gorschek, R. Feldt, and N. Lavesson. Large-scale information retrieval in software engineering - an experience report from industrial application. Empirical Software Engineering, 21(6):2324–2365, 2016. [22] S. Thomas, H. Hemmati, A. Hassan, and D. Blostein. Static test case prioritization using topic models. Empirical Software Engineering, 19(1):182– 212, 2014. [23] D. Flemström, P. Potena, D. Sundmark, W. Afzal, and M. Bohlin. Similarity-based prioritization of test case automation. Software Quality Journal, pages 1–29, 2018. [24] I. Ahsan, W. Butt, M. Ahmed, and M. Anwar. A comprehensive investigation of natural language processing techniques and tools to generate automated test cases. In Proceedings of the 2nd International Conference on Internet of Things, Data and Cloud Computing (ICC’17). ACM, 2017. [25] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). Morgan Kaufmann, 2014. [26] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., 2013.
202
Bibliography
[27] J. Lau and T. Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP (RepL4NLP). Association for Computational Linguistics, 2016. [28] M. Mimura and H. Tanaka. Reading network packets as a natural language for intrusion detection. In Information Security and Cryptology (ICISC’18). Springer, 2018. [29] V. Phi, L. Chen, and Y. Hirate. Distributed representation-based recommender systems in e-commerce. In Proceedings of the 8th forum on data engineering and information management (DEIM’16). Springer Berlin Heidelberg, 2016. [30] L. Trieu, H. Tran, and M. Tran. News classification from social media using twitter-based doc2vec model and automatic query expansion. In Proceedings of the Eighth International Symposium on Information and Communication Technology (ICTD ’16). ACM, 2017. [31] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. Computing Research Repository (CoRR), abs/1301.3781:1–12, 2013. [32] M. Pelevina, N. Arefyev, C. Biemann, and A. Panchenko. Making sense of word embeddings. In Proceedings of the 1st Workshop on Representation Learning for NLP (RepL4NLP). Association for Computational Linguistics, 2016. [33] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., 2014. [34] R. Namayandeh, F. Didehvar, and Z. Shojaei. Clustering validity based on the most similarity. Computing Research Repository (CoRR), abs/1302.3956, 2013. [35] M. Steinbach, L. Ertöz, and V. Kumar. The Challenges of Clustering High Dimensional Data, pages 273–309. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
Bibliography
203
[36] M. Ahmed, S. Yamany, N. Mohamed, A. Farag, and T. Moriarty. A modified fuzzy c-means algorithm for bias field estimation and segmentation of mri data. IEEE Transactions on Medical Imaging, 21(3):193–199, 2002. [37] L. McInnes, J. Healy, and S. Astels. HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 2017. [38] J. Bezdek, R. Ehrlich, and W. Full. Fcm: The fuzzy c-means clustering algorithm. Computers and Geosciences - Journal, 10(2):191–203, 1984. [39] Electric Multiple Unit Class 490 – Hamburg, Germany, url = https://www.bombardier.com/en/transportation/projects/project.ET-490Hamburg-Germany.html. [Accessed: 2018-02-13]. [40] N. Ilenic. A pytorch implementation of paragraph vectors (doc2vec). https://github.com/inejc/paragraph-vectors, 2017. [41] L. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. Clustering dependency [42] L. Hatvani and S. Tahvili. detection. https://github.com/leohatvani/ clustering-dependency-detection, 2018. [43] H. Han, W. Wang, and B. Mao. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing (ICIC’05). Springer Berlin Heidelberg, 2005. [44] D. Powers. Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1):37–63, 2011. [45] A. Viera and J. Mills Garrett. Understanding interobserver agreement: the kappa statistic. Family medicine, 37 5:360–3, 2005. [46] J. László, J. Cohn, and F. De La Torre. Facing imbalanced data– recommendations for the use of performance metrics. In Proceedings of the Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII ’13). IEEE Computer Society, 2013. [47] L. Jeni, J. Cohn, and F. Torre. Facing imbalanced data–recommendations for the use of performance metrics. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII ’13), pages 245–251. IEEE Computer Society, 2013.
204
Bibliography
[48] S. Tahvili, M. Saadatmand, M. Bohlin, W. Afzal, and Sh. Hasan Ameerjan. Towards execution time prediction for test cases from test specification. In 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA’17). IEEE, 2017. [49] S. Tahvili, W. Afzal, M. Saadatmand, M. Bohlin, and S. Hasan Ameerjan. Espret: A tool for execution time estimation of manual test cases. Journal of Systems and Software, 146:26 – 41, 2018. [50] G. Torres, R. Basnet, A.w Sung, S. Mukkamala, and B. Ribeiro. A similarity measure for clustering and its applications. Proceedings of World Academy of Science, Engineering and Technology, 31:490–496, 01 2009. [51] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In Workshop on Text Mining (KDD). ACM, 2000. [52] R. Feldt, R. Torkar, T. Gorschek, and W. Afzal. "searching for cognitively diverse tests: Towards universal test diversity metrics". In Proceedings of 1st Search-Based Software Testing Workshop (SBST’08). IEEE, 2008. [53] L. Briand, D. Bianculli, S. Nejati, F. Pastore, and M. Sabetzadeh. The case for context-driven software engineering research: Generalizability is overrated. IEEE Software, 34(5):72–75, 2017. [54] M. Dehghani, R. Boghrati, K. Man, J. Hoover, S. Gimbel, A. Vaswani, J. Zevin, M. Immordino-Yang, A. Gordon, A. Damasio, and J. Kaplan. Decoding the neural representation of story meanings across languages. Human Brain Mapping, 38(12):6096–6106.
PaperFF
Chapter 10
Paper F: sOrTES: A Supportive Tool for Stochastic Scheduling of Manual Integration Test Cases Sahar Tahvili, Rita Pimentel, Wasif Afzal, Marcus Ahlberg, Eric Fornande, Markus Bohlin Journal of IEEE Access (IEEE-Access), IEEE, 2018, In revision.
205
Abstract The main goal of software testing is to detect as many hidden bugs as possible in the final software product before release. Generally, a software product is tested through executing a set of test cases, which can be performed manually or automatically. The number of test cases which are required to test a software product depends on several parameters such as: the product type, size and complexity. Executing all test cases with no particular order can lead to waste of time and resources. Test optimization can provide a partial solution for saving time and resources which can lead to the final software product being released earlier. In this regard, test case selection, prioritization and scheduling can be considered as possible solutions for test optimization. Most of the companies do not provide direct support for ranking test cases on theirs own servers. In this paper we introduce, apply and evaluate sOrTES as our decision support system for manual integration of test scheduling. sOrTES is a Python based supportive tool which schedules manual integration test cases which are written in a natural language text. The feasibility of sOrTES is studied by an empirical evaluation which has been performed on a railway use-case at Bombardier Transportation in Sweden. The empirical evaluation indicates that around 40% of testing failure can be avoided by using the proposed execution schedules by sOrTES, which leads to an increase in the requirements coverage of up to 9.6%.
10.1 Introduction
10.1
207
Introduction
The crucial role of software testing in a sustainable software development cannot be ignored. To make a more effective and efficient testing process, several factors should be considered. One of the most important factors is estimating the total required cost for testing a software product, which can be a step toward justifying any software testing initiative. The software testing costs can be classified as fixed costs (including test team salaries, tester training, testing environment and automated testing tools) or variable costs, which deals with the troubleshooting and re-execution efforts [1]. Reducing the fixed testing cost is more related to the organization’s policies and procedures, whereas minimizing the testing variable cost is an optimization problem directly impacted by the test efficiency. In an inefficient testing process, a wide range of redundant test cases is created and thereby a large number of redundant executions can occur during the testing phase. In recent years, utilizing different techniques for test case selection, prioritization and test suit minimization has received much attention. By using the above mentioned aspects, we can also address other testing issues such as earlier fault detection [2] and faster release of the software products [3]. Sequencing and scheduling are a form of dynamic and continuous decision making with an industrial applicability. The dynamic decision-making approach fits the environment that changes over time and the previous decision might affect the new decision [4]. The test scheduling problem can be considered as a dynamic decision-making problem which can be a practical solution for minimizing the redundant execution at industry. Monitoring the test records (results) of the previously executed test cases can affect the results of the new test cases, especially in the integration testing level, where test cases are more interdependent. Knowing the test results (pass or fail) for each test case can help testers to select a right test candidate for execution. However, monitoring and decision-making process on a large set of test cases is both a challenging and cost consuming process. Through analyzing several industrial case studies, seen in [2, 5] and [6], it has been proven that the problem of test optimization is a multi-criteria decision making problem, which can be applied to the dynamic test scheduling problem as well. Measuring the effect of several criteria on the test cases and also dynamically scheduling them is a challenging task which is addressed in this paper. In the present work, we introduce, apply and evaluate sOrTES (Stochastic Optimizing TEstcase Scheduling), as an automated decision support system for test scheduling. sOrTES is a multi-criteria decision support system with
208
Paper F
a fast performance, which makes continuous execution decisions for manual integration of test cases. Furthermore, the feasibility of sOrTES is evaluated on a railway domain at Bombardier Transportation (BT) in Sweden. This paper makes the following contributions: (i) detecting the dependencies between manual integration test cases, (ii) dynamically scheduling test cases through analyzing their execution results, (iii) increasing the requirements coverage up to 9.6% and (iv) decreasing the total required troubleshooting time to 40%. The organization of this paper is laid out as follows: Section 10.2 provides a background of the initial problem and also an overview of research on test optimization, Section 10.3 describes the proposed approach. The structure of sOrTES is depicted in Section 10.4. An industrial case study has been designed in Section 10.5. Section 10.6 compares the performance between sOrTES, BT and a history-based test case prioritization approach. Threats to validity and delimitations are discussed in Section 10.7. Section 10.8 clarifies some points of future directions of the present work and finally Section 10.9 concludes the paper.
10.2
Background and Related work
The concept of test optimization has received much attention in the past decade, which can be performed through several approaches such as test automation, test suite minimization, test case selection, prioritization and also test scheduling. However, all mentioned approaches are not applicable on all testing environments within industry. The tradeoff between the test optimization effort and the expected gain should be considered in an early stage of a testing process. For instance, changing the test procedure from manual to automated requires a huge effort and sometimes is not a proper decision in terms of fault detection [7]. Moreover, manual testing is a popular approach for testing the safety critical systems where the human judgment and supervision are superior to machines [8]. Among the mentioned test optimization aspects, test case selection, prioritization and scheduling can be applied almost in all industries where the testing process can be optimized through minimizing the required costs and effort for running the proposed approaches. Selecting a subset of generated test cases or ranking them for execution can lead to a more efficient usage of the allocated testing resources. The generated test cases for testing a software product have different quality attributes and therefore do not have the same values for execution. Identifying the properties of the test cases and measuring
10.2 Background and Related work
209
their value for execution can be considered a master key to solving the test optimization problem. As explained in Section 10.1, the test optimization problem is a multicriteria and multi-objective decision making problem, where the properties of the test cases (e.g. execution time, requirement coverage) are the criteria and the targeted objectives are to maximize requirement coverage and detect faults earlier. Determining the critical criteria and the desirable objectives depends on several factors such as the test cases’s size, complexity, diversity and also the testing procedure. For instance, the lines of code (LOC) can be considered as a metric to measure the size of a test script, which is not a valid criteria in a manual testing procedure. Furthermore, test-satisfying objectives can be changed during the testing process based on different test optimization aspects. In this section, we briefly discuss three different aspects of test optimization.
10.2.1
Test case selection
Generally, test case selection deals with choosing a subset of designed test cases for execution. Test case selection techniques can be used in exploratory testing, where the testers are involved in minimum planning and maximum test execution [9]. The problem of test case selection is formulated as follow by Yoo and Harman [10]: Definition 10.1. Given: The program, P , the modified version of P , P , and a test suite, T . Problem: To find a subset of T , T , with which to test P . In other words, not all generated test cases need to be executed, as they also can be tested in some levels of testing such as acceptance testing, where all test cases are already executed at least once and just some test cases need to be selected (randomly) for execution. Test case selection can also be utilized for test automation, where a subset of test cases can be selected as good candidates for automation in the manual testing approach.
10.2.2
Test case prioritization
Ranking all designed test cases for execution is called test case prioritization, which can be applied in all testing levels such as unit testing, regression and system integration testing. The main goal of test prioritization is to give a higher priority to those test cases which have a better-quality attribute for execution.
210
Paper F
The following definition of test case prioritization is proposed by Yoo and Harman [10]: Definition 10.2. Given: A test suite, T , the set of permutations of T , P T and a function from P T to real numbers, f : P T → R. Problem: To find a T ∈ P T that maximizes f . Several objectives such as total testing time minimization and earlier fault detection can be satisfied through applying the test case prioritization techniques. Moreover, test cases can be prioritized for test automation in the manual testing approach, where the most critical manual test cases can be top ranked among all manual test cases.
10.2.3
Test Case Stochastic Scheduling
Selecting a subset of test cases or prioritizing test cases for execution is usually performed offline and it is not a daily task during a testing process. However, optimizing test cases for execution without monitoring the test results (after execution) is not the most efficient approach in terms of test optimization. The results (pass or fail) of previously executed test cases can influence on the execution results of the new test cases. This problem can be seen clearly in the integration testing level, when the interactions between software modules are tested. There is a strong interdependency between integration test cases, which directly impacts the execution results of each other [6, 11]. This level of interdependency also influences on the process of the test optimization, where selecting and prioritizing test cases should satisfy the dependency constraints. Considering the above-mentioned issues, we opted to use a stochastic scheduling model for ranking and sequencing test cases for execution. The stochastic scheduling is a subset of an optimization problem, where the processing time of tasks are modeled as random variables [12], therefore a job’s processing time is not known until it is completed [13]. In the test scheduling problem, a new execution decision should be made based on the results of the previously executed test cases. We proposed the following definition to the problem of test case stochastic scheduling: Definition 10.3. Given: A test suite, T . For all subset of T , A ⊆ T , the set of all permutations A, SP A. For all B ⊆ T , the set of all possible outputs after execution of the test cases in B, R. For each r ∈ R, the function fr : SP A → R. Problem: To find a prioritized set of T , T , considering the function f∅ : P T →
10.2 Background and Related work
211
R, where P T is the set of permutations of T . To execute the test cases in T until the first failure (if any). To update the previous procedure for T − Tp , considering the function fre , until Tp = T , where Tp is the set of passed test cases and re is the output of the executed test cases, respectively. In other words, the executed test cases need to be saved in re and the prioritizing process should be continued until all generated test cases are executed at least one time. Note that the main difference between Definition 10.2 and Definition 10.3 is monitoring the results of the test executions, which leads to a dynamic test optimization process. If no failures occur after the first execution then we only need to prioritize test cases once, according to Definition 10.2.
10.2.4
Related Work
Test prioritization tries to optimally order a set of test cases for execution, typically by balancing criteria such as detecting faults as early as possible with minimal cost, which is dependent on execution time. However, in the most cases a selective subset of possible test cases is utilized for test prioritization. Several test optimization techniques have been proposed in literature [10], while more and more techniques that utilize multi-objective and multi-criteria techniques are being proposed [14]. In [15], Walcott et al. present time-aware test prioritization that balances execution time and code coverage. A similar approach is also presented in [16]. In [17], Wang et al. introduce resourceaware multi-objective test prioritization where one cost measure (total time) and three effectiveness measures (prioritization density, test resource usage, fault detection capability) were defined and formulated into a fitness function. Although the Greedy algorithm may produce a suboptimal result, it has received much attention to be utilized for regression test case prioritization. In [18] Li et al. are empirically investigated metaheuristic algorithms for the regression test case prioritization. Strandberg et al. [19] present a multiple factor automated system level regression test prioritization approach that combines multiple properties of tests such as test duration, previous fault detection success, interval since last executed and modifications to the code tested. System-level test case prioritization has also been investigated in [20] where requirements coverage and/or volatility can be considered as one important prioritization criterion. It is interesting to note that during integration testing, dependencies between components and functions of the system under test becomes a critical criterion. Few studies have investigated test case prioritization
212
Paper F
based on such dependencies. Caliebe et al. [21] utilized a system graph based on the component dependency model and on path searching methods through the graph, where test cases were selected based on dependencies between the components in the system. Similarly, Haidry and Miller [22] prioritized functionally-dependent test cases based on different forms of the graph coverage values. In order to detect functional dependencies among integration test cases at an early stage, our earlier work [23] proposed, using natural language processing (NLP), to analyze multiple related artefacts (test specification, software requirement specification and relevant signaling information between functions under test). Reference
Wang et al. [17]
Strandberg et al. [19]
Caliebe et al. [21]
Srivastava and Thiagarajan [24]
Nahas and Bautista-Quintero [25] Elbaum et al. [26] Kim and Porter [27]
Purpose of paper Uses total time, prioritization density, test resource usage, fault detection capability for test prioritization Utilizes a set of test cases’ properties for regression test prioritization Selects a subset of test cases for execution based on component dependency models Uses the changes that have been made to the program for test prioritization Identifies a set of representative implementation classes for a TTC scheduler Uses modules’s history for test case prioritization Ranks test cases based on their execution histories
Drawback Static prioritization which does not check the execution results of test cases Not efficient in terms of minimizing test failures The component dependency model might not be available in all testing environments Use of changes make it unfit for prioritization in an early stage of testing Designing an appropriate set of test cases is a challenging task Focus on version specific test cases Historical data of the test cases required
Table 10.1: Summary of related work. Nahas and Bautista-Quintero [25] introduced Scheduler Test Case (STC) as a new technique which provides a systematic procedure for documenting and testing. STC supplies a black box tool which predicts the behavior of implementation sets of the real time scheduling algorithms. The STC uses
10.3 Proposed Approach
213
several scheduling example tasks and also a subset of test cases in order to examine the expected behavior output of time trigged co-operative architectures. The problem of test case prioritization is identified as a single objective optimization problem by Wong et al. in [28], where all test cases are ranked based on their increasing cost per additional coverage. In [24] Srivastava and Thiagarajan measured the changes that had been made to the program and prioritized test cases based on that. Moreover, the branch coverage has been identified several times as the most critical criteria by Rothermel et al. [29], [30], [31] and Elbaum et al. [26] where test cases are ranked based on a single criteria. History-based test prioritization is another single objective test optimization problem which is proposed by Kim and Porter [27]. Table 10.1 presents a summary of the related work. We propose a multicriteria decision approach for scheduling test cases for execution, where most of the proposed approaches are a single objective approach. Furthermore, our proposed approach monitors the execution results of test cases and a new decision would be made after each failure.
10.3
Proposed Approach Input
Extraction Phase
Scheduling Phase
Requirement specification (.*docx file) Table 10.2
Detecting dependencies between requirements
Ranking test cases for execution
Test specification (.*docx file) Table 10.3
Detecting dependencies between test cases & measuring requirement coverage
Removing passed test cases
Output
Table 10.5
Table 10.4 and also dependency visualization in Figures 10.3, 10.5
Figure 10.1: The input-output and phases of sOrTES. In the present work, we introduce sOrTES as a supportive tool for stochastic test scheduling. sOrTES measures the interdependency between integration test cases and ranks them for execution based on their requirement coverage and execution time. A new schedule is proposed after each execution for the remaining test cases. As outlined earlier, there is a different number of critical criteria which influences the test cases. The criteria in the testing concept can be interpreted as a property for each test case, which creates a difference between
214
Paper F
test cases. The following criteria are utilized by sOrTES for the integration of test case scheduling: • Functional dependency between test cases: test cases TC1 and TC2 are functionally dependent if they are designed to test different parts of function F1 or if they are testing the interaction between functions F1 and F2 . For instance, given two functions F1 and F2 of the same system, let the function F2 be allowed to execute if its required conditions are already enabled by function F1 . Thus, function F2 is dependent on function F1 . Consequently, all test cases which are designed to test F2 should be executed any time after the assigned test cases for testing F1 . Detecting functional dependencies between test cases can lead to a more efficient use of testing resources by means of [1]: – avoiding redundant execution, – parallel execution for independent test cases, – simultaneous execution of test cases that test the same functionality, – any combination of the previous options. • Execution time: is the total required time that each test case is allowed to take for execution. Test case’s execution time can differ from one test to another. Knowing the execution time of test cases before execution can help test managers to divide test cases between several testers. Moreover, estimating the required time for each test case can provide a better overview of the total required time for testing a software product. • Requirement coverage: as the title implies, shows the number of requirements which have been fulfilled by a test case. The coverage of requirements is a fundamental need throughout the software life cycle. Sometimes, a test case can test more than one requirement and sometimes several test cases are designed to test just one requirement. Identifying and measuring the influences of the mentioned criteria on each test case requires a close collaboration with the testing experts at industries, which consumes time and resources. On the other hand, human judgment suffers from uncertainty but eliminating them from experiments would impact the results. One of the main objectives of designing sOrTES as a supportive tool is automatic measurement of the testing critical criteria, where the human
10.3 Proposed Approach
215
judgment will be reduced and thereby a more trustable result will be produced. However, in some testing processes, there is no available information about the dependencies between test cases. In the case of a lack of a requirement traceability matrix in a testing level, the requirement coverage cannot be measured automatically. Moreover, the execution time for test cases, in most companies is only available after execution. We need to consider that, to have an efficient testing schedule, the effect of the mentioned criteria on the test cases should be measured in an early stage of a testing process and even before the first execution. The criteria measurement process can become even more complicated in a manual testing procedure. Analyzing a wide range of test specifications which are written by human with a variance in language and testing skills makes the problem more complex. The problem of dependency detection between manual integration test cases has previously been solved through proposing several approaches such as a questionnaire based study, deep learning, natural language processing (NLP) and machine learning [6, 23, 32]. We also proposed an aiding tool called ESPRET1 for estimating the execution time for manual integration test cases before the first execution [33]. Previously, test cases have been prioritized and ranked by adapting some ranking algorithms such as AHP2 in [5, 6] and TOPSIS3 in [2]. Since test cases need to be scheduled for execution more than one time (daily scheduling), running the manual methods is not an optimal approach. The second main goal of designing sOrTES as a supportive tool is the need of fast and daily scheduling. Note that the test scheduling problem is a dynamic task, meaning the results of scheduled test cases in the first cycle (first day) can imply the scheduling plan for the second day due to the following two reasons: 1. The dependency structure between test cases is changing after a successful execution. In other words, when an independent test case is passed, the next dependent test case (to this independent test case) can be considered as an independent test case. Since the dependencies between test cases is changing continuously during a testing process, a new schedule based on the current dependencies status need to be proposed. 2. The passed test cases need to be removed from the testing cycle (testing pool) and the failed test cases need to be re-scheduled for the secondround execution. The failed test cases must be troubleshooted first and the 1 EStimation
and PRediction of Execution Time Hierarchy Process 3 Technique for Order of Preference by Similarity to Ideal Solution 2 Analytic
216
Paper F
time needed to accomplish this needs to be considered when scheduling the remaining test cases for execution.
10.4
sOrTES- Stochastic Optimizing of Test case Scheduling
Requirement ID SRS-BHH-Brake system 768
BTI/SFC Requirement Emergency brake loop is not de-energized without order.
SRS-BHH-Brake system 241
Emergency Brake Not (26-K23) is not active.
SRS-BHH-Brake system 98
The activation of the emergency brake shall be shown on the HMI
Requirement ID SRS-BHH-Line Voltage 327
BTI/SFC Requirement Display the line voltage in AC-mode and the rail voltage in DC-mode .
SRS-BHH-Line Voltage 243
TCMS shall indicate the AC line voltage to the driver via HMI .
SRS-BHH-Line Voltage 398
TCMS shall suppress values between 0.1 and 2.0 kV.
Interface input: Internal Signal 44-A34.X11.4.DI3 output: MVB SDr3 EmRel input: Internal Signal 44-A34.X11.4.DI3 output: Internal Signal 95-B27.X01.5.PI10 input: Internal Signal 95-B27.X01.5.PI10 output: MCB 44-F34
(a) Brake system (SLFG) Interface input: 44-A23.X11.4.DI2 output: Internal Signal 95-B27.X01.5.PI10 input: Internal Signal 44-A37.X11.1.DI5 output: Internal Signal 44-A34.X11.4.DI3 input: I/O iDcuPd128 output: Internal Signal 95-B27.X01.5.PI10
(b) Line voltage (SLFG)
Table 10.2: Software requirement specification examples - Bombardier Transportation. sOrTES is a Python-based automated decision support system which consists of two separate phases: 1- extractor and 2-scheduler, that dynamically schedules manual integration test cases for execution based on three main criteria: (i) dependencies between test cases, (ii) test cases execution time and (iii) the requirement coverage. sOrTES reads the requirement and test specifications as inputs and provides a list of ranked test cases for execution as an output. In the extraction phase, the functional dependencies and the requirement coverage for each test case are computed and in the scheduler phase, the test cases are sorted for execution. The execution time for each test case is predicted by ESPRET (our proposed tool for execution time prediction [33]) which can be added to the test cases manually, or the same time value can be assumed for all test cases. Figure 10.1 shows the required inputs, expected outputs and also the embedded phases inside of sOrTES, which is exemplified through analyzing an industrial case study later in this paper. The following paragraphs describe the mentioned phases.
10.4 sOrTES- Stochastic Optimizing of Test case Scheduling
10.4.1
217
The Extraction Phase
To get a clearer picture of the required inputs for running sOrTES, we provide some examples of software requirement specification (SRS) and test case specification, extracted from DOORS4 database at BT. A typical SRS at BT consists of different pieces of information including the signal information and standards, which are described textually. The SRSs are written by the requirement engineering teams’ members as early as the needed input is available to the project. The requirement adjustments are performed continuously during the project life cycle. Each requirement from the SRS is assigned to a sub-level functional group (SLFG). The requirement is then implemented as a part of one module within the SLFG, or as a part of several modules within the same SLFG. Table 10.2a and Table 10.2b represent two requirement specification examples for the brake system and line voltage respectively. During the project, some of the SRSs might be removed, merged or new SRSs might be added to the project.
10.4.2
Functional Dependencies Detection
In the extraction phase, the functional dependencies between test cases are detected through analyzing the internal signal communications between the software modules, which is described textually in the SRSs. In order to illustrate the signal communications between the software modules, a traceability graph for the train-borne distributed control system (in summary TCMS) at BT is provided in Figure 10.2. TC1
TC2
TCMS platform
Req1
TC3
TC4
Req2
Module 1 Line voltage
Req3
Internal signal
TC5
TC6
Req4
Module 2 Brake system
Figure 10.2: The traceability graph for TCMS. Figure 10.2 shows how two software modules from two different SLFGs 4 Dynamic
Object-Oriented Requirements System
218
Paper F
(line voltage and brake system) are communicating by sending and receiving an internal signal to each other. As depicted in Figure 10.2, module 1 sends an internal signal to module 2, which builds a dependency relation between those software modules. Since module 2 is functionally dependent on module 1, it should be tested after it. Thereby, all assigned requirements (e.g. the requirements in Table 10.2a) and test cases for module 2 are functionally dependent on the assigned requirements (e.g. the requirements in Table 10.2b) and test cases for module 1. According to Figure 10.2, requirements (Req1 and Req2 ) describe how module 1 should be tested and four test cases TC1 , TC2 , TC3 and TC4 are also designed to test module 1 based on Req1 and Req2 . The test cases that are designed to test module 1 (i.e. TC1 , TC2 , TC3 and TC4 ) should ideally be top ranked for execution. TC3 and TC4 are designed to test both modules, but they need to be tested with the assigned test cases for module 1 in the first testing cycle. Test case name: Test case ID 3EST001845-2032 - RCM (v.1) Test configuration TCMS baseline: TCMS 1.2.3.0 Test rig: VCS Release 1.16.5 VCS Platform 3.24.0 Requirement(s) SRS-BHH-Line Voltage 1707 SRS-BHH-Speed 2051 Tester ID BR−1211 Initial State No active cab Step 1 2 3 4 5 6 7
Auxiliary Compressor Control Test level (s) Sw/Hw Integration
Date: Test Result
2018-01-20 Comments
Action Lock and set Auxiliary reservoir pressure < 5.5 bar Activate cab A2 lock and set signal braking mode from ATP to 109 Lock and set Auxiliary reservoir pressure > 5.5 bar Wait 20 seconds Reset dynamic brake in the train for 5 seconds Set Auxiliary reservoir pressure < 5.5 bar Clean up
Reaction Signal Command auxiliary compressor Signal braking mode to IDU is set to 109 Signal Auxiliary compressor is running to IDU is set to FALSE
Pass / Fail
IDU in B1 car as On Signal Auxiliary compressor is running to IDU is set to FALSE
Table 10.3: A test case specification example from the safety-critical train control management system at Bombardier Transportation. The required information for creating Figure 10.2 is described textually in the requirement specification at BT. sOrTES reads the requirement specifications in the format of an excel sheet (see Table 10.2) and finds the matched internal input-output signals. As we can see in Table 10.2a, the internal signal 44-A34.X11.4.D13 is described as an input for the brake voltage module which is described by the requirement SRS-BHH-brake system 768 (see column Interface in Table 10.2a). However, the same signal (Internal signal 44-A34.X11.4.D13) is described as an output signal in the interface column for the brake module, assigned to the requirement SRS-BHH-Line Voltage 243 in Table 10.2b. Moreover, Table 10.3 represents an example of a manual test case for a safety
10.4 sOrTES- Stochastic Optimizing of Test case Scheduling
219
Figure 10.3: The dependency between the requirements (blue nodes) and the test cases (red nodes).
critical system at BT, which consists of a test case description, an expected test result, test steps, a test case ID, corresponding requirements, etc. In this example, two requirements (SRS-BHH-LineVolt1707 and SRS-BHH-Speed2051) are assigned to the test case example in Table 10.3. Once the dependencies between requirements are detected, sOrTES searches for the dependencies between test cases through mapping the assigned test cases to the corresponding requirements. Since both requirement and test specifications are written in a natural text, several library packages are used, such as xlrd for reading excel (.xls) documents (the requirement and test specifications) and vis.js,
220
Paper F
which is a dynamic browser based visualization library, for visualization of the dependencies. Moreover, some specific algorithms were implemented in Python to extract the required information from the requirement and test specifications. Since the implementation details (packages, libraries and pseudocodes) are already described in length in [32], thus the Python implementation details are omitted from this paper. The extraction phase is currently embedded in sOrTES. Furthermore, in a large testing project, with a wide range of requirements and test cases, the dependency between requirements and test cases is very complex. Figure 10.3 displays a part of the dependency relations between the requirements together with the test cases in an industrial testing project at BT. The blue and red nodes in Figure 10.3 represent the requirements and the test cases, respectively. Note that all red nodes are connected to only blue nodes, even if it might seem otherwise due to dense visualization. As mentioned earlier, there is a strong and complicated dependency between requirements and test cases in the integration testing level. As one can see in Figure 10.3, often more than one test case (red nodes) is assigned to test a requirement. However, in some testing scenarios, several requirements (blue nodes) are meant to be tested by just one test case. Visualizing the dependency relationships between the requirements together with test cases, and also showing the test diversity data, can assist testers and test managers to improve the testing process in terms of test case selection and prioritization. For example, the minimum number of required test cases for testing a certain set of requirements can easily be extracted from Figure 10.3.
10.4.3
Requirement Coverage Measurement
Computing the number of assigned requirements for each test case provides the requirement coverage information. sOrTES captures the inserted information in each test specification (e.g. Table 10.3) and provides a number acting as the requirement coverage to each test case. The requirement coverage is equal to 2 for the provided test specification example in Table 10.3. However, Table 10.4 shows an output example provided by the extraction phase in sOrTES for dependencies and requirement coverage for each test case. In Table 10.4 we can see how many test cases need to be executed before testing another particular test case. The independent test cases have a 0 value in the Dependent on column in Table 10.4. Moreover, the number of test cases that can be tested after each test case is inserted in the Output column. The requirement coverage (the number of assigned requirements to a single test case) is calculated and stored in the Requirement coverage column in Table 10.4. For instance, test case number
10.4 sOrTES- Stochastic Optimizing of Test case Scheduling
221
1 is an independent test case, which does not require execution of any other test cases before it. Furthermore, a total of 38 requirements are assigned to this test case. The testers can plan to prioritize execution of this test case first given its requirements coverage. In certain industrial contexts, the requirement coverage is assumed as the most important criterion for test case selection and prioritization [6]. The tester can also test this test case at any time due to its independent nature. On the other hand, test case number 5 is dependent on 13 other test cases with a requirements coverage equal to 6 and just one test case being dependent on it. Thus, test case number 5 is not a good candidate for first cycle execution given its dependency on 13 other test cases. Nr. 1 2 3 4 5 6 7 8 9 10
Test case ID Requirement coverage Dependent on Output IVVS-ATP-IVV-51 38 0 0 IVVS-Battery-IVV-96 24 0 0 IVVS-Linevoltage-IVV-5 21 1 0 IVVS-Drive-IVV-27 20 0 0 IVVS-Trainradio-IVV-02 6 13 1 IVVS-Linevoltage-IVV-3 46 12 0 IVVS-Braketest-IVV-21 3 9 0 IVVS-Speed-IVV-04 2 0 46 IVVS-Speed-IVV-06 5 0 45 IVVS-TC-IVV-07 4 0 36
Table 10.4: Independent and dependent test cases and the requirement coverage for each test case. However, after a successful execution of test case number 8, a total of 46 test cases will be available for execution (as independent test cases) and therefore this test case can also be considered for early execution in the testing cycle. As we can see in Table 10.4, test cases number 9, 10 are two independent test cases, where a total of 45 and 36 test cases are dependent on these two test cases, respectively. The names of some test cases which are dependent on test case numbers 9 and 10, are shown in Table 10.4. However, the inserted information in Table 10.4 can be used as an input in any other decision support system for ranking algorithms or can even be utilized manually for test case selection and prioritization (one-time ranking).
10.4.4
The Scheduling phase
The information provided in Table 10.4 can help us to schedule test cases for dynamic execution. The test execution results of dependent test cases directly impact the other test cases. On the other hand, maximizing the requirement
222
Paper F
coverage, or execution of as many test cases as possible in a limited period of time, is always demanded by industry. The proposed optimization solution for test scheduling in this work can be divided into two main parts: (i) finding a feasible set and (ii) maximizing the requirement coverage for each testing cycle. Finding a feasible set of test cases for execution deals directly with interdependence, interaction, and relationships between test cases. First, a dependency graph (directed graph) should be built up for the dependent test cases, which shows our constraints. The objective function is to maximize the requirement coverage. Ignoring the dependencies can lead to redundant failures, thereby increasing the troubleshooting cost and the total testing time [1]. Additionally, maximizing the requirement coverage can help testers to finish the testing process faster which might lead to earlier release of the final software product. Hence the main constraint of this optimization problem is dependency; we are not able to rank test cases just based on their requirement coverage (or any other criteria) for execution. For instance, the inserted test case in line 6 in Table 10.4 has the highest value for the requirement coverage (46), but this test case is a multidependent test case and cannot be executed as a first test case in a testing cycle because 12 other test cases should be executed successfully before reaching this one. Choosing the test candidate for execution from the feasible set might minimize the risk of unnecessary failure. Generally, test cases can fail based on the following causes: (i) there is a mismatch between test case and requirement, (ii) the testing environment is not ready for testing, (iii) bugs in the system and (iv) paying no attention to the dependencies between test cases. A failed test case needs to go through troubleshooting, which consumes testing resources. Thus, minimizing the redundant executions and thereby unnecessary failures (based on the dependency causes) can lead to a decrease in troubleshooting efforts. Note that the proposed approach in this paper just deals with the unnecessary failure based on the dependent test cases as we are not able to avoid any other failure causes.
10.4.5
Model Assumptions and Problem Description
Let us consider a subset of n test cases, designated by TC1 , TC2 , ... TCn , that will be tested in a testing cycle, which are chosen from the testing pool among all test cases. Each test case, TCi , is characterized by its execution time, ti > 0, the required time for performing the troubleshooting process in case of failure, bi > 0, and the requirement coverage, si ∈ N. After each execution, a testing result such as pass, fail, not run and partly
10.4 sOrTES- Stochastic Optimizing of Test case Scheduling
223
pass is recorded for the executed test cases. According to the testing policy at BT, all failed, partly passed and not run test cases must be executed again. Following their procedure, we consider all results different than pass as fail. For test case TCi , we denote its testing result as ri , which we represent either by 1 for fail or 0 for pass. Indeed, ri is a realization of a random variable Ri , given that we do not know what the result will be beforehand, i.e.
Ri =
1, 0,
if test case TCi fails if test case TCi passes.
where, Ri ∼ Bernoulli(pi ) with pi = P(TCi fails). Our main goal is to define an optimal execution order to test all test cases within each testing cycle. In our study, we assume that (i) the n test cases are all tested in each testing cycle only once; (ii) when a failed test case is sent for troubleshooting, it will be tested again in another testing cycle (not in the current one). Note that these are acceptable assumptions by our industrial partner. In general, for n test cases, there are n! ways of ordering the test cases. However, the order should be chosen in a way to avoid unnecessary failure based on dependency and thereby redundant troubleshooting. As it was explained before, some of these failure reasons are unavoidable, due to the failure based on dependency which is avoidable. Let us consider that each test case in a testing cycle is a node (red nods in Figure 10.3) in a complete directed graph. Thus, there are bi-directed edges connecting every pair of nodes (but there are no edges from a node to itself). In fact, we are going to solve a stochastic traveling salesman problem (TSP) [34] as following:
Definition 10.4. Given: The n test cases and their troubleshooting times. Problem: To find the best execution order to test each test case, minimizing the total troubleshooting time.
224
Paper F
To formalize the TSP, we define: ⎡ ⎤ min
n n ⎢ ⎥ xij bi Ri ⎥ E⎢ ⎣ ⎦ i=1 j=1 j=i
s.t.
xij ∈ {0, 1} , ∀ i, j = 1, ..., n n xij = 1, ∀ i = 1, ..., n j=1 j=i n
(10.1)
xij = 1, ∀ j = 1, ..., n
i=1 i=j
xij ≤ |V | − 1, ∀ V {1, 2, ..., n} , V = ∅
i∈V j∈V j=i
The solution of Equation 10.1 gives us the best order to execute the n test cases in such a way that the troubleshooting time is minimized at each testing cycle, which can be written as an n-dimension vector: (TC·,1 , TC·,2 , · · · , TC·,n ) . In our notation TC·,k represents the k th test case that should be tested, i.e. the second index stands for the execution order where the first index indicates the test case number. For instance, TC5,2 represents that TC5 should be executed in second place in a testing cycle. In the present work we are preoccupied with minimizing the needless troubleshooting time caused by interdependent integration test cases. In other words, if TCj depends on TCi , then it is very unlikely that TCj passes if TCi was not tested before, then we consider that: P(TCj passes|TCi was not tested) = 0.
(10.2)
Thus, if we choose an order where TCj is tested before TCi then we will have Rj = 1 almost certainly (and a troubleshooting time will be summed up). On the other hand, if all tests that TCj is dependent on have already passed, then TCj will behave as an independent test case. For instance, let us consider that TCj only depends on TCi and TCk , then we have: P(TCj passes|TCi and TCk already passed) = P(TCj passes).
10.4 sOrTES- Stochastic Optimizing of Test case Scheduling
225
In order to avoid adding unnecessary troubleshooting time before testing a test case with dependencies, we should first test all test cases that it is dependent on. However, to accomplish this, an embedded digraph of dependencies should be described first. This is one of the main contributions of sOrTES in the extraction phase. To clarify the explanation, let us consider a dummy example5 , where we only have 5 test cases: TC1 , TC2 , TC3 , TC4 and TC5 . We have 5! = 120 different execution orders of testing the 5 test cases. However, let us assume that we are able to describe the following embedded digraph of dependencies: T C2
T C1
T C3
T C5
T C4 Figure 10.4: Example of an embedded digraph of dependencies for 5 test cases. According to Figure 10.4, TC1 , TC2 and TC4 are independent test cases, TC3 depends on TC1 , TC5 directly depends on TC2 , TC3 and TC4 , where we call them as the precedents, and indirectly depends on TC1 . Knowing this grid of dependencies and taking into account Equation (10.2), we realize that there are only 12 feasible choices to execute those 5 test cases (always with TC5 in the last place and TC3 after TC1 ). To formalize the set of the feasible choices, for each test case TCi , we consider the set of all precedents, Pi , i.e. the set of test cases that TCi directly depends on. For instance, for the presented example in Figure 10.4, P1 = P2 = P4 = ∅, P3 = {TC1 } and P5 = {TC2 , TC3 , TC4 }. In order to avoid unnecessary troubleshooting time, let us call F the set of all possible ways to test the n test cases, which takes into account the 5 It is a dummy in the sense that the number of test cases is very small compared with real industry cases.
226
Paper F
interdependencies. We call it the feasible set and it is defined as follows: F
=
{(TC·,1 , TC·,2 , · · · , TC·,n ) : P·,1 = ∅, ∀ i = 2, · · · , n, P·,i {TC·,1 , · · · , TC·,i−1 }} .
Considering again the example in Figure 10.4, if we have (TC·,1 , TC·,2 , TC·,3 , TC·,4 , TC·,5 ) ∈ F, the total expected time for troubleshooting is b1 p1 + b2 p2 + b3 p3 + b4 p4 + b5 p5 . On the other hand, if (TC·,1 , TC·,2 , TC·,3 , TC·,4 , TC·,5 ) ∈ F, the total expected time for troubleshooting is either b1 p1 + b2 p2 + b3 p3 + b4 p4 + b5 or b1 p1 + b2 p2 + b3 + b4 p4 + b5 p5 or b1 p1 + b2 p2 + b3 + b4 p4 + b5 . This means that the total expected time for troubleshooting for elements in the feasible set F is always lower than the corresponding expected time for elements that do not belong to F. In other words, any solution not included in F is sub–optimal to minimize the total expected time for troubleshooting. This implies that from now on we only consider elements in F. Currently, sOrTES considers all elements in the feasible set F and orders them by requirement coverage, i.e. it wants to achieve the highest requirement coverage possible, as early as possible. We can formalize it in the following way: We want to choose (TC·,1 , TC·,2 , · · · , TC·,n ) ∈ F, with respective requirements coverage s·,1 , s·,2, ..., s·,n , such that, if we consider another or der TC·,1 , TC·,2 , · · · , TC·,n ∈ F with respective requirements coverage s·,1 , s·,2 , ..., s·,n , for all i = 1, 2, ..., n, the following inequality holds i j=1
10.5
s·,j ≥
i
s·,j .
(10.3)
j=1
Empirical Evaluation
In order to analyze the feasibility of sOrTES, we designed an industrial case study at Bombardier Transportation (BT) in Sweden, by following the proposed guidelines of Runeson and Höst [35] and also Engström et al. [36]. BT provides various levels of testing in both manual and automated approaches, where the integration testing is performed completely manually. The number of required test cases for testing a train product at BT is rather large and the testing process is performed in several testing cycles.
10.5 Empirical Evaluation
227
Figure 10.5: The dependencies between the BR project’s test cases at BT, using sOrTES.
10.5.1
Unit of Analysis and Procedure
The units of analysis in the case under study are manual test cases at the level of integration testing for a safety-critical train control subsystem at BT. sOrTES is however not restricted to the testing level or methods and can be applied to other testing procedures in other levels of testing (e.g. unit, regression and system level in both manual and automated procedures) in other domains. The case study is performed in several steps: • An Commuter train for Hamburg called BR project6 is selected as a case under study. • A total number of 3938 requirements specifications (SRSs) from 17 different sub-level function groups are extracted from the DOORS database at BT for the BR project. • The dependencies between requirements are detected for 3201 SRSs while 737 SRSs are detected as being independent. 6 The BR series is an electric rail car specifically for the S-Bahn Hamburg GmbH network in production at Bombardier Hennigsdorf facility started in June 2013. Bombardier will deliver 60 new single and dual-voltage commuter trains with innovative technology, desirable energy consumption and low maintenance costs. The heating system in the BR project is designed to use waste heat from the traction equipment system to heat the passenger compartments [37].
228
Paper F
• A total number of 1748 test specifications are extracted from DOORS and analyzed for dependency detection. • The results of dependency between test cases are presented to the BR project team members. • Three different test execution cycles from the beginning, middle and end of the BR project are selected to be compared with the proposed execution schedules by sOrTES. • The testers’ and engineers’ opinions about the number of failed test cases based on the dependencies are collected and analyzed.
10.5.2
Case Study Report
As mentioned before, the main goal of proposing sOrTES is dynamic test scheduling. The results of executed test cases can impact how the dependencies are shaped between test cases. Therefore making a correct decision for test execution is directly related to the test cases’ dependencies. sOrTES first detects the dependencies between the assigned requirements to the BR project and thereby the dependencies between corresponding test cases are detected. In the next level, the requirement coverage is computed for each test case and thereafter test cases are ranked based on their dependencies and requirement coverage. Schedule Project: BR Test case ID IVVS-ATP-S-IVV-051 IVVS-Battery-IVV-096 IVVS-LineVoltage-S-IVV-015 IVVS-Drive-IVV-027 IVVS-S-ConfItems-IVV-013 IVVS-Fire-S-IVV-036 IVVS-LineVoltage-S-IVV-018 IVVS-Drive-IVV-004 IVVS-Drive-IVV-008 IVVS-Fire-S-IVV-034
Toggle Graph Priority 1 2 3 4 5 6 7 8 9 10
RC 38 24 27 20 17 17 21 20 16 16
Time 1 1 1 1 1 1 1 1 1 1
Result
Table 10.5: An execution schedule example, proposed by sOrTES for the BR project at Bombardier, RC stands for the requirement coverage and Time represents the execution time.
10.6 Performance evaluation
229
Table 10.5 represents the user interface of sOrTES with an execution schedule example, where the dependent test cases are highlighted in red. Moreover, the execution time and requirement coverage (RC column in Table 10.5) are available for each test case. Since the execution time for test cases is captured by another tool (ESPRET), we assume the same time value for all test cases in the present work. Note that by changing the execution time value in Table 10.5, another execution schedule will be proposed which might be different than the current one. In other words, test cases are ranked for execution based on their dependencies, requirement coverage and the execution time, where the user is able to specify that a lower or higher execution time per test case is demanded. The priority column in Table 10.5 shows the execution order for each test case. After each execution, the testers can insert the test results for those test cases which are passed by clicking on the play circle icon in Table 10.5. Thus, the passed test cases are removed automatically from the proposed schedule and a new execution schedule will appear in Table 10.5. Using the red color for the dependent test cases can help testers to have a better overview of the dependencies between test cases. By clicking on the Toggle Graph button (embedded in sOrTES user interface seen in Table 10.5), Figure 10.5 is shown, which represents the dependencies between test cases (the red lines in Table 10.5) for the BR project. Moreover, a new graph is created after removing the passed test cases from Table 10.5.
10.6
Performance evaluation
In this section we evaluate the performance of sOrTES through comparing the proposed execution schedules by sOrTES with three different execution orders at BT. The purpose of the performance evaluation is to adequately compare the amount of fulfillment requirements and troubleshooting efforts in the different testing cycles during the project.
10.6.1
Performance comparison between sOrTES and Bombardier
For each set of test cases, in each testing cycle, sOrTES and BT provide an execution schedule, ordering the test cases based on their own criteria. Following the notation already introduced, we represent the scheduling order given by sOrTES and BT, respectively by:
Paper F
V2U7(6 %7
&XPXODWLYHUHTXLUHPHQWFRYHUDJH
1XPEHURIWHVWFDVH
(a) Cumulative requirement coverage after the 1st test execution cycle
V2U7(6 %7
&XPXODWLYHUHTXLUHPHQWFRYHUDJH
230
1XPEHURIWHVWFDVH
(b) Cumulative requirement coverage after the 2nd test execution cycle
10.6 Performance evaluation
231
V2U7(6 %7
&XPXODWLYHUHTXLUHPHQWFRYHUDJH
1XPEHURIWHVWFDVH
(c) Cumulative requirement coverage after the 3rd execution cycle
Figure 10.6: Comparing the scheduling execution results at BT with the proposed execution schedule by sOrTES for the BR project.
TC·,1S , TC·,2S , ..., TC·,mS and TC·,1B , TC·,2B , ..., TC·,mB . where TC·,1S represents the first test case in sOrTES proposed execution schedule. For instance, TC5,1S means that TC5 is the first test case that needs to be executed, according to sOrTES. Suppose that at BT the TC5 is tested in 12nd place, then we have TC5,12B . From now on, whenever it is necessary, we use the S to associate with sOrTES and B to associate with BT. Our goal is to compare the performance of both of the provided schedules. The presented test cases in Table 10.4 are arranged in alphabetical order for the 1st testing cycle in Table 10.6. As Table 10.6 shows, TC1 is scheduled by sOrTES as the first test candidate for execution, however, the TC1 has been executed by BT as the 942th test case in the 1st testing cycle. To compare the planned schedules, we consider a
232
Paper F
TCi TC1 TC2 TC3 TC4 TC5 TC6 TC7 TC8 TC9 TC10
Test case ID IVVS_SBHH_SyTS_ATP_S-IVV-051 IVVS_SBHH_SyTS_Battery-IVV-096 IVVS_SBHH_SyTS_Braketest-IVV-021 IVVS_SBHH_SyTS_Drive-IVV-027 IVVS_SBHH_SyTS_Linevoltage-IVV-003 IVVS_SBHH_SyTS_Linevoltage_S-IVV-015 IVVS_SBHH_SyTS_Speed-IVV-004 IVVS_SBHH_SyTS_Speed-IVV-006 IVVS_SBHH_SyTS_TC-IVV-007 IVVS_SBHH_SyTS_Trainradio-IVV-002
sOrTES TC1,1S TC2,2S TC3,1191S TC4,4S TC5,237S TC6,3S TC7,1399S TC8,108S TC9,173S TC10,1199S
BT TC1,942B TC2,119B TC3,224B TC4,387B TC5,560B TC6,1340B TC7,784B TC8,785B TC9,894B TC10,888B
Table 10.6: Execution order in the 1st testing cycle, for a subset of 10 test cases (among 1462), according to sOrTES and BT variable which represents the sum of the requirements coverage fulfilled after the execution of the k th test case, namely: CS S (k) =
k
s·,j S
and CS B (k) =
j=1
k
s·,j B .
j=1
Indeed, CS S (k) and CS B (k) are the cumulative requirements coverage achieved by sOrTES and BT, respectively. To compare the schedules after execution, just as before, we consider a variable that gives us the cumulative requirement coverage. However, now we assume a penalty for every time a test case gets a failed result. We don’t sum the requirement coverage from the failed test cases, i.e. we have instead: CSAS (k) =
k
s·,iS
1 − r·,iS
i=1
and CSA (k) = B
k
s·,iB
1 − r·,iB .
i=1
It is also worthy to consider a variable regarding the troubleshooting time. This variable represents the sum of the number of labor hours, regarding to troubleshooting, after the execution of the k th test case. In fact, when a test case gets a fail result, we add its troubleshooting time, i.e. CT AS (k) =
k i=1
b·,iS r·,iS
10.6 Performance evaluation
233
and CT AB (k) =
k
b·,iB r·,iB .
i=1
In other words, CSA is the cumulative requirement coverage after execution for the passed test cases and CT A represents troubleshooting time for the failed test cases. Figure 10.6 illustrates the gained cumulative requirements coverage (CSA) for three different execution cycles at BT, where sOrTES maximizes the number of requirements in each execution cycle, compared with BT. According to Figure 10.6a in the 1st execution cycle at BT, the number of 2000 requirements are tested through executing 880 test cases with the proposed execution order by the testers. However, the same number of requirements could be covered by executing 580 test cases if the testers follow the proposed execution schedule by sOrTES. Since we are using the cumulative requirement coverage in the end of each testing cycle, the total number of tested requirements would be the same by both proposed execution schedules as long as there are no failures based on dependency between test cases. However, given that sOrTES has less failures, it can cover more requirements. Moreover, comparing the tested requirements in Figure 10.6 indicates that, in all proposed execution schedules by BT, fewer requirements have been tested compared with sOrTES. In some testing companies testing as requirements as possible in a limited period of time is demanded. Therefore prioritizing test cases based on their requirement coverage is a well known and acceptable approach. Note that the proposed execution schedule by sOrTES in Figures 10.6 is not only based on the requirement coverage but is also based on the dependencies between test cases. In other words, we avoid the risk of failure based on dependencies between test cases by using the execution schedules proposed by sOrTES. Hence sOrTES ranks test cases for execution based on their requirement coverage (after dependency detection), those test cases which are covering more requirements will be ranked higher for execution. As seen in Table 10.5, the first test candidate for execution covers 38 requirements. The average of executed test cases in each testing cycle at BR project is around 1500, which leads to a very low probability of executing the test cases with a high number of requirements coverage in the early stages of a testing cycle. According to Figure 10.6 using the proposed execution schedules by sOrTES leads to increase the average of requirement coverage up to 9.6% . In Figure 10.8 we compared the scheduled cumulative requirement coverage with 10000 random test execution orders. As can be seen, 2000 requirements are fulfilled by sOrTES through executing 400 test cases, whereas in the 10000
Paper F
V2U7(6 %7
7URXEOHVKRRWLQJWLPH
1XPEHURIWHVWFDVH
(a) Cumulative requirement coverage after the 1st test execution cycle
V2U7(6 %7
7URXEOHVKRRWLQJWLPH
234
1XPEHURIWHVWFDVH
(b) Troubleshooting time after the 2nd test execution cycle
10.6 Performance evaluation
235
V2U7(6 %7
7URXEOHVKRRWLQJWLPH
1XPEHURIWHVWFDVH
(c) Troubleshooting time after the 3rd test execution cycle
Figure 10.7: Troubleshooting time comparison at BT with the proposed execution schedules by sOrTES for the BR project. random execution orders for more than 1400 test cases, the minimum number of test cases is 700 to obtain the same 2000 requirements7 . As highlighted in Section 10.3, one of the main advantages of using sOrTES is the ability to avoid unnecessary failures based on previous failures (fail based on dependency). All failed test cases at BT are labeled with a tag such as software change request (SCR). SCRs are usually used when a test case fails based on an existing bug in the system or there is a mismatch between a requirement and a test case. For analyzing the performance of sOrTES in the troubleshooting time perspective, we assumed all test cases which are tagged with SCR (and failed in each testing cycle) would result in a failure with the proposed execution order by sOrTES as well. Since, the dependencies between test cases are detected by sOrTES, we are able to monitor the execution results for dependent test cases after each execution cycle. In fact, we have searched for “failed based on failed” test cases among all failed test cases. To exhibit 7 Note that Figure 10.8 shows the first execution schedule plan (before the first execution), i.e. CS s defined before. For this reason Figure 10.8 does not coincide with Figure 10.6a.
236
Paper F
Figure 10.8: Cumulative requirement coverage for random test execution orders. the differences between the proposed execution schedule by sOrTES and BT’s execution orders, we consider those test cases that have resulted in a fail based on dependencies as a pass instead. In all three execution cycles at BT, a test case which depends on any test case is failed when: (i) the independent test case was already tested, and the result was failed or (ii) the independent test case was not tested before the dependent one (wrong execution order). As highlighted before, we need to consider that using the proposed execution order by sOrTES does not avoid any other causes of failure (e.g. the mismatching between test cases and requirements), as those kinds of failures would most likely occur with sOrTES schedules. As previous study [1] shows, fails based on dependency are the most common cause of failure in the integration testing level for the Zefiro8 project at BT. Our analysis shows that 20.79% of the executed test cases in the first execution cycle at BR project have failed, where 40.79% of the failures have occurred based on the dependencies between test cases. The occurrence percentage is 44.20% and 36.18%, out of a failure percentage given by 33.92% and 19.86%, 8 Bombardier Zefiro is a family of very high-speed passenger trains designed by Bombardier Transportation whose variants have top operating speeds of between 250 km/h (155 mph) and 380 km/h (236 mph).
10.6 Performance evaluation
237
for the second and third execution cycle respectively. The total troubleshooting time for each test case is estimated as 16 labor hours with a collaboration of the testing experts at BT, meaning a failed test case takes two days to cycle back to testing again. Figure 10.7 illustrates the troubleshooting time for the failed test cases in the three testing cycles at BT. As can be seen in Figure 10.7a, the total required troubleshooting time in the first execution cycle at BT is close to 5000 labor hours, where using the proposed schedule by sOrTES can lead to less than 3000 labor hours. Furthermore, in the second execution cycle (see Figure 10.7b), we observe that the troubleshooting time has increased even more, reaching 8000 labor hours, which is almost two times higher than the required troubleshooting time by sOrTES. According to Figure 10.7, the proposed execution schedule by BT has less failure in the beginning of each testing cycle. In the first cycle, after executing 100 test cases, a significant increase in troubleshooting can be observed (see Figure 10.7a), which indicates the fail based on fail problem. In the second execution cycle, failure starts from the beginning, illustrated in Figure 10.7b. Moreover, failure starts after almost 100 executions, displayed in Figure 10.7c, for the third execution cycle. The main reason that the execution order at BT in Figure 10.7a and Figure 10.7c has a lower failure in the beginning is that the testers are usually executing the simple test cases in the beginning of each testing cycle. Those type of test cases, which are not complicated and generally cover one or two requirements, have a higher chance for a successful execution. In the proposed execution schedule by sOrTES, those test cases which cover a high number of requirements need to be tested as soon as possible. However, postponing more complex test cases for execution is not an optimal decision. First of all, ranking test cases based on their requirement coverage (and the execution time) leads to testing the software product faster. Secondly, if a complex test case fails in the end of the testing process, it would be necessary to cycle the test case back through testing, which would take more overall time and could ultimately require a deadline extension for the testing project.
10.6.2
Performance comparison including a History-based test case prioritization approach
As reviewed earlier in Section 10.2.4, several approaches have been proposed for test case selection and prioritization in the state of the art. Among the existing methods, we have opted to compare the performance of sOrTES with a history-based prioritization approach in this subsection. The history-based
238
Paper F
&XPXODWLYHUHTXLUHPHQWFRYHUDJH
6RU7(6 %7 +LVWRU\EDVHGSULRULWL]DWLRQ
1XPEHURIWHVWFDVH
(a) Cumulative requirement coverage after the 2nd test execution cycle
&XPXODWLYHUHTXLUHPHQWFRYHUDJH
6RU7(6 %7 +LVWRU\EDVHGSULRULWL]DWLRQ
1XPEHURIWHVWFDVH
(b) Cumulative requirement coverage after the 3rd test execution cycle
10.6 Performance evaluation
239
7URXEOHVKRRWLQJWLPH
6RU7(6 %7 +LVWRU\EDVHGSULRULWL]DWLRQ
1XPEHURIWHVWFDVH
(c) Troubleshooting time after the 2nd test execution cycle
7URXEOHVKRRWLQJWLPH
6RU7(6 %7 +LVWRU\EDVHGSULRULWL]DWLRQ
1XPEHURIWHVWFDVH
(d) Troubleshooting time after the 3rd test execution cycle
Figure 10.9: Comparing the scheduling execution results at BT with the proposed execution schedule by sOrTES together with a historical-based test case prioritization approach for the BR project.
240
Paper F
approaches utilize information from test case’s history data [38] for selecting and prioritizing test cases for the next execution. Strandberg et al. [19] provide a list of critical factors for test case prioritization, where the test case’s history is recognized as the most impact. Generally, in a history-based test prioritization, the failed test cases in each testing cycle will be high ranked for execution in the next test cycle [39]. In some cases, the history-based prioritization improves the test efficiency compared to other existing test case prioritization techniques [40]. Furthermore, the history-based test prioritization has been considered as a common and popular approach in terms of reducing the risk of failure between those test cases which are failed once before. In the previous subsection, the performance of sOrTES is compared with the proposed test scheduling approach at BT within three test execution cycles, in terms of requirement coverage measurement and unnecessary failure reduction. Since the execution results for the BR project’s test cases are available, we are able to schedule them one more time, by using the history-based prioritization method, where the performance of all three approaches (sOrTES, BT and history-based approach) can be compared with each other. Toward this goal, in the historybased prioritization, we high rank the failed test cases in the first execution cycle, to be executed in the second cycle. Percentage of failed test cases Percentage of re-tested test cases Percentage of re-failed test cases Percentage of re-failed test cases based on dependency
1st Execution cycle 20.79% − − −
2nd Execution cycle 33.92% 93.36% 98.66% 43.39%
3rd Execution cycle 19.86% 97.60% 56.56% 30.43%
Table 10.7: Test execution failure analysis in each testing cycle for the BR project at BT. Table 10.7 shows the percentage of test failures and the amount of re-failed test cases together with re-failed test cases based on dependency in each testing cycle. As one can see, 20.79% of test cases are failed in the first execution cycle, where 93.36% of those are considered for re-execution in the second execution cycle at BT. This indicates that, around 6.64% of the failed test cases (in the 1st execution cycle) are omitted in the 2nd execution cycle. Using the history-based test prioritization technique, the failed test cases in the 1st execution cycle are high ranked in the 2nd execution cycle. For ranking other test cases (which are not failed in the 1st execution cycle), the proposed execution approach by BT is utilized. As Table 10.7 presents, 98.66% of the failed test cases in the 1st execution cycle, are failed again in the 2nd execution cycle, where 43.39% of
10.7 Threats to Validity
241
the failures are occurred due to the dependencies between test cases. A similar interpretation can be done for the 3rd execution cycle. Moreover, Figure 10.9 shows the gained cumulative requirements coverage and also the troubleshooting time for the failed test cases in the two testing cycles at BT. As we can see in Figures 10.9a and 10.9b, sOrTES maximizes the number of requirements in both execution cycles, compared with the history-based prioritization approach. As mentioned earlier, the first failed test case in the 1st execution cycle is assumed be to executed as a first test case in the 2nd execution cycle (see Figures 10.9a and 10.9c). In this regard, all failed test cases in the 2nd execution cycle are high ranked for execution in the 3rd execution cycle (see Figures 10.9b and 10.9d). As Table 10.7 shows, the dependency between test cases is an important cause for failure in each testing cycle and therefore the troubleshooting value in Figures 10.9a and 10.9b is higher than sOrTES and BT. The cumulative requirement coverage in the 2nd and 3rd (see Figures 10.9a and 10.9b) execution cycles indicates that the proposed test scheduling approaches by BT, is led to covering more requirements, compared with the history-based approach in the both execution cycles. This shows the intuitions and testing skills of BT’s testers for ranking test cases for execution. Moreover, one of the main reasons behind the achieved results is the unnecessary failures which occurred between the failed test cases (again in the next execution cycle) due to their dependencies (see the last row in Table 10.7). The empirical results presented in Figure 10.9 indicate that sOrTES outperforms the historical-based prioritization approach in terms of requirement coverage and unnecessary failure between integration test cases, which leads to reducing the troubleshooting time. However, the history-based prioritization technique can be an efficient approach in any other testing environment, where there is no complex dependency between test cases.
10.7
Threats to Validity
The threats to the validity, limitations and the challenges faced in conducting the present study are discussed in this section. • Construct Validity: is one of three main types of validity evidence which makes sure the study measures what it intended to measure [41]. The major construct validity threat in the present study is the way that the functional dependencies between test cases is detected. Utilizing the internal signals information of the software modules for dependency detection may not be attainable in other testing processes. Generally, the information about the signals is provided in the design level of a software
242
Paper F
product and it might be hard (or not possible) to capture this information in the testing level. On the other side, communication between different departments in an organization in order to capture and share information may require more time and effort than the amount of time saved by test scheduling. Moreover, scheduling test cases for execution based on other types of dependencies (e.g. temporal dependency) might be considered as a more efficient way for scheduling. The execution sequence is an important factor in a real-time system. The functionally dependent test cases can be tested after each other at any time during the testing cycles. However, the temporal dependent test cases should be executed directly after each other over a specific time period. • Internal Validity: concerns the conclusions of the study [35]. In order to reduce the threats to internal validity, the obtained results in this study are compared with the execution results of the test cases from three different testing cycles at BT. Furthermore, we assumed that those test cases which had failed due to other testing failure causes (e.g. mismatching between requirement and test cases) might be failed in the proposed execution schedule by sOrTES. Therefore only the failed test cases based on dependencies are assumed to be passed by following our execution order. In the present work, multiple test artifacts such as test specification, requirement specification and test records are analyzed and the experimental knowledge of testing experts is also considered. However, the structure of the software requirement specification and the test specification at our use case company provider can be considered a threat to this study. sOrTES was performed on a set of well-defined semi-natural language SRSs and test specifications which can be decomposed and analyzed quickly. In a more complex case with a different structure of SRS and test cases, sOrTES might (or might not) perform accordingly, which can influence the accuracy of dependencies and thereby the proposed execution schedules. • External Validity: addresses the generalization of the proposed approach and findings [42]. sOrTES has been applied on just one industrial testing project in the railway domain, however, it should be applicable to other similar contexts in other testing domains. Primarily, the functional dependencies between requirements (and thereafter test cases) can be detected through analyzing the internal signal communications between the software modules in many different contexts.The extraction phase in sOrTES is currently designed based on the input structure of the require-
10.7 Threats to Validity
243
ments and test cases specifications. Secondly, scheduling test cases for execution based on several criteria and also their execution results are also applicable in other testing environments. Since the scheduling phases are designed based on a stochastic traveling salesman problem, sOrTES can also be applied for solving other queueing or stochastic traffic problems. Moreover, the context information has been provided in Section 10.3 to enable comparison with other proposed approaches in the testing domain. • Conclusion Validity: deals with the factors which can impact the observations and can lead to an inaccurate conclusion [43]. Generally, an inadequate data analysis can yield conclusions that a proper analysis of the data would not have supported [44]. Utilizing the human’s judgment, decision and supervision might decrease this threat. Since sOrTES is designed and developed for an industrial usage at BT, a close collaboration and dialogue with the domain experts is established in order to ensure the industry’s requisite and needs. The study presented in this paper is started in parallel with the integration testing phase of the BR project and the initial results (dependency detection) have been presented in a few BR project meetings. Moreover, a workshop is conducted by us at BT for the members of BR and also C30 9 , which is a parallel ongoing project at BT. The testers’s and engineers’ opinions about the proposed schedules by sOrTES have been gathered, resulting in necessary modifications to sOrTES. • Reliability: is the repeatability and consistency of the study [35]. The extraction phase of sOrTES has been tested for accuracy of results [32]. As outlined in section 10.5, a total of 737 requirements are identified as independent requirements, which means that sOrTES could not find any matches for them. In consultation with the testers and engineers at BT, a set of wrongly spelled meanings were found in the requirements and test case specification documents. Thus, the data in the DOORS database sometimes contains ambiguity, uncertainty and spelling issues. sOrTES searches for exact names for input−output signals for detecting dependencies. In the case that no output signal matches are found for an internal input signal, the corresponding requirement is counted as an independent requirement. Indeed, by missing one letter in the name of a signal, no signal matches will be found, even if the signal enters (or exits) to several requirements. This issue can directly impact the proposed 9 MOVIA
C30 is a metro cars project ordered by the Stockholm Public Transport.
244
Paper F
execution schedules as well. In addition, most of the language parsing techniques have some performance issues when a large set of data needs to be processed. There are demerits in the available tools for natural language processing.
10.8
Discussion and Future work
The main goal of this study is to design, implement and evaluate an automated decision support system which schedules manual integration test cases for execution. To this end, we make the following contributions: • We have proposed an NLP-based approach to detect the dependencies between manual integration test cases automatically. The dependencies have been extracted by analyzing multiple test process artifacts such as the software requirement specification and test case specification (the extraction phase). A stochastic approach for optimal scheduling of test execution has been proposed. The travelling salesman problem (TSP) was utilized for identifying a feasible set among the test cases (the scheduling phase). The mentioned phases are integrated in a Python based tool called sOrTES. • The evaluation of sOrTES was performed through applying it on an industrial testing project (BR) in a safety critical train control management subsystem at Bombardier Transportation in Sweden. Moreover, the execution schedules proposed by sOrTES was compared with three different execution orders, which have been previously performed by BT. • The performance analysis of sOrTES indicates that the number of fulfilled requirements increased by 9.6% compared to the execution orders at BT. • The total troubleshooting time is reduced up to 40% through avoiding redundant test executions based on dependencies. Scheduling test cases for execution can provide an opportunity for using the testing resources in a more efficient way. Decreasing the redundant execution directly impacts the testing quality and the troubleshooting cost. Utilizing sOrTES in an early stage of a testing process can help testers to have a more clear overview about the dependencies between the requirements. This information can also be used for designing more effective test cases. The provided information about the test cases’ properties (dependency and the requirement
10.9 Conclusion
245
coverage) in Table 10.4 can be utilized for test case selection and prioritization purposes. In some testing levels, a subset of test cases just needs to be executed once. For this purpose, the inserted test cases in Table 10.4 can be ranked for execution. However, the problem of dependencies between test cases does not exist in all testing environments, where test cases can be prioritized based on a single or multiple criterion, such as their requirement coverage or execution time. As we discussed in Section 10.6, around 40% of failure that had occurred due to the dependencies between test cases, which can be eliminated through detecting the dependencies between test cases before the execution. Moreover, maximizing the requirement coverage in each execution is another optimization aspect of the proposed approach in this paper. Developing sOrTES as a more powerful tool which can handle even larger sets of requirements and test specifications is one of the future directions of the present work. Moreover, merging ESPRET for execution time prediction with sOrTES is also another considered research direction for us. In the future, one more step in the extraction phase will be added to sOrTES, which estimates a time value for each test case as the maximum required time for the execution. Dealing with time as a constraint for optimizing the scheduling problem can drive us toward using the traveling salesman problem with time windows that supports this intuition.
10.9
Conclusion
Test optimization plays a critical role in the software development life cycle, which can be performed through test case selection, prioritization and scheduling. In this paper, we introduced, applied and evaluated our proposed approach and tool, sOrTES, for scheduling manual integration test cases for execution. sOrTES takes software requirement specifications and test specifications as input and provides the dependencies and requirement coverage for each test case as output. First, a feasible set of dependent test cases is identified by sOrTES and secondly, test cases are ranked for execution based on their requirement coverage. Our empirical evaluations at Bombardier Transportation and analysis of the results of one industrial project shows that sOrTES is an applicable tool for scheduling test cases for execution. sOrTES is also able to handle a large set of manual requirements and test specifications. sOrTES assigns a higher rank for those test cases which have a lower dependency and cover more requirements. By removing the passed test cases from each testing schedule, a new execution schedule will be proposed for the remaining test cases. This process will be continued until there are no test cases left in the testing cycle.
246
Paper F
Continuous monitoring of the test cases’ execution results minimizes the risk of redundant execution and thereafter the troubleshooting efforts. Moreover, utilizing sOrTES leads to maximizing the number of fulfilled requirements per execution and thereby allowing for a faster release of the final software product. Having these two optimization aspects at the same time results in achieving a more efficient testing process and a higher quality software product.
Bibliography [1] S. Tahvili, M. Bohlin, M. Saadatmand, S. Larsson, W. Afzal, and D. Sundmark. Cost-Benefit Analysis of Using Dependency Knowledge at Integration Testing. Springer, 2016. [2] S. Tahvili, W. Afzal, M. Saadatmand, M. Bohlin, D. Sundmark, and S. Larsson. Towards earlier fault detection by value-driven prioritization of test cases using fuzzy topsis. In 13th International Conference on Information Technology : New Generations (ITNG’16). Springer, 2016. [3] G. Kumar and P. Bhatia. Software testing optimization through test suite reduction using fuzzy clustering. CSI Transactions on ICT, 1(3):253–260, 2013. [4] G. Cleotilde, F. Javier, and C. Lebiere. Instance-based learning in dynamic decision making. Cognitive Science, 27:591–635, 2003. [5] S. Tahvili, M. Saadatmand, and M. Bohlin. Multi-criteria test case prioritization using fuzzy analytic hierarchy process. In The Tenth International Conference on Software Engineering Advances (ICSEA’15). IARIA, 2015. [6] S. Tahvili, M. Saadatmand, S. Larsson, W. Afzal, M. Bohlin, and D. Sundmark. Dynamic integration test selection based on test case dependencies. In 11th Workshop on Testing: Academia-Industry Collaboration, Practice and Research Techniques. IEEE, 2016. ˇ c, and P. Pettersson. A comparative [7] E. Enoiu, D. Sundmark, A. Cauševi´ study of manual and automated testing for industrial control software. In International Conference on Software Testing, Verification and Validation (ICST’17). IEEE, 2017. 247
248
Bibliography
[8] R. Ramler, T. Wetzlmaier, and C. Klammer. An empirical study on the application of mutation testing for a safety-critical industrial software system. In Proceedings of the Symposium on Applied Computing (SAC ’17). ACM, 2017. [9] J. Itkonen and M. Mäntylä. Are test cases needed? replicated comparison between exploratory and test-case-based software testing. Empirical Software Engineering, 19(2):303–342, 2014. [10] S. Yoo and M. Harman. Regression testing minimization, selection and prioritization: A survey. Software Testing, Verification and Reliability, 22(2):67–120, 2012. [11] S. Arlt, T. Morciniec, A. Podelski, and S. Wagner. If a fails, can b still succeed? inferring dependencies between test results in automotive system testing. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST’15). IEEE, 2015. [12] X. Cai, X. Wu, and X. Zhou. Optimal Stochastic Scheduling. International Series in Operations Research & Management Science. Springer US, 2014. [13] J. Nino-Mora. Stochastic scheduling. In Encyclopedia of Optimization. Springer Netherlands, 2009. [14] M. Harman. Making the case for morto: Multi objective regression test optimization. In 4th International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2011. [15] K. Walcott, M. Soffa, G. Kapfhammer, and R. Roos. Time aware test suite prioritization. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’06). ACM, 2006. [16] L. Zhang, S. Hou, C. Guo, T. Xie, and H. Mei. Time-aware test-case prioritization using integer linear programming. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA’09). ACM, 2009. [17] S. Wang, A. Shaukat, Y. Tao, B. Oyvind, and L. Marius. Enhancing test case prioritization in an industrial setting with resource awareness and multi-objective search. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE’38). ACM, 2016.
Bibliography
249
[18] Z. Li, M. Harman, and H. Hierons. Search algorithms for regression test case prioritization. IEEE Transactions on Software Engineering, 33(4):225–237, 2007. [19] P. Strandberg, D. Sundmark, W. Afzal, T. J. Ostrand, and E. J. Weyuker. Experience report: Automated system level regression test prioritization using multiple factors. In 27th International Symposium on Software Reliability Engineering (ISSRE’16). IEEE, 2016. [20] H. Srikanth, L. Williams, and J. Osborne. System test case prioritization of new and regression test cases. In International Symposium on Empirical Software Engineering (ESEM’05). IEEE, 2005. [21] P. Caliebe, T. Herpel, and R. German. Dependency-based test case selection and prioritization in embedded systems. In 5th International Conference on Software Testing, Verification and Validation (ICST’12). IEEE, 2012. [22] S. Haidry and T. Miller. Using dependency structures for prioritization of functional test suites. IEEE Transactions on Software Engineering, 39(2):258–275, 2013. [23] S. Tahvili, L. Hatvani, M. Felderer, W. Afzal, M. Saadatmand, and M. Bohlin. Cluster-based test scheduling strategies using semantic relationships between test specifications. In 5th International Workshop on Requirements Engineering and Testing (RET’18). IEEE, 2018. [24] S. Amitabh and J. Thiagarajan. Effectively prioritizing tests in development environment. SIGSOFT Software Engineering Notes, 27(4):97–106, 2002. [25] N. Mouaaz and B. Ricardo. Developing scheduler test cases to verify scheduler implementations in timetriggered embedded systems , 2016. [26] G. Rothermel, R. Untch, C. Chu, and M. Harrold. Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10):929–948, 2001. [27] J. Kim and A. Porter. A history-based test prioritization technique for regression testing in resource constrained environments. In Proceedings of the 24th International Conference on Software Engineering (ICSE ’02). ACM, 2002.
250
Bibliography
[28] W. Wong, J. Horgan, S. London, and H. Agrawal. A study of effective regression testing in practice. In Proceedings The 8th International Symposium on Software Reliability Engineering (ISSRE’97). IEEE, 1997. [29] H. Do, S. Elbaum, and G. Rothermel. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Software Engineering, 10(4):405–435, 2005. [30] S. Elbaum, A. Malishevsky, and G. Rothermel. Incorporating varying test costs and fault severities into test case prioritization. In Proceedings of the 23rd International Conference on Software Engineering (ICSE’01). IEEE, 2001. [31] S. Elbaum, A. Malishevsky, and G. Rothermel. Test case prioritization: a family of empirical studies. IEEE Transactions on Software Engineering, 28(2):159–182, 2002. [32] S. Tahvili, M. Ahlberg, E. Fornander, W. Afzal, M. Saadatmand, M. Bohlin, and M. Sarabi. Functional dependency detection for integration test cases. In The 18th International Conference on Software Quality, Reliability and Security (QRS’18). IEEE, 2018. [33] S. Tahvili, W. Afzal, M. Saadatmand, M. Bohlin, and Sh. Hasan Ameerjan. Espret: A tool for execution time estimation of manual test cases. Journal of Systems and Software, 146:26 – 41, 2018. [34] J. Beneoluchi, M. Kahar, and M. Nizam. Solving the traveling salesman’s problem using the african buffalo optimization. Computational Intelligence and Neuroscience, 2016:3:3–3:3, 2016. [35] P. Runeson and M. Höst. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering, 14(2):131, 2008. [36] E. Engström, P. Runeson, and A. Ljung. Improving regression testing transparency and efficiency with history-based prioritization – an industrial case study. In 4th International Conference on Software Testing, Verification and Validation (ICST’11). IEEE, 2011. [37] Electric multiple unit class 490 , hamburg, germany. [Accessed: 2018-0213].
Bibliography
251
[38] Y. Cho, J. Kim, and E. Lee. History-based test case prioritization for failure information. In 23rd Asia-Pacific Software Engineering Conference (APSEC’16). IEEE, 2016. [39] T. Noo and H. Hemmati. A similarity-based approach for test case prioritization using historical failure data. In 26th International Symposium on Software Reliability Engineering (ISSRE’15). IEEE, 2015. [40] X. Wang and H. Zeng. History-based dynamic test case prioritization for requirement properties in regression testing. In Proceedings of the International Workshop on Continuous Software Evolution and Delivery (CSED’16), pages 41–47. ACM, 2016. [41] C. Robson. Real world research : a resource for users of social research methods in applied settings. Chichester, West Sussex John Wiley & Sons, third edition edition, 2011. Previous ed.: 2002. [42] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, 2000. [43] P. Cozby and C. Rawn. Methods in Behavioural Research. McGraw-Hill Ryerson, 2012. [44] E. Drost. Validity and reliability in social science research. Education, research and perspectives., 38(1), 2011.
To achieve the goal of test efficiency, a set of criteria, having an impact on the test cases, needs to be identified. The analysis of several industrial case studies and also state of the art in this thesis, indicate that the dependency between integration test cases is one such criterion, which has a direct impact on test execution results. Other criteria of interest include requirement coverage and test execution time. In this doctoral thesis, we introduce, apply and evaluate a set of approaches and tools for test execution optimization at industrial integration testing level in embedded software development. Furthermore, ESPRET (Estimation and Prediction of Execution Time) and sOrTES (Stochastic Optimizing of Test Case Scheduling) are our proposed supportive tools for predicting the execution time and the scheduling of manual integration test cases, respectively. All proposed methods and tools in this thesis, have been evaluated at industrial testing projects at Bombardier Transportation (BT) in Sweden. As a result of the scientific contributions made in this doctoral thesis, employing the proposed approaches has led to an improvement in terms of reducing redundant test execution failures of up to 40 % with respect to the current test execution approach at BT. Moreover, an increase in the requirements coverage of up to 9.6 % is observed at BT. In summary, the application of the proposed approaches in this doctoral thesis has shown to give considerable gains by optimizing test schedules in system integration testing of embedded software development.
Mälardalen University Doctoral Dissertation 281 Sahar Tahvili MULTI-CRITERIA OPTIMIZATION OF SYSTEM INTEGRATION TESTING
Optimizing the software testing process has received much attention over the last few decades. Test optimization is typically seen as a multi-criteria decision making problem. One aspect of test optimization involves test selection, prioritization and execution scheduling. Having an efficient test process can result in the satisfaction of many objectives such as cost and time minimization. It can also lead to on-time delivery and better quality of the final software product.
Sahar Tahvili is a researcher at RISE (Research Institutes of Sweden) and also a member of the software testing laboratory at Mälardalen University. In 2014, she graduated as M.Phil. in Applied Mathematics with emphasis on optimization from Mälardalen University. Her research focuses on advanced methods for testing complex software-intensive systems, designing the decision support systems (DSS), mathematical modelling and optimization. She also has a background in Aeronautical Engineering. Sahar holds a licentiate degree in software engineering from Mälardalen University, titled: “A Decision Support System for Integration Test Selection’’, since October 2016.
ISBN 978-91-7485-414-5 ISSN 1651-4238
2018
Address: P.O. Box 883, SE-721 23 Västerås. Sweden Address: P.O. Box 325, SE-631 05 Eskilstuna. Sweden E-mail:
[email protected] Web: www.mdh.se
Multi-Criteria Optimization of System Integration Testing Sahar Tahvili