Artificial neural networks as multi-networks automated test oracle

1 downloads 1156 Views 2MB Size Report
Sep 24, 2011 - the input domain to the output domain automatically. .... of a software application is a fault-free implementation of it that generates correct.
Autom Softw Eng (2012) 19:303–334 DOI 10.1007/s10515-011-0094-z

Artificial neural networks as multi-networks automated test oracle Seyed Reza Shahamiri · Wan M.N. Wan-Kadir · Suhaimi Ibrahim · Siti Zaiton Mohd Hashim

Received: 26 September 2010 / Accepted: 8 September 2011 / Published online: 24 September 2011 © Springer Science+Business Media, LLC 2011

Abstract One of the important issues in software testing is to provide an automated test oracle. Test oracles are reliable sources of how the software under test must operate. In particular, they are used to evaluate the actual results produced by the software. However, in order to generate an automated test oracle, it is necessary to map the input domain to the output domain automatically. In this paper, Multi-Networks Oracles based on Artificial Neural Networks are introduced to handle the mapping automatically. They are an enhanced version of previous ANN-Based Oracles. The proposed model was evaluated by a framework provided by mutation testing and applied to test two industry-sized case studies. In particular, a mutated version of each case study was provided and injected with some faults. Then, a fault-free version of it was developed as a Golden Version to evaluate the capability of the proposed oracle finding the injected faults. Meanwhile, the quality of the proposed oracle is measured by assessing its accuracy, precision, misclassification error and recall. Furthermore, the results of the proposed oracle are compared with former ANN-based Oracles. Accuracy of the proposed oracle was up to 98.93%, and the oracle detected up to 98% of the injected faults. The results of the study show the proposed oracle has better quality and applicability than the previous model. Keywords Automated software testing · Software test oracle · Artificial neural networks · Mutation testing S.R. Shahamiri () Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia e-mail: [email protected] W.M.N. Wan-Kadir · S. Ibrahim · S.Z.M. Hashim Department of Software Engineering, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor, Malaysia

304

Autom Softw Eng (2012) 19:303–334

1 Introduction Software testing is the process of evaluating the software behavior to check whether it operates as expected in order to improve its quality and reliability, which they are important attributes for the world today software applications (Bourlard and Morgan 1994). Since the testing process is highly time and resource consuming, complete testing is almost impossible; thus, testers use automatic approaches to facilitate the process and decrease its costs (Sun et al. 2010). Whittaker (2000) explained software testing is divided into four phases: modeling the software’s environment, selecting test scenario, running and evaluating the test scenario, and measuring testing process. Test oracles are usually used in the third phase when testers want to evaluate the scenarios. After test cases are executed and results of the testing are generated, it is necessary to decide whether the results are valid in order to determine the correctness of the software behavior. To verify the behavior of the SUT (Software Under Test), correct results are compared with the results generated by the software. The results produced by the SUT that need to be verified are called actual outputs, and the correct results that are used to evaluate actual outputs are called expected outputs (Whittaker 2000). Test oracles are used as a reliable source of how the SUT must behave and a tool to verify actual outputs correctness (Ran et al. 2009). Usually, the verifier makes a comparison between actual and expected outputs. The process of finding correct and reliable expected outputs is called oracle problem (Ammann and Offutt 2008). Automated oracles must provide correct expected outputs for any input combinations that are specified by software documentations automatically. In particular, it is necessary to map the input domain to the output domain automatically. Then, an automated comparator is required to compare actual and expected results in order to verify the correctness of the SUT behavior. Previously Artificial Neural Network (ANN) based oracles, which we call SingleNetwork Oracles (S-N Oracles), were employed to perform the required mapping in order to test small software applications. S-N Oracles use only one ANN to model the SUT; thus, their accuracy may not be adequate when they are applied to test complex software applications requiring large scale scenarios (Hall 2008) because their productivity is only limited to one ANN. We overcome this limitation by introducing Multi-Networks Oracles (M-N Oracles), which consist of several standalone ANNs instead of only one, in order to automate the mapping and provide an accurate oracle for more complicated software applications where S-N Oracles may fail to deliver a high quality oracle. They are explained in detail in Sect. 4. Furthermore, since all ANN-based oracles are only an approximation of the SUT, they may not provide the very same outputs as the expected results. In particular, sometimes it is not sufficient to perform a direct point-to-point comparison and it is necessary to consider some tolerance when comparing the actual and the expected outputs. As an illustration, it is possible that the actual and expected outputs are different a little, but they can still be considered the same. Therefore, the proposed model includes an automated comparator that applies some thresholds to define the comparison tolerance and adjust the precision of the proposed oracle. We evaluated the proposed approach using mutation testing (Woodward 1993) by apply it to test two industry-sized case studies. First, a Mutated Version of each

Autom Softw Eng (2012) 19:303–334

305

application was provided and injected with some faults. Then a fault-free version of it was developed as a Golden Version1 to evaluate the capability of the proposed oracle to find the injected faults. The quality of the proposed approach was assessed by measuring the following parameters: (1) Accuracy: How accurate the oracle results are. It conveys what percent of expected outputs generated by the proposed oracle is accurate. It was measured by comparing the correct expected results, which were derived from the Golden Version, with the results produced by the proposed oracle. (2) Precision: The tolerance parameter dictating how close expected and actual result must be to each other for the comparator to rule a match. It can be adjusted using threshold parameters. (3) Misclassification Error: It is the amount of false reports produced by the comparator. (4) Recall: It is the amount of the injected faults, which was identified by the proposed oracle. Note that it is different from accuracy because accuracy considers both successful and unsuccessful test cases but recall considers only unsuccessful test cases (i.e. the test cases that reveal a fault). The process of measuring above parameters is explained in Sects. 8 and 9. Moreover, in order to highlight the advantages of the proposed model, a S-N Oracle for each case study was provided and applied as the M-N Oracle was applied. Then, the results produced by all oracles were compared in detail.

2 Related work This section provides a survey on the prominent and state-of-the-art oracles. First, some of the most popular oracles are explained. Then, after recent studies on automated test oracles are surveyed, previous studies on ANN-based oracles are described. In addition, some applications of ANNs to automate other software testing activities are reviewed as well. 2.1 Prominent test oracles This section explains some popular test oracles. In general, automated oracles can be used as shown in Fig. 1. Each of the oracles mentioned here can be replaced by the oracle in the figure. Human oracles are the most popular type of un-automated oracles. They are people that understand the software domain such as programmers, customers, and domain experts. Human oracles are not automated. In addition, they cannot be completely reliable when there are thousands of test cases. Likewise, human oracles could be expensive. 1 A Golden Version of a software application is a fault-free implementation of it that generates correct

expected results. It was produced to evaluate the proposed oracle in this study.

306

Autom Softw Eng (2012) 19:303–334

Fig. 1 Using an automated oracle in software testing

Random Testing is a black-box test-data-generation technique that randomly selects test cases to match an operational profile (Spillner et al. 2007). It uses statistical models in order to predict the SUT reliability. It can scarcely be considered as a test oracle because it cannot provide expected outputs for the entire I/O domain that violates the oracle definition. Similarly, it automates the test case selection but the expected outputs must still be provided by any other approaches. Moreover, although the cost of random testing is low, its effectiveness is not adequate (Ntafos 2001). Cause-Effect Graphs and Decision Tables (Jorgensen 2002) are two popular test oracles that can be applied to address the I/O mapping. Although they are some tools to create the required structures using formal software specifications and written programs in order to create the oracles automatically, they still need some human observations and improvements to achieve the best oracle. They are usually considered as a black-box testing method but they may be seen as gray-box in case the improvements are necessary to examine the source code. In addition, they are limited to verify the logical relationships between causes and their effects, which are being implemented by the SUT. Similarly, they can grow very quickly and lose readability in complex applications that have large conditions and related actions (Spillner et al. 2007). Therefore, these oracles may not be effective in case the SUT applies heavy calculations to create the results. Formal Oracles were considered before as well. They can be generated from formal models such as formal specification (Stocks and Carrington 1996; Richardson et al. 1992) and relational program documentation (Peters and Parnas 1994, 1998). They may provide both fully automated and reliable oracles in case an accurate and complete formal model of the SUT, which is used to generate the oracle, is existed. However, especially for large-scale software applications, formal models can be expensive and difficult being provided. Moreover, it seems to be compelling evidence of their effectiveness for large-scale software applications is not existed (Pfleeger and Hatton 1997). Providing an oracle for model transformation testing was discussed by Kessentini et al. (2011a). They used “an example” based method to generate oracle functions. 2.2 State-of-the-art automated oracle models Automated oracles are being considered lately finding an effective solution to address the oracle challenges. N-Version Diverse System is a testing method based on various implementations of the SUT. It conveys a different implementation of Redundant Computation as the Gold Version. To put it differently, it uses various versions of

Autom Softw Eng (2012) 19:303–334

307

the SUT that they all implement the same functionalities, but they are implemented independently using several development team and methods. The independent versions may be seen as a test oracle. Nonetheless, the idea could result in an expensive process because different versions of the SUT must be implemented by different programming teams. In addition, it is unable to guarantee the efficiency of the testing process since each of the implementations can be faulty itself. Manolache and Kourie (2001) suggested an approach based on a similar approach to decrease the cost of N-Version Diverse Systems. In particular, the authors explained another solution based on N-Version Testing called M-Model Programs testing (M-mp). The new approach considers reducing the cost of the former approach and increasing the reliability of the testing process by providing more precise oracle. M-mp testing implements only different versions of the functions to be tested but N-Version Diverse implements several versions of the whole software. Although M-mp testing reduces the cost of the previous method, it can still be very expensive and unreliable. Kessentini et al. proposed an oracle to model transformation testing. It relies on the premise that the more a transformation deviates from well-known good transformation examples, the more likely it is erroneous (Kessentini et al. 2011b). To put it differently, this oracle compares expected results with a base of examples that contains good quality transformation traces, and then assigns a risk level to them accordingly. There have been several attempts to apply Artificial Intelligence (AI) methods in order to make test oracles automatically. These methods are varied according to the applied AI method. As an illustration, Last and his colleges (Last and Freidman 2004; Last et al. 2004) introduced a fully automated black-box regression tester using Info Fuzzy Network (IFN). IFN is an approach developed for knowledge discovery and data mining. The interactions between the input and the target attributes (discrete and/or continuous) are represented by an information theoretic connectionist network. An IFN represents the functional requirement by an oblivious tree-like structure, where each input attribute is associated with a single layer, and the leaf nodes corresponds to the combinations of input values (Last and Freidman 2004). The authors developed an automated oracle that could generate test cases, execute, and evaluate them automatically based on previous versions of the SUT to regression test (Briand et al. 2009) the upcoming versions. There are three differences among the model proposed by the authors and the one proposed here. First, Last and his colleagues applied Info Fuzzy Networks to produce a multi-target model which may act as a test oracle but we considered ANNs. The capability of their model to simulate complex test oracles should be studied before it is being comparable to the proposed Multi-Networks Oracle because the case study the authors considered was an expert system for solving partial differential equations, which it is not possible to be compared with our case studies in terms of software domain and complexity. Second, the multi-target model proposed by the authors is less accurate than the single-target model, which the one proposed by us is. They claimed the former is more compact than the last. Finally, their model can only be applied to verify the functionalities that are remained unchanged in the new increments of the software (i.e. regression testing). In Shahamiri et al. (2011), we introduced the idea of using multi-networks ANNs as to compute input/output relationships for test oracles and showed how these can be

308

Autom Softw Eng (2012) 19:303–334

fit within a full framework for automated testing. By contrast, in this paper we have focused on showing how multi-networks ANNs can improve on performance over that of single-network ANNs for this purpose. Genetic Algorithms (GA) were employed to provide oracles for condition coverage testing, which concerns conditional statements (Woodward and Hennell 2006). Manual searching to find test cases that increase condition coverage is difficult when the SUT has many nested decision making structures. Therefore, using an automated approach to generate effective test cases that cover more conditional statements is helpful. Michael et al. (Michael and McGraw 1998; Michael et al. 2001) introduced an automated Dynamic Test Generator using GA in order to identify effective test cases. Dynamic test generators are one of the white-box techniques that examine the source code of the SUT in order to collect information about it; then, this information can be used to optimize the testing process. The authors applied GA to perform condition coverage for C/C++ programs identifying test cases that may increase the source code coverage-criteria. The drawback of this approach is it may not be reliable for testing complex programs consisted of many nested conditional statements that each of them has several logical operations in their statements. Furthermore, this model is not platform-independent. Taking advantages of the above approach, Sofokleous and Andreou (2008) employed GA with a Program Analyzer to propose an automated framework for generating optimized test cases, targeting condition coverage testing. The program analyzer examines the source code and provides its associated control flow graph, which the graph can be used to measure the coverage achieved by the test cases. Then, GA applies the graph to create optimized test cases that may reach higher conditioncoverage criteria. The authors claimed that this framework provides better coverage than the previous one in order to test complex decision-making structures. However, it is still platform dependent. Such GA based test frameworks may be considered as an automated test oracle that can address the I/O mapping only for condition coverage testing, although they do not guarantee that full coverage can be achieved. Therefore, they may provide an imperfect oracle. Other methods such as Equivalence Partitioning and Boundary Value Analysis (Ammann and Offutt 2008; Spillner et al. 2007) are more test case generation approaches than test oracles, hence they are not discussed here as test oracles. Memon et al. (Memon et al. 2000, 2005; Memon 2004) applied AI planning as an automated GUI 2 test oracle. Representing the GUI elements and actions, the internal behavior of the GUI was modeled in order to extract the GUIs expected state automatically during execution of each test case. A formal model of the GUI, which it was composed of its objects and their specifications, was designed based on GUI attributes and used as an oracle. GUI actions were defined by their preconditions and effects, and expected states were automatically generated using both the model and the actions from test cases. Similarly, the actual states were described by a set of objects and their properties that obtained by the oracle from an execution monitor. In addition, the oracle applied a verifier to compare the two states automatically and find 2 Graphical user interface.

Autom Softw Eng (2012) 19:303–334

309

faults in the GUI. The problem is a formal model of the GUI is required; moreover, it can only be used to verify the GUI states. We conducted a study representing prominent methods that may be considered in order to automate software testing activities and showed how ANNs can be used to automate the software testing activities mentioned before (Shahamiri et al. 2009; Shahamiri and Wan 2008). As an illustration, Su and Huang proposed an approach to estimate and model software reliability using ANNs (Su and Huang 2007). An effective test case selection approach using ANNs is introduced in Saraph et al. (2003), which it studied the applications of I/O analysis to identify which input attributes have more influence over outputs. It concluded that I/O analysis could significantly reduce the number of test cases. The ANN was applied to automate I/O analysis to determine important I/O attributes and their ranks. Khoshgoftaar et al. (1992, 1995) proposed a method to employ ANNs predicting the number of faults in the SUT based on software metrics. In addition, testability of program modules was studied by the same authors in Khoshgoftaar et al. (2000). 2.3 ANN based single-network oracles Vanmali and his colleges proposed an automated oracle based on ANNs (Vanmali et al. 2002). The authors modeled an ANN to simulate the software behavior using the previous version of the SUT, and applied this model to regression test unchanged software functionalities. Using the previous version of the SUT, expected outputs were generated and the training pairs were employed to train the ANN. Aggarwal et al. applied the same approach to solve the triangle classification problem (Aggarwal et al. 2004). Particularly, an application implemented the triangle classification was tested using a trained ANN. Their work was followed by Jin et al. (2008). We applied Single-Network Oracles to test decision-making structures (Shahamiri et al. 2010a) and verify complex logical modules (Shahamiri et al. 2010b). Particularly, a Multilayered Perceptron ANN that modeled a university subject-registration policy was trained using domain experts’ knowledge of the system. Then, the trained ANN was employed to test an application that performed the subject registration and verified whether students followed the policies or not. All the above ANN-based oracles were applied to test discrete functions. Mao et al. formulated ANNs as test oracles to test continuous functions (Mao et al. 2006). Consider the continuous function y = F (x) where x is the software input vector, y is the corresponding output vector, and F is the software behavior. The function F was modeled and expected outputs were generated using a trained ANN. Despite Multilayered Perceptron neural networks that applied in all the above studies, Lu and Mao (Lu and Ye 2007) used a different type of ANNs to provide automated oracles and test a small mathematic continues function. The Perceptron neural networks are explained later. Although all of these studies are shown the significance of S-N Oracles based on ANNs, as we illustrated later, they may not be reliable when the complexity of the SUT increases because they require larger training samples that could make the ANN learning process complicated. A tiny ANN error could increase the oracle misclassification error significantly in large software applications. Previous ANN based oracle

310

Autom Softw Eng (2012) 19:303–334

studies were evaluated by small applications having small I/O domains. Therefore, one neural network was enough to perform the mapping in above studies. Furthermore, ANN based oracles may not provide the very same output vector as the expected results. In particular, it is possible that a minuscule difference between the expected output generated by the ANN-based oracle and the correct one is existed. Any direct comparison may classify these results as faulty where they can be perceived as correct because the SUT may not require ultimate precision. We introduced M-N Oracles to perform the I/O mapping in order to test more complicated software applications where S-N Oracles may fail to deliver a high quality oracle. In addition, the proposed model considers a comparator with some defined thresholds in order to adjust the oracle precision, and to prevent correct results being classified as incorrect where the distance between oracle results and the expected results are different a little.

3 Multi-layered perceptron neural networks In recent years, we have seen a convincing move by the research community from theoretical research into practical one in order to find solutions for hardly solvable problems. Similarly, researchers are interested in model-free intelligent dynamic systems based on experimental data. ANNs are one of these systems to discover the hidden knowledge of experimental data and learn it while processing them. They are called intelligence systems because they can learn from calculations on numerical data and examples. ANNs try to model natural-neural-systems, but they are restrictively able to simulate the information processing capability of them (Schalkoff 1997). They are network structures comprised of some correlated elements called neurons, each one has input(s) and output(s), and they perform a simple local add operation. Each neuron input has its corresponding weight, which it acts as the neuron memory. Furthermore, neurons have their Bias Weights and Activation Functions to squash the add operation results into specific values. Adjusting these weights causes the network to learn the hidden knowledge being modeled through a process called the Training Process. In particular, Training Samples, which they are consisted of inputs and corresponding outputs, are given to the network by the process while the ANN tries to discover the relationships between inputs-outputs and learn them. After training, whenever an input combination is provided to the trained network, it should be able to generate its associated result(s). In other words, the ANN can recognize how to response to each input pattern. Note that the accuracy of the ANN depends on how well the network structure is defined and the training process is done. Learning Rate is one of the training parameters that show how fast the ANN learns and how effective the training is. It can be in range (0..1); nevertheless, choosing a value very close to zero requires a large number of training cycles and makes the training process extremely slow. On the contrary, large values may diverge the weights and make the objective error function fluctuate heavily; thus, the resulted network reaches a state where the training procedure cannot be effective any more.

Autom Softw Eng (2012) 19:303–334

311

Networks with only one neuron have limitations due to inabilities to implement non-linear relationships; hence, Multilayer Perceptron networks are one of the most popular types of ANNs to solve the non-linearity problems (Menhaj 2001). They are multi-layered networks with no limitation to choose the number of neurons, and they must have an input layer, one or more hidden layers (or middle layers) and one output layer. The general model of Perceptron networks is Feed-Forward with BackPropagation training procedure, which feed-forward are networks with inputs of the first layer connected and propagated to middle layers, the middle layers to the final layer, and the final layer to the output layer. In back-propagation procedure, after results of the network are generated, the parameters of the last layer to the first layer will be corrected in order to decrease the network misclassification error. The misclassification error can be presented by MSE,3 which it is the squared difference between the training sample outputs and the results generated by the network. According to Heiat (2002), MSE can be considered to show how well the ANN outputs fit the expected outputs. Lower MSE represents better training quality and more accuracy to produce correct results.

4 Multi-networks oracles versus single-network oracles As mentioned in the previous sections, ANN-based oracles that comprised of only one ANN (i.e. Single-Network Oracles) may not be able to model the SUT if the software application is too complicated and generates several results. The main drawback of such oracles is they learn the entire functionalities using only one ANN. Thus, if the number of the functionalities or the complexity of them is increased, the single ANN may fail to learn them with enough accuracy. It is why we introduced MultiNetworks Oracle, which uses several standalone ANNs in parallel instead of one, in order to distribute the complexity of the SUT among several ANNs. Suppose the output domain of the SUT consisted of outputs O1 to On : Output Domain = {O1 , O2 , . . . , On } In order to test the software using a Single-Networks Oracle, the entire I/O domain is modeled using only one ANN. In particular, the ANN must learn the required functionalities to generate all outputs itself. On the other hand, using Multi-Networks Oracle, the functionalities associated to each output are modeled by a standalone ANN. In other words, a Multi-Networks Oracle uses several Single-Network Oracles in parallel. Since the complexity of the SUT is distributed among several ANNs, each ANN has less to learn so it eases the training process; thus, it is easier for the ANNs to converge on the training data. Therefore, because complex software applications may require a huge training dataset, the practicality of Single-Network Oracles to find faults may be reduced. We introduced the Multi-Networks Oracles that use several ANNs instead of one to learn the SUT. To put it differently, a Multi-Networks Oracle is composed of several 3 Mean squared error.

312

Autom Softw Eng (2012) 19:303–334

Fig. 2 A single-network oracle

Fig. 3 A multi-networks oracle

Single-Network Oracles. A single network is defined for each of the output items of the output domain; then, all of the networks together make the oracle. As an illustration, if the SUT produces seven output items, we need seven ANNs to create the Multi-Networks Oracle. Particularly, the complexity of the software is distributed between several networks instead of having a single network to do all of the learning. Consequently, separating the ANNs may reduce the complexity of the training process and increase the oracle practicality to find faults. Note that the training process must be done for each of the ANNs separately using the same input vectors but only the output to be generated by the ANN. Figure 2 shows a Single-Network Oracle and Fig. 3 depicts the Multi-Networks Oracle. There are other ways to distribute the complexity among the ANNs if the software functionalities are increased. For example, it may be possible to consider software modules instead of outputs. To put it differently, for each module of the software, we can use an ANN to learn its related functionalities. Consequently, it may decrease the complexity by inserting a new ANN to the oracle. Moreover, Multi-Networks Oracles increase the flexibility to use several types of ANNs with different structures and parameters.

Autom Softw Eng (2012) 19:303–334

313

Fig. 4 Multi-networks oracle training procedure

5 Making the multi-networks oracle The ANNs need to be trained before they are applied as automated oracles. They must be trained using training samples composed of inputs and corresponding expected results. The training samples are required in order to perform the training process and they have to be accurate. The ANNs model the SUT by trying to learn these samples, which they can be generated using the software documentations, domain experts or any other possible reliable approaches. Then, they can be applied to train the SUT. Note that ANN based test oracles can be considered as a Black-Box test method. In the following, after the ANN training phase is described, the process of applying the proposed oracle is explained. 5.1 The training process Figure 4 depicts the training process. Before proceeding to ANN training, all of the non-numeric inputs and outputs must be normalized into numeric values in order to achieve better ANN accuracy. Non-numeric and text data are normalized to numbers while binary data are treated as zero and one. Similarly, continuous numeric data are scaled to [0, 1]. Then, all of the generated normalized I/O vectors are considered to train the ANNs and employed as the training samples. The ANNs are capable to decide which output vector must be chosen for each input vector by learning the training samples. In the training phase, each training sample is used to simulate the SUT functional behavior. The input vectors are provided to the network and each ANN generates its associated output. The results generated by each ANN are compared with the outputs from the training samples. In order to adjust each network parameters, the training algorithm back-propagates the error data into the network and modifies the network parameters automatically.

314

Autom Softw Eng (2012) 19:303–334

Then, the whole process is repeated and continued until adequate error reached, i.e. the adequate MSE. If the training process was not successful enough (i.e. the MSE seems to be large), the network structure needs to be modified by trial and error until adequate error is achieved. Note that there is not a practical method to automatically determine how low the network error should be because ANNs are not absolute hence this must be achieved through trial and error. It also depends on the expertise of the tester to make ANNs. Nevertheless, if the training error is very large, there might be incoherencies or ambiguities within the training samples. For example, there may be two different output vectors for a single input vector. During the training phase, each training sample is applied to make an accurate ANN-based test oracle. Note that the trainings samples that are being given to each ANN are comprised of all of the inputs but only the output that is associated to the ANN. In particular, the input vectors are given to the network and the result is generated by the ANN. In order to increase the quality of the network, each ANN result is compared with the correct expected result provided by the training sample. The network error (i.e. MSE) can be measured by calculating the squared distances between the expected results and the network-generated results. To achieve an adequate error rate, the networks parameters (neurons and biases weights mentioned in Sect. 3) are adjusted by back-propagating the error data to the network. Then, the whole process is repeated and continued until adequate error rate being obtained. At this point, the network adjustment cycles in Fig. 4 are finished and the resulted network is ready to be used as one of the networks that are required in order to make the Multi-Networks Oracle. This process is repeated for the rest of the networks. 5.2 Using the trained ANNs as multi-networks oracle Once the networks are trained, we can apply them in the testing process. The test cases are executed on the trained networks (Multi-Networks Oracle) and the SUT simultaneously. The process of using the proposed oracle was as follows: test cases are given to the trained networks (the M-N Oracle) while they are being executed on the SUT. Actual outputs, which are being evaluated, are generated by the SUT and the expected outputs by the oracle. Each ANN is responsible to generate the result that was trained for; therefore, the results from all of the networks make the complete output vectors. An automated comparator is defined to decide whether the actual outputs produced by the SUT are faulty or not. The algorithm of the automated comparator is provided by Sect. 8. Note that the entire process may be automated with minimum human effort to prepare the environment and the necessary dataset. 6 Case studies For evaluation purpose, the proposed model was applied to two industry-size case studies, which they are presented in the following. 6.1 The first case study The first case study is a web-based car insurance application to manage and maintain insurance records, determine the payment amount claimed by customers, handle their

Autom Softw Eng (2012) 19:303–334

315

requests for renewal, and other related insurance operations. Moreover, the application is responsible to apply the insurance policies to customers’ data and produce four outputs that are used throughout the application. According to the insurance policies, the customers’ data are fed to the application by eight inputs; thus, the input vector has eight inputs and the output vector has four outputs. Half of the outputs are continuous and the other half are binary. Equivalence Partitioning was considered to provide the input domain (i.e. the inputs of the training samples), which it is a test case reduction technique that classifies a group of data values into an equivalence class where the SUT processes them the same way. The test of one representative of the equivalence class is sufficient because for any other input value of the same class the SUT does not behave differently (Jorgensen 2002; McCaffrey 2009; Myers 2004; Patton 2005; Spillner et al. 2007). Table 1 shows the input vector and its equivalence partitions, and Table 2 depicts the output vector. The first case study has an I/O domain comprised of 13824 equivalence partitions as explained in Table 1. The business layer of the application was implemented by 1384 lines, which user-interface layer implementation is not included. Cyclomatic Complexity (Patton 2005) implies the structural complexity of program code. Particularly, it measures how many different paths must be traversed in order to fully cover each line of program code. Higher numbers mean more testing is required and the SUT is more complicated. The Cyclomatic Complexity of the insurance policies implemented by the case study was 38, which was measured using Code Analysis package included in Microsoft Visual Studio 2010 Ultimate. The case study was developed using C# and ASP.Net. We applied the proposed model to verify how well the insurance policies were implemented by the application. In order to provide a complete test oracle, we must provide at least one training sample for each of the insurance policies and train the ANNs completely. Each insurance policy represented by one of the identified equivalence partitions. Therefore, 13824 equivalence partitions were required as the training samples to cover all the insurance policies implemented by the software. To put it differently, the insurance policies were reflected by 13824 different training (I/O) pairs, which they were chosen by equivalence partitioning. The framework proposed by Shahamiri et al. (2011) was employed to automatically generate the SUT responses to the identified partitions (i.e. the output domain) and provide the training samples. In particular, the authors employed I/O Relationship Analysis (Schroeder et al. 2002; Schroeder and Korel 2000) as part of their framework to generate a reduced set of expected results and expand them in order to provide the rest of the outputs. However, the training samples may be generated using software specifications, domain experts or any other test-data-generation method. 6.2 The second case study The second case study is a registration-verifier application. The goal of the software is to maintain and manage the students’ records and validate their registration based on the Iranian Universities Bachelor-Students Registration Policies. The policies require a complex logical process based on the students’ data that are given to the software

316

Autom Softw Eng (2012) 19:303–334

Table 1 The input domain of the first case study

1

Inputs

Equivalence Classes

Number of Equivalence Classes

The driver’s experience

Less than 5 years

3

Between 5 and 10 years More than 10 years 2

Type of the driver license

3

Type of the car

A, B, C, D

4

Sedan

6

SUV MPV Hatchback Sport Pickup 4

5

The credit remains in the insurance account

Less than 100$

Cost of this accident

Less that 25% of the initial credit

2

More than 100$ 3

Between 25% and 50% of the initial credit More than 50% of the initial credit 6

Type of the insurance

Full

2

Third Party 7

The car age

Less than 10 years

4

Between 10 to 15 years Between 15 to 20 years More than 20 years 8

Number of registered accidents since last year

0

4

Between 0 and 5 Between 5 and 8 More than 8

Total Equivalence Classes = 3 × 4 × 6 × 2 × 3 × 2 × 4 × 4 = 13824

as the input vector and consisted of eight data items. The software and the oracles implement these rules and make decisions on the validity of registrations, the maximum courses the students allow to select, and whether they can apply for a discount or not. Cyclomatic Complexity of the registration policies implemented by the case study is 18. The input domain consisted of eight inputs, as shown in Table 3, and the output domain consisted of three outputs depicted by Table 4. The “Values” column is provided considering equivalence partitioning. Similarly, the last column is the number of the equivalence classes. For example, the fourth input has three equivalence

Autom Softw Eng (2012) 19:303–334

317

Table 2 The outputs of the first case study Outputs

Description

1

Insurance Extension Allowance

A Boolean data item that is true if the insurance is allowed for extension and false otherwise.

2

Insurance Elimination

A Boolean data that is true if the insurance account is terminated and false otherwise.

3

The Payment Amount

The amount to be paid for accidents based on the claimed amount (input 5) and the insurance rules. This output is continuous.

4

Credit

This amount is assigned to each insurance account as its available credit. This output is continuous.

Table 3 The second case study input domain

1

Inputs

Description

Equivalence Classes

Number of Equivalence Classes

GPA

The student’s GPA.

14

5

6

Total-Number-ofConditionalRegistration

The total number of conditional-registration the student has used.

>3

CPA

The student’s CPA.

≤ 12,

2

≤3 3

Between >12 and ≤ 17 >17 7

8

StudyingMode

HasDiscountBefore

The applicant can be a full time or part time student.

Full Time (True)

Whether the student applied for discount before or not.

True

2

Part Time (False)

False

Total Equivalence Classes = 2 × 2 × 2 × 3 × 2 × 3 × 2 × 2 = 576

2

318

Autom Softw Eng (2012) 19:303–334

Table 4 The first case study output domain Outputs

Description

1

IsAllowedToRegsiter

Is the student permitted to register for the requested semester?

2

MaxAllowedCourses

To determine the maximum amount of courses that the student can apply for.

3

Discount

To decide if the student is eligible to apply for discount.

Table 5 The single-network oracle structure (first case study) Input Neuron # Hidden Neuron # Output Neuron # Learning Rate Training Cycles MSE 8

30

4

0.01

10000

Output1 0.0016 Output2 0.0005 Output3 0.0147 Output4 0.0018

Total MSE

0.0046

classes; hence, the software behaves the same for any input 4 values less than or equal to 12. As can been seen in Table 3, the input domain of the second case study is composed of 567 equivalence partitions, which they were generated like the first case study.

7 The experiment The proposed oracle was applied to verify the case studies. After the training samples were ready, we proceeded with the training process. A small tool was created to train the required ANNs using NeuronDotNet4 package, which it provides the required libraries and tools to create and train ANNs using Microsoft Visual Studio. We conducted the experiments by providing both the S-N oracle and the M-N Oracle in order to compare their results. Note that all of the networks are Multilayered Perceptron networks with Sigmoid activation function and one hidden layer. Since all the outputs for both of the case studies are scaled or normalized to range [0, 1], Sigmoid is adequate. All of the ANNs were trained using back-propagation algorithm with the same training samples that represent the insurance policies for the first case study and the registration policies for the second case study. 7.1 The first case study experiment Table 5 shows the structure of the S-N Oracle and Fig. 5 depicts the corresponding training error graph for the first case study. The network training error was presented for each output separately. 4 http://neurondotnet.freehostia.com/.

Autom Softw Eng (2012) 19:303–334

319

Fig. 5 The single-network oracle training error graphs

Since the insurance policies were complicated, a S-N Oracle seemed to fail learning them with enough accuracy; the minimum MSE we saw through several experiments was as big as 0.0046 over 10000 training cycles on a 8 × 30 × 4 Multilayer Perceptron network with learning rate 0.01 (eight input, 30 hidden and four output neurons). On the other hand, the proposed Multi-Networks Oracle could reach a lower MSE 0.00076 as mentioned in Table 6. Table 6 highlights the Multi-Networks Oracle structure and the MSEs, and Fig. 6 depicts the corresponding training error graphs. Each of the networks in Fig. 6 was associated to its output. For example, Network 1 is the first ANN that generates the first output; Network 2 is the second ANN that generates the second output and so on. Since the output vector has four outputs, four ANNs were required to make the Multi-Networks Oracle. Note that each of the ANNs in the Multi-Networks Oracle was given only by its associated output; for example, the training samples for the second ANN that modeled the second output contained all of the inputs but the second output since the ANN must generate the second output only. Thus, first, third, and fourth outputs are not required while training the second ANN. All of the inputs fed and outputs generated by the ANNs are normalized. In particular, the third and fourth outputs are continuous data items that were scaled to

320

Autom Softw Eng (2012) 19:303–334

Table 6 The multi-networks oracle structure (first case study) Corresponding output

Type of the output

Input Hidden Output Learning Neuron Neuron Neuron Rate # # #

Training Cycles

MSE

ANN1 Output1 (Insurance Binary Extension Allowance)

8

13

1

0.01

1400

0.00025

ANN2 Output2 (Insurance Binary Elimination)

8

3

1

0.1

150

0.00002

ANN3 Output3 (The Payment Amount)

Continuous 8

30

1

0.25

2000

0.0023

ANN4 Output4 (Credit)

Continuous 8

30

1

0.25

2000

Total MSE

0.00048 0.00076

range [0..1]. For example, The Payment Amount (i.e. output3 ) is the percentage of the claimed amount according to the insurance rules. The percentage was divided to 100 and scaled to the range. The fourth output is the same. Binary inputs were scaled to zero percent (i.e. “False”) and one hundred percent (i.e. “True”). The first and second outputs were binary so they were treated the same. Regarding the discrete inputs, they were mapped to some integer values. As an illustration, the “A” value of the second input was regarded as one, “B” as two and so on. A comparison between Fig. 5 and Fig. 6 shows that the S-N Oracle could not converge on the training samples as easy as the M-N Oracle did because it had higher MSEs and more training cycles were necessary. It was caused due to the complexity of the insurance policies that the ANN must learn. On the other hand, the M-N Oracle reached a stable state easily because the complexity of the policies was distributed among several ANNs. 7.2 The second case study experiment Since the input domain of the second case study has eight inputs, the input layer has eight neurons. Similarly, the output domain has three outputs; therefore, the MultiNetworks Oracle consisted of three ANNs. On the other hand, only one ANN is required if we consider providing a S-N Oracle. As mentioned earlier, the experiments were repeated by providing the same S-N Oracle in order to compare its results with the proposed M-N one. Table 7 shows the structure of the S-N Oracle and Table 8 depicts the same for the M-N Oracle. The network parameters (the number of hidden neurons and layers, training cycles, the learning rate and the activation functions) should be adjusted by trial and error in order to minimize the MSE. In addition, to provide better confidence in the network, we also measured the actual Absolute Error Rate of the trained networks after applying the test cases. The absolute error rate shows squared difference between the proposed oracle results and the correct expected results. Once the neural networks were trained, it was possible to use them as an automated oracle. Nevertheless, the quality of the resulted oracle needed to be verified

Autom Softw Eng (2012) 19:303–334

321

Fig. 6 The multi-networks oracle training error graphs Table 7 The single-network oracle structure (second case study) Input Neurons #

Hidden Neurons #

Output Neurons #

Learning Rate

Training Cycles

8

40

3

0.01

10000

Total MSE

MSE

Output1

0.01764

Output2

0.00081

Output3

0.00102

0.00589

against accuracy, precision, misclassification error and recall to evaluate the proposed approach as mentioned in the first section.

8 Evaluation The proposed model was evaluated considering mutation testing by developing two versions of each case study. The first version was a Golden Version, which it was a

322

Autom Softw Eng (2012) 19:303–334

Table 8 The multi-networks oracle structure (second case study) Corresponding Output

Input Neuron #

Hidden Neuron #

Output Neuron #

Learning Rate

Training Cycles

MSE

ANN1

Output1 (AllowedToRegister)

8

30

1

0.01

10000

0.0073

ANN2

Output2 (MaxAllowedCourses)

8

30

1

0.01

10000

0.00001

ANN3

Output3 (Discount)

8

30

1

0.01

10000

0.00005

Total MSE

0.0024

complete fault free implementation of the case studies that generated correct expected results. The other version was a Mutated Version that was injected with common programming mistakes. In order to verify the oracles, the ability of the networks to find the injected faults was evaluated using the following steps: 1. The test cases were executed on both the Golden Version and the Mutated Version. 2. Meanwhile, the test cases were given to the proposed oracle. 3. The Golden Version results were completely fault free and considered as expected results. 4. All the results (the results from the oracle, the mutated results and expected results) were compared with each other and any distance between them more than a defined tolerance was reported as a possible fault, as explained later. 5. To make sure the oracle results were correct, the outputs from the oracle were compared with the expected results (the Golden Version results) and their distances were considered as oracle absolute error. Figure 7 shows the verification process. Since the neural networks are only an approximation of the case studies, some of their outputs may be incorrect. On the other hand, the mutated classes themselves may produce errors, which finding their faults is the main reason for the testing process. Therefore, the results of the comparison can be divided into four categories: 1. True Positive: All the results are the same. Therefore, the comparator reports “No fault”. This means there is actually neither any fault in the mutated version nor the oracle. True positive represents the successful test cases. 2. True Negative: Although the expected and the oracle results are the same, they are different from the mutated results. In this case, the oracle results are correct. Therefore, the oracle correctly finds a fault in the mutated version. 3. False Positive: Both the oracle and mutated version produced the same incorrect results. Therefore, they are different from the expected results and faults in the oracle and the mutated version are reported. To put it differently, the oracle missed a fault. 4. False Negative: The mutated and the expected results are the same, but they are different from the oracle results. Thus, the comparator reports a faulty oracle.

Autom Softw Eng (2012) 19:303–334

323

Fig. 7 The evaluation process

Since the third and fourth categories represent the oracle misclassification error, we mainly considered these types of faults to measure the accuracy of the proposed oracle. Moreover, as illustrated in Table 5 through 8, the MSEs were not exactly zero. Hence, we defined some thresholds presenting how much distance between the expected and the oracle results could be ignored (i.e. the comparison tolerance). If the distance was less than the thresholds, both results were considered the same. Otherwise, the results of the oracle were not accurate and false positive or false negative faults were reported. The thresholds are used to define the precision of the proposed oracle, and it can be increased or decreased as necessary, as discussed later. Figure 8 shows the comparison algorithm.

9 Results This section explains the results of the experiment. The neural networks training quality was already discussed by the previous section. Hence, the results of applying the produced oracles on the case studies are presented here. The quality benchmarks that were explained earlier were measured to assess the quality of the proposed approach. In order to evaluate the quality of the Multi-Networks Oracles, two Single-Network Oracles for both of the case studies were provided too. Five experiments were conducted on the case studies to evaluate the accuracy of the proposed oracle under different precision levels.

324

Autom Softw Eng (2012) 19:303–334

1. Start 2. Calculate the distance between the expected result and the oracle result as absolouteDistance 3. Calculate the distance between the mutated result and the oracle result as mutatedDistacne 4. If the absoluteDistance ≤ threshold then a. The oracle is correct b. If the mutatedDistance ≤ thresdhold then True Positive is reported (no fault) c. Else True Negative is reported (a fault is found) 5. Else a. The oracle is faulty b. If the mutated result and the expected result are not the same i. False Positive is reported (missed fault) c. Else False Negative is reported (the oracle failed to verify the mutated result) 6. End Fig. 8 The comparison algorithm

Table 9 Samples of the mutants Original Code

Mutated Code

Error Type

Affected Output

if (Credit 15) && (CarAge 10) && (CarAge