Fault Evaluator: A Tool for Experimental ... - East Carolina University

1 downloads 59 Views 462KB Size Report
tool. I. INTRODUCTION. The availability of a variety of software testing methods necessitates a .... Fault Evaluator starts by requesting that the user load.
Fault Evaluator: A Tool for Experimental Investigation of Effectiveness in Software Testing William Jenkins, Sergiy Vilkomir, William Ballance Department of Computer Science East Carolina University Greenville, USA {jenkinsw10, vilkomirs, ballancew08}@ecu.edu

Coverage (RC/DC) [5], and MUMCUT [6]. The effectiveness of these aforementioned approaches has been discussed in many papers, including [7, 8, 9, 10, 11]. This is an important research area because specifications of safety-critical software, for instance control systems for nuclear power plants, avionic software, etc. are often represented by logical expressions. At the same time, this area provides interesting research opportunities because of the mathematical nature of specifications: the opportunity to clearly define and formalize testing criteria [12] and types of faults [13] for logical expressions and the opportunity to automate studies in testing effectiveness and to use software tools for experimental investigations. Various tools [14, 15, 16] have been developed and used to investigate testing effectiveness. However, researchers often provide results of experimental investigations without detailed information on the tools being used. Such information, including description of tool functionalities, algorithms, and input and output data, is imperative for understanding the nature of experimental results and their trustworthiness.

Abstract—The specifications for many software systems, including safety-critical control systems, are often described using complex logical expressions. It is important to find effective methods to test implementations of such expressions. Analyzing the effectiveness of the testing of logical expressions manually is a tedious and error prone endeavor, thus requiring special software tools for this purpose. This paper presents Fault Evaluator, which is a new tool for experimental investigation of testing logical expressions in software. The goal of this tool is to evaluate logical expressions with various test sets that have been created according to a specific testing method and to estimate the effectiveness of the testing method for detecting specific faulty variations of the original expressions. The main functions of the tool are the generation of complete sets of faults in logical expressions for several specific types of faults; gaining expected (Oracle) values of logical expressions; testing faulty expressions and detecting whether a test set reveals a specific fault; and evaluating the effectiveness of a testing approach. Keywords- Software testing; effectiveness; logical expressions; tool

I.

INTRODUCTION

This paper presents Fault Evaluator, which is a new software tool developed for experimental investigation of testing logical expressions in software. The goal of this tool is to evaluate several logical expressions using various test sets (created according to any specific testing method), and then use this data to estimate the effectiveness of the testing method for detecting specific faulty variations of the original expressions. The main functions of this tool are the following:

The availability of a variety of software testing methods necessitates a comparison of different testing approaches and their effectiveness. Researchers, who have studied the effectiveness of testing methods, usually define “effectiveness” as the ability to detect failures in a software program. Different measures of testing effectiveness have been suggested including expected number of detected failures (E-measure), probability of detecting at least one failure (P-measure) [1], number of test cases required to detect the first failure (Fmeasure) [2], and measures of effectiveness for subdomainbased testing criteria [3].



Generation of complete sets of faults in logical expressions for several specific types of faults. • Gaining expected (Oracle) values of logical expressions. • Testing faulty expressions and detection if a test set revealed a specific fault or not. • Evaluation of effectiveness of the testing approaches. The tool also allows for the generation of test sets for random testing, but can also be used for other test sets created by other external tools. By using many test sets, Fault Evaluator combines the results of testing various expressions with all possible faults of different types. This, in turn, allows for the collection of significant statistical data sufficient for

It is impossible to suggest that any given testing approach is equally effective in all situations (for different types of software, different types of faults, etc). For this reason, effectiveness of software testing methods is often evaluated separately for one or more specific areas. One such specific area is the testing of logical expressions in software. Approaches used in this area include simple decision and condition coverage criteria as well as more complicated and effective approaches such as Modified Condition/Decision Coverage (MC/DC) [4], Reinforced Condition/Decision

___________________________________ 978-1-4244-6787-7/10/$26.00 ©2010 IEEE

1077

Fault Evaluator

Engine Results Expression DB

Compact Windows

Test DB

Summary Window

Fault Generator

Detail Window

Output Files CSV HTML

Main Class Input Files Expressions

Test Sets

External Tools

Figure 1. Structure of the Fault Evaluator tool.

expression objects, a database is used to keep them grouped with the correct expression, representing the first of a group of faulty expressions.

precise evaluation of probabilistic measures of testing effectiveness. This paper is organized as follows: In Section 2, the details of implementation and the general architecture in Fault Evaluator are considered. Tool usage from the user's point of view is addressed in Section 3. Section 4 describes outputs of the tool including output tables with experimental results of different levels of detailing and output files in different formats. Section 5 provides a short description of a case study where Fault Evaluator was used for experimental investigation of effectiveness of a pair-wise testing approach for testing logical expressions in software. Section 6 concludes this paper with directions for future works. II.

Test sets are then stored in a test database, which allows each test case to be associated with the test set. When it is time to evaluate, Engine pulls each expression from the expression database and evaluates it with the valid test cases that it can find within the test set database. Engine then stores the results for each test case within the expression itself, where the index of the test matches the index of the storage space within the expression. With this data, Main regains control and waits for the users to determine how they wish to view the. Depending on the output required by the user, a specialized window class is used to manage the widgets for each type of request. The Engine is responsible for preparing the data as a table widget, but the window is responsible for layout and handling GUI callbacks for itself. Whenever the window receives a command to save the data for complex windows, it relays the request back to Engine; however, for compact views it simply copies the table data to a file.

TOOL IMPLEMENTATION

Fault Evaluator was developed using Eclipse IDE for Java Developers to run on Java Runtime Environment 1.6. It uses the Swing library from Java for the graphical user interface (GUI) support. The classes, with few exceptions, are written from scratch. The structure of the Fault Evaluator tool is shown in Fig. 1, which provides an abstract view of the various conceptual classes in the tool.

Fault generation is handled by a series of complex string manipulations. The expression to generate faults is sent to a unique Fault Generator, which is responsible for generating all types of faults that the program can support. The Engine will use the Fault Generator on this string to generate a single type of fault. Then the Fault Generator will be freed and a new generator is constructed for each string. This is less efficient, but due to the complexity of some manipulations, it saves the trouble of determining the proper resetting of state variables. The general methodology of generating faults starts with the expression, which is just an infix string at this point, being

The core class is called Main and its purpose is to instantiate an Engine instance and to build the main GUI window that the user first sees. Main handles the resulting callbacks from GUI and loads data from files selected by the user before sending it to the Engine instance for storage and additional processing. Expressions are stored within their own object, which handles conversions and formatting of the expression throughout the program. To further manage

1078

generated during this process. The resulting data is optimized into an array with no empty spaces and returned to the Engine.

passed through a tokenizer. The tokens then are searched for the one that will have the fault inserted before or after. Fault Evaluator supports and automatically generates five different types of faults in logical expressions:

III.

TOOL USAGE

Fault Evaluator starts by requesting that the user load prebuilt expression files. Note that other options within the tool are disabled until certain criteria are met. This prevents the user from accessing methods that require data from expressions, test sets, or the evaluation results, as illustrated in Fig. 2.

• Variable Negation Fault (VNF). • Operator Reference Fault (ORF). • Variable Reference Fault (VRF). • Associative Shift Fault (ASF). • Expression Negation Fault (ENF). These types of faults have been considered previously by many researches. The precise definitions for each can be found, for example, in [7] or [13].

Files with expressions are simple text files, with one expression per line. The expressions must consist syntactically correct expressions that contain “and”, “or”, “not” in a combination of various scopes and variables. An example of such input file with logical expressions is shown in Fig. 3.

The process of generating VNF faults is initiated by counting the number of variable instances in the expression. This number indicates how many derivatives will be created from this one expression and allows the allocation of storage for each result. From this point onwards, the string is tokenized and the tokens are copied to each of the resulting expressions. If the token is a variable, then a negation operator is inserted before that same token in one of the faults, and the program proceeds until all tokens have processed. With this process complete, the tool returns the resulting array of faulty string expressions.

These expressions can be supplied in two formats. In the first format, all the expressions in the file are deemed correct and can be used with the automatic fault generation to produce derivative faulty expressions. This is the fastest way to load large numbers of expressions into the program. In the second mode, only the first expression in the file is deemed correct with all subsequent expressions being faults derived from that first expression. In this way, the user is allowed to evaluate faults where automatic generation is not currently supported in the program. These faults are labeled within the program as the “Unknown Fault” type.

The ORF process is very similar to VNF. Instead of inserting a negation, each “and” or “or” operator is replaced by its opposite into one of the faults. VRF follows a similar approach to both VNF and ORF, but must adapt slightly for various replacements. In VRF, the first step is to search for the number of unique variables and create a list. Then, as the tokens are stepped through in the string and variables are found, they are iterated through the list and all possible replacements are made as a unique fault for each replacement.

After users load faults, they can either load a test set file, or can choose to create random test sets suitable for carrying out evaluations. In the former, the test set file is an index to various

In more complex faults like ENF and ASF, the tokens are rearranged and complex search algorithms ensure that a token swap will create the appropriate fault and still produce a syntactically accurate expression. In ASF, parentheses are shifted, thus altering the scope of the expression. However, there are distinct rules to how a parenthesis can move. For example, a parenthesis cannot shift beyond the beginning or end of the expression. A parenthesis cannot shift past another parenthesis. A parenthesis must shift so that it encompasses at least one more operator and operand except when a “not” is involved. In this case, if the parenthesis is an “open parenthesis”, then it must stop before a negation of a variable unless it was already just after the negation. A “closing parenthesis” can ignore negations and can simply aim to encase another operator and operand.

Figure 2. The Main Menu window of the Fault Evaluator tool.

ENF is the most complicated fault type supported by the program. It involves two steps; the first is to insert a negation before each “opening parenthesis.” In the second step, the program recursively searches for groups of operands that are joined via an “and” operator. These groups can include other scoped parts of the expression or the entire expression itself. Each of the “and blocks” are surrounded by parenthesis and then the entire “and block” is negated. Finally, the program converts each subsequent fault into a binary tree and performs a tree comparison to ensure that no duplicate faults were

Figure 3. Example of an input file with logical expressions.

1079

The output tables represent three levels of detail. The first three options represent “Compact Views,” which show the entire set of results divided by the denoted aspect. The next level is the “General Summary” view, and from this level it is possible to go to the “Detailed View” with detailed information for individual faulty expressions and for each test case. The three “Compact View” windows break down the results by expression, fault types, and test sets. An example of the “Compact View” window of two expressions and 6 test sets are illustrated in Fig. 5. For expression 1 in Fig. 5, there are 45 faults that are derived from this particular correct expression. The next column shows that there are 3 test sets that are appropriate for this expression, that is, they have an appropriate number of inputs for the unique variable in the expression. The third column indicates that from these 45 expressions and 3 test sets, 135 evaluations are possible.

Figure 4. The options for displaying results within the Results menu.

other test set files, allowing the user to quickly load many different test set files with one import. Each test set file contains input values that are substituted into the expression with one case per line.

The following column shows how many of those evaluations detected a fault. Finally, in the last column, the number of detections over the number of evaluations is shown to determine the effectiveness of all the test sets for this single expression.

The last step before evaluation is to generate faults from the correct expressions. Faults are generated via complex rules encoded into the program, which are then applied to the expression’s string form. The user may select any combination of the five supported fault types or have previously loaded faults paired with a correct expression.

There is the more complicated “General Summary” view that shows the percentages of effectiveness for each expression and test set by fault. This view provides a significant amount of information, but large numbers of expressions or test sets are discouraged as the window can quickly get crowded. Finally, from the “General Summary” window each cell can be clicked on to expand a “Detailed View,” which shows the results of evaluation for that expression and test set, as well as all the faults generated off that expression. The “Detailed View” is available for any cell where an evaluation was appropriate for the test set. Cells where an evaluation was not appropriate are filled with double dashes. The total column then calculates a weighted average of the effectiveness of this singular test set at detecting all of the faults. Examples of “General Summary” and “Detailed View" windows are presented in Fig. 6 and 7, respectively.

Finally, the program proceeds to process each test case through each appropriate expression. The program checks the number of unique variables for the correct expressions and then only loads test sets that match the number of inputs. After these processes are complete, the program calculates a set of hardcoded statistics based on the number of faulty expressions that have results different from the correct expression. This action represents “detection” and the percentage of these detections indicates how well a test set was at exposing the fault from the original expression. IV.

TOOL OUTPUTS

Fault Evaluator focuses on visual output in the form of tables, which are available from the Results menu. The main options are shown in Fig. 4.

Figure 5. Compact Summary window for Expression view.

Figure 6. General Summary window.

1080

Figure 7. Detailed View window.

activities (fault generation, expression evaluation, testing effectiveness estimation) were handled by Fault Evaluator.

All of these result windows provide the option to save the data viewed as a HTML or CSV file. To simply preserve a record of the data, HTML files can be created, which show the results of the evaluations and the percentage of faulty expressions detected (Fig. 8).

The use of Fault Evaluator was instrumental in the investigation; it drastically reduced time of evaluation and generation of faults, and eliminated miscalculation of efficiency. Without using Fault Evaluator, the investigation would have been tedious and error prone due to the high number of test cases and faulty logical expressions.

HTML files are generally inappropriate, however, for any use other than logging or providing a printable version for reading the evaluations. A better format to use is a comma delimited CSV format, which allows moving the data from the program to other tools.

Description of our initial experimental results is presented in [20]. More than 100,000 test cases were generated and applied according to the pair-wise approach, and more than 5,000 faulty logical expressions were evaluated.

This format preserves the table as it is, complete with table headers and percentages, thus allowing use of the data during future visits. However, the real power in saving as a comma delimited file lies in the ability to move that file to other programs like Microsoft Excel, Numbers, or Open Spreadsheet, and then process this data with more complicated algorithms. A comparison of CSV markup and how it is displayed are demonstrated in Fig. 9 and 10, respectively. V.

CASE STUDY

Fault Evaluator can use test cases generated according to any testing approach. As a case study, we used Fault Evaluator to investigate effectiveness of a pair-wise testing approach for logical expressions in software. It is a non-traditional application of pair-wise testing, usually used to cover various combinations of input parameters. The only previous investigation in this area, according to our knowledge at this time, is [11]. We analyzed the same set of 20 logical specifications of the TCAS II system [7, 17] that were considered in [11]. In our investigation, pair-wise test cases had been generated by using the tools Allpairs [18] and TConfig [19], and all other

Figure 8. Example of the rendered HTML generated by the program.

1081

Figure 9. CSV markup.

Figure 10. CSV markup as rendered by Microsoft Excel

It was estimated that the effectiveness of pair-wise testing for logical expressions is in the range 0.19 - 0.48 for different types of faults. The results show the pair-wise approach is not highly efficient for testing logical expressions. However, if a pair-wise approach is used for coverage of an input domain, then its consideration at the same time for logical expressions is useful and provides additional benefits.

VI.

CONCLUSIONS AND FUTURE WORKS

Specifications for many software systems, including safetycritical control systems, are often described using complex logical expressions. It is important to find effective methods to test implementations of such expressions. Analysis of effectiveness of testing logical expressions manually is a tedious and error prone endeavor so special software tools should be used for this purpose. That is why Fault Evaluator, a

1082

new tool for experimental investigation of testing logical expressions in software, was developed. This paper describes the architecture and main functionalities of the tool and various formats in which final results can be represented. The usage of the tool from the user's point of view is considered.

[6]

Such a tool is a key element to success in investigations of testing effectiveness. The tool should provide a large statistically sound number of evaluations of faulted logical expressions to guarantee trustworthiness of the experimental results. In Fault Evaluator, the number of original logical expressions and the number of used test sets are determined by a user of the tool. However, the tool automatically generates all possible faulty expressions for each original expression for five different types of fault. It allows precise probabilistic measures of testing effectiveness to be obtained. The tool has been extensively used for evaluation of effectiveness of testing logical expressions for random and pair-wise testing approaches.

[8]

[7]

[9]

[10]

[11]

[12]

There are still a variety of additions that this program could see in the future to expand its goal from “quicker evaluations” to a more comprehensive test suite. For example, it may be able to generate random logical expressions of desired degrees of complexity with a predetermined number of scopes and variables. Another direction for future works is automatic generation test sets, according to various testing criteria. Finally, the dictionary of fault types is planned to be removed and replaced with an engine that can administer various rules to an expression. This would mean that any additional faults that needed to be added to the program could be done so without modifying the source code.

[13]

[14]

[15] [16]

REFERENCES [1]

[2]

[3]

[4]

[5]

T. Chen, F. Kuo, and R. Merkel, “On the statistical properties of testing effectiveness measures,” Journal of Systems and Software, vol. 79, no. 5, 2006, pp. 591–601. T. Chen, H. Leung, and I. Mak, “Adaptive random testing,” Proc. of the 9th Asian Computing Science Conference (ASIAN 2004), LNCS vol. 3321, Springer, 2004, pp. 320–329. E. Weyuker, “Can we measure software testing effectiveness?”, Proc. of 1st International Software Metrics Symposium, May 21–22, 1993, Baltimore, MD, USA, 1993, pp. 100–107. J. Chilenski, and S. Miller, “Applicability of Modified Condition/Decision Coverage to software testing,” Software Engineering Journal, September 1994, pp. 193–200. S. Vilkomir, and J. Bowen, “From MC/DC to RC/DC: Formalization and Analysis of Control-Flow Testing Criteria,” Formal Aspects of Computing, vol. 18, no. 1, March 2006, pp. 42-62.

[17]

[18] [19]

[20]

1083

Y. Yu, M. Lau, and T. Chen, “Automatic generation of test cases from Boolean specifications using the MUMCUT strategy,” Journal of Systems and Software, vol. 79, no. 6, June 2006, pp. 820-840. E. Weyuker, T. Goradia, and A. Singh, “Automatically generating test data from a Boolean specification,” IEEE Transactions on Software Engineering, vol. 20, no. 5, 1994, pp. 353–363. P. Frankl and E. Weyuker, “A formal analysis of the fault detecting ability of testing methods,” IEEE Transactions on Software Engineering, vol. 19, no. 3, 1993, pp. 202–213. M. Lau and Y. Yu, “On Comparing Testing Criteria for Logical Decisions,” Proc. of the 14th Ada-Europe International Conference, Brest, France, June 8-12, 2009, LNCS, vol. 5570, 2009, pp. 44–58. S. Vilkomir, K. Kapoor and J. Bowen, “Tolerance of Control-Flow Testing Criteria,” Proc. of 27th IEEE Annual International Computer Software and Applications Conference (COMPSAC 2003), Dallas, Texas, USA, 3-6 November 2003. IEEE Computer Society Press, 2003, pp. 182-187. N. Kobayashi, T. Tsuchiya, and T. Kikuno, “Non-specification-based approaches to logic testing for software,” Information and Software Technology, vol. 44, no. 2, 2002, pp. 113–121. S. Vilkomir and J. Bowen, “Formalization of software testing criteria using the Z notation,” Proc. of COMPSAC 2001: 25th IEEE Annual International Computer Software and Applications Conference, Chicago, Illinois, USA, 8--12 October 2001. IEEE Computer Society Press, 2001, pp. 351 - 356. D. Kuhn, “Fault classes and error detection capability of specificationbased testing,” ACM Transactions on Software Engineering and Methodology, vol. 8, no. 4, 1999, pp. 411-424. J. Bradbury, J. Cordy, and J. Dingel, “An empirical framework for comparing effectiveness of testing and property-based formal analysis,” SIGSOFT Softw. Eng. Notes, vol. 31, no. 1, Jan. 2006, pp. 2-5. A. Offutt, “An integrated automatic test data generation system,” Journal of Systems Integration, vol. 1, no. 3, November 1991, pp. 391-409. S. Sprenkle, L. Pollock, H. Esquivel, B. Hazelwood, and S. Ecott, “Automated Oracle Comparators for TestingWeb Applications,” In Proc. of the the 18th IEEE international Symposium on Software Reliability (November 05 - 09, 2007). ISSRE. IEEE Computer Society, Washington, DC, pp. 117-126. N. Leveson, M. Heimdahl, H. Hildreth, and J. Reese, “Requirements specification for process-control systems”, Technical Report, Department of Information and Computer Science, University of California, Irvine, 1992, pp. 92-106. J. Bach, “ALLPAIRS Test Case Generation Tool (Version 1.2.1)”, http://www.satisfice.com/tools/pairs.zip Accessed on July 5, 2010. A. Williams, J. Lo, and A. Lareau, “TConfig,” http://www.site.uottawa.ca/~awilliam/TConfig.jar Accessed on July 5, 2010. W. Ballance, W. Jenkins, and S. Vilkomir, “Probabilistic Assessment of Effectiveness of Software Testing for Safety-Critical Systems,” Proc. of the 10th International Probabilistic Safety Assessment & Management Conference (PSAM 10), 7-11 June 2010, Seattle, Washington, USA