Keywords: Metamorphic relation, oracle problem, spreadsheet testing, test oracle ..... literature (for example, Panko, 1998; Tukiainen, 2000) [9]. Table 7 gives ...
Error Trapping and Metamorphic Testing for Spreadsheet Failure Detection
ABSTRACT This study deepens the research on error trapping (ET) and metamorphic testing (MT) for detecting spreadsheet failures. It is observed that some spreadsheet developers and testers are confused between ET and MT, because the two techniques are similar to each other in some aspects. Inspired by the observation, this paper first outlines the main concepts of ET and MT using several examples for illustration. This is followed by discussing an experiment with a view to investigating and comparing the failure detection capabilities of the two techniques. The results of the experiment indicate that neither technique is sufficient in spreadsheet testing. Rather, ET and MT are complementary and they should be used together in spreadsheet testing whenever possible. Keywords: Metamorphic relation, oracle problem, spreadsheet testing, test oracle
INTRODUCTION Nowadays, spreadsheet-based systems (or simply spreadsheets) have become an indispensable tool in various business areas such as accounting and financial reporting, asset recording, production scheduling, and engineering design (Lu et al., 1991; McDaid & Rust, 2009). Spreadsheets are often used to generate important information for managers and executives in making strategic decisions (Caulkins et al., 2007; Kruck et al., 2003). Although spreadsheet applications are extremely popular, numerous spreadsheet failures [1] have been reported in the literature (Bishop & McDaid, 2011; Leon et al., 2015; Morrison et al., 2002; Panko, 1998, 1999, 2007; Panko & Aurigemma, 2010). Some studies further substantiate the widespread growth of faulty spreadsheets. For example, Panko & Ordway (2005) found that nearly all (94%) of the spreadsheets contained faults. A possible reason contributing to the large number of faulty spreadsheets is that their development has shifted from IT professionals to “nontechnical” departmental end users such as accounting or marketing staff (hereafter referred to as end-user programmers); many of the latter have little formal training in software development and testing (Powell et al., 2008, 2009). According to the European Spreadsheet Risks Interest Group (EuSpRIG), faulty spreadsheets could result in various business risks including: (a) loss in revenue, profit, cash, assets, and tax; (b) mispricing and poor decision-making; and (c) financial failure. In view of the above risks, several systematic techniques have been developed for dynamic testing (that is, testing involving software execution) of spreadsheets. These techniques include the constraint-based spreadsheet testing method (Abraham & Erwig, 2006), the “What You See Is What You Test (WYSIWYT)” methodology (Fisher et al., 2006), error trapping (ET) (Jain, 2010; Powell & Baker, 2009, pp.115116), and metamorphic testing (MT) (Chen et al., 2003; Liu et al., 2014, Poon et al., 2014). 1
Among these techniques, the constraint-based spreadsheet testing method requires formal training in software engineering, and the WYSIWYT methodology requires some technical knowledge of data-flow adequacy criteria (Jee et al., 2009) and coverage monitoring (Vilkomir et al., 2003). These requirements pose difficulties to end-user programmers who often do not possess such technical knowledge. On the other hand, ET and MT are more applicable to end-user programmers because both require less technical knowledge for their use [2]. Since this paper is related to dynamic spreadsheet testing from the end-user programmer’s perspective, ET and MT are the main focus. In the course of our research into spreadsheet testing, we observe that many people confuse between ET and MT, because they are similar in some aspects. Thus, this paper aims to help readers understand the two concepts. It also investigates their failure detection effectiveness, and determines whether they are complements or substitutes. The paper first analytically compares the two techniques, specifically focusing on their similarities, differences, and relationship. The analytical comparison is followed by a discussion of an experiment, which investigates their effectiveness in detecting spreadsheet failures. Thereafter, a practical guidance on spreadsheet testing with the use of ET and MT are provided.
DYNAMIC AND STATIC TESTING Testing is a verification and validation technique, and is often categorized as being either dynamic and static. Dynamic testing involves executing the software system with test data, and then checking the output and the operational behavior of the software (Sommerville, 2011). In the checking process, if the actual output (or operational behavior) is found to be different from the expected output (or operational behavior), a failure is revealed, indicating that one or more faults exist in the software. Static testing (also known as human testing), however, does not involve software execution. Reviews, inspections, and audits are examples of static testing (Myers, 2004). In organizations, program code is often tested in the preliminary construction and final construction phases of systems development (Everett & McLeod, 2007, pp. 3537). In the preliminary construction phase, code is static tested (for example, in the forms of code inspections or code walkthroughs). Myers (2004, p. 21) argues that, although not all testers read code, code review (a form of static testing) is widely accepted because this practice is quite effective in finding errors [3]. Thus, static testing should be performed between the time the program is coded and the time when dynamic testing commences. Dynamic testing is performed in both the preliminary construction and final construction phases in various forms such as function, performance, and load tests (Everett & McLeod, 2007, pp. 3537). (Our paper focuses on functional testing, which attempts to identify discrepancies between the software system and its expected behavior from the end-user’s point of view (Myers, 2004, p. 129.) Although organizations typically spend about 2550% of the total
2
software development cost on testing (Watkins & Mills, 2011, p. 43), they rarely mandate that spreadsheets be tested after development (Panko, 2006b).
ET Despite best efforts, wrong input data and formula faults may occasionally creep into a spreadsheet. These “upstream” problems may cause subsequent problems in “downstream” cells that depend on the “upstream” input data or computation results. ET could be used to detect these problems, and to minimize their adverse impacts on the spreadsheet users. More specifically, ET allows the end-user programmers to determine what happens in the event of an unintended run-time error, to prevent loss of recent changes to a spreadsheet, and to prevent the spreadsheet from refusing to function at all. ET often takes the form of conditional-based tasks using built-in commands such as IF, IFERROR, ISERROR, and ISNA provided by the spreadsheet, or the ON ERROR command provided by Microsoft Visual Basic (VB) for more advanced error handling procedures. Below are two examples which involve a formula fault and wrong input data individually. Example 1 (division by zero) Consider the partial spreadsheet in Table 1, whose column D includes formulae to calculate the percentage changes in sales since last year for some vehicle types. Suppose the formula used in cell D4 is “=(C4-B4)/B4”. Because cell B4 contains zero, the formula in cell D4 involves division by zero, resulting in the display of the error message “#DIV/0!”. This situation will undoubtedly cause confusion and frustration to the user. One way to solve the problem is to replace the original formulae in column D by other formulae involving the ISERROR command, such as “=IF(ISERROR((C4-B4)/ B4), ”No sales last year“, (C4-B4)/B4)” in cell D4. After this replacement, the ISERROR function will return a TRUE value. Then, the spreadsheet will display the message “No sales last year “ (rather than “#DIV/0!”) to alert the user about the error.
Example 2 (unbalanced balance sheet) Consider the spreadsheet in Table 2, corresponding to a balance sheet at the end of a financial year. In accounting, the total assets (corresponding to cell B12, which is the sum of current assets and fixed assets) must be equal to the sum of the total liabilities (corresponding to cell D10, which is the sum of current liabilities and long-term liabilities) and the total equity (corresponding to cell F5). Otherwise, either some accounting entries are mistakenly entered or omitted. To check this equality, the formula “=B12-D10-F5” in cell F7 is used. In Table 2, cell F7 contains a non-zero value, indicating that the balance sheet is unbalanced due to incorrect and/or missing accounting entries.
3
MT The fundamental concept of MT is to use some specific properties of the problem to be implemented to form their corresponding metamorphic relations (MRs). These MRs are then used to generate new test cases for testing and to verify the test results. If these results across multiple software executions violate any MR, then the software system under test is faulty. MT was originally developed by Chen et al. (2003) as a “general” testing technique. Recently, Poon et al. (2014) have introduced MT for spreadsheet testing. MT is particularly useful in testing when a test oracle (or simply an oracle) does not exist. Here, an oracle refers to a procedure or a mechanism by which the tester can verify the correctness of the system output (Chen et al., 2003). Thus, the oracle is of utmost importance in testing. In testing, the oracle problem is said to occur when: (Scenario 1) an oracle does not exist, or (Scenario 2) an oracle exists but it is infeasible to apply possibly due to resource constraints. Many spreadsheet researchers and practitioners have reported the frequent occurrence and severity of the oracle problem in spreadsheet testing (Grossman & Özlük, 2010; Panko, 1998, 1999, 2006a; Panko & Aurigemma, 2010; Pryor, 2004): “Such independent calculations [an oracle] are, in my experience, rarely available and so full [spreadsheet] system testing is not often performed.” (Pryor, 2004) “In most cases, there was no comparable calculation [an oracle] before spreadsheets … In complex spreadsheets, then, there usually is no oracle other than the spreadsheet calculations, which may not be correct … This lack of a readily-found oracle is the most serious problem in spreadsheet execution testing. Without a strong and easy-to-apply oracle, execution testing simply makes no sense for error-reduction testing.” (Panko, 2006a) “In the event that the correct outputs [the oracle] are not knowable, testing is of little value. For this reason software testing is not applicable to a large class of scientific and business models, including large financial planning models, because the only calculation of the outputs is computed by the software [spreadsheet] being tested.” (Grossman & Özlük, 2010) Example 3 below illustrates the oracle problem in spreadsheet testing and how MT can be used to alleviate this problem. Example 3 (total annual sales commission) Consider the spreadsheet in Table 3 used by a vehicle retailer, who has 100 different types of vehicles for selling. Details of these vehicle types are stored in rows 2 to 101 in the spreadsheet. Cells B2B101 and C2C101 store the sales amounts for the two periods JanuaryJune and JulyDecember, respectively, in the current year. Similarly, cells D2D101 and E2E101 store the commission rates for the two periods, respectively. In addition, cell
4
E103 contains the array formula “=SUMPRODUCT(B2: C101,D2:E101)” [4] to calculate the total annual sales commission. To verify the correctness of the computation result in cell E103, one could use a hand-held calculator to manually compute the “expected” result, which is then compared with the “actual” result in cell E103. This manual computation of “expected” result, however, is tedious, time-consuming, and prone to human error. This causes the oracle problem (Scenario 2) in testing the spreadsheet. MT makes use of some properties associated with the sales commission scheme to form MRs for testing. Four examples of these MRs are as follows: (MR1)
For every row i (where 2 i 101), if it is rotated “forward” (or “backward”) by C (where C 1) rows, the total annual sales commission (cell E103) will remain unchanged.
(MR2)
Suppose there exists a set of vehicle types {v1, v2, …, vn} (where n 2) such that these vehicle types have the same commission rate in the period JanuaryJune. Suppose, further that, the respective sales amounts of these vehicle types are {s1, s2, …, sn}. If these sales amounts are changed to {s1’, s2’, …, sn’} such that (s1’ + s2’ + + sn’) = (s1 + s2 + + sn), the total annual sales commission (cell E103) will remain unchanged.
(MR3)
Select any two vehicle types {v1, v2} whose associated sales amounts and commission rates in the period JulyDecember are {s1, s2} and {r1, r2}, respectively, such that r1 > r2. If we change the sales amounts of these two vehicle types to {s1 + s2, 0}, the total annual sales commission will increase by $(s2 (r1 r2)) [5].
(MR4)
If all the commission rates in the two periods (from cells D2D101 and E2E101) are multiplied by a constant C, the total annual sales commission (cell E103) will also increase by C times.
Note that, the effectiveness of an MR in detecting spreadsheet failures depends on whether or not it is defined in a “fine grain” manner. Consider, for instance, if MR3 above is replaced by another less “fine grain” MR5 as follows: (MR5)
Select any two vehicle types {v1, v2} whose associated sales amounts and commission rates in the period JulyDecember are {s1, s2} and {r1, r2}, respectively, such that r1 > r2. If we change the sales amounts of these two vehicle types to {s1 + s2, 0}, the total annual sales commission will increase.
Obviously, the effectiveness of MR5 in detecting spreadsheet failures, as compared to MR3, will be lowered. In what follows, MR1 and MR2 are used to illustrate how new test cases (which are combinations of input data) are generated from existing ones to detect spreadsheet failures. Suppose cell E103 of Table 3 is wrongly coded as “=SUMPRODUCT(B2:C100, D2:E100)”, that is, row 101 has been carelessly omitted from the formula. In this case, the computation result in cell E103 would be “$1,516,950 (= $1,522,000 $45,000 0.05 5
$40,000 0.07), which is wrong (see Table 4). However, because of the oracle problem, it is very difficult for the end-user programmer to verify whether or not the value “$1,516,950” is correct. First, consider MR1. Suppose, for every row between rows 2 and 101, it is rotated “forward” by one row (that is, C in MR1 is equal to one). In other words, rows are rotated such that row 2 row 3, row 3 row 4, …, row 100 row 101, and row 101 row 2). In this case, with respect to MR1, the spreadsheet should be executed twice. Suppose that, in the first execution, the end-user programmer entered the test case into the spreadsheet as shown in Table 4, and obtained “$1,516,950” in cell E103 (recall that the wrong formula “=SUMPRODUCT(B2:C100, D2:E100)” has been coded). In the second execution, with respect to MR1, the end-user programmer should enter the test case as shown in Table 5, by rearranging the rows between rows 2 and 101. After rearranging, cell E103 of Table 5 will decrease to $1,514,400 (= $1,516,950 + $45,000 0.05 + $40,000 0.07 $90,000 0.04 $80,000 0.05). The decrease in the total annual sales commission as computed in cell E103 of Table 5 violates MR1 (which states that the total annual commission should remain unchanged), indicating that the array formula in cell E103 is wrong. Note that detecting this fault does not require the end-user programmer to know whether or not the value “$1,516,950” in cell E103 of Table 4 and the value “$1,514,400” in the same cell of Table 5 are correct. In MT, test cases used in the first execution (such as the input data in Table 4) are called source test cases; and the subsequent test cases generated in accordance with an MR (such as the input data in Table 5) are called follow-up test cases. Next, consider MR2. Suppose that between January and June, only Honda Accord and Volvo XC60 have the same commission rate 0.05 (refer to Table 4). Thus, in MR2, n = 2, and v1 and v2 correspond to Honda Accord and Volvo XC60, respectively. In this case, the source test case is given in Table 4 with s1 =$80,000 and s2 = $45,000, while a follow-up test case is given in Table 6 with s1’ =$125,000 and s2’ = $0. Suppose that the source test case is first executed with the wrong array formula, as represented by Table 4. After making the change to the spreadsheet in Table 4 in accordance to MR2 (shown in Table 6), the second execution result in cell E103 will increase from “$1,516,950” to “$1,519,200” (= $1,516,950 + $45,000 0.05). (Note that “0.05” is the commission rate for Honda Accord in the period JanuaryJune.) This is because the sales commission for Volvo XC60 in the period JanuaryJune (= $45,000 0.05 = $2,250) had not been included in computing the total annual sales commission (= $1,516,950) in the first execution due to the omission of row 101 in the array formula in cell E103. This amount of $2,250 (= $45,000 0.05), however, is now included in computing the total annual sales commission in the second execution. The increase in the total annual sales commission violates MR2 , thereby revealing the wrong array formula in cell E103. Example 3 shows that MT checks whether or not each identified MR holds among multiple executions rather than focussing on the correctness of outputs from individual executions (which requires the expected output values to be known). It is this characteristic of MT that makes the technique applicable to test software systems with 6
the oracle problem. Although Example 3 illustrates the use of MT for testing when the oracle problem exists, readers are cautioned that MT is also applicable in the absence of the oracle problem.
ERROR TRAPPING VERSUS METAMORPHIC TESTING As observed in spreadsheet testing research, people may confuse between ET and MT. This section analyzes their similarities, differences, and complementary relationship. Similarities ET and MT are similar to each other in the following aspects: a) Both techniques use some properties of the application (or application domain) for testing. Examples of these properties are the equality between the total assets and the sum of the total liabilities and the total equity in Example 2 (for the rest of the paper, a property of the application used for ET is called an ET property), and the properties associated with MR1 to MR5 in Example 3. This characteristic implies that the effectiveness of both techniques in detecting spreadsheet failures largely depends on the knowledge and expertise of the end-user programmer about the application implemented by the spreadsheet, because ET properties and MRs are often defined by the end-user programmer. b) Checking the fulfillment of ET properties in ET and MRs in MT can be largely automated and, hence, a large amount of testing resources can be saved. c)
MT can be used for testing spreadsheets even when the oracle problem exists. This situation also applies to ET, provided that ET properties associated with one single spreadsheet execution can be identified (such as the equality between the total assets and the sum of the total liabilities and the total equity in Example 2).
Differences Despite their similarities, ET and MT differ from each other in the following aspects: a) ET attempts to warn the user about potential errors by checking for violations to ET properties within one single spreadsheet execution. On the other hand, MT checks for violations to MRs, by comparing the output results between multiple spreadsheet executions. As such, MT has a higher demand for testing resources than ET. b) ET often requires the inclusion of additional formulae in the spreadsheet for checking, such as the formula “=IF(ISERROR((C4-B4)/B4),”No sales last year“,(C4-B4)/B4)” in cell D4 in Example 1 and the formula “=B12-D10-F5” in cell F7 in Example 2. This requirement, however, is not mandatory in MT [6]. For 7
instance, Example 3 shows that no additional formula needs to be incorporated into the spreadsheet when using MT. c)
There exist some testing scenarios in which no ET property can be identified. On the other hand, this is almost never a problem in MT [7]. Consider, for instance, Example 3 which shows how MRs can be identified to test the spreadsheet in Table 4 involving multiple executions. No ET property, however, exists in this spreadsheet, rendering ET to be inapplicable.
Complementary Relationship Both ET and MT have their own strengths and weaknesses and, hence, should be used together whenever possible. Consider, for instance, the unbalanced balance sheet in Example 2. This example shows how ET can be used to detect wrong or missing accounting entries by checking the equality between the total assets and the sum of the total liabilities and the total equity in one spreadsheet execution. This checking, however, is irrelevant to MT which involves multiple spreadsheet executions. On the other hand, while MT is effective to detect, for instance, the array formula fault (“=SUMPRODUCT(B2:C100,D2:E100)“) in cell E103 in Example 3, ET is unable to reveal this fault because of the absence of any ET property. When to Use ET and MT? There are two major factors affecting the decision of using ET and MT for testing spreadsheets: (a) the amount of available testing resources, and (b) the existence of ET properties and MRs in the spreadsheet under test. With respect to factor (a), if testing resources are available, both ET and MT should be used together, because they are complementary to each other there are some faults that can be detected by one of them but not the other (as explained in the previous Subsection “Complementary Relationship” and further confirmed by the experimental results to be discussed in the next section). On the other hand, if testing resources are tight, ET is more preferable than MT with a view to saving testing cost, because MT has a higher demand for testing resources (see item (a) in the Subsection ”Differences” of this section). With respect to factor (b), if ET properties cannot be identified from the spreadsheet under test, then ET is not applicable. Thus, MT should be used instead (see item (c) in the Subsection “Differences” of this section).
EXPERIMENT Many software researchers and practitioners, such as Meyer (2008), argue that a successful test is only relevant to quality assessment if it causes a software failure. This is because a failure will alert the end-user programmer about the existence of spreadsheet faults and, hopefully, lead to their subsequent removal. In view of their
8
argument, this section evaluates and compares the effectiveness of ET and MT in terms of their failure detection capabilities. Subject Spreadsheets and Participants The experiment involved five spreadsheets (denoted by S1, S2, …, S5, respectively) with real faults inadvertently introduced by the developers. These spreadsheets were obtained from the EUSES Spreadsheet Corpus, which contains over 4,400 “realworld” spreadsheets (Fisher & Rothermel, 2005). We chose these five subject spreadsheets because each of them was related to a different application domain and they altogether covered the following six common and major spreadsheet fault types [8] according to Ronen’s classification scheme (Ronen et al., 1989):
Mistakes in logic (F1)
Incorrect ranges in formulae (F2)
Incorrect cell references (F3)
Incorrectly copied formulae (F4)
Accidentally overwritten formulae (F5)
Misuse of built-in functions (F6)
The paper by Ronen et al. (1989) is commonly regarded as a classical paper on spreadsheet analysis and design, and has been widely cited in the spreadsheet literature (for example, Panko, 1998; Tukiainen, 2000) [9]. Table 7 gives some details about the subject spreadsheets, including the types of faults that each contains. Five participants were recruited for the experiment, who were non-IT graduates and had not been formally trained as IT professionals. However, all of them had practical experience in spreadsheet development as part of their job duties. To help the participants get familiar with ET and MT, before commencing the experiment, they had been offered a one-hour training session, during which hands-on exercises on both techniques were given. Procedures and Results of Experiment Each participant spent about an hour, alone, identifying as many ET properties and MRs as possible for each subject spreadsheet. We observed that participants spent about the same time on identifying ET properties and MRs. After tallying, the numbers of distinct ET properties and distinct MRs identified by the participants for the subject spreadsheets are shown in Table 8. The mean numbers of distinct ET properties and distinct MRs identified by each participant were 5.6 (= 28/5) and 8.6 (= 43/5), respectively. In other words, on average, each participant identified about 54% more MRs than ET properties for a spreadsheet. One plausible explanation for this large difference is that the number of ET properties associated with some spreadsheet
9
applications, for example, S2 in Table 8, is very small. Readers may refer to difference (c) discussed in the Section “Error Trapping versus Metamorphic Testing”. For each identified ET property, five test cases were randomly generated to test the corresponding spreadsheet, with a view to checking whether or not this ET property was violated in one single spreadsheet execution. Similarly, for each identified MR, five source test cases were randomly generated. Then, for each such source test case, a follow-up test case was generated in accordance with the respective MR. Each source test case and its corresponding follow-up test case is called a metamorphic test pair (or simply test pair). Thus, each identified MR was associated with five test pairs. Each test pair involved two spreadsheet executions (the first execution involved the source test case and the second execution involved the follow-up test case) to determine whether or not an MR was violated. Note that a violation to an ET property or an MR indicates that the spreadsheet is faulty. For ease of discussion, we call all the test cases associated with each ET property a test set. Similarly, we also call all the test pairs associated with each MR a test set. Furthermore, we use the notation EPm, n (or MRm, n) to denote the nth ET property (or the nth MR) for spreadsheet Sm, where 1 m 5 and n 1. The results of the experiment are shown in Tables 9 and 10. Consider ET in Table 9 first. We found that, for fault types F3 to F6, each type was detected by the test set of at least one ET property. For example, F5 was detected by the test sets of EP1.3, EP1.4, and EP1.5. Considering the five spreadsheets together, an average of about 71% of ET properties, and about 63% of the test cases revealed a spreadsheet failure. Next, consider MT in Table 10. Each fault type was detected by the test set of at least one MR. Considering all the spreadsheets together, an average of about 33% of MRs, and about 24% of the test pairs uncovered a spreadsheet failure. At first glance, the above data may indicate that ET is superior to MT in detecting spreadsheet failures. We caution the readers, however, that the data also show that MT is more effective than ET in detecting failures in some aspects. More specifically, two (F1 and F2) of the six fault types were not detected by ET (see Table 9), but all the six fault types were detected by MT (see Table 10). Also, for those spreadsheets where ET properties are infeasible or difficult to identify (see, for example, spreadsheet S2 in Table 8), it is still feasible and relatively easier to identify MRs from these spreadsheets (see note 6). In this regard, MT has a wider applicability than ET. The above echoes our earlier discussion in the Subsection “Complementary Relationship” that neither technique is better than the other in all aspects and, hence, they are complementary.
THREATS TO VALIDITY There are two main threats to validity in the experiment. Firstly, the experiment only involved five participants. Although it would be desirable to have more participants involved, it is not easy to find a large group of people who possess the following characteristics of an end-user programmer: (a) non-IT graduates, (b) no formal IT training before commencing the experiment, and (c) possessing practical experience in spreadsheet development as part of their job duties. We would however like to point 10
out that these five participants generated a total of 28 distinct ET properties and 43 MRs from the five sample spreadsheets (see Table 8). As mentioned in the subsection “Procedures and Results of Experiment”: (i) for each distinct ET property, five test cases were randomly generated, and (ii) for each distinct MR, five test pairs were generated (with each pair involved two test cases). In other words, the experiment involved a total of 570 (= 28 5 + 43 5 2) test cases, with each test case involved one spreadsheet execution to determine the fault detection effectiveness of ET and MT. With respect to the objective of determining the fault detection effectiveness of the test cases generated by the two techniques, the number of distinct test cases is much more important than the number of participants. Thus, to large extent, we believe that the experimental results are valid and reliable. Secondly, the experiment only involved five subject spreadsheets. However, their faults are “real” ones that were inadvertently introduced by the developers in the development stage, rather than artificially seeded into the spreadsheets for the sake of experiment. Thus, this arrangement has posed difficulties in selecting the subject spreadsheets for the experiment. Nevertheless, as mentioned in the Subsection “Subject Spreadsheets and Participants”, these five spreadsheets have already covered all the six common and major spreadsheet fault types according to Ronen’s classification scheme (Ronen et al., 1989). Even with the above two threats to validity, the experiment still provides some initial, useful insights into the failure detection effectiveness of ET and MT.
GUIDANCE ON USING ET AND MT Based on the above analytical and experimental comparisons between ET and MT and our experience on both techniques, the following practical steps are recommended in using ET and MT: 1) Determine and use the amount of testing resources available to decide the maximum number of test cases (N) that the tester can afford for testing. 2) With reference to the specification document and the application domain of the spreadsheet under test, identify the properties associated with this spreadsheet. Those properties related to one single spreadsheet execution are ET properties, whereas those related to multiple executions are further used to define their corresponding MRs. 3) Based on the tester’s experience and judgement on the following three factors, the numbers of test cases generated for ET (A) and MT (B) are decided (note that N A + B):
With respect to the saving of testing resources (that is, a smaller value of (A + B)), focus more on ET rather than MT when generating test cases. This is because only one test case is needed for testing each ET property. On the contrary, at least two test cases are needed for testing each MR. 11
With respect to the potential fault detection effectiveness, focus more on MT rather than ET when generating test cases. This is because, according to the experimental results, MT is able to detect all the six fault types (F1 to F6), but ET is able to detect four fault types (F3 to F6) only.
For those spreadsheets with only few ET properties, generate test cases should be primarily based on MT.
RELATED WORK In our experiment, all the participants identified MRs for the five subject spreadsheets in an ad hoc manner. Recently, Chen et al. (2016) have developed a systematic methodology, known as METRIC, for identifying MRs from specifications, with an automated tool to support the METRIC methodology. As compared with ET and MT, the constraint-based spreadsheet testing method (Abraham & Erwig, 2006) and the “What You See Is What You Test (WYSIWYT)” methodology (Fisher et al., 2006) are less easy to be automated by the end-user programmers, they do not focus on the intrinsic properties of the spreadsheet application, and they cannot be applied when the oracle problem occurs. Relatively speaking, more research work has been done on static testing (such as reviews, inspections, and audits) than dynamic testing in the spreadsheet paradigm. For example, Panko (1999) investigated the effectiveness of individual and group inspections on detecting spreadsheet faults. He observed that group inspection found 80% of all faults, while individual inspection only found 63%. Morrison et al. (2002) introduced a code inspection approach that helps visualize the structure of a linked spreadsheet in order to detect linking errors (that is, incorrect references to spreadsheet cell values on separate work areas). Aurigemma & Panko (2014) investigated the performance of two popular spreadsheet static analysis programs, and found that these programs were very poor at detecting every category of natural spreadsheet errors. Here, we caution the readers that neither dynamic testing nor static testing is sufficient on its own; so, whenever possible, both should be used. Also, readers are reminded that the importance and contributions of dynamic testing are well known in the software testing community. Graham (1994) argued that dynamic testing is essential in the development of any software system, because we need this technique to assess what the system actually does, and how well it does it, in its final environment. Graham (1994) further argued that “a system without [dynamic] testing is merely a paper exercise [for example, inspection and review]; it may work or may not, but without [dynamic] testing, there is no way of knowing this before live use.” Pressman (2010) also argued that dynamic testing is the unavoidable part of any responsible effort to develop a software system. Because of the above reasons, although static testing (such as inspection) could be very cost-effective in identifying defects, dynamic testing (such as ET and MT) must still be performed to provide assurance to the developers and the end users about the actual execution behavior of the software. Thus, many software researchers and 12
practitioners, such as those from the Jet Propulsion Laboratory of CalTech (Kandt, 2009) and from Hewlett Packard (Franz & Shih, 1994), have used both static and dynamic testing.
SUMMARY AND CONCLUSION The primary purpose of this paper is to help clear up the confusion between error trapping (ET) and metamorphic testing (MT). To achieve this end, we first outlined the main concepts of the two techniques, followed by an analytical comparison between them. We explained that both techniques have their own strengths and weaknesses. In this regard, they are complementary and, hence, should be used together in spreadsheet testing whenever possible. Thereafter, we described an experiment to evaluate the failure detection capabilities of ET and MT. Two major observations of the experiment were: (a) ET was able to detect more failures than MT in terms of the percentage of ET properties and the percentage of test cases that revealed a spreadsheet failure, though some fault types were not detected; and (b) MT was more effective than ET in detecting all the fault types. These two observations, again, support our earlier argument that the two techniques are complementary to each other. Overall, this research provides a basis from which researchers can further investigate the issue of dynamic spreadsheet testing. This study is also beneficial to practitioners, especially end-user programmers, as it provides an in-depth discussion on how to use the two techniques. Acknowledgements. This research was partially supported by an Australian Research Council grant (ARC Linkage Project No. LP100200208).
NOTES 1. According to the IEEE Standard Glossary a failure is an “observed” malfunction of a program, which is caused by a fault in that program, which in turn is caused by a human mistake (or simply mistake). In their book, Senders & Moray (1991) refer to a failure, a fault, and a mistake collectively as an error. 2. ET is a very popular spreadsheet testing technique, and it often involves the IFERROR and ISERROR built-in functions provided by EXCEL. The popularity of ET in spreadsheet testing is reflected by numerous references related to this technique (Jain, 2010; Powell & Baker, 2009, pp. 115116; Wallis, 2015). MT has been receiving increasing attention in the software testing community (Barr et al., 2015; Gotlieb & Botella, 2003). MT has been successfully applied in several application domains such as bioinformatics (Chen et al., 2009), healthcare (Murphy et al., 2011), air traffic control (Hui et al., 2013), and machine learning (Xie et al., 2011). More recently, MT has been successfully applied to spreadsheet 13
testing (Poon et al., 2014; Singh, 2013). 3. Desk checking is a static testing technique in which code listings are visually examined, usually by the person who generated them, to identify problems such as software faults and violations of development standards (IEEE, 1990). Desk checking can be viewed as a “one-person inspection or walkthrough”. This technique is argued to be relatively ineffective in detecting software faults because it is an undisciplined process and it lacks the synergistic effect of the inspection team (Myers, 2004, p. 40). Another more formal static testing technique is code inspection, which involves a set of procedures and error-detection techniques for group code reading (Myers, 2004, pp. 2426). An inspection team consists of a few people; two of which are the moderator and the programmer. The moderator works like a quality-control engineer, and is responsible for ensuring that the team discussions proceed along productive lines and that the team members focus their attention on finding (but not correcting) errors. After the inspection session, the programmer then performs error correction. 4. The array formula “=SUMPRODUCT(B2:C101,D2:E101)” is equivalent in operation to the formula ”= (B2 * D2 + C2 * E2) + (B3 * D3 + C3 * E3) + + (B101 * D101 + C101 * E101)”. 5. With respect to MR3, suppose that Table 4 represents the source test case used for the first execution. When generating a follow-up test case for the second execution, the tester may select Toyota Corolla as v1 and Honda Accord as v2 (note that r1 (0.07) > r2 (0.06)). For the follow-up test case so generated, the sales amounts for Toyota Corolla and Honda Accord in the period JulyDecember are $245,000 (= $185,000 + $60,000) and $0, respectively. When comparing the computation results between the two executions, in accordance with MR3, the total annual sales commission (cell E103) will increase by $600 (= $60,000 (0.07 0.06)). 6. In MT, the end-user programmer may occasionally add one or more formulae into a spreadsheet to automate the comparison between the outputs of source and follow-up test cases. This inclusion of additional formulae, however, is not a mandatory requirement. This is because, instead of adding extra formulae into a spreadsheet, other techniques such as API (application programming interface) reading/writing can be used to automate the comparison of outputs between multiple spreadsheet executions. 7. Metamorphic relations such as those similar to MR4 in Example 3 can be identified for almost any spreadsheet. 8. Ronen et al. (1989) have identified a total of eight fault types for spreadsheets. In addition to the six types listed in the main text, the other two are “incorrect use of formats and column widths” and “confused range names”. These two fault types do not affect the correctness of the computation results. Also, detecting these two fault types often does not involve spreadsheet execution (note that this paper 14
focuses on dynamic testing). Thus, their detection is better left to static testing such as reviews and inspections. In this regard, our experiment did not cover these two fault types. 9. There are several spreadsheet error taxonomies other than the one developed by Ronen et al. (1989). Examples include Galletta et al. (1997), Leon et al. (2015), Panko & Aurigemma (2010), Purser & Chadwick (2006), and Rajalingham et al. (2000). These other taxonomies, however, are mainly based on mistakes instead of faults (also see note 1 above). Thus, they are inapplicable to the experiment described in this paper.
15
Table 1. Division by zero A 1
Vehicle Type
2 3 4 :
Toyota Corolla Honda Accord BMW 318 :
B Annual Sales Amount: Last Year ($) 150,000 200,000 0 :
C Annual Sales Amount: This Year ($) 180,000 220,000 30,000 :
D % Change 20% 10% #DIV/0! :
Table 2. Unbalanced balance sheet at the end of a financial year 1 2 3 4 5 6
9 10 11 12
B
Current Assets
Amount ($)
Cash Accounts receivable net Inventory Supplies
7 8
A
C Current Liabilities
F
Amount ($)
Equity
Amount ($)
Accounts payable
35,000
Wages payable
42,000
10,000 2,000
Interest payable Taxes payable
2,500 6,000
9,000
Long-term Liabilities 11,000 500,000 900,000
Total Assets
E
24,000
Fixed Assets Plant & equipment Land Buildings
D
Common stock Retained earnings
1,000,000 300,000
Total Equity Reconciliation
Bonds payable
1,300,000 67,500
55,000
Total Liabilities
114,500
1,482,000
Note: Cell B12 contains the formula “=SUM(B2:B5)+SUM(B8:B10)”. Cell D10 contains the formula “=SUM(D2:D5)+D8”. Cell F5 contains the formula “=F2+F3”. Cell F7 contains the formula “=B12-D10-F5”.
Table 3. Total annual sales commission with the correct array formula in cell E103 A 1 2 3 4 : 100 101 102
Vehicle Type Toyota Corolla Honda Accord BMW 318 : Ford Focus Volvo XC60
B Sales Amount ($): Jan to Jun 200,000 80,000 120,000 : 90,000 45,000
C Sales Amount ($): Jul to Dec 185,000 60,000 75,000 : 80,000 40,000
D Commission Rate: Jan to Jun 0.06 0.05 0.10 : 0.04 0.05 Total Annual Sales Commission ($)
103
16
E Commission Rate: Jul to Dec 0.07 0.06 0.15 : 0.05 0.07 1,522,000
Table 4. First execution: Total annual sales commission with a wrong array formula (“=SUMPRODUCT(B2:C100,D2:E100)”) in cell E103 A 1 2 3 4 : 100 101 102
Vehicle Type Toyota Corolla Honda Accord BMW 318 : Ford Focus Volvo XC60
B Sales Amount ($): Jan to Jun 200,000 80,000 120,000 : 90,000 45,000
C Sales Amount ($): Jul to Dec 185,000 60,000 75,000 : 80,000 40,000
D Commission Rate: Jan to Jun 0.06 0.05 0.10 : 0.04 0.05 Total Sales Annual Commission ($)
103
E Commission Rate: Jul to Dec 0.07 0.06 0.15 : 0.05 0.07 1,516,950
Table 5. Second execution in accordance with MR1: Total annual sales commission with a wrong array formula (“=SUMPRODUCT(B2:C100,D2:E100)”) in cell E103 A 1 2 3 4 5 : 101 102
Vehicle Type Volvo XC60 Toyota Corolla Honda Accord BMW 318 : Ford Focus
B Sales Amount ($): Jan to Jun 45,000 200,000 80,000 120,000 : 90,000
C Sales Amount ($): Jul to Dec 40,000 185,000 60,000 75,000 : 80,000
D Commission Rate: Jan to Jun 0.05 0.06 0.05 0.10 : 0.04 Total Sales Annual Commission ($)
103
E Commission Rate: Jul to Dec 0.07 0.07 0.06 0.15 0.05 1,514,400
Table 6. Second execution in accordance with MR2: Total annual sales commission with a wrong array formula (“=SUMPRODUCT(B2:C100,D2:E100)”) in cell E103 A 1 2 3 4 : 100 101 102
Vehicle Type Toyota Corolla Honda Accord BMW 318 : Ford Focus Volvo XC60
B Sales Amount ($): Jan to Jun 200,000 125,000 120,000 : 90,000 0
C Sales Amount ($): Jul to Dec 185,000 60,000 75,000 : 80,000 40,000
D Commission Rate: Jan to Jun 0.06 0.05 0.10 : 0.04 0.05 Total Sales Annual Commission ($)
103
17
E Commission Rate: Jul to Dec 0.07 0.06 0.15 : 0.05 0.07 1,519,200
Table 7. Subject spreadsheets with real faults Spreadsheet S1 S2 S3 S4 S5 †
Application Domain School equipment management Stores management & stock control Air quality monitoring Database performance evaluation Household expense management & analysis
Types of Faults † Contained F1 & F5 F2 F4 F6 F3
For each fault type, the number of faults contained in the respective spreadsheet is 1.
Table 8. Number of distinct ET properties and distinct MRs identified by the participants Spreadsheet S1 S2 S3 S4 S5 Total Number
Number of Distinct ET Properties 5 1 6 9 7 28
18
Number of Distinct MRs 9 11 5 5 13 43
Table 9. Effectiveness of ET in failure detection Number(s) of test cases associated with each ET property that revealed failures Spreadsheet S1 Fault type F1 F5 F2
F4
F6
F3
EP1.1 EP1.2 0 0 0 0 Spreadsheet S2 EP2.1 0 Spreadsheet S3 EP3.1 EP3.2 5 5 Spreadsheet S4 EP4.1 EP4.2 5 5 Spreadsheet S5 EP5.1 EP5.2 0 5
EP1.3 0 5
EP1.4 0 5
EP1.5 0 5
EP3.3 5
EP3.4 3
EP3.5 5
EP3.6 2
EP4.3 5
EP4.4 2
EP4.5 5
EP4.6 5
EP4.7 0
EP5.3 5
EP5.4 5
EP5.5 1
EP5.6 0
EP5.7 0
EP4.8 0
EP4.9 5
Average
19
Mean number of ET properties that revealed failures 0.00 (= 0 / 5) 0.60 (= 3 / 5)
Mean number of test cases that revealed failures 0.00 (= 0 / 25) 0.60 (= 15 / 25)
0.00 (= 0 / 1)
0.00 (= 0 / 5)
1.00 (= 6 / 6)
0.83 (= 25 / 30)
0.78 (= 7 / 9)
0.71 (= 32 / 45)
0.57 (= 4 / 7) 0.71 (=20 / 28)
0.46 (= 16 / 35) 0.63 (= 88 / 140)
Table 10. Effectiveness of MT in failure detection Number(s) of test pairs associated with each MR that revealed failures Fault type F1 F5 F2
F4
F6
F3
Spreadsheet S1 MR1.1 MR1.2 0 0 1 0 Spreadsheet S2 MR2.1 MR2.2 2 0 Spreadsheet S3 MR3.1 MR3.2 5 5 Spreadsheet S4 MR4.1 MR4.2 0 5 Spreadsheet S5 MR5.1 MR5.2 0 0
Mean number of MRs that revealed failures 0.33 (= 3 / 9) 0.33 (= 3 / 9)
Mean number of test pairs that revealed failures 0.27 (= 12 / 45) 0.18 (= 8 / 45)
0.18 (= 2 / 11)
0.09 (= 5 / 55)
MR1.3 0 2
MR1.4 0 5
MR1.5 0 0
MR1.6 0 0
MR1.7 2 0
MR1.8 5 0
MR1.9 5 0
MR2.3 0
MR2.4 0
MR2.5 0
MR2.6 3
MR2.7 0
MR2.8 0
MR2.9 0
MR3.3 2
MR3.4 0
MR3.5 0
0.60 (= 3 / 5)
0.48 (= 12 / 25)
MR4.3 0
MR4.4 0
MR4.5 0
0.20 (= 1 / 5)
0.20 (= 5 / 25)
MR5.3 5
MR5.4 0
MR5.5 0
0.15 (= 2 / 13) 0.33 (= 14 / 43)
0.15 (= 10 / 65) 0.24 (= 52 / 215)
MR5.6 0
MR5.7 0
MR5.8 0
MR5.9 5
Average
20
MR2.10 0
MR5.10 0
MR2.11 0
MR5.11 0
MR5.12 0
MR5.13 0
REFERENCES Abraham, R., & Erwig, M. (2006). AutoTest: A tool for automatic test case generation in spreadsheets. Proceedings of the IEEE Symposium on Visual Languages and HumanCentric Computing (VL / HCC ’06) (pp. 4350). Los Alamitos, CA: IEEE Computer Society Press. Aurigemma, S., & Panko, R. (2014). Evaluating the effectiveness of static analysis programs versus manual inspection in the detection of natural spreadsheet errors. Journal of Organizational and End User Computing, 26(1), 4765. Barr, E.T., Harman, M., McMinn, P., Shahbaz, M., Yoo, S. (2015). The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering, 41(4), 507525. Bishop, B., & McDaid, K. (2011). Expert and novice end-user spreadsheet debugging: A comparative study of performance and behaviour. Journal of Organzational and End User Computing, 23(2), 5780. Caulkins, J.P., Morrison, E.L., & Weidemann, T. (2007). Spreadsheet errors and decision making: Evidence from field interviews. Journal of Organizational and End User Computing, 19(3), 123. Chen, T.Y., Ho, J.W.K., Liu, H., & Xie, X. (2009). An innovative approach for testing bioinformatics programs using metamorphic testing. BMC Bioinformatics, 10, doi: 10. 1186/1471-2105-10-24 Chen, T.Y., Poon, P.-L., & Xie, X. (2016). METRIC: METamorphic Relation Identification based on the Category-choice framework. Journal of Systems and Software, 116, 177190. Chen, T.Y., Tse, T.H., & Zhou, Z.Q. (2003). Fault-based testing without the need of oracles. Information and Software Technology, 45(1), 19. Everett, G.D., & McLeod, R., Jr. (2007). Software testing: Testing across the entire software development life cycle. Hoboken, NJ: Wiley. Fisher, M., II, & Rothermel, G. (2005). The EUSES spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. Proceedings of the 1st Workshop on End-User Software Engineering (pp. 15). New York: ACM Press. Fisher, M., II, Rothermel, G., Brown, D., Cao, M., Cook, C., & Burnett, M. (2006). Integrating automated test generation into the WYSIWYT spreadsheet testing methodology. ACM Transactions on Software Engineering and Methodology, 15(2), 150194. Franz, L.A., & Shih, J.C. (1994). Estimating the value of inspections and early testing 21
for software projects. Hewlett-Packard Journal, 45(6), 6067. Galletta, D.F., Hartzel, K.S., Johnston, S., & Joseph, J.L. (1997). Spreadsheet presentation and error detection: An experimental study. Journal of Management Information Systems, 13(3), 4563. Gotlieb, A., & Botella, B. (2003). Automated metamorphic testing. Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC 2003) (pp. 3440). Los Alamitos, CA: IEEE Computer Society Press. Graham, D.R. (1994). Testing. In J.J. Marciniak (Ed.), Encyclopedia of Software Engineering (pp. 13301353). New York: Wiley. Grossman, T.A., & Özlük, Ö. (2010). Spreadsheets grow up: Three spreadsheet engineering methodologies for large financial planning models. Paper presented at the Annual Conference of the European Spreadsheet Risks Interest Group (EuSpRIG), London, UK. Hui, Z., Huang, S., Ren, Z., Yao, Y. (2013). Metamorphic testing integer overflow faults of mission critical program: A case study. Mathematical Problems in Engineering, article ID 381389. Retrieved December 7, 2015, from http://www.hindawi.com/ journals/mpe/2013/381389/ Institute of Electrical & Electronics Engineers (IEEE). (1990). IEEE standard 610.121990: IEEE standard glossary of software engineering terminology. New York: IEEE. Jain, A. (2010). Add error trap: Wrap your formulas with IFERROR or ISERROR. Excel Items. Retrieved November 26, 2015, from http://www.excelitems.com/2011/03/ wrap-iferror-iserror-formulas-add.html Jee, E., Yoo, J., Cha, S., & Bae, D. (2009). A data flow-based structural testing technique for FBD programs. Information and Software Technology, 51(7), 11311139. Kandt, R.K. (2009). Experiences in improving flight software development processes. IEEE Software, 26(3), 5864. Kruck, S.E., Maher, J.J., & Barkhi, R. (2003). Framework for cognitive skill acquisition and spreadsheet training. Journal of Organzational and End User Computing, 15(1), 2037. Leon, L., Przasnyski, Z.H., & Seal, K.C. (2015). Introducing a taxonomy for classifying qualitative spreadsheet errors. Journal of Organizational and End User Computing, 27(1), 3356. Liu, H., Kuo, F.-C., Towey, D., & Chen, T.Y. (2014). How effectively does metamorphic testing alleviate the oracle problem? IEEE Transactions on Software Engineering, 40(1), 422. Lu, M.-T., Litecky, C.R., & Lu, D.H. (1991). Application controls for spreadsheet development. Journal of Organizational and End User Computing, 3(1), 1222. 22
McDaid, K., & Rust, A. (2009). Test-driven development for spreadsheet risk management. IEEE Software, 26(5), 3136. Meyer, B. (2008). Seven principles of software testing. IEEE Computer, 41(8), 99101. Morrison, M., Morrison, J., Melrose, J., & Wilson, E.V. (2002). A visual code inspection approach to reduce spreadsheet linking errors. Journal of Organizational and End User Computing, 14 (3), 5163. Murphy, C., Raunak, M., King, A., Chen, S., Imbriano, C., Kaiser, G., Lee, I., Sokolsky, O., Clarke, L., & Osterweil, L. (2011). On effective testing of health care simulation software. Proceedings of the 3rd Workshop on Software Engineering in Health Care (SEHC 2011) (pp. 4047). New York: ACM Press. Myers, G.J. (2004). The art of software testing (2nd ed.). Hoboken, NJ: Wiley. Panko, R.R. (1998). What we know about spreadsheet errors. Journal of End User Computing, 10(2), 1521. Panko, R.R. (1999). Applying code inspection to spreadsheet testing. Journal of Management Information Systems, 16(2), 159176. Panko, R.R. (2006a). Recommended practices for spreadsheet testing. Paper presented at the Annual Conference of the European Spreadsheet Risks Interest Group (EuSpRIG), Cambridge, UK. Panko, R.R. (2006b). Spreadsheets and Sarbanes-Oxley: Regulations, risks, and control frameworks, Communications of the AIS, 17(1), article 29. Panko, R.R. (2007). Two experiments in reducing overconfidence in spreadsheet development. Journal of Organizational and End User Computing, 19(1), 123. Panko, R.R., & Aurigemma, S. (2010). Revising the Panko-Halverson taxonomy of spreadsheet errors. Decision Support Systems, 49(2), 235244. Panko, R.R., & Ordway, N. (2005). Sarbanes-Oxley: What about all the spreadsheets? Paper presented at the Annual Conference of the European Spreadsheet Risks Interest Group (EuSpRIG), London, UK. Poon, P.-L., Kuo, F.-C., Liu, H., & Chen, T.Y. (2014). How can non-technical end users effectively test their spreadsheets? Information Technology and People, 27(4), 440462. Powell, S.G., & Baker, K.R. (2009). Management science: The art of modeling with spreadsheets (3rd ed.). Hoboken, NJ: Wiley. Powell, S.G., Baker, K.R., & Lawson, B. (2008). A critical review of the literature on spreadsheet errors. Decision Support Systems, 46(1), 128138. Powell, S.G., & Baker, K.R., & Lawson, B. (2009). Errors in operational spreadsheets. Journal of Organizational and End User Computing, 21(3), 2436.
23
Pressman, R.S. (2010). Software engineering: A practitioner’s approach (7th ed.). New York: McGraw-Hill. Pryor, L. (2004). When, why and how to test spreadsheets. Paper presented at the Annual Conference of the European Spreadsheet Risks Interest Group (EuSpRIG), Klagenfurt, Austria. Purser, M., & Chadwick, D. (2006). Does an awareness of differing types of spreadsheet errors aid end-users in identifying spreadsheet errors? Paper presented at the Annual Conference of the European Spreadsheet Risk Interest Group (EuSpRIG), Cambridge, UK. Rajalingham, K., Chadwick, D., & Knight, B. (2000). Classification of spreadsheet errors. Paper presented at the Annual Conference of the European Spreadsheet Risk Interest Group (EuSpRIG), Greenwich, UK. Ronen, B., Palley, M.A., & Lucas, H.C., Jr. (1989). Spreadsheet analysis and design. Communications of the ACM, 32(1), 8493. Senders, J.W., & Moray, N.P. (1991). Human error: Cause, prediction, and reduction. Hillsdale, NH: Lawrence Erlbaum. Singh, B. (2013). Implementation of metamorphic testing on spreadsheet applications. International Journal of Modern Engineering Research, 3(2), 990995. Sommerville, I. (2011). Software engineering (9th ed.). Boston, MA: Pearson. Tukiainen, M. (2000). Uncovering effects of programming paradigms: Errors in two spreadsheet systems. In A.F. Blackwell & E. Bilotta (Eds.), Proceedings of the 12th Workshop of the Psychology of Programming Interest Group (pp. 247266). Psychology of Programming Interest Group. Vilkomir, S.A., Kapoor, K., & Bowen, J.P. (2003). Tolerance of control-flow testing criteria. Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC 2003) (pp. 182187). Los Alamitos, CA: IEEE Computer Society Press. Wallis, D. (2015). Error trapping and handling in Excel macros. Retrieved November 26, 2015, from http://www.consultdmw.com/excel-macro-error-handling.htm Watkins, J., & Mills, S. (2011). Testing IT: An off-the-shelf software testing process (2nd ed.). New York: Cambridge University Press. Xie, X., Ho, J.W.K., Murphy, C., Kaiser, G., Xu, B., Chen, T.Y. (2011). Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software, 84(4), 544558.
24