Backward-Slice-Based Statistical Fault Localization without Test Oracles

2 downloads 0 Views 306KB Size Report
metamorphic test group, rather than the backward slice and the result of failure or pass for an individual test case in BSSFL. Because our approach does not ...
2013 13th International Conference on Quality Software

Backward-Slice-based Statistical Fault Localization without Test Oracles Yan Lei, Xiaoguang Mao School of Computer National University of Defense Technology Changsha, China {yanlei, xgmao}@nudt.edu.cn

Tsong Yueh Chen Swinburne University of Technology Hawthorn, Victoria, 3122, Australia [email protected]

Abstract—A recent promising technique for fault localization, Backward-S lice-based S tatistical Fault Localization (BSS FL), statistically analyzes the backward slices and results of a set of test cases to evaluate the suspiciousness of a statement being faulty. However, BSS FL like many existing fault localization approaches assumes the existence of a test oracle to determine whether the result of a test case is a failure or pass. In reality, test oracles do not always exist, and therefore in such cases BSS FL can be severely infeasible. Among current research, metamorphic testing has been widely studied as a technique to alleviate the oracle problem. Hence, we leverage metamorphic testing to conduct BS S FL without test oracles. With metamorphic testing, our approach uses the backward slices and the metamorphic result of violation or non-violation for a metamorphic test group, rather than the backward slice and the result of failure or pass for an individual test case in BSS FL. Because our approach does not need the execution result of a test case, it implies that BSS FL can be extended to those application domains where no test oracle exists. The experimental results on 8 programs an d 2 grou ps of the maximal suspiciousness evaluation formulas show that our proposed approach demonstrates the effectiveness comparable to that of existing BSS FL techniques in the cases where test oracles exist.

covered by a failed test case, whereas its suspiciousness should decrease when it is executed by a passed test case. Following this intuition, many formulas are developed, such as Ochiai [9], Naish1 [24], Wong1 [24] etc. Although SFL delivers its promising ability in locating faults, it uses the coverage information that however cannot identify those entities whose execution affects the output of a test case. Consequently, the vulnerability of coverage informat ion may affect the effectiveness of SFL because a faulty program entity generally cannot trigger and cause a failure unless its execution affects the output. Therefore, a new fault localization approach BackwardSlice-based Statistical Fault Localizat ion (BSSFL) is proposed to solve the above issue [2]. BSSFL utilizes backward slicing technique [3-5] to define strong-semantics statistical variables which can determine whether the executions of a statement affect (or do not affect) the outputs of test cases. Since the strong-semantics statistical variables are analogous to weak-semantics statistical variables defined by SFL, BSSFL uses the same suspiciousness evaluation formulas of SFL with strong-semantics variables to evaluate the suspiciousness of each statement being faulty. It has been empirically demonstrated that BSSFL significantly outperforms SFL and delivers promising fault localization results [2]. Nevertheless, all existing BSSFL techniques share the same assumption with SFL, that is, they need the existence of a testing oracle to determine whether a test case is failed or passed. But such an assumption does not always hold in many programs, such as machine learning algorithms, simulations, co mbinatorial calculations, etc [6,7]. This constraint has made BSSFL like SFL infeasible in many application domains. Recently, metamorphic testing [8] has been used to alleviate the oracle problem in the conventional SFL techniques. Based on the framework of metamorphic testing, this new SFL technique uses a new concept of metamorphic slice and metamorphic test result as the counterparts of coverage information and test result in the conventional SFL. Because BSSFL adopts the structure of the conventional SFL and shares the same assumption with it, it is a natural extension to investigate how the current BSSFL techniques utilize the solution of the new SFL technique to assist BSSFL in dealing with the oracle problem. Inspired by the new SFL technique [8], we define a new concept of metamorphic backward slice (mbslice) and metamorphic test result as the counterparts of the backward slice and test result

Keywords-statistical fault localization; backward slice; test oracle; oracle problem; metamorphic testing;

I.

INT RODUCT ION

Being a key activity in the development and maintenance of software, debugging normally     

  amount of resources in an actual project. A typical process of software debugging consists of reproducing failures, locating faults, repairing faults and verifying repairs, among wh ich locating fault has been recognized as a tedious, difficult and time-consuming process [1]. W ith the aim at decreasing the cost of debugging and improving software reliability, many fault localizat ion techniques have been proposed to alleviate the problem. Among current research, a popular fault localization technique Spectrum-based Fault Localizat ion (SFL) has received much attention in recent years. SFL [9] statistically analyzes the coverage information and results of a set of test cases to evaluate the suspiciousness of a program entity being faulty, that is, it defines the suspiciousness of a program entity using its coverage and test results. The basic intuition of this approach is that the suspiciousness of a program entity being faulty should increase when it is 978-0-7695-5039-8/13 $26.00 © 2013 IEEE DOI 10.1109/QSIC.2013.45

212

used in the current BSSFL techniques. Each mbslice 1 is related to a specific property [8] of a program defined through a metamorphic relation (MR). An M R involves mu ltiple inputs and their outputs, and must be satisfied if the program is correct. Hence, when an MR is specified, the correctness of a program, to some extent, can be determined fro m the view that whether the MR is violated or nonviolated, rather than the view that the test result is failed or passed. In other words, BSSFL can use the role of violated or non-violated for a metamorphic test group to replace the role of failed or passed for a test case in the absence of a test oracle. Consequently, BSSFL with mbslice can be still applied when there is no test oracle. It implies that our approach significantly enhances the applicability of the current BSSFL techniques beyond the domains which need a test oracle. In this paper, an experimental study is conducted on 8 programs of different sizes, which suffer fro m the oracle problem due to some of their functionalities. Our approach is applied to 2 groups of the maximal evaluation formu las to evaluate its applicability. The empirical results demonstrate our approach is applicable for the situations where these programs have no test oracle and the conventional BSSFL is really infeasible in practice. In addition, the effectiveness of our proposed approach is comparable to that of the conventional BSSFL with the assistance of test oracles. As a consequence, test oracles are no longer a requirement for BSSFL. The remainder of this paper is organized as below. Section II introduces the background of backward-slicebased statistical fault localization and metamorphic testing. Section III defines the concept of mbslice and shows how it is applied to BSSFL. Section IV presents the setup for our experimental study. Section V reports and analyzes the results of the experimental study and the threats to its validity. Section VI gives the conclusion and future work. II.

M statements ª x11 «x N bslices « 21 « # « ¬ xN 1 s1

x12 x22 # xN 2 s2

failed /passed

! x1M º ! x2 M »» % # » » ! xNM ¼ ! sM

ª r1 º t1 «r » t « 2» 2 «#» # « » ¬ rN ¼ t N

Figure 1. Input to BSSFL.

First, we assume that a program P co mprises a set of program statements S={s1 , s2 , …, sM } and runs against a set of test cases T={t1 , t2 , …, tN} (see Fig. 1). Let bslice(ti) be the backward slice of the output of the test case ti . The above matrix N×(M+1) represents the input to BSSFL. An element xij is equal to 1 if sj ‫א‬bslice(ti) and 0 otherwise. If the backward slice is a dynamic slice [2,4,5], bslice(ti ) includes those statements whose execution affected the output of test case ti according to data and control dependencies. To capture the influence of the execution of a statement on the output, BSSFL adopts the dynamic slice for the backward slice. Hence, xij =1 also means that the execution of statement sj affected the output of ti , and xij =0 otherwise. The vector r at the rightmost column of the matrix represents the test results. The element ri is equal to 1 if test case ti failed, and 0 if ti passed. Except the result vector r, the rest of the matrix is expressed in terms of the matrix A. The ith row of A shows whether the execution of a statement affected the output of ti , while the jth column of A indicates the executions of statement sj affected the outputs of which test cases. BSSFL evaluates the suspiciousness of a statement being faulty via utilizing the similarity between the statement vector and result vector of the matrix in Fig. 1. Hence, we define the following four statistical variables in Eq. (1) based on the above matrix in Fig. 1. 

BACKGROUND

a pq ( s j )

^t

i

( xij

p ) š ( ri

`

q ) 

 

Where p, q‫{א‬0,1}. a 00 (sj) and a01(sj ) represent the number of test cases whose output is not affected by the executions of statement s j with a passed and failed result respectively. a 10(sj) and a 11(sj) denote the number of test cases whose output is affected by the executions of statement sj with a passed and failed result respectively. Because the above variables are analogous to those defined by spectrum-based fault localization [9] (SFL), BSSFL adopts the evaluation formulas of SFL to conduct the evaluation by using the above variables instead of those defined by SFL. In addition, the variables of SFL cannot identify those statements whose execution affects the output of a test run, whereas our variables can do that. Consequently, BSSFL significantly outperforms SFL and shows its high fault localization effectiveness [2]. The recent research [24] has theoretically proven that 5 out of 30 investigated SFL suspiciousness evaluation formulas are the most efficient formulas (referred to as the maximal formulas) under the single-fault scenario. Two formulas are equivalent and constitute a group ER1, and the other three equivalent formulas compose a group ER5. TA BLE I describes the five

A. Backward-Slice-based fault localizatoin (BSSFL) The basic intuition of BSSFL is that the suspiciousness of a program entity 2 being faulty should increase when its execution affects the output of a failed test case, while its suspiciousness should decrease when its execution influences the output of a passed test case. BSSFL [2] utilizes backward slicing techniques [3-5] to define new variables that can identify whether the execution of a statement affects the output of a test case, and then adopts the promising structure of spectrum-based fault localization to evaluate the suspiciousness of a statement being faulty. 1 As compared with the metamorphic slice used in [8], our mbslice is based on the backward slice concerned with data and control dependencies instead of coverage information, that is, our mbslice can identify the influence of the execution of a statement on the output whereas the metamorphic slice in [8] cannot. 2 A program entity is a part of a program, such as basic blocks, functions, paths, etc. In this paper, we adopt the widely used entity type for SFL, namely statements.

213

MR must be satisfied. Any violation of MR indicates an incorrect implementation. Let us use the shortest path as an examp le to show the process of MT. First, we assume that a program is in search of the shortest path between any two nodes in an undirected graph and outputs the length of the shortest path. Let SP(x, y, G) represent the length of the shortest path from node x to node y in the graph G. If the graph G is complicated and the value of SP(x, y, G) is fairly large, it is very costly or even almost impossible to check whether the computed value is correct due to the large number of possible paths between x and y. Under such circumstance, it is said that this program suffers from the oracle problem. Because MT verifies the program properties instead of directly verifies the correctness of the output, the first step of applying MT is to define an MR based on some properties of the program. With some well-known properties in graph theory, it is easy for us to define some MRs. For examp le, we define an M R (referred as MR1) as SP(x, y, G)=SP(y, x, G) because the length of the shortest path keeps unchanged when swapping the start node and destination node. Likewise, another MR (referred as MR2) can defined as SP(x, y, G)=SP(x, w, G)+SP(w, y, G), where is w is a node in the shortest path from x to y. The simple idea is that even though it is difficult to verify the correctness of the individual output, namely SP(x, y, G), SP(y, x, G), SP(x, w, G) and SP(w, y, G), but it is easy to verify whether the above two MRs are satisfied or not, that is, MT just needs to examine whether SP(x, y, G)=SP(y, x, G) and SP(x, y, G)=SP(x, w, G)+SP(w, y, G). For instance, if SP(x, y, G)=99999, MT will check whether SP(y, x, G) is or not equal to 99999 according to MR1. If SP(y, x, G) is not equal to 99999, then we can conclude that the program is incorrect due to the violation of MR1. If SP(y, x, G) is also 99999, we cannot make a defin itive conclusion of the correctness of the program. However, this is the limitation of software testing. In this examp le, (x, y, G) represents the source test case, and (y, x, G) denotes the follow-up test case of MR1, and (x, w, G) and (w, y, G) are the follow-up test case of MR2. Note that there could be multiple source test cases or follow-up test cases when involving some MRs in practice. In addition, follow-up test cases may not be solely dependent on the source test cases but also on their outputs. As MT is simple in concept and independent of programming languages, it should be easy to understand and automate metamorphic testing in many applications, once the MRs are defined. Apparently, the difficult part of applying MT is the identification of the MRs that may need considerable effort. However, recent studies have presented some empirical evidence that through a brief general training, engineers can define proper MRs for conducting effective MT on the target programs [16,17]. But this research is beyond the scope of this paper. Interested readers may consult [18].

maximal formu las: Naish1, Naish2, Wong1, Russel&Rao and Binary. Because BSSFL adopts the structure of SFL, it uses the same name of the formulas of SFL. Please note that the definitions of variables of the formulas are different between BSSFL and SFL. Despite BSSFL has shown promising results in previous experiments [2], it assumes the existence of a test oracle to determine whether the result of a test case is failed or passed. However, many programs actually suffer from the oracle problem, that is, it is almost impossible or too costly to verify the correctness of the computed outputs. For instance, suppose that we need to test a compiler, that is, we need to identify whether the compiler exact ly compiles the source code into its equivalent object code. It is difficult for us to accomplish this task in practice. Other examples include testing programs related to machine learn ing algorith ms, complex co mputational programs, graphics display on the monitor, etc [6,7]. In such situations, the application of existing BSSFL techniques can be severely infeasible. T ABLE I MAXIMAL FORMULAS OF BSSFL AND BSSFL WITH MBSLICE Name Naish1 ER1 Naish2 Wong1 ER5

Russel&Rao Binary

Formula if a01(s j ) ! 0 ­ °1 ® a ( s ) ° ¯ 00 j if a01(s j ) d 0 a10 (s j ) a11(s j )  a10 (s j )  a00 (s j )  1 a11(s j ) a11(s j ) a11(s j )  a01(s j )  a10 (s j )  a00 (s j )

­0 if a01(s j ) ! 0 ° ® ° ¯1 if a01(s j ) d 0

B. Metamorphic tesing Metamorphic testing (MT) [10-15] is a technique designed for alleviating the oracle problem. It ext racts the specific properties from the problem domain and defines metamorphic relations (MRs) to express the relationship between mult iple test cases and their outputs. MT verifies whether the MRs are satisfied, rather than verifies the correctness of the computed outputs of individual test cases. Given a program under test, the first requirement for applying MR is to identify some MRs. An MR is a property of the program that establishes the relationship between mu ltiple test cases and their outputs. Multiple test cases consist of two parts: source test cases and follow-up test cases. Source test cases are usually obtained from the testing according to various generation and selection strategies, such as random testing, coverage criteria-based testing, mutation testing, etc. Follow-up test cases are constructed from the source test cases and the relationship established by the MR. A source test case (or a group of source test cases) and its related follow-up test cases are usually referred to as a metamorphic test group. The basic intuition of MT is to verify the property of a program expressed by an MR after executing all test cases of a metamorphic test group, instead of verify ing the correctness of the output of each individual test case. If the implementation of the program is correct,

III. A PPROACH In this section, we will define a new concept of metamorphic backward slice (mbslice) that is a group of backward slices associated with the output of a metamorphic

214

ª x11 «x N mslices « 21 « # « ¬ xN 1

A. Mbslice Given a test case t, BSSFL uses the backward dynamic slicing technique to define the bslice(t) that is the backward dynamic slice of the output of the test case t. The bslice(t) in BSSFL must always bound together with the test result of failure or pass for test case t. Nevertheless, such methodology, in reality, can be infeasible due to the oracle problem. Recently, metamorphic testing has been proposed to alleviate the oracle problem in the conventional SFL techniques [8]. Since BSSFL adopts the structure of the conventional SFL, it is natural to consider how to integrate the conventional BSSFL with metamorphic testing, to extend the application of BSSFL to those areas where a test oracle does not exist. Therefore, we define a new concept of metamorphic backward slice (mbslice) to achieve this goal. Given an MR. Let TS ={t1 S , …, t ks S } and TF ={t1F , …, t kf F } be the source test cases and follow-up test cases for MR. TF is constructed from MR and TS . TS and TF compose a metamorphic test group g according to the MR. Based on the bslice, a mbslice is defined as follows: x A mbslice, mbslice(MR, TS ) is the union of all bslice(t) for t involving in the relevant MT executions with TS and TF , that is,  mslice( MR, T S )

ks

kf

k 1

k 1

vio /non  vio

M statements

test group. With mbslice, BSSFL can be extended to the situations where there exists no test oracle.

s1

x12 x22 # xN 2 s2

! x1M º ! x2 M »» % # » » ! xNM ¼ ! sM

ª r1 º g1 «r » g « 2» 2 «#» # « » ¬ rN ¼ g N

Figure 2. Input to BSSFL with mbslice.

As shown in Fig. 2, the matrix N×(M+1) represents the input to BSSFL with mbslice. An element xij is equal to 1 if sj ‫א‬mbslice(MR, ti ) and 0 otherwise. As mbslice(MR, ti) denotes the union of bslice(ti ) and bslice(ti f), mbslice(MR, ti) includes those statements whose execution affected the output of metamorphic test group gi according to data and control dependencies. Hence, xij =1 also means that the execution of statement s j affected the output of g i , and xij =0 otherwise. The vector r at the rightmost column of the matrix represents the metamorphic test results. The element ri is equal to 1 if g i violated the metamorphic relat ion MR, and 0 if g i non-violated the MR. Except the result vector r, the rest of the matrix is expressed in terms of the matrix A. The ith row of A shows whether the execution of a statement affected the output of gi , while the jth column of A indicates the executions of the statement sj affected the output of which metamorphic test groups. BSSFL with mbslice follows the structure of BSSFL and measures the suspiciousness of a statement being faulty via utilizing similarity between the statement vector and result vector of the matrix in Fig. 2. Hence, we define four new statistical variables in Eq. (3) via using the matrix in Fig. 2.

(* bslice(tkS )) ‰ (* bslice(tkF ))   

Essentially, a mbslice is a group of bslices bound together with a metamorphic test group of MR, that is, a mbslice can potentially play the role of bslice in BSSFL. As a reminder, the output of a metamorphic test group consists of the outputs of all test cases in the metamorphic test group. In addition, a mbslice is associated with a metamorphic testing result of violation or non-violation. Intuitively speaking, the test result of failure or pass corresponds to the metamorphic result of violation or non-violation, respectively. Therefore, in BSSFL, instead of using the bslice and the result of failure or pass for an individual test case, the mbslice and the metamorphic result of violation or non-violation for a metamorphic test group can be used. As a consequence, test oracles are no longer a requirement for BSSFL, and it is feasible to extend the application of BSSFL beyond the areas that require test oracles. In the following section, without limitation of the generality, we assume that the given metamorphic relation MR has only one source test case and one follow-up test case, that is, TS and TF have only one element each.



a pq ( s j )

^g

i

( xij

p ) š ( ri

` 

q)

 

Where p, q‫{א‬0,1}. a 00 (sj) and a01(sj ) represent the number of metamorphic test groups whose output is not affected by the executions of statement s j with a violated and non-violated result respectively. a 10(sj) and a11(sj ) denote the number of metamorphic test groups whose output is affected by the executions of statement sj with a violated and nonviolated result respectively. As described above, the proposed approach replaces bslice with mbslice to identify the influence of the execution of a statement on the output of a metamorphic test group rather than an individual test case. Moreover, it uses the metamorphic test result of violated or non-violated instead of the test result of failed or passed. After the transformation fro m Fig. 1 to Fig. 2, the proposed approach applies the same procedure as BSSFL to formu late the vector and to use new defined variables evaluating the suspiciousness value for each statement. In the conventional BSSFL, a failed test case implies that the execution of a faulty     affects the faulty output and the faulty statement should be in the corresponding bslice 3 , while a passed test case does not

B. BSSFL with mbslice Let us use the program P with a set of statements S={s1 , s2 , …, sM }and the set of test cases T={t1 , t2 , …, tN} defined in Section II.A. The set of test cases T represents the source test cases, and the set of follo w-up test cases Tf ={t1f , t 2f , …, t Nf } are generated according to source test cases T and a metamorphic relation MR. Thus, t i and tif constitutes a metamorphic test group g i (see Fig. 2).

3 Backward slicing, in some cases, cannot capture the influence of the execution of certain types of faults on the output [2]. Such cases are not included in this discussion.

215

          the execution of a faulty statement does not affect the expected output and the corresponding bslice is free of faulty statement. Similarly, a violated metamorphic test group implies that there is at least one failed test case within it. Even though we do not know which test cases are actually the failed ones, we still can conclude that the execution of a faulty statement must affect the violated output of the metamorphic test group and a faulty statement should be included in the union of all the corresponding bslices, that is, mbslice. On the other hand, a non-violated metamorphic test group does not provide a   conclusion about whether it involves any failed test case and as a consequence, the correctness of all statements in mbslice is also uncertain. IV.

All the experiments ran on a Ubuntu 10.04 machine with 2.20 GHz Intel(R) Core(TM)2 Duo CPU and 2 GB of memory. B. Definition of MR Since mbslice is a concept associated with an M R, we should first define metamorphic relations for each program. In this case study, we define three MRs for each program. Due to the limitation of space, we only give the description of the MRs for grep in this paper. The detailed specification of grep can refer to the GNU website 4 . For grep program, the three MRs are defined from the equivalence of different regular exp ressions. Given an input file “infile.txt” and a pattern Rs described by the regular expression, grep reads the “infile.txt” line by line, then searches the file for lines containing a match to Rs , and finally prints all the matched lines. Without changing the input file and the command options, the three MRs focus on generating a follow-up regular exp ression Rf which is equivalent to Rs . Because the different formats of Rs and Rf are equivalent to each other, the corresponding output Os should be identical to the output Of . The following will present the definition of each MR in detail: MR1: Completel y decomposing the bracketed subexpression Given a bracketed sub-expression, its usual format is “[x1 …xn ]”, where xi is a single character. If the characters “x1 ,…,xn ” are continuous according to the values of their ASCII codes, they can expressed by the format “[x1 -xn ]”. For a bracketed sub-expression, one of its equivalent formats is a complete decomposition of this “[ ]” structure using “|”. Therefore, in M R1, Rf is constructed through replacing such bracketed sub-expression in Rs with their corresponding complete decomposed equivalent formats. This transformation should have the relation Of =Os . For example, if Rs contains a bracketed sub-expression “[a-c]” or “[abc]”, MR1 uses the expression “a|b|c” instead in Rf . MR2: Splitting the brackets structure According to a bracketed sub-expression “[x1 …xn ]” or “[x1 -xn ]”, MR2 uses the symbol “|” to decompose the bracketed sub-expression. More specifically, M R2 generates an equivalent format by splitting the bracket structure of the sub-expression into two parts based on “|” and each part still keeps the bracket structure. Hence, in this MR, Rf is constructed by replacing a bracketed substring with its equivalent splits. For example, if Rf contains “[a-c]” or “[abc]”, MR2 uses the expression “[ab]|[c]” instead in Rf . Likewise, the relation Os =Of should hold. MR3: Bracketing simple characters Except those key words with special meanings, any simple character should be equivalent to itself enclosed by the brackets. It means that “xi ” is equivalent to “[xi ]” if xi is not a key word. Hence, MR3 replaces some simple characters in Rs with their corresponding equivalent bracketed formats to form the follow-up regular expression Rf. For example, if Rf contains the “abc” in Rs , MR3 uses the expressionĀ[a][b][c]” instead in Rf . Again, Os =Of .

EXPERIMENT AL SET UP

A. Objectives T ABLE II. DESCRIP TION OF THE S IEMENS S UITE AND GREP Program print_tokens print_tokens2 replace schedule schedule2 tcas tot_info grep

e LO C 342 355 512 292 262 135 274 7309

De scription Lexical analyzer Lexical analyzer Pattern recognition Priority scheduler Priority scheduler Altitude separation Information measure Pattern match

To evaluate the effectiveness of the proposed approach, this study has chosen the Siemens Suite and grep as our subject programs. The Siemens Suite was originally developed at the Siemens Research Corporation and contains 7 programs. And grep is a real-life UNIX utility program performing pattern match. The version in our experiments is grep 2.0. All the subject programs are obtained from Software-artifact Infrastructure Repository (SIR) [19]. We select the Siemens programs because they were widely used and studied in the fault localization co mmunity despite their small sizes. And we also include grep because it is larger in size than the Siemens programs. More importantly, verifying the correctness of the outputs for these programs is actually difficu lt in most cases, that is, they have the oracle problem. For examp le, in grep, it is easy for us to check whether all the printed lines actually contain matches   

  . However it is almost impossible to know whether grep has printed all the matched lines, unless we search the entire file to conduct an exhaustive examination. It reveals that it may be easy to verify the soundness of the results, but may become tough to verify the completeness of the results. As a reminder, all the previous BSSFL studies simply assume the original versions as the testing oracles. TABLE II lists the programs, lines of executable statements, as well as the functional descriptions of the corresponding program. The lines of executable code, collected by SLOCCount 2.26, excludes those statements such as blank lines, lines of left or right brace, etc.

4

216

GNU, http://www.gnu.org/software/grep/.

code, and then searches for potential locations where a mutant operator can be applied. Finally, our tool randomly chooses one of those defined mutant operators and apply it to the corresponding potential positions randomly selected. For each program, the mutants which could not be compiled successfully are first excluded. It is common that mutants with no failure revealed were excluded by the existing fault localization techniques including the conventional BSSFL techniques with bslice. Likewise, mutants with no violation are also excluded in our investigation. The overall nu mbers of the used mutants for each program in each MR are shown in TABLE IV.

C. Test suite generation For the 7 programs in Siemens Suite, since the “universe” test suites provided by SIR are sufficient for applying our MRs, we simply adopt “universe” test plans as our source test suites. For grep, although the SIR provides 807 test cases, we do not utilize them because only quite a small part of them contains  

 substrings that are sufficient for effectively applying our MRs. Therefore, to provide     source test cases for our experiments, we use a random test pool and select qualified test cases for each MR. Finally, 2982 test cases are chosen as source test cases for MR1, 5003 for M R2, and 2084 for MR3. The numbers of the source test cases for all the 8 programs are listed in TABLE III.

E. Evaluation formulas and metric This study focuses on the scenarios of single faults. Since ER1 and ER5 have been recently proved as the maximal formulas [24], it would be meaningless to evaluate our approach using the non-maximal formu las. Thus, we evaluate our approach by only using Nash 1 out of the two equivalent formulas in ER1 and Binary out of the three equivalent maximal formu las in ER5. Their evaluation formulas are listed in TABLE I. The experiment evaluates the effectiveness of using mbslice and bslice with these two maximal formu las. In light of the equivalence in each group, the following section uses ER1 and ER5 to represent Naish1 and Binary respectively. Based on the algorithm of BSSFL techniques in [2], this study also adopts the Approximate Dynamic Backward Slice (ADBS) as the backward slice (that is bslice). The ADBS is the intersection of the statements in the static backward slice of the output with the executed statements. The static backward slice is implemented by using FEMA slicing tool [20] and the executed statements are collected by using Gcov 6 . The fault localization effectiveness is commonly evaluated by the percentage of executable code that needs to be examined before finding the actual faulty statement (referred to as the EXAM score [8]). A lower EXAM score indicates a better performance. Following this convention, our study also adopts the EXAM score as the evaluation metric. For statements with the same suspiciousness values, we rank them according to their original line order in the source code.

T ABLE III. NUMBER OF SOURCE TEST CASES Program print_tokens print_tokens2 replace schedule tot_info

Number 4130 4115 5542 2650 1052

Program schedule2 tcas grep

Numbe r 2710 1608 2982 (MR1) 5003 (MR2) 2084 (MR3)

D. Mutant generation T ABLE IV. NUMBER OF USED MUTANTS Program print_tokens print_tokens2 replace schedule schedule2 tcas tot_info grep

MR1 132 120 104 95 105 129 29 220

MR2 120 124 90 88 113 132 33 218

MR3 103 97 85 92 85 136 32 78

SIR has provided a number of mutants for Siemens Suite 5 and grep. However, the amounts of the mutants are small for our study. In addition, many of them are actually equivalent mutants according to our selected MRs, and thus no violation can be revealed in these mutants. Therefore, besides the provided mutants for the Siemens and grep programs, we randomly generate 300 ext ra mutants to give more meaningful statistics. Our mutant generation focuses on producing the singlefault mutants with the non-omission faults. The future work will include the investigation of multiple-faults. The generation utilizes two types of mutant operators: statement mutation and operator mutation. For statement mutation, three types of statements are considered: continue, break and goto statements. A continue statement is replaced with a break statement and vice versa. A goto statement is replaced with another valid label. For operator mutation, it replaces an arithmetic (or a logical) operator with a d ifferent arithmetic (or logical) operator. To generate a mutant, our tool first randomly chooses a line in the relevant part within the source

V. A N EXPERIMENT AL ST UDY To show the effectiveness of our proposed approach in locating faults under the circumstances where test oracles do not exist, BSSFL with mbslice is applied to each program using all relevant MRs and Naish1 as well as Binary. Since the conventional BSSFL is evaluated with the assistance of an assumed oracle, the experiments refer to its effectiveness as the benchmarks to be compared with the effectiveness of BSSFL with mbslice. Thus, four scenarios are used: (1) MS: BSSFL using mbslice with all metamorphic test groups. (2) B-ST: BSSFL using bslice with all source test cases.

5

Two real-life faults in Siemens Suite (print_token and schedule) are reported in [8]. The faults of the original versions of two programs are corrected. All the mutants of two programs are generated by using their corrected versions.

6

A test coverage tool used in concert with Gcc Compiler (see http://gcc.gnu.org/onlinedocs/gcc/Gcov.html).

217

(3) B-FT: BSSFL using bslice with all follo w-up test cases. (4) B-AT: BSSFL using bslice with all source and follow-up test cases. Conducting the comparison between MS and the other three scenarios is based on two observations. One is that the number of bslices used in either B-ST or B-FT is the same as that of mbslices used in MS, while B-AT has used twice as many bslices. Therefore, B-ST, B-FT and MS have the same amount of raw data for the evaluation formulas, and hence show the same degree of reliability. The other is that the number of test executions in MS is the same as that in B-AT, and it is twice the number of test executions of either B-ST or B-FT. Thus the program execution overheads of MS and B-AT are the same, while the program execution of B-ST and B-FT are also similar, but about half of those for either MS or B-AT. As a reminder, B-ST, B-FT and B-AT require a test oracle to determine whether a test case is failed or passed. Like some previous studies, we adopt the non-mutated versions as the assumed test oracles in these three scenarios for our experiments. However, MS does not require such assumed oracles by using mbslice to identify whether the metamorphic test group is violated or non-violated with respect to an MR. The following sections analyze the experimental results fro m t wo aspects: the EXAM score distribution and the statistical comparison using the Wilcoxon-Signed-Rank Test. A. EXAM score distribution To demonstrate the effectiveness of our statistical approach, we first use EXAM distribution of a particular formula among all the mutants of a program, with the four scenarios MS, B-ST, B-FT and B-AT. The EXAM distributions for each of the eight programs are presented in Fig. 3(a) to Fig. 3(h). Each figure summarizes the results of all the three MRs for a program in ER1 and ER5 respectively. In these figures, the x-axis represents the EXAM score, while the y-axis denotes the percentage of the mutants that has located their faults within the corresponding EXAM score. Apparently, the faster that the curve increases and reaches 100% of mutants, the better effectiveness the corresponding technique has. Generally speaking, we can observe from these graphs that the EXAM distributions of MS are similar with the other three scenarios. For examp le, in program print_tokens, these four scenarios have almost identical effectiveness. In program schedule2, the curves of MS are inter-crossing with the other three curves, such that MS performs slightly worse at some intervals of EXAM score while its curves exceeds the other three curves at some other intervals of EXAM score. In program grep, the curves of MS are always above the other three scenarios. Hence, in summary, the performance difference between our approach and the existing techniques is small, that is, the EXAM score distribution does not present any visually significant difference between our approach and the existing techniques.

218

B. Statistical comparison Although the EXAM score distribution provides a visual comparison, the result is not rigorous enough due to small difference. Therefore, we further conduct a more rigorous and scientific co mparison: the paired Wilco xon-Signed-Rank Test. The paired Wilcoxon-Signed-Rank test is a nonparametric statistical hypothesis test for testing that the differences between pairs of measurements F(x) and G(y), which do not follow a normal d istribution [21]. It makes use of the sign and the magnitude of the rank of the differences between F(x) and G(y). At the given significant level ɐ, we can use both 2-tailed p-value and 1-tailed p-value to obtain a conclusion. For the 2-tailed p-value, if p൒ɐ, the null hypothesis H0 that F(x) and G(y)  

      otherwise, the alternative hypothesis H1 that F(x) and G(y)  

         -tailed p-value, there are two cases: the lower case and the upper case. In the lower case, if p൒ɐ, H0 that F(x)  

   be greater than the G(y)   otherwise, H1 that F(x) s

          G(y) is accepted. And in the upper case, if p൒ɐ, H0 that F(x) does not 

       than the G(y)    otherwise, H1 that F(x) 

   than the G(y) is accepted. The experiments performed three paired WilcoxonSigned-Rank tests: MS v.s. B-ST, M S v.s. B-FT and MS v.s. B-AT. Each test uses both the 2-tailed and 1-tailed checking !Given a program and a formula, we use the list of the EXAM score for all the mutants with MS as the list of measurements of F(x), while the list of measurements of G(y) is the list of the EXAM scores for all the mutants with B-ST, B-FT and B-AT respectively. Hence, in the 2-tailed test, MS has SIMILAR effectiveness as B-ST, B-FT or B-AT when H0 is  

  0.05. And in the 1-tailed test (lower), MS has WORSE effectiveness than B-ST, B-FT or B-AT when H1 is accepted   

    !  " the 1-tailed test (upper), MS has BETTER effectiveness than B-ST, B-FT or B-AT when H1 is  

 ! The experiments use Orig inPro 9.0, developed by Origin Lab, to conduct the above statistical analysis. TABLE V to TABLE VII illustrate the results of the statistical comparison between MS and the other three scenarios, namely MS v.s. B-ST, MS v.s. B-FT, as well as MS v.s. BAT. They show that in most of the combinations of programs and formulas, BETTER and SIMILAR results occur more frequently than WORSE results. Table VIII summarizes the numbers of SIMILAR, BETTER and WORSE cases in TABLE V to TABLE VII. It shows that in all cases, the number of BETTER and SIMILAR results is larger than that of WORSE results. Overall, we have over 50% BETTER and SIMILAR results. Therefore, the performance of MS is statistically comparable to that of the other three scenarios where the test oracle is required. C. Summary Fro m the above two aspects of the analysis, we can conclude that the performance of our approach is similar to

that of the conventional BSSFL techniques in the cases where test oracles are mandatory. Please note that the main interest of our approach is to extend the applicability of the conventional BSSFL techniques in the cases of no test oracles. As a consequence, the effectiveness of the proposed approach is not a main focus. However, we can still see that our approach can

performs similarly or better than the conventional BSSFL using an assumed oracle in over 50% cases. Hence, it imp lies that our approach can still conduct the effective BSSFL in the cases of no test oracle. D. Threats to Validity

Figure 3. EXAM distribution in each program.

219

T ABLE V. MS V.S. B-ST Program

1-taile d 1-taile d (lowe r) (uppe r) 0.95299 0.52415 0.47649 0.24097 0.12048 0.87991 0.19507 0.9026 0.09753 0.24047 0.87991 0.12023 3.75151E-11 1.87576E-11 1 0.75068 0.37534 0.62494 0.20703 0.89664 0.10352 9.23365E-17 1 4.61682E-17 0.34407 0.82825 0.17203 0.52023 0.74013 0.26011 0 0 1 0.00331 0.00166 0.99835 1.13619E-7 5.68095E-8 1 0.55882 0.27941 0.72186 1 3.72074E-10 7.44148E-10 4.39697E-22 1 2.19849E-22 2-taile d

ER1 ER5 ER1 print_token2 ER5 ER1 replace ER5 ER1 schedule ER5 ER1 schedule2 ER5 ER1 tcas ER5 ER1 tot_info ER5 ER1 grep ER5 print_token

T ABLE VI. MS V.S. B-FT Conclusion

Program

2-taile d

SIMILAR SIMILAR SIMILAR SIMILAR WORSE SIMILAR SIMILAR BET T ER SIMILAR SIMILAR WORSE WORSE WORSE SIMILAR BET T ER BET T ER

ER1 print_token ER5 ER1 print_token2 ER5 ER1 replace ER5 ER1 schedule ER5 ER1 schedule2 ER5 ER1 tcas ER5 ER1 tot_info ER5 ER1 grep ER5

0.72875 0.40356 0.18307 0.04988 8.45638E-10 0.80066 0.41491 8.22003E-15 0.11682 0.31395 8.45899E-10 1.78619E-5 1.13619E-7 0.01628 0.15002 5.38492E-17

ER1 ER5 ER1 print_token2 ER5 ER1 replace ER5 ER1 schedule ER5 ER1 schedule2 ER5 ER1 tcas ER5 ER1 tot_info ER5 ER1 grep ER5 print_token

1-taile d 1-taile d Conclusion (lowe r) (uppe r) 0.87567 0.56281 0.43783 SIMILAR 0.23375 0.11688 0.88352 SIMILAR 0.1051 0.94753 0.05255 SIMILAR 0.15284 0.92369 0.07642 SIMILAR 3.75151E-11 1.87576E-11 1 WORSE 0.8446 0.4223 0.57799 SIMILAR 0.20703 0.89664 0.10352 SIMILAR 7.17551E-16 1 3.58775E-16 BET T ER 0.38618 0.8072 0.19309 SIMILAR 0.65088 0.67485 0.32544 SIMILAR 0 0 1 WORSE 0.02414 0.01207 0.98794 WORSE 1.13619E-7 5.68095E-8 1 WORSE 0.01628 0.00814 0.99194 WORSE 0.15002 0.92505 0.07501 SIMILAR 3.10372E-17 1 1.55186E-17 BET T ER 2-taile d

T ABLE VIII. OVERALL STATISTICAL RESULTS Comparison BETTER ER1 1 MS v.s. B-ST ER5 2 ER1 1 MS v.s B-FT ER5 4 ER1 0 MS v.s. B-AT ER5 2

WORSE 3 1 2 1 3 2

Conclusion SIMILAR SIMILAR SIMILAR BET T ER WORSE SIMILAR SIMILAR BET T ER SIMILAR SIMILAR BET T ER BET T ER WORSE WORSE SIMILAR BET T ER

of the fault localization. This finding increases our confidence in further applications of our approach to programs with mu ltiple faults. Nevertheless, there are still many unknown and complicated factors in the realistic debugging. Hence, further study (e.g., using clustering techniques [23]) should be conducted to investigate the effectiveness of our approach in the context of mu ltiple faults. Another threat is that the identification of effective MRs may be difficu lt for some programs. In the experimental study, we can easily identify the effective M Rs for the subject programs because they are widely understood and studied in both academic and industrial community. Even though recent studies have empirically shown that testers with a brief general train ing can properly and effectively define MRs for the target programs [16,17], we cannot simply make the conclusion that the identification of effective MRs is a trivial task and should be easy for other complicated programs.

T ABLE VII. MS V.S. B-AT Program

1-taile d 1-taile d (lowe r) (uppe r) 0.36438 0.63625 0.20178 0.79874 0.90859 0.09153 0.9751 0.02494 1 4.22819E-10 0.40033 0.59995 0.7928 0.20745 1 4.11002E-15 0.94172 0.05841 0.15698 0.84321 1 4.22949E-10 0.99999 8.93094E-6 5.68095E-8 1 0.00814 0.99194 0.92505 0.07501 1 2.69246E-17

VI. CONCLUSION Recently, a new fault localization technique, BackwardSlice-based Statistical Fault Localization (BSSFL), is proposed to improve one of the state of the art techniques, namely Spectrum-based Fault Localization (SFL). Despite the promising results that BSSFL significantly outperforms SFL, BSSFL shares the same assumption with SFL, that is, they assume the existence of a test oracle to determine the result of a test case is failed or passed. However, this assumption, in reality, is not always held in many applications and thus it makes both BSSFL and SFL infeasible in the cases of no test oracles. To alleviate the oracle problem, a new SFL technique leverages metamorphic testing to extend the conventional SFL techniques to the applications without test oracles. Inspired by the new SFL technique, it is natural to investigate whether the solution of the new SFL technique can help BSSFL alleviating the oracle problem. Hence, in this paper, we propose a new fault localization approach that aims to extend the conventional BSSFL techniques to the areas which suffer fro m the oracle problem.

SIMILAR 4 5 5 3 5 4

The primary threat to the validity of our experiment is the subject programs used by the study. The experiment chooses 8 subject programs, with just three MRs for each, because they are extensively used in the field of fault localization. However, the programs of Siemens Suite are small-sized and even the real-life UNIX utility program grep is still not a very large-sized program. Hence, further applications of our approach to large-sized real-life programs would strengthen the validity of our workDŽ Next, a threat is the type of mutants generated and used in our study. Although these mutants are randomly generated, each mutant only contains exactly one fault and the types of faults are also limited. The research [22] has shown that that mu ltiple faults pose a negligible effect on the effectiveness

220

The proposed approach utilizes the framework of metamorphic testing to define a new concept of metamorphic backward slice (mbslice) as the counterpart of the backward slice (bslice) used in the conventional BSSFL techniques. In addition, the test result of failure or pass is replaced by the metamorphic test result of violation or non-violation. With the new redefined concepts, the proposed approach does not require the existence of a test oracle and significantly enhances the applicability of BSSFL. The experimental study chooses 8 subject programs in different sizes and 2 groups of maximal evaluation formulas to investigate the effectiveness of our approach. The results demonstrate that our approach with mbslice shows a performance co mparable to that of the conventional BSSFL techniques with the assistance of test oracles. It reveals that our approach has successfully extended the conventional BSSFL to the situations where no test oracles exist, that is, test oracles are no longer mandatory for BSSFL. In future work, we p lan to conduct additional experiments across a much broader spectrum of programs, especially large-size programs with mu ltiple faults. We will further study the applicability of our approach to different types of backward slices. Additionally, we will consider a further study on our approach and SFL.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

A CKNOWLEDGMENT This work is partially supported by the Australian Research Council under Grant No.DP120104773, and the National Natural Science Foundation of China under Grant No.91118007, the National High Technology Research and Development Program of China (863 program) under Grant No.2011AA010106 and 2012AA011201 and the Program for New Century Excellent Talents in University.

[15]

[16]

[17]

REFERENCES [1]

[2]

[3] [4] [5]

[6]

[7]

J. A. Jones and M. J. Harrold, "empirical evaluation of the tarantula automatic fault-localization technique," in Proceedings of the 20th International Conference on Automated software engineering (ASE), Long Beach, CA, USA, 2005, pp. 273-282. Y. Lei, X. Mao, Z. Dai, and C. Wang, "Effective statistical fault localization using program slices,". in Proceedings of the 36th Annual International Computer Software and Applications Conference (COMPSAC), Izmir, T urkey, 2012, pp. 1-10. M. Weiser, "Program slicing," IEEE Transactions on Software Engineering, vol. 10, pp. 352-357, 1984. B. Korel and J. Laski, "Dynamic Program Slicing," Information Processing Letters, vol. 29, pp. 155-163, 1988. T. Gyimóthy, Á. Beszédes, and I. Forgács, "An efficient relevant slicing method for debugging," in Proceedings of the 7th European Software Engineering Conference held jointly with the 7th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), Toulouse, France, 1999, pp. 303-321. T. Y. Chen, J. W. K. Ho, H. Liu, and X. Xie, "An innovative approach for testing bioinformatics programs using metamorphic testing," BMC bioinformatics, vol. 10, no. 1, pp. 24-35, 2009. X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T . Y. Chen, “Application of metamorphic testing to supervised classifiers,” in

[18]

[19] [20]

[21] [22]

[23]

[24]

221

Proceedings of the 9h International Conference on Quality Software (QSIC), Jeju, Korea, 2009, pp. 135-144. X. Xie, W. E. Wong, T. Y. Chen, and B. Xu, "Metamorphic slice: an application in spectrum-based fault localization," Information and Software Technology, in press, 2013. R. Abreu, P. Zoeteweij, and A. J. C. van Gemund, "On the accuracy of spectrum-based fault localization," in Proceedings of Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION, Windsor, UK, 2007, pp. 89-98. A. Gotlieb and B. Botella, "Automated metamorphic testing," in Proceedings of the 27th Annual International Conference on Computer Software and Applications (COMPSAC), Dallas, T X, USA, 2003, pp. 34-40. W. Chan, S. Cheung, and K. Leung, "A metamorphic testing approach for online testing of service-oriented software applications," International Journal of Web Services Research, vol. 4, no. 2, pp. 6181, 2007. C. Murphy, K. Shen, and G. Kaiser, "Using JML runtime assertion checking to automate metamorphic testing in applications without test oracles," in Proceedings of the 2nd International Conference on Software Testing,Verification and Validation (ICST), Denver, Colorado, USA, 2009, pp. 436-445. C. Murphy, K. Shen, and G. Kaiser, "Automatic system testing of programs without test oracles," in Proceedings of the 2009 International Symposium on Software Testing and Analysis (ISSTA), Chicago, IL, USA, 2009, pp. 189-200. C. Murphy, "Metamorphic testing techniques to detect defects in applications without test oracles," Ph.D. dissertation, Columbia University, 2010. T. Y. Chen, S. C. Cheung, and S. M. Yiu., "Metamorphic testing: a new approach for generating next test cases." Technical Report HKUST-CS98-01, Dept. of Computer Science, Hong Kong Univ. of Science and T echnology, Tech. Rep., 1998. P. F. Hu, Z. Zhang, W. K. Chan, and T. H. Tse, "An empirical comparison between direct and indirect test result checking approaches, " in Proceedings of the 3rd International Workshop on Software Quality Assurance, Portland, USA, 2006, pp. 6-13. Z. Y. Zhang, W. K. Chan, T. H. T se, and P. F. Hu, "Experimental study to compare the use of metamorphic testing and assertion checking," Journal of Software, vol. 20, no. 10, pp. 2637-2654, 2009. T. Y. Chen, T . H. T se, and Z. Q. Zhou, "Semi-proving: an integrated method for program proving, testing, and debugging," IEEE Transactions on Software Engineering (TSE), vol. 37, no. 1, pp. 109125, 2011. SIR, http://sir.unl.edu/php/index.php. W. Dong, J. Wang, C. Zhao, X. Zhang, and J. Tian, "Automating software FMEA via formal analysis of dependence relations," in Proceedings of the 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), T urku, Finland, 2008, pp. 490-491. G. W. Corder and D. I. Foreman, Nonparametric statistics for nonstatisticians: a step-by-step approach: Wiley, 2009. N. DiGiuseppe and J. Jones, $'  *     coverage-based fault localization," in Proceedings of the 2011 International Symposium on Software Testing and Analysis (ISSTA), T oronto, Canada, 2011, pp. 210–220. J. Jones, J. Bowring, and M. Harrold, “ Debugging in parallel,” in Proceedings of the 2007 International Symposium on Software Testing and Analysis (ISSTA), New York, NY, USA, 2007, pp. 16–26. X. Xie, T . Y. Chen, F-C. Kuo, and B. Xu, “A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization,” ACM Transactions on Software Engineering and Methodology (TOSEM), in press, 2013.

Suggest Documents