Early Bug Detection in Deployed Software Using Support Vector Machine Saeed Parsa, Somaye Arabi Nare, and Mojtaba Vahidi-Asl Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
[email protected], {Atarabi,
[email protected]}
Abstract. Software crashes may be disastrous and cause great economical damages. Therefore, the reliability and safety of software products in some circumstances may be very vital and critical. In this paper a new mechanism to detect errors and prevent software crashes at run time, is presented. The novelty of the proposed technique is the use of Support Vector Machine (SVM) method to accelerate the detection of bugs early before they cause program crashes. By applying the SVM method, two thoroughly distinguishable patterns of failing and passing execution of the program are constructed in a relatively short amount of time, before the program is actually deployed. The vectors are constructed from the decision making expressions or in other words predicates, appearing within the program text. These patterns are further applied, after the program deployment, to estimate the probability of program failure symptoms, early before the program crashes. Our experiments with bug prediction in Siemens software, demonstrate the ability of our proposed technique to predict errors before they can cause any damages. Keywords: Software Debugging, Early Bug Detection, Deployed Software, Support Vector Machine, Predicate.
1 Introduction The more complicated the software, the more difficult it will be to debug. This reveals the need to develop automated software debugging tools [1]. One of the important fields in this context is debugging deployed software [2]. Bug localization is hard in these systems because there is no mechanism to inform the user when the software behaves anomalously. The only sign of anomalous behavior is a crash in software or undesirable output [3]. Existence of bugs in some vital deployed software may cause critical and deadly outcomes. For instance NASA Mars Global Surveyor battery failure was the result of a series of events linked to a computer error made five months before [19]; or because of a fault in software control of safety-critical systems such as The Therac-25 (Radiation therapy machine) accidents (1985-1987), at least five patients died [20]. Current automated debugging techniques develop a profile for a program’s execution either through static inspection or dynamic instrumentation [4]. A static analysis detects program bugs after checking the source code using a well-specified program H. Sarbazi-Azad et al. (Eds.): CSICC 2008, CCIS 6, pp. 518–525, 2008. © Springer-Verlag Berlin Heidelberg 2008
Early Bug Detection in Deployed Software Using Support Vector Machine
519
model (such as control flow graph) [5]. A dynamic analysis, usually tries to locate defects by contrasting the runtime behavior of correct and incorrect executions [6]. Dynamic techniques are based upon analysis of predicates. Predicates are simple Boolean expressions at various program points. Predicates are designed to capture potentially interesting program behaviors such as results of function calls, directions of branches, or values of variables [7, 8, 9]. To collect such information, extra code is inserted before each predicate within the program code. This process is called instrumentation [10]. During the program execution, the number of times each predicate is observed to be true or false is counted. This information is analyzed later to find potential bugs. These techniques entail the complete execution of the program, and hence could not be useful for early bug detection in deployed software. One of the efficient approaches to debug deployed software is to employ models based upon learning algorithms. In these approaches a learning model which represents program behavior is constructed before deployment of the software. This model (pattern) is used later to detect anomalous behaviors of the program and informs the user about the existence of error-prone code [11, 12]. This paper presents a new machine-learning technique to detect anomaly dynamically while the program is executing. The technique employs a machine learning method called Support Vector Machine (SVM) [13], to build a model of a program behavior according to passing and failing executions of the program. It then uses the model to identify error-prone points. The distinguishing feature of our suggested approach is to detect anomaly dynamically during program execution and to find the location of bug before system crashes, in a relatively small amount of time. The remaining part of this paper is organized as follows: Section 2 discusses previous approaches for debugging deployed software. In section 3, we introduce our proposed method for early bug detection. Section 4 evaluates functionality of proposed method in two case studies and includes experimental results. We conclude with final remarks and portray future work in section 5.
2 Related Work Few works have been done on early bug detection in deployed software. Statistical debugging is one of the strongest dynamic methods within the software engineering field [4, 7, 8, 14]. It gathers information about program variables after the program is instrumented. The statistical data collected from program execution are formulated into a report containing specific execution data and parameters; this is referred to as a test case. Statistical debugging techniques take such set of test cases and apply an algorithm to determine which predicates are responsible for the programs failure. The applied algorithm generally uses a number of metrics dependent on correct and incorrect test cases [14]. Statistical debugging techniques depend upon the information which is completed after the program execution and therefore they cannot be applied dynamically while the program is executing. These techniques cannot detect anomalous or failing behavior as the software is executing. Our technique tries to find bugs before they really occur and informs user before software crashes. In [15] a completely different approach for anomaly detection in deployed software is proposed. The tool which is introduced in this approach is called Diduce which is
520
S. Parsa, S. Arabi Nare, and M. Vahidi-Asl
inspired from Daikon [16]. The tool extracts properties of the program which are generated according to passed test cases and are called invariants. Invariants are constant properties of a program which are satisfied in all successful executions of the program. The generated invariants could be used to find the locations of potential defects if they are violated in failing test cases. This technique tries to debug deployed software. The drawback of this technique is its high overhead on executing program which is not desirable in deployed systems. Our proposed method produces very low overhead on executing software, because it uses simple vectors and SVM method which we employ for classification performs classification very fast and precisely. In [11] a collection of statistical data is gathered based on predictive properties in program such as branches in order to understand program behavior and fault detection. This data is used to build behavior models by applying statistical machine learning techniques. The technique builds markov model for failing and passing executions of a program and then classify them based on the result of the execution. Program behaviors could be predicted based on this classifier. But the technique is not suitable for deployed software products because it needs complete program execution to build markov model.
3 Anomaly Prediction Early bug detection can be performed in two main phases of training and deployment. As shown in Figure 1.a, the training phase consists of three steps. Figure 1.b shows the deployment phase. The details of these two phases are further discussed in this section. Program P
Instrumentation
Execution
Learning
Program Model
(a) Training phase Inputs P
P*
Program Model
Error Report
(b) Deployment phase
Fig. 1. A view of our proposed approach: (a) Training phase (b) Deployment phase
3.1 Training Phase In the training phase a model based on passing and failing executions of a given program code is built. Training phase consists of three major steps: instrumentation, execution, and learning. In the instrumentation step of a program P, probes are inserted before the branch statements or the locations within the program code where the value of predicates may change, to generate the instrumented program P*. During the Execution step, P* is executed on the failing and passing test cases. For each instrumentation point a separate predicate vector is built. These predicate vectors are input to the learning step. In this step a model, representing the program behavior is built. The important steps of the training phase are execution and learning. These steps are further described below.
Early Bug Detection in Deployed Software Using Support Vector Machine
521
Execution. In the execution step of the training phase, the failing and passing test cases are applied to run the instrumented program code, several times. At each run for each instrumented point within the program a separate vector is built. Within a loop construct such as For and While, for each iteration of loop, a separate predicate vector including the loop iteration number, should be created for all the instrumented statements appearing within the loop body. Each cell of the vector represents the value of a predicate at the time execution reaches the corresponding instrumented point. The vectors are classified in two groups of Fail and Pass which depends on whether the program fails or executes successfully. Based on this observation that errors most often are originated at predicates [10], we decided to select some of the critical predicates as instrumented points. Here, the criticality of a predicate is calculated as the number of the program branches rooted at the predicate. If the predicate at an instrumented point includes either ≤or ≥, we split it into two predicates. For example the predicate A≤B is converted into two predicates A