Predicting Vulnerable Software Components with Dependency Graphs ∗
Viet Hung Nguyen
Le Minh Sang Tran
University of Trento, Italy
University of Trento, Italy
[email protected]
[email protected]
ABSTRACT
A trade-off solution widely adopted is to focus only on the most vulnerable parts of software. Concentration on the rest will come later in the maintenance phase, in which discovered vulnerabilities are fixed by perioding patches. For this purpose, software developers can employ security metrics to prioritize testing effort. The product manager, on the other hand, also needs security metrics to determine whether company’s products have reached to an acceptable level of security so that they are ready to the market. So far many software security metrics have been introduced, but they are inadequate. For example, estimating the complexity of a software system and its correlation with vulnerability could give an indicator to the security level of software’s components. However, the software complexity itself is very difficult to measure. The software complexity can be seen in two perspectives: structural and semantic. The structural complexity is represented in the code structure. Most existing complexity metrics such as total line of code, nesting levels fall into this category. The semantic complexity is the semantic of the code, which is the complexity of the problems, algorithms. Measuring semantic complexity then is more difficult than structural one. As a consequence, current complexity metrics could not perfectly reflect the real nature complexity of software system. Contribution. We are interested in studying security metrics, in particular metrics for vulnerabilities prediction. In this work, we introduce a new prediction model using dependency graphs of software system. These dependency graphs are based on the relationship among software elements (i.e., components, classes, functions, variables) which can be obtained from a static code analyzers (e.g., Doxygen) or from a detail design specification. Therefore, the proposed model can be applied in both design phase and developing phase. The proposed model is supported by an experiment on JavaScript Engine of Firefox. The rest of this paper is organized as follows. We briefly present the vulnerability prediction model in the next section (§2), we discuss issues that a prediction model needs to concern, and how to evaluate a prediction model. Then we describe the definition of dependency graph (§3) and our experiment (§4). Next we discuss the validity (§5) of our work, we discuss both internal and external factors that might impact the experiment result. Finally we concisely describe studies in this area for the last ten years (§6) and conclude the paper with future work (§7).
Security metrics and vulnerability prediction for software have gained a lot of interests from the community. Many software security metrics have been proposed e.g., complexity metrics, cohesion and coupling metrics. In this paper, we propose a novel code metric based on dependency graphs to predict vulnerable components. To validate the efficiency of the proposed metric, we conduct a prediction model which targets the JavaScript Engine of Firefox. In this experiment, our prediction model has obtained a very good result in term of accuracy and recall rates. This empirical result is a good evidence showing dependency graphs are also a good option for early indicating vulnerability.
Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures
General Terms Security, Vulnerability, Prediction
1.
INTRODUCTION
It is the fact that most of software have security problems during lifetime. This insecure phenomenon eventually comes from two reasons: complexity and motivations [15]. As requirements are continuously evolving and growing, software is going to include more and more functions while still guaranteeing the backward compatibility. This increases the complexity of the software. Meanwhile, the software market is the market of “lemons” [1, 15], as first comers take better advantages, and have a higher probability to success. Therefore, software vendors lack motivations for putting more testing effort on their products, instead they rush themselves to reduce the time to market. ∗ This research is partly supported by the project EU-FETIP-SECURE CHANGE (www.securechange.eu).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
2.
VULNERABILITY PREDICTION MODEL
This section discusses common research issues, objectives 1
of defect/vulnerability prediction (prediction, for short) papers. We also describe a common prediction framework as well as the way to evaluate the prediction models. The objective of prediction studies is to prioritize explicitly or implicity testing effort. A good prediction model will indicate vulnerable-prone components for testing. This helps to reduce testing time and provide a certain level of defect-free. To this purpose, there are two typical issues for a prediction model [22]:
Vulnerability Database (tracable to code base)
Phase 1:
Training Data Selection (Black box)
Code base
• Classification: this issue concerns to the question “Are you vulnerable/defective?”
New code base
produces
Vulnerable?
Figure 1: Basic setting for machine learning based prediction model.
Figure 2: Precision and recall explained (from [12]). The Recall defines the rate of true defective components in comparison to the total number of defective components: Recall =
TP TP + FN
A good prediction model should obtain high Precision and Recall [14]. However, high recall often comes at the cost for low precision and vice versa. Therefore F-measure is used to combine the precision and recall as a harmonic mean. F-measure =
2 × Recall × P recision Recall + P recision
The Accuracy metric (Acc or success rate) measures the overall accuracy of the prediction. The Acc is defined as follows: TP + TN Acc = TP + TN + FP + FN
TP P recision = TP + FP
Naive Bayes
Predictor (Classifier)
Phase 2:
According to [2], there are different approaches implementing the software fault prediction model such as genetic programming, neural networks, case-based reasoning, fuzzy logic, Dempster-Shafer networks, statistics and machine learning based. Also, [2] points out the trends in modern vulnerability researches. Figure 1 describes a common solution using machine learning techniques. The approach has two phases: training and predicting. In the training phase, code base and vulnerability data are fed into a black box to produce the training dataset which includes code base metrics and vulnerability data. This dataset is used to train the predictor (a.k.a classifier). Table 1 describes some popular classification techniques. The second phase is the predicting phase which classifies a new code module. The metrics for this module is also generated by the black box. And the trained classifier uses these values to determine whether the code module is vulnerable or not. The outcome of a prediction model for classification can be true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) as depicted in Figure 2. To evaluate the prediction model, we use the Precision, Recall, F-measure and Accuracy. The Precision relates the number of true defective components to the number of components predicted as defective.
Neural Network Logistic Regression Support Vector Machine (SMO)
Predictor Trainer
Predicting
• Ranking: in other words, it is the question “How likely you are vulnerable?” or “Which are the most vulnerable components?”.
Method Bayesian Network Random Forest
®®
Description A Bayesian Network based classifier
Back to the main objective, in order to effectively prioritize testing effort, the recall value should be as high as possible. It means that the model should be able to recognize as much as possible vulnerable components before release time. However, there should be a trade-off between precision and recall. Otherwise, a cheating model always answers vulnerable to any source module, and thus the recall rate reaches to 100%.
An ensemble of decision trees, each tree is trained on a bootstrap sample of the given training dataset A back propagation neural network A linear logistic regression based classification A sequential minimal optimization algorithm for support vector classification A standard probabilistic Naive Bayes model
3.
DEPENDENCY GRAPHS
The software code base is composed by many components which are compiling units. A component, depend on the programming language, could be located in a single source file e.g., .java in Java, or in multiple ones e.g., header file (.h) and body file(.c, .cpp) in C/C++ languages. We consider
Table 1: Some common classification techniques
2
two file, a header file (e.g., .h,.hpp) and its implementation (e.g., .c,.cpp) as a single component. In our work, we assume that header and implementation files share a same name with different extensions. From a static point of view, a component is the aggregation of data items and executable code. In proceduraloriented languages, the former are global variables (or shared variables) and the later is procedures (or functions). Meanwhile, in object-oriented languages, data items are data members of classes and executable code is class methods. In the case of C++ which supports both procedural and objectoriented specification, data items are both shared variables and data members, and executable code is both functions and methods. Below, we formally present the definition of a software component. Definition 1.(Component) A component is a pair hDC , MC i, where DC is a set of shared variables and data members, and MC is a set of functions and methods. A software system, from a static point of view, is a set of components and relations among members in these components. A relation is an ordered pair of members, in which data is flown from the source (first member) to the target (second member). There are four types of relation as follow: parameter passing, function return, data read, and data write. The two first relation types are between a method and a method, where the caller method calls and passes data as parameters to the callee, and might receive returned data from the callee. The two last types are between a method and a data item (data write), or a data item and a method (data read). The following presents a formal definition of a software system in this work. Definition 2.(Software system) A software system is a quadruplet hSC , DS , MS , Riwhere SC is a set of components, DS = S S C∈SC DC is a set of data items. MS = C∈SC MC is a set of all functions. R ⊆ (DS ∪ MS ) × (DS ∪ MS ) is a set of member relations. Notably, there are two special cases. Firstly, we treat the call to zero-parameter functions as a null parameter-passing relation. Secondly, we treat all kinds of parameter passing as value-passing and ignore the side effect of other kinds of parameter passing e.g., reference-passing, name-passing.
3.1
Member Dependency Graph (MDG)
The member dependency graph (MDG) is a direct graph constructed from data items and methods of a software system. The edges connecting two nodes fall into three categories: call-edges, return-edges and data-edges. The two first edge types correspond to the parameter-passing and function-return relations of methods (or functions). Meanwhile, the data-edge can be data read or data write depending on the direction of the edge. Below, we present the formal definition of the MDG. Definition 3.(Member dependency graph) The member dependency graph of a software system ShS C , DS , MS , Riis a d m c r d direct graph GM D VM , VM , EM , EM , EM where
AV2
B1
3,DC
A1
1,DC
2,DC 3,DC
3,DC
A2
3,RMI D2
5,DC 1,DC DV1
1,DC
5,DC
C1
3,RMI B2
1,DC
D1
1,DC DC: Direct Call, RMI: Remote Process Invocation
Figure 3: Member dependency graph. The Figure 3 presents a sample member dependency graph of a software system. The round node represents a function, and rectangular node represents a data item. The solid line represents for both call-edge and data-edge regarding to the source and the target. The dash line represents the returnedge. The label on each edge denotes the number of data item transferred and the communication mean which could be an in-process communication or an out-process one e.g., RPC, RMI, WS/Http.
3.2
Component Dependency Graph
While the MDG focuses on the member-level of a software system, the component dependency graph (CDG) focuses on the component-level. For the sake of clarity, we use terms member node and member connection to refer to node and edge of MDG, and component node and component connection to refer to those of CDG. We also use the terms connection, edge and relation interchangeably throughout this work. For convenience, we use the notions VM and EM whenever all of member nodes and edges of an MDG are m d , and ∪ VM mentioned, respectively. Obviously, VM = VM d r c EM = EM ∪ EM ∪ EM . Basically, the CDG is generated from the MDG by gathering all member nodes of a same component into a component node. Connections between members which belong to two components and have a same direction are collapsed into a single connection between these two components. Definition 4.(Component Dependency Graph) Given a software system ShSC , DS , MS , Riand its member dependency c r d m d , EM , EM , the component depen, EM graph GM D VM , VM dency graph GCD hVC , EC i is constructed as follow: VC ≡ SC is a set of component nodes. EC ⊆ SC × SC is a set of component connections. EC = {hC1 , C2 i |Ci ∈ SC , ∃ hm1 , m2 i ∈ EM .mi ∈ Ci , i = 1, 2} Figure 4 illustrates an example of a CDG generated from the MDG depicted in Figure 3. In this example, the first character of member node’ is the component name.
3.3
d • VM ≡ DS : is a set of data nodes,
Predicting Vulnerable Components
To serve the predicting purpose, the elements of MDG and CDG are enriched with some extra information come from a static code analyzer. The MDG and CDG turn to be sorts of attributed graph in which edges and nodes are associated with certain attributes presented in Table 2. In this table, attributes of node, edge are divided into different groups. Attributes’ short name is presented in the first column, and
m • VM ≡ MS : is a set of function nodes, c m m • EM ⊆ VM × VM : is a set of call-edges. r m m • EM ⊆ VM × VM : is a set of return-edges. d d m • EM ⊆ VM × VM : is a set of data-edges.
3
Name Description Member Node Attributes L line of code LoC line of comments LB blank lines V McCabe complexity P the number of parameters in the method signature E the number of return points in the method Member Edge Attributes
Formula
d
d(e hm1 , m2i) =
number of data items transferred in an edge
Component Node Attributes n the number of member nodes that build up the component node V-density the McCabe density of a component C I-density the interface complexity density of a component C ie
the number of edges between two arbitrary nodes of a component. the number of incoming connections to a component. the number of outgoing connections from a component. the internal dataflow of a component. the outgoing dataflow of a component. the incoming dataflow of a component. the average internal dataflow the average outgoing dataflow the average incoming dataflow the average external data flow the ratio between average outgoing and incoming dataflow the ratio between external and internal dataflow
rin rout idf edfout edfin idf edf out edf in edf oir eir
P (m2 ) 1
V-density(C) = n1 I-density(C) = E(m))
m , m2 ∈ VM d , m2 ∈ VM
P
V (m) m∈C P 1 m∈C (P (m) n
+
P idf (C) = e∈INTERNAL(C) d(e) P edfout (C) = e∈OU T (C) d(e) P edfin (C) = e∈IN (C) d(e) idf (C) = idf /ie edf out (C) = edfout /rout edf in (C) = edfin /rin edf(C) = (edfout + edfin )/(rout + rin ) oir(C) = edf out /edf in eir(C) = edf/idf
Table 2: Attributes of the member node and component node.
A
B
the ratio between the average external and internal dataflow. V-density and I-density are, respectively, the density of McCabe and interface complexity. On one hand, we use the first tuple to determine the prediction power of CDG. On the other hand, we use the second one to see if the prediction power can be improved (or even worsened). The following presents the formulas for these attributes of a component.
C
D P
McCabe complexity of m (1) number of member nodes m∈C∪V m parameters of m + return points of m
V-density = P
Figure 4: The component dependency graph generated from the member dependency graph in Figure 3
I-density =
m∈C
M
number of member nodes (2)
P (outgoing dataflow) × rin P oir = (incoming dataflow) × rout P P (external dataflow) × (internal edges) P eir = (internal dataflow) × (rout + rin )
the description about the attribute goes into the second column. The last column describes how the attribute value is obtained. The blank means the attribute value comes from a source code analyzer, otherwise it denotes the formula to calculate the value from others. Depend on the granularity of the vulnerability database, each node will have an extra attribute denoting the number of security bugs of the underlying code entity. The predictor discussed in section 2 is trained based on values of these attributes. In this study, we use two tuples hrin , rout , oir, eiri and hrin , rout , V-density, I-density, oir, eiri, in which rin/out are the number of incoming and outgoing relations of a component from/to others, oir is the ratio between the average outgoing and incoming dataflow, eir is
4.
(3) (4)
EXPERIMENT AND DISCUSSION
In this section we present an experiment that classifies and ranks vulnerable components of the JaveScript Engine (JSE) used in Mozilla Firefox. We address JSE because of numerous reasons. First, java script is one of the most attractive and vulnerable source by which browsers are being attacked, (see Figure 2 in [12]). Second, JSE is a part of a popular open source project, Mozilla, so the code base 4
Source Documentation Reports Dependency Graph
Function metrics
JSE Codebase Member Dependency Graph C
Metric Reports
A
C
B
V
A
B
V
Data Extraction Tools
CDG Generator
Component Dependency Graph Code Metrics
Figure 5: Steps to generate CDG from code base by using Doxygen and RSM. Classifier
Re (%) 74.42 69.77 48.84 65.12 39.53 59.53
Bayesian Network Na¨ıve Bayes Neural Network Random Forest SMO Average
version 1.5 Pr(%) FP rate(%) 55.17 16.15 65.22 9.94 72.41 4.97 62.22 10.56 85.00 1.86 68.01 8.70
Acc(%) 81.86 85.78 85.29 84.31 85.78 84.61
Re (%) 62.50 62.50 55.00 57.50 62.50 60.00
version 2.0 Pr(%) FP rate(%) 58.14 10.84 55.56 12.05 81.48 3.01 52.27 12.65 55.56 12.05 60.60 10.12
Acc(%) 83.98 83.01 88.83 81.55 83.01 84.08
Table 3: Classification result for JSE of Firefox 1.5 and 2.0 of JSE is easy to access via CVS server. Third, vulnerability data of JSE is available at both vendor advisories: MFSA and Bugzilla, public vulnerability database (NVD, Bugtraq). Finally, prior papers of Shin and others [16–18] have done analysis on JSE so that we are able to compare our work to theirs.
4.1
ports. Doxygen is configured to extract all private and static members of a component, and to generate dependency information for these members. The outcome of Doxygen are analyzed to extract the dependency information (including dependency graph and variables, fields). These information is used to construct the MDG. On the other hand, the functional metrics are extracted from the outcome of RSM, and then embedded into the MDG. Finally, the nodes and edges of MDG are then synthesized to produce the CDG. The constructed dataset of component nodes’ metrics is used to train and test the predictor. In this work, instead of implementing classification techniques from scratch, we employ WEKA, an open source data mining toolkit, which supports many learning techniques. We use 10-fold cross validation to validate the prediction models. During the validation, the whole dataset is partitioned into 10 folds (segments) randomly. Each fold in turn is held as the test data and a classification is trained on the rest nine folds. In this experiment, we use several classification techniques. Each classifier has it own recall, precision, F-measure and accuracy. We take the average of these values to estimate the quality of our proposed model.
Experiment setup
For the code base, two snapshots of JSE in Firefox 1.5 and 2.0 are taken into account. The vulnerability data is obtained by combining several sources as discussed in [3,12,18]. We obtain these data from the Vulnerability Database for Firefox [9] that we have built at the University of Trento. This database contains vulnerability information for Mozilla Firefox from version 1.0 to 3.5. The vulnerabilities of JSE are determined by components containing fixes for these vulnerabilities. If the component is located in JSE source folder (i.e., mozilla/js/src), the correlated vulnerability is supposed to belong to JSE. The granularity of the vulnerability is at component level. Figure 5 summarizes steps for generating CDG from code base of JSE. Firstly, the code base is fed into Doxygen1 , a source code documentation generation, and RSM2 , a tool calculating standard source code metrics, to generate re-
4.2
Result and Discussion
Table 3 shows the 10-fold cross validation result of prediction models constructed from our dataset using four attributes generated from CDG hrin , rout , oir, eiri. This ta-
1
http://www.doxygen.org 2 http://msquaredtechnologies.com/ 5
Accuracy FP rates FN rates
Nesting level(∗) v1.5 v2.0 91.13% 96.81% 0.98% 0.06% 89.93% 95.08%
CDG(∗∗) v1.5 v2.0 84.61% 84.08% 8.70% 10.12% 40.47% 40.00%
supported by WEKA, to estimate the vulnerable level of each component. The predicted results are tested against the actual number of vulnerabilities of each component by using Spearman rank correlation (ρ). The correlation strength is judged by the range of ρ [4], where the value that less than 0.3 indicates a weak relation, and one greater than 0.5 means strong relation, and a medium relation for otherwise. In this experiment, we obtain the value of ρ computed by [19] as 0.58 (uncorrected) and 0.45 (corrected). This means that our vulnerable ranking has a good accuracy.
(*): values are taken from Shin et al. [17] report. (**): the average values of techniques listed in Table 3.
Table 4: Comparison of the prediction power between Nesting level metric (Shin et al. [17]) and CDG.
5. ble describes the recall (column Re), the precision (column Pr ), false positive rate (column FP-rate), and accuracy (column Acc) of different classification techniques supported by WEKA. We do not present the result of prediction model using tuple hrin , rout , V-density, I-density, oir, eiri because we have the same value for both tuples. From this table, we can conclude that predicting vulnerable components using CDG is a good model in term of accuracy, recall and false positive rate. This model attained a high accuracy (approximately 84%), and high recall rate (approximately 60%). It means that among predicted-asvulnerable components, 84% of them are actually vulnerable. And the model can recognized up to 60% of actually vulnerable components. Table 4 displays side-by-side result adapted from work of Shin and William [17] and ours. The left-half part of the table is the experiment result discussed in [17], in which they used logistic regression to classify vulnerable methods using the nesting level metric. This model has a very high accuracy (approximately 93% in average) and a very low FP-rate (approximately 0.5% in average). It means that methods flagged as vulnerable by their model are likely actual ones. However, their model suffers from a very high FN-rate (approximately 90%). Their model thus misses a large portion of vulnerabilities. The result of our model, in the right-half part, has a significantly lower FN rates (we hit more vulnerabilities) at the price of a small increase in FP rates (we may flag erroneously one component out of ten). However, it is still too soon to conclude that the prediction power of CDG is better or worse than the nesting level metric because of following reasons.
VALIDITY
In this section we discuss both internal and external threats to the validity of our work. • Bugs in tools. The code base of JSE written in C/C++. In this language, preprocessors such as macro and compiler conditions are widely used. Thus the difficulty of parsing the source code increases for thirdparty tools rather than C/C++ compiler itself. Therefore, tools like Doxygen, RSM might produce errors. For example, the existence of macros in function declaration misleads Doxygen and RSM in parsing function name and signature. This might cause the same function to be recognized differently by these tools. • Bugs in data collection. The MDG is constructed by gathering reports generated by Doxygen and RSM. The code that parses these reports and constructs the MDG might be buggy and produce errors. This threat however can be mitigated by manual checking the outcome on a small amount of data. Faulty code was fixed and checked again to free the program from bugs as much as possible. • Quality of the Vulnerability database. We collect vulnerability information from [9]. The quality of this database is not clear at the moment. Our work heavily relies on this database. Therefore the quality of this database strongly influences the quality of our prediction model.
6.
RELATED WORK
Nagappan and Ball [11] present a prediction model using code churn for system defect density. The experiment data come from source version control log and the defected data of Windows 2003. Neuhaus et al. [12] introduce a new metric called import and function-call to construct the prediction model. The imports in programming languages (#include in C++, import in Java) allow developers to reuse service provided by another libraries. The import patterns indicate sets of imports that usually go together. Similarly, function-call patterns are groups of functions called together. The experiment conducted on the entire Mozilla code base (13,111 files) reports a good value of precision (70%) and recall (45%). To further estimate the model, the authors use a simple cost model. This model assumes that there are V vulnerabilities distributed arbitrarily among m components, and testers have a ”testing budget” of T units. Spending 1 unit on a component will determine whether this component is vulnerable. Obviously, both V and T are less than m. The testers have to randomly choose components for the test. Hence,
• Vulnerability data used in [17] are collected from only MFSA and Bugzilla, which only indicate the revision containing fixes of a component, not the vulnerable ones. The assumption that the same vulnerability existed in all previous revisions might not be correct. Moreover, vulnerabilities reported in version 2.0 probably existed in version 1.5. Therefore, the quality of training dataset used in [17] is not clear. We has discussed this phenomenon and a solution to address the problem in [9]. Our dataset, instead, is taken from [9]. So, we do not suffered from this issue. • As discussed in [3], [17] is also suffered from imbalance training set because they work at functional level. This might bias their classifier. We are working at component level where the imbalance training set is less than the functional level. Another interesting issue is the ranking problem as discussed in section 2. We use linear regression, which is also 6
the test efficiency is TV/m. With the guide of a prediction model with precision p, the test efficiency is Tp. Therefore, the prediction model is qualified iff p > V /m. In practise, they estimate V by V’, the number of known vulnerabilities. Applying this quality check, thier prediction model has m = 10,452, V’ = 424, which obviously shows that the model is more than fifteen times better than random test strategy. Menzies et al. [10] claim that choosing attribute metrics is less significant than choosing how to use these metric values. In the experiment on MDP datasets, they rank different metrics by using InformationGain values to select the metrics for the predictor. The ranking value of a metric is different across projects. The accuracy of the predictor is evaluated by a probability of prediction ( pd ), and a probability of false alarm (pf ). However, Zhang et al. [20] point out that assessment using the Information Retrieval notions of precision and recall rate is better than pd,pf. Zhang et al. [21] replicate the work in [10], with a combination of three function-level complexity metrics to do the prediction. Olague et al. [13] make a comparison of the prediction power of three different metric suites: CK, MOOD and QMOOD. Their experiment is conducted on six versions of Rhino, an open-source JavaScript implementation of Mozilla. The defect data are collected from Mozilla Bugzilla. The authors use logistic regression method to perform the prediction for each metric suite. The result shows that the CK suite is the superior prediction method for Rhino. Zimmerman et al. [23] build a logistic regression model to predict post-release defects of Eclipse using several metrics in different levels of code base i.e., methods, classes, files and packages. The defect data are obtained by analyzing the log of the source version control. This method is detailed in [24], and employed by [12]. The final dataset is put in the PROMISE repository. In an other work, Zimmerman and Nagappan [22] exploit program dependencies as metrics for their predictor. However, they did not state clearly where the defect data used in their study come from. Shin and Williams [16–18] raise a research question about the correlation between complexity and software security. In order to validate these hypotheses, the authors conduct an experiment on JSE. They mine the code base of four JSE’s versions for complexity metric values. Meanwhile, faults and vulnerabilities for these versions are collected from MFSA and Bugzilla. Their prediction model is based on nesting level metric and logistic regression method. Although the overall accuracy is quite high, their experiment still misses a large portion of vulnerability. Jiang et al. [8], in their work, study about which is the impact to predicti power of machine learning based vulnerability discovery models, metrics or learning methods. Their experiments are based on MDP datasets, using many metrics belonging to three categories: code level, design level and combination of code and design level, as well as several machine learning methods. The experiment results show that the metrics strongly impact the power of the models, meanwhile, there is not much difference among learning methods. Also, the most powerful metric is the combination of both code and design level metrics, and the design metrics are the most inferior ones. Gegick et al. [5–7] use warnings of automatic source analysis tools (ASA), code churn and total line of code to implement their prediction model. However, the conducted experiments are based on private defected data sources. Their
studies are not reproducible, and hence are less convincing. Chowdhury and Zulkernine [3] combine complexity, coupling and cohesion metrics for a vulnerability prediction model. These values are fed to a trained classifier to determine whether the source code is vulnerable. In their experiment, the authors used a vulnerability dataset for Firefox, which is assembled from MFSA and Bugzilla.
7.
CONCLUSION
In this work, we have proposed a new model to predict vulnerable component in order to prioritize testing effort. Our proposed model also uses existing machine learning technique based on a set of metrics generated from the component dependency graph (CDG) of a software system. The prediction quality of the model is backup by an evidence on JSE of Mozilla Firefox 1.5 and 2.0. The high accuracy and recall reported from the experiment show the possibility of using CDG in the field of vulnerability prediction. Since we can construct the CDG of a software system at the design phase, our model might be used as an early predictor for vulnerability. In the future, we will replicate our work using nesting level metric to obtain a better comparison between this metric and ours. We will also conduct an experiment for entire code base of several versions of Firefox. And we might want to enrich the CDG with other metrics to improve the prediction quality.
8.
ACKNOWLEDGEMENT
We would like to thank professor Fabio Massacci (University of Trento, Italy) and professor Mohammad Zulkernine (Queen’s University, Canada) for many valuable comments and suggestions.
9.
REFERENCES
[1] R. Anderson. Why information security is hard - an economic perspective. In Proc. of ACSAC’01, 2001. [2] C. Catal and B. Diri. A systematic review of software fault prediction studies. Expert Sys. with App., 36(4):7346–7354, 2009. [3] I. Chowdhury and M. Zulkernine. Using complexity, coupling, and cohesion metrics as early predictors of vul. J. of Soft. Arch., 2010. [4] J. Cohen. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988. [5] M. Gegick. Failure-prone components are also attack-prone components. In Proc. of the 23rd ACM SIGPLAN Conf. on Object-Oriented Prog. , Sys. , Lang. , and Applications (OOPSLA’08), pages 917–918. ACM Press, 2008. [6] M. Gegick and L. Williams. Ranking attack-prone components with a predictive model. In Proc. of ISSRE’08, pages 315–316, 2008. [7] M. Gegick, L. Williams, J. Osborne, and M. Vouk. Prioritizing software security fortification throughcode-level metrics. In Proc. of QoP’08, pages 31–38. ACM, 2008. [8] Y. Jiang, B. Cuki, T. Menzies, and N. Bartlow. Comparing design and code metrics for software quality prediction. In Proc. of PROMISE’08, pages 11–18. ACM, 2008. 7
[9] F. Massacci and V. H. Nguyen. Which is the right source for vulnerabilities studies? an empirical analysis on mozilla firefox. In Proc. of MetriSec’10, 2010. [10] T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. TSE, 33(9):2–13, 2007. [11] N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In Proc. of ICSE’05, pages 284–292, 2005. [12] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller. Predicting vulnerable software components. In Proc. of CCS’07, pages 529–540, October 2007. [13] H. M. Olague, S. Gholston, and S. Quattlebaum. Empirical validation of three software metrics suites to predict fault-proneness of object-oriented classes developed using highly iterative or agile software development processes. TSE, 33(6):402–419, 2007. [14] T. J. Ostrand and E. J. Weyuker. How to measure success of fault prediction models. In SOQUA ’07: Fourth international workshop on Software quality assurance, pages 25–30, New York, NY, USA, 2007. ACM. [15] A. Ozment. Vulnerability Discovery and Software Security. PhD thesis, University of Cambridge. Cambridge, UK, 2007. [16] Y. Shin. Exploring complexity metrics as indicators of software vulnerability. In Proc. of the Int. Doctoral Symp. on Empirical Soft. Eng. (IDoESE’08). [17] Y. Shin and L. Williams. An empirical model to predict security vulnerabilities using code complexity metrics. In Proc. of ESEM’08, 2008. [18] Y. Shin and L. Williams. Is complexity really the enemy of software security? In Proc. of QoP’08, pages 47–50, 2008. [19] Wessa. P. (2010), Free Statistics Software, Office for Research Development and Education, version 1.1.23-r6. http://www.wessa.net/. [20] H. Zhang and X. Zhang. Comments on data mining static code attributes to learn defect predictors. TSE, 33(9):635–637, 2007. [21] H. Zhang, X. Zhang, and M. Gu. Predicting defective software components from code complexity measures. In Procc. of PRDC’07, pages 93–96, 2007. [22] T. Zimmermann and N. Nagappan. Predicting defects with program dependencies. In Proc. of ESEM’09, 2009. [23] T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Proc. of PROMISE’07, page 9. IEEE Computer Society, 2007. [24] T. Zimmermann and P. WeiSSgerber. Preprocessing cvs data for fine-grained analysis. In Proc. of the 1st Int. Working Conf. on Mining Soft. Repo. MSR(’04), pages 2–6, May 2004.
8