ISSN 01464116, Automatic Control and Computer Sciences, 2012, Vol. 46, No. 7, pp. 338–344. © Allerton Press, Inc., 2012. Original Russian Text © M.I. Glukhikh, V.M. Itsykson, V.A. Tsesko, 2011, published in Modelirovanie i Analiz Informatsionnykh Sistem, 2011, No. 4, pp. 68–79.
Using Dependencies to Improve Precision of Code Analysis1 M. I. Glukhikh, V. M. Itsykson, and V. A. Tsesko St. Petersburg State Polytechnical University, Russia email:
[email protected],
[email protected],
[email protected], http://kspt.ftk.spbstu.ru Received September 15, 2011
Abstract—Development of dependency analysis methods in order to improve static code analysis pre cision is considered in this paper. Reasons for precision loss when detecting defects in program source code using abstract interpretation methods are explained. Need for program object dependency extraction and interpretation is justified by numerous realworld examples. Dependency classification is presented. Necessity for aggregate analysis of values and dependencies is considered. Dependency extraction from assignment statements is described. Dependency interpretation based on logic infer ence using logic and arithmetic rules is proposed. The methods proposed are implemented in defect detection tool Digitek Aegis, significant increase of precision is shown. Keywords: Static analysis, abstract interpretation, dependency analysis, program defect detection DOI: 10.3103/S0146411612070097 1
1. INTRODUCTION Problem of software quality assurance is getting much attention nowadays. According to the existing research, popular open source software (Linux, Apache, MySQL, etc.) contains 0.1–0.5 errors per 1000 lines of source code. Sometimes fault events lead to disasters resulting in serious losses [1]. Software quality may be significantly improved through static defect detection. These methods can ana lyze different program properties without running it. Static analysis methods use program source code, for mal specifications and other design stage artifacts to determine properties of the whole program rather than of a single execution trace. Both functional and nonfunctional program errors may be detected this way. Nonfunctional errors (also known as program defects) are usually caused by software complexity or violation of program contracts. These errors include memory leaks, buffer overruns, null pointer derefer ences, uses of uninitialized objects, etc. Apart from testing, static analysis methods may be used to detect such errors [2]. Defect detection algorithms may be compared using three interdependent properties: computational complexity, soundness and precision. Analysis soundness is a ratio of true program defects detected to all the existing defects in the program. Analysis precision is a ratio of true defects to all the detected defects. Sound static analysis algorithms seem to be the most interesting ones. Only sound algorithms make it possible to detect all the defects in a program. Use of unsound algorithms means that some defects may be left undetected. However, use of sound static analysis algorithms poses a difficult problem of ensuring affordable com putational complexity and high analysis precision at the same time. Usually high performance algorithms are not very precise. Utilizing dependencies between program objects seems to be a promising approach to the aforementioned problem [5, 6, 16]. Dependency is a proposition constraining values of several variables like f(x, y) is true. Dependencies are used to refine the results and to eliminate false positives when current program state contains several states from different execution traces and when analyzed objects have multiple values. The main aim of this paper is to define dependency analysis methods which allow to improve precision of program abstract interpretation for defect detection. The rest of the paper is described as follows. Section 2 contains brief review of related works. Set of examples justifying extraction and interpretation of dependencies for improving quality of abstract inter pretation is presented in Section 3. Section 4 contains classification of dependencies. Dependency imple
1 The article was translated by the author.
338
USING DEPENDENCIES TO IMPROVE PRECISION OF CODE ANALYSIS
339
mentation is considered in Section 5. Further development of the approach described is presented in Sec tion 6. Experimental evaluation of dependency implementation is discussed in Section 7. 2. RELATED WORKS Static analysis methods are of strong research interest. The most popular static analysis method used for defect detection is abstract interpretation [3, 4]. Abstract interpretation is similar to ordinary interpre tation but it operates on values from an abstract semantic domain chosen for a specific task rather than ordinary variable values (numbers, pointers, etc.). The domain for nonfunctional defect detection includes interval values [12, 14], object addresses [15, 20, 21] and object dependency sets. Dependency analysis research field is backed by the foundational work of Cousot [6]. Dependency extraction is proposed to be based on abstract interpretation and linear statements are considered: xj = a1x1 + a2x2 + … + anxn. Closed convex polyhedral model in Rn, where n is the count of program variables, is used to approximate constraints on program variables. Nesov and Malikov [5] propose improving precision of defect detection by using enhanced interval analysis with inequality systems based on linear dependencies y(x) = kx + b. Bush et al. [16] consider using program state representation based on values and value dependencies. This approach assumes that some variables have precise values. If it is impossible to define a precise vari able value, then the value may be restricted by various constraints such as a < 5, a > b, a = c + 3, etc. Thus precise values are complemented by variable interdependencies. This approach sounds promising but it needs to be extended to more general dependency types. Wang et al. [17] propose proving correctness of Java programs with the help of HOL theorem prover [18] and WHY verification platform [19]. JMLannotations are used to extract source data. VCC tool [22] implements similar approach to prove correctness of multithreaded Cprograms. Major disadvantage of such an approach is the need for extensive annotations which is unfeasible for large programs. There are some production tools for defect detection, for instance, Microsoft Code Analyzer [7], FramaC [8], Splint [9], Fortify [10], etc. Experimental evaluation of some of these tools have made it evi dent that they do not provide soundness property [11]. The results presented in this paper are obtained using Digitek Aegis tool [13]. This tool is aimed at detection of most frequent defect classes, providing soundness with possible loss of precision. 3. LOW PRECISION OF ABSTRACT INTERPRETATION Defect detection based on abstract interpretation that provides both soundness and high precision is characterized by high computational complexity. The main cause is the need to analyze all the possible execution traces and to store the values of all the reachable objects in these traces. High number of program execution traces is caused by the use of branch operator – each branch oper ator potentially doubles the number of traces. So medium or large programs may contain astronomic number of execution traces. Besides, during analysis of an execution trace the number of reachable objects usually increases significantly. This leads to exponential computational complexity (both time and mem ory) depending on the size of an analyzed program which makes it unfeasible to precisely analyze even small programs. To reduce the number of analyzed execution traces, program states can be merged with consequent analysis of the merged traces [12]. This approach also known as path merging provides linear analysis com plexity when an analyzed program does not contain cycles. On the other hand, variables in different exe cution traces may have different values. As a result of program state merging multiple values of a variable appear, for example, x ∈ [5…10], ptr ∈ {0, &arr}, etc. When manipulating multiple values dependency information is lost which decreases analysis precision. Let’s consider some examples. 3.1. Dynamic Array Size int * ptr = malloc(size * sizeof(int)); for (int i = 0; i < size; i++) ptr[i] = i; Let size ∈ [5…10]. In this case abstract interpretation of the cycle requires 10 iterations. On the last iteration i is 9, which leads to out of bound defect detection when array size is less than 10. To eliminate AUTOMATIC CONTROL AND COMPUTER SCIENCES
Vol. 46
No. 7
2012
340
GLUKHIKH et al.
this false positive we need to take into account a dependency between size and allocated memory size and also i < size dependency. 3.2. Noninitialized Element Count int arr[MAXSIZE], int size = 0; while (!in.eof() && size < MAXSIZE) in >> arr[size++]; for (int i = 0; i < size; i++) cout 5) or y = x + 3 • Merge dependency constructed when two branches merge, for instance, if (…) ptr = malloc(…) else ptr = 0. There are three ways how we can use the extracted dependencies: • When interpreting conditions – in this case logical inference starts with cond is TRUE or cond is FALSE assertion, after that some corollary is inferred using the dependencies extracted • When detecting defects – usually in this case validity of some assertions must be checked, for instance, 0 ≤ index < sizeof(array) when detecting “out of bounds” defect • When interpreting assignments – in this case logical inference starts with a = b – c assignment and using dependencies about b and c (if any exists) infers some knowledge about the value of a. 5. DEPENDENCY IMPLEMENTATION Dependency implementation includes program state representation, dependency extraction and interpretation methods. 5.1. Program State Representation Abstract interpretation deals with the program state containing program object values from an abstract semantic domain [2, 3]. During dependency analysis it is reasonable to use a predicate domain which is logical assertions describing program state and program object values. Predicate domain includes all the possible predicates about program objects. Program state is a subset of the domain and it includes only predicates which are true in the specific program lifecycle instant. These assertions may be divided into two groups: 1. Assertions about object values: —precise: obj = const, obj1 points to obj2 —multiple: Comp(obj; op; const), obj points to one of (obj1, …, objN) 2. Assertions about dependencies between objects (see Section 4). 5.2. Dependency Extraction Dependencies are extracted during program statement interpretation. Assignment statements produce assignment dependencies. The same applies to indirect assignment a = *ptr or *ptr = a if it is known precisely which object ptr points to. Algorithm for *ptr = a inter pretation may be the following: Set predicates; // predicates: dependencies and values function extract(DereferenceAssignment da(ptr, a)): for each (p points to obj) in predicates: if (p == ptr): predicates.add(Comp(obj, =, a)) When the value of some program object changes or it goes out of scope all the associated dependencies are reset. If this object depends on at least two other objects dependency superposition is inferred. For instance, if a = f(x) and x = g(b) then when x dies a = f(g(b)) dependency is produced. When allocating dynamic memory a dependency for the memory region size should be produced – for instance, during analysis of ptr = malloc(size) statement dependencies ptr points to dynamic object and sizeof(dynamic object) = size are constructed. For other special objects (for instance, strings) similar rules may be specified. When interpreting merge of execution traces program state union C = A ∪ B is performed. In practice this dependency union may require too much resources because of producing D1 or D2 dependencies. So A ∪ B = (A ∩ B) ∪ (A\B) ∪ (B\A) transformation is applied, intersection A ∩ B is calculated and union (A\B) ∪ (B\A) is ignored. The precise rules for dependency union are to be developed. AUTOMATIC CONTROL AND COMPUTER SCIENCES
Vol. 46
No. 7
2012
342
GLUKHIKH et al.
5.3. Dependency Interpretation During the dependency interpretation phase logical inference is performed using the existing asser tions. Logical inference is based on wellknown theorem proving rules which may be subdivided into the following groups: • General logical inference rules: Modus Ponens, substitution • Logic algebra rules • Integer arithmetic rules Dependency interpretation in Aegis tool for condition expressions in conditional statements may be described in the following algorithm. x is a condition, v defines the true branch (the inference starts from x is TRUE) or the false branch (the inference starts from x is FALSE): function interpret(Condition x, boolean v): if v: for each Logic(f, x1, op, x2) in predicates: if x == f && op == and: useModusPonens(x1, true); useModusPonens(x2, true); else: for each Logic(f, x1, op, x2) in predicates: if x == f && op == or: useModusPonens(x1, false); useModusPonens(x2, false); useModusPonens(x, v); expandDeps(); function useModusPonens(Condition x, boolean v): for each Equiv(f, dep) in predicates: if x == f: predicates.add(v ? dep: dep.invert()) // Modus Ponens interpret function applies rules of algebra logic to interpret dependencies Logic(x; x1; and; x2) and Logic(x; x1; or; x2). useModusPonens function uses similar modusPonens rule and Equiv(x; D) dependencies to extract true dependencies from the corresponding branch. expandDeps function expands the list of known dependencies using substitution and integer arithmetic rules. 6. FURTHER DEVELOPMENT We plan to use existing logical inference tools for dependencies interpretation. These tools may be sub divided into three major groups: Theorem provers. HOL prover is an example [18]. Usually these tools implement higher order logic. Without human intervention they are able to prove only the simplest statements. In more complex cases a user has to manage the prover. Besides, these provers are characterized as low performance and lack con venient programming interface. SMT solvers. Microsoft Z3 Solver is an example [23]. These tools are aimed at solving first order logic problems. Usually they include powerful methods for solving systems of linear equations and inequalities. Many SMTLIB solvers use unified task description language SMTLIB [24], which allows to replace one solver by another. Logic programming languages. The most famous example is Prolog language. This language uses subset of the first order logic, or Horn clauses. An advantage of this approach is simplicity of integration into any software, but it may be not powerful enough for solving complex tasks. Analysis of the existing dependencies and inference rules justifies that major static analysis solving tasks may be described using first order logic. Therefore the most promising approach is using SMT solvers, because theorem provers are too complex and logical programming languages are not powerful enough. AUTOMATIC CONTROL AND COMPUTER SCIENCES
Vol. 46
No. 7
2012
USING DEPENDENCIES TO IMPROVE PRECISION OF CODE ANALYSIS
343
The results of opensource projects static analysis Dependencies OFF Project
Code size, KLOC False positives
base64 ping bwnng heme rhapsody resetxlog Total
10 2.5 5 2.5 19 4 43
Dependencies ON
Analysis time, seconds
55 24 63 18 8 9 177
105 19 35 7 218 37 421
False positives
Analysis time, seconds
24 (–56%) 15 (–37%) 41 (–34%) 8 (–56%) 2 (–75%) 1 (–88%)
136 (+29%) 22 (+15%) 43 (+22%) 12 (+71%) 228 (+4%) 45 (+21%)
91 (–48%)
486 (+15%)
7. EXPERIMENTAL EVALUATION Methods described in section 5 have been implemented in Digitek Aegis tool and some open source utilities have been analyzed using this tool. Experimental results are shown in Table 1. base64 and ping go from Linux distribution, pg_resetxlog is part of PostgreSQL server, other projects can be down loaded from http://www.sourceforge.net. As a result of implementing the dependency analysis significant profit in overall analysis precision has been gained, because false positives decreased by 48%. Analysis time increased moderately. Average false positive ratio is 2 false defects per 1KLOC when analyzing 10 KLOC programs. 8. CONCLUSION The presented research demonstrates that extraction and interpretation of data dependencies is one of the most important aspects of code analysis. There are several directions to improve the suggested approach: • development of more precise dependency merging rules; • use of automated theorem proving methods when interpreting dependencies. At the moment statement interpretation, simplification, inference and defect detection rules are being formalized using SMTLIB [24] language. This scientific work was subsidized by Saint Petersburg science and high school committee. REFERENCES 1. Zhivich, M. and Cunningham, R., The Real Cost of Software Errors. IEEE Security and Privacy, IEEE Comput. Soc., 2009, vol. 7, no. 2, pp. 87–90. 2. Nielson, F., Nielson, N., and Hankin, C., Principles of Program Analysis, SpringerVerlag, 2005. 3. Cousot, P., Abstract Interpretation, ACM Comput. Surveys, 1996, vol. 28, no. 2, pp. 324–328. 4. Jones, N. and Nielson, F., Abstract Interpretation: A SemanticBased Tool for Program Analysis. Handbook of Logic in Computer Science, vol. 4: Semantic Modeling, Oxford: Oxford University Press, 1995. 5. Nesov, V.S. and Malikov, O.R., Using the Linear Dependencies for Vulnerability Detection in Program Source Code, Trudy Inst. Sist. Progr. Ross. Akad. Nauk, 2006, no. 9, pp. 51–57. 6. Cousot, P. and Hallwachs, N., Automatic Discovery of Linear Restraints Among Variables of a Program, Proc. 5th ACM SIGACTSIGPLAN Symp. on Principles of Programming Languages (POPL78), New York, 1978, pp. 84–96. 7. Code Analysis for C/C++. Overview. http://msdn.microsoft.com/enus/library/d3bbz7tz.aspx 8. FramaC Software Analyzers. http://framac.com 9. Splint Home Page. http://www.splint.org 10. Fortify Software. http://www.fortify.com 11. Itsykson, V.M., Moiseev, M.Yu., Tsesko, V.A., and Karpenko, A.V., Research on Tools for Automation of Defects Detection in Program Source Code, Sci. J. St. Petersburg State Polytechnical Univ. Informat. Telecom mun., 2008, no. 5(65), pp. 119–127. 12. Schwartzbach, M., Lecture Notes on Static Analysis, Aarhus, 2000. AUTOMATIC CONTROL AND COMPUTER SCIENCES
Vol. 46
No. 7
2012
344
GLUKHIKH et al.
13. Aegis – A Defect Detection System. http://www.digiteklabs.ru/en/aegis/platform/ 14. Itsykson, V.M., Moiseev, M.Yu., Tsesko, V.A., Zakharov, A.V., and Akhin, M.Kh., Interval Analysis Algorithms for Source Code Defect Detection, Inf. Control Syst., 2009, no. 2(39), pp. 34–41. 15. Itsykson, V.M., Moiseev, M.Yu., Akhin, M.Kh., Zakharov, A.V., and Tsesko, V.A., Pointsto Analysis Algo rithms for Source Code Defect Detection, in Sb. st. “Sistemnoe programmirovanie,” (Coll. Papers ‘System Pro gramming’), Terekhov, A.N. and Bulychev, D.Yu., Eds., St. Petersburg: St. Petersburg Univ., 2009, no. 4, pp. 5–30. 16. Bush, W., Pincus, J., and Sielaff, D., A Static Analyzer for Finding Dynamic Programming Errors, Softw. Pract. Exper., 2000, vol. 30, pp. 795–802. 17. Wang, A., Fei, H., Gu, M., and Song, X., Verifying Java Programs by Theorem Prover HOL, Proc. 30th Ann. Int. Computer Software and Applications Conf., Washington, 2006. 18. HOL 4 Kananaskis. http://hol. sourceforge.net 19. WHY – A Software Verification Platform. http://why.lri.fr 20. Steensgaard, B., PointsTo Analysis in Almost Linear Time, Proc. the 23rd ACM SIGPLANSIGACT Symp. on Principles of Programming Languages, New York, 1996. 21. Avots, D., Dalton, M., Livshits, V., and Lam, M., Improving Software Security with a C Pointer Analysis, Proc. 27th Int. Conf. on Software Engineering, New York, 2005, pp. 139–142. 22. VCC. http://vcc.codeplex.com 23. Z3. http://research.microsoft.com/enus/um/Redmond/projects/z3 24. SMTLIB. http://www.smtlib.org
AUTOMATIC CONTROL AND COMPUTER SCIENCES
Vol. 46
No. 7
2012