AbstractâBuffer overflow is one of the most dangerous and common vulnerabilities in ... In this paper we introduce a game called Bodhi that gives two randomly ...
2012 IEEE Sixth International Conference on Software Security and Reliability Companion
Bodhi: Detecting Buffer Overflows with a Game
Jie Chen1,2, Xiaoguang Mao2 1
National Laboratory for Parallel and Distributed Processing, Changsha, P.R.China School of Computer, National University of Defense Technology, Changsha, P.R.China {jchen, xgmao}@nudt.edu.cn
2
industry where buffer overflows detection remains largely manual. Rather than automating buffer overflow detection, the goal of this paper is to improve the effectiveness of the manual process. Human computation is the idea of using human effort to perform tasks that computers cannot yet perform, usually in an enjoyable manner [27]. It is known as “games with a purpose” [19] which seeks to elicit useful information through funny games. Buffer overflows detection is a task that computer could do but humans can do better. Thus we aim at changing the traditional tedious manual process and using human computation to utilize the public to help. In this paper we introduce a game called Bodhi that gives two randomly chosen partners a piece of code snippet with a potential buffer overflow, and then asks them “Does your partner think there is a program error at the highlight part in the code?” If both partners choose the same button (yes or no), they both obtain points, whereas neither of them receives points if they click on different buttons. It’s similar to the online game of Matchin [4] which is used to eliciting user preferences. Though seemingly simple, this game has been proven being enjoyable and efficient to make use of rich human resource to do better manual buffer overflows detection. The remainder of this paper is organized as follows. Section 2 shows an example that human can do better than state of the art static analyzers. Section 3 presents our game mechanism, user interface and several implementation details; Section 4 describes the experiments and evaluations; Section 5 discusses related work and Section 6 concludes the paper.
Abstract—Buffer overflow is one of the most dangerous and common vulnerabilities in CPS software. Despite static and dynamic analysis, manual analysis is still heavily used which is useful but costly. Human computation harness humans’ time and energy in a way of playing games to solve computational problems. In this paper we propose a human computation method to detect buffer overflows that does not ask a person whether there is a potential vulnerability, but rather a random person’s idea. We implement this method as a game called Bodhi in which each player is shown a piece of code snippet and asked to choose whether their partner would think there is a buffer overflow vulnerability at a given position in the code. The purpose of the game is to make use of the rich distributed human resource to increase effectiveness of manual detection for buffer overflows. The game has been proven to be efficient and enjoyable in practice. Keywords-CPS; software vulnerability; buffer overflow; human computation; game
I.
INTRODUCTION
Cyber-Physical Systems (CPS) are integrations of computation with physical processes [26]. The CPS software is in charge of the interactions between the cyber and physical worlds. Thus the software quality is the key element of the whole system. However, software errors are still inevitable. Buffer overflows are among the most dangerous software errors which lead to serious security vulnerabilities. In the list of 2011 CWE/SANS top 25 most dangerous software errors, “Buffer Copy without Checking Size of Input” ranks the 3rd, and “Incorrect Calculation of Buffer Size” ranks the 20th [1]. This weakness can often be detected by dynamic executing with large sets of test inputs. It can rarely test all the possible software execution paths and the software performance may slow down. Static analysis requires no real execution to detect buffer overflows. It is relatively easier for static analysis to test almost all the execution scenarios. Due to the overly aggressive abstractions [2], static analysis always reports too many false alarms which make the final report very difficult for users to investigate. Manual analysis is the traditional and useful way for finding buffer overflows, but it might not achieve desired code coverage within limited time constraints [3]. A number of automatic methods like dynamic and static analysis have been proposed trying to avoid this tedious work. Unfortunately till today this effort has limited impact in 978-0-7695-4743-5/12 $26.00 © 2012 IEEE DOI 10.1109/SERE-C.2012.35
II.
PROBLEM
Automatic buffer overflows detection techniques have various limitations. Dynamic execution would impact the software performance and need a large scale of test inputs. Preliminary study shows that 44% of the branches have not been adequately tested by state of the art automatic test generation techniques [25]. Static analysis can examine almost all the paths without executing the programs. False alarms and missed detections are two major problems. Inspected by Kratkiewicz and Lippmann [6], PolySpace [9] and Splint [8] report false alarm for line 6 of the simple example in Figure 1. If we rewrite line 2 to “if ((argc < 5) || (atoi(argv[1]) > 10))”, then ARCHER [10], BOON [12] and UNO [11] miss detection. Obviously, buffer overflows
168
As presented in Figure 2, we design the Bodhi framework in C/S mode. There is one server and many players. The framework works as follows: • The server shows the code to users. • The players choose the input and submit to server. • Depending on the players’ input, server decides whether give incentive to players or not. During this process, players submit their choices for fun and getting more incentives. The server records this information. By this way, Bodhi makes use of the rich human resource to detect buffer overflows efficiently and enjoyably. 2) Game process The main difficulty in creating games with a purpose is to make them enjoyable [4]. Bodhi is a two-player game that can be deployed over the internet. First, two players are matched randomly from all the players connected to the server. Bodhi does not tell the players who their partners are, and any communications between the players are forbidden. The only thing two partners have in common is a code snippet with a potential buffer overflow. Then they begin to play the game for five minutes every time. In each game, they can play as many rounds as possible. In each round, the two players see the same code snippet and both are asked to answer the question. If they choose the same answer, they both receive points. Otherwise, neither of them receives points. The first version of game interface is as Figure 3. We directly ask players “Is there a program error at the highlight part in the code?” which meant to ask the players to examine the code. They may or may not do so. However, the method of directly asking for contributions is always not considered to be a valid human algorithm game since the process is not enjoyable for players [28]. Besides, for the code snippet, no one knows whether there is a real error (buffer overflow actually). There will be unfair when players have different opinions. Because the
detection like this example might not always be difficult for human since human can understand the program semantics more flexible.
int main(int argc, char *argv[]) { 1 char buf[10]; 2 if ((argc < 5) || (atoi(argv[1]) > 9)) 3 { 4 return 0; 5 } 6 buf[atoi(argv[1])] = 'A'; 7 return 0; } Figure 1. Example code snippet
Though manual analysis is useful and widely used, this work is very tedious and time consuming. Desired code coverage within limited time constraints is always hard to achieve. For this challenge, human computation could be helpful. By playing the special game, many people are attracted to do the work in an enjoyable way. Players do not feel that they are doing the work of buffer overflows detection, but just a competition. During the progress, we collect the data players submitted. Finally, a bug report could be generated from the data collected. III.
GAME MECHANISM
A. System Discription 1) System framework There are four principles in developing human algorithm games [28]: • The problem to be addressed must be clearly defined. • The human interaction that performs the computation must be identified. • Output verification must be incorporated into the interaction and design of the system. • Principles of game design must be applied to increase the enjoyability of the interaction. Code
Players
Input
Server
Incentive /not Figure 2. The framework of Bodhi
Figure 3. The 1st version of Bodhi game
169
buffer overflow error must be true or false, one player is right while the other is wrong. Under this condition, giving no points to both is unfair to the right player. Usually, if a person suggests a solution for a problem to others, it means that the person will also handle the problem in a same way. However, the way in which one thinks does not necessarily agree with that of another. It is like the question “what do you like?” and “what do you think others will like?” in [4]. Furthermore, asking people to think what others think can make them to consider more completely with much external information. Accordingly, we design a second version of Bodhi as in Figure 4. Instead of asking directly, we ask the players “Does your partner think there is a program error at the highlight part in the code?” Our judge does not give a pair of players points when they disagree with each other. This is because they guess the wrong answer about the other, and it does not mean their own thoughts about the program error are wrong.
Reputation is the most attractable element to make people like playing Bodhi. To most people’s knowledge, programming is an intelligent work. By displaying the scoreboard, human players will be motivated because of the chance to receive public recognition for their intelligence and abilities. To make the game more enjoyable, we design Bodhi to give players more points when they get agreements consecutively. Investigated by [4], using an exponential and sigmoid function for scoring can make the game more fun and attractable. Programming learning is also an important goal of Bodhi design. By playing the game, players can learn many novel programming skills from different software written by different people all over the world. 2) Human skill Human computation games for collecting common sense like ESP [5] usually do not need any special human skill. But combining human computation games and software source code analysis is not easy since we still cannot find a better representation way than showing the source code directly. For our game Bodhi, basic knowledge or abilities about programming is needed. The students who are learning programming and software engineers in companies are two potential user groups. Since the source codes displayed in Bodhi are always the ones cannot be analyzed well by state of the art static analyzers, it is helpful to factor out the problems that could easily be done by a computer. 3) “Software error” versus “Buffer overflow” There are so many different kinds of software errors. If we want to detect all of them, the players would be confused since there may be several errors in one code line and one player can hardly be familiar to every kind of errors. We only concentrate on buffer overflow in this paper. As we know, buffer overflow is one of the most common and dangerous software vulnerabilities. It is relatively easier for human to detect. Moreover, we can use existed sound static analyzers to detect all the potential buffer overflow positions in source code to avoid wasting human effort. In the game interface, we just ask about “program error” rather than “buffer overflow”, because we believe most people are more familiar to this word. 4) Up-front analysis Before showing the source code to players, we use an abstract interpretation based sound static analyzer to detect buffer overflows for a software project first. Because of its soundness, it would not lose any real buffer overflows. Then we only need to consider the positions the analyzer reported which can be the target our game and will be showed to the players. We believe that small size of code snippet is suitable for human to investigate manually. Large and complex programs will make players confused. Thus if one potential buffer overflow the static analyzer reported locates in a small code snippet, we transfer this snippet to players directly. Otherwise, we should use a program slice method to get a smaller size code.
Figure 4. The 2nd version of Bodhi game
In this new version, Bodhi does not ask the players whether there is an error in the code, but the players have to examine the code snippet carefully if they want to score many points, because their partners may find the error. Players can also click on the pass button to opt out on difficult code, and then would get a message. It will continue to next round after their partners also click the pass button. Bodhi records all the clicks and the time players spend in thinking. At the end of one game, the two players can review all the code snippets that they do not get points. If they agree with each other, they can obtain some points after choosing the same option for that code. B. Other Stories 1) Motivation One of the challenges in any human computation system is finding a way to motivate people to participate [29]. Pay, altruism, enjoyment, reputation and implicit work are the most usual factors to motivation.
170
In this way, we decompose the original buffer overflow detection for a whole software project into some detection problems for small size code snippets. These codes which have no relation with each other will be showed to the game players. 5) Aggregation The probability of a buffer overflow comes from the agreement or disagreement of pairs of players. Agreement by a pair of independent players implies that the buffer overflow in the given code snippet is probably true or probably false. Disagreement means players have different opinions about the goal. After collecting enough information through Bodhi, we can generate the final bug report by combining all the contributions. The probability is very high for a buffer overflow if all players agree on “YES”. We should report this kind of errors in the beginning of the report that suggest the final users to investigate them first. The probability is relatively higher when more players agree on “YES” than the ones more players agree on “NO” which is still more probable than the ones have no agreement. These errors should be reported in this probability descending order. The buffer overflow may be false alarm when all players agree on “NO”. 6) When is a code snippet “done” As a particular code snippet passes through Bodhi several times, it will get some agreements or disagreements about whether there is a buffer overflow in it. The question is, at what point the information we collected is enough for that code snippet. Our solution is utilizing a threshold. A code snippet should pass through Bodhi at least X times before we can decide the probability of buffer overflow in it. The number X is the threshold. The threshold can be very low (X=1, only a pair of players examine it which makes the whole process fast but less accurate) or extremely high (X=100, 100 pairs of players should think about it which makes the whole process slow but more accurate). We can set the value of threshold according the practical need of accuracy and players amount. 7) Cheating In order to get high scores, players always use some strategies to cheat. Bodhi adopt several steps trying to avoid this. • Players are matched randomly by Bodhi among all the players connected to the server. Players with same IP or physical net-address cannot be matched. Then players could not partner with themselves and the probability of playing with a friend is low. • Borrowed from [5], we also let the game server start a game every 30 seconds: waiting 30-second boundary to pair new login players. Then cheating by login at the same time can be avoided. • Players do not know who their partners are, and any communications between them are not allowed. The only thing a pair of partners has in common is a code snippet with a potential buffer overflow. Then they cannot agree on a unified strategy previously
(e.g., “let’s click on the ‘YES’ button for all code snippets”). • Inserting some code snippets that the server knows the correct answers will make some cheating impossible. If players submit wrong answers for these codes, all of their answers will be abandoned. Then it is impossible to use a computer to play the game or just click the button randomly. • Average time the players take to make a decision should be measured. A sharp decrease may indicate a cheating. The choices of these players should be reviewed or abandoned. 8) Pre-record game play Bodhi is a two-player game. Still there may be situation that a player cannot find any partner. How can a single player play Bodhi? Like ESP games [5], we design a bot using the prerecorded game data (the button choice and time consumed). Then the single player can be partnered with the bot. The bot emulates a human’s actions which are from previous recorded game sessions involving two people (not the bot). It only mimics one person for a game. If the player makes an agreement with the bot, this information is also useful for our buffer overflow detection. IV.
EXPERIMENTS AND EVALUATIONS
The current version of Bodhi is implemented in C++ and is played in a C/S mode through the local area network. Everyone can play the game using the client connecting to the game server. We use the experiments programs from a widely used benchmark presented by Kratkiewicz and Lippmann [6]. The benchmark contains 291 small C-programs that can be used to diagnostically determine the basic capabilities of static and dynamic analysis buffer overflow detection tools. Then we asked 40 undergraduate volunteers who have one year C programing experience to do the experiment. In the following, we will evaluate the effectiveness, efficiency and enjoyability of Bodhi. The results we obtained are very encouraging though the scale of our experiments is very small. A. Effectiveness We randomly asked 20 volunteers to play Bodhi. We set the threshold X=5 that means a code snippet will be finished after obtaining answers from 5 pair of players. TABLE I. Tool Archer Boon Polyspace Splint Uno Bodhi ToTAL
DETECTION PERFORMANCE COMPARE Total Detections 264 2 290 164 151 291 291
Detection Rate 90.72% 0.69% 99.66% 56.36% 51.89% 100% 100%
Table I gives the detection performance compare results. The data of Archer, Boon, Polyspace, Splint and Uno are all
171
From Figure 5, we can find that players of Bodhi examined code slower than traditional manual analysis at the beginning. That’s because they may be not familiar to the game mechanism. But soon the game can improve the efficiency of buffer overflow detection work. The high efficiency can last for a longer time. The reason is that examining code one by one will soon become boring for human.
from [6]. The first column presents the names of 5 famous static analyzers and Bodhi. The second column gives the number of alarms reported by the five analysis tools and Bodhi on the benchmark. The third column shows the ratio of alarms reported by them. If one pair of players makes an agreement that there is an error in the source code, then Bodhi reports a buffer overflow at that position. All the alarms are in a probability descending order depending on the number of pairs making an agreement. The results show that Bodhi can detect all the buffer overflows of the benchmark and is more effective than all the static analyzers presented. TABLE II. Tool Archer Boon Polyspace Splint Uno Bodhi
C. Enjoyability We asked the first group of 20 volunteers five questions: • Do you have fun during the game? • Does game score attract you? • Does the programing skill attract you? • Will you play Bodhi again? • Will you introduce Bodhi to your friends? The results are showed in Table III.
FALSE ALARMS COMPARE Total False Alarms 0 0 7 35 0 0
False Alarm Rate 0.00% 0.00% 2.41% 12.03% 0.00% 0.00%
TABLE III. Question 1 2 3 4 5
Table II gives the false alarms compare results. Though Bodhi will report a buffer overflow once a pair of players makes an agreement that there is an error in the source code, it didn’t report any false alarms. The results show that Bodhi can achieve a low false alarm rate as many state of art static analyzers. The reasons of the 100% detection and 0% false alarms are that all the benchmark programs are small size and easy for humans to read, while some program properties like inter-procedural method calls, complex data structures, string operations and loops are difficult for static detection tools to analyze.
ENJOYABILITY INVESTIGATE Yes 19 18 10 20 17
No 1 2 10 0 3
As desired, 95% said “yes” for question 1 which indicates our game Bodhi is interesting. From question 2 and 3, we know that 90% think the score system attract them while 50% think they have learned some programming skill from the game. Therefore we believe the score mechanism of Bodhi works and it can be more enjoyable if we use some challenge software source codes for players who have passion in programming technology. All the volunteers (100%) said they will play Bodhi again, and 85% would introduce Bodhi to their friends. This result really encouraged us.
B. Efficiency During the experiment, we have seen that many people were eager to get a high score. Since each game last the same time, the only way to achieve the goal is to examine the code carefully and quickly.
V.
RELATED WORK AND DISCUSSION
A. Automatic Detection for Buffer Overflows Buffer overflow is one of the key elements that influence the software security. NIST reports more than 20% software vulnerabilities are due to buffer overflows [7]. Researchers have been trying to automate the detection process. There have been many static analyzers available to detect buffer overflows [8,9,10,11,12]. It is relatively easier for static analyzers to test almost all the execution scenarios since it requires no real execution with real inputs. However, current static analysis tools always have high false alarm rates and insufficient detection rates [13]. Compared to static analysis, dynamic approaches have no false alarms. They generate run-time checks at compile time, and then run the programs with different test inputs. It can rarely test all the possible paths, since it’s still very difficult to generate test inputs to represent all the possible execution paths. There are also many tools available [14,15,16,17].
Figure 5. Efficiency compare
We asked another 20 persons to examine the code one by one. Then we calculate the mean efficiency they get.
172
B. Human Computation Human computation is a technique that makes use of human abilities for computation to solve problems [18]. It aims to using human effort to perform tasks that computers are not good at but are trivial for humans. It’s usually in an enjoyable way like Game With A Purpose (GWAP) [19]. These games have become hugely successful. Since then, many other domains [4,20,21,22,23,24] have adopted this mechanism. For buffer overflows detection, both dynamic and static approaches have disadvantages. Manual analysis is still heavily used in industry. In the Bodhi project, we change the tedious manual examine process into an enjoyable game which can help us to get better results in an interesting and efficient way. GATE [25] is a game-based testing environment. It shows the method to user and encourages players to complete the sub-models of test criterion. It still maintains a high controllability by developers. Our Bodhi concentrates on buffer overflows detection. It combines the game and work in a natural way. VI.
[6]
[7] [8] [9] [10]
[11] [12]
[13]
CONCLUSION AND FUTURE WORK
[14] [15]
We have presented a game called Bodhi. It uses a human computation mechanism to detect buffer overflows in programs. People are willing to play the game and get fun from it. During the game process, useful data about buffer overflows are collected. Finally, we get better results than traditional manual analysis and state of art static analyzers. We have done several experiments to evaluate our game. Though the experiments scale is small, the results show that our game Bodhi is effective, efficient and enjoyable in buffer overflow detection. In the future, we plan to deploy the game in internet to make it played by more people and detecting buffer overflows for more software.
[16] [17]
[18] [19] [20]
ACKNOWLEDGMENT
[21]
We would like to thank Rui Wang, Pei Fan and Yali Pan for discussion. Jiahong Jang and all the volunteers for help with experiments. We also thank anonymous referees for helpful comments.
[22]
[23]
REFERENCES [1] [2]
[3] [4]
[5]
[24]
2011 CWE/SANS Top 25 Most Dangerous Software Errors , http://cwe.mitre.org/top25/ Kim, Y., Lee, J., Han, H., Choe, K.M., Filtering false alarms of buffer overflow analysis using smt solvers. Inf. Softw. Technol. 52:2(2010), pp. 210-219. CWE-120: Buffer Copy without Checking Size of Input ('Classic Buffer Overflow'), http://cwe.mitre.org/data/definitions/120.html Hacker, S. and von Ahn, L., Matchin: Eliciting User Preferences with an Online Game. To appear in Proc. of the SIGCHI Conf. on Human Factors in Computing Systems (Boston, April 4-9). ACM, New York, 2009. von Ahn, L. and Dabbish, L., Labeling images with a computer game. In Proc. SIGCHI Conf. on Human Factors in Computing Systems (Vienna, April 24-29). ACM, New York, 2004, pp. 319-326.
[25] [26]
[27] [28] [29]
173
K. Kratkiewicz, R. Lippmann, “A Taxonomy of Buffer Overflows for Evaluating Static and Dynamic Software Testing Tools”, in Proceedings of Workshop on Software Security Assurance Tools, Techniques, and Metrics, NIST Special Publication 500-265, National Institute of Standards and Technology, 2005, pp. 44-51. NIST. ICAT vulnerability statistics. http://icat.nist.gov/, Feb. 2005. D. Evans, D. Larochelle, Improving security using extensible lightweight static analysis, IEEE Software 19 :1(2002), pp. 42-51. Abstract interpretation. http://www.polyspace.com/downloads.htm, September 2001. Xie, Y., Chou, A., and Engler, D., ARCHER: Using symbolic, pathsensitive analysis to detect memory access errors, Proceedings of the 9th European Software Engineering Conference/10th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Helsinki, Finland, 2003, pp. 327-336. G. Holzmann. Static source code checking for user-defined properties. Pasadena, CA, USA, June 2002. D. Wagner, J. S. Foster, E. A. Brewer, and A. Aiken., A first step towards automated detection of buffer overrun vulnerabilities. In Network and Distributed System Security Symposium, San Diego, CA, February 2000, pp. 3-17. M. Zitser, Securing software: An evaluation of static source code analyzers. Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, Aug. 2003. F. Bellard. TCC: Tiny C compiler. http://www.tinycc.org, Oct. 2003. G. C. Necula, S. McPeak, and W. Weimer. CCured: Type-safe retrofitting of legacy code. In Proceedings of Symposium on Principles of Programming Languages, 2002, pp. 128-139. Parasoft, Insure++: Automatic runtime error detection. http://www.parasoft.com, 2004. O. Ruwase and M. Lam, A practical dynamic buffer overflow detector. In Proceedings of Network and Distributed System Security Symposium, 2004, pp. 159-169. Yuen, M., Chen, L., & King, I., A Survey of Human Computation Systems. Proc CSE 2009. von Ahn, L., Games with a purpose. IEEE Computer Magazine, 39:6(2006), pp. 96-98. Lee, B. and von Ahn, L., Squigl: A Web game to generate datasets for object detection algorithms. In submission. Mityagin, A. and Chickering, M., PictureThis. http://club.live.com/Pages/Games/GameList.aspx?game=Picture_This Turnbull, D., Liu, R., Barrington, L., and Lanckriet, G., A gamebased approach for collecting semantic annotations of music. In Proc. 8th Intl. Conf. on Music Information Retrieval (Vienna, September 23-27), 2007, pp. 535-538. Mandel, M. and Ellis, D., A Web-based game for collecting music metadata. Journal of New Music Research. 37:2(2009), pp. 151-165. Siorpaes, K. and Hepp, M., Games with a purpose for the semantic Web. IEEE Intelligent Systems, 2008, pp. 50-60. Chen, N., GATE: game-based testing environment, ICSE'11, May 21–28, 2011. E. A. Lee., Cyber-physical systems - are computing foundations adequate? Position Paper for NSF Workshop On Cyber-Physical Systems: Research Motivation, Techniques and Roadmap, 2006. Law, E., von Ahn, L., Input-agreement: a new mchnsm for collecting data using hcomp games.CHI'09, 2009. L. von Ahn. Human computation. PhD Thesis, December 2005. A. Quinn and B. Bederson, Human Computation, Charting The Growth Of A Burgeoning Field, College Park, Maryland, USA, 2010.