2008, Vol.13 No.5, 615-620 Article ID 1007-1202(2008)05-0615-06 DOI 10.1007/s11859-008-0521-6
Static Extracting Method of Software Intended Behavior Based on API Functions Invoking □ PENG Guojun, PAN Xuanchen, FU Jianming, ZHANG Huanguo
0
Introduction
†
School of Computer, Wuhan University, Wuhan 430072, Hubei, China Abstract: The method of extracting and describing the intended behavior of software precisely has become one of the key points in the fields of software behavior’s dynamic and trusted authentication. In this paper, the author proposes a specified measure of extracting SIBDS (software intended behaviors describing sets) statically from the binary executable using the software’s API functions invoking, and also introduces the definition of the structure used to store the SIBDS in detail. Experimental results demonstrate that the extracting method and the storage structure definition offers three strong properties: (i) it can describe the software’s intended behavior accurately; (ii) it demands a small storage expense; (iii) it provides strong capability to defend against mimicry attack. Key words: API functions invoking; software intended behavior; trusted behavior CLC number: TP 309
Received date: 2008-03-20 Foundation item: Supported by the National Natural Science Foundation of China (60673071, 60743003, 90718005, 90718006) and the National High Technology Research and Development Program of China (863 Program) (2006AA01Z442, 2007AA01Z411) Biography: PENG Guojun (1979-), male, Ph. D. candidate, Lecturer, research direction: trusted software, network security. E-mail: guojpeng@whu. edu.cn † to whom correspondence shoud be addressed. E-mail:
[email protected]
How to carry out the trusted authentication dynamically on the software behavior is an unavoidable hard nut in the field of trusted computing. To crack this hard nut requires comprehensive and deep understanding of the un-consistency between the expected software’s behavior and the practical behavior. In other words, the accurate describing of the software behavior and precise defining of the behavior’s storage structure are two significant premises to implement trusted authentication dynamically on software behavior. In this paper, we introduce a technique based on Win32 API functions invoked by the executable files to extract the software’s intended behavior using static analysis. Static analysis is a typical technique in the fields of anti-virus and reverse engineering[1-4], which disassemble the target binary executable and then, yields easy-to-read and easy-to-understand assembly version of code based on which the analyzing work on the software’s various properties can be conveniently done. Meanwhile, considerable research results have appeared in the fields of software testing, vulnerability mining and intrusion detection using this technique [5-10]. And many researchers have done a lot of works in intrusion detection, making use of the sequences of Win32 API calls or system calls [6-14]. Currently, there are many researches based on static analysis of the binary file, and the main methods are as follows: ① Do static analysis on the IAT(imported address table) of the PE(portable executable) format executable files to gain the program’s main API functions invoking
616
PENG Guojun et al : Static Extracting Method of Software Intended …
and make statistical analysis on large numbers of the same category programs(ex. viruses) to obtain statistical property of API functions invoking of the same kind of programs. ② Do static analysis on the malicious software to construct the signature of the malicious actions, according to which to detect the malicious software such as the heuristic scanning technology used in the antivirus field. ③ Do static disassembly to get the normal short sequences of API invoking, build the normal behavior database according to which to detect anomaly and protect the user’s software. In this paper, we propose a specified measure of extracting SIBDS (software intended behaviors describing sets) statically from the binary executable using the software’s API functions invoking, and also introduce the definition of the structure used to store SIBDS in detail.
|
— SIPS (sequential instruction pieces sets) |
— SIP (sequential instruction pieces) |
— APIFC (API functions calling) |
—SDFC (self-defined functions calling)
1.2 Building Method and Optimization Algorithm of the API Invoking Flow Graph As is defined in Definition 1, SIBDS is composed of AMS and AMRS. Usually, AM is extracted from the main process at the highest module level. After the proper extraction of AM, we can draw the flow graph according to the jumping relationship among the modules. For example, Fig.1 is the flow graph we draw from one of our software examples (Mi is AM). And further, we can do more extraction and thinning work. Finally we can describe the software behavior completely.
1 Extracting the Software Intended Behavior 1.1 Component of the Software Intended Behaviors Describing Sets We research the software intended behavior based on the invoking sequences of the win32 API and formally define the concept of SIBDS below. Definition 1 (SIBDS) SIBDS (software intended behaviors describing sets) is an anticipated behavior set which describes the normal API invoking sequences and its interrelations. It is composed of API module set and modules relationship set, say SIBDS=. Definition 2 (AM) AM (API functions invoking module) refers to a API functions module which has a relatively stable and independent sequence of API functions invoking when the software runs normally. AM is made up of SIPS (sequential instruction pieces sets) and SIPRS (sequential instruction pieces relationship sets). The detailed definition of SIPS and SIPRS will be presented in the following section in this paper. The hierarchical relationship among the data structures of SIBDS is as follows: SIBDS (software intended behaviors describing sets) |
— AMRS (API modules relationship set)
|
— AMS (API module set) |
— AM (API functions invoking module)
|
— SIPRS (sequential instruction pieces relationship sets)
Fig.1 Call graph of AM in the main process
For the sake of further obtaining the AM’s behavior characteristic, we can take a good look at the assemble code of AM and insert “Module Bookmark” into the following places: ① the start address of the current function; ② the address where the jump instructions jump to; ③ the next instruction’s address of the jump instructions; ④ the address of return instruction. Definition 3 (SIP) SIP (sequential instruction piece) refers to the block of sequential instructions between the consecutive module bookmark(BM). After inserting some module bookmarks into a assemble code, we can divide the code into several SIPs, and all of these SIP form the SIPS. Then, we give each SIP an unique label to identify the SIP, record the jumping relationship between different SIP, say SIPRS, using directed edge(each directed
617
Wuhan Univ. J. Nat. Sci. 2008, Vol.13 No.5
edge represents a jumping relation in the call graph), draw the flow graph according to the SIPS and SIPRS and finally we have a call graph of a AM. In the same way, we can draw other AM’s call graph accordingly. The drawing process is shown below in Fig.2.
Fig.3 Inter-SIP optimization process of Example 1
If the self-defined function’s SIPS has single SIP, then, we can replace the self-defined function invoking with the content of its SIP in order to reduce the AM we need to define and store. Example 2 In this example, there exists self-defined function MouseLD in the user’s program, but its SIPS has only one SIP without no jumping relations and the content of the SIP is as follows: Call ds:__imp__mouse_event@20 then, we can do the optimization work as shown in Fig.4.
Fig.2
Process of dividing a assemble code into SIPS and drawing the corresponding call graph
In the process of extracting the sequence of API invoking, we focus on the functions (self defined functions and API functions) invoking instructions and all kinds of jumping instructions (e.g., JMP, JXX, LOOP, RET and so on), neglecting all the other instructions. In this way, the graph we finally have is an API invoking relationship graph. The optimization principles of making the call graph into API invoking relationship graph are as follows: ① Inter-SIP optimization Firstly, we retain the assemble instructions related to the program’s function, say the instructions that have API invoking or self-defined function invoking, and then delete other instructions such as data operation instructions, instructions generated by the compiler and so on. Example 1 As shown in Fig.3, there are only two API invoking needed to be retained, then, we delete the other assemble instructions and accomplish the optimization in this SIP. Especially, if there is no one API invoking in a SIP, we make this kind of SIP into a NULL SIP.
Fig.4 Inter-SIP optimization process of Example 2
② Horizontal optimization If there exists the same SIP in the horizontal direction, we can use one SIP to represent these two SIPs with the same API invoking content, just as what is shown in Fig.5. Example 3 Figure 5 shows a typical situation of horizontal optimization.
Fig.5
Horizontal optimization process of Example 3
③ Vertical optimization In the vertical direction, if there is no one non-null SIP in a group of SIP and the total out-degree of this all null SIP group keeps less than two, then we can optimize
618
PENG Guojun et al : Static Extracting Method of Software Intended …
this SIP group into a single null SIP(only if the deleted operation does not affect the overall control flow’s integrity can the operation be successful and for the sake of the storage definition, the out-degree of a SIP should be less than two.Moreover, we can combine two sequential SIPs, with the former’s out-degree being 1 and the latter’s indegree being 1, into one SIP. Example 4 As is shown in Fig.6, we optimize a group of null SIP into a single null SIP. Example 5 Figure 7 shows the optimization process of two pre-and-post SIP with only one arrow between them.
Fig.6
Optimization process of Example 4
Fig.7
Optimization process of Example 5
Finally, after doing these three kinds of optimization repeatedly, we have an optimized and concise API invoking relationship flow graph, based on which we can continue our research and turn to the work of storage definition of the software anticipated behavior. 1.3 Storage Definition of API Invoking Sequential Module 1.3.1 Storage representation of the AMS(API module set) and AMRS(API module relationship set) In this paper, we represent the AMS and AMRS by DFSM (deterministic finite state machine),that is the equality DFSM = ( K , Σ , f , S , Z ), in which K represents the whole state set, Σ represents the input character set ,say the AMS, of the FSM, f is a transfer function which is used to represent the AMRS, S is the initial state of the FSM and Z is the ending state of the FSM. Fig.1 shows the extracting result of a program and the corresponding finite state graph is shown in Fig.8.
Fig.8
Corresponding finite state graph of Fig.1
In this FSM, API function invoking module is the input element of the finite alphabet Σ , and these input elements should be recognized (trusted authentication) before the state transition of the FSM. Thus, the work of defining a storage structure of AM is quite necessary for the research. 1.3.2 Storage definition of AM As stated above, AM can be accurately specified with the SIPRS and the SIPS, and in the following, we will introduce the storage structure of these two data element. ① Storage of SIPRS To store and represent the SIPRS of AM conveniently, we allocate the following attributes in SIP’s structure: CurrentNo: the serial number of current SIP (using 8 bits in this paper). Out: out-degree, say the possible SIP next to the current SIP. (range from 0 to 2) APICount: the number of times of API invoking in the current SIP. NextNo1: the serial number of the first possible next SIP (using 8 bits in this paper). NextNo2: the serial number of the second possible next SIP (using 8 bits in this paper). If the out-degree equals to 1, the serial number of the single next SIP is stored in the NextNo1, meanwhile that the out-degree equals to 0 means the current SIP is one of the ending SIP(SIP contains the return instruction). Finally, for the sake of convenient storage, we allocate 4 bytes to store the above attributes and the structure is shown in Table 1. Example 6 The example data is given in Table 1. It shows the actual data of a SIP’s attributes in a program’s modules. The data tells that the current SIP is the first SIP in the current AM, which has two jumping possibilities, contains two API invoking and the next possible SIP are the number 2 and number 11 SIP.
619
Wuhan Univ. J. Nat. Sci. 2008, Vol.13 No.5 Table 1
Storage structure definition of SIP’s attributes
Attribute CurrentNo Out APICount NextNo1 NextNo2
Bit 0-7 8-9 10-15 16-23 24-31
Example 00000001 10 000010 00000010 00001011
The serial number is assigned according to the RVA of the current SIP and we temporarily decide to support at most 28 − 1 SIPs. ② Storage of SIPS SIPRS is an array using SIP as the array element. The maximum number it supports is 28 − 1 The type of API invoking contained in one SIP is classified as system API invoking and user’s self-defined function invoking. In other words, each SIP is actually a sequence of a certain number of API invoking events. In order to locate the detailed information of a SIPRS rapidly and conveniently, we relate the VA of the first SIP in the SIPS with the corresponding SIPRS. To achieve this, we build an associate array to record the relationship. This array records the association information of the SIPRS and the corresponding virtual address (This array is actually prepared for the implementation of dynamic monitor, which is specified in another paper). In this paper, we define the following attributes to describe a specified API invoking event. DLLVersion: the version of system DLL identification. SRCType: the source type of API, (001) means the API function comes from system DLL, (010) means user DLL and (100) means the inner functions of the program. APIType: the type of API, 0 means the exported API while 1 means the un-exported. Reserved: reserved bits. DLLNo: it takes 6 bits and represents the DLL file’s identity, from which the API function comes (e.g., kernel32.DLL is defined as the DLLNo is 0.The actual DLL No. should be defined by the behavior rule definiens). APINo: it takes 10 bits and represents the corresponding API functions (it can be self-defined, and also can be defined as the exported ordinals of the API functions in DLL). Layer: the function’s depth in the current call stack. VA: virtual address means the address when the PE file is loaded into the system memory.The inner functions of a program is recognized by the VA. In this way, we use the following storage structure
definition to represent the API functions invoking: Table 2 shows the storage structure we use to represent the system API function. Table 3 shows the storage structure we use to represent the inner function of a program. Table 2
System API function storage structure
CurrentNo
Bit
Example
SRCType
0-2
001
DLLVersion
3-5
001
APIType
6-7
01
Reserved
8-15
00000000
DLLNo
16-21
000011
APINo
22-31
0000010011
Table 3
Program’s inner functions storage structure
Attribute SRCType Layer Reserved VA
Bit 0-2 3-5 6-7 8-31
Example 100 010 00 401080H
2 Related Experimental Data and Result Evaluation We select a famous program—ping.exe from OS Windows XP as our experimental data source. To simplify the experiment, we do the experiment with help of IDA5.0. Firstly, we load the symbol files of the ping.exe program in order to get the program’s inner functions name and other useful information, then, we disassemble the program with IDA and get the flow graph of the program, based on which we practice the method stated above to do the work of optimization, and lastly we compare the optimized graph with the raw graph and get the following data. Table 4 compares the pre-optimized data and the post-optimized data of the program ping.exe. Through the experiment data, we demonstrate that using the extracting method proposed in this paper, we can extract the intended behavior of the program ping.exe precisely and represent and store it completely. Moreover, we can restore the data that stores the information about the intended behavior to get the original flow graph of the target program, which means that the method we proposed is able to describe the software intended behavior completely. Meanwhile, the optimization method of the flow chart, to some great extent, decreases the complexity of the flow chart.
620
PENG Guojun et al : Static Extracting Method of Software Intended … Table 4
Comparison of SIP numbers before and after optimization
Function name Main GetDefaultTTL SetFamily NlsPutMsg Str2ip ResolveTarget GetSource PrintUsage Icmp6CreateFile IcmpCreateFile Icmp6SendEcho2 IcmpSendEcho2 print_statistics IcmpCloseHandle ProcessOptions
Before optimization SIP non-null SIP 195 65 7 3 5 1 5 2 10 2 13 4 4 1 1 1 1 1 1 1 1 1 1 1 6 3 1 1 63 20
After optimization SIP non-null SIP 102 50 5 3 3 1 3 2 4 2 5 4 1 1 1 1 1 1 1 1 1 1 1 1 4 3 1 1 33 15
As is shown in the experimental data, there still exists a number of null SIP, which is caused by the rule we defined that the out-degree of a SIP should be less than 2. In other words, we can properly increase the maximum out-degree of a SIP, in some certain area such as at the level of the main process, to reduce the number of the null SIP.
3
Conclusion
In this paper, we propose a method based on the API function invoking to extract the intended behavior of the software. By means of disassembling the binary program, slicing sequential instructions, optimizing the sequential instructions and other procedures, we can obtain the API functions invoking flow graph, based on which we make the storage structure definition of the SIBDS. Finally, we make the experiment of extracting the intended behavior of the program ping.exe from Windows XP, and the experimental result demonstrates the feasibility of the method that we proposed to extract the software intended behavior and the storage scheme.
References [1] Chinchani R, van den Berg E. A Fast Static Analysis Approach to Detect Exploit Code inside Network Flows[C]// Proceedings of the International Symposium on Recent Advances in Intrusion Detection (RAID). Berlin: Spriner-Verlag, 2006: 284-308.
[2] Xu Jianyun, Sung A H, C Patrick, et al. Polymorphic Malicious Executable Scanner by API Sequence Analysis[C]// Fourth International Conference on Hybrid Intelligent Systems (HIS ’04). Washington D C: IEEE Computer Society Press, 2004: 378-383. [3] Christodorescu M, Jha S. Static Analysis of Executables to Detect Malicious Patterns[C/OL]. [2007-12-20]. http://www. usenix.
org/events/
sec03/tech/full_papers/christodorescu/
christodorescu. pdf. [4] Bergeron J, Debbabi M, Desharnais J. Static Detection of Malicious Code in Executable Programs[C/OL]. [2007-12-10]. http:// www. sreis.org/old/2001/papers/sreis014.pdf. [5] Wagner D A. Static Analysis and Computer Security: New Techniques for Software Assurance [D/OL]. [2007-12-10]. http://http.cs.berkeley.edu/~daw/papers/ phd-dis.ps. [6] Sung A H, Xu J, Chavez P, et al. Static Analyzer of Vicious Executables[C]//Computer Security Applications Conference 2004. Washington D C: IEEE Computer Society Press, 2004: 326-334. [7] Liu Zhen, Bridges S M, Vaughn R B. Combining Static Analysis and Dynamic Learning to Build Accurate Intrusion Detection Models[C]//IEEE International Information Assurance Workshop 2005. Washington D C: IEEE Computer Society Press, 2005: 164-177. [8] Feng H H, Giffin J T, Huang Y, et al. Formalizing Sensitivity in Static Analysis for Intrusion Detection[C]//Proceedings of the IEEE Symposium on Security and Privacy. Washington D C: IEEE Computer Society, 2004:194-210. [9] Forrest S, Hofmeyr S, Somayaji A, et al. A Sense of Self for Unix Processes[C]//Proceedings of the 1996 IEEE Symposium on Security and Privacy. Washington D C: IEEE Computer Society Press, 1996:120-128. [10] Hofmeyr S, Forrest S, Somayaji A. Intrusion Detection Using Sequences of System Calls[J]. Journal of Computer Security, 1998, (6):151-180. [11] Su Purui, Yang Yi. Intrusion Detection Model Based on Executable Static Analysis[J]. Chinese Journal of Computers, 2006, (9):1572-1578 (Ch). [12] Yan Qiao, Xie Weixin, Song Ge. System Call Anomaly Detection Method Based on HMM[J]. Acta Electronica Sinica, 2003, (8):1486-1490 (Ch). [13] Tan Xiaobin, Wang Weiping, Xi Hongsheng. A Hidden Markov Model Used in Intrusion Detection[J]. Journal of Computer Research and Development, 2003,(2): 245-250(Ch). [14] Zhang Xiangfeng, Sun Yufang, Zhao Qingsong. Intrusion Detection Based on Sub-Set of System Calls[J]. Acta Electronica Sinica, 2004, (8):1338-1442 (Ch).
□