A Static Birthmark for MS Windows Applications Using Import Address Table JongCheon Choi, YongMan Han, Seong-je Cho*, HaeYoungYoo, Jinwoon Woo Dept. of Computer Science Dankook University Yongin, Korea, 448-701 {godofslp, grid_ym, sjcho, yoohy, jwwoo}@dankook.ac.kr
Minkyu Park
Youngsang Song, Lawrence Chung
Dept. of Computer Engineering Konkuk University Chungju, Korea, 380-701
[email protected]
Dept. of Computer Science University of Texas at Dallas Texas 75080, USA {youngsang.song, chung}@utdallas.edu
Abstract—A software birthmark is unique and native characteristics of software, and thus can be used to detect the theft of software. We propose a new static software birthmark for programs on Microsoft Windows which have Portable Executable (PE) format. These programs use different Dynamic Link Libraries (DLLs) and Application Program Interfaces (APIs) while they are executing. The number and names of the used DLLs and APIs are unique to each program. The proposed birthmark is based on these numbers and names. This information can be obtained from the Import Address Table (IAT), which is part of the PE file. By inspecting the proposed birthmark, we can identify certain software and detect pirated software. To evaluate the effectiveness of the proposed birthmark, we inspect and compare several applications of different kinds. The experimental results show that the proposed birthmark can identify Windows applications, leading to the prevention of an illegal distribution of copyrighted software. Keywords—Software birthmark, Static birthmark, Import address table, DLL, API
I.
INTRODUCTION
Though recent anti-piracy measures monitor Internet for detecting illegal upload or download of music and movies, copyrighted software has been still illegally distributed over web-hard servers and P2P networks. Software piracy is a growing concern in today's competitive world of software. Indeed, many incidents have been reported, and many software developers and copyright holders have been victimized by software theft. The Business Software Alliance (BSA) publishes the yearly study about copyright infringement of software. The Ninth Annual BSA 2011 Piracy Study reported that 57 percent of the world's personal computer users admit to pirating software [1]. The commercial value of all these pirated software rose from $58.8 billion in 2010 to $63.4 billion in 2011. Undoubtedly, software piracy causes severe damages to software industries, stifling not only IT innovation but also job creation across all * Corresponding author
sectors of the economy. To protect the intellectual property for software developers, many software protection techniques have been proposed. Among them, software birthmark is a prominent technique [2-9]. A software birthmark is a unique characteristic, or set of characteristics, that a program inherently has and can be used to identify that program. Existing birthmark schemes have some limitations, though. For example, a static source code-based birthmark [2] requires source code, and is not applicable to binary executable programs. This source code-based birthmark and other birthmarks, such as static executable code-based birthmark [3], dynamic whole program path (WPP)-based birthmark [4], and dynamic API-based birthmark [5], are not resilient to semantics-preserving obfuscation attacks, such as outlining and ordering transformation. Also none of the existing static birthmarks has been evaluated on large-scale programs. We propose a new software birthmark based on the number and names of Dynamic Link Libraries (DLLs) and Application Programming Interface (APIs) that are used in a Windows executable program. This birthmark can be used to detect the obfuscated Microsoft Windows applications, including large-scale programs, and consequently to detect illegal distribution of copyrighted software over web-hard servers and P2P networks. Windows executable programs have Portable Executable (PE) format, and their DLL And API information is stored in a section of PE, Import Address Table (IAT). The number and names of DLLs and APIs are inherent to each program and can be used as a unique birthmark. The software identification process proceeds in four steps: (1) inspecting the number of DLLs, (2) inspecting the names of DLLs, (3) analyzing the names and how many times APIs are called in DLLs (but excluding the three most commonly used DLLs by Windows applications), and (4) analyzing the names and how many times APIs are called in the three most commonly used DLLs. The rest of the paper is organized as follows. Section 2 outlines the background and related work. Section 3 describes a design and implementation of the proposed
scheme. In Section 4, we introduce IAT analysis of executable files and describe experiments for assessing our birthmark. Finally, we summarize our conclusions and describe future work.
II.
BACKGROUND AND RELATED WORK
A. Import Address Table Microsoft Windows operating systems use the Portable Executable (PE) format for executable files, object code, and DLLs [10]. The PE format contains dynamic library references for linking, API export and import tables, resource management data and thread-local storage (TLS) data. A PE file consists of a few headers and sections that tell the dynamic linker how to map the file into memory. When a program is loaded, the Windows loader loads all the DLLs the application uses and maps them into the process address space. A DLL is simply a file that contains one or more pre-compiled functions. That is, each DLL contains pre-compiled implementation code for API functions. The executable file lists all the functions it requires from each DLL. This loading and joining is accomplished by using the Import Address Table (IAT). The IAT is a table of function pointers filled in by the Windows loader as the DLLs are loaded. The IAT is a lookup table when the application is calling a function from a different module. It can be in the form of both import by ordinal and import by name [10]. The IAT of a PE file is used to store virtual addresses of functions that are imported from external PE files. From the IAT, we can obtain the feature information of the program, such as the number of DLLs, the names of DLLs, and the names of API functions in each DLL. B. Related Work A source code-based birthmark uses names of variables and functions. This birthmark, however, no longer exists after compilation without special handling. Given only an executable file, we cannot use this birthmark for its original purpose. Because of this limitation, many researchers are studying on API-based or system call-based birthmarks [2-9]. These birthmarks are intact through compilation and can be used for detecting software theft and computer forensics. Existing birthmarks can be classified into two categories. Static birthmarks extract statically available information in the program source code or executable files [2,3], for example, the types or initial values of the fields. Dynamic birthmarks, in contrast, rely on information gathered from the execution of a program [4,5]. Tamada et al. [2] proposed four types of static birthmark: constant values in field variables, sequence of method calls, inheritance structure, and used classes. All the four types are vulnerable to obfuscation techniques, such as code removal or splitting of variables [4]. In addition, their technique needs
to access the source code and only works for an objectoriented programming language, such as Java. Myles and Collberg proposed K-Gram-based birthmark, a static technique, which uniquely identifies a program through instruction sequences [3,9]. Instruction (opcode) sequences of length k are extracted from a program, and kgram techniques, which were used to detect the similarity of documents, are used for the opcode sequence. The k-gram static birthmark is still fragile to some obfuscation methods, such as statement reordering, invalid instruction insertion, and compiler optimization. Myles and Collberg presented another dynamic birthmark called a whole program path (WPP) and evaluated its performance on a Java program [4, 9]. WPP is originally used to represent the dynamic control flow graphs (DCFGs) of a program. It collects all the compact DCFGs and regards them as a program’s birthmarks. However, a WPP may not work for large-scale programs because of the overwhelming volume of WPP traces. Also, it is vulnerable to program optimization, such as loop transformations and inline functions. Tamada et al. [5] introduced two types of dynamic birthmark for Windows applications: sequence of API function calls and frequency of API function calls. The sequence and frequency of Windows API calls are recorded during the execution of a program. Shuler and Dallmeier [7] presented a dynamic birthmark based on the extraction of API call sequence sets during program execution. API birthmarks are more robust to obfuscation than WPP birthmarks [6]. However, dynamic birthmarks need program executions which are dependent on user interactions, inputs, and environments. Wang et al. [6] proposed two types of system call birthmark: system call short sequence birthmark and inputdependent system call subsequence birthmark. System callbased birthmarks can be platform-independent and are more robust to counter-attacks than API-based ones. They also need a program execution. Moreover, there are no easy ways to record system call traces of each application during program execution on Microsoft Windows systems. Choi et al. [8] suggested a static API birthmark for Windows. Their birthmark is a set of possible API calls which are statically extracted by analyzing disassembled code. They did not use DLL information, which can be easily obtained from the IAT. Current birthmarks are limited in their capabilities: some solutions are not strong enough to adequately prevent software theft, some cause significant performance degradation for large-scale programs, and some need program execution or work only for Java programs.
III.
A STATIC BIRTHMARK USING IMPORT ADDRESS TABLE (IAT)
in the structure of a PE file and Feature Database which stores the features of PE files.
A. Detection Method We use information not only about APIs but also about DLLs to identify programs. We extract a list of DLLs and APIs from the IAT of an executable file and the API frequency counted from the code section as a birthmark. We use this IAT-based birthmark (IATB) for software identification. This identification process consists of four steps. Step 1. Comparing the number of DLLs In the first step, we compare the number of DLLs loaded in the IAT of target programs and filter programs with the different number of DLLs. If the programs are not identified, then go to Step 2. Step 2. Comparing the names of DLLs and the number of APIs We compare the names of DLLs of the target programs one by one and filter the programs with different DLLs. We also compare the number of APIs of each DLL to filter more programs. If the programs are not identified, then go to Step 3. Step 3. Comparing the names and frequency of APIs in DLLs that are not among the three most commonly used DLLs
Figure 1. PE format and Feature database The Schema of Feature Database is shown in Figure 2. This database is a relational database and has tables used for each step. These tables are File information table, DLL information table, and API information table. The tables can be accessed using a file name and a DLL name.
We compare the names and frequency of APIs in DLLs, and filter programs with different names and frequency. In this step, we exclude most commonly-used three DLLs to speed up the comparison. These three DLLs are KERNEL32.DLL, USER32.DLL, and GDI32.DLL. The three DLLs account for 62.2% of all DLLs of a program on average. Thus, comparing these APIs in these DLLs is time consuming but does not get satisfactory results. If the programs are not identified, go to Step 4. Step 4. Comparing the names and frequency of APIs in the three most commonly used DLLs If a target program is not identified until Step 3, we must compare the names and frequency of APIs in KERNEL32.DLL, USER32DLL, and GDI32.DLL. Special Case 1. If the number of DLLs of a target program is less than three, we compare the names and frequency of APIs in all DLLs in Step 3. Special Case 2. If a target program uses only some of the three most commonly used DLLs, we skip Step 3 and perform Step 4 after Step 2. B. Design and Implementation We extract the list and the number of DLLs and APIs in an Import Address Table of an executable file. The calls to APIs are counted from the code segment. We store all this information to Feature Database. Figure 1 shows the features
Figure 2. The Schema of Feature Database
IV.
EXPERIMENTS AND EVALUATION
A. Target Applications We extracted information about DLLs and APIs from 22 widely used applications for Windows of five types, including commercial and open-source products (see Table 1). We mixed both kinds of software and make them close to real-world situations. Shaded entries in the table indicate open-source software. Debugging tools are used to simulate identification (filtering) and analysis on a PE format and code. Debugging
Figure 4. The number of DLLs used in programs tools analyze the PE header of a target application and extract information about DLLs and APIs, including the entry point (virtual address) of each API. We can obtain the number of references to an API by counting the API name that appears in a disassembled text section. Our identification functionality is implemented as a plug-in of a Python script. Table 1 Target applications Group FTP Client Image Viewer
Media Player
Text Editor
Compress Tools
No.
Program
Version
Size(Kb)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21)
Alftp Filezilla WinSCP Alsee FSViewer Imagine Xnview Alshow GOM Player KMPlayer Mplayerc AkelPad Eclipse Editplus EXPAD Hwp2010 notepad++ WINWORD2007 7zFM Alzip peazip
5.3.2 3.5.3 4.3.9 6.8 4.6 1.0.8 1.99 2.02 2.1.43 3.3.0 6.4.9.1 4.7.7 1.4.9 3.20 0.4 8.5.8 6.1.5 12.0 9.22 8.53 4.6.1
4,109 7,994 6,329 6,960 1,728 17 4,624 117 3,948 7,521 4,308 357 52 1,787 845 4,627 1,584 401 723 2,855 4,023
(22)
TUGZip
3.5
3,361
B. IAT Analysis We extracted information about how many DLLs and APIs are used in a program from the IAT of the executable file. The analysis results are shown in Figure 3. The number of DLLs is proportional to the number of APIs referenced. In the graph of Figure 3, each number at the bottom of each bar corresponds to a program in Table 1. Figure 4 shows the DLLs and APIs used in 22 programs. They use many different DLLs and APIs. We could not find any specific pattern, but identified frequently used DLLs.
Figure 3 The number of DLLs and APIs
In many programs, USER32.DLL, KERNEL32.DLL, and GDI32.DLL accounted for a large portion of DLLs used (see Figure 5).
Figure 5. The portion of USER32.DLL, KERNEL32.DLL, and GDI32.DLL These three DLLs accounted for over 40 percent in 18 programs except ALShow, Eclipse, HWP, and ALZip. Commercial programs such as ALShow, HWP, and ALZip use their own proprietary DLLs with a similarly high ratio. Eclipse showed a low percentage of these DLLs because it has only 59 APIs and also mainly uses its own DLLs. Meanwhile, Imagine program uses only USER32.DLL and KERNEL32.DLL. From this, we can draw that these three DLLs are frequently used in many Windows applications. Table 2 shows the basic functionality of the three DLLs. These frequently used DLLs would not help compare the similarity of programs; therefore, we consider the characteristics of such DLLs last when filtering software. Table 2. Tasks of USER32.DLL, KERNEL32.DLL, and GDI32.DLL DLL Kernel32.dll
User32.dll
GDI32.dll
Description Handles memory management, input/output operations, and interrupts Contains Windows API functions related to the Windows user interface (Window handling, basic UI functions, etc.) Contains functions for the Windows GDI (Graphical Device Interface) which assists Windows in creating simple 2dimensional objects
API LoadLibrary, GetCurrentProcess, ExitProcess, TerminateProcess, etc MessageBoxA, CreateWindowExA, SetCapture, SendMessageA, etc
Bitblt, TextOutA, SetROP2, LineTo, SetTextColor, etc
C. Applications in Same Categories The overall comparison results are shown in Table 3. Table 3. Overall comparison results
Program Alftp Filezilla WinSCP Alsee FSViewer Imagine Xnview Alshow GOM Player KMPlayer Mplayerc AkelPad Eclipse Editplus EXPAD Hwp2010 notepad++ WINWOR D2007 7zFM Alzip Peazip TUGZip
Step 1 2/22 2/22 2/22 identified identified identified identified 2/22
Step 2 identified identified identified
2/22
identified
identified 2/22 2/22 identified 2/22 identified identified identified
identified
identified identified identified
identified 2/22 2/22 2/22 2/22
identified identified identified identified
We could detect all different programs after performing Step 2. The value 2/22 of the Step 1 column means two out of 22 programs are detected to be identical after Step 1. In Step 2, we compared these programs only. In the case of AkelPad, the number of DLLs is 10, which is the same as that of peazip, but the names of DLLs are different from those of peazip. In the case of Alzip, the number of DLLs is 14, which is the same as that of TUGZip, but the names of DLLs are different from those of TUGZip. This shows the DLLs information is a good birthmark to identify each Windows application efficiently. D. Applications with Different Versions Using experiments, we can see that the same program with a different version has a similar distribution of DLLs and the number of APIs referenced in those DLLs. Figure 7 shows the distribution of DLLs referenced by the three versions of Alftp, Filezilla, and WinSCP. In the case of Alftp, all versions reference the same number of DLLS and APIs: 13 DLLs and 555 APIs. In the case of Filezilla, version 3.5.3 references the same number of DLLs as other versions do, but references a different number of APIs. The ratios of the number of APIs in a certain DLL to the total number of APIs are approximately the same for all versions. In the case of version 3.5.3, 155 APIs of KERNEL32 are referenced and account for 21.3% of total 729 APIs. In the case of version 3.4.0, 136 APIs of KERNEL32 are referenced and account for 19.6% of total 694 APIs. In the case of WinSCP, version 4.3.9 and 4.3.8 reference 16 DLLs but version 4.0.4 reference 17 DLLs.
However, the ratios of the number of APIs in a certain DLL to the total number of APIs also are approximately the same for all versions. The overall results of comparison of the same programs based on these results are shown in Table 4.
store this birthmark into a Feature Database and use it to compare the features of programs in concern. Using the proposed birthmark, we can perform comparisons faster with a smaller amount of information than an API Sequence-based one does, because our birthmark does not need API sequence extraction. In addition, it can provide an evidence of software theft or privacy even when code transformations have been applied to the copy by a malicious adversary. Through a set of experiments, we have shown that the proposed birthmark is effective in protecting Windows applications from illegal copying and redistribution over web-hard servers or torrent sites. We are working on ways to improve the efficiency of detecting illegal software and to elaborate comparisons with frequently used DLLs.
Figure 7. The portion of DLLs in the same programs with different versions In these experiments, the number of the target applications is 28, since we add six applications that include two different versions each for three applications. Table 4. Overall comparison of programs with different versions Program
Step 1
Step 2
Step 3
REFERENCES
Step 4
Alftp 5.3.2
4/28
3/4
3/3
identified
Alftp 5.3.1
4/28
3/4
3/3
identified
Alftp 5.2.2
4/28
3/4
3/3
identified
filezilla 3.5.3
5/28
3/5
identified
filezilla 3.5.2
5/28
3/5
identified
filezilla 3.4.0
5/28
3/5
identified
WinSCP 4.3.9 WinSCP 4.3.8
5/28 5/28
2/5 2/5
2/2 2/2
WinSCP 4.0.4
2/28
identified
identified identified
Programs with different versions are similar and more programs are detected in Step 1. In the case of WinSCP 4.3.9, the number of DLLs is 16, which is the same as that of other four programs, filezilla-3.5.3, filezilla-3.5.2, filezilla-3.4.0 and WinSCP 4.3.8. In Step 2, the names of DLLs of WinSCP 4.3.9 are the same as those of WinSCP 4.3.8. In Step 3, the names and the call counts of APIs of WinSCP 4.3.9 are also the same as those of WinSCP 4.3.8. We finally had only one program in Step 4. Our experiments show that IATB can be used to identify similar or same programs. V.
ACKNOWLEDGMENT This research was supported by Ministry of Culture, Sports and Tourism (MCST) and from Korea Copyright Commission in 2013, and supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012-0007372)
CONCLUSION AND FUTURE WORK
To detect software theft or piracy, a birthmark relies on the inherent characteristics of an application, which can be used show that one program is a copy of another. In this paper, we have proposed a new static birthmark scheme using the notion of Import Address Table (IAT), which can be used to identify Windows executable files. Our birthmark is obtained by analyzing a Windows PE executables. We
[1]
Business Software Alliance, 2011 BSA Global Software Piracy Study, 9th Edition, May 2012. [2] Haruaki Tamada, Masahide Nakamura, Akito Monden, and Ken-ichi Matsumoto, “Design and Evaluation of Birthmarks for Detecting Theft of Java Programs”, In Proc. IASTED International Conference on Software Engineering (IASTED SE 2004), pp.569--575, Feb. 2004. [3] Ginger Myles, and Christian Collberg, “K-gram based software birthmarks”, in SAC, 2005. [4] Ginger Myles, and Christian Collberg, “Detecting Software Theft via Whole Program Path Birthmarks”, In Information Security Conference, September 27-29, 2004. [5] Haruaki Tamada, Keiji Okamoto, Masahide Nakamura, Akito Monden, and Ken-ichi Matsumoto, “Dynamic Software Birthmarks to Detect the Theft of Windows Applications”, In Proc. International Symposium on Future Software Technology 2004 (ISFST 2004), October 2004. [6] Xinran Wang, Yoon-Chan Jhi, Sencun Zhu, and Peng Liu, “Detecting Software Theft via System Call Based Birthmarks”, in Annual Computer Security Applications Conference (ACSAC '09.), 2009. [7] David Schuler, and Valentin Dallmeier, “Detecting Software Theft with API Call Sequence Sets”, In Proceedings of the 8th Workshop on Software Reengineering, May 2006. [8] Seokwoo Choi, Heewan Park, Hyun-il Lim, and Taisook Han, “A Static Birthmark of Binary Executables Based on API Call Structure”, In The 12th Asian Computing Science Conference (ASIAN), December 2007. [9] Ginger Myles, Software Theft Detection Through Program Identification, PhD thesis, Department of Computer Science, The University of Arizona, 2006. [10] Microsoft, Microsoft Portable Executable and Common Object File Format Specification, Revision 8.2, 2010. 09. (also see Understanding the Import Address Table, http:// sandsprite.com/CodeStuff/Understanding_imports.html)