Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code Boyun Zhang1,2, Jianping Yin1, and Jingbo Hao1 1
School of Computer Science, National University of Defense Technology, Changsha 410073, China
[email protected] 2 Department of Computer Science, Hunan Public Security College, Changsha 410138, China
Abstract. An intelligent detect system to recognition unknown computer virus is proposed. Using the method based on fuzzy pattern recognition algorithm, a malicious executable code detection network model is designed also. This model target at Win32 binary viruses on Intel IA32 architectures. It could detect known and unknown malicious code by analyzing their behavior. We gathered 423 benign and 209 malicious executable programs that are in the Windows Portable Executable (PE) format as dataset for experiment . After extracting the most relevant API calls as feature, the fuzzy pattern recognition algorithm to detect computer virus was evaluated.
1 Introduction Malicious code is “any code added, changed, or removed from a software system to intentionally cause harm or subvert the system’s intended function”[1]. Such software has been used to compromise computer systems, to destroy their information, and to render them useless.Excellent technology exists for detecting known malicious executables. Software for virus detection has been quite successful, and programs such as McAfee Virus Scan and Norton AntiVirus are ubiquitous. These programs search executable code for known patterns. One shortcoming of this method is that we must obtain a copy of a malicious program before extracting the pattern necessary for its detection. Our efforts to address this problem have resulted in a fielded application, built using techniques from fuzzy pattern recognition and machine learning . The Malicious Executable Classification System currently detects unknown malicious executables code without removing any obfuscation. As far as know, our experiments is the first time to established methods based on fuzzy pattern recognition applying to detect malicious executables. In the following sections, we describe related research in the area of malicious code detection. Then we illustrate the architecture of our detect model in section 3. Section 4 details the method of extraction feature from program, and stating the detect engine work procedure. Section 5 details the implementation and experiment results. We state our plans for future work in Section 6. L. Wang and Y. Jin (Eds.): FSKD 2005, LNAI 3613, pp. 629 – 634, 2005. © Springer-Verlag Berlin Heidelberg 2005
630
B. Zhang, J. Yin, and J. Hao
2 Related Work There have been few attempts to use machine learning and data mining for the purpose of identifying new or unknown malicious code. In an early attempt, Lo et al. [2] conducted an analysis of several programs evidently by hand and identified tell-tale signs, which they subsequently used to filter new programs. Researchers at IBM's T.J.Watson Research Center have investigated neural networks for virus detection and have incorporated a similar approach for detecting boot-sector viruses into IBM's Anti-Virus software [3]. More recently, instead of focusing on boot-sector viruses, Schultz et al. [4] used data mining methods, such as naïve Bayes, to detect malicious code. There are other methods of guarding against malicious code, such as object reconciliation, which involves comparing current files and directories to past copies. One can also compare cryptographic hashes. One can also audit running programs and statically analyze executables using pre-defined malicious patterns. These approaches are not based on data mining.
3 Model Structure We first describe a general framework for detecting malicious executable code. Figure 1 illustrates the proposed architecture. The framework is divided into 3 part: Application Server, Detect Server, and Virus Detect Firewall based on character code scanning. Before a file save to the application server, it will be scanned by the virus detect firewall. If the file is infected with virus then quarantine it. Otherwise if there is no malicious information about the file, it will be replicated 2 copies. Then, one copy will be sent to the application server, another one will be sent to the detect server based on Fuzzy Pattern Recognition (FPR) detect engine. At the following stage, the file’s features is extracted in the detect server. The detect server drives detect engine based on FPR check the copy again. According to the result from detect server, if the file is infected with unknown malicious code, the application server will be remind to remove the copy from its application database. And then quarantine it in a special database or sent it to an expert to analyze it by hand.
Bay Netw orks
Fig. 1. Architecture
Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code
631
4 Malicious Code Detect Engine 4.1 Feature Extraction Our first intuition into the problem was to extract information from the PE executables that would dictate its behavior. We choose the Windows API function calls as the main feature. Lots of API function calls by tracing the programs in the training set could be obtained. It is surely that each API calls play different role on detecting malicious code. When an API call often appears in the malicious codes but seldom in the benign codes, so it plays more important role in detection. Here we use ‘ mean square deviation ‘as the main parameter to select API function calls as program’s feature. The mean square deviation between classes computed as follow: (1) Tracing each sample program in the training set to obtain it API calls sequence A={A1, A2,..., AP},(1≤i ≤ p) , count each API function( Ai ) frequency AijV in every malicious executables V j . And count its frequency AijN in every benign executables N j ; (2) Compute average frequency E ( AiV ) and E ( AiN ) of each API function( Ai ) in malicious executables set and benign executables set as: E ( AiV ) =
1 s V, 1 n Aij E ( AiN ) = ∑ AijN ∑ s j =1 n j =1
(1)
where s is the number of malicious executables ,n is the number of benign executables. (3) Compute total mean frequency E ( Ai ) of each API function Ai as: E ( Ai ) =
(4) Compute mean square deviation
E ( AiV ) + E ( AiN ) 2
(2)
D( Ai ) of each API function Ai as:
D(Ai ) = (E(Ai ) − E(AiV ))2 +(E(Ai ) − E(AiN ))2
(3)
At the last stage, we sorted the API function call sequence on D( Ai ) ,and choose the first t-th API function as the fuzzy feature vector. An example of feature vector shows in table 1. Table 1. Feature Vector List Sample
№ 1
Program’s behavior Search File
API Function Calls FindClose ;FindFirstFileA; FindNextFileA ;FindResourceA
DLLS reference KERNEL32.dll
632
B. Zhang, J. Yin, and J. Hao
4.2 Detection Algorithm The result of detect a computer program is only “benign” or “malicious”. We could get a set of feature from each sample file x, given C is the class set {benign, malicious}, C1 denotes benign, C2 denotes malicious. Our goal is to determine what class it is after the file’s feature F was obtained. In our method, a program file could be described by fuzzy set. Given Q = {q1 , q2 ,..., qn } is the domain of a fuzzy set M = {µ1 / q1 , µ2 / q2 ,..., µn / qn }
Where n is the number of features,
(4)
µi is a real number which value is between[0,1],
µ1 / q is the degree of membership of the test file which has feature qi . So the benign code and malicious code could be describe by fuzzy set on domain Q. Given E ( AiV ) is the mean frequency of a API function call in the malicious files, E ( AiN ) is the mean frequency of a API function call in the benign files, The malicious file set V ’s membership function create from the normal distribution of F distribution as: ⎧⎪0
µV (E( AiV )) = ⎨
, E( AiV ) < 0
−( E ( Ai )) ⎩⎪1 − e V
2
/σ 2
, E( AiV ) ≥ 0
(5)
where σ = max{E ( A1V ), E ( A2V ),..., E ( AtV )}/ 3 , t is the number of features. And the benign file set N ’s membership function is: ⎧⎪0
µN (E( AiN )) = ⎨
, E( AiN ) < 0
−( E ( Ai ⎩⎪1− e
N
))2 / σ 2
, E( AiN ) ≥ 0
(6)
In the same way, the test file’s membership function express as: ⎪⎧0
µ M ( Ai ) = ⎨
⎪⎩1 − e
, Ai < 0 − ( Ai )2 / σ 2
, Ai ≥ 0
(7)
During the training step, we compute the frequency of all features over malicious code set. According to membership function µV ( E ( AiV )) ,we get fuzzy set V as: V = {µ1V / A1 , µ2V / A2 ,..., µtV / At }
(8)
In the same way, we compute the frequency of all feature over benign code set. So we get fuzzy set N as: N = {µ1N / A1 , µ2N / A2 ,..., µtN / At }
(9)
For a test file M, by tracing its API function calls first, we could get the API call sequence. Then the frequency of all feature was computed too. Then { A1 , A2 ,..., At } was get, where t=88 in our experiment.
Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code
633
According to membership function µM ( Ai ) ,we get fuzzy set M : M = { µ1 / A1 , µ 2 / A2 ,..., µ t / At } .
(10)
On the Second step, the degree of similarity ψ ( M ,V ) between M and V , ψ ( M , N ) between M and N were computed as follow:
ψ ( A , B ) = 1 −
t
1
t
1
( ∑ ( µ iA − µ iB ) 2 ) 2
(11)
i =1
Where ψ ( A , B ) is Euclid degree of similarity. At the last step, we can determine which class the test file is by Theorem 1. Theorem 1: if ∃i satisfy the follow equation:ψ ( Ai , B ) = ∨ ψ ( A j , B ) , then classify Ai 1≤ j ≤ n
and B in the same class. Where Ai , B (i = 1, 2,..., n) is fuzzy sets, ψ ( A , B ) is the Euclid degree of similarity between A and B .
5 Experiment Results We estimate our results over data set in table 2. The data set consisting of PE format executables was composed of 423 benign programs and 209 malicious executables.The malicious executables were downloaded from http://vx.netlux.org and http://www.cs.columbia.edu/ids/mef/ . The clean programs were gathered from a freshly installed Windows 2000 server machine. Each sample was labeled by a commercial virus scanner with the correct class label(malicious or benign) for our method. After verification of the data set the next step of our method was to extract features from the programs using API tracing tool-APISPY.EXE that we designed. To evaluate our system we were interested in several quantities: (1). False Negative, the number of malicious executable examples classified as benign;(2). False Positives, the number of benign programs classified as malicious executables. We were interested in the detection rate of the classifier. In our case this was the percentage of the total malicious programs labeled malicious. We were also interested in the false positive rate. This was the percentage of benign programs which were labeled as malicious, also called false alarms. For the algorithms we plotted the detection rate using Receiver Operating Characteristic(ROC) curves. The ROC curves in Fig.2 show that our method had the lowest False Negative rate, 4.45%. Notice that the curve is down slowly when the number of samples increases. This is very fit to detect malicious code when the malicious sample obtained is difficult. In another experiments[5], we had used a algorithm based on K Nearest Neighbor(KNN) to classify the data set in table 2. The result is shown in Fig.3.That algorithm had the lowest false positive rate, 4.8%. The Fuzzy Pattern Recognition algorithm(FPR) has better detection rates than the algorithm based on KNN. But the KNN algorithm occupies less compute resources than FPR. The trade-off between detect rate and system overhead must be think over in practical application.
634
B. Zhang, J. Yin, and J. Hao Table 2. Dataset in experiment
Benign Code Malicious Code sum
Sample space 423 209 632
Fig. 2. Fuzzy Pattern Recognition ROC
Training set 373 159 532
Test Set 50 50 100
Fig. 3. K Nearest Neighbor ROC
6 Conclusion We presented a method for detecting previously undetectable malicious executables. As our knowledge, this is the first time that using fuzzy pattern recognition algorithm to detect computer virus. However, the rate of error alert seems high in our experiment. So future work involves extending our learning algorithms to better utilize API call sequences and other feature of virus. We are planning to use Neural network to gain higher accuracy and detection rates. We also would like to implement the system on a network of computers to evaluate its performance in terms of time and space in real world environments. Finally, we are planning on testing this method over a larger set of malicious and benign executables.
References 1. McGraw,G., Morisett,G.: Attacking malicious code: A report to the Infosec Research Council. IEEE Software. 5(2000) 33-41 2. Lo,R., Levitt,K., Olsson,R.: MCF: A malicious code filter. Computers & Security.14 (1995)541-566 3. Tesauro,G., Kephart,J., Sorkin,G.: Neural networks for computer virus recognition. IEEE Expert. 11(1996)5-6 4. Schultz,M., Eskin,E., Zadok,E., Stolfo,S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy. IEEE Press, Los Alamitos, CA, (2001)38-49 5. ZHANG Boyun,YIN Jianping,ZHANG Dingxing,HAO Jingbo.:Unkown computer virus detection based on K-nearest neighbor algorithm. Computer Engineering and Applications. 6(2005)7-10