ISSN : 2368-1209129
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
Malware Detection Based on Source Data using Data Mining: A Survey Ishita Basu1, Nidhi Sinha2, Diksha Bhagat3, Saptarsi Goswami4 1
Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
[email protected]
2
Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
[email protected]
3
Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
[email protected]
4
Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
[email protected]
ABSTRACT In this era of the information age, malware has become a serious threat. Malware creators create such kind of malware which can damage the entire computer, spread over the network and hack the secret information and ruin the whole system. In this paper, I focus on reviewing the application of data mining methods and summarized the ideas of detection techniques applied to specific data. Different data sources are also clearly described here. Reviewed papers are summarized within tables based on the data used for detection purpose. In this review, a brief description of data sources and how these are involved in malware detection are mentioned step by step. Our proposed approach focuses on the API call tracing. We used a dataset of 135 files containing 112 benign files and 23 malicious files. We use Xptruss tool for getting API calls. N-grams (eg: 2,3,4) are used for matching of API call sequences and a machine learning algorithm, Naïve Bayes is used for comparison.
KEYWORDS Malware, Data Mining, Detection Techniques, PE File, API call
1. INTRODUCTION Malware means malicious software. Malicious software programming skill is needed to make a malicious code and it is easily available in the internet. Also the malicious code authoring tools make it very easy to build new malicious code. Hence, the malware is keeping increasing. This paper describes a survey on several approaches for malware detection using data mining. The approaches are categorized based upon the types of data sources. We also mentioned the types of malware encountered
AJECT l January,2016 www.astpublishers.com
in our daily life and give an idea about these. We have also risen the different analysis for malware detection follows by researchers. We are working on PE file for malware detection. We have extracted API calls from running processes. For a new, unknown process, we have extracted API calls from that process and trying to match the sequences of API calls with a known one. If it doesn't match with known benign sequences, then there is a high chance that the new process is infected. We apply comparison algorithm between previously known infected dataset with the suspected dataset to get the result.
18
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
The unique contributions of the paper are as follows: Sources of data are categorized and a brief description of each source is specified. Divide the papers in several tables based on their source data. The main idea and data collection of each paper is summarized. Data mining techniques and other methods used for different approaches is clearly mentioned. Proposing an approach for malware detection using PE (Portable Executable) file.
2.MALWARE CLASSIFICATION ANALYSIS
AND
Several types of malware are spread over the internet day by day at a high speed. Before internet access became widespread, malware is spread everywhere like our PCs, mobile phones, internet, etc. Malwares are classified into following categories. 1. Viruses: - Viruses enter into the system, secretly, replicate themselves and infect the system. 2. Trojan horses: - A Trojan horse is a software program that invites the user to run it, spread harmful content, gets the access and gain control over the system. 3. Worms: - Worms spread over the internet, exploit network holes, get control of other machines through the network and run as a separate process. 4. Spyware: - Without the consent of user, spyware installed on the user’s machine, collect secret information and send it back to the creator. 5. Rootkits: - Rootkits hide the malicious program from the system’s process list and try to avoid detected by antivirus programs. 6. Backdoor: - Backdoors secretly bypass the authenticated procedure of
AJECT l January,2016 www.astpublishers.com
ISSN : 2368-1209129
user through the network to get the future access of the users’ machine. 7. Logic bombs: - A logic bomb is a harmful piece of code inserted intentionally into software of an application for a specific time or condition. Malware analysis is categorized into three parts, i.e. Static analysis, Dynamic Analysis and Hybrid Analysis.
Figure1:Malware Analysis
Static Analysis: - Static analysis is a process of analyzing executable code without executing the original file. It is just looking at the source code and uncovering the information about it. In static analysis, operational code frequency distribution, string signature, byte-sequence, control flow graph, n-grams etc. are used to detect malicious code. Assembly level language and operating system is needed to know in Static analysis. Dynamic Analysis: - In dynamic analysis, it is very important to execute the file. After execution of malicious code, it monitors its (malware) behavior and sees how much it affects the host machine. So, it is also called behavioral analysis. It is easy to detect unknown malware in dynamic analysis. Sandbox, simulator, virtual machine, etc. are used for analyzing infected code. Hybrid Analysis: - Hybrid analysis includes both static and dynamic analysis. It takes the signature part from static analysis, and then combines the behavioral part of dynamic analysis with it. So hybrid analysis is more efficient than other two analysis. But it has to maintain the limitation of both static and dynamic analysis.
19
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
3. SURVEY ON EXISTING WORK 3.1 SOURCES FROM WHICH DATA HAS BEEN COLLECTED I surveyed on malware detection, which uses data mining methods. In this paper, I give emphasize on the source data on which experiments were done for detecting new and existing malware. Basically source data are based on PE file, assembly file and kernel. From PE file, data can be collected in three ways, i.e. 1. The API call sequence can be extracted from PE file. 2. We can collect a byte sequence of a PE file. 3. PE file has two parts => Header and Section. The header contains various features. We can extract meaningful features from the header and use these as data for experimental. Assembly file contains opcode. We can extract the frequency of opcode sequence and use it as data. The kernel is the most important part of the operating system. Malicious codes often attack the kernel. So, features extracted from kernel may have noticeable changes if it is infected. Table 1: Sources of Data Collected for Malware Detection
Sl. No.
ISSN : 2368-1209129
1. API call graph: - API (Application Program Interface) is a set of commands, functions, and protocols, is used to build software for a specific operating system. They are predefined functions used to build new software.
Figure2: API Graph Construction
2. Byte sequence: - Data is collected in term of Byte sequences, which have been collected from files as features. The frequency of byte code is analyzed. The frequency of malicious code and benign code might be different. 3. Extract features from PE header and sections: - PE file consists of number of headers and sections, organized as a linear stream of data. It contains different features such as DLL’s, API function calls, the frequency of the instruction code, API call sequences, text based search technique, hashing, attribute certificate, date/time stamp, file pointer, linker information, CPU type, PE logical structure (section alignment, code size, debug flags) etc.
Types of Data Used
1.
The API call graph PE file
2.
Assembly file
3.
Kernel
AJECT l January,2016 www.astpublishers.com
Byte sequence Extract features from PE header and sections Extract Opcode sequence frequency of instruction Features are extracted from the kernel
Figure3: PE File Format
4. Extract Opcode sequence frequency from instruction: - Each assembly file contains a list with the opcode. The frequency of each opcode sequence is analyzed. 5. Features are extracted from Kernel: Features are collected from kernel in
20
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
term of system calls such as: files, registries, processes and threads, networking, and memory sections. Sometimes a kernel function is directly considered as features.
3.2 DATA, METHODS, IDEAS
ISSN : 2368-1209129
I summarized papers and kept these within 5 tables. Here I specify the data sources and from which it is collected, Used data mining techniques and other methods, and summarized their detection approach also. The tables are given below.
Table 2a: Summary of API call graph as Data, Methods, and Ideas The API call graph Sl. Main Idea Data Source & Data Mining, No. Collection method, etc. 1. Detect malware using API call graphs from PE The malware dataset API call extraction, file. Graph matching (based on Longest has been downloaded Data dependent API Common Subsequence (LCS) algorithm) and from call extraction, LCS graph similarity are used to trace common http://www.nexginrc.or graph matching subgraphs into the data dependent API call g/ web site algorithm, N-gram, graph. ROC curves • Unpacking PE file and Extracting API Call from it. • Extracting Data Dependent API Call Graph. • Compare it with Malware API call graphs. • Apply LCS Graph Matching Algorithm for finding the best subgraph with high similarity. • At last evaluate the result. [1] 2. Tree, • At first they have taken two datasets of The dataset has been Decision extracted from over Naive Bayes, malicious files and innocent files. Random Forest, • Extracting API calls from DLL from a 5000 executables Information gain PE parser and store it into a Data algorithm Mine. • Then apply machine learning and do classification by Random algorithm and Information Gain algorithm. • Derive knowledge from that and generate results. [2] 3. • After collecting data they had adopted They used a data set of N-Grams (size- 4, 5, files which 6) for Feature two approaches based on feature 137 frequency and extraction of common contains 119 benign Extraction, Machine feature in a class for feature extraction. files and 18 Spyware Learning Algorithm as J48, • Reduced set was processed to form an files. Benign files were such collected from Random Forest, ARFF database in a Boolean format http://download.com JRip, SMO, and (“0” and “1”). • ARFF formation of the features is and malicious files Naive Bayes, 10crosssupplied to WEKA for the application were collected from fold http://Spywareguide.co validation, ROC machine learning algorithms and m curve statistical analysis. • The ZeroR algorithm is used as a baseline for all the experiments, then two decision tree based algorithms (J48 and Random Forest), one rule
AJECT l January,2016 www.astpublishers.com
21
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
4.
base algorithm (JRip), one Bayesian network algorithm (Naive Bayes), and one support vector machine algorithm (SMO) were applied. • Did statistical analysis of results. [3] • They developed the Intelligent Malware Detection System (IMDS) using Objective-Oriented Association (OOA) mining based classification for detecting Windows API calls. • It is an integrated system which consists of three major modules: PE parser, OOA rule generator, and rule based classifier. • They have used an OOA_Fast_FPGrowth algorithm to obtain all the association rules with certain support and confidence thresholds. • After that they have applied NaïveBayes and SVMs (support vector machines) algorithm for classification. • Do feature selected by applying the Max - Relevance algorithm to select a set of API calls with the highest relevance to the target class, i.e., the file type. [4] • First constructing the API execution sequences by developing a PE parser. • Followed by extracting OOA rules using OOA Fast FP-Growth algorithm. • Finally, conducting classification based on the association rules. [5]
ISSN : 2368-1209129
They had downloaded 1,001 benign executables which were gathered from the system files of Windows 2000/NT and XP operating system and 3,265 malicious executable from several FTP sites
OOA_Fast_FPGrowth algorithm, Max-Relevance algorithm, NaïveBayes algorithm, SVMs (support vector machines) algorithm
They had collected 29580 executable, of which 12214 are benign executables and 17366 are malicious ones. These executables called 12964 APIs in total. Malicious executables are collected in KingSoft Antivirus Laboratory and the benign executables were gathered from the system files of Windows 2000/NT and XP operating system 6. They proposed a framework, which combines Not mentioned Signature-Based with Behavior-Based using the API graph system. They extract low-level information such as Control Flow Graphs (CFGs), Data-Flow Graphs (DFGs) and System call analysis. • They build their graph in different ways and analyze and compare graph using different methods. • To build the graph most researchers present node graph as system calls.
OOA_Fast_FPGrowth algorithm, Max-Relevance algorithm, OOA mining, Naive Bayes classifier, Decision Tree (version J4.8)
5.
AJECT l January,2016 www.astpublishers.com
Control Flow Graphs (CFGs), Data-Flow Graphs (DFGs), System call analysis
22
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
•
They create their graph by transforming PE file into call graph, the call graph nodes are system calls and the edges are system call sequence. Then the call graph minimizes into a code graph to speed up the analyze and compare graphs. [6] 7. They proposed the multi-view classification algorithm based on t the local property that describes the affinity between API functions in the network, file IO or other operations, which was evaluated on a large corpus of malicious code. • Their system is divided into two parts: offline analysis and online analysis. • The offline analysis part is responsible for constructing classification model by training with the historical data. • The online analysis part is responsible for three functions: getting the API call sequences of an unknown software, extracting features from the API call sequences, and verifying whether it is a malicious software. • At first, they had collected API call sequence. • Characterized API call sequence using 4 gram algorithm. • They designed a multiple classification fusion algorithm basing on the BKS(behavior knowledge space) proposed by Huang [7]. o Dividing the API call sequences of a software API call sequence into seven different sub-sequences according to the API category mentioned in 3.1. o Classifying them using machine learning algorithms and seven results can be got. o Fusing them with BKS algorithm or other ensemble methods, and using its fusion result as this software’s label. Compare proposed classifier with other classifiers like LibSVM, Id3, J48, Naive Bayes, and SMO. [8] 8. Their paper argues the accuracy of behavior based detection systems, in which the Application Programming Interfaces (API) calls are analyzed and monitored. The work identifies the problems that affect the accuracy
AJECT l January,2016 www.astpublishers.com
They have collected 817 sample programs which include the benign program, trojan, virus and worm. These programs can be downloaded from http://vx.netlux.org/vl.p hp . They run Microsoft Windows XP in VMware and use API Monitor (http://www.APImonito r.com/) as a “hook” program 125 to catch API call sequences of a program.
4-gram algorithm, classifier like LibSVM, Id3, J48, Naive Bayes, and SMO, 10-fold crossvalidation
The work analyzed Filtration, (10,000) of PE costimulation applications. Malicious samples had been collected in two
23
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
of such detection models. They proposed an approach using the Costimulation phenomenon that occurs inside Human Immune Systems (HIS). • They have used unsupervised and supervised classifier. • In unsupervised classifier each vector represents a six element window of a long API sequence and they have used unsupervised Self-Organizing Map Neural Network to achieve this test. • Supervised classifier is used to predict the status of an API call vector to a supervised neural network that depend on Feedforward Back-Propagation (FFBP) as a training algorithm. [9] 9. • They have used Metamorphic malware propagation with the help of CFG (Control, Flow Graph) detection method because CFG has been successful in detecting simple malware. • At first they are generating API call graph and trying to improve the simple CFG with beneficial information by assuming called APIs on the CFG. • Converting the resulted sparse graph to a vector to decrease the complexity of graph mining algorithms. [10] 10. • Their approach is based on static malware detection in resource-limited mobile environments. • They extracted function calls from binaries in order to apply their clustering mechanism, called the centroid. • This method is capable of detecting unknown malware and they are trying to find where the employed mechanism might find application at distribution channels. • Their proposed mechanism is also suitable for directly being used on smartphones for pre-checking installed applications. [11] 11. In their approach they have used the information gain method as a feature selector and combined it with the evolving clustering method as evolving learning classifier. Their proposed approach consists three phases. • In the first phase, it analyzes Windows API execution sequences called by the malware PE files using the malware PE analyzer.
AJECT l January,2016 www.astpublishers.com
ISSN : 2368-1209129
different locations; first is vx.netlux.org, and the second is www.offensivecomputi ng.net
They have collected 2,140 benign Windows PE-files and 2,305 Windows 32bit network worms, randomly, from a malware repository of APA malware research center at Shiraz University
Classifier like SMO, Naïve Bayes, Random Tree, Lazy K-Star and Random Forest, Cross validation, Graph Isomorphism Technique
They have collected 33 Symbian OS 2nd malware (http://www.dailabor.de) and 49 benign programs. 3620 unique function calls were discovered, where 254 of these only appeared in malware
Feature extraction, Centroid Machine, Naive Bayes, Binary SVM, tenfold crossvalidation
They have collected 2700 Windows malicious executables and 2300 benign Windows executable. 1000 malicious executables are randomly selected from VX Heavens malware
Correlation analysis, Feature election, Information gain, Classifier like Naïve Bayes, Decision tree and Neural network
24
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
•
The most informative features are selected in the second phase. I. They have used correlation measures based on the theoretical concept of entropy, a measure of the uncertainty of random variables. II. The classification power of each feature is derived by calculating its information gain (IG) based on the number of its appearances in the malicious class and benign class. III. Features with negligible information gains can then be removed to reduce the number of features and speed up the classification process. • After that the evolving clustering method was used to perform the detection in the third phase. I. They have used a classifier that divides a data set into a number of classes in the ndimensional input space by evolving rule nodes. II. Each rule node rj is associated with a class through a label. III. Its receptive field R(j) covers a part of the n-dimensional space around the rule node. [12] 12. They have proposed an approach which extracts the API call frequency from binary files or executable files and applies similarity measures for identification of malware. • At first they had extracted Assembly Code from Binary Executables using IDA Pro Disassembler. • Extract frequency of important features and stores it into a Slate database for batch analysis. • Extract API calls from the database and apply similarity analysis. • In a similarity analysis these unknown binary executables are compared with Cosine, Extended Jacquard measure and Pearson correlation. [13] 13. Their proposed architecture for detecting malicious patterns in executables is flexible to common obfuscation transformations. They proposed a prototype tool, called SAFE (a static analyzer for executables).
AJECT l January,2016 www.astpublishers.com
ISSN : 2368-1209129
collection (VX Heavens, 2011) and 1700 are collected from the Internet
Not mentioned
Cosine Similarity, Extended Jaccard Measure, Pearson Correlation Analysis
They had used Microsoft Windows 2000 machine for their experiment
Control flow graph, Code Transposition, Instruction Substitution
25
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
•
At first their proposed technique searches for virus signature (virus specific sequence of instructions) inside the program. • If the signature is found, then there is a high probability that the program is infected. • They built an abstract representation of the malicious program for detecting the virus. • This representation is the “generalization” of the malicious code. • After the generalization and the abstract representation of the malicious code, it can detect the infected executable. [14] 14. They have used API call sequence for malware detection, which can be successfully detected the zero day malware also. • Extract disassembly code from binary executables using IDA Pro disassemble. • Using IDA2SQLite, they have extracted API Sequence from disassembly code. • Compare it with signature DB-API. • Divide the dataset into training and testing data set. • A supervised learning algorithm is applied on training Set. • Do 10-fold cross validation on both training and testing data set. [15] 15. Their proposed approach is based on API call sequence using Object Oriented Association (OOA) mining. • After decomposing PE code, extract API sequence from it using PE parser. • Store it into a signature database. • Apply object oriented association rule and store it into a rule database. • Compare the result with different classification methods and generate results. [16]
Sl. No. 1.
They had collected 29580 binary executable, of which 12214 were clean and 17366 were malicious executable. Malicious programs had been collected from Honeynet project, VX heavens (VX Heavens 2011) and some other sources
Naïve Bayes algorithm, SVM, SMO (Sequential Minimal Optimization), J48 classifier, KNN algorithm, ANN algorithm, 10-fold cross validation
They had collected 29580 Windows PE executables of which 12214 were recognized as benign and 17366 were malicious executable. Malicious executables were collected from KingSoft Corporation laboratory and benign executables collected in Windows 2000/NT and XP operating system environment
OOA First FPGrowth mining algorithm, Naïve Bayes, Decision tree, SVM (Support Vector Mechanism), J4.8 classifier
Table 2b: Summary of Byte sequence as Data, Methods, and Ideas Byte sequence Main Idea Data Source & Data Mining, Collection method, etc. • Use Multi-Naïve Bayes algorithm, The dataset consists of Analyses TP, TN, 312 benign (non- FP, FN, Detection
AJECT l January,2016 www.astpublishers.com
26
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
which uses byte sequences in a file as features. • It is basically a collection of Naïve Bayes algorithms for splitting the data into sets. • Malicious executables have common intentions and they may have a similar byte code. • So, it is easy to detect malicious executable by looking at the frequency analysis of byte code in a file. • Byte sequences are extracted using the hex dump tool in Linux. [17] 2. They have proposed a method where the generated classifiers are used to classify new, previously unseen binaries as either legitimate software or Spyware. • At first they had extracted byte sequences from a hexadecimal dumps in terms of n-grams of different sizes. • Do N-gram generation and feature extraction. • Apply feature reduction method (CBFE) in order to reduce data set. • The reduced feature set is converted into the Attribute-Relation File Format (ARFF), which are ASCII text files that include a set of data instances, each described by a set of features. • The ARFF file is used as a training set and given as an input to train the J48 classifier. • The trained J48 classifier used for prediction of knowing unseen binaries. Use 10-fold cross validation in order to evaluate classifier performance. [18] 3. Their approach is to detect unknown virus automatically by using heuristic rules based on expert knowledge, and data mining technique. • In their context, it is a classification problem to determine whether a program can be classified into either malicious or benign. • The main problem of classification is “how to extract features from captured runtime instruction sequences”. • So, they have used instruction association. [19] 4. They proposed a formal learning framework for malware detection based on Kolmogorov complexity. • They converted each executable to hexadecimal codes and removed all headers and white spaces.
AJECT l January,2016 www.astpublishers.com
ISSN : 2368-1209129
malicious) executables and 614 spyware executable. The spyware collected from http://vx.netlux.org and the benign executables are collected from the system files in Windows XP operating system and from programs of a stereotype user
Rate, False Positive Rate and Overall Accuracy is using Multi-Naïve Bayes algorithm using window size 2 and 4, with Trojan and without Trojan
They have collected 51 binaries out of which 31 are benign and 20 are Spyware files. Benign files were collected from download.com and Spyware files were downloaded from spywareguide.com
Byte sequence Generation by using “xxd”(UNIX-based utility), N-gram Generation, Feature Extraction, Feature Reduction (CBFE), J48 classifier, 10fold cross validation
They had collected 267 Win32 viruses from VX heaven and 368 benign executables which consist of Windows system executables, commercial executable and open source executables
Logic assembly, Abstract assembly, Feature selection, Classification models (SVM, C4.5, Decision tree), 5fold cross-validation
They collected a set of 2000 distinct Windows EXE and DLL files from the “Program Files” and “system32” folders on author’s
Matching (PPM), Dynamic Markov Compression (DMC), ROC Curve
27
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
•
The main idea behind it is to consider the informative content of a string s as the measure of the size of the ultimate compression of s, K(s). • Their framework uses a learning methodology that works on unstructured raw executable, with an underlying statistical compression model. • It constructs two compression models, one from a collection of malicious code and one from a collection of benign code. • A new executable is classified according to which of the resulting models compresses the code more efficient. [20] 5. They proposed a scalable clustering approach to identify and group malware samples that exhibit similar behavior. • They use dynamic analysis to obtain the execution traces of malware programs. • These execution traces are generalized into behavioral profiles, which characterize the activity of a program in more abstract terms. • The profiles are used as input to an efficient clustering algorithm. [21]
ISSN : 2368-1209129
Windows 7 machine, among which 300 (15%) were randomly selected, used for training and 1700 (85%) were used for testing. 1000 malicious code was collected from the web site VX Heavens (http://vx.netlux.org), among which 300 (30%) were randomly chosen, used for training and 700 (70%) were used for testing. They had collected a data set for clustering of more than 75,000 samples in less than three hours
Locality Sensitive Hashing (LSH), Hierarchical Clustering
Table 2c: Summary of Features Extracted from PE header and sections as Data, Methods, & Ideas Extract features from PE header and sections Sl. Main Idea Data Source & Data Mining, No. Collection method, etc. 1. • Construct a Support Vector Machine 407 spyware programs Static vs. Dynamic are collected on these Analyses, (SVM) classifier for each client. – Information Gain, • In order to keep the classifier update- websites Vector to-date, there is a machine playing as mmbest.com/index.htm Support Machines, k-fold server to collect reports from all l, www.kobayashi.cjb.net cross validation clients. • Retrain, and redistribute the new /, www.xfocus.net/index. classifier to each client. [22] html, www.hf110.com/Index. html, www.hacker365.com, www.heibai.net/main.ht m, www.chinesehack.org/, from Jan. 2005 to Jun. 2005. 740 benign programs with size similar to spyware
AJECT l January,2016 www.astpublishers.com
28
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
2.
•
•
• •
3.
•
• •
•
•
They have proposed a method, which extracts fewer features from the PE header and compares various data mining methods to increase the accuracy of the clusters, which in turn improves the accuracy of detecting benign or malicious executable. The features that they used were file size, number of sections, number of unknown sections, number of DLL called, Size of Code, Size of initialized data, size of uninitialized data, Size of Image, Checksum, DLL Characteristics, Size of Stack Reserve, Size of Stack Commit and no. of directories. They have compared three unsupervised data mining techniques, hierarchical, K-mean and SOM. The main objective is to develop a method that improves the efficiency of distinguishing benign and malicious executable by using data mining, feature reduction and classification algorithms based on static malware analysis. [23] They try to find the PE format specific features that can be statically extracted from PE files to distinguish between benign and malicious files. They are analyzing each PE format feature they extracted. They choose 7 features named Debug Size, Image Version, IatRVA, Export Size, Resource Size, VirtualSize2, and Number of Sections. They divided the dataset of 100,000 malware and 16,000 clean files into five parts, and used those features as input to IBk, J48, J48 Graft, PART, Random Forest, and Ridor for classification. J48 was the best classification algorithm data for these data out of the six classifiers tried.
AJECT l January,2016 www.astpublishers.com
programs from the web site http://toget.pchome.co m.tw . This experiment dataset can be found on the URL: http://140.118.155.64/ MyResearch/spyware.a spx They have created three datasets of malware and benign binaries. Malware data sets were collected from http://vxheaven.org/vl.p hp and benign files were gathered from various versions of Microsoft windows operating system files and various windows applications on 32 bit and 64 bit. First dataset contains 22,172 binaries, second contains 14,467 binaries and the third comprises of 11,960 binaries which have both deals and executables
Benign files were collected from the base installations of Windows XP and Windows 7. And malicious files were collected from a subset of the VX Heavens archive at vx.netlux.org
ISSN : 2368-1209129
Feature extraction, Feature selection, Hierarchical learning algorithm, K-mean learning algorithm and Self Organizing Mapping Algorithm
Feature extraction (by Microsoft’s pedump utility tool), feature selection, classification by IBk, J48, J48 Graft, PART, Random Forest, and Ridor classifier
29
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
•
At last run with tenfold crossvalidation, and evaluate true positive rate, false positive rate and accuracy. • The results of this classification can be used by antivirus programs to improve their detection rates. [24] 4. Spyware detection by using data mining, classification algorithms based on a static anomaly detection that can be detect known and new, previously unseen spyware from windows Portable Executable (PE) file. • Collecting data set to perform experiments in the absence of known spyware dataset. • Finding and selecting the most related features and feature type that appropriate for the mining process. • Use data mining classification techniques such as (Support Vector Machines, Naive Bayes, and Decision Tree) and find the better one to detect spyware. • Trying to increase the detection rate and reduce the false positive and false negative rate. • Design and Build Spyware detection model based on static anomaly detection using data mining. [25] 5. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. It works in 2 phases- setup phase and malware detection phase. Setup Phase • They have built a Common Library Function (CLF). • Build Threats Library Function (TLF). • The CFL repository is constructed from benign files and the TFL repository is constructed from malware files. Malware Detection Phase • At first they have broken the file into segments. • After calculating segment entropy, features have been extracted (3-grams) for each segment. • For each file segment: Aggregate the features using the CFL and TFL separators to create indices. • After that segment are filtered using the computed indices. • Again a second level index aggregation is done.
AJECT l January,2016 www.astpublishers.com
They had collected 2084 spyware and 1065 benign windows Portable Executable (PE) file. The dataset consisted of a total of 4266 programs contained 3265 malicious and 1001 clean programs
Data cleaning, Data integration, Data selection, Data transformation, Classification algorithm (Random forest, Naïve Bayes (NB), K−Nearest Neighbor (kNN), JRip, J48 decision trees, and support vector machines (SVMs)), ROC curve, Pattern evaluation, Knowledge representation
2627 benign files were Common segment gathered from analysis, supervised programs installed learning, 3-grams under Windows XP program files folders and 849 malware files were gathered from the Internet with lengths ranging from 6Kb to 4.25MB (200 executables were above 300KB)
30
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
• At last classify the files. [26] 6. They have used automated behavior-based malware detection using machine learning techniques for detecting malware. • The behavior of each malware on an emulated (sandbox) environment will be automatically analyzed and will generate behavior reports. • These reports will be preprocessed into sparse vector models for further machine learning (classification). • Based on the analysis of the tests and experimental results of all classifiers, the overall best performance was achieved by J48 decision tree. [27] 7. They have created an architecture for detecting malware which consists of three main modules: (1) PE-Miner, (2) feature selection and data transformation, and (3) learning algorithms. • PE miner extracts DLL call function, API function, PE header information from PE file. • Send it to feature database. • Feature selection and transformation is done. Dataset is divided in to training and testing set. After learning algorithm, classification is done. [28]
A total of 220 unique malware (Indonesian malware) samples was acquired. A total of 250 benign data set samples was collected from system files located in the “System32” directory of a clean installation of Windows XP Professional 32-bit with Service Pack 2
ISSN : 2368-1209129
Automatic Behavior Monitoring, Data Preprocessing, Classifiers such as kNearest Neighbors (KNN), Naïve Bayes, Support Vector Machines (SVMs), J48 Decision tree and Multilayer Perceptron Neural Network (MLP)
Dataset consists of SVM, J48, and NB 247348 Windows classification programs in PE file algorithm format, in which 236756 are malicious and 10592 are clean programs. The malicious programs were collected from several sources such as VX Heavens Virus Collection database, author’s local laboratory, and author’s colleague’s laboratory. The benign programs were collected from Download.com, Windows’ system files and author’s local laboratory’s program files
Table 2d: Summary of Opcode sequence frequency as Data, Methods, and Ideas Extract Opcode sequence frequency from instruction Sl. Main Idea Data Source & Data Mining, No. Collection method, etc. 1. This paper describes how to use an opcode- Malware samples Feature extraction, sequence-frequency representation of were collected from Decision tree, executable to detect and classify malware. the VxHeavens Classification by They provide empirical validation of their website which using Support method with an extensive study of data- contains 17,000 Vector Machines mining models for detecting and classifying malicious programs, (SVM) classifier, unknown malicious software. including 585 K-Nearest Neighbor • At first they take the opcode and malware families that (KNN) algorithm, Trojan Bayesian Networks, discard the operand from the represent horses, viruses and k-fold cross instruction. etc. For validation. • They compute the weight for each worms, opcode by using some methodology. benign dataset, they
AJECT l January,2016 www.astpublishers.com
31
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
•
This weighting represents the had collected 1,000 relevance of the opcode to malicious legitimate executable and benign executables based on from their computers. whether it appears more frequently in malware or in benign executables. Computing follows three steps given below: o Try to obtain the assembly files with the help of NewBasic Assembler. o Generate opcode profile based on assembly files. o It contains a list with the operation code and the times that each opcode appears within both the benign and the malicious dataset. o Compute opcode relevance. • Do feature extraction by using assembly code. • Duo classification by using machine learning algorithm such as supervised learning, unsupervised learning and semi-supervised learning algorithms. o Make decision trees. o User Support Vector Machine classifier. o Apply K-Nearest Neighbor (KNN) algorithm. o Implement it within Bayesian Networks. • Use k-fold cross validation for evaluating the performance of machine learning classifiers. After Learning the model they measure the required representation time, number of features, feature selection time, training time and testing time by Testing the model. [29] Table 2e: Summary of Features extracted from Kernel as Data, Methods, and Ideas Features are extracted from the kernel Sl. Main Idea Data Source & Data Mining, No. Collection method, etc. 1. Their approach makes use of two types of They used a data set Combined Kernel, kernels: 1) a Gaussian kernel and 2) a which contains 615 Gaussian Kernel, spectral kernel. The notions of similarity that instances of benign Spectral Kernel, and these two kernels measure are quite distinct software and 1,615 the N-Gram (2 to 6 and they complement each other very well. instances of malware Gram), Support 1) Vector Machine (SVM), Markov • The Gaussian kernel searches for Chain local similarities between the adjacency matrices. It works by taking the exponential of the squared distances between
AJECT l January,2016 www.astpublishers.com
32
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
•
2)
ISSN : 2368-1209129
corresponding edges in the weighted adjacency matrices. The motivation behind this kernel is that two different classes of programs should have different pathways of execution, which would result in a low similarity score.
•
The Spectral kernel is based on spectral techniques. These methods use the eigenvectors of the graph Laplacian to infer global properties about the graph. • These eigenvectors encode global information about the graph’s smoothness, diameter, number of components and stationary distribution, among other things. [30] 2. They studied the diversity of system calls and analysis that the data which are based on system call sequences, demonstrates that simple malware detectors can face significant challenges in such environments. • Their proposed model characterizes the general interactions between benign programs and the operating system (OS). • Their system-centric approach models are able to find out how benign programs access OS resources (eg: files and registry entries). • Their approach captures well the behavior of benign programs and raises very few false positives while being able to detect a significant fraction of recent malware. [31]
4. PROPOSED APPROACH We have worked on malware behavior and their detection. We have collected benign and infected dataset from our college laboratory in Windows 2007 and XP operating system environment among which 30% is used for training and 70% is used for testing. We have worked on PE files (Portable executable files). We have used Xptruss, a tool to detect API calls, to collect API calls from running process. It has created a text file for each process and store all the API calls getting from
AJECT l January,2016 www.astpublishers.com
They had collected Preprocessing, 114.5 GB of data from model generation, all running Windows n-gram analysis XP, consisting of 1.556 billion of system calls, from 362,600 processes and 242 distinct applications. Their system collected data from each machine at an average rate of 8.2 MB/minute, with highly used machines producing logs at 40 MB/minute and idle machines producing 1.5 MB/minute
that process. It traces link time dll’s and exported functions of current processes. The API calls can be any exceptions occurred during runtime, threads, methods (get, set) etc. The examples are given below: Exception:NEW EXCEPTIONS FOUND : 1 IMM32.DLL!CtfImmIsTextFrameServiceDisabled : Found new exception: CtfImmIsTextFrameServiceDisabled Instrumented
33
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
Figure5: Thread Found In OUT Process
Methods:ADVAPI32.dll!WmiExecuteMethodW : Instrumented GDI32.dll!D3DKMTGetMultisampleMethodList : Instrumented Figure4: Exception Found In OUT Process
Thread:IMM32.DLL!CtfImmIsCiceroStartedInThread : Instrumented ole32.dll!CoVrfCheckThreadState : Instrumented
Figure6: Different Methods In OUT Process
We apply n-grams over the dataset. We use 2, 3 and 4 grams and find the probability that which n-gram is giving a maximum matching of API call sequences. The 3-gram gives good result
.
Figure7: Finding API Call Sequences Using R
We compare the sequences of API calls of Train set and test set. If API call sequences are not matched properly with the API call sequence of previously known benign data set,
AJECT l January,2016 www.astpublishers.com
then we can detect that the dataset might be infected. We apply machine learning algorithms (Naïve Bayes) on the same data set to compare the result.
34
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
Fig8: Proposed Framework
5. CONCLUSIONS Malware, or malicious code infects the users’ machine and hacks its secure and private information without his/her permission. It also damages the machine, spread over the network or internet and gets the control. In this literature survey, the different approaches of detecting malware are explored, where data mining models are being applied on various types of data. This survey gives a clear idea on malware detection procedures. It will help in further research on new and unknown malware detection. The proposed approach will help to detect both new, unseen and known malware. We are trying to implement the proposed approach through an antivirus software in near future.
ACKNOWLEDGEMENTS We would like to thank everyone who ever encourage our work and giving us support. A special thanks to our sir, who always guided us and encourage in every step whenever we had encountered any problem in our work.
AJECT l January,2016 www.astpublishers.com
REFERENCES [1]Ammar Ahmed E. Elhadi, Mohd Aizaini Maarof and Bazara I. A. Barry. “Improving the Detection of Malware Behaviour Using Simplified Data Dependent API Call Graph.” International Journal of Security and Its Applications Vol.7, No.5 (2013), pp.29-42. [2]Priyank Singhal, Nataasha Raul. “Malware Detection Module using Machine Learning Algorithms to Assist in Centralized Security in Enterprise Networks.” [3]Raja M. Khurram Shahzad, Syed Imran Haider. “Detection of Spyware by Mining Executable Files.” Master Thesis Computer Science Thesis no: MSC-2009:5. [4]Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, Qingshan Jiang. “An intelligent PE-malware detection system based on association mining.” © Springer-Verlag France 2008, J Comput Virol (2008) 4:323–334, DOI 10.1007/s11416-008-00824. [5]Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye. “IMDS: Intelligent Malware Detection System.” Industrial and Government Track Short Paper, KDD'07, August 12-15, 2007, San Jose,
35
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
California, USA. Copyright 2007 ACM 978-159593-609-7/07/0008. [6]Ammar Ahmed E. Elhadi, Mohd Aizaini Maarof, Ahmed Hamza Osman. “Malware Detection Based on Hybrid Signature Behaviour Application Programming Interface Call Graph.” American Journal of Applied Sciences 9 (3): 283288, 2012, ISSN 1546-9239, © 2012 Science Publications. [7]Y.S Huang,et a1.The Behavior-Knowledge Space Method for Combination of Multiple Classifiers[A],Proceedings of Computer Vision and Pattern Recognition[C].USA:IEEE,1993.347~ 352 [8]Guo Shanqing, Yu Qixia, Lin Fengbo, Wang Fengyu, Ban Tao. “A Malware Detection Algorithm Based on Multi-view Fusion.” [9]Saman Mirza Abdulla1, Assoc. Prof. Dr. Miss Laiha Mat Kiah, and Assoc. Prof. Dr. Omar Zakaria. “Minimizing Errors in Identifying Malicious API to Detect PE Malwares Using Artificial Costimulation.” International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE'2012) March 24-25, 2012 Dubai. [10]Mojtaba Escandari, Sattar Hashemi. “Metamorphic Malware Detection using Control Flow Graph Mining.” IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.12, December 2011 [11]Aubrey-Derrick Schmidt, Jan Hendrik Clausen, Ahmet Camtepe, Sahin Albayrak.” Detecting Symbian OS Malware through Static Function Call Analysis.” [12]Altyeb Altaher, Supriyanto, Ammar ALmomani, Mohammed Anbar, Sureswaran Ramadass. “Malware detection based on evolving clustering method for classification.” Scientific Research and Essays Vol. 7(22), pp. 2031-2036, 14 June, 2012. [13]Kevadia Kaushal, Prashant Swadas, Nilesh Prajapati. “Metamorphic Malware Detection Using Statistical Analysis.” International Journal of Soft Computing and Engineering (IJSCE) ISSN: 22312307 (Online), Volume-2, Issue-3, July 2012. [14]Mihai Christodorescu, Somesh Jha. “Static Analysis of Executables to Detect Malicious Patterns.” [15]Mamoun Alazab, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. “Zero-day Malware Detection based on Supervised Learning Algorithms of API call Signatures.” Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia. [16]V.Naresh Kumar, Ramesh nallamalli, Gorantla Praveen, I.Pavan Kumar, K Subrahamanyam. “Object Oriented Association (OOA) Mining Based Classification in Storage Cloud.” International Journal of Modern Engineering
AJECT l January,2016 www.astpublishers.com
ISSN : 2368-1209129
Research (IJMER) Vol.2, Issue.2, Mar-Apr 2012 pp-547-551 ISSN: 2249-6645. [17]Cumhur Doruk Bozagac. ”Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware.” [18]Manik k. Chavan, Dinesh A. Zende. “Spyware Solution: Detection of Spyware by Data mining and Machine learning Technique.” [19]Jianyong Dai, Ratan Guha, Joohan Lee. “Efficient Virus Detection Using Dynamic Instruction Sequences.” JOURNAL OF COMPUTERS, VOL. 4, NO. 5, MAY 2009, ACADEMY PUBLISHER. [20]Wei DENG, Qiao LIU, Hongrong CHENG, Zhiguang QIN. “A Malware Detection Framework Based on Kolmogorov Complexity.” Journal of Computational Information Systems 7: 8 (2011) 2687-2694, 1553-9105/ Copyright © 2011 Binary Information Press August, 2011. [21]Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, Engin Kirda. “Scalable, Behavior-Based Malware Clustering.” [22]Tzu-Yen Wang, Shi-Jinn Horng, Ming-Yang Su, Chin-Hsiung Wu, Peng-Chu Wang and WeiZen Su. “A Surveillance Spyware Detection System Based on Data Mining Methods.” 2006 IEEE Congress on Evolutionary Computation Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006. [23]Dalbir Kaur R. Chhabra. “Feature selection and clustering for malicious and benign software characterization.” (2014). University of New Orleans Theses and Dissertations. Paper 1864 (http://scholarworks.uno.edu/td). [24]Karthik Raman, “Selecting Features to Classify Malware”, c 2012 Adobe Systems Incorporated [25]Fadel Omar Shaban, Dr. Tawfiq S. Barhoom. “Spyware Detection Using Data Mining for Windows Portable Executable Files.” [26]Gil Tahan, Lior Rokach, Yuval Shahar. “MalID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features.”, Journal of Machine Learning Research 13 (2012) 949-979. [27]Ivan Firdausi, Charles Lim, Alva Erwin, Anto Satriyo Nugroho. “ANALYSIS OF MACHINE LEARNING TECHNIQUES USED IN BEHAVIOR-BASED MALWARE DETECTION.” 978-1-4244-xxxx-x/09 ©2009 IEEE 10 ACT10. [28]Usukhbayar Baldangombo, Nyamjav Jambaljav, Shi-Jinn Horng. “A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS.” [29]Igor Santos_, Felix Brezo, Xabier UgartePedrero, Pablo G. Bringas. “Opcode Sequences as Representation of Executables for Data-miningbased Unknown Malware Detection.” [30]Blake H. Anderson, Daniel A. Quist, Joshua C. Neil. “Graph-based Malware Detection Using Dynamic Analysis.” Los Alamos National
36
Ishita Basu et al, American Journal of Advanced Computing, Vol. III (1), 18-37
ISSN : 2368-1209129
Laboratory Associate Directorate for Theory, Simulation, and Computation (ADTSC) LA-UR 12-20429. [31]Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu, Engin Kirda. “AccessMiner: Using System-Centric Models for Malware Protection.” CCS’10, October 4–8, 2010, Copyright 2010 ACM 978-1-4503-0244-9/10/10.
AJECT l January,2016 www.astpublishers.com
37