Signature based Malware Detection for Unstructured

Signature based Malware Detection for Unstructured Data in Hadoop

Abhaya Kumar Sahoo Department of Information Technology C.V.Raman College of Engineering Bhubaneswar, India [email protected]

Kshira Sagar Sahoo Department of Information Technology

Mayank Tiwary Department of Information Technology

C.V.Raman College of Engineering

C.V.Raman College of Engineering

Bhubaneswar, India [email protected]

Abstract—Hadoop is a very efficient distributed processing framework. It’s based on map-reduce approach where the application is divided into small fragments of work, each of which may be executed on any node in the cluster. Hadoop is very efficient tool in storing and processing unstructured, semistructured and structured data. Unstructured data usually refers to the data stored in files not in traditional row and column way. Examples of unstructured data is e-mail messages, videos, audio files, photos, web-pages, and many other kinds of business documents. Our work primarily focuses on detecting malware for unstructured data stored in Hadoop distributed file system environment. Here we use calm AV’s updated free virus signature database. We also propose a fast string search algorithm based on map-reduce approach.

Keywords—Malwares; Map-reduce; Hadoop; Cluster; Pattern Matching; Signatures;

I. INTRODUCTION Hadoop is a fully open source product from apache and it provides fundamentally a new way of storing and processing data. The traditional expensive, proprietary hardware and different systems or software's which are used to store and process data, Apache Hadoop gives a very efficient distributed and parallel processing platform for huge amounts of data across the industry-standard servers that both store and process the data without limits. With Hadoop systems, no data is too big. Hadoop can handle many different types of data including: structured, unstructured, log files, pictures, audio files, communications records, email, etc regardless of its native format. Virus-Scanner is a computer program that detects malicious programs that can harm our systems. In Today’s date virus-scanner not only give protection of viruses from local mass storages or compact disks storage devices but now it also protects from local area networks and specially from internet traffic. The high growth rate of virus-signatures day by day decreases the virus scanning performance of antiviruses. So to remove this issue it requires a highly effective pattern matching algorithms for virus scanning, which could make virus scanning a lot faster than expected. It is also required for fast map-reduce based string search algorithms that can efficiently detect malwares or viruses in big data or hadoop environment.

Bhubaneswar, India [email protected]

Hadoop Distributed File System HDFS is a file system based on java that is designed for scalable and reliable data storage to span large clusters. Files in Hadoop are stored in HDFS. Hadoop automatically indexes the file, breaks the file to blocks and stores it in different nodes. Breaking of the file is based on file input stream. Files after getting stored in HDFS, requires map-reduce approach to be processed. Our work is focused to search a malware signature in the contents of unstructured data based on map-reduce approach. Malwares include viruses, ransomwares, spywares, adwares, scarewares and other type of malicious programs. Since unstructured data is not stored in traditional row-column format, they may be malwares. Examples of unstructured data include e-mail messages, videos, audio files, photos, webpages, and many other kinds of business documents. Most of the time unstructured data also includes executables files, zip files, compressed files etc. These types of files requires to be searched with malwares signatures. Most of the time malwares detections also requires their behavior to be checked. High increase of data requires Hadoop systems to get stored. So the clients storing their personal data storing in Hadoop also require security of data from viruses etc. So we developed a map-reduce approach to easily scan for viruses in HDFS in real time. So that the administrators can easily know which blocks contains malwares. We tested our algorithms on a 5 node-cluster each with core-i5 processor. And we used different files such as zip, bzip, gzip, mp3's, exe files, etc. There are many files which requires only the headers or tailor of the files to be scanned. While compressed and executables files requires their whole contents to be searched. Our map-reduce algorithm acts differently for different extensions of files. Such as for different files it uses different signatures. And finally writes the results to a different files. We used Clam AV 's virus signatures database to perform our experiment. II. DISTRIBUTED FILE SYSTEMS The unstructured data is stored in distributed manner, in a number of systems or so called nodes which are in cluster. These nodes while storing data use distributed file system such as HDFS and GFS.

A. Google File Systems Google File System (GFS) developed by Google is designed to provide efficient, reliable access, to data using large clusters. Here the files are mutated by appending new data rather than overwriting existing data. Once written, the files are read sequentially. GFS can run on computing clusters comprising of normal computers. In GFS there can be 100 to 1000's of nodes in clusters. Here the cluster consists of a master node and large number of chunk servers. Each file to be stored in GFS has to be divided into chunks of fixed size. Each chunk has a unique 64-bit label assigned by the master node at the time of creation. And logical mappings of files to constitute chunks are maintained. Replication of chunks takes place throughout the network. While the minimum replication factor is three. The master server does not store the actual data while stores the metadata of chunks, for example - tables mapping the 64-bit labels to chunk locations and the files they make up, locations of copies of chunks and what processes are reading or writing to a particular chunk. The master server periodically updates its metadata by checking the heartbeats of the chunk servers.

A computation request given by an application to process data is very much efficient if it is executed near the data where it is stored or on that node, where the data resides. The efficiency and performance of that computation increases to a very high extent when the size of the data set is huge as compared to the performance or when the data is brought to some other processing unit for the computation. This also has a number of effects on the network, it minimizes the network traffic congestion and finally increases the overall throughput of the system. Many a times it removes the bottleneck of network bandwidth for those computations which involves transfer of data. Therefore it is concluded that it is often better to move the computation closer to where the data is located rather than moving the data to where the processing place. HDFS gives a very efficient platform for applications so that they can move themselves closer to where the data is located

Fig2. Architecture of Hadoop Distributed File System

III. RELATED WORK

Fig1. Architecture of Google File System

B. Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with the existing distributed file systems such as GFS. Whereas it also has a lot differences between other distributed file systems. HDFS is very efficient in fault-tolerant and is designed to be implemented on lowcost systems or hardware. Throughput provision of HDFS for applications is very high and is suitable for such applications that have very large data sets. It uses a few POSIX requirements to enable high end streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project.

Researchers addressed malware detection problems based on system call monitoring [6, 7-8]. The malware detection methods of this research are based on identifying a particular set of one or more behaviors found in a previously discovered malware. They basically focus on monitoring or inspecting the behavior. There are generally four common behavior types that are so malicious that they can only be used to detect malware and much research has been proposed to detect them. •

Replication and propagation behavior is considered the common behavior of a virus. In[9], the authors developed a computer virus detection system based on a program’s GSR (Gene of Self-Replication). GSR is a specific sequence of system calls that are called by a running malware program that requests or tries to replicate program's code.

•

Privacy invasion behavior is a critical form of malicious behavior. In[9], the authors proposed a methodology to track flow of sensitive information that is processed by the Web browsers.

•

Malicious code injection behavior is an important attack method employed by malware. In[7], the authors present methods that can inspect whether a DLL copied to the memory space of a target process is malicious.

•

Persistent behavior is an important feature of malware. In[10], authors use ASEP monitoring techniques to detect a malicious malware. Any process with ASEP modification behavior is considered malware.

The related work includes various pattern matching algorithms and other approaches used to speed-up the virusscanning stages for big data in HDFS . The word “pattern” usually refers to the hexadecimal string in a virus signature. Many pattern matching algorithms [3], [5] have been proposed to solve the problem of slow virus-scanning system problem. Many of them uses shift based algorithms, which originates from a classic single pattern matching algorithm BM [2]. The BM utilizes information from the pattern to quickly shift the text during searching or scanning to reduce the number of compares as many as possible. BM uses a bad character heuristic to effectively capture such information [2]. Hash-AV [4] introduced an approach to use cache-resident bloom filters to quickly determine most no-match cases. However bloom filters have false positives. In order to further check those positive cases, Hash-AV has to be used together with the original Clam-AV. Most of the algorithms used to optimize the virusscanning time is serial based and runs on very less number of cores of general CPU’s. These algorithms can give speed-up to a maximum of 0.6 times than normal or existing ones. UNDERSTANDING MALWARE SIGNATURES Malware detection softwares are basically fast string search softwares that are used to prevent, detect and remove malwares such as computer viruses, malicious BHO’s, hijackers, spyware etc. These softwares also protect the systems from social engineering attacks etc. Antivirus generally uses virus-signature databases for virus scanning. Generally this scanning is done serially. A. Analyzing Virus Signatures Currently, the total number of signatures in Clam-AV is about 150,000, and the number keeps increasing constantly. In this paper, we have used the virus database downloaded from Clam-AV as a sample to analyze the general characteristics of a typical anti-virus signature database and also to implement actual experienced testing.

Mostly the virus-signatures from Clam-AV are divided into three types. • Basic Signatures: These types of signatures are all hexadecimal strings. Clam-AV scans basic types of signatures from the full content of any files. •

MD5 Signatures: The virus infected files matched by MD5 signatures present in their content are MD5 checksum of a target file or of a specific location.

•

Regular Expression Signatures: This is the next version of basic signatures. This in addition to basic signatures it includes several kinds of wildcards. Generally the antivirus matches this type of signatures with the whole content of the file.

Except from these three major types, there are a few other signatures for certain extended functions. There are 66 signatures for archive metadata and also 167 signatures for anti-phishing. In general the percentage of basic signatures 52.9%, MD5 signatures - 43.7%, Regex - 3.3% and others is 0.16%. The CPU overhead in scanning is very high in basic signatures and Regex signatures. B. Current scenario of Clam-AV

Currently Clam-AV uses two pattern matching algorithms. The extended version of BM [2] (BMEXT) algorithm handles the basic signatures and the regular expression signatures are handled by a modified AC [1] algorithm. Firstly, when implemented with DFA [5] data structure, AC consumes a large amount of memory for such large scale signature database. If implemented with NFA [5] data structure, there are several memory compressing techniques available to reduce memory consumption, however such techniques usually come with more memory access, which hurts the performance badly. Secondly, the BMEXT uses the last 3 characters of a signature to generate shifts. Given that the average length and shortest length of virus signatures are comparably long, we believe larger shifts could be produced by utilizing more characters of existing signature. IV. HADOOP MAP-REDUCE APPROACH Map-Reduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster of computers. Usually programming with map-reduce approach involves calling of two main functions i.e. a mapper and a reducer. The map function takes a key, value pair as arguments and outputs a list of intermediate values with the key. The map function is written in a specific way such that multiple map functions can run at once, so it’s the part of the program that divides up tasks. The reduce function then takes the output of the map functions, and does some process on them, usually combining values, to generate the desired result in an output file. When a map-reduce program runs through Hadoop, main job is assigned to name node or master node or job tracker. Then job tracker divides the main job into sub jobs which are

run in parallel and sends to different slave nodes or task trackers. Also, the job tracker keeps track failed works, so that these tasks are again distributed to other task trackers, only causing a slight increase in execution time. In case of slower workers, any task still running once, there are no more new jobs which are left ,given to machines that have already finished their tasks. Every process nodes which have small piece of large file, these files are accessed by utilizing high bandwidth of use of more hard disks in parallel. In this way, the performance of Hadoop may be able to be improved by working the I/O of nodes concurrently, providing more throughput.

Here we start explaining our algorithm starting with the mapper function, then the reducer function Mapper The input to the mapper is blocks of 64MB size. And Different variables are assigned for different set of malware signature strings. The mapper function breaks the block to sentences or lines and makes them the value to be emitted. To select the Key for a specific value or sentence or the line of the block it's important to know the type or extension of the file to which the block belongs so that appropriate signature can be applied to it. We know the extension of the file filesplit.getPath().getName(). Then we finally select appropriate Key and value i.e. a signature and the sentences. After we emit the key-value pairs to the reducer to perform pattern matching of the key string with the value strings collected.

Fig3.Map Reduce Approach

V. ALGORITHM IMPLEMENTATION In this section, we present the detailed Map-Reduce implementation of signature based scanning algorithm, Map-Reduce Antivirus; especially, we present how our proposed scheme is applied to this algorithm. Here we apply a single Map-Reduce approach but we tested different data sets with different pattern matching algorithms. Our testing is done on executable and compressed files, having extensions bzip2, gzip, tar etc. Generally the malwares are mainly embedded on executable files or stored in compressed files. It is generally required to scan the executable, compressed files whole contents with the virus signatures. For every different type of file their need to be a different set of virus signature. According to our algorithm, if any match is found with the malware signatures and contents of file, then write result to a separate file. And for rest types of files, it’s not required to scan the complete contents of the file but the perfect scanning can be done by just matching the headers and footers of the contents of the file. A. Map-Reduce AV Algorithm I(Signature Based)

Fig4.Implementation of Mapper

Reducer The input to the reducer is the key and value pairs from the mapper. Here the Key is the signature to be scanned with the values which is the sentences or contents of the files. The reducer collects all the sentences through an iterator for single values of keys. Now we perform appropriate pattern matching algorithm to scan the key with the value. If a match is found then write the name of file along with the result i.e. either a 0 or 1, in the output file. While performing the pattern matching, we compared our work with three existing pattern matching algorithms i.e. BoyerMoore's algorithm, KnuthMorrisPratt's algorithm and then finally with RabinKarp's algorithm.

Fig5.Implementation of Reducer

B. Hadoop Streaming Hadoop streaming is a an excellent feature of Apache Hadoop that allows us to create and run MapReduce based jobs with any executable or script. When that executable or script is given to run for mappers, each mappers role will first run the executable as separate process while the same mapper is being initialized. When the mapper starts running, it converts the inputs of the executable into lines and then it is fed as the lines to the stdin of the running process. At the meantime, the mapper has to collect the lines oriented outputs from the stdout of that process and converts it to a key-value pair, which is then collected as the output of the mapper function and input of the reducer function. By the default configurations or the supplied data types of hadoop i.e. LongWritable or intWritable or Text the prefix of the input line up to the first tab character detected is the key for that line and the rest of the line is the value corresponding to that key. If no tab character is encountered in the line, then whole line or sentence is considered to be as the key and the value is null. As the above Map-Reduce program would work only for one input directory, we apply hadoops streaming process to process more or large number of files at once. VI. EXPERIMENTAL ANALYSIS Our experimental analysis gives us the worst time comparisons for different volumes of files. We call it as worst because most of presently existing malware detectors do not scan the whole contents of the file, while they just scan either the header - or the tailor of the file or it generally depends on the type of file given as input.

Fig6.Time consumption for different volumes of data for virus signatures > 15000

Here we took real world data set and applied our algorithm on it. We applied three general algorithms for it i.e. BM's, KnuthMorrisPratt's and RabinKarp's string matching algorithms. We observed different set of time consumption for different algorithms. We also observed different levels of accuracy for detecting malwares by different pattern matching algorithms. We measured the time consumption by using Hadoops Benchmarking system. VII. CONCLUSION In this paper, we have focused on creating a mapreduce approach to detect malwares residing inside unstructured data which is stored in distributed file system. As increasing demand for storage increases day by day, it requires a fast and efficient way to search for malwares against the stored data. We Implemented different pattern matching algorithms to search for signatures in the content of files. Our experiment shows up to maximum 10 times speedup as we increase the number of nodes in the cluster. We performed the experiment on a real-world data or file set. As our size of file-set increases, our speedup also increases. From the above exploration, we believe that the our algorithm will provide compelling benefits in the fields of antivirus applications. Applications that used to rely on a cluster or a supercomputer to process will be solved on a desktop. In this way, we will get better performance in terms of execution time by optimizing our map-reduce algorithm based on release of new versions of Hadoop. VIII. REFERENCES [1] [2]

A.V. Aho and M.J. Corasick, Efficient string matching: An aid to bibliographic search, Communications of the ACM, 18(6):333–340, 1975. R. S. Boyer and J. S. Moore. A fast string searching algorithm, Communications of the ACM, 20(10), 1977.

[3]

B. Xu, X. Zhou and J. Li, Recursive shift indexing: a fast multi-pattern string matching Algorithm, Proc. Of the 4th International Conference on Applied Cryptography and Network Security (ACNS), 2006. [4] O. Erdogan and P. Cao, Hash-AV: fast virus signature scanning by cache-resident filters, Proc. Of the International Conference on Systems and Networks Communications (ICSNC), 2007. [5] M. Fisk and G. Varghese, An analysis of fast string matching applied to content-based forwarding and intrusion detection, Technical Report CS2001-0670, University of California – San Diego, 2002. [6] H. Yin, et al., "Panorama: capturing system-wide information flow for malware detection and analysis," in Proceedings of the 14th ACM conference on Computer and communications security, 2007, pp. 116127. [7] M. Christodorescu, et al., "Semantics-aware malware detection," in Security and Privacy, 2005 IEEE Symposium on, 2005, pp. 32-46. [8] D. Brumley, et al., "Automatically Identifying Triggerbased Behavior in Malware," in Botnet Detection. vol. 36, W. Lee, et al., Eds., ed: Springer US, 2008, pp. 6588. [9] M. Egele, et al., "Dynamic spyware analysis," presented at the 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, Santa Clara, CA, 2007. [10] S. Y. Dai and S. Y. Kuo, "MAPMon: A Host-Based Malware Detection Tool," in Proceedings of the 13th Pacific Rim International Symposium, 2007, pp. 349356.

Signature based Malware Detection for Unstructured

Signature based Malware Detection for Unstructured

Suggest Documents

Signature based Malware Detection for Unstructured

Efficient signature based malware detection on mobile ... - IOS Press

Signature Based Malware Detection is Dead - Semantic Scholar

Flow Based Algorithm for Malware Traffic Detection

Category Based Malware detection for Android

Behavior-based features model for malware detection

Permission-based Malware Detection Mechanisms

F-Sign: Automatic, Function-based Signature Generation for Malware

Apposcopy: Semantics-Based Detection of Android Malware

Permission-Based Android Malware Detection - Semantic Scholar

Idea: Opcode-sequence-based Malware Detection

CloudEyes: Cloud-based Malware Detection with ...

Power Consumption Based Android Malware Detection

Permission-based Malware Detection Mechanisms on Android ...

Crowdroid: Behavior-Based Malware Detection System for Android

Research Article Linear SVM-Based Android Malware Detection for

MadeCR: Correlation-based Malware Detection for Cognitive ... - People

Research Article Linear SVM-Based Android Malware Detection for ...

Static Analysis Based Behavioral API for Malware Detection using ...

Behavior-based features model for malware detection - Assiut University

pBMDS: A Behavior-based Malware Detection System for Cellphone ...

Evasion-Resistant Malware Signature Based on ... - DigitalFIRE Labs

Evasion-Resistant Malware Signature Based on Profiling Kernel Data ...

A Survey on Malware and Malware Detection