Data Diversity of a Distributed Honey Net Based Malware Collection ...

2 downloads 0 Views 259KB Size Report
Data Diversity of a Distributed Honey Net Based. Malware Collection System. Saurabh Chamotra1, Rakesh Kumar Sehgal2, Raj Kamal3, J.S.Bhatia4.
Data Diversity of a Distributed Honey Net Based Malware Collection System Saurabh Chamotra1, Rakesh Kumar Sehgal2, Raj Kamal3, J.S.Bhatia4 1

[email protected] 2 [email protected] 3 [email protected] 4 [email protected]

ecological studies could be applied to better understand the current malware landscape. For example to quantify the qualitative value of data collected by the honeynets, the diversity indices which is a well established concept to understand ecology of a region can be used. This diversity index quantifies the diversity based upon the richness and the evenness of species in a given collection. In this paper the points that we want to put forward are, 1) data diversity is one of the important metric that should be considered while evaluating the performance of the Honeynet system. Diversity of the data collected by the Honeynet system, when deployed in a distributed fashion is more than that deployed in a standalone mode. To prove this we have used the results of our distributed honeynet system. The honeynet nodes of this distributed system are deployed within different ISPs. The data is collected in terms of the binary samples, further we have labelled this data using the latest updated antivirus. Once labelled and categorized in to appropriate classes we use various diversity indexes on this labelled data to calculate the data diversity, further we have compared the data diversity of the distributed Honeynet as a whole with the data diversity of a standalone Honeynet node.

Abstract—-The value of a Data collection mechanism like Honeypot/Honeyntes lies in being attacked and probed[1] .Hence the efficiency of these resources depends upon the amount and value of data collected by them but then there is no appropriate measure present to quantify the value of data collated by these systems. Most of the honeynet projects proves the efficiency of their honeynet systems based upon the volume of data collected but then the volume of data in it self could be a misleading parameters as in the case where a honeypot collects a high volume of the data but the data lacks in the diversity as it collects the same attacks in a given time frame again and again from different data sources. In this paper we have done efforts to 1) introduced the diversity index which is commonly used in the ecological studies as a measure to quantify the value of data in terms of diversity of the data 2) and to prove that the diversity of the data collected by a distributed honeynet is greater than that of a honeynet deployed at a single location. I. INTRODUCTION Malwares have become a major threat to the internet as their prevalence in the internet had greatly increased in past few years. In response to this increasing malware attacks, Honeypots has emerged as one of the popular proactive defence technique. The Honeypots are the information system resources capable to lure, capture and collect theses malware attacks. Tools like nepenthes [11], honeybow [2], honeyTrap [4] and Amun [3] are some of the well established malware collection Honeypots available, further a network of these honeypots is known as honeynets. Many global projects like SGNET [10] DHP (Distributed Honeynet project) [16], HIVE [12] and Honeynet .org [15] uses these honeynets as a base for countering the large scale malware attacks. World wide projects like these are proving their worth by providing active supports to CERTs and other security agencies for countering large scale malware attacks. Most of the Distributed Honeynet projects evaluate the performance of the underling collection system based upon the volume of the data collected that is the number of the binary samples collected by these collection mechanisms. Although volume of data collected is a quantitative metric for the evaluation of data value but the evaluation based upon the qualitative measures of data like the classes of attacks the honeynet is able to lure and capture is missing. As the multitude of malware has evolved into a complex environment with properties similar to those of ecological systems [5], hence the well established principals in the

978-1-4577-0240-2/11/$26.00 ©2011 IEEE

MALWARE ECOLOGY & MALWARE EVOLUTION Ecology is the scientific study of the distributions, abundance and relations of organisms and their interactions with the environment [6]. This is basically the descriptive study of the organism in a system. The internet has provided a virtual world for the malwares to evolve and emerge hence it shares similarities with those of ecological systems. Efforts have already been made to understand the similarities between the internet organisms called malware and those found in an eco system [5]. This study is crucial to malware aspects such as collection, triage, analysis, intelligence estimates, detection, mitigation, and forensics. Studying malware ecology has the potential to bring the complexity of the malware evolution and its behavioural analysis, within the scope of novel analysis and defence techniques. Ecological Concepts such as nestedness and indicator species [17], ecological diversity, competition, facilitation (i.e., positive interactions between species), and habitat filtering (i.e., differences in site accessibility and habitability) could be adopted in malware analysis to better understand the malware eco system. In their work Jedidiah and this team [5] have applied the concepts of competition, II.

125

facilitation and habitat filtering to malware study. Similarly concepts like Island biogeography [7], species-area relationship [8], species-time relationship [9] and ecological diversity index could be applied to the malware collection context. Already Ecologists have many tools and techniques for understanding these phenomena which could be well applied in the malware context. In this paper, we have used the ecological diversity indices from information theory to quantify the diversity of data collected by the Honeynet system. These diversity indices are used to quantify the diversity of the ecological systems based upon the spices richness and the species evenness.

diverse data. In this paper we are trying to prove this point that the distributed hone yet system is able to collect a more diverse data as compared to a standalone deployed honeynet system. DATA DIVERSITY & ITS QUANTIFICATION In ecology, a diversity index is a statistic which is intended to measure the biodiversity of an ecosystem. More generally, diversity indices can be used to assess the diversity of any population in which each member belongs to a unique species .A perfectly homogenous set will have the diversity index score of 0 where the perfectly heterogeneous set will have the diversity index 1. Different type of diversity indices can be calculated to compare diversity of the ecological communities. Normally species diversity has two components. 1. Species richness refers to the number of species found in a community. IV.

MALWARE COLLECTION USING HONEYNETS The malware eco system has evolved in to one of the most complicated ecosystems since the first worm (Morris worm) was discovered in November 2, 1988[18].In this arm race attackers are coming up with new malware samples to evade the latest malware detection techniques. Malware landscape has become very divers and complicated, the new malwares classes like mutant worms formed by clubbing different file infectors, Trojans, file downloader’s and worms in a single malware attack, justifies this trend. To counter these malware threats honeypots has emerged as one of the popular and effective approach. Honeypots provide proactive solutions to not even prevent detect but also counter these attacks in a reactive mode. Honeypots contributes in prevention of attacks by deceiving attackers and detection of attacks by acting as a burglar alarm and responding to theses attacks by providing the valuable data regarding attacker and his techniques. As honeypots offers a vast variety of solutions for the capturing and collection of malware samples [2][3][12] hence they become an obvious choice for the development of the system for collection and study of the malwares. In this paper we have used a Distributed honeynet system as a malware collection system. Each remote node of the distributed system consists of a combination of high and low interaction honeypots. These honeynet nodes are deployed in different ISP’s monitoring different IP ranges. The data collected form each such node is in the form of malware binaries which is periodically collected from these nodes and sent to the central server. The collected binary samples are then correlated with the system log files and transformed in to relational database format. Further the samples are labelled using the latest updated antivirus engine. The reason of deploying the honeynet nodes in a geographically dispersed fashion and in different ISPs is to address the targeted attack problem. Targeted malware attacks are recently observed trend in the malware landscape, where the selection of the attack targets is done on the basis of the functional, geographical, and operational characteristics of the target. Dew to this scenario the malware samples found in one functional domain may not be present in the other domain its just like that certain species of the plants or animals are found in a specific landscapes. Hence being geographically dispersed and deployed in to different IP domains these systems can monitor a wider internet space and hence could collect more III.

2.

Species evenness refers to the relative abundance of each species in an ecological system.

An ecological system is said to have high species diversity if many nearly equally abundant species are present. If a community has only a few species or if only a few species are very abundant, then species diversity is low. Information theory provides mathematical measures to quantity this diversity in the form of Shannon & Simpson’s entropy indexes. Margfale was the first to propose the use of information theory in ecology and Pielou has played an important role in the promotion of information-theoretical heterogeneity indices in ecology. Hence the popular diversity indices used in ecology study are taken from the information theory and follow the Pielou’s axiom. According to the Pileo’s axiom ,the diversity indices should maximise when all the species are in equal proportion and in case of two ecological systems with equal proportionate species the one having more number of species will have higher diversity indices value. These indices estimates the diversity based upon the probability that the system have assembled randomly and quantify the uncertainty based upon the categorization of the individual in a given species with in the community [24] The Shannon index [13] which is also referred as the Shannon-Wiener Index is one of the most popular diversity index used in ecological studies. Figure 2 shows the formula to calculate the Shannon diversity index. Here Pi is the relative proportion of ith specie in the collection and s is the count of species present in the collection.

H1 = ≈

s

¦ P log i

2

Pi

i ≠1

Figure 2 Shannon Diversity index

Another popular index for measuring the system diversity in Simpson’s diversity index [14]. Simpson’s diversity index actually measures the probability that two individuals randomly selected from a sample will belong to the same species (or some category other than species).Figure 3 shows

126

labelling done by the antivirus engine we classify the binary samples in to following five classes. 1. Bots

the formula for calculation of Simpson’s index. Here Pi and Pj are the relative proportion of ith and jth specie. 5

H2 = 2

¦p

i

pj

i > j= l

Figure 3 Simpsons diversity index

Shannon, Simpson diversity species richness indexes are the gernalised form of the Renyi’s entropy which is calculated using the formula given in the figure 4.

2.

Trojans,

3.

Backdoors,

4.

Worms

5.

Zero day

Zero day class refers to those binary samples which were not detected as any malicious specie by the antivirus engine. Table 1 shows the number of binary samples collected from respective nodes during a period of one month.

s

1 H2 = log 2 Pi ∝ 1− ∝ i =l

¦

Figure 4 Renyi’s entropy

TABLE1

For different values of parameter alpha this metric provides different values of diversity i.e. Shannon diversity can be calculated using alpha approaching one and for alpha as zero it provides the species richness and for alpha approaching 2 it provides Simpson’s, diversity index. EXPERIMENTAL SETUP & RESULTS We have collected data from our distributed honeynet framework for over a period on one month. This framework consists of four nodes deployed in different ISPs. Each distributed node incorporates a combination of high interaction and low interaction Honeypots. High interaction honeypot is running Windows XP service pack 2 as an operating system where as nepenthes with all its vulnerability modules fully loaded is deployed as a low interaction honeypot. V.

Distributed Honeynet Node

Malware samples count

Node63

3709

Node97

741

Node42

2310

Node 19

1502

The data collected by these nodes consists of repetition of same malware binary samples. To get the actual number of unique malware binary samples we further refined this data set by calculating the MD5 hash values of dropped binary samples. Table 2 shows the number of unique binary samples calculated by each honeynet node. TABLE 2

Distributed Honeynet Node

Unique Malware samples count

Node63

264

Node97

133

Node42

155

Node 19

151

As mentioned earlier these unique binary samples were further categorized in to five classes Table 3 shows the distribution in terms of five classes as is obtained on each node For the calculation of different diversity indexes we have used R systems [19]. R systems is an open source version of s software package for data modelling .It is a statically analysis frame work which provides various statically functions for both confirmatory and exploratory analysis of the data. For this experiment, we have used the Vegan package of the R systems .Vegan is a package available in R system framework that incorporates the functions of ecological studies. Hence, it provides us an easy to use open source environment for the data analysis. The results generated for the above data is given in the table 4.

Figure 4 Network Diagram of Distributed System

Figure 4 shows the network diagram of the distributed honeynet system with four nodes. These nodes were deployed in different ISPs network in /24, /27, /28, /29 subnets. Binary files dropped on these honeynets are captured and send to the central database server on a regular basis and at central server the collected data is converted in to relational database format. Further the collected binary samples are labelled by the Symantec antivirus engine. Based upon the

127

TABLE 3

TABLE 6

Node

BOT

Worms

Backdoors

Trojans

Zero day/ Undetected

Diversity Index

Shannon Index

Simpson

Inverse Simpson

Fisher’s log series

Node 63 Node 97 Node 42 Node 19

95 31 54 71

17 22 20 19

30 5 8 3

59 33 31 14

63 33 32 44

Node 19 Node 42 Node 63

0.8786771 1.0920054 1.3639228

0.5344178 0.6117970 0.6464024

2.147848 2.575972 2.828073

1.226895 1.345137 3.191579

Node 97

1.3633567

0.6909469

3.235690

2.216315

TABLE 4

Diversity Index

Shannon Index

Simpson

Inverse Simpson

Fisher’s log series

Node 19 Node 42 Node 63 Node 97

1.763835 2.177934 2.788158 2.246577

0.7557734 0.8540809 0.8954789 0.8620876

2.147848 2.575972 2.828073 3.235690

4.094558 6.853114 9.567448 7.250980

Figure 6 below shows the Reni’s diversities at four distributed Honeynet nodes for table 6. The plot uses Trellis graphics with a separate panel for each site. The dots show the values for sites, and the lines represent the extremes and median.

Figure 5 below shows the Reni’s entropy for different values of parameter alpha. The dots show the values obtained by each node for different values of parameter alpha and the lines represents the maxima, minima and median in the data set. One can see in the figure that for the node63 all the points lies on the maxima lines which means that the malware dataset collected at node63 was the most diverse dataset in comparison to the other nodes. node19

node42

node63

node19

node42

node63

node97

0 0.25 0.5 1 2 4 8 16 32 64 Inf

0 0.25 0.5 1 2 4 8 16 32 64 Inf

0 0.25 0.5 1 2 4 8 16 32 64 Inf

0 0.25 0.5 1 2 4 8 16 32 64 Inf

diversity

2.0

node97

1.5

1.0

3.5

3.0

d ive rs ity

Figure 6. Reni’s diversities at four distributed Honeynet nodes 2.5

This figure shows that the diversity of the node 63 is more than any other node in terms of number of different classes of malware samples. Table 7 shows the data diversity calculated for the complete distributed system as a whole.

2.0

1.5

TABLE 7

1.0

Diversity indexes

Diversity

Figure 5 Reni’s entropy for different values of parameter alpha

Shannon index

2.624180

Further we have calculated the diversity of the data based upon the classification done by the symmetric antivirus engine. As per the symmetric classification of the binary samples there were total of 48 unique species discovered in four nodes excluding the zero day or non detected attacks. The distribution for each node is given in the table 5.

Simpson index

0.8766566

Inverse Simpson

8.107444

Fisher's log-series

13.05359

0 0.25 0.5 1 2 4 8 16 32 64 Inf

0 0.25 0.5 1 2 4 8 16 32 64 Inf

0 0.25 0.5 1 2 4 8 16 32 64 Inf 0 0.25 0.5 1 2 4 8 16 32 64 Inf

TABLE 5 Distributed nodes

NO of species present

Node19

14

Node63

39

Node42

15

Node97

16

Figure 7 plots the Reni’s diversity for different values alpha and it shows the diversity of the distributed system as whole is more than the diversity of an individual collection node. CONCLUSION The experiments done by us shows that the ecological concepts like diversity very well explains certain phenomena in the malware landscape and can be well used in malware context to better understand the ecology of malwares. VI.

The table 6 shows the diversity index calculated for this data for various nodes.

128

[12] Davide Cavalca and Emanuele “HIVE: an Open Infrastructure for Malware Collection and Analysis” Department of Computer Engineering and Systems Science 2008 [13] Shannon, C. 1948. A mathematical theory of communication. Bell Systems Technological Journal 27:379–423. [14] Simpson, E. H. 1949. Measurement of diversity. Nature 163:688. [15] http://www.honeynet.org/ [16] http://distributed.honeynets.org [17] M. Dufrene and P. Legendre. “Species assemblages and indicator species: The need for a flexible asymmetrical approach” Ecological Monographs, 67(3):345–366, 1997 [18] http://en.wikipedia.org/wiki/Morris_worm [19] http://www.rsystems.com

plot

3.5

diversity

3.0

2.5

2.0

1.5

0

0.25

0.5

1

2

4

8

16

32

64

Inf

Figure 7. Reni’s diversity for different values alpha

Further it was clearly brought out from the experiments that different nodes deployed with same configuration for similar duration in different ISPs greatly varies in the value of data collected by them in terms of the diversity. although the amount of data collected by some of the nodes was almost same but when evaluated on the diversity scale the value of data collected by them were greatly different and some of the nodes proved to be the better collector than the other one. As discussed in the paper it also proves the prevalence of the targeted attacks which targets certain specific targets based upon the functional, geographical and operational properties. Also it was deduced from the experiments that the data collected by a distributed system as a whole is more diverse than the data collection done by a standalone honeynet system. Hence this proves that distributed honeynet systems are able to collect much diverse data. In future we will look further the implementation of other concepts of ecology like habitat filtering, species nested ness to better understand the malware landscape. REFERENCES [1]

L. Spitzner, “Honeypots: Tracking Hackers”. Addislon-Wesley, ISBN from-321-10895-7, 2002. [2] Jianwei Zhuge,Thorsten Holz,Xinhui Han “Collecting autonomous spreading malware using high-interaction Honeypots” ICICS ACM 2007 [3] Jan G¨obel “Amun: A Python Honeypot Technical Report” Laboratory for Dependable Distributed Systems University of Mannheim, Germany [4] T. Werner. Honeytrap. http://honeytrap.mwcollect.org/. [5] Jedidiah R. Crandall “The Ecology of Malware” ACM, 47, 3(March2004) [6] http://en.wikipedia.org/wiki/Ecology [7] R. H. MacArthur and E. O. Wilson. The Theory of Island Biogeography, volume 1 of Monographs in opulation Biology. Princeton Press, 1967 [8] M. Rosenzweig. Preston’s ergodic conjecture: the accumulation of species in space and time. Biodiversity Dynamics, July 1998 [9] E. F. Connor and E. D. McCoy. The statistics and biology of the speciesarea relationship. The American Naturalist, 113(6):791–883, June 1979 [10] Corrado Leita, Marc Dacier “SGNET: a worldwide deployable framework to support the analysis of malware threat models” Institut Eurecom Sophia Antipolis, France [11] Paul Baecher1, Markus Koetter,Thorsten Holz, Maximillian Dornseif,“The Nepenthes Platform: An Efficient Approach to Collect Malware” RAID 2006 LNCS

129

Suggest Documents