New Payload Attribution Methods for Network Forensic ... - CiteSeerX

New Payload Attribution Methods for Network Forensic Investigations ∗ ´ BRONNIMANN ¨ MIROSLAV PONEC∗† , PAUL GIURA∗ , JOEL WEIN∗† and HERVE ∗ Polytechnic Institute of NYU, † Akamai Technologies

Payload attribution can be an important element in network forensics. Given a history of packet transmissions and an excerpt of a possible packet payload, a Payload Attribution System (PAS) makes it feasible to identify the sources, destinations and the times of appearance on a network of all the packets that contained the specified payload excerpt. A PAS, as one of the core components in a network forensics system, enables investigating cybercrimes on the Internet, by, for example, tracing the spread of worms and viruses, identifying who has received a phishing email in an enterprise, or discovering which insider allowed an unauthorized disclosure of sensitive information. Due to the increasing volume of network traffic in today’s networks it is infeasible to effectively store and query all the actual packets for extended periods of time in order to allow analysis of network events for investigative purposes; therefore we focus on extremely compressed digests of the packet activity. We propose several new methods for payload attribution which utilize Rabin fingerprinting, shingling, and winnowing. Our best methods allow building practical payload attribution systems which provide data reduction ratios greater than 100:1 while supporting efficient queries with very low false positive rates. We demonstrate the properties of the proposed methods and specifically analyze their performance and practicality when used as modules of a network forensics system ForNet. Our experimental results outperform current state-of-the-art methods both in terms of false positives and data reduction ratio. Finally, these approaches directly allow the collected data to be stored and queried by an untrusted party without disclosing any payload information nor the contents of queries. Categories and Subject Descriptors: C.2.0 [Computer-Communication Networks]: General— Security and protection General Terms: Algorithms, Performance, Security Additional Key Words and Phrases: Bloom Filter, Network Forensics, Payload Attribution

1.

INTRODUCTION

Cybercrime today is alive and well on the Internet and growing both in scope and sophistication [Richardson and Peters 2007]. Given the trends of increasing Internet usage by individuals and companies alike and the numerous opportunities for Author’s address: Polytechnic Institute of NYU, Department of Computer Science and Engineering, 6 MetroTech Center, Brooklyn, NY 11201, U.S.A. Akamai Technologies, 8 Cambridge Center, Cambridge, MA 02142, U.S.A. This research is supported by NSF CyberTrust Grant 0430444. This paper is a significantly extended and updated version of [Ponec et al. 2007]. We would like to thank Kulesh Shanmugasundaram for helpful discussions. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 20YY ACM 0000-0000/20YY/0000-0001 $5.00

ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–32.

2

·

Miroslav Ponec et al.

anonymity and non-accountability of Internet use, we expect this trend to continue for some time. While there is much excellent work going on targeted at preventing cybercrime, unfortunately there is the parallel need to develop good tools to aid lawenforcement or corporate security professionals in investigating committed crimes. Identifying the sources and destinations of all packets that appeared on a network and contained a certain excerpt of a payload, a process called payload attribution, can be an extremely valuable tool in helping to determine the perpetrators or the victims of a network event and to analyze security incidents in general [Shanmugasundaram et al. 2003; Staniford-Chen and Heberlein 1995; Garfinkel 2002; King and Weiss 2002]. It is possible to collect full packet traces even with commodity hardware [Anderson and Arlitt 2006] but the storage and analysis of terabytes of such data from today’s high-speed networks is extremely cumbersome. Supporting network forensics by simply capturing and logging raw network traffic, however, is infeasible for anything but short periods of history. First, storage requirements limit the time over which the data can be archived (e.g., a 100 Mbit/s WAN can fill up 1 TB in just one day) and it is a common practice to overwrite old data when that limit is reached. Second, string matching over such massive amounts of data is very time-consuming. Recently Shanmugasundaram et al. [2004] presented an architecture for network forensics in which payload attribution is a key component [Shanmugasundaram et al. 2004]. They introduced the idea of using Bloom filters to achieve a reduced size digest of the packet history that would support queries about whether any packet containing a certain payload excerpt has been seen; the reduction in data representation comes at the price of a manageable false positive rate in the query results. Subsequently a different group has offered a variant technique for the same problem [Cho et al. 2006]. Our contribution in this paper is to present new methods for payload attribution that have substantial performance improvements over these state-of-the-art payload attribution systems. Our approach to payload attribution, which constitutes a crucial component of a network forensic system, can be easily integrated into any existing network monitoring system. The best of our methods allow data reduction ratios greater than 100:1 and achieve very low overall false positive rates. With a data reduction ratio of 100:1 our best method gives no false positive answers for query excerpt sizes of 250 bytes and longer; in contrast, the prior best techniques had 100% false positive rate at that data reduction ratio and excerpt size. The reduction in storage requirements makes it feasible to archive data taken over an extended time period and query for events in a substantially distant past. Our methods are capable of effectively querying for small excerpts of a payload but can also be extended to handle excerpts that span several packets. The accuracy of attribution increases with the length of the excerpt and the specificity of the query. Further, the collected payload digests can be stored and queries performed by an untrusted party without disclosing any payload information nor the query details. This paper is organized as follows. In the next section we review related prior work. In Section 3 we provide a detailed design description of our payload attribution techniques, with a particular focus on payload processing and querying. In ACM Journal Name, Vol. V, No. N, Month 20YY.

New Payload Attribution Methods for Network Forensic Investigations

·

3

Section 4 we discuss several issues related to the implementation of these techniques in a full payload attribution system. In Section 5 we present a performance comparison of the proposed methods and quantitatively measure their effectiveness for multiple workloads. Finally, we summarize our conclusions in Section 6. 2.

RELATED WORK

When processing a packet payload by the methods described in Section 3, the overall approach is to partition the payload into blocks and store them in a Bloom filter. In this section we first give a short description of Bloom filters and introduce Rabin fingerprinting and winnowing, which are techniques for block boundary selection. Thereafter we review the work related to payload attribution systems. 2.1

Bloom Filters

Bloom filters [Bloom 1970] are space-efficient probabilistic data structures supporting membership queries and are used in many network and other applications [Broder and Mitzenmatcher 2002]. An empty Bloom filter is a bit vector of m bits, all set to 0, that uses k different hash functions, each of which maps a key value to one of the m positions in the vector. To insert an element into the Bloom filter, we compute the k hash function values and set the bits at the corresponding k positions to 1. To test whether an element was inserted, we hash the element with these k hash functions and check if all corresponding bits are set to 1, in which case we say the element is in the filter. The space savings of a Bloom filter is achieved at the cost of introducing false positives; the greater the savings, the greater the probability of a query returning a false positive. Equation 1 gives an approximation of the false positive rate α, after n distinct elements were inserted into the Bloom filter [Mitzenmacher 2002]. More analysis reveals that an optimal utilization of a Bloom filter is achieved when the number of hash functions, k, equals (ln 2) · (m/n) and the probability of each bit of the Bloom filter being 0 is 1/2. In practice, of course, k has to be an integer and smaller k is mostly preferred to reduce the amount of necessary computation. Note also that while we use Bloom filters throughout this paper, all of our payload attribution techniques can be easily modified to use any data structure which allows insertion and querying for strings with no changes to the structural design and implementation of the attribution methods.

α= 2.2

1 1− 1− m

kn !k

k ≈ 1 − e−kn/m

(1)

Rabin Fingerprinting

Fingerprints are short checksums of strings with the property that the probability of two different objects having the same fingerprint is very small. Rabin defined a fingerprinting scheme [Rabin 1981] for binary strings based on polynomials in the following way. We associate a polynomial S(x) of degree N − 1 with coefficients in Z2 with every binary string S = (s1 , . . . , sN ), for N ≥ 1: S(x) = s1 xN −1 + s2 xN −2 + · · · + sN .

(2)

ACM Journal Name, Vol. V, No. N, Month 20YY.

4

·


Then we take a fixed irreducible polynomial P (x) of degree K, over Z2 and define the fingerprint of S to be the polynomial f (S) = S(x) mod P (x). This scheme, only slightly modified, has found several applications [Broder 1993], for example, in defining block boundaries for identifying similar files [Manber 1994] and for web caching [Rhea et al. 2003]. We derive a fingerprinting scheme for payload content based on this Rabin’s scheme in Section 3.3 and use it to pick content-dependent boundaries for a priori unknown substrings of a payload. For details on the applications, properties and implementation issues of the Rabin’s scheme one can refer to [Broder 1993]. 2.3

Winnowing

Winnowing [Schleimer et al. 2003] is an efficient fingerprinting algorithm enabling accurate detection of full and partial copies between documents. It works as follows: For each sequence of ν consecutive characters in a document, we compute its hash value and store it in an array. Thus, the first item in the array is a hash of c1 c2 . . . cν , the second item is a hash of c2 c3 . . . cν+1 , etc., where ci are the characters in the document of size Ω bytes, for i = 1, . . . , Ω. We then slide a window of size w through the array of hashes and select the minimum hash within each window. If there are more hashes with the minimum value, we choose the rightmost one. These selected hashes form the fingerprint of the document. It is shown in [Schleimer et al. 2003] that fingerprints selected by winnowing are better for document fingerprinting than the subset of Rabin fingerprints which contains hashes equal to 0 mod p, for some fixed p, because winnowing guarantees that in any window of size w there is at least one hash selected. We will use this idea to select boundaries for blocks in packet payloads in Section 3.5. 2.4

Attribution Systems

There has been a major research effort over the last several years to design and implement feasible network traffic traceback systems, which identify the machines that directly generated certain malicious traffic and the network path this traffic subsequently followed. These approaches, however, restrict the queries to network floods, connection chains, or the entire payload of a single packet in the best case. The Source Path Isolation Engine (SPIE) [Snoeren et al. 2001] is a hash-based technique for IP traceback that generates audit trails for traffic within a network. SPIE creates hash-digests of packets based on the packet header and a payload fragment and stores them in a Bloom filter in routers. SPIE uses these audit trails to trace the origin of any single packet delivered by the network in the recent past. The router creates a packet digest for every forwarded packet using the packet’s non-mutable header fields and a short prefix of the payload, and stores it in a Bloom filter for a predefined time. Upon detection of a malicious attack by an intrusion detection system, SPIE can be used to trace the packet’s attack path back to the source by querying SPIE devices along the path. In many cases, an investigator may not have any header information about a packet of interest but may know some excerpt of the payload of the packets she wishes to see. Designing techniques for this problem that achieve significant data reduction (compared to storing raw packets) is a much greater challenge; the entire packet payload is much larger than the information hashed by SPIE; in adACM Journal Name, Vol. V, No. N, Month 20YY.


·

5

dition we need to store information about numerous substrings of the payload to support queries about excerpts. Shanmugasundaram et al. [2004] introduced the Hierarchical Bloom Filter (HBF), a compact hash-based payload digesting data structure, which we describe in Section 3.1. A payload attribution system based on an HBF is a key module for a distributed system for network forensics called ForNet [Shanmugasundaram et al. 2003]. The system has both low memory footprint and achieves a reasonable processing speed at a low false positive rate. It monitors network traffic, creates hash-based digests of payload, and archives them periodically. A user-friendly query mechanism based on XML provides an interface to answer postmortem questions about network traffic. SPIE and HBF are both digesting schemes, but while SPIE is a packet digesting scheme, HBF is a payload digesting scheme. With an HBF module running in ForNet (or a module using any of our methods presented in this paper), one can query for substrings of the payload (called excerpts throughout this paper). Recently, another group suggested an alternative approach to the payload attribution problem, the Rolling Bloom Filter (RBF) [Cho et al. 2006], which uses packet content fingerprints based on a generalization of the Rabin-Karp stringmatching algorithm. Instead of aggregating queries in a hierarchy as an HBF, they aggregate query results linearly from multiple Bloom filters. The RBF tries to solve a problem of finding a correct alignment of blocks in the process of querying an HBF by considering many possible alignments of blocks at once, i.e., RBF is rolling a fixed-size window over the packet payload and recording all the window positions as payload blocks. They report performance similar to the best case performance of the HBF. The design of an HBF is well documented in the literature and currently used in practice. We created our implementation of the HBF as an example of a current payload attribution method and include it in our comparisons in Section 5. The RBF’s performance is comparable to that of the HBF and experimental results presented in [Cho et al. 2006] show that RBF achieves low false positive rates only for small data reduction ratios (about 32:1). 3.

METHODS FOR PAYLOAD ATTRIBUTION

In this section we introduce various data structures for payload attribution. Our primary goal is to find techniques that give the best data reduction for payload fragments of significant size at reasonable computational cost. When viewed through this lens, roughly speaking a technique that we call Winnowing MultiHashing (WMH) is the best and substantially outperforms previous methods; a thorough experimental evaluation is presented in Section 5. Our exposition of WMH will develop gradually, starting with naive approaches to the problem, building through previous work (HBF), and introducing a variety of new techniques. Our purpose is twofold. First, this exposition should develop a solid intuition for the reader as to the various considerations that were taken into account in developing WMH. Second, and equally important, there are a variety of lenses through which one may consider and evaluate the different techniques. For example, one may want to perform less computation and cannot utilize data aging techniques and as a result opt for a method such as Winnowing Block Shingling (WBS) which ACM Journal Name, Vol. V, No. N, Month 20YY.

6

·


is more appropriate than WMH under those circumstances. Additionally, some applications may have specific requirements on the block size and therefore prefer a different method. By carefully developing and evaluating experimentally the different methods, we present the reader with a spectrum of possibilities and a clear understanding of which to use when. As noted earlier, all of these methods follow the general program of dividing packet payloads into blocks and inserting them into a Bloom filter. They differ in how the blocks are chosen, what methods we use to determine which blocks belong to which payload in which order (“consecutiveness resolution”), and miscellaneous other techniques used to improve the number of necessary queries and to reduce the probability of false positives. We first describe the basics of block based payload attribution and the Hierarchical Bloom Filter [Shanmugasundaram et al. 2004] as the current state-of-the-art method. We then propose several new methods which solve multiple problems in the design of the former methods. A naive method to design a simple payload attribution system is to store the payload of all packets. In order to decrease the demand for storage capacity and to provide some privacy guarantees, we can store hashes of payloads instead of the actual payloads. This approach reduces the amount of data per packet to about 20 bytes (by using SHA-1, for example) at the cost of false positives due to hash collisions. By storing payloads in a Bloom filter (described in Section 2.1), we can further reduce the required space. The false positive rate of a Bloom filter depends on the data reduction ratio it provides. A Bloom filter preserves privacy because we can only ask whether a particular element was inserted into it, but it cannot be coerced into revealing the list of elements stored; even if we try to query for all possible elements, the result will be useless due to false positives. Compared to storing hashes directly, the advantage of using Bloom filters is not only the space savings but also the speed of querying. It takes only a short constant time to query the Bloom filter for any packet. Inserting the entire payload into a Bloom filter, however, does not allow supporting queries for payload excerpts. Instead of inserting the entire payload into the Bloom filter we can partition it into blocks and insert them individually. This simple modification can allow queries for excerpts of the payload by checking if all the blocks of an excerpt are in the Bloom filter. Yet, we need to determine whether two blocks appeared consecutively in the same payload, or if their presence is just an artifact of the blocking scheme. The methods presented in this section deal with this problem by using offset numbers or block overlaps. The simplest data structure that uses a Bloom filter and partitions payloads into blocks with offsets is a Block-based Bloom Filter (BBF) [Shanmugasundaram et al. 2004]. Note that, assuming that we do one decomposition of the payload into blocks during payload processing, starting at the beginning of the packet, we will need to query the data structure for multiple starting positions of our excerpt in the payload during excerpt querying phase, as the excerpt need not start at the beginning of a block. For example, if the payload being partitioned with a block size of 4 bytes was ABCDEF GHIJ, we would insert blocks ABCD and EF GH into the Bloom filter (the remainder IJ is not long enough to form a block and is therefore not processed). Later on, when we query for an excerpt, for example, DEF GHI, we ACM Journal Name, Vol. V, No. N, Month 20YY.


·

7

would partition the excerpt into blocks (with a block size 4 bytes as done previously on the original payload). This would give us just one block to query the Bloom filter for, DEF G. However, because we do not know where could the excerpt be located within the payload we also need to try partitioning the excerpt from starting position offsets 1 and 2, which gives us blocks EF GH and F GHI, respectively. We are then guaranteed that the Bloom filter answers positively for the correct block EF GH, however, we can also get positive answers for blocks EF GH and F GHI due to false positives of a Bloom filter. The payload attribution methods presented in this section try to limit or completely eliminate (see Section 3.3) this negative effect. Alternative payload processing schemes, such as [Cho et al. 2006; Gu et al. 2007] perform partitioning of the payload at all possible starting offsets during the payload processing phase (which is basically similar to working on all n-grams of the payload) but it incurs a large overhead for processing speed and also storage requirements are multiplied. We also need to set two parameters which determine the time precision of our answers and the smallest query excerpt size. First, we want to be able to attribute to each excerpt for which we query the time when the packet containing it appeared on the network. We solve that by having multiple Bloom filters, one for each time interval. The duration of each interval depends on the number of blocks inserted into the Bloom filter. In order to guarantee an upper bound on the false positive rate, we replace the Bloom filter by a new one and store the previous Bloom filter in a permanent storage after a certain number of elements are inserted into it. There is also an upper bound on the maximum length of one interval to limit the roughness of time determination. Second, we specify the size of blocks. If the chosen block size is too small, we get too many collisions as there are not enough unique patterns and the Bloom filter gets filled quickly by many blocks. If the block size is too large, there is not enough granularity to answer queries for smaller excerpts. We need to distinguish blocks from different packets to be able to answer who has sent/received the packet. The BBF as briefly described above isn’t able to recognize the origins and destinations of packets. In order to work properly as an attribution system over multiple packets a unique flow identifier (flowID) must be associated with each block before inserting into the Bloom filter. A flow identifier can be the concatenation of source and destination IP addresses, optionally with source/destination port numbers. We maintain a list (or a more efficient data structure) of flowIDs for each Bloom filter and our data reduction estimates include the storage required for this list. The connection records (flowIDs) for each Bloom filter (i.e., a time interval) can be alternatively obtained from other modules monitoring the network. The need for testing all the flowIDs in a list significantly increases the number of queries required for the attribution as the flowID of the packet that contained the query excerpt is not known a priori and it leads to higher false positive rate and decreases the total query performance. Therefore, we may either maintain two separate Bloom filters to answer queries, one into which we insert blocks only and one with blocks concatenated with the corresponding flowIDs, or insert both into one larger Bloom filter. The former allows data aging, i.e., for very old data we can delete the first Bloom filter and store only the one with flowIDs at the cost of higher false positive rate and slower querying. Another method to save storage ACM Journal Name, Vol. V, No. N, Month 20YY.

8

·


space by reducing size taken by very old data is to take a Bloom filter of size 2b and replace it by a new Bloom filter of size b by computing the logical or operation of the two halves of the original Bloom filter. This halves the amount of data and still allows querying but the false positive rate increases significantly. An alternative construction which allows the determination of source/destination pairs is using separate Bloom filters for each flow. Then instead of using one Bloom filter and inserting blocks concatenated with flowIDs, we just select a Bloom filter for the insertion of blocks based on the flowID. Because we cannot anticipate the number of blocks each flow would contain during a time interval, we use small Bloom filters, flush them to disk more often and use additional compression (such as gzip) on the Bloom filters before saving to disk which helps to significantly reduce storage requirements for very sparse flows. Having multiple small Bloom filters also has some performance advantages compared to one large Bloom filter because of caching; the size of a Bloom filter can be selected to fit into a memory cache block. This technique would most likely use TCP stream reconstruction and makes the packet processing stateful compared to the method using flowIDs. It may thus be suitable when there is another module in the system, such as intrusion detection (or prevention) system, which already does the stream reconstruction and a PAS module can be attached to it. If using this technique the evaluation of methods would then be extremely dependent on the distribution of payload among the streams. We do not take flowIDs into further consideration in method descriptions throughout this section for clarity of explanation. We have identified several important data structure properties of methods presented in this paper and a summary can be found in Table I. Figure 1 shows a tree structure representing the evolution of methods. For example, a VHBF method was derived from HBF by the use of variable-sized blocks. These properties of all methods are thoroughly explained within the description of a method in which they appear first. Their impact on performance is discussed in Section 5. There are many possible combinations of the techniques presented in this paper and the following list of methods is not a complete of all combinations. For example, a method which builds a hierarchy of blocks with winnowing as a boundary selection technique can be developed. However, the selected subset provides enough details for a reader to construct and analyze the other alternative methods and as we have experimented with them we believe the presented subset can accomplish the goal of selecting the most suitable one, which is, in a general case, the Winnowing Multi-Hashing technique (Section 3.12). In all the methods in this section we can extend the answer from the simple yes/no (meaning that there was/wasn’t a packet containing the specified excerpt in a specified time interval and if yes providing also the list of flowIDs of packets that contained the excerpt) to give additional details about which parts of the excerpt were found (i.e., blocks) and return, for instance, the longest continuous part of the excerpt that was found. 3.1

Hierarchical Bloom Filter (HBF)

This subsection describes the (former) state-of-the-art payload attribution technique, called an HBF [Shanmugasundaram et al. 2004], in detail and extends the description of previous work from Section 2. The following eleven subsections, each ACM Journal Name, Vol. V, No. N, Month 20YY.


·

9

Table I. Summary of properties of methods from Section 3. We show how each method selects boundaries of blocks when processing a payload and how it affects the block size, how each method resolves the consecutiveness of blocks, its special characteristics, and finally, whether each method allows false negative and N/A answers to excerpt queries.

Fig. 1. The evolution tree shows the relationship among presented methods for payload attribution. Arrow captions describe the modifications made to the parent method.


10

·


Fig. 2. Processing of a payload consisting of blocks X0 X1 X2 X3 X4 X5 in a Hierarchical Bloom Filter.

showing a new technique, represent our novel contribution. An HBF supports queries for excerpts of a payload by dividing the payload of each packet into a set of blocks of fixed size s bytes (where s is a parameter specified by the system administrator1 ). The blocks of a payload of a single packet form a hierarchy (see Figure 2) which is inserted into a Bloom filter with appropriate offset numbers. Thus, besides inserting all blocks of a payload as in the BBF, we insert several super-blocks, i.e., blocks created by the concatenation of 2, 4, 8, etc., subsequent blocks into the HBF. This produces the same result as having multiple BBFs with block sizes multiplied by powers of two. And a BBF can be looked upon as the base level of the hierarchy in an HBF. When processing a payload, we start at the level 0 of the hierarchy by inserting all blocks of size s bytes. In the next level we double the size of a block and insert all blocks of size 2s. In the n-th level we insert blocks of size 2n s bytes. We continue until the block size exceeds the payload size. PThe total number of blocks inserted into an HBF for a payload of size p bytes is l ⌊p/(2l s)⌋, where l is the level index s.t. 0 ≤ l ≤ ⌊log2 (p/s)⌋. Therefore, an HBF needs about two times as much storage space compared to a BBF to achieve the same theoretical false positive rate of a Bloom filter, because the number of elements inserted into the Bloom filter is twice higher. However, for longer excerpts the hierarchy improves the confidence of the query results because they are assembled from the results for multiple levels. We use one Bloom filter to store blocks from all levels of the hierarchy to improve space utilization because the number of blocks inserted into Bloom filters at different levels depends on the distribution of payload sizes and is therefore dynamic. The utilization of this single Bloom filter is easy to control by limiting the number of inserted elements, thus we can limit the (theoretical) false positive rate. Offset numbers are the sequence numbers of blocks within the payload. Offsets are appended to block contents before insertion into an HBF: (content||offset), where 0 ≤ offset ≤ ⌊p/(2l s)⌋ − 1, p is the size of the entire payload and l is the 1 The block size used for several years by an HBF-enabled system running in our campus network is 64 and 32 bytes, respectively, depending on whether deployed on the main gateway or smaller local ones. Longer blocks allow higher data reduction ratios but lower the querying capability for smaller excerpts.



·

11

Fig. 3. The hierarchy in the HBF does not cover double-blocks at odd offset numbers. In this example, we assume that two payloads X0 X1 X2 X3 and Y0 Y1 Y2 Y3 were processed by the HBF. If we query for an excerpt X1 Y2 , we would get a positive answer which represents an offset collision, because there were two blocks (X1 ||1) and (Y2 ||2) inserted from different payloads but there was no packet containing X1 Y2 .

level of hierarchy. Offset numbers are unique within one level of the hierarchy. See the example given in Fig. 2. We first insert all blocks of size s with the appropriate offsets: (X0 ||0), (X1 ||1), (X2 ||2), (X3 ||3), (X4 ||4). Then we insert blocks at level 1 of the hierarchy: (X0 X1 ||0), (X2 X3 ||1). And finally the second level: (X0 X1 X2 X3 ||0). Note that in Figure 2 blocks X0 to X4 have size s bytes, but since block X5 has size smaller than s it does not form a block and its content is not being processed. We analyze the percentage of discarded payload content for each method in Section 5. Offsets don’t provide a reliable solution to the problem of detecting whether two blocks appeared in the same packet consecutively. For example, in a BBF if we process two packets made up of blocks X0 X1 X2 X3 and Y0 Y1 Y2 Y3 Y4 , respectively, and later query for an excerpt X2 Y3 , the BBF will answer that it had seen a packet with a payload containing such an excerpt. We call this event an offset collision. This happens because of inserting a block X2 with an offset 2 from the first packet and a block Y3 with an offset 3 from the second packet into the BBF. When blocks from different packets are inserted at the appropriate offsets, a BBF can answer as if they occurred inside a single packet. An HBF reduces the false positive rate due to offset collisions and due to the inherent false positives of a Bloom filter by adding supplementary checks when querying for an excerpt composed of multiple blocks. In this example, an HBF would answer correctly that it did not see such excerpt because the check for X2 Y3 in the next level of the hierarchy fails. However, if we query for an excerpt X1 Y2 , both HBF and BBF fail to answer correctly (i.e, they answer positively as if there was a packet containing X1 Y2 ). Figure 3 provides an example of how the hierarchy tries to improve the resistance to offset collisions but still fails for two-block strings at odd offsets. We discuss offset collisions in an HBF further in Section 3.8. Because in the actual payload attribution system we insert blocks along with their flowIDs, collisions are less common, but they can still occur for payload inside one stream of packets within one time interval as these blocks have the same flowID and are stored in one Bloom filter. Querying an HBF for an excerpt x starts with the same procedure as querying a BBF. First, we have to try all possible offsets, where x could have occurred inside one packet. We also have to try s possible starting positions of the first block inside x since the excerpt may not start exactly on a block boundary of the ACM Journal Name, Vol. V, No. N, Month 20YY.

12

·


original payload. To do this, we slide a window of size s starting at each of the first s positions of x and query the HBF for this window (with all possible starting offsets). After a match is found for this first block, the query proceeds to try the next block at the next offset until all blocks of an excerpt at level 0 are matched. An HBF continues by querying the next level for super-blocks of size twice the size of blocks in the previous level. Super-blocks start only at blocks from the previous level which have even offset numbers. We go up in the hierarchy until all queries for all levels succeed. The answer to an excerpt query is positive only if all answers from all levels of the hierarchy were positive. The maximum number of queries to a Bloom filter in an HBF in the worst case is roughly twice the number for a BBF. 3.2

Fixed Block Shingling (FBS)

In a BBF and an HBF we use offsets to determine whether blocks appeared consecutively inside one packet’s payload. This causes a problem when querying for an excerpt because we do not know where the excerpt starts inside the payload (the starting offset is unknown). We have to try all possible starting offsets, which not only slows down the query process, but also increases the false positive rate because a false positive result may occur for any of these queries. As an alternative to using offsets we can use block overlapping, which we call shingling. In this scheme, the payload of a packet is divided into blocks of size s bytes as in a BBF, but instead of inserting these blocks we insert strings of size s + o bytes (the block plus a part of the next block) into the Bloom filter. Blocks overlap as do shingles on the roof (see Figure 4) and the overlapping part assures that it is likely that two blocks appeared consecutively if they share a common part and both of them are in the Bloom filter. For a payload of size p bytes, the number of elements inserted into the Bloom filter is ⌊(p − o)/s⌋ for a FBS which is close to ⌊p/s⌋ for a BBF. However, the maximum number of queries to a Bloom filter in the worst case is about v times smaller than in a BBF, where v is the number of possible starting offsets. Since the value of v can be estimated as the system’s maximum transmission unit (MTU) divided by the block size s, this improvement is significant, which is supported by the experimental results presented in Section 5.

Fig. 4. Processing of a payload with a Fixed Block Shingling (FBS) method (parameters: block size s = 8, overlap size o = 2).

The goal of the FBS scheme (of using an overlapping part) is to avoid trying all possible offsets during query processing in an HBF to solve the consecutiveness problem. However, both these techniques (shingling and offsets) are not guaranteed to answer correctly, for an HBF because of offset collisions and for FBS because ACM Journal Name, Vol. V, No. N, Month 20YY.


·

13

multiple blocks can start with the same string and a FBS then confuses their position inside the payload (see Figure 5). Thus both can increase the number of false positive answers. For example, the FBS will incorrectly answer that it has seen a string of blocks X0 X1 Y1 Y2 after processing two packets X and Y made of blocks X0 X1 X2 X3 X4 and Y0 Y1 Y2 , respectively, where X2 has the same prefix (of size at least o bytes) as Y1 .

Fig. 5. An example of a collision due to a shingling failure. The same prefix prevented a FBS method from determining whether two blocks appeared consecutively within one payload. The FBS method incorrectly treats the string of blocks X0 X1 Y1 Y2 as if it was processed inside one payload.

Querying a Bloom filter in the FBS scheme is similar to querying a BBF except that we do not use any offsets and therefore do not have to try all possible offset positions of the first block of an excerpt. Thus when querying for an excerpt x we slide a window of size s + o bytes starting at each of the first s positions of x and query the Bloom filter for this window. When a match is found for this first block, the query can proceed with the next block (including the overlap) until all blocks of an excerpt are matched. Since these blocks overlap we assume that they occurred consecutively inside one single payload. The answer to an excerpt query is considered to be positive only if there exists an alignment (i.e., a position of the first block’s boundary) for which all tested blocks were found in the Bloom filter. Figures 6 and 7 show examples of querying in the FBS method. Note that these examples ignore that we need to determine all flowIDs of the excerpts found. Therefore even after a match was found for some alignment and a flowID we shall continue to check other alignments and flowIDs because multiple packets in multiple flows could contain such an excerpt. 3.3

Variable Block Shingling (VBS)

The use of shingling instead of offsets in a FBS method lets us avoid testing all possible offset numbers of the first block during querying, but we still have to test all possible alignments of the first block inside an excerpt (as shown in Fig. 6 and 7). A Variable Block Shingling (VBS) solves this problem by setting block boundaries based on the payload itself. ACM Journal Name, Vol. V, No. N, Month 20YY.

14

·


Fig. 6. An example of querying a FBS method (with a block size 8 bytes and an overlap size 2 bytes). Different alignments of the first block of the query excerpt (shown on top) are tested. When a match is found in the Bloom filter for some alignment of the first block we try subsequent blocks. In this example all blocks for the alignment starting at the third byte are found and therefore the query substring (at the bottom) is reported as found. We assume that the FBS processed the packet in Fig. 4 prior to querying.

Fig. 7. An example of querying a FBS method for an excerpt which is not supposed to be found (i.e., no packet containing such string has been processed). The query processing starts by testing the Bloom filter for the presence of the first block of the query excerpt at different alignment positions. For alignment 2 the first block is found because we assume that the FBS processed the packet in Fig. 4 prior to executing this query. The second block for this alignment has been found too due to a false positive answer of a Bloom filter. The third block for this alignment has not been found and therefore we continue with testing first block at alignment 3. As there was no alignment for which all blocks were found we report the query excerpt was not found.

We slide a window of size k bytes through the whole payload and for each position of the window we compute a value of function H(c1 , . . . , ck ) on the byte values of the payload. When H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary immediately after the current position of byte ck . Note that we can choose to put a block boundary before or after any of the bytes ci , 1 ≤ i ≤ k, but this selection has to be fixed. Note that for use with shingling it is better to put the boundary after the byte ck such that the overlaps are not restricted to strings having only special values which satisfy the above condition for boundary insertion (which can increase shingle collisions). When the function H is random and uniform then the parameter m sets the expected size of a block. For random payloads we will get a distribution of block sizes with an average size of m bytes. This variable block size technique’s drawback is that we can get many very small blocks, which can flood the Bloom filter, or some large blocks, which prevent us from querying for smaller excerpts. Therefore, we introduce an enhanced version of this scheme, EVBS, in ACM Journal Name, Vol. V, No. N, Month 20YY.


Fig. 8.

·

15

Processing of a payload with a Variable Block Shingling (VBS) method.

the next section. In order to save computational resources it is convenient to use a function that can use the computations performed for the previous positions of the window to calculate a new value as we move from bytes c1 , . . . , ck to c2 , . . . , ck+1 . Rabin fingerprints (see Section 2.2) have such iterative property and we define a fingerprint F of a substring c1 c2 . . . ck , where ci is the value of the i-th byte of the substring of a payload, as: F (c1 , . . . , ck ) = (c1 pk−1 + c2 pk−2 + · · · + ck ) mod M,

(3)

where p is a fixed prime number and M is a constant. To compute the fingerprint of substring c2 . . . ck+1 , we need only to add the last element and remove the first one: F (c2 , . . . , ck+1 ) = (p F (c1 , . . . , ck ) + ck+1 − c1 pk ) mod M.

(4)

k−1

Because p and k are fixed we can precompute the values for p . It is also possible to use Rabin fingerprints as hash functions in the Bloom filter. In our implementation we use a modified scheme [Broder 1997] to increase randomness without any additional computational costs: F (c2 , . . . , ck+1 ) = (p(F (c1 , . . . , ck ) + ck+1 − c1 pk )) mod M.

(5)

The advantage of picking block boundaries using Rabin functions is that when we get an excerpt of a payload and divide it into blocks using the same Rabin function that we used for splitting during the processing of the payload, we will get exactly the same blocks. Thus, we do not have to try all possible alignments of the first block of a query excerpt as in previous methods. The rest of this method is similar to the FBS scheme where instead of using fixed-size blocks we have variable-size blocks depending on the payload. To process a payload we slide a window of size k bytes through the whole payload. For each its position we check whether the value of F modulo m is zero and if yes we set a new block boundary. All blocks are inserted with the overlap of o bytes as shown in Figure 3.3. Querying in a VBS method is the simplest of all methods in this section because there are no offsets and no alignment problems. Therefore, this method involves much fewer tests for membership in a Bloom filter. Querying for an excerpt is done in the same way as processing the payload in previous paragraph but instead of the insertion we query the Bloom filter. Only when all blocks are found in the Bloom filter the answer is positive. The maximum number of queries to a Bloom filter in ACM Journal Name, Vol. V, No. N, Month 20YY.

16

·

Fig. 9.


Processing of a payload with an Enhanced Variable Block Shingling (EVBS) method.

the worst case is about v · s times smaller than in a BBF, where v is the number of possible starting offsets and s is the number of possible alignments of the first block in a BBF, while assuming the average block size in a VBS method to be s. 3.4

Enhanced Variable Block Shingling (EVBS)

The enhanced version of the variable block shingling method tries to solve a problem with block sizes. A VBS can create many small blocks, which can flood the Bloom filter and do not provide enough discriminability, or some large blocks, which can prevent querying for smaller excerpts. In a EVBS we form superblocks composed from blocks found by a VBS method to achieve better control over the size of blocks. To be precise, when processing a payload we slide a window of size k bytes through the entire payload and for each position of the window we compute the value of the fingerprinting function H(c1 , . . . , ck ) on the byte values of the payload as in the VBS method. When H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary after the current position of byte ck . We take the resulting blocks of an expected size m bytes, one by one from the start of the payload, and form superblocks, i.e., new non-overlapping blocks made of multiple original blocks, with the size at least m′ bytes, where m′ ≥ m. We do this by selecting some of the original block boundaries to be the boundaries of the new superblocks. Every boundary that creates a superblock of size greater or equal to m′ is selected (Figure 9, where the minimum superblock size is m′ ). Finally, superblocks with an overlap to the next superblock of size o bytes are inserted into the Bloom filter. The maximum number of queries to a Bloom filter in the worst case is about the same as for a VBS, assuming the average block sizes for the two methods are the same. This leads, however, to a problem when querying for an excerpt. If we use the same fingerprinting function H and parameter m we get the same block boundaries in the excerpt as in the original payload, but the starting boundary of the first superblock inside the excerpt is unknown. Therefore, we have to try all boundaries in the first m′ bytes of an excerpt (or the first that follows if there was none) to form the first boundary of the first superblock. The number of possible boundaries we have to try in an EVBS method (approximately m′ /m) is much smaller than the number of possible alignments (i.e., the block size s) in an HBF, for usual parameter values. ACM Journal Name, Vol. V, No. N, Month 20YY.


·

17

Fig. 10. Processing of a payload with a Winnowing Block Shingling (WBS) method. First, we compute hash values for each payload byte position. Subsequently boundaries are selected to be at the positions of the rightmost maximum hash value inside the winnowing window which we slide through the array of hashes. Bytes between consecutive pairs of boundaries form blocks (plus the overlap).

3.5

Winnowing Block Shingling (WBS)

In a Winnowing Block Shingling method we use the idea of winnowing, described in Section 2.3, to select boundaries of blocks and shingling to resolve the consecutiveness of blocks. We select the winnowing window size instead of a block size and we are guaranteed to have at least one boundary in any window of this size inside the payload. This also sets an upper bound on the block size. We start by computing hash values for each payload byte position. In our implementation this is done by sliding a window of size k bytes through the whole payload and for each position of the window we compute the value of a fingerprinting function H(c1 , . . . , ck ) on the byte values of the payload as in the VBS method. In this way we get an array of hashes, where the i-th element is the hash of bytes ci , . . . , ci+k−1 , where ci is the i-th byte of the payload of size p, for i = 1, . . . , (p − k + 1). Then we slide a winnowing window of size w through this array and for each position of the winnowing window we put a boundary immediately before the position of the maximum hash value within this window. If there are more hashes with maximum value we choose the rightmost one. Bytes between consecutive pairs of boundaries form blocks (plus the beginning of size o of the next block, the overlap) and they are inserted into a Bloom filter. See Figure 3.5. When querying for an excerpt we do the same process except that we query the Bloom filter for blocks instead of inserting them. If all were found in the Bloom filter the answer to the query is positive. The maximum number of queries to a Bloom filter in the worst case is about the same as for a VBS, assuming the average block sizes for the two methods are the same. There is at least one block boundary in any window of size w. Therefore the longest possible block size is w + 1 + o bytes. This also guarantees that there are always at least two boundaries to form a block in an excerpt of size at least 2w + o bytes. ACM Journal Name, Vol. V, No. N, Month 20YY.

18

3.6

·


Variable Hierarchical Bloom Filter (VHBF)

Querying an HBF involved trying s possible starting positions of the first block in an excerpt. In a VHBF method we avoid this by splitting the payload into variable-sized blocks (see Figure 11) determined by a fingerprinting function as in Section 3.3 (VBS). Building the hierarchy, the insertion of blocks and querying are the same as in the original Hierarchical Bloom Filter, only the block boundary definition has changed. Reducing the number of queries by s (the size of one block in an HBF) helps reducing the resulting false positive rate of this method. Notice that even if we added overlaps between blocks (i.e., use shingling) we would still need to use offsets, because they work as a way of determining whether to check the next level of the hierarchy during the query phase because we check the next level only for even offset numbers.

Fig. 11.

3.7

Processing of a payload with a Variable Hierarchical Bloom Filter (VHBF) method.

Fixed Doubles (FD)

The method of fixed doubles is designed to address a shortcoming of a hierarchy in an HBF. The hierarchy in an HBF is not complete in a sense that we do not insert all double-blocks (blocks of size 2s) and all quadruple-blocks (4s), and so on, into the Bloom filter. For example, when inserting a packet consisting of blocks S0 S1 S2 S3 S4 into an HBF, we insert blocks S0 ,. . . , S4 , S0 S1 , S2 S3 , and S0 S1 S2 S3 . And if we query for an excerpt of size 2s (or up to size 4s − 2 bytes, Figure 3), for example, S1 S2 , this block of size 2s is not found in the HBF (and all other double-blocks at odd offsets) and the false positive rate is worse than the one of a BBF in this case, because we need about two times more space for an HBF compared to a BBF with the same false positive rate of the Bloom filter. The same is true for other levels of the hierarchy. In fact, the probability that this event happens rises exponentially with the level number. As an alternative approach to the hierarchy we insert all double-blocks as shown in Figure 12, but do not continue to the next level to not increase the storage requirements. Note that this method is not identical to a FBS scheme with an overlap of size s because in a FD we insert all single blocks and also all double blocks. In this method we neither use shingling nor offsets, because the consecutiveness problem is solved by the level of all double-blocks which overlap with each other, ACM Journal Name, Vol. V, No. N, Month 20YY.


·

19

by half of the size with the previous one and the second half overlaps with the next one.

Fig. 12.

Processing of a payload with a Fixed Doubles (FD) method.

The query mechanism works as follows: We first find the correct alignment of the first block of an excerpt by trying to query the Bloom filter for all windows of size s starting at positions 0 through s − 1. Note that we can get multiple positive answers and in that case we continue the process independently for all of them. Then we split the excerpt into blocks of size s starting at the position found and query for each of them. Finally, we query for all double-blocks and when all answers were positive we claim that the excerpt was found. The FD scheme inserts 2⌊p/s⌋ − 1 blocks into the Bloom filter for a payload of size p bytes, which is approximately the same as an HBF and about two times more than an FBS scheme. The maximum number of queries to a Bloom filter in the worst case is about two times the number for a FBS method. 3.8

Variable Doubles (VD)

This method is similar to the previous one (FD) but the block boundaries are determined by a fingerprinting function as in a VBS (Section 3.3). Hence, we do not have to try finding the correct alignment of the first block of an excerpt when querying and the blocks have variable size. Both the number of blocks inserted into the Bloom filter and the maximum number of queries in the worst case are approximately the same as for the FD scheme. An example is given in Figure 3.8. As well as the FD method we use neither shingling nor offsets, because the consecutiveness problem is solved by the level of all double-blocks which overlap with each other and with the single blocks. During querying we simply divide the query excerpt into blocks by a fingerprinting method and query the Bloom filter for all blocks and for all double-blocks. Finally, if all answers are positive we claim that the excerpt is found. 3.9

Enhanced Variable Doubles (EVD)

The Enhanced Variable Doubles method uses the technique from Section 3.4 (EVBS) to create an extension of a VD method by forming superblocks of a payload. Then these superblocks are treated the same way as blocks in a VD method. Thus, we insert all superblocks and all doubles of these superblocks into the Bloom filter as ACM Journal Name, Vol. V, No. N, Month 20YY.

20

·


Fig. 13.

Processing of a payload with a Variable Doubles (VD) method.

shown in Figure 14. The number of blocks inserted into the Bloom filter as well as the maximum number of queries in the worst case is similar to that of the VD method (assuming similar average block sizes of both schemes).

Fig. 14.

3.10

Processing of a payload with an Enhanced Variable Doubles (EVD) method.

Multi-Hashing (MH)

One would imagine that the technique of VBS would be strengthened by using multiple independent VBS methods because it provides greater flexibility in the choice of parameters, such as the expected block size. We call this technique Multi-hashing which uses t independent fingerprinting methods (or fingerprinting functions with different parameters) to set block boundaries as shown in Figure 15. It is equivalent to using t independent Variable Block Shingling methods and the answer to excerpt queries is positive only if all the t methods answer positively. Note that even if we set the overlapping part, i.e., the parameter o, to zero for all instances of the VBS, we would still get a guarantee that the excerpt has appeared on the network as one continuous fragment. Moreover, by using expected block sizes of the instances as multiples of powers of two we can generate a hierarchical structure with the MH method. The expected of blocks inserted into the Bloom filter for a payload of Pnumber t size p bytes is i=1 ⌊p/mi ⌋, where mi is the expected block size for the i-th VBS.



·

21

Fig. 15. Processing of a payload with a Multi-Hashing (MH) method. In this case, the MH uses two independent instances of the Variable Block Shingling method simultaneously to process the payload.

3.11

Enhanced Multi-Hashing (EMH)

The enhanced version of Multi-Hashing uses multiple instances of EVBS to increase the certainty of answers to excerpt queries. Blocks inserted by independent instances of EVBS are different and overlap with each other, and therefore improve the robustness of this method. Other aspects than the superblock formation are the same as for the MH method. In our experiments in Section 5 we use two independent instances of EVBS with identical parameters and store the data for both in one Bloom filter. 3.12

Winnowing Multi-Hashing (WMH)

The WMH method uses multiple instances of WBS (Section 3.5) to reduce the probability of false positives for excerpt queries. The WMH gives not only excellent control over the block sizes due to winnowing (see Figure 18(c)) but also provides much greater confidence about the consecutiveness of the blocks inside the query excerpt because of overlaps both inside each instance of WBS and among the blocks of multiple instances. Both querying and payload processing are done for all t WBS instances and the final answer to an excerpt query is positive only if all t answers are positive. In our experiments in Section 5 we use two instances of WBS with identical winnowing window size and store data from both methods in one Bloom filter. By storing data of each instance in a separate Bloom filter we can allow data aging to save space by keeping only some of the Bloom filters for very old data at the cost of higher false positive rates. For an example of processing payload and querying in a WMH method see the multi-packet case in Fig. 16 and 17. 4.

PAYLOAD ATTRIBUTION SYSTEMS (PAS)

A payload attribution system performs two separate tasks: payload processing and query processing. In payload processing, the payload of all traffic that passed through the network where the PAS is deployed is examined and some information is saved into permanent storage. This has to be done at line speed and the underlying raw packet capture component can also perform some filtering of the packets, for ACM Journal Name, Vol. V, No. N, Month 20YY.

22

·


example, choosing to process only HTTP traffic. Data is stored in archive units, each of which having two timestamps (start and end of the time interval during which we collected the data). For each time interval we also need to save all flowIDs (e.g., pairs of source and destination IP addresses) to allow querying later on. This information can be alternatively obtained from connection records collected by firewalls, intrusion detection systems or other log files. During query processing given the excerpt and a time interval of interest we have to retrieve all the corresponding archive units from the storage. We query each unit for the excerpt and if we get a positive answer we try to query successively for each of the flowIDs appended to the blocks of the excerpt and report all matches to the user. 4.1

Attacks on PAS

As with any security system, there are ways an adversary can evade proper attribution. We identify the following types of attacks on a PAS (mostly similar to those in [Shanmugasundaram et al. 2004]): 4.1.1 Compression&Encryption. If the payload is compressed or encrypted, a PAS can allow to query only for the exact compressed or encrypted form. 4.1.2 Fragmentation. An attacker can transform the stream of data into a sequence of packets with payload sizes much smaller than the (average) block size we use in the PAS. Methods with variable block sizes where block boundaries depend on the payload are harder to beat, but, for very small fragments, e.g., 6 bytes each, the system will not be able to do the attribution correctly. A solution is to make the PAS stateful so that it concatenates payloads of one data stream prior to processing. However, such a solution would impose additional memory and computational costs and there are known attacks on stateful IDS systems [Handley et al. 2001], such as incorrect fragmentation and timing attacks. 4.1.3 Boundary Selection Hacks. For methods with block boundaries depending on the payload an attacker can try to send special packets containing payload that can contain too many or no boundaries. The PAS can use different parameters for boundary selection algorithm for each archive unit so that it would be impossible for an attacker to fool the system. Moreover, winnowing guarantees at least one boundary in each winnowing window. 4.1.4 Hash Collisions. Hash collisions are very unpredictable and therefore hard to use by an attacker because we use different salt for the hash computation in each Bloom filter. 4.1.5 Stuffing. An attacker can inject some characters into the payload which are ignored by applications but in the network layer they change the payload structure. Our methods are robust against stuffing because the attacker has to modify most of the payload to avoid correct attribution as we can match even very small excerpts of payload. 4.1.6 Resource Exhaustion. Flooding attacks can impair a PAS. However, our methods are more robust to these attacks than raw packet loggers due to the data ACM Journal Name, Vol. V, No. N, Month 20YY.


·

23

reduction they provide. Moreover, processing identical payloads repeatedly does not impact the precision of attribution because the insertion into a Bloom filter is an idempotent operation. On the other hand, the list of flowIDs is vulnerable to flooding, for example, when a worm tries to propagate out of the network by trying many random destination addresses. 4.1.7 Spoofing. Source IP addresses can be spoofed and a PAS is primarily concerned with attributing payload according to what packets have been delivered by the network. The scope of possible spoofing depends on the deployment of the system and filtering applied in affected networks. 4.2

Multi-packet queries

The methods described in Section 3 show how to query for excerpts inside one packet’s payload. Nevertheless, we can extend the querying mechanism to handle strings that span multiple packets. Methods which use offsets have to continue querying for the next block, which was not found with its sequential offset number, with a zero offset instead and try all alignments of that block as well because the fragmentation into packets could leave out some part smaller than the block size at the end of the first packet. This is very inefficient and increases the false positive rate. Moreover, for methods that form a hierarchy of blocks it means that it cannot be fully utilized. The payload attribution system can do TCP stream reconstruction and work on the reconstructed flow to fix it. On the other hand, methods using shingling can be extended without any further changes if we return as an answer to the query the full sequence of blocks found (see a WMH example in Fig. 16 and 17).

Fig. 16. Processing of payloads of two packets with a Winnowing Multi-Hashing (WMH) method where both packets are processed by two independent instances of the Winnowing Block Shingling (WBS) method simultaneously.

4.3

Privacy and Simple Access Control

Processing and archiving payload information must comply with the privacy and security policies of the network where they are performed. Furthermore, authorization to use the payload attribution system should be granted only to properly authorized parties and all necessary precautions must be taken to minimize the possibility of a system compromise. The privacy stems from using a Bloom filter to hold the data. It is only possible to query the Bloom filter for a specific packet content but it cannot be forced to ACM Journal Name, Vol. V, No. N, Month 20YY.

24

·


Fig. 17. Querying for an excerpt spanning multiple packets in a Winnowing Multi-Hashing (WMH) method comprised of two instances of WBS. We assume the WMH method processed the packets in Fig. 16 prior to querying. In this case, we see that WMH can easily query for an excerpt spanning two packets and that the blocks found significantly overlap which increases the confidence of the query result. However, there is still a small gap between the two parts because WMH works on individual packets (unless we perform TCP stream reconstruction).

provide a list of packet data stored inside. Simple access control (i.e., restricting the ability to query the Bloom filter) can be easily achieved as follows. Our methods allow the collected data to be stored and queried by an untrusted party without disclosing any payload information nor giving the query engine any knowledge of the contents of queries. We achieve this by adding a secret salt when computing hashes for insertion and querying the Bloom filter. A different salt is used for each Bloom filter and serves the purpose of a secret key. We can also easily achieve much finer granularity of access control by using different keys for different protocols or subranges of IP address space. Without a key the Bloom filter cannot be queried and the key doesn’t have to be made available to the querier (only the indices of bits for which we want to query are disclosed). Without knowing the key a third party cannot query the Bloom filter. However, additional measures must be taken to enforce that the third party provides correct answers and does not alter the archived data. Also note that this kind of access control is not cryptographically secure and some information leakage can occur. On the other hand, there is no additional computational or storage cost associated with using it and also no need for decrypting the data before querying as is common with standard techniques. A detailed analysis of privacy achieved by using Bloom filters can be found in [Bellovin and Cheswick 2004]. 4.4

Compression

In addition to the inherent data reduction provided by our attribution methods due to the use of Bloom filters, our experiments show that we can achieve another about 20 percent storage savings by compressing the archived data (after careful optimization of parameters [Mitzenmacher 2002]), for example by gzip. The results presented in the next section do not include this additional compression. 5.

EXPERIMENTAL RESULTS

In this section we show performance measurements of payload attribution methods described in Section 3 and discuss the results from various perspectives. For this purpose we collected a network trace of 4 GB of HTTP traffic from our campus network. For performance evaluation throughout this section we consider processing 3.1 MB segment (about 5000 packets) of the trace as one unit trace collected during ACM Journal Name, Vol. V, No. N, Month 20YY.


·

25

one time interval. As discussed earlier, we store all network traffic information in one Bloom filter in memory and we save it to a permanent storage in predefined time intervals or when it becomes full. A new Bloom filter is used for each time interval. The time interval should be short because it determines the time precision at which we can attribute packets, i.e., we can determine only in which time interval a packet appeared on the network, not the exact time. The results presented do not depend on the size of the unit because we use a data reduction ratio to set the Bloom filter size (e.g., 100:1 means a Bloom filter of size 31 kB). Each method uses one Bloom filter of an equal size to store all data. Our results did not show any deviations depending on the selection of the segment within the trace. All methods were tested to select the best combination of parameters for each of them. Results are grouped into subsections by different points of interest. 5.1

Performance Metrics

To compare payload attribution methods we consider several aspects which are not completely independent. The first and most important aspect is the amount of storage space a method needs to allow querying with a false positive rate bounded by a pre-defined value. We provide a detailed comparison and analysis in the following subsections. Second, the methods differ in the number of elements they insert into a Bloom filter when processing packets and also in the number of queries to a Bloom filter performed when querying for an excerpt in the worst case (i.e., when the answer is negative). A summary can be found in Table II. Methods which use shingling and variable block size achieve significant decrease in the number of queries they have to perform to analyze each excerpt. It is important not only for the computational performance but also for the resulting false positive rate as each query to the Bloom filter takes a risk of a false positive answer. The boundary selection techniques these methods use are very computationally efficient and can be performed in a single pass through the payload. The implementation can be highly optimized for a particular platform and some part of the processing can be also done by a special hardware. Our implementation running on a Linux-based commodity PC (with a kernel modified for fast packet capturing [NTOP 2008]) can smoothly handle 200 Mbps and the processing can be easily split among multiple machines (e.g., by having each machine process packets according to a hash value of the packet header). 5.2

Block Size Distribution

The graphs in Figure 18 show the distributions of block sizes for three different methods of block boundary selection. We use a block (or winnowing window) size parameter of 32 bytes, a small block size for an EVBS of 8 bytes, and an overlap of 4 bytes. Both VBS and EVBS show a distribution with an exponential decrease in the number of blocks with an increasing block size, shifted by the overlap size for a VBS or the block size plus the overlap size for an EVBS. Long tails were cropped for clarity and the longest block was 1029 bytes long. On the other hand, a winnowing method results in a quite uniform distribution where the block sizes are bounded by the winnowing window size plus the overlap. The apparent peaks for the smallest block size in graphs 18(a) and 18(c) are caused by low-entropy payloads, such as long blocks of zeros. The distributions of ACM Journal Name, Vol. V, No. N, Month 20YY.

26

·


Table II. Comparison of payload attribution methods from Section 3 based on the number of elements inserted into a Bloom filter when processing one packet of a fixed size and the number of blocks tested for presence in a Bloom filter when querying for an excerpt of a fixed size in the worst case (i.e., when the answer is negative). Note that the values are approximations and we assume all methods have the same average block size. The variables refer to: n: the number of blocks inserted by a BBF taken as a base, s: the size of a block in fixed block size methods (BBF, HBF, FBS, FD), v:s the number of possible offset numbers, p: the number of alignments tested for enhanced methods (EVBS, EVD, EMH). Note that the actual number of bits tested or set in a Bloom filter depends on the number of hash functions used for each method and therefore this table presents numbers of blocks. block sizes obtained by processing random payloads generated with the same payload sizes as in the original trace show the same distributions just without these peaks. Nevertheless, the huge amount of small blocks does not significantly affect the attribution because inserting a block into the Bloom filter is an idempotent operation. 5.3

Unprocessed Payload

Some fraction of each packet’s payload is not processed by the attribution mechanisms presented in Section 3. Table III(b) shows how each boundary selection method affects the percentage of unprocessed payload. For methods with a fixed block size the part of a payload between the last block’s boundary and the end of the payload is ignored by the payload attribution system. With (enhanced) Rabin fingerprinting, and winnowing methods the part starting at the beginning of the payload and ending at the first block boundary and the part between the last block boundary and the end of the payload are not processed. The enhanced version of Rabin fingerprinting achieves much better results because the small block size, which was four times smaller that the superblock size in our test, applies when selecting the first block boundary in a payload. Winnowing performs better than other methods with a variable block size in terms of unprocessed payload. Note also that a WMH method, even though it uses winnowing for block boundary selection as well as a WBS does, has about t times smaller percentage of unprocessed payload than a WBS because each of the t instances of a WBS within the WMH covers a different part of the payload independently. Moreover, the “inner” part of the payload is covered t times which makes the method much more resistant to collisions because t collisions have to occur at the same time to produce a false positive answer. For large packets the small percentage of unprocessed payload does not pose a problem, however, for very small packets, e.g., only 6 bytes long, it means that they are possibly not processed at all. Therefore we can optionally insert the entire payload of a packet in addition to inserting all blocks and add a special query type to the system to support queries for exact packets. This will increase the ACM Journal Name, Vol. V, No. N, Month 20YY.


·

27

(a)

(b)

(c)

Fig. 18. The distributions of block sizes for three different methods of block boundary selection after processing 100000 packets of HTTP traffic are shown: (a) VBS (the inner graph shows an exponential decrease in log scale), (b) EVBS, (c) WBS.


28

·

Miroslav Ponec et al. (a)

(b)

Table III. (a) False positive rates for data reduction ratio 130:1 for a WMH method with a winnowing window size 64 bytes and therefore an average block size about 32 bytes. The table summarizes answers to 10000 queries for each query excerpt size. All 10000 answers should be NO. YES answers are due to false positives inherent to a Bloom filter. WMH guarantees no N/A answers for these excerpt sizes. (b) The percentage of unprocessed payload of 50000 packets depending on the block boundary selection method used. Details are provided in Sections 5.2 and 5.3.

Fig. 19. The graph shows the number of correct answers to 10000 excerpt queries for a varying length of a query excerpt for each method (with block size or winnowing window size 64 bytes) and data reduction ratio 100:1. This reduction ratio can be further improved to about 120:1 by using a compressed Bloom filter. The WMH method has no false positives for excerpt sizes 250 bytes and longer. The previous state-of-the-art method HBF does not provide any useful answers at this high data reduction ratio for excerpts shorter than about 400 bytes.

storage requirements only slightly because we would insert one additional element per packet into the Bloom filter. 5.4

Query Answers

To measure and compare the performance of attribution methods, and in particular to analyze the false positive rate, we processed the trace by each method and queried for random strings which included a small excerpt of size 8 bytes in the middle that ACM Journal Name, Vol. V, No. N, Month 20YY.


·

29

Table IV. Measurements of false positive rate for data reduction ratio 50:1. The table summarizes answers to 10000 excerpt queries using all methods (with block size 32 bytes) described in Section 3 for various query excerpt lengths (top row). These queries were performed after processing a real packet trace and all methods use the same size of a Bloom filter (50 times smaller than the trace size). All 10000 answers should be NO since these excerpts were not present in the trace. YES answers are due to false positives in Bloom filter and N/A answers mean that there were no boundaries selected inside the excerpt to form a block for which we can query.

did not occur in the trace. In this way we made sure that these query strings did not represent payloads inserted into the Bloom filters. Every method has to answer either YES, NO or answer not available (N/A) to each query. A YES answer means a match was found for the entire query string for at least one of the flowIDs and represents a false positive. A NO answer is the correct answer for the query, and a N/A answer is returned if the blocking mechanism specific to the method did not select even one block inside the query excerpt. The N/A answer can occur, for example, when the query excerpt is smaller than the block size in an HBF, or when there was one or no boundary selected in the excerpt by a VBS method and so there was no block to query for. In Table I we summarize the possibility of getting N/A and false negative answers for each method. A false negative answer can occur when we query for an excerpt which has size greater than the block size but none of the alignments of blocks inside the excerpt fit the alignment that has been used to process the payload which contained the excerpt. For example, if we processed a payload ABCDEF GHI by an HBF with block size 4 bytes, we would have blocks ABCD, EF GH, ABCDEF GH, and if we queried for an excerpt BCDEF G, the HBF would answer NO. Note that false negatives can occur only for excerpts smaller than twice the block size and only for methods which involve testing the alignment of blocks. Table IV provides detailed results of 10000 excerpt queries for all methods with the same storage capacity and data reduction ratio 50:1. The WMH method achieves the best results among the listed methods for all excerpt sizes and for excerpts longer than 200 bytes has no false positives. The WMH also excels in the number of N/A answers among the methods with a variable block size because it guarantees at least one block boundary in each winnowing window. The results also show that methods with a variable block size are in general better than methods with a fixed block size because there are no problems with finding the right ACM Journal Name, Vol. V, No. N, Month 20YY.

30

·


alignment. The enhanced version of Rabin fingerprinting for the block boundary selection does not perform better than the original version. This is mostly because we need to try all alignments of blocks inside superblocks when querying which increases the false positive rate. These enhanced methods therefore favorably enhance the block size control but not always the false positive rate. The graph in Fig. 19 shows the number of correct answers to 10000 excerpt queries as a function of the length of a query excerpt for each method with block size parameter set to 64 bytes and data reduction ratio 100:1. The WMH method outperforms all other methods for all excerpt lengths. On the other hand, the HBF’s results are the worst because it can fully utilize the hierarchy only for long excerpts and it has very high false positive rate for high data reduction ratios due to the use of offsets, problems with block alignments, and the large number of elements inserted into the Bloom filter. As can be observed when comparing HBF’s results to the ones of a FD method, using double-blocks instead of building a hierarchy is a significant improvement and a FD, for excerpt sizes of 220 bytes and longer, performs even better than a variable block size version of the hierarchy, a VHBF. It is interesting to see that an HBF outperforms a VHBF version for excerpts of length 400 bytes. For longer excerpts than 400 bytes (not shown) they perform about the same and both quickly achieve no false positives. The results of methods in this graph are clearly separated into two groups by performance; the curves representing the methods which use a variable block size and do not use offset numbers or superblocks (i.e., VBS, VD, WBS, MH, WMH) have concave shape and in general perform better. For very long excerpts all methods provide highly reliable results. The Winnowing Multi-Hashing achieves the best overall performance in all our tests and allows to query for very small excerpts because the average block size is approximately half of the winnowing window size (plus the overlap size). The average block size was 18.9 bytes for a winnowing window size 32 bytes and an overlap 4 bytes. Table III(a) shows the false positive rates for WMH for a data reduction ratio 130:1. This data reduction ratio means that the total size of the processed payload was 130-times the size of the Bloom filter which is archived to allow querying. The Bloom filter could be additionally compressed to achieve a final compression of about 158:1, but note that the parameters of the Bloom filter (i.e., the relation among the number of hash functions used, the number of elements inserted and the size of the Bloom filter) have to be set in a different way [Mitzenmacher 2002]. We have to decide in advance whether we want to use the additional compression of the Bloom filter and if so, optimize the parameters for it, otherwise the Bloom filter data would have very high entropy and is hard to compress. The compression is possible because the approximate representation of a set by a standard Bloom filter does not reach the information theoretical lower bound [Broder and Mitzenmatcher 2002]. Winnowing block shingling method (WBS) performs almost as well as WMH (see Fig. 19) and requires t-times less computation. However, the confidence of the results is lower than with WMH because multi-hashing covers majority of the query excerpt multiple times and if storage is needed, a data aging method can be used to downgrade it to a simple WBS later. ACM Journal Name, Vol. V, No. N, Month 20YY.


6.

·

31

CONCLUSION

In this paper, we presented novel methods for payload attribution. When incorporated into a network forensics system they provide an efficient probabilistic query mechanism to answer queries for excerpts of a payload that passed through the network. Our methods allow data reduction ratios greater than 100:1 while having a very low false positive rate. They allow queries for very small excerpts of a payload and also for excerpts that span multiple packets. The experimental results show that our methods represent a significant improvement in query accuracy and storage space requirements compared to previous attribution techniques. More specifically, we found that winnowing represents the best technique for block boundary selection in payload attribution applications and that shingling as a method for consecutiveness resolution is a clear win over the use of offset numbers and finally, the use of multiple instances of payload attribution methods can provide additional benefits, such as improved false positive rates and data-aging capability. These techniques combined together form a payload attribution method called Winnowing Multi-Hashing which substantially outperforms previous methods. The experimental results also testify that in general the accuracy of attribution increases with the length and the specificity of a query. Moreover, privacy and simple access control is achieved by the use of Bloom filters and one-way hashing with a secret key. Thus, even if the system is compromised no raw traffic data is ever exposed and querying the system is possible only with the knowledge of the secret key. We believe that these methods also have much broader range of applicability in various areas where large amounts of data are being processed. REFERENCES Anderson, E. and Arlitt, M. 2006. Full Packet Capture and Offline Analysis on 1 and 10 Gb Networks. Technical Report HPL-2006-156. Bellovin, S. and Cheswick, W. 2004. Privacy-enhanced searches using encrypted Bloom filters. Cryptology ePrint Archive, Report 2004/022. Available at http://eprint.iacr.org/. Bloom, B. 1970. Space/time tradeoffs in hash coding with allowable errors. In Communications of the ACM (CACM). 422–426. Broder, A. 1993. Some applications of Rabin’s fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science. Springer-Verlag, 143–152. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. Broder, A. and Mitzenmatcher, M. 2002. Network Applications of Bloom Filters: A Survey. In Annual Allerton Conference on Communication, Control, and Computing. UrbanaChampaign, Illinois, USA. Cho, C. Y., Lee, S. Y., Tan, C. P., and Tan, Y. T. 2006. Network forensics on packet fingerprints. In 21st IFIP Information Security Conference (SEC 2006). Karlstad, Sweden. Garfinkel, S. 2002. Network forensics: Tapping the internet. O’Reilly Network. Gu, G., Porras, P., Yegneswaran, V., Fong, M., and Lee, W. 2007. BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation. In Proceedings of the 16th USENIX Security Symposium. 167–182. Handley, M., Kreibich, C., and Paxson, V. 2001. Network Intrusion Detection: Evasion, Traffic Normalization, and End-to-End Protocol Semantics. In Proceedings of the USENIX Security Symposium. Washington, USA. King, N. and Weiss, E. 2002. Network Forensics Analysis Tools (NFATs) reveal insecurities, turn sysadmins into system detectives. Information Security. Available at www.infosecuritymag.com/2002/feb/cover.shtml. ACM Journal Name, Vol. V, No. N, Month 20YY.

32

·


Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference. San Fransisco, CA, USA, 1–10. Mitzenmacher, M. 2002. Compressed Bloom Filters. IEEE/ACM Transactions on Networking (TON) 10, 5, 604 – 612. NTOP. 2008. PF RING Linux kernel patch. Available at http://www.ntop.org/PF RING.html. ¨ nnimann, H., and Wein, J. 2007. Highly Efficient Techniques for Ponec, M., Giura, P., Bro Network Forensics. In Proceedings of the 14th ACM Conference on Computer and Communications Security. Alexandria, Virginia, USA, 150–160. Rabin, M. O. 1981. Fingerprinting by random polynomials. Technical report 15-81, Harvard University. Rhea, S., Liang, K., and Brewer, E. 2003. Value-based web caching. In Proceedings of the Twelfth International World Wide Web Conference. Richardson, R. and Peters, S. 2007. 2007 CSI Computer Crime and Security Survey Shows Average Cyber-Losses Jumping After Five-Year Decline. CSI Press Release. Available at http://www.gocsi.com/press/20070913.jhtml. Schleimer, S., Wilkerson, D. S., and Aiken, A. 2003. Winnowing: local algorithms for document fingerprinting. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM Press, New York, NY, USA, 76–85. ¨ nnimann, H., and Memon, N. 2004. Payload Attribution via Shanmugasundaram, K., Bro Hierarchical Bloom Filters. In Proc. of ACM CCS. ¨ nnimann, H. 2003. ForNet: A DisShanmugasundaram, K., Memon, N., Savant, A., and Bro tributed Forensics Network. In Proc. of MMM-ACNS Workshop. 1–16. Snoeren, A. C., Partridge, C., Sanchez, L. A., Jones, C. E., Tchakountio, F., Kent, S. T., and Strayer, W. T. 2001. Hash-based IP traceback. In ACM SIGCOMM. San Diego, California, USA. Staniford-Chen, S. and Heberlein, L. 1995. Holding intruders accountable on the internet. In Proceedings of the 1995 IEEE Symposium on Security and Privacy. Oakland.

Received February 2008; revised July 2008; accepted November 2008


New Payload Attribution Methods for Network Forensic ... - CiteSeerX

New Payload Attribution Methods for Network Forensic ... - CiteSeerX

Suggest Documents

Payload Attribution via Hierarchical Bloom Filters

Physician Service Attribution Methods for

Physician Service Attribution Methods for

IRJET-Forensic Investigation Methods on Network Attacks

New Methods for Attribution of Rabbinic Literature ...

Forensic Authorship Attribution Using ... - Semantic Scholar

Forensic Authorship Attribution Using ... - Semantic Scholar

A Survey on Image Forensic Methods - CiteSeerX

Incremental Methods for Bayesian Network Structure ... - CiteSeerX

Payload characterization for CubeSat demonstration of ... - CiteSeerX

Forensic Analytics: Methods and Techniques for Forensic Accounting ...

pdf-1893\forensic-analytics-methods-and-techniques-for-forensic ...

Anomalous Payload-based Network Intrusion ... - Semantic Scholar

Exploiting a GSM Network for Precise Payload Delivery

Forensic Methods for Detection of Deniable ...

Mitochondrial DNA and Methods for Forensic Identification

new methods for adaptive noise suppression - CiteSeerX

Multi-Dimensional Visualization for Network Forensic ...

Performance Attribution - CiteSeerX

A New Approach Of Digital Forensic Model For Digital ... - CiteSeerX

Situated Plan Attribution for Intelligent Tutoring - CiteSeerX

iOS Forensic Investigative Methods - Zdziarski.com

New methods for distribution network monitoring with smart meters ...

New Neural Network Methods for Forecasting Regional ... - Description