Scalable Packet Digesting Schemes for IP Traceback Tsern-Huei Lee
Wei-Kai Wu
Tze-Yau William Huang
Department of Communication Engineering, NCTU Email:
[email protected]
Computer Science and Information Engineering, NCTU Email:
[email protected]
Department of Communication Engineering, NCTU Email:
[email protected]
Abstract - Identifying the sources of an attack is an important task in the Internet security area. An attack could consist of a large number of packet streams generated by many compromised slaves that consume resources associated with various network elements to deny normal services or a few offending packets to disable a system. Several techniques based on probabilistic samples of transit packets have been developed to determine the sources of large packet flows. It seems that logging of packet digests is necessary for traceback of an individual packet. A clever technique based on Bloom filters has recently been proposed to generate the audit trails for each individual packet within the network. The scheme is effective. However, the storage requirement is approximately 0.5% of the link capacity, which becomes a problem as link capacity increases. In this paper, we propose packet digesting schemes for flows and sets of packets sharing the same source and destination addresses. Compared with the individual packet digesting scheme, these schemes can achieve similar goals and are much more scalable. Simulations with real Internet traffic show that the storage requirements of our proposed schemes are one to two orders of magnitude lower. ∗ I. I NTRODUCTION The impact of network attacks is getting more and more significant as the Internet becomes ubiquitous. A widely reported attack, known as distributed denial-of-service (DDoS), is one that uses a number of compromised slaves to generate large packet flows to consume resources of various network elements so that normal services are seriously degraded or totally denied [1]. Another common attack is for the attacker to generate a few well-targeted packets to disable a system [2][3]. Identifying the sources of offending packets is therefore an important task to make the attackers accountable. However, the anonymous nature of the IP protocol makes it difficult to identify the true attacker because the source IP address of the attack packets can be easily forged. The ∗ This work was supported by the Ministry of the Education, Taiwan, Republic of China, under contract 89-E-FA04-1-4
IEEE Communications Society
1008
IP traceback problem concerns tracing spoofed packets to identify the machines that directly generate the attack packets. Since every packet contains its own source address, elimination of IP spoofing is the simplest way for IP traceback. One possible technique is ingress filtering defined in RFC 2827 [4]. The fundamental idea of ingress filtering is to block all packets carrying invalid source IP addresses at network edges. The problem of ingress filtering is that all edge networks have to implement the scheme to make it work. This is unlikely to happen in the near future. Therefore, other techniques that allow incremental deployment are necessary for IP traceback. To cope with DDoS attacks, several techniques have been developed by sending probabilistic samples of auditing routers’ identifications on a flow’s path to the destination so the victim can reconstruct the attack path if sufficient number of packets are collected from the flow. ICMP traceback message [5][6] is proposed as an out-of-band carrier of the sample information. Probabilistic packet marking [7][8][9], on the other hand, uses IP header bits in randomly selected packets to carry the information in-band. The sampling nature of these approaches limits their applications to the path identification of flood-based attacks. For attacks that require only one or a few packets, most attack path identification schemes require storage of audit trails on the routers. Storing plain traffic logs on the routers is prohibitive in terms of memory requirement. Selective logging [10] reduces memory requirement by tracking only those commonly abused protocol packets, but it is nearly impossible to profile suspicious packets for all potential victims without including a large portion of packets passing through the network. To save memory, hash-based IP traceback exploits hashing techniques to record the passage of individual packets through each auditing router [11][12]. The passage of a set of packets is recorded by storing the corresponding packet digests to a digest table. A specific packet is determined, with a controlled false positive rate (FPR), to be a member of the set if its packet digest maps to an existing pattern stored in the digest table. Specifically, the Source Path Isolation Engine (SPIE) proposed in [11][12] employs the space-efficient Bloom filter that maps data to be tested through multiple hash functions
0-7803-8533-0/04/$20.00 (c) 2004 IEEE
into a single array of bits [13]. The FPR is controlled by allowing an individual digest table to store limited number of digest sets [14]. For a given FPR bound, the memory required for recording the passage of packets per unit time is proportional to the link capacity per unit time. Motivated by the desire to trace the attack path of an individual packet, SPIE takes the first 24 invariant bytes of the packet as the digest input from every packet passing through each router. It was shown that the memory requirement is roughly 0.5% of the link capacity. As illustrated in [11], a core router with a maximum capacity of about 320Mpkts/sec would require roughly 1.6Gb/sec of digest table memory, which increases the cost significantly. Besides, recording packets at their arrival rate further requires technology that is currently not scalable. This places limits on the digest table size and the time period covered by each digest table. Consequently, SPIE needs to examine more digest tables to cover a period long enough to offset the timing uncertainties. This increases the complexity of implementation and reduces the reliability of results. To decrease the memory requirement of hash-based IP traceback, we propose digesting packet aggregation units instead of individual packets. Flow and source-destination set are the two typical units of packet aggregation. A flow is identified by a five-tuple comprised of the protocol field, source and destination addresses in the IP header and the first four IP payload bytes (source and destination port numbers for TCP/UDP). We define source-destination set as the set of all packets carrying the same source and destination IP addresses. Flow digesting and source-destination set digesting provide useful capabilities with much smaller memory requirements than that of individual packet digesting. We elaborate the reduction of memory requirement from digesting aggregation of packets in the next section. In Sections III and IV, we explore the capabilities of hash-based flow traceback and source-destination set traceback, respectively. We discuss the issues of deployment and vulnerabilities in Section V before concluding in Section VI. II. M EMORY R EDUCTION
Consider M SFp /M SFf as the storage time gain of flow digesting against individual packet digesting, we have the gain equal to np /nf which is equivalent to the average flow length. The average flow length observed on one peering link in 2001 is 7.75 [15]. As a consequence, switching to flow digesting can potentially offer a storage time gain close to one order of magnitude. Similarly, the storage time gain of source-destination set digesting is np /nsd . We used Cisco NetFlow [16] to collect 30 billion packets passing through a campus router over a period of 70 days. As shown in Fig. 1, one can see that the np /nsd gain can reach as high as 80. For a time period of 1 second, which we think is a reasonable duration for one digest table on a high-end router, the gain is about 30. 90
80
70
np
In hash-based IP traceback, it is important to reduce memory requirement so the audit trail can be stored for a longer period of time on the high-end routers. With Bloom filter, the digest table size is proportional to the number of unique members that can be recorded for a given FPR bound. The relationship was derived in [11] as 1 ) P where m is the digest table size, n represents number of members digested and P denotes the false positive rate. The above equation shows that the required digest table size per unit time is proportional to the number of members to be digested per unit time. Let us define m/n as the memory m = 1.44n · log2 (
IEEE Communications Society
scaling factor (MSF) of a digesting scheme. Note that MSF means the average number of bits used to digest a signature and, given an FPR bound, its value should be as small as possible. The types of digested members we are interested in are packet, flow, and source-destination set. When we observe np individual packets, nf separated flows, and nsd distinct source-destination pair set through a router in a particular period of time, it is certain that np ≥ nf ≥ nsd . Further, the memory scaling factors for individual packet digesting, flow digesting, and source-destination set digesting are, respectively, 1 M SFp = a(P ) = 1.44 · log2 ( ) P nf M SFf = a(P ) · np nsd M SFsd = a(P ) · np
1009
nsd
60
50
40
30 0
10
20
30
40
50 Time Scale
60
70
80
90
100
Figure 1. The np /nsd aggregation gain as a function of time. III. F LOW T RACEBACK Flow traceback applies flow digesting in hash-based IP traceback. It can attain similar goals accomplished by individual packet traceback and identify the ingress points to
0-7803-8533-0/04/$20.00 (c) 2004 IEEE
the traceback-enabled network or one or more compromised routers within the enabled network. To spoof an attack packet with the flow characteristics of a legitimate concurrent flow, the attacker must have knowledge of or guess correctly the port numbers used for a concurrent legitimate transport in the cases of TCP and UDP. This usually means that the attacker can eavesdrop on some legitimate traffic towards the victim and spoof the individual packet signature of a legitimate concurrent packet. The flow traceback is equally effective in isolating the path of an attack packet or the paths associated with a packet in a distributed attack. In the following, we analyze the capability of flow traceback under the assumption that there is no false positive node caused by hash collision. Two cases, i.e., external attackers and internal attackers, are considered separately. A. External Attackers A traceback scheme should be able to identify the ingress points of an attack when the attacker is outside the tracebackenabled network. Let A(s, v) be the set of all routers passed through by any packet that arrived at a victim v bearing the source IP address s. And Ap (s, v) and Af (s, v) are subsets of A(s, v) passed through by packets bearing the individual packet signature and flow signature of the attack packet, respectively. Let Ep (s, v) and Ef (s, v) be the sets of ingress points to the traceback-enabled network derived from Ap (s, v) and Af (s, v), respectively. Ep (s, v) is a subset of Ef (s, v). Let ER (s, v) be the set of all ingress points to the traceback-enabled network of legitimate routes from s to v.
• •
A: {R1 ,R2 ,R5 ,R6 ,R7 ,R8 ,R9 ,R10 ,R11 } Af : {R1 ,R5 ,R7 ,R8 ,R10 ,R11 }
IEEE Communications Society
1010
•
Ap : {R1 ,R5 ,R8 ,R11 } Figure 2. Illustration of the ingress point of external attacker sets derived from different digesting schemes. The victim v receives an attack packet with the apparent source address of s. A rectangular packet symbol is placed next to a router for each packet with the address pair s and v passing through that router. A white triangle appears inside the packet symbol if the packet produces the flow signature of the attack packet. The triangle turns black, however, if that packet also produces the individual packet signature of the attack packet.
We compare the source path isolation capabilities between flow traceback and individual packet traceback. We can deduce Ep (s, v) from individual packet digesting and Ef (s, v) from flow digesting but neither digesting scheme can produce the set the other produces unless the two sets are identical. If ER (s, v) and Ef (s, v) are disjoint, every point in Ef (s, v) should be considered as a hole in the defense against the particular attack associated with the attack packet. The difference between Ef (s, v) and Ep (s, v), if not null, can be caused by multi-path routing or multiple concurrent sources of packets with the same attack flow signature. Identifying additional sources of attack with single traceback is an advantage. If we can assume that there is only one source of the attack flow, the larger the Ef (s, v) set is, the smaller the number of routers can forward a packet to v through all members of Ef (s, v). Given the knowledge of routing topology outside the traceback-enabled network, flow digesting may provide more information about the attack source region than individual packet digesting in the presence of multi-path routing. We now consider the complementary situation where ER (s, v) intersects Ef (s, v). If ER (s, v) intersects with the subset of Ef (s, v), Ep (s, v), then maybe the source of attack is not spoofed or reside along the legitimate paths from s to some router in ER (s, v) (i.e. flow amplification). Both Ef (s, v) and Ep (s, v) identify potential ingress points of an external attacker. If ER (s, v) intersects with Ef (s, v) but not Ep (s, v), the worst case is that the attacker can spoof the flow signature but cannot or choose not to spoof the individual packet signature. This kind of attack is difficult to achieve if the attacker has no control of any agent along the legitimate paths from s to some router in ER (s, v). When we have the knowledge of ER (s, v), we can choose not to implicate any member of ER (s, v) as the ingress point of the particular attack packet if the difference between Ef (s, v) and ER (s, v) is not null. We should treat the ingress points of obviously spoofed sources as the primary threat. Overall, flow traceback can isolate the ingress points of packets in the flow of the attack packet with a spoofed source address and can sometimes reveal more suspicious ingress
0-7803-8533-0/04/$20.00 (c) 2004 IEEE
points than individual packet traceback. With a less elaborated signature, flow traceback does broaden the attack packet space from which attack packets can be drawn to mimic the members of a concurrent flow from the real source to the victim. It is not obvious that attackers can take advantage of this difference between flow traceback and individual packet traceback, but prior knowledge of routing and topology can help reduce the ambiguity arisen from this gap. B. Internal Attackers A main reason for SPIE [11] to emphasize attack path isolation is that the scheme assumes that some router may be subverted and the attack packet may be originated from within the traceback-enabled network. Reconstructing the attack path helps to isolate those compromised routers.
Of : {R6 } Op : {R5 } • OR : {R6 } Figure 3. Illustration of the origin of internal attacker sets in the graphical convention of Fig. 2. The attacker a may represent a compromised routine within router R5 . Notice that Ap = {R1 , R2 } is a proper subset of Af = {R1 , R2 , R5 , R6 }, yet the origin sets Op and Of do not intersect. We define Op (s, v) and Of (s, p) as the set of traceback ”dead ends” derived from Ap (s, v) and Af (s, v), respectively. Each member of Op (s, v) is a member of Ap (s, v) inside a traceback-enabled network that is not the next-hop router of any other member of Ap (s, v). Members of Of (s, p) are similarly defined. The set OR (s, v) contains all routers inside a traceback-enabled network serving as local gateways for s. It is possible that some member of Op (s, v) is not a member of Of (s, v). Such router then is covered up by the forwarding path from a member of Ef (s, v), Of (s, v) or OR (s, v) to v. Such compromised routers usually have the opportunity to camouflage its attack against v through flow amplification against individual packet traceback. Besides those routers, the set of suspect routers implicated by the difference between • •
IEEE Communications Society
1011
Of (s, v) and OR (s, v) should be a superset of those implicated by the differences between Op (s, v) and OR (s, v). Flow traceback can sometimes reveal more or less subverted routers than individual packet traceback depending on routing and attacker capabilities. IV. S OURCE -D ESTINATION S ET T RACEBACK Source-destination set traceback applies source-destination set digesting in hash-based IP traceback. It is useful in identifying sources of all packets sent to the victim using a particular source address. Let A(s, v) denote the set of routers passed through by packets with the source-destination set signature of the attack packet. We define E(s, v) as the subset of A(s, v) constituting the set of ingress points to the traceback-enabled network. Both Ep (s, v) and Ef (s, v) defined in the last section are subsets of E(s, v). When E(s, v) is not a subset of ER (s, v), we have to assume that the attack packet comes through a router in E(s, v) but not in ER (s, v). That is, given the very loose traceback criteria, we can only implicate address spoofing sources if they are present. However, if a distributed attack uses the same spoofed source address, we will learn all the ingress points with a single traceback. When E(s, v) is a subset of ER (s, v), we can then assume that the attack packet did come through the ER (s, v). This information is useful in determining whether a reflector attack packet was reflected at the apparent source. For attack packets originated from within the tracebackenabled network, source-destination set traceback offers the same source path isolation capabilities as flow traceback. The collection of traceback ”dead ends”, O(s, v) is a super set of Of (s, v). Those additional members, if not absent, may camouflage more subverted routers on the legitimate path but reveal more other subverted router with one traceback. Source-destination set traceback extends the coverage on address spoofing sources in both aspects of space and time at the cost of the ability to implicate the source of a specific attack packet. Strictly speaking, however, the traceback can still be completed with a copy of single IP packet. V. D ISCUSSION A. Multi-path routing Multi-path routing occurs when packets are forwarded from a source s to a destination v through different sets of routers. It may be caused by load balancing or route changes. When we digest the passage of packet aggregation units such as flow and source-destination set, our traceback follows the ensemble of paths traversed by all packets belonging to an aggregation unit during a period of time. The impact of multi-path routing on aggregated IP traceback is reduced if the time resolution is increased. While it is possible to reduce the digest duration of each digest table by reducing the digest table size, the timing uncertainties across
0-7803-8533-0/04/$20.00 (c) 2004 IEEE
the network prevents us from taking advantage of finer time resolution in storage. In many cases, however, routes in the networks traversed by the attack flow are static enough for the path ensemble of a physical flow to converge into a single path. Some vendors boast features that maintains consistent route for a flow going through load balancing network to prevent packet reordering. One such example is the application of Cisco’s NetFlow [16] packet filtering to load balancing forwarding. We expect this to be a trend as more real-time interactive applications are transported across Internet. B. False positives Besides weakening the ability for aggregated IP traceback to implicate the path of a particular attack packet, multipath routing can increase the false positives. Ensembles of attack paths may contain more nodes than the attack path of an individual packet. That usually means an increase in the number of neighbors one hop upstream to any nodes. Each neighbor contributes to the probability of becoming a spurious node in the traceback graph with a false positive in its Bloom filter query. According to [11], the maximum number of additional spurious nodes to the resulted attack graph is given as n·
dP (1 − dP )
where n is the number of nodes actually see a packet with the digest signature, d represents the maximum degree of neighbors among all nodes, and P denotes the FPR of a Bloom filter. If P is small enough, the maximum number of extra nodes approximates ndP . To keep this value constant, a k-fold in n will require P to be divided by k. Applying P/k to the multiplier function in the M SFp formula, we get k P ) = 1.44 · log2 ( ) = a(P ) + 1.44 · log2 (k) k P For a reasonably small P , only a small increase in memory scaling factor is required to compensate for a doubling in attack graph nodes. The additional false positives resulted from digesting packet aggregation units through the impact of multi-path routing effect is, therefore, very manageable. a(
C. Transformation processing Both flow traceback and source-destination traceback can use the transformation processing schemes proposed for individual packet traceback proposed in [11] as long as the traceback query contains the same 24 bytes individual packet signature. 1) Transformation lookup with flow signature: To take advantage of packet aggregation, we may opt for including only the five-tuple flow signature in a traceback query. Then the Transformation Lookup Table (TLT) [11] should instead be indexed on a flow digest. This TLT should handle most of the transformation supported in individual packet traceback.
IEEE Communications Society
1012
2) Transformation lookup with source and destination addresses: For source-destination set traceback, the prospect of including only source and destination addresses in the traceback query is very attractive. Legal and societal concerns on privacy often outweigh the technical merits on security [17]. We can build a TLT indexed on a digest of source and destination addresses and log all source-destination pair candidates that have transformed in the pair being queried. The resulted attack graph may scatter too widely to be useful as in the case of NAT. Alternatively, we can keep a TLT indexed on individual packet or flow signature but retain another Bloom filter on the source-destination set that requires additional query information. Then the additional traceback signature required for transformation can be obtained through a call-back request to the victim. 3) Fragmentation: Fragmentation is not considered as transformation for source-destination set traceback. Flow traceback, on the other hand, has a complicated issue with fragmentation. As packet fragments, the first four bytes of the IP payload become arbitrary. Only the first fragment identifies with the signature of the original flow. Digesting the rest of the fragments incurs additional risk of colliding with the signature of other flows. We can consider the flow traceback accomplished with the traceback of the first fragment and exclude the remaining fragments from being digested. Alternatively, we can implement a special five-tuple encoding scheme for fragments that are not the first fragments. This encoding scheme should transform the flow signature to minimize the signature collision and facilitate the traceback of fragments at least to the point of fragmentation. D. Vulnerabilities Aggregated IP traceback shares most of the vulnerabilities of individual packet traceback proposed in [11]. The extended digest storage time, however, strengthens the aggregated version against DDoS attacks with respect to the traceback query communication. Options of making traceback query with smaller set of attack packet information also reduce the concerns of information leakage. With respect to the threat of flow amplification, the aggregated versions have the same risk as the individual packet version. We, however, consider the situation where a router is subverted to a point of being able to alter one passingby packet that completes the attack. For example, if an FTP control packet may be altered to trigger buffer overflow on some type of servers, a subverted router can perform the attack in-path without being traced if an FTP control packet happens to pass through it on the way to the victim. None of the hash-based IP traceback we have discussed will be able to isolate the attacker. The source-destination set traceback, however, has the additional vulnerability that the subverted
0-7803-8533-0/04/$20.00 (c) 2004 IEEE
router can alter any passing-by packet for the victim to an FTP control packet without being detected. E. Hybrid Deployment We have discussed the merits and issues of flow and source-destination set traceback schemes as homogeneous systems. Hash-based IP traceback, however, can be deployed with different digesting schemes across the traceback-enabled network. Individual packet traceback can be deployed to user gateway and ISP edge routers to isolate the attacker with higher confidence. Flow traceback can be deployed to ISP peering ingress links for balanced memory and source isolation performance. Source-destination set traceback can be deployed to ISP core routers to accommodate the memory requirement under high link capacity. By taking the strength from different digest granularities, we have a scalable IP traceback architecture that is more suitable to be deployed throughout the diverse landscape of Internet. VI. C ONCLUSION In this paper, we have investigated two intuitive ways of aggregating the set of packets that can be traced originally to extend the length of time traceback query can be served as proposed in [11]. Flow traceback and source-destination set traceback offer useful source path isolation capabilities with lower memory scaling factors. The attack graphs resulted from these aggregated IP traceback schemes sometimes offer information not available from conducting individual packet traceback. Individual packet traceback may not always be desirable and the hybrid deployment of these schemes with different traceback granularity constitutes a scalable IP traceback architecture. Through our investigation, we found it difficult to come up with a set of criteria to put the different traceback granularity in a simple order. For good reasons, IP traceback can only provide partial information about the threats towards a victim over the Internet. The partial information obtained through a traceback scheme is of different values depending on what is to be done with the knowledge gained. Our investigation is inconclusive until we evaluate the traceback schemes in the context of a security system completed with detection and filtering operations in different parts of a network.
[6] A. Mankin, D. Massey, C.-L. Wu, S. F. Wu, and L. Zhang, “On Design and Evaluation of ”Intenstion-Driven” ICMP Traceback,” in IEEE International Conference on Computer Communications and Networks, October 2001. [7] H. Burch and B. Cheswick, “Tracing anonymous packets to their approximate source,” in In Proc. USENIX LISA ’00, December 2000. [8] S. Savage, D. Wetherall, A. Karlin, and T. Anderson, “Network Support for IP Traceback,” IEEE/ACM Transactions on Networking, vol. 9, no. 3, pp. 226–237, June 2001. [9] D. X. Song and A. Perrig, “Advanced and Authenticated Marking Schemes for IP Traceback Messages,” in Proc. IEEE Infocom ’01, April 2001. [10] G. Sager, “Security Fun with OCxmon and Cflowd,” Internet 2 Working Group Meeting, November 1998, http://www.caida.org/projects/NGI/ content/security/1198. [11] A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, B. Schwartz, S. T. Kent, and W. T. Strayers, “Single-Packet IP Traceback,” IEEE/ACM Transactions on Networking, vol. 10, no. 6, pp. 721–734, December 2002. [12] ——, “Hash-Based IP Traceback,” in ACM SIGCOMM ’01, August 2001. [Online]. Available: ”http://nms.lcs.mit.edu/ snoeren/papers/spieton.html” [13] B. H. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Communication of ACM, vol. 13, no. 7, pp. 422–426, July 1970. [Online]. Available: http://www.mit.edu/ jli/afbv/ [14] L. Fan, P. C. J. Almeida, and A. Z. Broder, “Summary Cache: a Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281–293, 2000. [15] N. Duffield, C. Lund, and M. Thorup, “Properties and Prediction of Flow Statistics from Sampled Packet Streams,” in Proc. Internet Measurement Workshop, November 2002. [16] CISCO, “Cisco Netflow,” http://www.cisco.com/warp/public/732/ netflow/index.html. [17] S. Lee and C. Shields, “Challenges to Automated Attack Traceback,” IT Professional, vol. 4, no. 3, pp. 12–18, May-June 2002.
R EFERENCES [1] V. Paxson, “An analysis of using reflectors for distributed denial-ofservice attacks,” ACM Comp. Comm. Review, vol. 31, no. 3, 2001. [2] M. Corporation, “Stop 0A in tcpip.sys when receiving out of band(OOB) data,” http://support.microsoft.com/support/kb/articles/ Q143/4/78.asp. [3] R. Sekar, Y. Guang, S. Verma, and T.Shanbhag, “A High-Performance Network Intrusion Detection System,” in In Proceedings of the 6th ACM conference on Computer and communications security, 1999, pp. 8–17. [4] P. Ferguson and D. Senie, “Network ingress filtering: Defeating denial of service attacks which employ IP source address spoofing,” RFC 2827, IETF, May 2000. [5] S. M. Bellovin, “ICMP Traceback Messages,” IETF Draft, 2000, http://www.research.att.com/smb/papers/draft-bellovin-itrace-00.txt.
IEEE Communications Society
1013
0-7803-8533-0/04/$20.00 (c) 2004 IEEE