Efficient Flow Sampling With Back-Annotated Cuckoo ... - IEEE Xplore

2 downloads 0 Views 469KB Size Report
S. Pontarelli, P. Reviriego, and J. A. Maestro. Abstract—One of the applications of network traffic monitoring is to detect anomalies and security threats. Due to ...
IEEE COMMUNICATIONS LETTERS, VOL. 18, NO. 10, OCTOBER 2014

1695

Efficient Flow Sampling With Back-Annotated Cuckoo Hashing S. Pontarelli, P. Reviriego, and J. A. Maestro

Abstract—One of the applications of network traffic monitoring is to detect anomalies and security threats. Due to the huge number of packets that traverse networks, monitoring is typically implemented by sampling the traffic. Sampling can be done per packet or per flow. For flow sampling, the decision to select a flow can be purely random or based on some properties of the flows. In this later case, each incoming packet has to be compared against the set of flows being monitored to determine if the packet belongs to any of those flows. This matching can be implemented using a content addressable memory (CAM) or hash based data structures. Among those, one option is Cuckoo hashing that provides good memory utilization and a deterministic worst number of memory accesses. However, in the case of flow sampling, most packets will not belong to any of the flows being monitored. Therefore, all tables will be accessed and the worst case number of accesses will be required thus reducing throughput. In this letter, a technique to reduce the average number of accesses to search for items that are not stored in the Cuckoo hash is proposed and evaluated. The results show that the proposed scheme can significantly reduce the average number of accesses in a flow sampling application. This means that the technique can be used to increase the throughput substantially. Index Terms—Traffic monitoring, flow sampling, Cuckoo hashing, intrusion detection, security.

I. I NTRODUCTION

T

RAFFIC monitoring is used in networks to detect problems, for network planning and for security among other applications [1]. One example of the use of monitoring is to detect anomalies that may be related to security threats [2]. As the volume of traffic in modern networks is huge, monitoring is commonly implemented by sampling. The simplest way to implement sampling is to select a percentage of packets. Another option is to sample flows given by the 5-tuple of source and destination addresses, ports and protocol. In that case, the decision to sample is taken by flow so that when a flow is selected all the packets that belong to the flow are monitored. The use of flow monitoring can be more effective in detecting anomalies [3] and can also facilitate coordinated sampling at the network level [4]. The decision of which flows to sample can be pseudo-random or based on some features of the flows [2], Manuscript received April 14, 2014; revised July 17, 2014; accepted August 6, 2014. Date of publication August 15, 2014; date of current version October 8, 2014. This work was supported by the Spanish Ministry of Science and Education under Grant AYA2009-13300-C03. This brief is part of a collaboration in the framework of COST ICT Action 1103 Manufacturable and Dependable Multicore Architectures at Nanoscale. The associate editor coordinating the review of this paper and approving it for publication was Dr. A. Rabbachin. S. Pontarelli is with Consorzio Nazionale Interuniversitario per le Telecomunicazioni (CNIT), Rome 00133, Italy (e-mail: [email protected]). P. Reviriego and J. A. Maestro are with Universidad Antonio de Nebrija, Madrid 28040, Spain (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LCOMM.2014.2347959

[5], [6]. For example, sampling small or large flows has been suggested to improve the effectiveness in detecting anomalies. Pseudo random sampling can be implemented by applying a hash function to the 5-tuple. Then the flows which hash value is in a given range are sampled. The implementation of flow sampling based on flow features is more complicated. In this case, each incoming packet has to be matched against the set of flows that are being monitored. This can be done with a content addressable memory (CAM) [7] or with different hash structures [8]. The use of CAMs is restricted to specialized hardware implementations and even in that case typically incurs in significant power consumption. On the other hand, hash based data structures are more generally applicable but commonly provide lower throughput. Among the data structures used to perform matching, Cuckoo hashing is an attractive option as it enables good memory utilization and a worst case constant number of memory accesses [9]. A constant worst case is interesting in packet processing applications as it minimizes jitter and provides a predictable worst case performance. This worst case occurs when a match is found in the last hash table or when no match is found as in both cases all tables are accessed. In a flow sampling application only a small fraction of the flows will be monitored. This means that most packets will not find a match and will incur in the worst case number of accesses. This reduces the throughput of Cuckoo hashing in a flow sampling application. Therefore, in this application, the throughput could be significantly increased if the average number of memory accesses for searches that do not find a match is reduced. Another situation in which this occurs is in hardware implementations for software defined networks in which matching is performed in parallel by a hash table and a TCAM [10]. The hash table stores a set of single flows to be matched while the TCAM will check against general rules. When no match is found in the hash tables, the system uses the TCAM output to define the action associated to the packet. In this letter, a technique to increase the throughput of Cuckoo hashing in flow sampling applications is presented. This is done by reducing the average number of memory accesses needed to determine that a given flow is not stored in a Cuckoo hash table. The proposed scheme introduces additional information in the first hash tables. This information is such that for each item it is known if there are flows stored in the next hash tables that map to this particular item. When a hash table is accessed and there are no further flows in the next tables associated with this item, the matching operation can be stopped avoiding accesses to the remaining tables. The proposed scheme has been evaluated by simulation and using traffic traces. The results show that the throughput can be significantly increased compared to a traditional Cuckoo hash implementation. The rest of the letter is structured as follows: in Section II an overview of Cuckoo hashing is provided. In Section III, the proposed scheme is

1089-7798 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1696

IEEE COMMUNICATIONS LETTERS, VOL. 18, NO. 10, OCTOBER 2014

presented and in Section IV it is evaluated. The letter ends with the conclusions that are summarized in Section V. II. C UCKOO H ASHING There is a wide variety of hash based techniques for packet processing [8]. One of those is Cuckoo hashing that uses several hash tables to store items and enables efficient match operations against the stored items. In more detail, Cuckoo hashing uses a set of d hash tables. A given item x can be placed in tables 1, 2, . . . , d in the positions determined by a set of hash functions: h1 (x), h2 (x), . . . , hd (x). The match, insertion and removal operations are defined as follows:—Match: the first table (i = 1) is accessed and position hi (x) is read and compared with x. If the stored value is not equal to x, the second table is accessed and the comparison is repeated. If again no match is found in that table, the third table is accessed and the operation continues until either a match is found or all d tables have been accessed.—Insertion: the first table (i = 1) is accessed and position hi (x) is read. If it is free, the new item is stored there. If not, the second table is accessed and the process is repeated. If the position in the second table is not free, the third table is accessed and the operation continues until a table with a free position is found or all d tables have been accessed. In that case, a random table j is chosen and the new item x is stored in position hj (x). Then, the insertion operation is done for the entry y that was removed when inserting x but not taking table j as an option for insertion. This process is recursive and items are moved to accommodate the new item if needed.—Removal: is the same process as a match but when the item is found it is removed. The main difference with other multiple hash table schemes is that Cuckoo hashing allows the movement of stored items during insertion to maximize memory utilization. In fact, utilizations above 90% can be achieved with three hash tables and over 95% with four hash tables. This comes at the cost of a more complex insertion procedure. There are different options to implement Cuckoo hashing. The most direct implementation is to store all the tables in a single memory so that tables are accessed sequentially. It is also possible in hardware implementations to use a different memory device to store each table so that all tables can be accessed in parallel. In the rest of the letter, the sequential implementation using a single memory is considered. This option is the one used when the implementation is done using a processor. In this sequential implementation, the throughput for match operations is determined by the average number of accesses needed to perform a match. For operations that find match, the number of accesses is between 1 and d. However for operations that do not find a match, d accesses are required. As discussed before, for flow sampling, most match operations will not find a match as only a small fraction of the flows will be monitored. This means that the performance will be close to that of the worst case. III. P ROPOSED T ECHNIQUE The proposed scheme is based on adding information to the first hash tables to avoid access to subsequent tables. In more detail, a counter is added to each item of the first d − 1 hash

Fig. 1.

Illustration of the proposed scheme for d = 3.

tables. The counter is placed with the stored item as shown in Fig. 1. This ensures that both the item and counter can be read in a single memory access. Then, when an item x is added in a given table, for example in table j, the counters in positions h1 (x), h2 (x), . . . , hj−1 (x) are incremented. Similarly the counters on those positions are decremented when the item is removed. The match procedure is also modified to check that the counter is greater than zero to proceed to the next table. Therefore, in this modified structure, the presence of items in subsequent tables is back-annotated in previous tables to improve match performance. The three operations on the proposed scheme are as follows:—Match: the first table (i = 1) is accessed and position hi (x) is read and compared with x. If the stored value is not equal to x, the counter is checked, if it is greater than zero, the second table is accessed and the comparison is repeated. If again no match is found in that table and the counter is greater than zero, the third table is accessed and the operation continues until either a match is found or all d tables have been accessed. In all cases, when a zero counter is found the operation ends.—Insertion: the first table (i = 1) is accessed and position hi (x) is read. If it is free, the new item is stored there. If not, the second table is accessed and the process is repeated. If the position in the second table is not free, the third table is accessed and the operation continues until a table with a free position is found or all d tables have been accessed. When the item is inserted on table l the counters for the items in h1 (x), h2 (x), . . . , hl−1 (x) on tables 1 to l − 1 are incremented. When no free position is found, a random table j is chosen and the new item x is stored in position hj (x) and the counters for the items in h1 (x), h2 (x), . . . , hj−1 (x) are incremented. Conversely, the counters for items h1 (y), h2 (y), . . . , hj−1 (y) associated with the item y that has been removed are decremented. Then, the insertion operation is done for the entry y that was removed when inserting x but not taking table j as an option for insertion. This process is recursive and items are moved to accommodate the new item if needed.—Removal: is the same process as a match but when the item is found it is removed and the counters h1 (x), h2 (x), . . . , hj−1 (x) are decremented. In Fig. 1, the proposed scheme is illustrated with an example for the case d = 3. Let us consider that a match operation for an item t is done and that the values of the hashes are h1 (t) = 0, h2 (t) = 1, and h3 (t) = 3. The first table is accessed and the stored item and counter are retrieved (as mentioned before this is done in one memory access as they are stored together on the same memory position). Then the item is compared and no match is found (as z = t) but since the counter is greater than zero the second table is accessed. Again no

PONTARELLI et al.: EFFICIENT FLOW SAMPLING WITH BACK-ANNOTATED CUCKOO HASHING

match is found on the second table but in this case, the counter is zero and therefore the operation can stop. This saves one memory access compared to the traditional implementation. Finally, it is interesting to note that position 3 in the first table is free but has a counter value of one. This situation occurs when item insertions and removals take place over time as is the case in flow monitoring applications. Let us now consider the insertion of a new item f with hash values h1 (f ) = 2, h2 (f ) = 2 and h3 (f ) = 0. The item would be inserted on table 3 and the counters on the first two tables in position 2 are increased. The reduction in the average number of memory accesses that can be achieved with the proposed technique depends on the values of the counters and on the number of items stored and it is difficult to estimate analytically. The same applies to the number of bits required for the counters to ensure that no overflow occurs. In this case, intuitively it can be seen that more bits will be required in the first tables. In the next section, the benefits of the proposed scheme and the storage requirements are evaluated by simulation.

1697

Fig. 2. Average number of memory accesses vs table occupancy for d = 2.

IV. E VALUATION The proposed scheme has been implemented in C++ and compared with a traditional implementation of Cuckoo hashing. In the simulations, several values of d and of the number of stored items are used. The size of each of the hash tables was 4 K items. For each configuration, first the target number of items is inserted. Then successive removals and insertions of items are performed to simulate a steady state in which flows end and start over time. The number of insertion/removals used was selected to be four times the target number of items. During this process, the maximum value of the counters is logged. Finally, 32 K matches for items not stored in the tables are performed and the average number of memory accesses is logged. In all the experiments flows are taken from traces of high speed Internet links to ensure that the distribution of IP addresses and ports is realistic. In particular, traces from the 2012 CAIDA Anonymized Internet Data set have been used [11]. Each trace contains hundreds of thousands of flows and several traces have been used to ensure that the results are consistent. The results are presented in Figs. 2–4 where the number of accesses for a traditional implementation is also shown for comparison. For d = 2 the maximum achievable table occupancy is around 0.5 and therefore only occupancies in the range 0 to 0.5 are considered. For d = 4, 8 maximum occupancies are close to 1. It can be observed that the average number of memory accesses is in all cases lower than d. The largest reductions are achieved at low/medium occupancies. For example, for a 0.5 occupancy and d = 4 or 8 reductions of approximately 50% are achieved (from 4 to approximately 2 and from 8 to 4). For high occupancies the proposed method provides also sizeable reductions. For a 0.4 occupancy and d = 2 the savings are around 40% (from to 2 to approximately 1.2). For a 0.95 occupancy the proposed scheme is able to reduce the number of accesses by 25% for d = 4 and by 12% for d = 8. These results clearly show that the method can reduce the average number of memory accesses significantly.

Fig. 3. Average number of memory accesses vs table occupancy for d = 4.

Fig. 4. Average number of memory accesses vs table occupancy for d = 8.

The main overhead introduced by the proposed scheme is the need to store a counter for each table element on the first d − 1 tables. As mentioned, before the maximum values observed on the counters were logged for all the simulations.

1698

IEEE COMMUNICATIONS LETTERS, VOL. 18, NO. 10, OCTOBER 2014

TABLE I S IMULATION R ESULTS FOR THE C OUNTERS

TABLE II M EMORY OVERHEAD

The results are summarized on Table I. Based on those results, counters are dimensioned with 5 bits to provide ample margin against overflow. Since the 5-tuple requires 104 bits, the counter overhead is approximately 5%. In fact the value is lower as the counters as mentioned before are only needed on the first d − 1 tables. The exact overheads are presented on Table II. Finally, it should be noted that in many implementations, the memory width will be a power of two and therefore, the 104 bits of the 5-tuple will be stored on 128 bits leaving 24 spare bits. In those cases, five of those spare bits can be used to implement the counters with no practical overhead. The other overhead introduced by the scheme is the updating of the counters that is performed during the insertion and removal of items. This requires some additional write operations. In flow sampling, insertions and removals only take place when the decision to monitor a flow is taken or when monitoring ends. Therefore, those operations are much less frequent than searches that take place for every incoming packet. This means that the impact on performance will be negligible. As an example, let us consider a configuration with d = 4. This provides a good trade-off between memory utilization and number of memory accesses and is used for example in [10]. For this configuration the proposed scheme requires only a 3.6% memory overhead and provides a reduction of at least 25% in the average number of memory accesses. This reduction corresponds to maximum table occupancy. However, in most systems, table occupancy will not be maximum all of the time and therefore the average reduction will be larger. V. C ONCLUSION In this letter, a technique to improve the performance of Cuckoo hashing when searching for items not stored in the tables has been presented. Optimizing for this case is important in

applications like flow sampling in which only a small percentage of the incoming packets are monitored and therefore most searches do not find a match. The technique, back-annotated Cuckoo hashing, introduces counters on the elements of the first tables that are used to know if there can be a match in subsequent tables. In more detail, the presence of elements in the last tables is back annotated in the first ones. This makes it possible to complete search operations without needing to access all the tables thus reducing the average number of accesses. The proposed scheme has been evaluated by simulations using flows taken from publicly available traces of high speed Internet links. The results show that significant reductions on the average number of accesses are achieved in most cases with values exceeding 25% for practical configurations. The main overhead of the scheme is a small increase in memory size that in the case of flow sampling is below 5% in all the configurations considered. The theoretical analysis of the proposed scheme, both in terms of its performance and the dimensioning of the counters, is an area for future work. Another option that could be studied in future work is to implement the counters only in a subset of the tables to reduce the memory overhead. R EFERENCES [1] M. Crovella and B. Krishnamurthy, Internet Measurement: Infrastructure, Traffic and Applications. Hoboken, NJ, USA: Wiley, 2006. [2] A. Patcha and J.-M. Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Comput. Netw., vol. 51, no. 12, pp. 3448–3470, Aug. 2007. [3] G. Androulidakis and S. Papavassiliou, “Improving network anomaly detection via selective flow-based sampling,” IET Commun., vol. 2, no. 3, pp. 399–409, Mar. 2008. [4] V. Sekar, A. Gupta, M. K. Reiter, and H. Zhang, “Coordinated sampling sans origin-destination identifiers: Algorithms and analysis,” in Proc. 2nd Commun. Syst. Netw. Conf., Jan. 2010, pp. 1–10. [5] N. Duffield, C. Lund, and M. Thorup, “Estimating flow distributions from sampled flow statistics,” IEEE/ACM Trans. Netw., vol. 13, no. 5, pp. 933–946, Oct. 2005. [6] K. Bartos, M. Rehak, and V. Krmicek, “Optimizing flow sampling for network anomaly detection,” in Proc 7th IWCMC, 2011, pp. 1304–1309. [7] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable Memory (CAM) circuits and architectures: A tutorial and survey,” IEEE J. SolidState Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006. [8] A. Kirsch, M. Mitzenmacher, and G. Varghese, “Hash-based techniques for high-speed packet processing,” in Algorithms for Next Generation Networks. London, U.K.: Springer-Verlag, 2010, pp. 181–218. [9] R. Pagh and F. F. Rodler, “Cuckoo hashing,” J. Algorithms, vol. 51, no. 2, pp. 122–144, May 2004. [10] P. Bosshart et al., “Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN,” in Proc. SIGCOMM Conf. Appl., Technol., Architectures, Protocols Comput. Commun., 2013, pp. 99–110. [11] CAIDA Anonymized Internet Traces 2012 Dataset. [Online]. Available: http://www.caida.org/data/passive/passive_2012_dataset.xml