326
IEEE TRANSACTIONS ON COMPUTERS,
Parallel d-Pipeline: A Cuckoo Hashing Implementation for Increased Throughput Salvatore Pontarelli, Pedro Reviriego, and Juan Antonio Maestro Abstract—Cuckoo hashing has proven to be an efficient option to implement exact matching in networking applications. It provides good memory utilization and deterministic worst case access time. The continuous increase in speed and complexity of networking devices creates a need for higher throughput exact matching in many applications. In this paper, a new Cuckoo hashing implementation named parallel d-pipeline is proposed to increase throughput. The scheme presented is targeted to implementations in which the tables are accessed in parallel. A parallel implementation increases the throughput and therefore is well suited to high speed applications. Parallel schemes are common for ASIC/FPGA implementations in which the tables are stored in several embedded memories. Using the proposed technique, the throughput can be significantly increased with gains that in practical scenarios can reach 60 percent compared to existing parallel implementations. The new scheme has been evaluated using a case study and detailed results for performance and implementation costs are reported. Index Terms—High speed packet processing, Cuckoo hashing, parallel pipelined
Ç 1
INTRODUCTION
PACKET classification plays a fundamental role in most networking applications [1]. As link speed and packet processing complexity increase, there is a growing need for fast and efficient packet classification implementations. This is exacerbated because the speed of memories increases more slowly than that of links. Therefore, it is increasingly difficult to support the throughput needed to match the link speeds. One option is to use algorithms based on hash tables [2]. An example of packet processing is exact matching in which a group of bits in the packet are compared against a set of patterns to find an exact match. It is used for example for load balancing [3], routing [4] or quality of service [5]. A hash table is a data structure used to implement an associative memory able to map keys to values. A hash table uses a hash function to compute an index of an array of buckets or slots, from which the correct value can be found. Ideally hash tables are able to retrieve values in O(1) time. The major drawback of hash tables is the hash collision that maps different keys to the same index. To manage this issue, several hashing schemes have been proposed, which differ in the collision resolution strategy used to handle such events. The performance in terms of memory size, the average and worst case times for the insert and query operations are heavily dependent on the collision resolution policy applied. One common strategy is to place the items with the same index into a chained list. When an item is searched, the chain is scanned until the value corresponding to the searched item is found [6], [7]. In another strategy, called open addressing, when a new entry has to be inserted, the buckets are examined, starting with the index given by the hash and proceeding in some probe sequence, until an unoccupied slot is found. The search operation scans the buckets in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such
S. Pontarelli is with Consorzio Nazionale Interuniversitario per le Telecomunicazioni, CNIT, Via del Politecnico 1 - 00133, Rome, Italy. E-mail:
[email protected]. P. Reviriego and J.A. Maestro are with Universidad Antonio de Nebrija, C/ Pirineos 55 E-28040, Madrid, Spain. E-mail: {previrie, jmaestro}@nebrija.es.
Manuscript received 14 Mar. 2014; revised 14 Oct. 2014; accepted 8 Mar. 2015. Date of publication 26 Mar. 2015; date of current version 16 Dec. 2015. Recommended for acceptance by K. Li. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2015.2417524
VOL. 65, NO. 1,
JANUARY 2016
key in the table [6]. For both strategies described above, the average and worst case query time are not deterministic, and depend on the load of the table. Cuckoo hashing is a multiple hash table scheme that provides constant access time and high memory utilization [8]. These features make it attractive for exact matching implementation and it has for example being recently used for packet processing in Software Define Networking [9]. To implement Cuckoo hashing, the tables can be stored in a lower speed external memory [10] or can be realized using embedded memories in ASICs [9] or FPGAs [10] based networking devices. The design choice depends on the requirements in terms of speed and size of the hash table to be implemented. When external memories are used, all tables are commonly stored in the same memory device and are accessed serially. The use of a unique external memory device to store all the hash tables is due to the pin limitations of network hardware devices which pin count can already exceed 1,000 pins [10]. Recently, the optimization of an implementation that combines both a fast on chip memory and a larger external memory has been studied in [11]. This can be useful for large tables that cannot be stored in on chip memory. Another recent work has focused on optimizing the throughput of Cuckoo hashing for flow sampling applications in which most of the searches are not successful (as only a few flows are selected to be sampled) [12]. However, as discussed in the following, in many networking applications, most searches are successful. Therefore that optimization will only be useful in some cases. In ASIC/FPGA implementations and for small-medium tables, an embedded memory per table can be easily used so that tables are accessed in parallel to further speed up access. In both cases, the worst case access time is fixed and given in the serial case by a number of memory access cycles equal to the number of tables and in the second by just one memory access. The number of tables is typically small as for example with four tables a memory utilization larger than 95 percent is achieved with Cuckoo hashing. As link speed and the complexity of packet processing increase, there is a growing need for faster implementations of Cuckoo hashing. This is further exacerbated by the fact that memory speed increases less than link speed [10]. This creates a throughput bottleneck for high speed switches and routers. Therefore, alternative implementations are needed to keep up with link speed. One trivial option is to use several implementations of the same Cuckoo hashing tables on different memory devices. This increases the throughput by a factor equal to the number of times that the tables are replicated. The problem is that the cost in terms of memory storage is also increased by the same factor. Therefore, it would be more interesting to design alternative Cuckoo hashing implementations that speed up access without requiring memory replication. This has been the purpose of the work that is presented in this paper. In particular, an optimized architecture to implement Cuckoo hashing is presented. The proposed scheme is targeted to parallel implementations and it can provide significant speed ups that can reach 60 percent in practical scenarios. The rest of the paper is organized as follows, Section 2 presents a brief overview of Cuckoo hashing. The proposed architecture for optimized Cuckoo hashing implementation is introduced in Section 3 and evaluated in Section 4. The conclusions and some ideas for future work are discussed in Section 5.
2
OVERVIEW OF CUCKOO HASHING
In this section a brief overview of Cuckoo hashing is presented focusing on the implementation aspects that are needed to understand the optimized architecture presented in the next section. As discussed in the introduction, Cuckoo hashing uses a number d of hash tables and an element x can be placed in those tables 1,2,. . .d in positions h1 ðxÞ; h2 ðxÞ; . . . ; hd ðxÞ where hi ðxÞ are hash
0018-9340 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65,
NO. 1,
JANUARY 2016
327
Fig. 1. Illustration of the parallel pipeline implementation.
functions. This is similar to other multiple hash table techniques such as d-left hashing [13]. The key difference is what is done if when inserting an element x, all positions h1 ðxÞ; h2 ðxÞ; . . . ; hd ðxÞ are already occupied by other elements. In other structures such as d-left hashing, the new element cannot be inserted and this limits memory utilization. In Cuckoo hashing, the elements stored in those positions are moved to alternative positions in order to make room for the new element. This strategy is applied recursively to increase the probability of successfully inserting the new element in a table [8]. This results in a much larger memory utilization. For example when d is four, utilization is larger than 95 percent (for two tables utilization is approximately 50 percent and for three tables around 90 percent). In many practical implementations four tables are used as it is the smallest value that achieves a utilization that is close to 100 percent. When inserting an element into the Cuckoo hash it can occur that several of the h1 ðxÞ; h2 ðxÞ; . . . ; hd ðxÞ positions are free. In that case two simple algorithms can be used to select in which position the element is inserted. The first one is to insert the element in the first table that is free starting from table 1. This is similar to what is done in dleft hashing. The second option is to select randomly one table. The insertion is then done in the first free table starting from the randomly selected table. In the following, the first option will be referred to as ordered insertion and the second one as random insertion. To perform exact matching of a value v, the positions hi ðvÞ are accessed and the elements stored are compared to v. If a match is found, the operation is successful, otherwise that value is not stored the Cuckoo hash. As mentioned before, the d tables can be stored in one or several memories. In the first case, tables are accessed serially and in the second case in parallel. For the serial access, the worst case access time occurs when the element is not stored in the tables or it is found in the last table. In both cases, d memory accesses are needed. The average number of accesses will be lower and depends on the table occupancy and on the strategy used for insertion. As with d-left hashing, ordered insertion provides a lower average access time as items are concentrated on the first tables and therefore found faster. In the parallel case, there are d memories such that each table is stored in a different memory and search operations can be done by accessing all memories in parallel. In that case, all searches require d accesses and are completed in one memory cycle. Alternatively, search operations can be pipelined to reduce the number of memory accesses. In the pipeline architecture, a search operation accesses each of the memories sequentially such that when the current operation moves to access the second memory, the next search operation can access the first memory. Therefore the performance in terms of throughput is the same as that of parallel searches. However, in the pipeline architecture once a search is successful, the rest of the memories do not have to be accessed. Therefore, the average number of accesses is reduced to that of the serial case. Due to the pipeline, all the search operations are completed in d memory cycles. The parallel pipeline implementation is illustrated in Fig. 1. In this architecture, an incoming element is searched in
Fig. 2. Proposed parallel d-pipeline implementation.
table one in the first cycle. If the element is not found, then it is searched in table two and so on. As soon as the element is found, no further searches are done and the element is just propagated through the pipeline until it reaches the output.
3
PROPOSED IMPLEMENTATION
The proposed technique to optimize the throughput of Cuckoo hashing implementations is based on a simple observation: in the parallel pipeline implementation, the tables are not accessed in all cycles. For example, in a parallel pipeline implementation if an element is found in Table 1, then on the following cycles Tables 2,3,. . ., d are not accessed. Therefore, throughput can potentially be increased with implementations that make use of all the tables in all the cycles. For the parallel pipeline implementation this can be done by using d pipelines as illustrated in Fig. 2. This will be referred to in the following as parallel d-pipeline implementation. In this implementation, each pipeline has a different entry point marked in blue. This allows us to input an element to any table that is idle in a given cycle. For example if an element is in the first pipeline and a match is found in the first table, then in the next cycle an element can be inserted in the second pipeline to make use of table two. The same applies to the third and fourth pipelines and tables three and four. In this implementation, in one cycle several elements can leave the input FIFO to be placed on several pipelines. Similarly several elements can leave the pipelines and be sent to the output in one cycle. This enables an increase in throughput. The proposed scheme can be formalized as follows. Let us define each of the d tables as Ti and each of the pipelines as pj . Each pipeline has d positions that will be denoted pj ½i. As mentioned before, the first position of each pipeline is different. Elements are inserted in position pj ½j and leave the pipeline d cycles later. In each memory cycle, for each table i, the pipeline elements pj ½ifor j 6¼ i are checked. If all have been already found in previous memory cycles, then a new element is taken from the FIFO and is inserted in pi ½iand used to access table i in that cycle. This procedure ensures by design that all the tables are accessed in all the cycles (as long as there are items waiting on the FIFO) and that the delay to perform exact matching is constant with a value of d cycles.
328
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 65, NO. 1,
JANUARY 2016
Fig. 3. Timeline of the match operation with different implementations. a) serial implementation; b) parallel implementation; c) parallel pipeline implementation; d) proposed parallel d-pipelined implementation.
In Fig. 3, an example of the timeline of memory access operations for the different implementations is illustrated. Fig. 3a is the timeline for a serial implementation, in which several keys (K1..Kn) are given as input to a Cuckoo hash with d ¼ 4 tables. When the item is found, the corresponding box is colored, while items that are not present in a given table are surrounded by white boxes. Fig. 3b shows the behavior of the parallel implementation, in which at each time-slot the same key is used for all the tables, and the Cuckoo hash always gives one match. Fig. 3c illustrates the behavior of the traditional parallel-pipeline implementation in which keys access the tables as they move through the pipeline. Finally, Fig. 3d presents the proposed parallel d-pipeline implementation. In this last case, it can be seen that during the first time-slot, the item K2 is found, and in the second time slot the remaining items (K1,K3,K4) are shifted, while a new item (K5) substitutes the matched item. In the second time slot, two items are matched and therefore the third time slot can start to process two new items (K6 and K7). It can also be noticed that the worst-case access time occurs for item K1, which requires four time slots to be matched. For the traditional pipeline implementation shown in Fig. 1, the throughput is constant and equals one element per cycle. This is not the case for the new parallel d-pipeline scheme (or for the serial scheme) in which throughput depends on several factors. The first one is the percentage of the matching operations that are successful. This is because an operation that fails requires accesses to all d tables and leaves no room for improvement. On the other hand, a successful operation may require less than d accesses leaving some cycles to be used by other operations. The second factor is whether ordered or random insertion is used in the Cuckoo hashing. Finally and related to the second factor, the occupancy of the tables also affects the throughput. In most applications of Cuckoo hashing in networking, the percentage of search operations that are successful is close to 100 percent. This is because in these applications, when an item not present in the table is accessed, the system must add the item to the hash table. Therefore subsequent queries of the same item give a successful result. Typical examples of this behavior are for example: 1) flow monitoring, in which unsuccessful search operations occur only for the first packet of a new flow, and the new flow is added to the table [14] and 2) network switches, in which the table is updated when new flows arrive or the link state changes [15]. Therefore in the evaluation presented in the next section, it will be assumed that only search operations for items that are stored in the Cuckoo hash are performed. To discuss the throughput of the new scheme, let us first consider that table occupancy is close to 100 percent. Let us also assume that there is a percentage m of searches that are not successful. In that case, when an element is in the tables, the probability of
finding an element in any of the d tables in the first access will be approximately 1/d. Therefore, the second access will be needed only in (1-1/d) of the operations. After the second access, the third access will be needed in (1-1/d-1/d) of the operations and so on. For elements that are not on the in the tables, the number of accesses is always d. Therefore, the expected number of accesses to find the match will be approximately: E ½Num Acc ¼ d m þ ð1 mÞ 1 2 3 d1 1þ 1 þ 1 þ 1 þ þ 1 d d d d dþ1 ¼ d m þ ð 1 mÞ 2 dþ1 d1 þm : ¼ 2 2 Let us know consider that m is close to zero that as discussed before is the case in many networking applications. Then, for d ¼ 4 which is a value commonly used in practical Cuckoo hashing implementations, this gives a value of 2.5. Since in the proposed implementation the four tables are accessed in all cycles the average throughput per cycle will be approximately 4/2.5 ¼ 1.6. This means a 60 percent increase over the traditional pipeline implementation that provides a throughput of 1.0. Similarly, for d ¼ 2, the average throughput would be 2/1.5 ¼ 1.33 or 33 percent better than a traditional parallel pipeline implementation. When the table occupancy is significantly lower than 100 percent, the probability of finding an item on one of the tables depends on the how the insertion procedure is implemented. For example, when ordered insertion is used, it will be much more likely to find the items in the first tables. On the other hand with random insertion, the probability will be approximately the same for all tables and equal to 1/d. Therefore with random insertion, the average throughput will be independent of the occupancy and equal to 1.6 items per cycle. However, for ordered insertion it will vary with table occupancy. As an example let us consider a situation in which Table 1 has an occupancy of 70 percent, Table 2 of 20 percent, Table 3 of 5 percent and Table 4 of also 5 percent so that the overall occupancy will be 25 percent. Then the average number of accesses for a search operation that starts in pipeline one will be 1 þ 0.3 þ 0.1 þ 0.05 ¼ 1.45. However, a search operation that starts in pipeline three requires on average 1 þ 0.95 þ 0.9 þ 0.2 ¼ 3.05 accesses. Similarly, a search that starts in pipeline two requires on average 1 þ 0.8 þ 0.75 þ 0.7 ¼ 3.25. The average number of accesses for the entire implementation depends on the probabilities of an item entering the system on the different pipelines. In the example considered, in many cases the items will enter through
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65,
JANUARY 2016
329
Fig. 4. Plot of the throughput achievable with the proposed parallel d-pipeline implementation using the ordered and random insertion for d ¼ 4.
Fig. 5. Plot of the throughput achievable with the proposed parallel d-pipeline implementation using the ordered and random insertion for d ¼ 2.
pipeline two as it tends to be free since table one gives a 70 percent match probability. Therefore the final value will be closer to 3.25 than to 1.45. A complete theoretical analysis could be done, but since the simulation results presented in the next section clearly show that random insertion outperforms ordered insertion in all cases considered, this analysis is left for future work. Finally, it is interesting to note that this is different from the serial case in which ordered insertion is more efficient and is in fact the main idea behind d-left hashing. From the previous discussion it becomes apparent that the proposed parallel d-pipeline implementation combined with random insertion is a good option to increase throughput in Cuckoo hashing implementations. This will be confirmed by the experiments presented in the following section.
4.1
4
NO. 1,
EVALUATION
In this section, the proposed implementation is evaluated to show its benefits. The first step is to simulate the scheme to assess its throughput for different loads and insertion procedures. This is done using a Cþþ implementation of the scheme with ideal hash functions and randomly generated items. Then a case study of flow identification is used to show the benefits in a practical scenario. In this case study, the proposed scheme is implemented on an FPGA compared with both a standard parallel and parallel pipeline scheme both in terms of throughput and in terms of implementation cost.
Simulation Results
To evaluate the efficiency of the proposed implementation, we simulated the parallel d-pipeline implementation with d ¼ 4 and with the size of each table set to n ¼ 32 K. Both the random and ordered insertions have been investigated. The simulations have been set-up to vary the load from 10 to 90 percent. The simulation first fills the tables with insertions of random items up to the required load factor. Then evaluates the time (in terms of simulated clock cycles) needed to perform 1 million queries. For each load factor, 1,000 simulation runs have been performed and the average is reported. Fig. 4 shows the simulation results for the ordered and random insertion. As expected, the random insertion shows a throughput of 1.6 queries per clock cycles independently of the load factor. Instead, the throughput of the hash tables filled with the ordered insertion procedure is load dependent and varies from 1.17 to 1.58 queries per clock cycles. As an example, the table occupancies for a load of 10 percent were 32.91, 6.86, 0.23 and 0 percent for ordered insertion and 9.96, 10.06, 10.00 and 9.98 percent for random insertion. This clearly shows how ordered insertion concentrates the items on the first tables. This reduces the benefits of the parallel d-pipeline scheme as the last tables are mostly unused and do not contribute to the system throughput. The simulations have been also done for d ¼ 2. In this case, the loads tested are from 10 to 50 percent as the maximum occupancy that can be achieved with d ¼ 2 is around 50 percent. The results are shown in Fig. 5. In this case, random insertion has a throughput of 1.33 queries per clock cycles
Fig. 6. Block diagram of the traffic monitoring system implemented on the FPGA with the different options for the Cuckoo hash.
330
IEEE TRANSACTIONS ON COMPUTERS,
TABLE 1 Logic Resources
standard pipeline d-pipeline
JANUARY 2016
TABLE 2 Critical Path for the Cuckoo Hash Implementations
# of Slice LUT
# of BRAM
24,993 (25%) 25,288 (25%) 26,279 (27%)
115 of 212 (54%) 115 of 212 (54%) 118 of 212 (55%)
independently of the load factor. The throughput for ordered insertion ranges from 1.09 to 1.30. These results confirm the reasoning made in the previous section that random insertion should be used in the parallel d-pipeline scheme.
4.2
VOL. 65, NO. 1,
FPGA Implementation
To evaluate the proposed parallel d-pipeline architecture, we developed on an FPGA a simple network measurement application that counts the packets and the bytes of each inspected TCP/UDP flow.1 For each packet arriving in the system, the hash table is queried to retrieve the index associated to the corresponding flow. If the index is found, the counters associated to the flow are updated. Otherwise, a new item is stored in the hash table and the corresponding counters are initialized. A scrubbing block removes the items corresponding to flows that are no longer active. The target FPGA is a XILINX Virtex5 XC5VLX155T [16]. The Virtex5 FPGA provides 212 BRAM blocks of 36 Kbits. Three versions of the system have been designed: one with the standard parallel implementation, another with the parallel pipelined implementation and the third one with the proposed parallel d-pipeline. The block diagram of system designed is shown on Fig. 6. It is composed by the following blocks:
A block extracting the flow identification (104 bits) and the packet size of the incoming packets. A FIFO that takes as input the flow identification coming from the previous block and provides the items to the hash tables. In the case of the proposed parallel d-pipeline a multi-output FIFO able to provide up to 4 items to the 4 pipelines is used. An FSM controlling the accesses to the hash tables. 4 hash tables with 4 K entries of 128 bits. The tables store the key and the index in which the byte and packet counters of the flow are located. In the case of the proposed parallel d-pipeline the input of the hash tables are multiplexed as depicted in Fig. 2. For simplicity, the Cuckoo hash table implementation limits the number of moves when inserting new items to one, as proposed by [17]. Four 4K 64 bits memories for storing the flow counters. An FSM controlling the accesses to the counter memory. The system was implemented with a clock frequency of 156.25 MHz for the three options considered. The FPGA resources occupied by the three versions of the system (standard, pipeline and d-pipeline) are reported in Table 1. The numbers are similar for all the options and the additional cost of the proposed implementation is below 5 percent for both LUTs and BRAMs. The traditional systems (both standard and pipeline) have a throughput of 1.0 and therefore are able to process up to 156.25 million of packets per seconds. Supposing a minimum packet size of 40 bytes the maximum throughput sustainable under worst case packet size conditions is 50 Gbps. Instead, since the proposed parallel d-pipeline can be clocked at the same frequency as the traditional parallel pipeline implementation, but processes an average number of 1.6 packets per cycle, the
delay standard pipeline d-pipeline
maximum throughput can be estimated as 80 Gbps. Therefore, for a given FPGA technology, the packet throughput can be increased significantly. This clearly illustrates the benefits of the proposed scheme. The main disadvantage of the scheme is that it requires some additional FPGA resources. To further study the impact of the modifications introduced by the proposed scheme, the three designs have also been synthesized setting the Xilinx synthesis tool for the maximum speed. The goal in this case is to assess the impact on timing of the proposed scheme. The critical paths for the three options considered are reported in Table 2. It can be observed that the values are similar in all cases, being the lowest the pipeline scheme. This can be explained as it reduces the fan-out coming from the input to the tables. The proposed scheme has the largest delay but the increment over the pipeline is less than 5 percent. In addition, when looking at the results for the whole system (as shown in Fig. 6), the critical path is 5.39 ns. This critical path is located in the ID extraction block. Therefore, as in the FPGA case, the d-pipeline scheme has no impact on the overall system timing.
5
CONCLUSIONS
In this paper, an architecture to increase the throughput of Cuckoo hashing implementations has been proposed and evaluated. The scheme is targeted to high-speed parallel implementations and modifies the existing pipeline architecture to ensure that all tables are accessed in all cycles. This is done by introducing one pipeline per table such that elements can enter the system at any given table. This enables a more efficient use of the tables and therefore an increase in throughput. The proposed scheme has been evaluated both in simulation and in an FPGA implementation. The results show that significant gains in speed can be achieved. For the case study considered, the gains were close to 60 percent. The implementation cost of the proposed scheme is larger than that of a simple parallel pipeline architecture. The overheads in the FPGA implementation are less than 5 percent in the case study. This increase in FPGA resource usage is clearly offset by the gains in throughput. Therefore, the proposed scheme can be an interesting option for Cuckoo hashing implementations that require high throughput. The proposed architecture could also be potentially used to speed up parallel implementations of other multiple hash schemes such as d-left hashing and Bloom filters [18]. This can be an attractive topic for future work.
ACKNOWLEDGMENTS Salvatore Pontarelli is the corresponding author. This brief is part of a collaboration in the framework of COST ICT Action 1103 “Manufacturable and Dependable Multicore Architectures at Nanoscale”.
REFERENCES [1]
1. A flow is defined as the set of packets having: the same source and destination IP addresses, the same source and destination ports and the same protocol field. For IPv4, the flow identification requires exact matching of 104 bits.
4.62 ns 4.47 ns 4.63 ns
[2]
P. Gupta and N. McKeown, “Algorithms for packet classification,” IEEE Netw., vol. 15, no. 2, pp. 24–32, Mar./Apr. 2001. A. Kirsch, M. Mitzenmacher, and G. Varghese, “Hash-based techniques for high-speed packet processing,” in Algorithms for Next Generation Networks. London, U.K.: Springer, 2010, pp. 181–218.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, [3] [4] [5] [6] [7] [8] [9]
[10] [11] [12] [13] [14] [15] [16] [17] [18]
NO. 1,
JANUARY 2016
Y. Azar, A. Broder, A. Karlin, and E. Upfal, “Balanced allocations,” in Proc. 26th ACM Symp. Theory Comput., 1994, pp. 593–602. M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, “Scalable high speed IP routing lookups,” in Proc. Conf. Appl., Technol., Archit., Protocols Comput. Commun., 1997, pp. 25–36. Z. Wang, Internet QoS: Architectures and Mechanisms for Quality of Service. San Mateo, CA, USA: Morgan Kaufmann, 2001. D. E. Knuth, “Art of computer programming,” in Sorting and Searching, 2nd ed., vol. 3. Reading, MA, USA: Addison-Wesley, 1998. P. Reviriego, L. Holst, and J. A. Maestro, “On the expected longest length probe sequence for hashing with separate chaining,” J. Discrete Algorithms, vol. 9, no. 3, pp. 307–312, 2011. R. Pagh and F. F. Rodler, “Cuckoo hashing,” Int. J. Algorithms, vol. 51, no. 2, pp. 122–144, 2004. P. Bosshart, G. Gibb, H. S, Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, “Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN,” in Proc. Conf. Appl., Technol., Arch., Protocols Comput. Commun., 2013, pp. 99–110. C. Hermsmeyer, et al., “Towards 100G packet processing: Challenges and technologies,” Bell Labs Tech. J., vol. 14, no. 2, pp. 57–79, 2009. Y. Kanizo, D. Hay, and I. Keslassy, “Maximizing the throughput of hash tables in network devices with combined SRAM/DRAM Memory,” IEEE Trans. Parallel Distrib. Syst. , vol. 26, no. 3, pp. 796–809, Mar. 2014. S. Pontarelli, P. Reviriego, and J. A. Maestro, “Efficient flow sampling with back-annotated cuckoo hashing,” IEEE Commun. Lett., vol. 18, no. 10, pp. 1695–1698, Oct. 2014. A. Broder and M. Mitzenmacher, “Using multiple hash functions to improve IP lookups,” in Proc. 20th Annu. Joint Conf. IEEE Comput. Commun. Soc., 2001, vol. 3, pp. 1454–1463. C. Estan and G. Varghese, “New directions in traffic measurement and accounting,” in Proc. 1st ACM SIGCOMM Workshop Internet Meas., 2001, pp. 323–336. J. Naous, D. Erickson, G. Covington, G. Appenzeller, and N. McKeown, “Implementing an OpenFlow switch on the NetFPGA platform,” in Proc. 4th ACM/IEEE Symp. Archit. Netw. Commun. Syst., 2008, pp. 1–9. Virtex-5 Family Overview [Online]. Available: http://www.xilinx.com/ support/documentation/data_sheets/ds100.pdf, 2009. A. Kirsch and M. Mitzenmacher, “The power of one move: Hashing schemes for hardware,” IEEE/ACM Trans. Netw., vol. 18, no. 6, pp. 1752– 1765, Dec. 2010. B. Bloom, “Space/time tradeoffs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970.
331