An Efficient Approach to On-Chip Logic Minimization - CiteSeerX

3 downloads 3650 Views 694KB Size Report
ingly to a new variety of applications that demand very fast and frequent minimization ..... work hosting the specified destination in a list of variable length network ...
1040

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 9, SEPTEMBER 2007

An Efficient Approach to On-Chip Logic Minimization Seraj Ahmad and Rabi N. Mahapatra

Abstract—Boolean logic minimization is being applied increasingly to a new variety of applications that demand very fast and frequent minimization services. These applications typically have access to very limited computing and memory resources, rendering the traditional logic minimizers ineffective. We present a new approximate logic minimization algorithm based on ternary trie. We compare its performance with Espresso-II and ROCM logic minimizers for routing table compaction and demonstrate that it is 100 to 1000 times faster and can execute with a data memory as little as 16 KB. We also found that the proposed approach can support up to 25 000 incremental updates per second. We also compare its performance for compaction of the routing access control list and demonstrate that the proposed approach is highly suitable for minimizing large access control lists containing several thousand entries. Therefore, the algorithm is ideal for on-chip logic minimization. Index Terms—Access control list (ACL), compaction, Internet protocol (IP), logic minimization, minimization trie (m-Trie), routing table, ternary content addressable memory (TCAM).

I. INTRODUCTION

L

OGIC minimization techniques traditionally have been used in logic synthesis to reduce the number of gates required for a given circuit. However, in recent years, logic minimization has been applied to numerous applications other than logic synthesis, such as routing table reduction [1], access control list (ACL) reduction in the network processors [2], and hardware/software partitioning [3]. The applications are generally embedded in nature and have access to very limited computing and memory resources. These applications typically require minimization of a fairly large database table containing more than 100 000 entries. Note that the monolithic database table is generally divided into several subtables based on some criteria. However, the resulting subtables can still have a large number of entries concentrated into a few subtables. In addition, they demand high frequency logic minimization services due to continuous updates to the tables targeted for minimization. The application should be able to satisfy the worst case update time requirement corresponding to the largest subtable. The logic minimization is known to be an NP-complete problem, making exact algorithms unsuitable for practical size tables. Also, the currently available approximate algorithms Manuscript received September 22, 2004; revised December 8, 2005. S. Ahmad is with the Advanced Microlithography Division, Magma Design Automation, Inc., San Jose, CA 95110 USA (e-mail: [email protected]). R. N. Mahapatra is with the Department of Computer Science, Texas A&M University, College Station, TX 77843 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2007.902202

are suited only to the workstation environment due to their large computing and memory requirements. Thus, in order to utilize the existing approximate algorithms for the resource constrained applications described earlier, an off-chip approach is utilized [1]. In an off-chip approach, the target table is minimized on a workstation equipped with a powerful minimizer, and the result is transmitted to the intended embedded applications. This approach seems to circumvent the resource limitations but is still inadequate to meet the requirements of applications described earlier. For example, The Espresso-II logic minimization algorithm described in [4] takes 109 s to minimize a routing table containing 11 091 entries and supports two worst case updates per second on a 400-MHz ARM platform. Further, it requires 500-kB data memory and 100 kB of instruction memory. A newer logic minimizer ROCM, described in [2], takes 120 s to minimize the same routing table, but offers about 30 worst case updates per second. Contrast this with over 100 000 routing table entries and peak rates of over 2000 worst case updates per second required in current backbone routers. Furthermore, the off-chip approaches are subjected to communication overhead. This overhead is unacceptable in practice due to the high frequency of updates. Thus, a logic minimizer intended for an embedded environment is required to be kept closer to the application and share the limited computing and memory resources available on-chip. In addition, the logic minimizer should have a small code footprint, fast execution time, and be able to operate with a given data memory budget at an acceptable loss of minimization. The graceful degradation of minimization for different data memory budgets is a key requirement to support even the most resource constrained devices. This paper introduces a novel linear time approximate minimization technique based on a retrieval tree (trie) data structure called a minimization trie (m-Trie) [5]. The technique provides fast minimization and efficient incremental updates using localized minimization heuristics. It has a very small code footprint of about 20 kB and can operate with a data memory budget as small as 16 kB within a 2%–5% loss of minimization, as compared to Espresso-II and ROCM. Experimental results show that it can attain up to 25 000 updates per second. The rest of this paper is organized as follows. Section II presents a background and discusses related work on logic minimization. Sections III and IV describe the m-Trie data structure, its mapping to the Boolean sum of product (SOP) forms and insertion/deletion algorithms. Section V presents a case study and results on routing table compaction to demonstrate the usefulness of the proposed approach. Section VI

1063-8210/$25.00 © 2007 IEEE

AHMAD AND MAHAPATRA: AN EFFICIENT APPROACH TO ON-CHIP LOGIC MINIMIZATION

1041

presents another case study on access control list compaction to further establish its usefulness. The conclusion and future enhancements are discussed in Section VII. The Appendix presents a ternary content addressable memory (TCAM) architecture and enhancement needed to support single cycle prefix lookup in TCAM.

TABLE I TRUTH TABLE FOR  (e1; e2)

II. BACKGROUND AND RELATED WORK

quired memory. Although the execution performance of ROCM is not better than Espresso-II, it offers good performance for the incremental minimization needed by most on-chip logic minimization applications.

can be specified by an A Boolean function and a dc-set . The on-set on-set contains all the input combinations, where the function assumes a value of logic 1. The dc-set, also known as the don’t care set, contains all the input combinations for which the value of is unspecified. Two-level logic minimization involves finding a minimum covering set for the specified function . The set contains the SOPs representation of the input variables. The first exact solution to this problem was given by Quine [6], [7] and McCluskey [8]. The procedure first generates all the primes (SOPs which are not contained in another SOP). This is followed by finding a set containing the minimum set of primes that covers all the points in on-set . Studies have suggested that the Quine–McCluskey procedure and its variants are inefficient as they generate a huge number of primes (sometimes more than ten million) in the first stage. Therefore, many newer logic minimizers try to generate only those primes that are part of some minimal cover of , thereby pruning out a large number of primes. Because the logic minimization problem is NP-complete, these algorithms still require a considerable amount of time to find an exact solution. A number of heuristics algorithms have been proposed that find a near optimal solution within an acceptable time budget. The most notable of these are SPAM [9], PRESTO [10], MINI [11], and Espresso-II. Several other variants of these algorithms exist, offering better performance, however, Espresso-II is by far the most popular two-level logic minimizer. To deal with resource-constrained applications, Espresso-II provides a fast option that uses a single expand stage during the refinement of initial minimal cover. The expand stage promotes the smaller product terms to bigger product terms by removing one or more literals. The removal of a literal is considered valid only if the resulting product term does not include a point from the off-set. Minimization is achieved by deleting all the product terms included in the expanded product. The com. One study in plexity of the single-expand procedure is [12] has suggested a distance-one merge ( -merge) heuristic to achieve an acceptable level of compaction for on-chip applications in considerably less time. The distance-one merge heuristic iteratively merges the product terms that are one hamming distance. The complexity of the -merge procedure is . Both of these heuristics achieve good speedup but they do not minimize the memory requirement. Binary decision diagram (BDD)-based minimization approaches work similarly and are very sensitive to the ordering of the literal, and sometimes can require huge memory resources. Moreover, an incremental deletion may not be easily achieved in a BDD-based approach. Another logic minimizer ROCM, studied in [2], also uses a single expand stage, but effectively minimizes the re-

III. PRELIMINARIES The m-Trie is defined in terms of a 3-ary tree known as ternary trie. Each nonleaf node in a ternary trie can have up to three children labeled , , and . The edges connecting the node to its children , , and are labeled as 0, 1, and , respectively, to specify its direction. The symbols 0 and 1 denote two disjoint directions, while is defined as a union of both the directions 0 and 1. The set of strings formed by these symbols are . represented as , where The basic unit of insertion and deletion in the ternary trie is a path. A path containing all the edges between nodes and is denoted as and can be uniquely mapped to a string in known as the route. The route is formed by concatenating directions of all the edges between nodes and . Therefore, a path can be specified as , where is the starting node and specifies the route taken by the path. A path can also be specified simply as , where the starting node is implicitly assumed to be the root. The cover for a given path is defined as any path formed by promoting one or more of the 0 and 1 directions of to . The height of a node is defined as the number of nodes visited to reach the lowest leaf node from . The height of the root node is denoted by . The subtrie rooted at node is de. noted as The trend, between two edges and in the ternary trie is defined according to Table I. This definition can be extended to define a trend between two and as follows: paths if otherwise where

and the path trend is a string formed by concatenating the edge trends ’s. The edges and are said to match if , i.e., if they follow the same trend. The distance between two paths and in ternary trie is defined as the number of mismatches between and for . of two paths and is The similarity defined as follows: if if

.

. The two paths are said to be dissimilar when Two tries and are said to be dissimilar if there are no paths and such that .

1042

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 9, SEPTEMBER 2007

Fig. 1. Mapping of a path to product term.

IV. MINIMIZED TRIE , If we map the product terms to a route of the path the two-level logic minimization problem can be modeled as the minimization of the number of leaf nodes in a ternary trie. For example, the path 10100 can be mapped to a five-variable . Another path can be mapped product term . Thus, a to the following five-variable product term minimization algorithm should attempt to minimize the number of leaf nodes by merging portions of paths in directions 0 and 1 showing similarity and rerouting towards direction . For exand as shown in Fig. 1, branch ample, the paths out at node located at the end of path segment 10. However, the branches going in direction 0 and direction 1 are similar, and . This is equivtherefore, can be merged into a single path and alent to merging two product terms into a product . In Section IV-A, we define a minimization retrieval tree m-Trie as an extension of the ternary trie. m-Trie introduces a set of minimization constraints imposed upon its structure. The minimization algorithm is formulated in terms of insertion and deletion upon m-Trie. Definition m-Trie or minimized trie is a ternary trie , such that for every node , the following properties are satisfied: ; 1) ; 2) m-Trie m-Trie ; 3) m-Trie 4) m-Trie m-Trie . The properties enumerated above are enforced on the ternary trie during insertion/deletion in order to get fewer leaf nodes. These properties are selected to achieve a balance between execution time and compaction performance and do not guarantee an exact minima for the leaf nodes. Property 1 requires all the leaf nodes to have the same level to facilitate checking for similarity and path merging. The Properties 2, 3, and 4 are minimization constraints and state that m-Tries rooted at children must be dissimilar pairwise. If there is some similarity between and , the paths showing similarity must the subtries and , and merged and rerouted to be removed from subtrie . However, if there is some similarity between suband , the path in showing the similarity tries . The same is removed and merged with the similar path in reasoning can be applied for similarity between subtries and .

Algorithm 1: Traversal algorithm.

Fig. 2. Path traversal during insertion.

The algorithm to remove similarity between a pair of tries path comparisons. with and paths, respectively, takes prefix paths present in the trie and distributed If there are equally among the children at every level, then it will take path comparisons to build m-Trie from a ternary trie. Therefore, the complexity of building the . m-Trie from a ternary trie is insertion algorithm on Section IV-A provides an m-Trie. Because the building of m-Trie is equivalent to insertions on m-Trie, an incremental insertion-based approach . The reduction in the complexity from can build it in to linear is mainly attributed to quadratic localization heuristics adopted in the insertion algorithm. A. Insertion The insert operation on m-Trie consists of two steps. The first step involves traversal of the path to be inserted and creating those parts of the path that do not already exist in the trie. Traversal procedure is given in Algorithm 1, maintains a path to track the number of paths passing through each counter node . Traversal increments the path counter of each visited node by 1 and returns the leaf node. Fig. 2 shows a path to be inserted in the trie. The path starting from root to node already exists in the trie, so this path is simply traversed till node . Now there is no path from to leaf node , therefore, the path is created during traversal.

AHMAD AND MAHAPATRA: AN EFFICIENT APPROACH TO ON-CHIP LOGIC MINIMIZATION

1043

Fig. 3. Similarity at node u and v .

Algorithm 3: Search algorithm.

Algorithm 2: Insertion algorithm.

The second step involves merging of the newly added path by rerouting. It consists of inspecting all the nodes on starting from the leaf node for a similarity condition with its offsprings and resolving this condition. Please note that the existence of a similarity condition is a violation of properties 2–4. In order , the path segments to remove the similarity at a node showing similarity are untraversed and traversed again in direction at . Fig. 3 shows a path similarity condition at node . Here, the path shows similarity with path , where node is an offspring of . To break the similarity, the paths and are untraversed and traversed again in direction . Insertion maintains two bitmaps and of length at remembers the node posievery leaf node. The bitmap is merged with a path in tions where a path in direction direction . The path counters and bitmaps together are used to support the deletion in m-Trie. The correctness and complexity of Algorithm 2 is established in the following lemma. Lemma 1: The minimization constraints for an m-Trie are violated only along the newly added path during insertion upon steps. the m-Trie. The constraints can be restored in

Proof: Suppose is a node on the newly inserted path in m-Trie, and and are the subtries rooted at the and of . The depth of these subtries are asoffsprings sumed to be . If we assume that the minimization constraints has only one path, are not violated below , the subtrie which may have a similar path in either or . If a similar path exists, both of them are merged and rerouted to the subtrie rooted in direction to restore minimization constraints. The merging and rerouting of the path implies that there is still one path now up to which may have similarity with . other paths in the subtrie rooted at the offsprings of This shows that constraints’ violations are localized along path . Further, the constraint restoration procedure requires searching for the presence of an identical path in subtries and , which takes steps. The procedure has to be applied for all the nodes on , starting from the last nonleaf node up to the root in a bottom-up manner to completely restore the minimization constraints, requiring steps. Therefore, the minimization steps. constraints can be restored in B. Deletion The deletion procedure is described in Algorithm 3. The deletion operation on m-Trie consists of three major steps involving searching, untraversal, and balancing. Due to path merging and rerouting during insertion, the inserted path may not be directly present in the m-Trie. For exin direction 0 or 1 may be rerouted to ample, an edge direction to restore minimization constraints during insertion. This rerouting scheme merges the original path into another path in Fig. 3 is which is a cover of . For example, the path during insertion and can be found rerouted at node only along path . The first step in the deletion operation involves searching of a covering path. Algorithm 4 gives a backtracking method to search the covering path in m-Trie. The algorithm first tries to find the given path in the original direction. If the path is not present in the original direction, it backtracks

1044

Algorithm 4:

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 9, SEPTEMBER 2007

Deletion algorithm.

Algorithm 5: Untraversal algorithm.

one edge and tries to search in the direction . After all the paths diverging from the current node are searched, it backtracks to its parent and repeats the procedure until the covering path is found. The algorithm returns the leaf node of the covering path. The uniqueness and search complexity of the covering path is proved in Lemma 2. The second step involves untraversing the path to be deleted along the covering path. Untraverse operation described in Algorithm 5, decrements for every visited node by 1 and returns the leaf node. Untraverse deletes the path segment con. taining nodes with , During insertion, path pairs of the form ( ), ( , ), and ( , ) are merged into path . The deletion from the path breaks the similarity, of , , , or causing an imbalance and indicating the need to reroute the in the original direction. The node other paths , , or differs from the path to positions where the covering path may have an imbalance. We can capture these be . However, the bitmap positions by a bitmap indicates only potential imbalance positions, which may not and differ at one position exist in reality. For example, from requires no rerouting. The third step but deletion of

consists of finding the imbalance positions and rerouting the unbalanced paths in their original directions. Please note that if several contiguous imbalance points are encountered on the path, they are rerouted together in the original direction of the node farthest from the leaf. The following two lemmas discuss the complexity of the different steps involved in the deletion procedure. Lemma 2: The covering path for a given path is unique steps in the worst case. and can be found in Proof: Due to rerouting, we may have to explore the original direction (0 or 1) as well as direction at every level in order to find the exact path. Therefore, we can state that the in m-Trie having level satprefix cover search complexity . In the worst case, we may have to perform isfies a bidirectional search at each level, therefore, the search com. However, plexity for the covering path of seems to be it should be noted that the number of actual paths present in the but limited to , where is the m-Trie is not number of paths inserted in m-Trie, and is the compaction factor. Further, it can be noted that the path will be either found in its original direction or in direction but not both, which would violate the constraints 3 and 4 given in Section III. Hence, there will be a unique path denoted by in the m-Trie covering the given path . Lemma 3: Let be the path to be deleted and be a path present in m-Trie such that covers . The minimization constraints for an m-Trie are violated only along path . The constraints can be restored in steps. Proof: The deletion of path along the covering path may leave unbalanced path segments at every node in direction , where path suggests a 0 or 1 direction. The rerouting of the path segment in m-Trie at level takes steps. We can encounter nodes having imbalance at each level on the path in the worst case. Therefore, the complexity of restoring the balance after an untraverse operation is . V. CASE STUDY I: ROUTING TABLE COMPACTION IN TCAM A. Background The Internet is a packet switched network consisting of a number of routers and hosts interconnected by communication links. Hosts reside at the boundary of the network and host-tohost communication takes place via a number of routers using Internet protocol (IP). The communication between two nodes is converted into a series of packets known as IP datagrams. Every datagram carries a 32-bit destination address to facilitate independent routing. The packet is forwarded to a node corresponding to the best-known path to the destination, which is called next hop. The next hop is determined by an IP lookup operation performed on a routing table. The IP lookup operation searches for the most specific network hosting the specified destination in a list of variable length network identifiers known as IP prefixes. IP prefixes can be represented using a ternary string

AHMAD AND MAHAPATRA: AN EFFICIENT APPROACH TO ON-CHIP LOGIC MINIMIZATION

1045

where denotes a 32-bit prefix, represents a don’t care represents the most sigsymbol, and the prefix length. is compared against the nificant bits of the prefix. Only specified destination address to decide a match. Because the routing table contains several overlapping prefixes, the destination address can match multiple IP prefixes. The IP lookup operation searches the routing table to find the longest prefix matching the destination. The longest match semantics requires pattern matching as well as length determination, which makes a practical implementation of the IP lookup operation harder, especially in high-end routers. The difficulty is attributed to the super-linear growth of the size of the routing table as well as the increasing gap between silicon and optic fiber speed, with the latter growing at an exponential rate [13]. B. Issues in IP Lookup A survey of software- and hardware-based methods for IP lookup can be found in [13] and [14]. The lookup algorithms designed for conventional memory to solve the longest prefix match problem require several memory accesses to retrieve the next hop. This can quickly become a bottleneck for high-speed backbone routers operating at gigabit speed and with large routing tables containing more than 100 000 entries. Francis et al. investigate techniques for O(1) IP lookup using binary content addressable memory (BCAM) and TCAM [15]. BCAMs allow storage of 0 and 1 in each memory cell and can perform only a fixed length match. Hence, multiple BCAMs are required to search variable length prefixes in a single cycle. This can lead to significant under utilization of the available memory. TCAMs are similar to BCAMs but allow storage of 0, 1, and states. The state is treated as don’t care and ignored during a matching operation. Thus, TCAMs allow the storage of variable length prefixes in a single unit, achieving more economy. Also, TCAMs offer easier management and update of the routing tables. Despite the advantages, TCAM-based lookup solutions remained unpopular due to the high cost, low capacity, and poor performance. However, recent advances in manufacturing and interconnection technology allow fabrication of high capacity, high performance, and low cost TCAM units, matching the requirement of today’s backbone routers. For example, the latest available TCAM in the market operates at 100 million searches per second and offer capacities up to 16 MB [16]. C. Reducing Power Consumption in TCAM TCAM-based fast lookup seems promising, but it is not without its disadvantages. Because power consumption is proportional to the number of TCAM entries enabled for searching, TCAMs consume considerable power under normal operating conditions. Research efforts to reduce TCAM power consumption can be divided into two categories. The first approach attempts to reduce power consumption by partitioning the entire TCAM memory into a set of TCAM pages, and then finding a suitable hashing algorithm to map each entry into a set of target pages [17]–[19]. During searching, only target pages are enabled. This reduces power consumption by a ratio of , where and are the average number of target pages and total pages, respectively. The second approach reduces the power consumption by compacting routing table entries

Algorithm 6: m-Trie-based minimization algorithm.

using logic minimization techniques as discussed in [1] and [2]. IP prefixes contain the symbol only at the end, while in minimized IP prefixes can occur at any position. Because TCAM allows storage of at any bit position, routing semantics can be guaranteed even with a minimized routing table. Here, the reduction in power consumption is dependent on the compaction ratio achieved by the logic minimization technique applied. The logic minimization-based power reduction can be applied to existing TCAMs, while partitioning-based architecture requires hardware modification to TCAM architecture to support paging. D. Logic Minimization Using m-Trie In order to apply logic minimization, a given IP routing table is first partitioned according to next hops into subta. Each of these subtables are pruned to remove overlapbles ping entries and further subdivided on the basis of prefix length into subtables . The logic minimization algorithm is applied to each of these subtables . The m-Trie-based logic minimization is described in Algorithm 6. and memory budget The algorithm takes a subtable as its input. The prefixes in the subtable are treated as routes of the path in the m-Trie. To minimize the subtable, an empty m-Trie is created and prefix paths are added to it one at a time. If the memory budget is reached before consuming all the prefixes in the subtable, the paths in the current m-Trie are enumerated as minimized prefixes and flushed out to the TCAM. The m-Trie is then deleted to reclaim the memory consumed by it. The algorithm starts building a new m-Trie by inserting remaining prefixes in the current subtable. The whole process is repeated until all the prefixes in the subtables are consumed by is the minimum amount of the algorithm. Please note that memory needed to create a prefix path during insertion in the m-Trie. E. Experimental Results To establish the suitability of using m-Trie-based minimization, we evaluated its performance on the standard routing table traces used in [1] and [2], as well as two additional large routing

1046

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 9, SEPTEMBER 2007

TABLE II COMPARISON OF PERFORMANCE—ESPRESSO-II VERSUS m-TRIE

TABLE III COMPARISON OF PERFORMANCE—ON-CHIP MINIMIZER VERSUS m-TRIE

Fig. 4. Compaction performance versus memory budget.

table traces from bbnplanet and attcanada against two existing logic minimizers, Espresso-II and ROCM. All results were obtained on cerfcube running embedded Linux on a 400-MHz Intel XScale processor [20]. The main results have been summarized in Tables II and III. The first column gives the original and pruned routing table sizes for each router. Note that pruning alone can compact the routing tables by 20%–30%. The remaining columns give data memory, execution time, and size of the minimized routing table for each algorithm. As we can see, the m-Trie-based minimization is the fastest among all the existing approaches by an order of 100 to 1000 times for all the standard routing tables. Further, we found that m-Trie shows only 0.2%–5.0% loss of compaction with a memory budget of 16 kB, which is 10 to 100 times lower than what is required by other algorithms. The routing table compaction achieved by m-Trie with respect to different memory budgets is shown in Fig. 4. Note that the total routing table compaction as achieved by m-Trie rapidly approaches up to 95% of peak compaction within a

memory budget of 16 kB and up to 99% within 200 kB. This is because each instance of path merging reduces the average memory required to insert a path. Therefore, more paths can be inserted for a given memory budget. m-Trie keeps inserting the paths until the memory budget is reached, thereby giving more compaction opportunities to the regions where compaction is more. This greedy approach results in a better compaction with a limited memory budget. Also, note that the peak compaction achieved in m-Trie usually outperforms what is found in other algorithms. This extra compaction is achieved due to the pruning property built into the m-Trie. It causes the m-Trie to greedily prune the overlapping prefixes even during minimization, a function other algorithms lack. The peak compaction for each router is given in Fig. 4. We further observed that the performance of Espresso-II and ROCM depends heavily on the distribution of subtables sizes. For example, the ROCM or Espresso-II minimization on the bbnplanet routing table takes less time than on the attcanada routing table even though bbnplanet has more routing table en-

AHMAD AND MAHAPATRA: AN EFFICIENT APPROACH TO ON-CHIP LOGIC MINIMIZATION

tries. This occurs because most of the attcanada routing entries are concentrated in a few partitions, while the bbnplanet routing table maintains a fairly uniform distribution. In contrast, we observed a linear dependence for m-Trie-based minimization with respect to routing table size. Thus, m-Trie shows a very predictable execution time even in worst case partitioning. We also evaluated the update performance of different routing algorithms on four randomly selected subtables from the attcanada and bbnplanet routing tables. The results for a single update in routing tables of different sizes have been summarized in Table VI. We found that an m-Trie-based update requires about 40 s in our case study and outperforms the Espresso-II and ROCM update methods by 1000 to 10 000 times. The performance remains fairly independent of the routing table size. The speed advantage, however, comes with a price of 17 MB more data memory for the attcanada testbed. This data memory is required to maintain m-Trie by preventing the garbage collection. We also evaluated the performance of a memory efficient update scheme on m-Trie that reminimizes the updated subtables. This update scheme can be supported by as little a memory as 16 KB to achieve update rates comparable to ROCM on the better end. We also compared the code complexity in terms of C code lines for each of the algorithm. The code complexity for m-Trie is the lowest with 300 lines of C code, as compared to 1800 and 8000 lines for ROCM and Espresso-II (fast or single-expand) algorithms, respectively.

VI. CASE STUDY II: NETWORK ACL COMPACTION

1047

Also, some of the deeper classifications (involving several pattern fields) may require multiple searches, thus, multiplying the worst case performance requirement several times. For example, an IP router operating at 10 Gb/s may have to process up to 32 million packets per second [21]. This requires the router to support a four search deep classification to deliver a throughput performance equivalent to 40 Gb/s operational speed. Hence, packet classification in high-speed routers is offloaded to specialized hardware called the IP classification coprocessor. The classification coprocessor can either be designed using a custom design approach or by using commercial off the shelf (COTS) available TCAMs. The latter is preferred due to the cost and flexibility advantages. However, TCAMs may cause the overall power budget to increase significantly. Also, security requirements and intelligent intrusion detection systems may cause the ACL databases to grow several fold in the future, thereby increasing power requirements. This affects the board packaging, power budget, and cooling requirement of the classification coprocessor. The partitioning schemes proposed for reducing TCAM power for IP lookup applications are not suitable for classification engines due to the large memory overhead. Hence, efficient techniques to compact ACL databases are required to achieve space economy at reasonable cost. The compaction of ACL databases is first studied in [12] and revisited by [2]. Reference [2] discusses an efficient online minimization technique based on ROCM; however, it may require considerable initial minimization time, reducing the router availability or compromising the security (please refer to [22]). Reference [12] proposes ACL compaction using the -merge algorithm. The -merge is faster than ROCM, but requires more code and data memory and may not be suitable for on-chip minimization.

A. Background C. ACL Minimization Using m-Trie Another important problem in internet routers is high-speed packet filtering to control the quality of service and traffic flow, prevent routing updates, and implement security measures. Here, each incoming packet undergoes a classification and filtering decision before it is forwarded to its next hop. The classification and filtering is done according to an ordered set of rules known as the ACL. Each entry in the ACL contains a pattern specification and an action. The pattern is specified by a combination of packet fields, such as higher layer protocol transported, packet type, source and destination addresses, and their corresponding ports. The possible actions include permit and deny forwarding of the packets based on the result of pattern matching. The problem of pattern matching on multiple fields is an -dimensional generalization of the IP lookup problem discussed in Section V. B. Issues In modern routers, packet classification accounts for most of the computation time in the data path. To improve this bottleneck, a number of hardware- and software-based techniques for packet filtering are suggested in [12] and [13]. The software-based solutions utilizing conventional memory tend to be slower and unable to meet the expected line rate performance.

The ACL entries commonly found in the routers can be mapped to the tuple . The protocol field denotes the type of the IP packet under consideration. The and denotes the range of IP addresses and have the same form as IP prefixes discussed in Section is mapped to a list of protocol VI-B. The field can ports for which the rule is applicable. The on be specified by layer 4 operators one or two port operands. The operators represents :” is an upper (lower) bound on the ports. The operator “ a range operator and represents a lower as well as an upper bound. The operator “ ” acts as a wildcard and represents all possible ports. The operator “1” is an identity and represents a single port. To illustrate the use of these operators consider which specifies all the ports from 1025 to an example, . Similarly, specifies all the ports in the and 1000:1033 specifies all the ports in the range range 1000 to 1033. In order to use TCAM for ACL classification, the tuple should be transformed into entries suitable for storing in TCAM. This transformation causes an explosion in the number of entries . to be stored in TCAM due to difficulty in implementing

1048

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 9, SEPTEMBER 2007

For example, gets exploded into 64513 entries using range expansion, or to the following seven entries:

TABLE IV UPDATED PERFORMANCE

TABLE V COMPARISON OF PERFORMANCE—ACL COMPACTION AND EXECUTION TIME

using controlled expansion. The controlled expansion requires a precomputed transformation pattern for frequently occurring ranges. However, there is no efficient way to control the expansion of 721:534. In practice, it is seen that the random ranges occur very infrequently, therefore, some router vendors provide registers to prevent ACL explosion [23] a limited set of (although it defeats the cost advantage of using COTS classification devices). As networking applications evolve, the security requirement may render these solutions inadequate. Because exploded ACL exhibits a lot of similarity, the logic minimization can be effectively applied to minimize the ACL. In addition, different ACL entries may also have a lot of similarity that can be exploited to further minimize the routing table. In order to minimize the ACL, we first create subtables based on the action field. Here, all the ACL entries with contiguously similar action are grouped together. For example, the following will lead action pattern to the subtables , , , , and . The minimization algorithm is applied to each of these subtables separately to compact the ACL. The minimized ACL entries are stored in TCAM in the same order as in the original list. D. Experimental Results To assess the suitability of m-Trie-based minimization, we evaluated its performance on real-life ACL traces provided by a large router company. We have collected the results for m-Trie, ROCM, and Espresso-II algorithms on a PC platform with a 3-GHz Intel Pentium-IV processor. The ACL traces used in this study are the same as those used in [2], however, we take a different approach when dealing with . In [2], the ACL entries are transformed using controlled expansion as explained previously and resorting to two-entry expansion for random port range patterns. This scheme implicity register bank in the router. assumes the availability of the In our approach, we explode all the ACL entries using range expansion without assuming any dependence on the availability of the register bank. However, this results in a huge table that was found to be unsuitable for logic minimization using either ROCM or Espresso-II. Therefore, we adopted another approach. We exploded an ACL entry and immediately followed with logic minimization using the -merge algorithm. Then, we collected the minimized entries. After processing all the ACL entries, the resulting minimized table is reminimized using Espresso-II and ROCM to achieve further compaction. It is worth noting that we

apply m-Trie on the exploded table itself, avoiding the two-pass approach, which gives a better compaction. The main results for ACL compaction have been summarized in Table V. The first column gives the original and exploded table size. The remaining columns give the size of the compacted table and the execution time for each of the algorithms. We classified ACL traces into three groups: typical, bad, and long, to represent their usage and configuration type. We have not included the time taken by -merge to compact the tables in the first pass. From Table V, it can be seen that Espresso-II consistently gives the best minimization results. However, Espresso-II is not suitable for a larger table due to prohibitive execution time. ROCM performs well for smaller tables, however, it loses significant compaction opportunities for larger tables and also requires significant execution overhead. m-Trie consistently provides better compaction performance for all the traces at an acceptable execution overhead. It can also be noted that m-Trie tends to achieve more compaction for larger tables while requiring very limited computing resources. m-Trie took on average about 2 s to minimize the exploded ACL table while other algorithms took a considerably longer time, forcing us to switch to a two-pass approach. ROCM took about 500 s to minimize in the second-pass approach, while Espresso-II took about 1500 s. VII. CONCLUSION AND FUTURE WORK The m-Trie-based logic minimization approach offers memory efficient and fast logic minimization and incremental updates suitable for high-frequency logic minimization services on resource constrained embedded devices. The m-Trie is also suitable for implementing a chain-ancestor ordering scheme for TCAM updates as suggested in [24]. The m-Trie-based logic minimization works well on the benchmark set by ROCM and Espresso-II. The m-Trie also exhibits excellent performance for compacting large access control lists. Its further usage and applicability to other on-chip logic minimization problems remains to be investigated. APPENDIX COVERING PATH LOOKUP IN TCAM As we have noted in Section III, the most expensive step in the deletion procedure is finding the covering path in the m-Trie. After minimization, each path in the m-Trie is mapped into a

AHMAD AND MAHAPATRA: AN EFFICIENT APPROACH TO ON-CHIP LOGIC MINIMIZATION

1049

Fig. 5. (a) NOR-based TCAM cell. (b) TCAM architecture with modified search register.

TABLE VI ENCODING OF TERNARY SYMBOLS

TABLE VII MATCH BEHAVIOR FOR TERNARY SYMBOLS

TABLE VIII SAMPLE ROUTING TABLE

minimized prefix and stored in TCAM. Hence, the covering path time in TCAM with suitable enhancecan be searched in ment of the TCAM architecture. This enables the deletion algorithm to perform at par with the insertion algorithm. The following describes the standard TCAM architecture and enhancements for searching the covering path. A NOR-based TCAM cell is shown in Fig. 5(a). It uses two SRAM-based binary storage cells to store states 0, 1, and , based on the encoding scheme given in Table VI. It has four transistor switches T1–T4 to assist comparison. These transistor switches prevent matchline from getting shorted to ground when a match occurs. For example, a 0 state in the TCAM cell will turn off the transistor . A search for 0 applied on search lines and will turn off tran, blocking matchline from getting shorted to ground, sistor implying a match. However, a search for 1 will turn on the tran, creating a path to ground through transistor . On sistor the other hand an state will turn off both and , blocking all paths to ground thus matching all search keys applied to the TCAM cell. A simplified TCAM architecture is shown in Fig. 5(b) adapted from [25]. An array of TCAM cells are arranged to form a TCAM word. In order to perform word comparison, all cells belonging to a single word share a common matchline. Since data being searched can match multiple words in TCAM due to variable length matching, all the matchlines are connected to a priority encoder. The priority encoder selects the word at the lowest address among all the matched words. To initiate a search in TCAM, the matchline is charged high. The data to be searched is stored in the search register. Because the search data is fed to a large number of cells, TCAM provides drivers to handle the capacitive sink load contributed by each cell. TCAM words that do not match the search data cause the

matchline to be discharged, which is detected with the aid of a sense amplifier. In order to perform covering path lookup in TCAM, the search register is enhanced to store 0, 1, and . This can be easily accomplished by using 2 bits to represent each symbol, using the encoding scheme given in Table VI. The corresponding hardware overhead is quite small and does not impact the TCAM cells or interconnections. This opposite encoding of the don’t care symbol in TCAM and the search register results in the match behavior shown in Table VII. For example, consider the TCAM cell shown in Fig. 5(a). If the symbol is stored in the search register, the encoding scheme implies and . If a “0” or “1” that it will turn on the transistors is stored in the cell, it would cause a mismatch by turning or , and thereby creating a discharge off the transistor path to ground. However, if the don’t care symbol is stored in TCAM, it will match 0, 1, and in the search register by and and blocking all the paths turning off the transistors to ground. This differential match behavior for symbol gives the desired combined IP lookup and covering path lookup capability to TCAMs. For example, a search for IP addresses prefixed with “100000” will match all the TCAM entries in Table VIII and returns 1000 , which is the longest prefix among all the matched prefixes. However, a search for path routes of the form 100 will match 10 , 100 , and return 100 , which is the longest matching route in this case.

1050

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 9, SEPTEMBER 2007

REFERENCES [1] H. Liu, “Routing table compaction in ternary-CAM,” IEEE Micro, vol. 15, no. 5, pp. 58–64, Jan./Feb. 2002. [2] R. Lysecky and F. Vahid, “On-chip logic minimization,” in Proc. 40th Conf. Des. Autom., 2003, pp. 334–337. [3] G. Stitt, R. Lysecky, and F. Vahid, “Dynamic hardware/software partitioning: A first approach,” in Proc. 40th Conf. Des. Autom., 2003, pp. 250–255. [4] R. Brayton et al., Logic Minimization Algorithm for VLSI Synthesis. Boston, MA: Kluwer, 1984. [5] S. Ahmad and R. Mahapatra, “m-Trie—A fast and efficient approach to on-chip logic minimization,” in Proc. Int. Conf. Comput.-Aided Des., 2004, pp. 428–435. [6] W. V. Quine, “The problem of simplifying truth functions,” Amer. Math. Monthly, vol. 59, no. 8, pp. 521–531, 1952. [7] W. Quine, “A way to simplify truth functions,” Amer. Math. Monthly, vol. 62, no. 9, pp. 627–631, 1955. [8] E. J. McCluskey, “Minimization of boolean functions,” Bell Syst. Technol. J., vol. 35, no. 6, pp. 1417–1444, Nov. 1956. [9] S. Kang and W. M. vanCleemput, “Automatic PLA synthesis from a DDL-P description,” in Proc. 18th Conf. Des. Autom., 1981, pp. 391–397. [10] D. W. Brown, “A state-machine synthesizer—SMS,” in Proc. 18th Conf. Des. Autom., 1981, pp. 301–305. [11] S. J. Hong, R. G. Cain, and D. L. Ostapko, “Mini: A heuristic approach for logic minimization,” IBM J. Res. Develop., vol. 59, no. 18, pp. 443–458, Sep. 1974. [12] D. Alessandri, “Access control list processing in hardware,” M.S. thesis, Elect. Eng. Dept., Eidgenossische Technische Hochschule, Zurich, Switzerland, 1997. [13] P. Gupta, “Algorithms for routing lookups and packet classification,” Ph.D. dissertation, Dept. Comput. Sci., Stanford University, Stanford, CA, 2000. [14] M. Ruiz-Sánchez, E. W. Biersack, and W. Dabbous, “Survey and taxonomy of IP address lookup algorithms,” IEEE Netw., vol. 15, no. 2, pp. 8–23, Mar./Apr. 2001. [15] A. J. McAuley and P. Francis, “Fast routing table lookup using CAMs,” in Proc. INFOCOM (3), 1993, pp. 1382–1391. [16] SiberCore Technologies, Ottawa, ON, Canada, “Ultra 2M Family,” SCT2000CB3, 2004. [Online]. Available: www.sibercore.com [17] F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: Power-efficient TCAMs for forwarding engines,” in Proc. INFOCOMM, 2003, pp. 42–52. [18] R. Panigrahi and S. Sharma, “Reducing TCAM power comsumption and increasing throughput,” in Proc. HOT Interconnects, 2002, pp. 107–107. [19] V. C. Ravikumar, R. N. Mahapatra, and L. N. Bhuyan, “EaseCAM: An energy and storage efficient TCAM-based router architecture for IP lookup,” IEEE Trans. Comput., vol. 54, no. 5, pp. 521–533, May 2005.

[20] Intrinsyc, Vancouver, BC, Canada, “Cerfcube 255 with embedded Linux,” [Online]. Available: http://www.intrinsyc.com/products/cerfcube [21] Integrated Device Technology, Inc., San Jose, CA, “Taking packet processing to the next level,” 2002. [22] Integrated Device Technology, Inc., San Jose, CA, “Intrusion prevention without security holes,” 2004. [23] Cisco Systems Inc., San Jose, CA, “Understanding ACL merge algorithms and ACL hardware resources on Cisco Catalyst 6500 switches,” 2003. [24] D. Shah and P. Gupta, “Fast incremental updates on ternary-CAMs for routing lookups and packet classification,” presented at the HOT Interconnects (8), Stanford, CA, 2000. [25] K. J. Schultz, “CAM-based circuits for ATM switching networks,” Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. Toronto, Toronto, ON, Canada, 1996.

Seraj Ahmad received the B.Tech. degree from Indian Institute of Technology Guwahati, Guwahati, India, and the M.S. degree in computer engineering from Texas A&M University, College Station, in 2005. He is currently with the Advanced Microlithography Division, Magma Design Automation, Inc., San Jose, CA. His research interests include microlithography, design for manufacturing, embedded systems, wireless networking, logic synthesis, and VLSI algorithms.

Rabi N. Mahapatra is an Associate Professor with the Department of Computer Science, Texas A&M University, College Station. Since 2001, he has been directing the Embedded System Research Group, Texas A&M University. His research interests include the areas of embedded systems, low-power design, SoC and VLSI design, and computer architecture. He has published more than 95 research papers in referred international journals/conferences. His research has been funded by the National Science Foundation, DoT, NASA, and other major industries. Dr. Mahapatra is a Ford Fellow and a Distinguished Visitor of IEEE Computer Society.