looking up a packet's (probably Internet Protocol, IP) address in its routing table, ... Addressing may be classful or classless (Malhotra, 2002). ..... Illinois at Urbana Champaign, USA, (http://www-sal.cs.uiuc.edu/~steng/cs497_01/qian.ppt). Stoica ...
Reducing Processing Latency in Network Packet Filters Vic Grout, John McGinn and John Davies Centre for Applied Internet Research (CAIR), University of Wales NEWI, UK e-mail: {v.grout|j.mcginn|j.n.davies}@newi.ac.uk
Abstract The efficiency of packet filters is discussed, that is the effect on processing time of the order of rules in Access Control Lists (ACLs). The problem is introduced and formulated, and various approaches considered. A current optimisation technique is described and potential difficulties noted. A simpler solution, more appropriate for real-time implementation on production routers for example, is then evolved and its implementation discussed.
Keywords Packet filters, Rule dependencies,
Access Control Lists (ACLs) Heuristics,
Processing time, Real-time optimisation
1. Introduction to the Problem, Objectives and Related Work The essential tool in the implementation all forms of packet-based traffic policy is the Access Control List (ACL), known informally as a packet filter. The simplest use of ACLs is in the blocking of certain types of traffic to or from all or part of a network. However, these filters also offer a means of identifying and selecting packets for a variety of other traffic management purposes, some fairly sophisticated, including: • • •
Network Address Translation/Port Address Translation (NAT/PAT) Bandwidth allocation, queuing strategy implementation and traffic shaping Authentication, access control, encryption and other security policies
As an example, a network administrator configuring Border Gateway Protocol (BGP) on a router, may use ACLs to select traffic for route maps, prefix lists, communities or other BGP attributes. The use of ACLs is widespread in advanced routing. Consequently, along with looking up a packet’s (probably Internet Protocol, IP) address in its routing table, a router will spend the larger part of its processing time for each packet in matching it against various ACLs for different policies. Inefficiently structured ACLs can thus lead to increased packet latency and have a negative effect on the router’s, and in turn the network’s, performance. 1.1. ACL Structure and Rules An ACL is an ordered sequence of rules, applied through a router’s operating system, to select packets with certain characteristics for some policy. A rule may apply criteria such as: source or destination address (which could be a single host or a range, network or subnet), use of a particular protocol or protocols (e.g. IP, TCP or UDP and those using them at higher OSI
layers), or certain specific characteristics (such as the acknowledgement bit being set in the packet header). Addressing may be classful or classless (Malhotra, 2002). An example, using a generic format for the purposes of this paper, would be: include all packets from address range A to address range B using a protocol in P with characteristics C
which, for convenience, may be shortened syntactically to include all from A to B using P with C
Each of A, B, P and C then may be a single va lue or a set of possible values. A packet will match the rule if it satisfies all parameters. The alternative is the exclude operator as in: exclude all using P
which matches all packets (from anywhere to anywhere) using a protocol in P (with any characteristics). There can be anything between two and several thousand rules in a working ACL (Stoica, 2001). On a Cisco 2600 router, a single rule takes approximately 0.5 to 1.5µs to process against each packet (Grout & McGinn, 2005); a small figure in itself but one that grows quickly when multiplied by the number of rules in a list, a number of lists on a router and a semi-continuous packet stream passing through a sequence of routers. Raw processing time may be shortened with newer router models such as the CRS-1 (Cisco, 2005) but the relative significance of reducing cumulative delay remains as important as ever. 1.2. Rules and Policies A particular traffic policy or traffic selection process is achieved by a sequence of rules in an ACL. Every incoming packet is tested against each rule in top-to-bottom order to find a match - until a match is found. The first match determines the fate (included or excluded) of the packet and no other rules are tested. (There is an implicit exclude all rule at the end of each list to reject any unmatched packet.) Thus, the order of rules in an ACL is critical. Consider the following rule pair, for example: include all from A exclude all to B
A packet from A to B (i.e. satisfying both rules) will be accepted by the first rule and the second will be redundant. However, if the order of the rules is reversed, the packet will be rejected. Define two (or more) rules to be dependent if it is possible that a single packet could match both rules. Although it may be possible for a number of different orderings of the same rules to achieve the same policy, it is not acceptable to reorder such dependent rules if the behaviour of the ACL is to remain the same. Only independent rules may be reordered thus. 1.3. ACL Efficiency The following formulation of ACL efficiency is simplified from Grout & McGinn (2005).
An ACL L is a sequence of n rules [r1 , r2 , ..., rn ]. Each rule ri has two key characteristics: •
•
Its hit-rate, h(ri): some rules will be more likely to match packets than others, for example by having larger address, protocol ranges, etc. or by being more aligned with current traffic flow. If necessary, hit-rates can be calculated through a count maintained for each rule. Its latency, λ(ri): rules will take different amounts of time to process, through having to test more fields in each packet for example. Latencies depend largely on the type of rule being applied and can be taken from look- up tables.
Ideally, rules with high hit-rates would be placed towards the top of a list (to minimise the number of rules examined before a match is found) but rules with large latencies towards the bottom (to minimise the time taken to process each unmatched rule). However, of course, these ideals may be in conflict. (Where should a rule with a high hit-rate and latency be placed?) The cumulative latency, κi, of a rule ri (at position i) in a list is the time taken to execute ri and all rules before it. So i
κ i = ∑ λ (rj )
(1)
j =1
and the expected latency (mean latency), E(L), of the list is then n
n
i
i =1
i =1
j =1
E ( L) = ∑ h( ri )κ i = ∑ h( ri ) ∑ λ (r j ) .
(2)
Maximising or improving the efficiency of L means minimising or reducing E(L). Rearranging the order of the rules, ri in L, may reduce E(L) and there will be an optimal ordering which minimises E(L). However, all optimisation must be constrained by the need to preserve the order of dependent rules as discussed in the previous subsection. The problem in this form is NP-Complete (Grout & McGinn, 2005) although it becomes computationally easy through ignoring either individual rule latencies or rule dependencies. 1.4. Work to Date The issue of efficiency in packet filters was first addressed in this context by Stoica (2001). Shih & Qian (2002) discuss the crucial question of how to identify rule dependencies in ACLs although the subject is first considered in any form in Hari et al. (2000). The first attempt at optimisation comes from Cisco (2003) but this work ignores individual rule latencies: all rules are assumed to take the same time to process. Bukhatwa & Patel (2003) show the value of ACL optimisation but ignore both differences in rule latencies and rule dependencies. Bukhatwa (2004) gives a simplified method for ordering a list efficiently, based on the classification of rules by latency, but still fails to consider rule dependencies. In this approach, rules are permitted to migrate freely within the list. Al-Shaer & Hamed (2004) give an improved treatment of the problem, but only for the purpose of discovering rule anomalies.
Grout & McGinn (2005) consider the complete problem and compare alternative solution methods. This work is presented in overview in the next section. Further, improved, and simplified techniques are proposed and discussed in Section 3.
2. Optimisation Principles The problem of minimising the expected latency of an ACL, subject to dependency constraints, can be shown (Grout & McGinn, 2005) to be equivalent to an established NPComplete problem, “Sequencing to Minimize Weighted Completion Time” (Garey & Johnson, 1979). Consequently, without a significant breakthrough in computability theory and practice (see Papadimitriou, 1994, for a full discussion), no algorithm, running in reasonable time, can be expected to find an optimal solutio n to the problem for other than small instances. For larger problems, approximation methods, in the form of heuristics, will be necessary to find acceptable results in acceptable time. However, for small or moderately sized ACLs, exact methods provide benchmarks against which to test heuristics in simulation exercises. 2.1. A Local Search Algorithm The need to implement an optimisation routine in real time within the tight (space and time) restrictions of (say) a production router will not tolerate complex heuristics. There will be no value in an optimisation process that takes so long to run that its uses up too much router processing time and actually adds to network latency. Often the simplest, and sometimes the most effective/efficient, solution in such cases is a greedy or greedy-type algorithm (Aarts & Lenstra, 2003). Starting with an ACL, L = [r1 , r2 , ..., rn], the following will look for improvements in expected latency, and implement the greatest at each iteration, until no further such gains are to be found. repeat ∆max = 0 for i := 1 to n-2 do \ for j := i+1 to n-1 do if dij = 0 then \ begin ∆ := E(L) - E(L); if ∆ > ∆max then \ begin ∆max := ∆; i* := i; \ j* := j end end; if ∆max > 0 then swap(ri*, rj*) \ until ∆max = 0 \
For all node pairs If not dependent
Consider swapping
(3) Record largest gain
Implement best swap Until no improvement
(The nth rule is an implicit exclude all, having a dependency with all others.) In this algorithm, known generically as 2-Opt or 2-Swap, D = (dij) is the dependency matrix (dij = 1 if there is a dependency between rules ri and rj ; dij = 0 otherwise) and L is the list L with rules ri and rj swapped). swap(ri*, rj*) has the natural interpretation.
2.2. Local Search Results Applying 2-Opt to ACLs of different lengths gives results summarised in Table 1 (Grout & McGinn, 2005). 2-Opt achieves results on average within a few percent of optimal cost. It is also shown that using a substantially more sophisticated algorithm (based upon that of Lin & Kernighan, 1973 - ‘L-K’ in Table 1) achieves only marginally better results. However, this is at the expense of substantially longer run times (excluded for brevity). n (number of rules) 10 25 50 100
Improvement in E(L) of optimal solution over starting list (%) 14.28 15.88 18.60 23.20
Improvement in E(L) of 2-Opt over starting list (%) 13.41 12.54 12.70 14.80
Improvement in E(L) of L-K over starting list (%) 13.48 12.78 13.80 18.00
Table 1. Performance of 2-Opt. 2-Opt (and L-K-Opt) are examples of local search algorithms (Aarts & Lenstra, 2003): small perturbations are applied to an initial solution, to identify improved orderings. The natural initial solution is the ACL as entered by the network administrator. Whilst it clearly performs well, there still remain two concerns with the operation of 2-Opt on a production router. •
•
Although reasonably efficient (particularly with respect to exact optimisation and complex heuristics), 2-Opt is still iterative in nature (an indefinite number of O(n2 ) steps with all rule pairs being considered as potential swaps) and likely to add to the latency of processing packets. This is compounded by the need to recalculate the expected latency (Equation 2) for each prospective swap. Network traffic varies continuously over time. Packet characteristics, and consequently rule hit-rates will vary also. It is unclear, in such an environment, how this can be monitored or even how frequently ACL optimisation should be performed.
Analysis of these difficulties, and particularly the relatio nship between them, leads to the proposals in the following section.
3. A Simplified Real-Time Approach A technique is needed to monitor changing hit-rates and reduce the complexity of 2-Opt. Considering the significance of hit-rates carefully, the following observations may be made: •
If hit-rates are relative (e.g. probabilities) then they may be normalised. That is, n
if H = ∑ h( ri ) then intuitively, H = 1.
However this need not be the case.
In
i =1
practice, 2-Opt only uses relative values of h(ri) to compute ∆ := E(L) - E(L ) so
• • •
that independent changes to h(ri ) (say, for some i), making H > 1, will still permit the algorithm to function correctly. If this is the case then the processing of each packet against a list can be taken as only having an effect on (i.e. increasing) the hit-rate of the rule that it matches. This will have a relative not absolute effect on the hit-rates of the other rules. If this then is the case, and the reordering of L is to be attempted following the processing of a single packet matching rule ri, then it is only necessary to consider the relocation of ri and any rules displaced (e.g. by swapping) by it. If, following the processing of a packet matching rule ri, ri is to be moved in L, then it can only move up in L and then the most probable rule with which to swap will be ri-1 .
These observations lead to a simplified, real-time algorithm detailed in the next section. 3.1. A Compound Algorithm Define the trade-off coefficient, Ti, to be the decrease in expected latency from swapping rules ri and ri-1 . Then i− 2
Ti = ∑ h (rk )κ k + h( ri −1 )κ i−1 + h (ri )κ i + k =1
i− 2
− ∑ h( rk )κ k − h (ri −1 )κ i − h (ri )κ i−1 − k =1
n
∑ h(r )κ
k = i +1
k
k
n
∑ h(r )κ
k =i +1
k
k
.
(4)
= h (ri−1 )λ ( ri −1 ) + h (ri )[λ ( ri −1 ) + λ ( ri )] − h (ri )λ ( ri ) − h( ri−1 )[ λ ( ri−1 ) + λ (ri )] = h (ri )λ ( ri −1 ) − h( ri −1 ) λ ( ri )
So, for consecutive rules, Ti is a simple calculation, not a re-evaluation of the complete expected latency, E(L). The following algorithm, named δ-Opt since it considers only small perturbations to the current solution (adjacent swaps on rule ri) is in three parts. Step 1 is executed on initial configuration (or reconfiguration) of the ACL, Step 2 is executed after each packet is processed against the ACL and Step 3 is executed when renormalisation becomes necessary or idle processor time permits. Step 1: Initialisation for i := 1 to n do h(ri) := 1/n
(5) \ Hit rates set equal to start
Step 2: On processing a packet matching rule ri h(ri) := θh(ri); \ Increase matched hit-rate if (di-1 i=0) and (Ti > 0) then \ If a valid gain then swap(ri-1, ri) \ Move matched rule up one Step 3: Renormalisation when H > Ω for i := 1 to n do h(ri) := h(ri) / H
\ Periodically set H = 1
(6)
(7)
Step 1 sets initial hit-rates. Step 2 increases the hit-rate of the rule matched by the last processed packet and attempts to move the matched rule one place up the list. Step 3
renormalises hit-rates from time to time, when the total hit rate exceeds some limit (Ω, dependent on implementation), to prevent overflow. Steps 1 and 3 are of O(n) complexity. Step 2, the critical part following the processing of each packet, is (small and) constant. All potential latency difficulties have been removed. δ-Opt is configurable through the choice of the value θ, the factor by which the hit-rate of the matched rule is increased. Larger values of θ will bring a rule to prominence faster than smaller ones. 3.2. Discussion of Results This is a highly efficient algorithm and particularly suitable for typical packet flows in which streams of similar packets are to be processed. The promotion of a matched rule up the list identifies a form of ‘locality of reference’ and offers the equivalent of caching (Mierlutiu, 2003). The real- time component of δ-Opt is negligible in complexity. On average it will appear to give poorer results than 2-Opt since the scope of its local search is smaller. However, it is in the unpredictable nature of greedy-type algorithms that it sometimes performs better. Table 2 summarises simulation results for δ-Opt, with randomly generated packets, compared with 2-Opt, applied statically to the final set of hit-rates. θ = 2 for these tests. It shows that the increase in expected latency of δ-Opt over 2-Opt increases with larger numbers of rules, but not quickly. n (number of rules)
Cases where
Cases where
Cases where
E(L)(%)
E(L)(%)
E(L)(%)
10 25 50 100
44 9 1 0
24 11 2 0
32 80 97 100
δ-Opt=2-Opt
δ-Opt>2-Opt
δ-Opt