NetFlow technology, proposed by Cisco [16] (implementations are also available with ... in many fields, such as network monitoring, network planning, security ...
Robust Optimization for Selecting NetFlow Points of Measurement in an IP Network Mustapha Bouhtou and Olivier Klopfenstein France T´el´ecom R&D, 38 rue du gl Leclerc 92130 Issy-les-Moulineaux, France Email: {mustapha.bouhtou,olivier.klopfenstein}@orange-ftgroup.com
Abstract— NetFlow is a solution to make an in-depth traffic analysis possible in IP networks. Most of the telecommunication operators have already deployed it, or intend to do so, for network management purposes. While the best description of network state would be achieved by activating NetFlow on all router interfaces at the same time, this would not be a good idea in most of the cases. Then, a natural idea is to select a smaller set of interfaces, while ensuring that a sufficiently large proportion of the total amount of traffic is characterized. A critical factor when selecting these interfaces is traffic variability. Indeed, the decision taken at a given moment has to be robust to future traffic variations, at least on a given period. Mathematical models and algorithms are proposed to tackle this issue. They are based on recent advances in robust optimization to deal with probabilistic constraints. The method is tested on real-life data taken from an international France T´el´ecom IP backbone.
I. I NTRODUCTION Network operators need traffic measurements to manage effectively their networks. Classically, they have relied on quite basic information, such as traffic amounts on a given link, provided by the Simple Network Management Protocole (SNMP). Since a few years, new measurement technologies have been proposed by equipment constructors. These systems make an in-depth analysis of the traffic possible in an IP network. This is in particular the case for the so-called NetFlow technology, proposed by Cisco [16] (implementations are also available with other equipment providers, such as Juniper [11]). By using NetFlow on router interfaces, network managers have access to a detailed characterization of traffic, with in particular information on sources and destinations of flows (IP addresses). An important application of this latter feature is related to traffic matrix measurement and estimation, that is, an end-to-end characterization of the network traffic [1], [5], [9], [10], [12], [14]. The need for such a detailed description of traffic is critical in many fields, such as network monitoring, network planning, security analysis, traffic engineering and even billing issues. That is the reason why most of the telecommunication operators have adopted such measurement technologies. However, this solution also presents drawbacks [13], [15]. First, as observed by Zang and Nucci in [13], it is costly: softwares and equipments have to be bought, updated and maintained. This turns into capital and operational expenditures. Moreover, the collection of the measured information impacts the network load. Indeed, data have to be sent regularly
through the network to centralized servers. Additionally, the frequent traffic measurements may consume a lot of CPU and memory resources on routers. These considerations motivate the limitation of the usage of NetFlow in the network, by not activating it everywhere at the same time. A natural question is then the following: how to select a set of interfaces in the network, on which NetFlow has to be activated? Zang and Nucci have dealt with this question in the context of a completely new deployment of NetFlow in the network [13]. Using integer linear programs, they provide a methodology to optimize the NetFlow deployment process by identifying which routers and which interfaces should be NetFlow-enabled. Their model relies on traffic data supposed to be precisely known. The current work extends this earlier approach by taking account of traffic variability in time. The objective is to select a set of router interfaces in the network, while ensuring that a given proportion of the total amount of traffic in the network is characterized. Our focus is on making our interface selection robust to traffic variations in time. Indeed, the selection of interfaces is decided at a moment, but the amounts of traffic going through the interfaces will necessarily vary afterwards. Hence, the difficulty is to ensure that the selection decided will actually characterize the wanted proportion of the total traffic in the network at any time. Our aim is to ensure that the decision made at a given time will remain valid with a guaranteed probability, despite the traffic variability. To deal with this traffic variability issue, we rely on original mathematical programming models. Robust optimization is an attractive way to deal with data uncertainty, since it is highly tractable. Thus, taking advantage of recent advances in this field, an effective algorithm is designed to solve our problem. Numerical tests are performed on data collected in an international France T´el´ecom IP backbone: they show that the proposed algorithm is usable in practice on real-life instances. Since it runs fast enough, it can be used in a dynamic setting to update regularly earlier decisions and fit the evolution of network traffic. II. P ROBLEM
DESCRIPTION
A. Notations and mathematical models Let R be the set of routers in the network. To each router r ∈ R, we associate the set Ir of its external input interfaces
(i.e. access, customer and peering links). Note that only external input interfaces are considered here to avoid redundancy in measurements. It is supposed thatS SNMP measures are available on all the network interfaces r∈R Ir , and we denote by tri ≥ 0 the traffic amount on the interface i of router r. For a given interface i on router r ∈ R, the variable xri takes value 1 if NetFlow has to be activated on this interface, 0 otherwise. As in [13], our constraint is to select enough interfaces to characterize a proportion at least α ∈ [0, 1] of the total network traffic. This can be mathematically written as: X X tri (1) tri xri ≥ α (r,i)∈RI
(r,i)∈RI
where RI denotes the set of all network interfaces: RI = {(r, i)|r ∈ R, i ∈ Ir }. We denote n = |RI|, the total number of interfaces in the network. At a given time, traffic data {tri }(r,i)∈RI are assumed to be known thanks to SNMP measurements. The main difficulty comes from satisfying constraint (1) while traffic is varying in time. Indeed, the selection of interfaces to activate NetFlow is made at a given time, but the traffic values will certainly vary afterwards. Thus, we have to consider that data tri are known only approximately. From a mathematical point of view, this means that tri is not a scalar, but a random variable. Let us denote by Pr the probability measure. We are interested in a selection x of interfaces satisfying: X X tri ≥ p (2) tri xri ≥ α Pr (r,i)∈RI
(r,i)∈RI
where p ∈ (0, 1] is the probability target fixed by the network manager. Equation (2) means that the selection x of interfaces will enable the measurement of a proportion α of the total traffic at any time, with a guaranteed probability p. Finally, the constraint (2) has to be inserted into a global decision problem. To keep the analysis as general and flexible as possible, we denote simply by f (x) the cost associated to a selection x. This cost may include many different objectives, such as expenditures, network overload coming from sending data, impact on router performances, etc. On the other hand, requirements other than (2) may also be imposed to a decision x; we simply denote them by: g(x) ≥ 0. Thus, our decision problem is the following: min f (x) s.t. g(x) ≥0 P Pr t (x − α) ≥ 0 ≥p (r,i)∈RI ri ri n x ∈ {0, 1}
(3)
A particular setting of f and g will be presented in the numerical tests (see Section IV). Note that when P α = 1, we have xri ≤ α for all (r, i). This implies that (r,i)∈RI tri (xri − α) is always a non-positive quantity. Hence, satisfying the probabilistic requirement imposes to take xri = 1 for all (r, i): that means that NetFlow has to be activated on all interfaces.
On the other hand, the case when α = 0 is meaningless in practice. From now on, we suppose that α ∈ (0, 1). B. A pessimistic approach Traffic values tri have been considered as varying in time. Now, we need some characterization of their randomness. The idea is to rely on earlier traffic values to characterize future ones. We suppose that tri takes values in an interval [tri , tri ], tri (resp. tri ) being the minimum (resp. maximum) traffic value on interface i of router r. These extreme values are deduced from the history of earlier SNMP measurements. ObservePthat inequality (1) can be written equivalently: F (x, t) = (r,i)∈RI tri (xri − α) ≥ 0. We want to ensure that F (x, t) is always non-negative, that is equivalent to considering p=1 in the probabilistic inequality (2). Then, for a given (r, i) ∈ RI, we have to find the lowest value that tri (xri − α) can reach. This value depends on whether xri = 1 or xri = 0. If xri = 1, then: tri (xri − α) ≥ (1 − α)tri (recall that α ≤ 1). On the contrary, if xri = 0, then: tri (xri − α) ≥ −αtri . This implies that: tri (xri − α) ≥ xri · (1 − α)tri − (1 − xri ) · αtri . From this analysis, we deduce that the following inequality always holds: X X xri (1 − α)tri − (1 − xri )αtri tri (xri − α) ≥ (r,i)∈RI
(r,i)∈RI
Hence, a pessimistic approach to our interface selection problem consists of: min f (x) s.t. g(x) P ≥0
(r,i)∈RI
P xri αtri + (1 − α)tri ≥ α (r,i)∈RI tri
x ∈ {0, 1}n
(4) This approach has the advantage of providing very robust solutions, even when having no probability information such as distributions or correlations. However, the computed solution is valid for the case when, for each selected interface, the traffic is minimal, while it is maximal on all the non-selected interfaces. Such a case will never occur in practice. That is the reason for considering probability tradeoffs p < 1. C. Getting traffic intervals in a dynamic environment To build the traffic interval [tri , tri ] for interface i of router r, we rely on earlier SNMP measurements. These data have to deal with a sufficiently long period. Indeed, suppose that SNMP data describe the traffic of one single hour, say from 7:00 to 8:00 am. Then, tri and tri correspond respectively to the weakest and the strongest traffic values observed on the interface during this hour. But these values will certainly be very different from those corresponding to 8:00 to 9:00 am. When working with measurements on a day, things are a bit more stable. In particular, it is known that working days (i.e. from Monday to Friday) are very similar. Thus, it is justified to use the measurements of Monday, for instance, to characterize the traffic of Tuesday. Similar reasonings hold for weeks or months.
Then, we can define a time period such that each period is similar to the following one in terms of traffic. From the measurements of a given period, we obtain a characterization of traffic for the period to come. Hence, new decisions on where to activate NetFlow may possibly be done to fit better the evolution of network traffic. III. A N EFFECTIVE ALGORITHM The algorithm designed is an adaptation of the one proposed in [7] for the chance-constrained knapsack problem. Because of paper size limitations, many technical details are omitted. For more details, we refer to [2], [3], [7], [8]. The idea is to approximate the hard probabilistic problem (3) through a sequence of integer linear problems, called “robust”, which are easy to solve with commercial solvers. A. A robust model The robust approach of [2] is used. Consider an integer Γ ∈ {0, . . . , n}. The main idea is to find a solution x which catches the wanted proportion α of the total traffic, even though the traffic values take their “worst value” on Γ interfaces, while taking their “best value” on the n − Γ other interfaces. More precisely, following the analysis of Section II-B, for all (r, i) ∈ RI, we know that the following inequalities always hold: tri (xri − α) ≥ xri (1 − α)tri − (1 − xri )αtri (i) tri (xri − α) ≤ xri (1 − α)tri − (1 − xri )αtri (ii) We look for the best solution remaining feasible even when any set of Γ of the n quantities {tri (xri − α)}(r,i)∈RI take their lowest values (i), while the n − Γ other ones take their largest values (ii). In particular, observe that if Γ = n, the pessimistic framework described in Section II-B is considered. On the contrary, if Γ = 0, we make highly optimistic assumptions. Mathematically, we look for a solution x satisfying: ∀S P ⊆ RI s.t.|S| = Γ : xri (1 − α)tri − (1 − xri )αtri (r,i)∈S P + (r,i)∈S xri (1 − α)tri − (1 − xri )αtri ≥ 0 /
Following the analysis of [2], this exponential-size formulation can be simplified into a polynomial-size one, using mainly the theory of duality in linear programming. Then, after some technical work (not detailed here because of paper size limitation), we obtain the robust version of our global decision problem (3), associated to the robust parameter Γ ∈ [0, n] (possibly fractional in this model): min f (x) s.t. P g(x) ≥ 0
ri − tri α(1 − xri ) (r,i)∈RI tri (1 − α)x P − Γu + (r,i)∈RI vri ≥ 0 u + vri ≥ [α + (1 − 2α)xri ] δri , ∀(r, i) ∈ RI u, v ≥ 0 x ∈ {0, 1}n (5) δri = tri − tri is the maximal traffic variation on interface i of router r. We call this problem the Robust NetFlow
Measurement Selection Problem associated to Γ, RNMSP(Γ) for short. If f and g represent respectively a linear objective and linear constraints, RNMSP(Γ) is a 0-1 integer linear program of reasonable size. In most of the cases, it can be solved effectively by commercial softwares. When Γ is increasing, the optimal value of RNMSP(Γ) is also increasing (i.e., it is deteriorating). But the feasibility probability of the corresponding probability is also increasing (i.e., it is improving). B. The algorithm The algorithm consists of solving a sequence of robust problems RNMSP(Γ) for increasing values of Γ (cf [7]). Starting from Γ = 0, the robust parameter is progressively increased. For each new value of Γ, the corresponding robust problem is solved. Thus, the feasibility probability of the obtained solutions tends to increase. We stop as soon as we have a feasible solution for (3), that is, when the probability requirement is met. Heuristic for solving (3) Step 0: Let Γ = 0, and let d > 0 be arbitrarily small. Step 1: Solve RNMSP(Γ), and let x be the obtained solution. Step 2: Check the probability requirement (2). If it is satisfied: STOP. Step 3: Compute Γ(x) as described below. Step 4: Update the robust parameter: Γ ← Γ(x) + d. Go to Step 1. For a given selection x of interfaces, Γ(x) is the largest value of the robust parameter such that x remains a feasible solution of the robust problem (in particular, observe that Γ(x) ≥ Γ). It can be computed by the following algorithm: Algorithm to compute Γ(x) Step 0: Let Q = 0, Γ(x) = 0. Step 1: For all (r, i) ∈ RI: - if xri = 1, Q ← Q + (1 − α)tri ; - if xri = 0, Q ← Q − αtri . Step 2: Sort the {|xri − α| · δri }(r,i)∈RI in non-increasing order. Denote by {vk }1≤k≤n the sorted list obtained. Step 3: For k = 1 to n : - if Q − vk < 0 : Γ(x) ← Γ(x) + Q/vk ; STOP. - otherwise : Q ← Q − vk , Γ(x) ← Γ(x) + 1. This algorithm runs in time O(n log(n)) (which is the complexity of Step 2, because of sorting). C. Probability assumptions Step 2 of the above heuristic requires to assess the feasibility probability of a solution. In this section, we give a practical way of estimating it. The average value of tri is assumed to be known for each interface (r, i); it is denoted by E[tri ] (expected value). As for lower and upper bounds tri and tri , it can be computed from earlier SNMP measurements. It would
From this theorem, we obtain directly the following useful result: Lemma 1: Suppose that the random variables {tri }(r,i)∈RI are independent. Let x¯ be a selection of interfaces to activate P NetFlow, we denote: T = ¯ri )tri . Then, if (r,i)∈RI (α − x E[T ] < 0: P Pr xri − α) ≥ 0 (r,i)∈RI tri (¯ 2E[T ]2 P ≥ 1 − exp − (α−¯ x )2 δ 2 (r,i)∈RI
ri
ri
(6) This result relies on two major assumptions. The first one is the independence of random variables {tri }(r,i)∈RI . This may not be satisfied in practice, and dependencies may possibly be included earlier in the model. Nevertheless, in most of the cases, the independence assumption is not excessively restrictive. The second major assumptionhlies in E[T ]