Network tra .c matrix - IEEE Xplore

Network Traffic Matrix : how can one learn the prior distributions from the link counts only ? S. Vaton, J.S. Bedo Abstract— The Origin Destination (OD) traffic matrix in a network is the volume of traffic for each OD pair. It is very useful for capacity planning, routing, operations and management, SLAs, customer reports, etc... But this matrix cannot be measured directly, it would require very costful upgrades of the existing infrastructures. On the contrary SNMP reports give a periodic account on the volume of traffic for each link. The traffic matrix must then be estimated by mathematical methods from the link counts only. This inverse problem is typically ill-posed since the number of OD pairs is much greater than the number of links. One of the existing techniques is Bayesian : it runs Markov Chain Monte Carlo (MCMC) algorithms to simulate the joint distribution of the OD pairs given the link counts. It requires a prior distribution for each OD pair and most of the time those priors are arbitrary. A challenging issue would be to train the priors from the only available data, that is to say the link counts themselves. This is exactly what we do in this paper. We prove the validity of our approach on a network on which direct measurements of the OD counts were made available.

I. Introduction Various statistical and non statistical methods have been proposed in the litterature for the estimation of the traffic matrix [1][2][3][4][5]. One of the statistical methods proposed is Bayesian [2]. It consists in simulating the OD counts by running Markov Chains Monte Carlo (MCMC) algorithms. Despite the generality and the powerfulness of MCMC techniques if used judiciously, this method has one serious drawback. It requires a prior distribution for each of the OD counts. Obviously for real datasets this prior distribution is very difficult to fix in advance. This results in a serious bias in the estimate, since these Corresponding author: S.Vaton, GET/ENST Bretagne, Brest, France, [email protected]. Ce projet re¸coit le soutien financier de la Région Bretagne.

IEEE Communications Society

prior distributions are fixed more or less arbitrarily [6]. On the contrary the prior distribution of the OD counts should be learnt from the only available data, that is to say the link counts themselves. This is a very challenging issue that we tackle in this paper. In this paper we develop a Bayesian approach where the prior distributions are learnt blindly from the link counts only. This is performed in an iterative manner by exchanging estimated OD matrix and prior distributions between two steps. Moreover the exchange is smoothed conveniently in the first iterations of the algorithm as it is the case in simulated annealing methods. The algorithm has already been validated on simulated traffic in some previous papers [7][8]. In this paper the algorithm is validated on real traffic data on a single router network for which direct measurements of the OD counts were made available. II. Need for valid prior distributions The traffic matrix problem is typically an ill-posed problem. If one denotes by y the column vector of the link counts and by x the column vector of the OD pairs then y = A x where A is the routing matrix. This linear system is strongly underdetermined because c >> r where r is the number of links and c the number of OD pairs. Therefore the dimension of the space of solutions is c − r. x3

x3 A

x1

y1 x2

B

x2 y2

C

x1

Fig. 1. Network with 2 links and 3 OD pairs.

Take the example of the network of Figure 1 with two links and three OD pairs. Obviously y1 = x1 + x3 and y2 = x1 + x2 . Suppose for exam-

2138

0-7803-8533-0/04/$20.00 (c) 2004 IEEE

ple that y1 = 3 and y2 = 5. In that case there are four integer positive solutions : (x1 , x2 , x3 ) = (0, 5, 3), (1, 4, 2), (2, 3, 1) and (3, 2, 0). And those solutions have nothing in common except that they all satisfy to the equations x1 + x3 = 3 and x1 + x2 = 5. There are different approaches to that problem. One possible approach is to introduce some prior distributions over the OD pairs in order to bring out some solutions as more likely than others. But those priors should not be arbitrary. Indeed an arbitrary choice of the prior distribution would bias seriously the estimate of the traffic matrix. Ideally those priors should be estimated from the link counts only since the link counts are the only available data. This is exactly what we propose to do in this paper...

Generally, Gibbs is used to simulate a multidimensional random variable z = (z1 , . . . , zN ) under its joint distribution p(z). It is often impossible to sample directly from the joint distribution p(z) but the conditional distributions pi (zi | z−i ) are easy to simulate, where pi (zi | z−i ) is the conditional distribution of zi when the other components (vector z−i ) have fixed values. The principle of the Gibbs algorithm is to simulate one component zi at a time. Each component zi is simulated under its posterior distribution pi (zi | z−i ) where z−i takes current values. One sweeps iteratively across the different components of z as follows :

III. MCMC sampling for computing the expected traffic matrix

n+1 n+1 n+1 n , . . . , zN zN −1 ← pN −1 (zN −1 | z1 −2 , zN ) n+1 n+1 n+1 ← pN (zN | z1 , . . . , zN −1 ) zN

This section is a detailed description of the MCMC approach for computing the OD traffic matrix [2]. The goal is to simulate x under its posterior distribution p(x | y) with the constraint that y = Ax. The inputs are the vector of link counts y, the routing matrix A, as well as a prior distribution on the OD pairs that are supposed to be independent : p(x) = ci=1 pi (xi ) In this section we follow Tebaldi et al. [2]. The routing matrix A is full line rank r. Then, up to some linear combinations of the lines of A and to some permutations of the columns of A one can write : A = A1 A2 where A1 is an invertible r × r matrix and where A2 is a r × (c − r) matrix. The same linear combinations should be applied on the components of y and a reordering of the OD pairs should be performed as well. x is similarly decomposed into an upper part of size r and a lower part of size c − r : x = (x1 T , x2 T )T . Then it results from y = A x that x1 = A1 −1 (y − A2 x2 ) where A1 −1 is the inverse matrix of A1 . Therefore, given y, x2 is a set of free variables and the simulation of p(x | y) reduces to simulating x2 under the posterior distribution p(x2 | y) and then getting x1 from x1 = A1 −1 (y − A2 x2 ). A. Gibbs algorithm The simulation of x2 under p(x2 | y) can be performed by running a Gibbs algorithm.


z1n+1 z2n+1 .. .

n) ← p1 (z1 | z2n , . . . , zN n) ← p2 (z2 | z1n+1 , z3n , . . . , zN .. .

When n (number of iterations) is large then zn = n ) is approximately distributed under (z1n , z2n , . . . , zN the joint distribution p(z). If we come back to the traffic matrix the target distribution is p(x2 | y) and the instrumental distributions are the posterior distributions p(xi2 | y, x−i 2 ). It is therefore necessary to compute the probability density functions p(xi2 | y, x−i 2 ). After a few straightforward computations that are described in [8] it r j i comes that p(xi2 | y, x−i 2 ) ∝ p(x2 ) j=1 p(x1 ) where ∝ means proportionality and with x1 = A1 −1 (y − A2 x2 ). The posterior distribution p(xi2 | y, x−i 2 ) cannot be simulated directly. For simulating p(xi2 | y, x−i 2 ) one must use another MCMC algorithm, namely the Metropolis algorithm. This leads to a Metropolis algorithm within a Gibbs algorithm, what is called a Metropolis within Gibbs algorithm. B. Metropolis algorithm The Metropolis algorithm makes possible the simulation of a random variable defined by its probability density function up to a multiplicative factor. The principle of the Metropolis algorithm is to draw a random variable with a distribution that one can simulate, and to accept the draw with a probability that is equal to the Metropolis ratio. This ratio is a

2139

0-7803-8533-0/04/$20.00 (c) 2004 IEEE

function of the likelihoods of the draw and the previous sample under the distribution that one wants to simulate (the target distribution) and under the distribution that one can simulate (the instrumental distribution). The sequence of random variables that is produced by the Metropolis algorithm converges to the target distribution when the number of iterations of the algorithm is large. In our case, the target distribution is the distribution of xi2 conditionally to y and to x−i 2 . The instrumental distribution can be, for example, the prior distribution of xi2 . IV. Fitting a weighted mixture of Gaussians on the successive values of each OD pair In this section, our goal is to propose a suitable model for the prior distribution of each OD pair. We propose to model each OD pair by a mixture of Gaussians. Indeed many distributions can be fitted conveniently by mixtures. Moreover with mixtures it is possible to adapt to variability in the traffic by letting the weights of the different components in the mixture change along the time. In this section we are interested in only one OD pair. The successive values x1 , x2 , . . . , xT of that OD pair are distributed as a mixture of Gaussians : p(xk ) = K j=1 wj (k)Gµj ,σj (xk ). Gµ,σ (•) is the probability density function of the Gaussian distribution with mean µ and standard deviation σ. wj (k) is the probability that xk stems from component j. The problem is now to estimate the number K of components, as well as their means µj , standard deviations σj and the weights wj (k) from the successive values taken by that OD pair. Of course we do not know the real values of that OD pair. We only have estimates of those values. Therefore in our global iterative algorithm we do not use the real values of that OD pair, but their estimated counterparts as it is explained in Section V.

The EM algorithm iterates the two following steps : 1. Expectation step : compute the probability that xk stems from the component j of the mixture wj (k) = wj Gµj ,σj (xk )/ K w G i=1 i µi ,σi (xk ) 2. Maximization step : update the weight wj = (1/T ) Tk=1 wj (k), as well as the means µj = T wj (k)xk / Tk=1 wj (k) and variances (σj )2 = k=1 T T 2 k=1 wj (k)(xk − µj ) / k=1 wj (k). B. The BIC criterion For estimating the number K of components in the mixture we used a BIC criterion. The principle is to try different values for K (from K = 1 to K = Kmax ) and to select the one which maximizes the BIC criterion : BIC = 2 ∗ L(K) − ν(K) log(T ) where L(K) = log p(x; θˆK ) is the log-likelihood of the mixture with K components which best fits the OD pair (θˆK is given by the EM algorithm), and ν(K) = 3K − 1 is the number of free parameters. V. The global loop In Section III we considered one time period and we estimated the various OD pairs for that period. In Section IV we considered one OD pair and we fitted a mixture model to that OD pair. In this section we estimate the successive values of the various OD pairs jointly by splitting that problem into two. A. The feedback principle Our principle is to use a divide to conquer method. We split this difficult problem into two subproblems. A global convergence is obtained by exchanging informations between subproblems. Gravity Model iteration 1

Link Counts

iteration 1

Metropolis within iteration > 1 Gibbs

iteration > 1

Traffic

EM

Matrix

algorithm

(Estimate)

Feedback :

A. The EM algorithm The weights wj (k), means µj and standard deviations σj can be estimated by an EM algorithm [9].In the case of the mixture of Gaussians the EM algorithm is very simple and extremely fast. The computational load that is added to Tebaldi’s method by the EM algorithm is about 10% if one considers the overall cost of estimating K, the µj , σj and wj (k).


t 2 p(xti ) = j wi,j N (mi,j , σi,j ) (each OD pair i and each time t) 2 mi,j and σi,j (means and variances of the component distributions) t Weights wi,j of the component distributions

Fig. 2. The global loop.

Our algorithm consists of the loop displayed on Figure 2. The first box (Metropolis within Gibbs)

2140

0-7803-8533-0/04/$20.00 (c) 2004 IEEE

runs MCMC methods just as explained in Section III : its inputs are the link counts and prior distributions; the ouput is the estimated traffic matrix. The second box (EM algorithm) has the successive values of each OD pair as an input ; it produces the parameters (weights, means and standard deviations) of a mixture of Gaussians as explained in Section IV. These mixtures are then provided as an input to the first box in a feedback loop. The process of exchanging informations between the two boxes is iterated until convergence to a fixed point. B. Initialization of the global loop We want our method to be as uninformative as possible. We do not want to set anything arbitrarily. This is also true for the initialization of the global loop. In order to launch our iterative process (global loop), we need a first estimate of the traffic matrix since no reliable prior distribution is available. For the initialization we have considered two very simple methods : the gravity model [10] and a method based on the second moments of the traffic flows [11]. If xO,D is the amount of traffic between a node Origin O and a node destination D then with a gravity x ∗x model this quantity is estimated as x Ô,D = O,• x •,D •,D D where x0,• is the total amount of traffic originating from node O and x•,D is the total amount of traffic destinated to node D. The estimate produced by the gravity model is considerably improved if it is projected over the set defined by the linear constraints ˆ 2 with y = A x and y = Ax and xi ≥ 0 : min x − x xi ≥ 0, ∀i. This quadratic optimization with linear constraints is solved with an Uzawa algorithm. Another method that we considered for initialization is based on the second moments. The traffic between the Origin O and the Destination D is a proportion of the total traffic x0• from node O : cov(xO,• ,x•,D ) x Ô,D = var(xO,• ) xO,• . Once again a projection over the set y = Ax and xi ≥ 0 is performed to reduce the variance of the estimate. C. Smoothing the exchange of information between boxes It is convenient to “smooth” the exchange of informations between boxes to avoid convergence to a fixed point in the first iterations : (i) The prior distributions p(xti ) (OD pair i, time


t) that are fed back into the first box must be smoothed during the first iterations by replacing p(xti ) by [p(xti )]α where α < 1.0. In practice one only r i )]α i )]α has to set : p(xi2 | y, x−i ) ← [p(x [p(x 2 1 2 i=1 (ii) Similarly it is convenient to smooth the information (estimated OD pairs) that is provided by the first box to the second box. As the quantity x ˆti breaks in on the EM algorithm only through its likelihoods Gµ,σ (ˆ xti ), smoothing is obtained by raising those likelihoods to the power α in the Expectation step of the EM algorithm : Gµ,σ (xti ) ← [Gµ,σ (ˆ xti )]α where α < 1.0. In practice α = 0.5 for iterations 1 to 10 then α = 0.75 for iterations 11 to 15 and then α = 1.0 for iterations 16 to 20. VI. Simulation results The algorithm that we propose is validated on a real dataset that was used by Cao et al for testing the validity of their algorithm [3]. The exchanges between the the two boxes effectively improve the estimate of the traffic matrix along the successive iterations, as we observed on simulated data [7][8]. Fig. 3 represents the correlation between the true OD pair and its estimated counterpart along the successive iterations. 1

0.99

0.98

0.97

0.96

0.95

0.94

0.93

0.92

0.91

0.9

0

5

10

15

20

25

Fig. 3. x-axis: iteration number. y-axis: correlation between true and estimated OD pairs.

We compare our algorithm to the performances obtained with the methods of Section VB that is to say gravity model, and the second moments method, with and without a projection on the plan y = Ax and xi ≥ 0, ∀i. The correlation between the true and the estimated OD pairs is displayed on Table II. Our

2141

0-7803-8533-0/04/$20.00 (c) 2004 IEEE

OD pair 1 2 3 4 5 6 7 .. .

Grav. 0.9773 0.9638 0.6995 0.2271 0.9810 0.9320 0.9907 .. .

Grav. + Proj. 0.9852 0.9869 0.8827 0.4420 0.9896 0.9933 0.9982 .. .

2nd Mom. 0.9310 0.5956 0.5440 0.2224 0.9954 0.8214 0.9944 .. .

2nd Mom. + Proj. 0.9777 0.9816 0.9074 0.3861 0.9957 0.9882 0.9989 .. .

Vaton et al. 0.9909 0.9940 0.9775 0.5180 0.9994 0.9959 0.9991 .. .

TABLE I Correlation between true and estimated OD pair for the 7 biggest OD pairs (the rest is less than 1 percent of the traffic). 5 methods compared : gravity, second moments (with and without projection) and our method.

8

2.5

x 10

2

1.5

1

0.5

0

0

500

1000

1500

Fig. 4. OD pair 5. Estimated versus true values. 5 days of traffic (1440 measurement periods of 5 minutes each).

method outperforms the other methods significantly, when this is still possible. Fig. 4 displays the estimated and true values for 5 days of traffic for the OD pair 5. Each point is 5 min. of traffic. Note that a perfect estimation of all the OD pairs is not affordable since only part of the information is available (the link counts). It would be interesting to derive the Cramer Rao Lower Bound (CRLB) in this case and to check if our estimate reaches this bound or not.

matrix on a network. One major drawback of these methods is that they require reliable knowledge of a prior distribution for the OD pairs. In this paper we have developed into details a method for training these prior distributions from the only available data, that is to say the link counts themselves. We check by simulations on real traffic data on a simple network the validity of our approach. This confirms the very good results that we had obtained on simulated data[7] [8]. References [1]

Y. Vardi, “Network tomography: Estimating sourcedestination traffic intensities from link data,” JASA, vol. 91, no. 433, march 1996. [2] C. Tebaldi and M. West, “Bayesian inference on network traffic using link count data,” JASA, vol. 93, no. 442, 1998. [3] J. Cao, D. Davis, S. Vander Wiel, and B. Yu, “Timevarying network tomography: Router link data,” JASA, vol. 95, no. 452, 2000. [4] N. Duffield Y. Zhang, M. Roughan and A. Greenberg, “Fast accurate computation of large-scale ip traffic matrices from link loads,” in Proc. ACM SIGMETRICS, 2003. [5] Y. Zhang, M. Roughan, C. Lund, and D. Donoho, “An Information-Theoretic Approach to Traffic Matrix Estimation,” in SICOMM, 2003. [6] A. Medina, N. Taft, S. Battacharya, C. Diot, and K. Salamatian, “Traffic matrix estimation: Existing techniques compared and new directions,” in SIGCOMM, 2002. [7] S. Vaton and A. Gravey, “Iterative Bayesian Analysis of Network Traffic Matrices in the Case of Bursty Flows,” in IMW, 2002. [8] S. Vaton and A. Gravey, “Network Tomography : an Iterative Bayesian Analysis,” in ITC 18th, 2003. [9] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from incomplete data,” JRSS B, vol. 39, pp. 1–38, 1977. [10] M. Roughan, A. Greenberg, C. Kalmanek, M. Rumsewicz, J. Yates, and Y. Zhang, “Experience in measuring backbone traffic variability : Models, metrics, measurements and meaning,” in IMW, 2002. [11] E. Van Zwet, “Method of moments estimation for origindestination traffic on a network,” Dept of Statistics, University of California in Berkeley.

VII. Conclusion MCMC methods are flexible and reliable techniques that can be used for estimating the traffic


2142

0-7803-8533-0/04/$20.00 (c) 2004 IEEE