Structural Models for Dual Modality Data With Application ... - CiteSeerX

1 downloads 0 Views 1MB Size Report
ported in part by NSF Grant DMS-0806094 and NSA Grant H98230-10-1-0203. H. Singhal is with Consumer Risk, Bank of America, Charlotte, NC 28202. USA.
5054

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Structural Models for Dual Modality Data With Application to Network Tomography Harsh Singhal and George Michailidis

Abstract—We propose models for the joint distribution of two modalities for network flow volumes. While these models are motivated by computer network applications, the underlying structural assumptions are more generally applicable. In the case of computer network flow volumes, this corresponds to joint modeling for packet and byte volumes and enables computer network tomography, whose goal is to estimate characteristics of source-destination flows based on aggregate link measurements. Network tomography is a prototypical example of a linear inverse problem on graphs. We introduce two generative models for the relation between packet and byte volumes, establish identifiability of their parameters, and discuss different estimating procedures. The proposed estimators of the flow characteristics are evaluated using both simulated and emulated data. Finally, the proposed models allow us to estimate parameters of the packet size distribution, thus providing additional insights into the composition of network traffic. Index Terms—Compound model, computer networks, identifiability, inverse problem, packet size distribution, tomography, traffic matrix estimation.

I. INTRODUCTION

J

OINT modeling of related modalities is a common theme in many imaging and sensing applications. Such models are a prerequisite for merging data from different modalities. While there are some common threads that run through these models, they are also naturally tied to the application under consideration. In this paper, we look at the problem of estimation of flow volumes in networks. The main mission of many physical networks, like computer, road and supply chain networks is to carry flows of different objects; i.e., packets, vehicles and goods in these three types of networks, respectively. Estimating traffic volumes is important for monitoring and provisioning such networks. The high number of flows traversing a network makes the latter objective a challenging task. Network tomography techniques that estimate source-destination traffic volumes based on aggregate link measurements offer a computationally feasible solution (for a recent overview of tomography techniques see [1] and references therein). In many cases, there may be available Manuscript received April 10, 2009; revised December 16, 2010; accepted January 11, 2011. Date of current version July 29, 2011. This work was supported in part by NSF Grant DMS-0806094 and NSA Grant H98230-10-1-0203. H. Singhal is with Consumer Risk, Bank of America, Charlotte, NC 28202 USA. G. Michailidis is with the Department of Statistics and the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA. Communicated by A. Nosratinia, Associate Editor for Communication Networks. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIT.2011.2158474

Fig. 1. Aggregate Volume Measurements.

information on different but related modalities (types) of flow volumes; examples include packets and bytes for computer networks, vehicles and passengers for road networks, and so forth. Joint modeling mechanisms of such modalities can enhance the effectiveness of tomography techniques, a topic explored in this paper where the focus is primarily on computer networks. An alternative kind of data that enables estimation of flow volumes is sampled data and issues related to such data are also an active area of research [2]. A computer network is comprised of network elements such as workstations, routers and switches and links that connect those elements. A logical node in the network may correspond to a collection of these elements, i.e., a sub-network. A network flow may be defined in may different ways. We will assume that a flow contains all the traffic originating in a logical node and destined for some other logical node in the network. Each flow can in principle traverse a set of paths connecting its origin and destination, which is determined by the routing policy. In computer networks, the flow traffic is carried on packets, whose payload is expressed in bytes. The volume of traffic measured on a link may refer to either the number of packets and/or the number of bytes in computer networks, and such data for a particular time interval—typically of the order of a couple of minutes—are available through queries using the Simple Network Management Protocol (SNMP) [3] protocol. The volume of traffic on a link is the sum of volumes of all flows traversing that link. This produces highly aggregate data and the question of interest is to estimate various statistics of the underlying flows. Under the assumption of independence of flow volumes, second order statistics of observed aggregate data can be used to estimate second order parameters of flow volume distributions. The independence assumption limits the number of parameters to be estimated, which is crucial for identifiability. This assumption is standard in network tomography [4], although it may be relaxed to some extent [5]. The basic idea can be demonstrated through the toy network depicted in Fig. 1, where the network is comprised of 3 nodes and two links. Observations on links 1 and 2 are respectively given by

0018-9448/$26.00 © 2011 IEEE

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5055

Fig. 2. Byte volume versus packet volume for 3 observed flows in Tokyo Trace data (details in Section V).

Note that if the flow volumes are independent random variables, then their variances are “identifiable” from the joint disand as follows: tribution of observed edge volumes

Thus,

that contains the variances and the covariance of , uniquely determines that contains the variances of and , since is a matrix of full rank. By assuming a network-wide relation between the second order and first order parameters of flow volume distribution one can estimate the latter, which is the usual quantity of interest, from the former. The term network tomography was introduced by Vardi [6] for the problem of estimating source-destination flow volumes from aggregate link measurements. The flow volumes were modeled as Poisson random variables, the difficulties of estimation based on maximum likelihood demonstrated and as an alternative a low complexity method of moments estimator was proposed. In [4] flow volumes were modeled as being normally distributed with flow variances proportional to their means. The proportionality assumption leads to identifiability of the mean parameters through identifiability of variances. An estimator based on the EM algorithm was proposed. Recently, a sufficient condition for identifiability of the entire distribution up to mean of flow volumes was established in [7]. Further, an estimator based on the characteristic function of the aggregate data was proposed. In [5] several identifiability results were derived for settings incorporating specific spatial correlations, multimodal measurements and heavy tailed distributions of flow volumes. The classical network tomography setup does not consider packet volumes and byte volumes simultaneously. In this work, we use the fact that these two measures of flow volume are related through the packet size distribution. We assume that none of the flows have packets of identical sizes. While there exist networks where packet sizes are identical (for example

when tunneling IPv6 though IPv4 or VPN traffic), in these cases not much additional information is gained from the measurement of both modalities compared to measurement of any one modality alone. Motivated by empirical evidence, we introduce two models that capture the relationship between packet and byte volumes. In the first model, we assume a compound structure for the byte volume with the packet volume as the compounding variable. In the second model, we assume that each flow is made up of independent sub-flows, each with a fixed packet size. For both models we make some network wide assumptions in the spirit of classical network tomography. These assumptions attempt to utilize the structural relationship between packet volume and byte volume of a flow. They can be viewed as a type of regularization that enables us to estimate flow volume means from aggregated data. The models introduced in this study try to capture the main characteristics of the packets-bytes relationship, although the true one may be more complex as evidenced by the plots of three flows obtained from real network traces shown in Fig. 2. Experience suggests that such complex relationships tend to be present in not highly aggregated flows. The models presented in this paper would be most closely applicable for highly aggregated flows. The remainder of the paper is organized as follows: in Section II we introduce the proposed flow volumes, while in Section III we address identifiability issues. In Section IV we study estimation of the models based on a pseudo-likelihood framework and establish consistency and asymptotic normality of the estimators. The performance of the models on simulated and emulated data is assessed in Section V. The issue of estimating characteristics of the packet size distribution is examined in Section VI. Finally, some concluding remarks are drawn in Section VII. The notation used in this paper for probabilistic operators is standard. refers to the expectation of random variable Specifically refers to the variance of random or random vector variable refers to the covariance of random vari(in the context of random vector ) ables and and refers to the covariance matrix of random vector . II. FLOW MODELS Suppose there are flows and directed links in a network. Let be the routing matrix such that if flow

5056

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

traverses link and 0 otherwise. Further, let and be vectors of length whose elements are packet and byte volumes . Define of the flows in time interval for

and aggregate SNMP measurements

From (1), we have

and . Now (see the equations at the

bottom of the page). Thus

where

denotes element-wise multiplication and (2)

Further, assume that

is the excess variance in byte distribution “not explained by the variance in packet distribution”. Collecting these parameters, if , then is parametrized by , i.e., , and thus

is a stationary sequence.

A. Compound Model We assume

(1) where is the size (in bytes) of the th packet of the th flow in time interval (we suppress the time interval indexing for notational convenience). It is assumed that are independent and identically distributed (i.i.d.) from some distribution , corresponding to flow . Also, random variables are assumed in, for . Further dependent of the packet count it is assumed that the packet counts , are independent across flows. Note that (1) itself is obviously true. It is the requirements about independence that are a substantive assumption of this model. Note that the distributions , may well have finite support and we make no assumption about them being continuous. Define the following parameters: . 1) Mean packet volume vector , i.e., 2) Packet volume variance vector . Further from our assumption of independence (this assumption of packet counts, can be relaxed to include the most significant empirically observed spatial correlations in flow volumes, i.e., the ones between forward and reverse flows due to the connection oriented nature of Internet traffic [5]). , i.e., is the mean of 3) Mean packet size vector . , i.e., is the variance 4) Packet size variance vector of .

In the usual setup for traffic demand tomography, one assumes a functional relationship between flow volume means and vari. This is a way to ances for all flows of the type get identifiability of mean flow volumes from identifiability of flow volume variance (see for example [4]). For comparison purposes, the true relationship between the means and variances of flow volumes in the data examined (see Section V) is shown in Fig. 3(a). Joint modeling of packet and byte volume allows us to estimate mean flow volumes under a different assumption on and all flows. We will assume (except in Section 6) for all flows . In this case the mean packet volume is identifiable (as shown in Section III). B. Independent Sub-Flow Model Another way to jointly model packet and byte flow volumes is to assume that each flow is comprised of independent sub-flows, each with a characteristic packet size. Empirical evidence suggests that just a few packet sizes account for most of the traffic in a network. The histogram of the observed packet sizes recovered from header trace of the data described in Section V is shown in Fig. 3(b). These packet sizes are determined by the dominant protocols in the flows. These are typically TCP for web browsing and file transfers and UDP for streaming traffic (e.g., audio and video applications). Other empirical studies have found dominant packet sizes at 40 bytes, 576 bytes and 1500 bytes [8], [9]. Thus, we assume that each class of traffic, such as bulk transfers versus streaming traffic, results in one or more sub-flows, each with a fixed packet length.

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5057

Fig. 3. Variance versus mean of flow volumes (a) and Observed packet size distribution (b).

Assume each origin destination flow is made up of subflows, each of different “type”. Further, all packets of sub-flow of type have size , for . Also let be the number of packets in sub-flow of flow in a time interval. Let be a vector of length with each element equal to 1 and be a vector of length with th element equal to . Finally, let be a matrix whose th element is . can be written as Now, the packet volumes vector

while the byte volume vector

as

We will assume that are independent for all and . This is the most important assumption of the model in that the applicability of the model would depend on how well this assumption is met. Now let and be matrices with th elements, and , respectively. Under this model we have

where

is the vector of element-wise square of and

We set and parametrize by ; i.e., and . As a regularizing constraint for tomography we will assume and , as elaborated in Section 3. Some comments on the case of are given in Section V. C. Equivalence Under Poisson Model If the packet volumes are distributed as independent Poisson random variables with parameter for all and and have finite support such that all packet size distributions , then the disis identical to the one under the independent tribution of sub-flow model with Poisson sub-flows. Note that in this case (1) can be re-written as

The independence of for all and follows from the independence property of thinned Poisson processes. We will not assume the Poisson model in the remainder of the paper and instead treat the compound and independent sub-flow models separately. III. IDENTIFIABILITY AND REGULARIZING ASSUMPTIONS Then

In this section, we address the issue of identifiability of the parameters of the two proposed models; i.e., we show that the parameters of interest are uniquely determined by the observed data distribution (or statistics thereof). The strategy for proving identifiability of the parameters in our models has two steps.

5058

First, we establish identifiability of parameters associated with and subsequently prove the identifiability of the covariance the remaining parameters. The former is based on an identifiability result from [5]. We introduce next some useful definitions for subsequent developments. 1) Let be a set of symmetric positive definite matrices, of the form

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

similar network structure was assumed in [4] to prove identifiability results. For the rest of the paper we assume identifiability of second then moments. Thus, if implies . Further, given a -dimensional one to one , and , we have parametrization implies . that A. Compound Model

where and are length vectors of the variance of packet volumes, covariance of packet and byte volumes and variance of byte volumes, respectively. 2) We call a weighted (directed) graph symmetric, if the is the same as the weight on edge weight on edge , for all edges . 3) A path is a finite sequence of connected nodes. The weight of a path is defined as the sum of weights of all edges in to node is called a that path. A path from node to minimum weight path, if there is no path from with a smaller weight than the weight of . 4) We will call a (minimum weight) routing scheme balanced to node is the if the path of the flow from node to . In other words, if the reverse of the flow from traffic from a node to a node is carried on path , then the traffic from node to is carried on the path . The following lemma from [5] proves useful for establishing identifiability. Lemma 1: Under balanced minimum weight routing on a symmetric graph and assuming flow volume covariances (alternatively ) are identifiable from the . covariance of cumulative link measurements, The above assumptions may seem restrictive in light of empirical studies on real networks [10], [11]. However, it is important to note the following: 1) The above conditions are sufficient, but not necessary. Easily verifiable necessary and sufficient conditions for identifiability are provided in [5]. These conditions typically require full rankness of certain matrix product of the routing matrix with itself. 2) The above conditions only need to be true on a sub network for identifiability. Further, the actual routing policy may be arbitrary, as long as the “realized routing” on the sub-network is implied by some balanced minimum weight routing. Specifically, consider the situation when the network has internal and terminal nodes. Internal nodes neither generate nor sink traffic. Flows exist only between pairs of terminal nodes, which are only connected to internal nodes and not to other terminal nodes directly. This is a reasonable model if the network under consideration corresponds to a combination of a backbone network and sub networks, with the latter being connected amongst themselves through the backbone network. Hence, the nodes of the backbone network are considered internal nodes. In this case the routing may be arbitrary, but the conditions of the above Lemma would still be true on the links connecting the terminal and internal nodes. A

As mentioned before, in order to establish identifiability of this model, we require the following regularizing assumption. is the same for all We assume the packet size distribution flows . As mentioned earlier this implies and . Lemma 2: Under balanced minimum weight routing on a symmetric graph and assuming all flows have identical packet size distributions, the parameters of the compound model are identifiable from cumulative link measurements. , it is clear that is a Proof: With one-to-one map. Thus, based on the previous result, is identifiable. Identifiability of from follows from the fact that is a nonzero vector with non-negative entries and that no nontrivial vector with non-negative entries can lie in the null space of . This is because all entries in are non-negative. Thus, and can be identified from the relation . . Finally, we get B. Independent Sub-Flows Model There may be many different regularizing assumptions that lead to identifiability for tomography problems under the independent sub-flows model. However, we focus on the following as it works well in practice. First note that in Fig. 3(b), the packet size distribution is concentrated on 2 support points, roughly byte payloads) and bulk corresponding to streaming traffic ( and transfers (1500 byte payloads). Thus, we assume and (for identifiability purposes, we only ). Further, we assume that need that for . This is similar to the assumption of proportionality of means and variances in classical tomography, except for that we allow for separate proportionality constants, the two sub-flows. Lemma 3: Under balanced minimum weight routing on a symmetric graph and assuming two sub-flows with and , the parameters of the independent sub-flow model, and , are identifiable from cumulative link measurements. can be seen to be one-to-one Proof: With is full rank and since

Thus, is identifiable. Now (3) (4)

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

Thus

5059

The second step proceeds as follows. In both models for some matrix and vector . Specifically, for the compound model we have

where

and . As before, we have that and , since a nontrivial vector with non-negative and are entries can not be in the null space of . Since identifiable, so are and . Thus, is identifiable.

IV. ESTIMATION PROCEDURE AND ITS PROPERTIES We adopt a pseudo-likelihood approach [12] for estimation purposes. Specifically, we will obtain the estimates through maximizing a function which is not the likelihood of the available data, but rather the likelihood of a normal distribution that has the same mean and covariance as the distribution of the data. There are several computational advantages to using a normal likelihood and in reality the departures from regularizing assumptions tend to have a greater impact than other misspecifications to the likelihood. For a given parametrization and covariance matrix of a random of the mean vector the normal likelihood is given by

and

Further, in both models where is a matrix (or vector) identifiable from or in other words a function of estimated in the first step. The form of and is given by

and

for the compound model and by

for the independent sub-flows model. Next, let the QR decomposition of

Now,

However, optimizing the above likelihood to get an estimate of was found to have quite slow convergence for both second order and EM type algorithms. The intuitive reason for that is that certain parameters appear both in and and that makes the likelihood surface ill-conditioned. The condition number of the information matrix for the normal approximation of the compound model described in Section V was found to be . Hence, we propose the following “hybrid” estimator. Suppose is parametrized as a one-to-one function of the . This is true for both of the proparameter vector posed models with for the compound model and for the independent sub-flows model. For estimation, we follow a two-step strategy. In the first step, we obtain a consistent estimate of . The covariance only pseudo-likelihood for is expressed in terms of the covariance of . Since is a one-to-one function, by definition we have that if , then . Further

. For the independent sub-flows model we have and

be given by

can be re parametrized as

where , then

. Note that for being a matrix of rank . Further, it is easy to get a consistent

; e.g., the sample mean . estimate of Finally, a consistent estimate of is obtained by solving (6) Since has non-negative entries, in practice the above optimization would be done subject to the constraint , which corresponds to a quadratic program. We establish next the main properties of the pseudo-likelihood estimator . (defined in Section 2) Proposition 1: For being a stationary sequence and whose fourth moments exist the pseudo-likelihood estimator satisfies

where (7)

is a consistent estimate of under fairly general conditions (specifically temporal independence is not required [13]). Thus

for (8)

(5) defines a pseudo-likelihood function. Maximizing the likelihood function in (5) can be accomplished through the EM algorithm presented in the Appendix. Therefore, at the end of the first step, a consistent estimate has been obtained.

and Corollary: Under the conditions of Proposition 1, the hybrid estimator is also asymptotically normal.

5060

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Fig. 5. Abilene Topology used for Numerical Study.

V. PERFORMANCE ASSESSMENT

Fig. 4. Mean byte volume versus mean packet volume for flows in Tokyo network trace data.

The proof and details of the asymptotic distribution of and are given in the Appendix. The computational complexity of this estimator is determined by the first step, which involves an EM algorithm. The computational complexity of each EM step is as argued in the Appendix. The pseudo-likelihood (5) does not take into account that and , and hence, given is heteroskedastic as opposed to the case of joint normality. An alternative is to assume that is normally distributed and is normal given with mean and variance given by the relation above. In this case, the distribution of does not correspond to any well known distribution and the likelihood of cannot be written explicitly. Consequently, the obvious way to obtain estimates would be to use an EM algorithm where the “full-data” is . Since, this likelihood more closely reflects the compound model it can be expected to be statistically more efficient. However, it was found to have several drawbacks. First, the E-step can no longer be carried out analytically and one has to resort to MCMC methods. Second, the MCMC E-step has to be carried out individually for each time interval which makes it computationally quite expensive. Finally, the gains in statistical efficiency were found to be marginal at best. Thus, we do not pursue that direction in this paper. The computational complexity of the first step in the estimation can be reduced by using a method of moments estimator for instead of maximum (pseudo-) likelihood estimation. For both models, the elements of the covariance matrix, can be written as a linear combination of elements of . For the compound model this requires an additional estimation step where is estimated and is treated as a known constant in the method of moments step. Thus, we get

Hence, a consistent estimate of can be obtained by minimizing subject to the constraint . This corresponds to solving a quadratic program.

For performance assessment, we use simulated and real network data from 2 sources. The data set and simulation setups used in our numerical study are described next. The first real network dataset was obtained from a complete packet header trace of a high capacity link [14]. We refer to this as the Tokyo trace data. We split the data into bidirectional flows between sub-networks using the first 8 bytes of the IP-address to identify the corresponding sub-network. We aggregate flow volumes to bin size of 5 minutes. The total duration under consideration is 12.5 hours. Thus, we have data on packet and byte volumes of 55 flow pairs (110 flows) in each of 150 time intervals. The mean byte volume of each of these 110 flows is plotted versus the mean packet volume in Fig. 4. The second real network dataset was obtained from sampled Netflow data from Internet 2, a large backbone network. We refer to this as the Internet2 data set. The data was collected on Feb 19th, 2009 and each packet was assigned to one of 72 source/destination flows as described in [15]. Flow volumes are aggregated to a bin size of 2 minutes and a total of 5.5 hours worth of data is used resulting in 165 observations. To generate data from the compound model, we simulate the packet volumes as independently Gamma distributed with means and variances equal to the corresponding parameters in the Tokyo trace data set. For each time-interval and flow, given the packet volume, the byte volume is generated as normally distributed with mean and variance proportional to the packet volume. The proportionality constants are the mean packet size and variance in packet size. Mean packet size is estimated from the Tokyo-trace data set over all flows. Variance of the packet size distribution is calculated from the mean by assuming that the packet size distribution is supported entirely on 40 and 1500 bytes. To simulate from the Independent Sub-Flow model, we generate two sub-flow (packet) volumes for each flow in each time-interval. The first sub-flow corresponds to a packet size of 40 bytes and we use Gamma distributions with common scale parameter across all flows and randomly generated shape parameter. Similarly, the second sub-flow corresponds to a packet size of 1500 bytes and we use Gamma distributions with common scale parameter across all flows and randomly generated shape parameters. Finally the packet and byte volumes of the flows are generated as appropriate linear combinations of the sub-flow volumes. Finally, we also look at data generated from the Independent Sub-Flow model above, with the additional constraint that the scale parameter of all sub-flows is identical. In this case the means of packet volumes are proportional to variances of packet volumes over all flows. Thus, estimation based on classical tomography relation [4] would be consistent. We refer to this as the classical data generation method.

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5061

Fig. 6. Estimated (with s.d error bars) versus the true parameters for data simulated from Compound Model and estimation under Compound (a), Indepen. dent Sub-Flow (b), and Classical Tomography (c) model with

Fig. 7. Estimated (with s.d error bars) versus the true parameters for data simulated from Independent Sub-Flow Model and estimation under Compound (a), Independent Sub-Flow (b), and Classical Tomography (c) model with .

Fig. 8. Estimated (with s.d error bars) versus the true parameters for data simulated from Classical data generation Model and estimation under Compound (a), Independent Sub-Flow (b), and Classical Tomography (c) model with .

For both the Tokyo trace data and simulated data the Abilene network topology (Fig. 5) is used. It consists of 11 nodes and directed edges between pairs of nodes (bidirectional links). Flows exist between all pairs of nodes resulting in a total of flows. We assume that these flows are routed through minimum distance paths. Further we assume that cumulative flow volumes (SNMP data) are available from all the edges. For Internet2 data, the topology used is a subset of that shown in Fig. 5 and consists of 9 nodes and 26 directed edges.

The key findings from the numerical study are discussed next. In the case of simulated data, 200 replications of each scenario were run to obtain the mean and standard deviation of estimates. In Figs. 6–8, the results of estimating the mean packet volumes are shown using the Compound (left), Independent Subflows (center) and classical tomography (right), when the data generation mechanism corresponds to the Compound, the Independent Sub-flows and the classical tomography model, respectively. It can be seen that when the model is correctly speci-

5062

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Fig. 9. Estimated (with s.d error bars) versus the true parameters for data simulated from Compound Model and estimation under Compound model (left) and Independent Sub-Flow model(right) for .

fied, the resulting estimates exhibit no discernible bias. Further, when data are generated from the Independent Sub-flows model, classical tomography performs well (see Fig. 7), while when

the data are generated from the classical tomography mechanism, both the Compound model and the Independent Sub-flows model estimate the means well (see Fig. 8).

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5063

Fig. 10. Estimated (with s.d error bars) versus the true parameters for data simulated from Independent Sub-Flow model and estimation under Compound Model (left) and Independent Sub-Flow model (right) with .

In Figs. 9 and 10, the estimates obtained at the end of the first stage of the “hybrid” procedure (i.e., ) are depicted, when the generative model is specified as Compound and Independent Sub-flows, respectively. It can be seen that the results exhibit no discernible bias in the case of correct specification (left panels in Fig. 9 and right panels in 10). On the other hand, estimates from the Independent Sub-flows model exhibit a strong systematic bias for data obtained from the Compound model (right panels in Fig. 9), while those obtained from the Independent Sub-flows model do not when estimated by the Compound model (left panels in Fig. 10). Fig. 11 shows when data is generated from classical model and estimation is performed with the Compound and Independent Sub-flows models. The Inde-

pendent Sub-flows model performs well in general, while the parameters. Compound model estimates adequately only the Finally, the variance of the estimates decreases as the sample size increases (results not shown). Table I shows the median (over flows) of relative mean squared error for various scenarios of data generation and estimation. The median is used in order to avoid the results from getting overwhelmed by lighter flows, which have large relative MSE. Relative MSE for a parameter is defined as follows: let be the true value of the parameter and let be the estimate from the th replication out of a total of replications; then, the relative MSE is equal to .

5064

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Fig. 11. Estimated (with s.d error bars) versus the true parameters for data simulated from Classical data generation model and estimation under Compound Model (left) and Independent Sub Flow Model (right) with . TABLE I MEDIAN RELATIVE MSE FOR VARIOUS ESTIMATION AND GENERATIVE MODELS

Figs. 12 and 13 display the estimated versus true values for the Tokyo trace data for the three estimation techniques. For

the Independent sub-flows model clearly does better than the Compound model. For the final estimate, , both

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

Fig. 12. Estimated versus the true parameters for Tokyo Data

5065

assuming compound model (left) and Independent Sub-Flow model (right).

the Compound model and the Independent Sub-flows model suffer from a single outlier, while the classical tomography estimates have a lot of values equal to 0. The outlier in Fig. 13(a) and (b) corresponds to the flow in Fig. 2(c). This is clearly an exceptional flow. We substitute this flow by the flow in Fig. 14 which is constructed through averaging the packet and byte volumes over all other flows in each time interval. Fig. 15 shows the estimated versus true for the three estimation methods with this replacement. In this case the performance of the estimates based on the Compound and Independent Sub-flows models appear to outperform those based on classical tomography. We can quantify the performance of the methods in two different ways. We define the usual mean squared error (MSE) between a vector of estimated values and a vector of true values as . Suppose

represents the mean of , i.e., and represents the variance in , i.e., . Instead of reporting the MSE we will report the closely related R-squared statistic, defined as . The R-squared statistic is always less than 1 and increases with decreasing MSE. We will also report the mean squared relative error (MSRE) defined as . Using the above metrics the result of Fig. 15 can be summarized as follows. The R-squared statistic for estimates based on Compound, Independent Sub-Flow and classical tomography methods are and respectively. Thus, in an MSE sense the Compound and Independent Sub-Flow methods outperform the classical tomography method. On the other hand, the MSRE are 5.10, 0.511 and 1.24. This implies that based on MSRE, the classical tomography estimates are better than the estimates from the compound model. Note that all

5066

Fig. 13. Estimated versus the true means for Tokyo Data Tomography (c).

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

assuming compound model (a), Independent Sub-Flow model (b), and Classical

VI. PACKET SIZE TOMOGRAPHY

Fig. 14. Substitute flow volumes for the outlier flow.

methods perform poorly in a relative sense for the small flows and these relative errors may be the dominant terms in MSRE. Fig. 16 shows the estimated versus true for the three estimation methods when applied to the Internet2 dataset. In this case the R-squared statistics is 0.569, 0.809 and 0.682 respectively for estimates based on Compound, Independent Sub-Flow and classical tomography models. The MSRE are and . Thus, in this case the independent sub-flow model performs best in terms of MSE and classical tomography performs best in terms of relative MSE. The above results show that it is possible to substantially improve (in the MSE sense) the estimates based on classical tomography using the models presented in this paper. This is not surprising since the structural models allow us to combine both the packet count and byte count data to estimate flow volumes. Naturally, we have to make certain assumptions to utilize all the information, and hence, the performance of estimates based on these models depend on how closely these models approximate the true behavior of the network flows.

The packet size distribution of a flow is a useful quantity for network monitoring purposes and is indicative of the traffic composition [16]. Joint modeling of packet and byte volumes allow us to estimate parameters of the packet size distribution from cumulative measurements, as well. This is most easily accomplished through the compound model and that is the focus of this section. We start by removing the constraint of common packet size . The objective here is to estimate , the means; i.e., is vector of mean packet sizes of all flows. Recall that if constrained to be diagonal as described in Section 3.1, then is identifiable from observations. This in turn means that is identifiable. Mean packet volumes, , are not identifiable and are in fact “confounded” with . Thus, we use the parametrization (2). With this parametrization, the “covariance only pseudo-likelihood” (5), is wellbehaved. maximizes As before the pseudo likelihood estimator, . An EM algorithm very similar to that used for the hybrid estimator and given in the Appendix can be used for the optimization. The only difference is that (18) is replaced by (9) Remark: The computational complexity of each EM step is the same as that for the Hybrid Estimator of Section 3.1. A. Numerical Study First, we consider the performance of our estimates under simulated data. The data are generated as described in Section V for the Compound model with the exception that the mean and variance of packet size distribution is calculated separately for each flow and data generated correspondingly. A sample size of is considered. Fig. 17 shows the estimated versus the true values of and . Note that the “natural” parameters of the are well estimated. Howcovariance matrix, i.e., have large MSE of estimation. The reason ever, certain

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5067

Fig. 15. Estimated versus the true means for Tokyo Data model (b), and Classical Tomography (c).

after replacing the outlier flow, assuming compound model (a), Independent Sub-Flow

Fig. 16. Estimated versus the true means for Internet2 Data raphy (c).

assuming compound model (a), Independent Sub-Flow model (b), and Classical Tomog-

for this is as follows. Estimating is similar to estimating being the variance of the the regression coefficient with corresponding covariate. As in any regression problem, if the covariate variances span a big range of values, the coefficients are not well estimated. corresponding to small values of This issue is demonstrated more clearly in Fig. 18. The plot on the left panel shows the MSE from the above simulation versus (both on a log scale), while the plot on the right panel shows the asymptotic variance (as described in the following) (again on log scales). The asymptotic variance is versus calculated from the Fisher Information matrix corresponding to the covariance only likelihood (5) when alone is unknown, evaluated at the true value of . Both figures show that a large is observed for small values of variance for estimates of . The differences between the two plots are expected due to departures from normality in the data. In reality, the interest is primarily in estimating properties of heavy flows which usually . correspond to large values of packet volume variance, is itself well estimated, reliable estimates of can Since, be provided for the most interesting flows. Fig. 19 shows the estimated versus true values of mean packet size, , for the Tokyo trace data for heavy flows only. Here heavy flows are defined to be the top 40% flows in terms of estimated packet volume variance, . Naturally, the data would

have some departures from the compound model that would impact the performance of estimates. It is likely that more highly aggregated data would follow the compound model better and would lead to better estimation. VII. DISCUSSION The use of packet and byte information for traffic volume measurement along with structural modeling of their joint distribution opens up several options for more detailed network tomography. We have proposed two models, the Compound model and the Independent Sub-flows model for this task. Further, we made specific network-wide regularizing assumptions that lead to identifiability of parameters of interest. These choices and their performance clearly depend on the data at hand and are also closely tied to the estimation strategy. Estimation, in turn poses significant challenges. As demonstrated by the simulation studies, mis-specification of the distribution family is not necessarily the biggest challenge. The heterogeneity observed in real computer network flows and of course departures from the regularizing assumptions are significant factors. Finally the Independent Sub-flows model and the Compound model provide a framework for defining, investigating the identifiability of and estimating several interesting characteristics of the joint distribution of packet and byte volumes

5068

Fig. 17. Estimated (with

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

s.d error bars) versus the true parameters for data simulated and estimated under Compound Model with

.

Fig. 18. Dependence of (a) MSE (simulation) and (b) asymptotic variance (normal approximation) on packet volumes variance.

of a flow. In particular the Independent Sub-flows model can incorporate a larger number of sub-flows. Indeed, it is easy to

see that variances of up to 3 sub-flows are identifiable from . Using the covariance of packet and byte volumes of flows

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5069

(10) Assume that at the th E-step the estimated parameter is Let

.

(11)

Fig. 19. Estimated versus true

for heavy flows in Tokyo trace data.

carefully chosen parametric families and the information from higher cumulants it may be possible to estimate an even larger number of sub-flows; however, the practical viability of such an approach could be limited due to the various challenges posed by data from real networks.

(12) Now [see (13), shown at the bottom of the page]. Define

(14) APPENDIX A EM ALGORITHM FOR COVARIANCE ONLY PSEUDO LIKELIHOOD A very simple EM algorithm can be derived to maximize the . Then pseudo-likelihood in (5). Assume (15)

The above would be the true likelihood of

if (16)

were distributed i.i.d. and model to derive the EM algorithm. Let . Thus function based on

. We use this be the likelihood

Using the above, we get the expectation step. E-Step

(13)

5070

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

The M-step involves maximization of over and is straightforward from the following observations. The first and . The maxsecond term in the last expression just involve imum likelihood estimate of subject to the diagonal constraint is given simply by replacing the off-diagonal elements in the unconstrained MLE with 0. be the function which replaces the off-diagonal elLet ements of a matrix with zeros. The th stage M step then gives the following parameter estimates. M-Step

and

be a sequence of random matrices, defined as

We will establish that and , for some and constant matrix . random vector In the following all functions and derivatives are evaluated at . Its easy to show that

(17) (21) (18) Define .. .

(19) Computational Complexity The computational complexity of each EM step can be obtained as follows. Assume that the number of flows, , is of . The matrix inversions in (11)–(12) order , where is involve matrices, and hence, have complexity . involves multiplying a matrix with Computing matrix and would have complexity if done a is exploited then the comnaively. However, if sparsity of plexity reduces to . On the other hand, comfrom involves multiplying a maputing matrix, neither of which is necessarily sparse. trix with a . Note that The complexity of this operation is we never need to multiply two matrices. Thus, the overall . Note that while (12) is complexity of each iteration is expressed in terms of individual , we only need the following “sufficient” statistic

Thus, . From CLT converges in distribution to a random matrix with all entries jointly normal distributed. Thus, has a multivariate normal distribution. The mean of is and the covariance matrix is given by . On the other hand

(22) for evaluation of (13)–(16). This would involve a one-time cost . of APPENDIX B ASYMPTOTIC DISTRIBUTION OF THE ESTIMATOR For the purpose of deriving the asymptotic distribution of the estimator that maximizes (5) we make explicit the dependence on (20) We refer to the true value of parameters as as . Proof of Proposition 1: Let random vectors, defined as

Thus, . Clearly consistent estimates of can be obtained by replacing in (8) by a consistent estimate like . can be consistently estimated by replacing and Now by their consistent estimates and expectations by their emis consistently estimated by pirical means in (7). Also . Proof of the Corollary: For the asymptotic distribution of the hybrid estimator, note that neglecting the positivity constraints, the objective function 6 is maximized for

and to the estimate be a sequence of

where

. Now

SINGHAL AND MICHAILIDIS: STRUCTURAL MODELS FOR DUAL MODALITY DATA WITH APPLICATION TO NETWORK TOMOGRAPHY

5071

Further

ACKNOWLEDGMENT

Hence

The authors would like to thank two anonymous referees for useful comments and suggestions that improved the presentation of the material. They would also like to thank Joel Vaughan for providing the Internet2 data. REFERENCES

Thus

So, finally (23) Or making explicit the dependence on

in (23)

Now from the proposition , a mean 0 normal random variable. Similarly, , another mean 0 normal random variable. Thus, a simple application of delta method suggests an asymptotic distribution given by

where

are jointly normal distributed with

(24)

The partial derivatives in the above expression can be written more explicitly as

[1] E. Lawrence, G. Michailidis, V. Nair, and B. Xi, “Network tomography: A review and recent developments,” in Frontiers in Statistics, Fan and H. Koul, Eds. London, U.K.: Imperial College Press, 2006, pp. 345–364. [2] N. G. Duffield, C. Lund, and M. Thorup, “Learn more, sample less: Control of volume and variance in network measurement,” IEEE Trans. Inf. Theory, vol. 51, no. 5, pp. 1756–1775, May 2005. [3] L. Peterson and B. Davie, Computer Networks: A Systems Approach. San Francisco, CA: Morgan Kaufmann, 2003. [4] J. Cao, D. Davis, S. Wiel, and B. Yu, “Time-varying network tomography: Router link data,” J. Amer. Statist. Assoc., vol. 95, pp. 1063–75, 2000. [5] H. Singhal and G. Michailidis, “Identifiability of flow distributions from link measurements with applications to computer networks,” Inv. Probl. vol. 23, no. 5, pp. 1821–1849, 2007 [Online]. Available: http:// stacks.iop.org/0266-5611/23/1821 [6] Y. Vardi, “Network tomography: Estimating source-destination traffic intensities from link data,” J. Amer. Statist. Assoc., vol. 91, pp. 365–377, 1996. [7] A. Chen, J. Cao, and T. Bu, “Network tomography: Identifiability and fourier domain estimation,” in Proc. 26th IEEE Int. Conf. Computer Communications, May 2007, pp. 1875–1883. [8] S. A. R. Group, IP Data Analysis [Online]. Available: http://research. sprintlabs.com/packstat/packetoverview.php [9] R. Sinha, C. Papadopoulos, and J. Heidemann, Internet Packet Size Distributions: Some Observations, Univ. Southern California, Los Angeles, Tech. Rep. ISI-TR-2007-643, 2005. [10] U. Weinsberg, Y. Shavitt, and Y. Schwartz, “Stability and symmetry of internet routing,” in Proc. 28th IEEE Int. Conf. Computer Communications Workshops, Piscataway, NJ, 2009, pp. 407–408. [11] V. E. Paxson, “Measurement and Analysis of End-to-End Internet Dynamics,” Ph.D. dissertation, Univ. California, Berkeley, CA, 1998. [12] C. Gourieroux, A. Monfort, and A. Trognon, “Pseudo maximum likelihood methods: Theory,” Econometrica, vol. 52, no. 3, pp. 681–700, 1984. [13] M. Taniguchi and Y. Kakizawa, Asymptotic Theory of Statistical Inference for Time Series. New York: Springer-Verlag, 2000. [14] K. Cho, WIDE-TRANSIT 150 Megabit Ethernet Trace 2007-01-09 (Anonymized) (collection) [Online]. Available: http://mawi. wide.ad.jp/mawi/samplepoint-F/20070109 [15] J. Vaughan, S. Stoev, and G. Michailidis, Network-Wide Statistical Modeling and Prediction of Computer Traffic, Univ. Michigan, Ann Arbor, Dept. Statistics Tech. Rep. 508, 2010. [16] K. Thompson, G. Miller, and R. Wilder, “Wide-area internet traffic patterns and characteristics,” IEEE Network, vol. 11, no. 6, pp. 10–23, Nov./Dec. 1997.

Harsh Singhal graduated from University of Michigan in April 2009 with a Ph.D. in statistics. His research interests are in multivariate statistics, machine learning, dynamical systems, optimal experiment design and survival analysis. He currently works in Consumer Risk at Bank of America.

where for George Michailidis received his Ph.D. in mathematics from UCLA in 1996. He was a post-doctoral fellow in the Department of Operations Research at Stanford University from 1996 to 1998. He joined The University of Michigan in 1998, where he is currently a Professor of Statistics, Electrical Engineering, and Computer Science. His research interests are in the areas of stochastic network modeling and performance evaluation, queuing analysis and congestion control, statistical modeling and analysis of Internet traffic, network tomography and analysis of high dimensional data with network structure.

Suggest Documents