Fitting Univariate Distributions to Computer Network Traffic Data Using GUI Petar Čisar * and Sanja Maravić Čisar** *
Academy of Criminalistic and Police Studies, Belgrade-Zemun, Serbia ** Subotica Tech, Department of Informatics, Subotica, Serbia
[email protected],
[email protected]
Abstract—The available literature is not completely certain what type(s) of probability distribution best models computer network traffic. The statistical analysis presented in this paper aims to show the implementation of graphical interface for fitting univariate distributions to authentic network traffic data. The analysis is realized in Matlab based GUI, using distribution fitting tool.
3. The integral of the probability function is one, that is +∞
∑ f (x )dx = 1 −∞
I. INTRODUCTION A random variable is a variable (typically represented by x) that has a single numerical value that is determined by chance. A probability distribution is a graph, table or formula that gives the probability for each value of the random variable. A univariate distribution is a probability distribution of one random variable. Discrete distributions − If x is a random variable then P(x) denotes the probability of occurring x. It must be the case that 0 ≤ P( x ) ≤ 1 for each value of x and pj = 1
∑ j
where j represents all possible values that x can have and pj is the probability at xj.
Figure 1. Figure 1. Discrete distribution [8]
Continuous distributions - The mathematical definition of a continuous probability function f(x) is a function that satisfies the following properties. 1. The probability that x is between two points a and b is:
Figure 2. Continuous distribution [8]
Since continuous probability functions are defined for an infinite number of points over a continuous interval, the probability at a single point is always zero. Probabilities are measured over intervals, not single points. That means, the area under the curve between two distinct points defines the probability for that interval. In this sense the height of the probability function can in fact be greater than one. The property that the integral must equal one is equivalent to the property for discrete distributions that the sum of all the probabilities must equal one. Fitting distributions consist of finding a mathematical function which represents in a good way a statistical variable. In statistics it is very often the following situation: there are some observations of a quantitative character x1, x2,… xn and the task is to test if those observations, being a sample of an unknown set, belong to a set with a probability density function (pdf) f(x,q), where q is a vector of parameters to estimate with the available data. In Matlab, pdfs are estimated with appropriate parameters. Each supported pdf represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.
b
p[a ≤ x ≤ b] =
∫ f (x)dx a
2. It is non-negative for all real x.
II.
COMPUTER NETWORK TRAFFIC DISTRIBUTIONS
The different computer network traffic models each have their own advantages and disadvantages. The type of
network under observation and the traffic characteristics dominantly influence the choice of the traffic model used for analysis. Traffic models that cannot detect or describe the statistical characteristics of the actual traffic on the network are to be avoided, since the choice of such models will result in under-estimation or over-estimation of network performance. There is no one single model that can be universally used for modeling traffic in all types of networks. For heavy-tailed traffic, it can be shown that Poisson distribution model under-estimates the traffic [1]. In the case of high speed networks with unexpected demand on packet transfers, Pareto distribution based traffic models are acceptable since the model takes into the consideration the long-term correlation in packet arrival times [2]. Also, with Markov models, though they are mathematically correct, they fail to fit the actual traffic of high-speed networks. The available literature is not completely unanimous what type(s) of probability distribution best models network traffic. Thus, for example, the uniform, Poisson, lognormal (Figure 3), Pareto and Rayleigh distributions were used in different applications.
User 1
Figure 3. Network traffic distribution [3]
III.
NETWORK TRAFFIC DATA
For the analysis of network traffic curves, this research uses daily, weekly and monthly graphic illustration of several larger Internet users that derives from the popular network software MRTG (Multi Router Traffic Grapher), which is related to the period of one day, week and month. Without the loss of generality, the graphical presentation of curves from three users is given below, noting that the observed traffic curves of other users do not deviate significantly from the forms shown here [4]. User 2
User 3
Daily
Weekly
Monthly
Figure 4. Traffic curves of different users
The daily outgoing traffic of a typical user (in this case, for September 21, 2010, Tuesday) is taken as an example, in which the following four characteristic intervals can be identified (Figure 5): 02–06h (night traffic), 06−10h (morning traffic), 10−22h (daily traffic) and 22−02h (night traffic).
Figure 5. Daily traffic curve
Using the ability of the monitoring software PRTG [5] to provide numeric values also (Figure 6), 349 consecutive hourly averages were taken for the first 15 days of the monthly period (Aug 24 − Sep 06).
Figure 7. Fitting distributions to network traffic data using GUI Figure 6. PRTG - Traffic samples (example)
This paper uses 144 samples of daily traffic (10−22 h). Appropriate descriptive statistics for these samples is given by the following table (the speed rates are given in kbit/s). TABLE I. DESCRIPTIVE STATISTICS OF DAILY TRAFFIC Descriptive statistics
10-22h
Mean
59144,47319
Standard Error
443,5149934
Median
60117,992
Mode
#N/A
Standard Deviation
5322,179921
Sample Variance
28325599,11
Kurtosis
-0,38901581
Skewness
-0,280017092
Range
25817,343
Minimum
45322,681
Maximum
71140,024
Sum
8516804,14
Count
144
Confidence Level(95,0%)
876,6926133
Upper Control Limit
75111,01296
Lower Control Limit
43177,93343
Using this data as input for GUI Distribution Fitting Tool [7], the result is the following graphic situation.
Weibull distribution − A random variable X is said to have a Weibull distribution with parameters α and β if the pdf of X is:
α α −1 −( x / β )α x e βα f ( x;α , β ) = 0 f ( x;α , β ) =
x≥0 x