services connections (Telnet, FTP, SMTP, etc.). FIGURE 1.Berkeley .... effect of the network and the transport protocol dynamics in bulk data transfers. III. 4.
A Phenomenological Approach to Internet Traffic Self-Similarity* Javier Aracil, Richard Edell and Pravin Varaiya
We analyze four days worth of IP traffic measured in the UCB campus network and study the long-range dependence features of this traffic. It turns out that the FTP data arrivals process is self-similar. This behavior can be explained in terms of the heavy-tail properties of the distribution of interarrival times. A correspondence with fractal renewal processes can be established, giving analytical justification to our hypothesis. We also analyze chaotic maps and their application to Internet traffic models. We study the role that network transport protocols (TCP) play in this particular behavior and we draw some useful conclusions for network dimensioning and control.
I. INTRODUCTION Recent studies [13, 15] show that Internet traffic should be modeled as a self-similar process and not as a Poisson process. These studies find that Internet traffic exhibits long-range dependence, in contrast with the short-range dependence of a standard Poisson or compound Poisson process. The implications of self-similarity in network provisioning have already been outlined [9] in terms of congestion control and avoidance. There is a large literature on self-similar processes and their applications to various fields including economics, fluid theory and telecommunications traffic (see [18] and references therein). A self-similar process [18] has slowly decaying correlations (or a spectral density that tends to infinity in the vicinity of zero). Intuitively, the averaged packet counting process Xk(m) does not smooth out to the mean rate with increases in the aggregation level m as a Poisson process would. This behavior is also observed in the Index of Dispersion of Counts (IDC), also called Fano Factor [4], which grows with the size of the counting window. The variance of Xk(m) decays slowly with the aggregation level m [15].This indicates the presence of burstiness at any time scale and has dramatic consequences on network performance [9]. We present statistical analysis of four days of IP traffic between the U.C. Berkeley and the “rest of the world.” Our purpose is twofold: First, to check current commonly accepted hypothesis on traffic modeling for Internet services, such as FTP. Secondly, we present an explanation of the self-similar nature of FTP data traffic in terms of distributions of user think time and file transmission duration, analyzing the effect of network transport protocols in such behavior.
*Research funded by Pacific Telesis, the California Micro and National Science Foundation. J. Aracil is supported by a Fulbright/Central Hispano grant.
1 of 24
MEASUREMENT SCENARIO
Our aim is to provide a practical explanation of self-similarity in the Internet and to assess the contribution of the network and the user, so that this harmful effect can be avoided through proper control. In this paper we mainly deal with detailed analysis of connection arrival processes for Telnet, FTP control and FTP data sessions. The main contribution of this paper in comparison with other studies [3, 13, 15] is the characterization of the connection arrival process in terms of Fractal Renewal Processes (FRPs) and chaotic maps, which help to justify the self-similarity effect in terms of transport protocols and user dynamics. The structure of this paper is as follows: In section II we describe the measurement scenario; section III is devoted to data analysis and comparison with other studies [13, 15]; in section IV we dwell on modeling of FTP data sessions with chaotic maps. Section V is devoted to the application of Fractal Renewal Processes (FRPs) to our data and section VI and VII analyze the influence of network transport protocols using simulation. Finally, section VIII presents our conclusions. II. MEASUREMENT SCENARIO Measurements are performed in the Berkeley campus FDDI rings, whose topology is shown in figure 1. Four complete days worth of data are recorded, comprising 3,692,301 records of common Internet services connections (Telnet, FTP, SMTP, etc.). Internet UCB Network Router FDDI
FDDI
128.32.x.x
136.152.x.x Measurement Device (BayBridge)
FIGURE 1.Berkeley campus Network
The BayBridge router [6] was used to record traffic traces comprising internal connections involving hosts within the same FDDI ring, connections whose source and destination hosts are attached to different FDDI rings and, finally, external connections between UCB hosts and the rest of the Internet. Similar measurement scenarios (internal and external connections being recorded) can be observed in previous work [13, 15]. A timestamp of 100 µs. is used to achieve high accuracy of the recorded trace. We first consider FTP connections, which present interesting properties regarding self-similarity and leave Telnet connections for subsequent sections.
A Phenomenological Approach to Internet Traffic Self-Similarity*
2 of 24
FTP CONNECTIONS ANALYSIS
III. FTP CONNECTIONS ANALYSIS A number of FTP data sessions are initiated within an FTP control session. We analyze the birth process of both kinds of session, so that each event in the time series under analysis represents the initiation of a new connection. We focus our attention on the following statistics: • Long-term behavior of the birth process of FTP control and data sessions over the
whole measurement period. • Correlation between duration and size of FTP sessions. • Heavy-tail properties of FTP sessions size and duration. • Poisson nature of the birth process of FTP control and data sessions. III. 1. Long term behavior. Trends.
Figure 2 shows the hourly arrivals of FTP control sessions during four days, starting Monday, January 16th 1995 at 19:00 hours. A trend can be clearly identified, with a peak at noon and troughs at night that can be explained in terms of the user behavior. At night, the network load decreases to less than half the peak load. This behavior can also be observed in telephone networks. If users spawned their FTP connections at night we would observe a homogeneous load, which would certainly ease network dimensioning and resource allocation. Appropriate time-of-use pricing may produce the desired network load smoothing, as outlined in [5].
Arrivals per hour
FIGURE 2. FTP control sessions per hour 1750 1500 1250 1000 750 500 250 1
25
49 73 Hour
97
We consider one day of data, since a clear periodicity is observed for the entire dataset. Furthermore, the sample size is large enough to guarantee significant confidence intervals. We limit the analysis to the sample size summarized in table 1 (one day of data). TABLE 1.Sample Size Connection type
Number of connections
Telnet
13,665
FTP Control
23,409
FTP Data
51,766
A Phenomenological Approach to Internet Traffic Self-Similarity*
3 of 24
FTP CONNECTIONS ANALYSIS
It is interesting to observe that, on average, two FTP data sessions are initiated per FTP control session. III. 2. Correlation between duration and size
Figure 3 shows a scatterplot of duration-size of FTP sessions. It can be seen that there is no correlation between duration and size that can be explained due to the varying load conditions of the network and the effect of window congestion avoidance mechanisms (like Slow Start). If we consider small time scales then a positive correlation between duration and size can be noted, as shown in figure 4. For short duration FTP data sessions the dominant factor is transmission time, giving rise to a quasilinear relation between size and duration (figure 4). For longer FTP data sessions the window congestion control mechanism, together with varying roundtrip delays, produces high variability in the connection duration. FIGURE 3. Scatterplot of duration-size (hours)
Size (bytes)
50000
25000
0 0
1250 2500 Duration (s.)
Size (bytes)
FIGURE 4. Scatterplot of duration-size (seconds)
7000 6000 5000 4000 3000 2000 1000 0 0
1 2 3 Duration (s.)
4
III. 3. Distribution of FTP data sessions size
It has been reported [15, 13] that the distribution of FTP data sessions size (bytes) is heavy-tailed, with a few huge bursts having a dominating effect on connections. Our analysis confirms this hypothesis. A distribution F(x) of a random variable X is heavy-tailed iff the complementary distribution follows the power law, P ( X > x) = 1 – F ( x) ≈ x
–α
x>θ
A Phenomenological Approach to Internet Traffic Self-Similarity*
(1)
4 of 24
FTP CONNECTIONS ANALYSIS
for some θ. The moments of X can be infinite depending on the value of α. In general the nth moment is finite if α>n. The smaller is α the more variable are the outcomes, since very significant deviations from the mean can take place. Distributions with 1 x)]
-2 -3 -4 -5 -6 -7 5
10
15 x
20
25
30
A Phenomenological Approach to Internet Traffic Self-Similarity*
8 of 24
FTP CONNECTIONS ANALYSIS
FIGURE 13. FTP data sessions interarrival times (Low-pass filter 5.0 s.)
0 3:00 11:00 17:00 19:00
-1
Log[P(X > x)]
-2 -3 -4 -5 -6 -7 5
10
15 x
20
25
30
On the other hand, our findings in the previous section showed non-Poisson behavior (long-range dependence) for the FTP data birth process, even though the interarrival times distribution is exponential. Let us first consider a hypothesis and the corresponding counter-hypothesis: • Users tend to cluster arrivals (with, for instance, mget commands). This fact would
explain the non-Poisson nature of the birth process [15]. • The users population is very large, so that this clustering effect should not influence the superposed birth process. To test both hypotheses we analyze filtered versions of the birth process. We proceed by filtering out connections coming from the same source whose interarrival times lie below a certain threshold. We consider 0.5, 1.0 and 5.0 seconds (a 4.0 seconds threshold is chosen in [15] to separate connection bursts). Figure 13 shows the interarrival times complementary distribution of a filtered version of the FTP data sessions arrival process, obtained by removing arrivals coming from the same source and separated in time by less than 5.0 seconds. Figures 12 and 13 show curves with the same shape so that only the scale parameter of Fi(x) changes, namely the data comes from the same distribution. Also, we obtain the same values for the estimates of the H parameter (figure 9). Thus, long-range dependence in the FTP data arrivals process cannot be attributed to the sole fact of users clustering arrivals. Other factors, such as TCP dynamics and user think time have an important influence as we will see. In conclusion, the interarrival times for FTP control, Telnet and FTP data sessions can be approximated by an exponential random variable. Interarrival times are independent for FTP control and Telnet. However, a simple Smirnoff-Kolmogorov test for FTP control and Telnet sessions would lead us to reject the null hypothesis of the interarrival times being exponentially distributed. For a discussion on why this test should not be used for a process with long-range dependence such as the FTP data sessions arrival process, see [2]. We observe the largest deviations from exponential behavior taking place in the tail of the distribution, elsewhere the exponential approximation is valid. III. 6. Conclusions
Several partial conclusions can be drawn from the preceding analysis: A Phenomenological Approach to Internet Traffic Self-Similarity*
9 of 24
CHAOTIC MAPS AND FTP DATA SESSIONS
• There is a clear daily trend in the FTP control and data sessions due to user behavior. • Experimental data shows no correlation between duration and size of FTP connections
that can be attributed to varying network load conditions or network transport protocols dynamics. • FTP data sessions size and duration show heavy-tail behavior, in accordance with previously reported data. • Telnet and FTP control session arrival processes can be modeled as renewal processes, whose interarrival times are very close to exponential random variables. The Poisson approximation is reasonable. • FTP data sessions interarrival times are also exponentially distributed; however, the arrivals are not renewal. This behavior is indicated by the Whittle estimator, log-log plots of variance aggregation level and a power law behavior that can be seen in the periodogram. Figure 9 shows the values of the Whittle estimator for 24 one hour intervals, in which the birth process can be assumed homogeneous. In the previous sections we have observed the self-similarity features of the FTP data sessions arrival process (figure 9). Also, the interarrival times obey an exponential distribution (figure 12 and 13). In the next section we examine two different explanations for this behavior: • Users tend to follow a typical pattern of FTP data sessions within an FTP control con-
nection. Ask yourself what you usually do immediately after you establish an FTP control connection. Normally users type ls followed by several cd and finally get or mget and then the process starts over again. This behavior is fairly deterministic and we study the possible connections to the theory of chaotic maps [12]. • The interarrival times of FTP data sessions within the same FTP control connection (i.e., coming from the same user) can be decomposed into two contributions: file transmission duration and user think time. The result is a heavy-tailed renewal process whose superposition with other processes generates a superposed Fractal Renewal Process (FRP) [17]. The marginal distribution of interarrival times is exponential if a sufficient number of users are superposed. We will show that the data support the second hypothesis but not the first. IV. CHAOTIC MAPS AND FTP DATA SESSIONS The FTP data sessions birth process exhibits long-range dependence features (figure 9). The same behavior is not observed for FTP control and Telnet sessions. It has been argued [15] that users tend to cluster their arrivals (for instance by typing the mget command) and that this fact may have influence on the non-Poisson nature of the arrivals. We have discussed such hypothesis in the last section and now we argue that it is possible to find a typical pattern for the arrivals of such sessions. A typical FTP control session starts with an ls and several cd commands, until we reach the target directory and
A Phenomenological Approach to Internet Traffic Self-Similarity*
10 of 24
CHAOTIC MAPS AND FTP DATA SESSIONS
download one or more files by typing get or mget. This process constitutes a typical series of events that can be noted in the interarrival times (for instance a short interarrival time (ls) followed by a larger one (get)). We first summarize elementary notions about chaotic maps, then we analyze the deterministic behavior of FTP data sessions through phase plots [8]. Lastly, we study the self-similarity features of chaotic maps with stochastic perturbations and their relation to our empirical data. IV. 1. Probabilistic behavior of deterministic systems
We give a brief introduction to the probabilistic behavior of deterministic transformations. For a rigorous treatment we refer the reader to [12]. Let S be a continuous mapping from ℜ to ℜ and let f 0 ( x ) be the initial density of points (regarded as states [12]) on ℜ . We iterate each point x in the real line with the transformation S and form the sequence 2
x ,S ( x ) ,S ( x ) , ...
(4)
At each transition, a new density f 1 ( x ) in an interval [a,x] is obtained and is given by x
∫f a
1
∫
( u ) du = S
–1
f 0 ( u ) du
(5)
( [ a, x ] )
We can write f 1 ( x ) in terms of f 0 ( x ) as f 1 = Pf0 so that we can rewrite the last equation as Pf ( x ) =
∫ f ( u) du
d dx S
–1
(6)
( [ a, x ] )
P is the Frobenius-Perron operator associated with the transformation S. Even if the trajectories defined by equation 4 exhibit a chaotic behavior a stationary density ˆf can be found for asymptotically stable transformations, that satisfies Pfˆ = ˆf
(7)
It is intuitively clear that the process defined by equation 4 should present long-range dependence features, since each point depends entirely on the past history of the process. If we model the interarrival times series as successive iterations of an asymptotically stable map, we find a marginal distribution for the interarrival times that is stationary as described in equation 7. As we will see next, it is possible to find chaotic maps whose associated stationary density is exponentially distributed, as the FTP data session interarrival times.
A Phenomenological Approach to Internet Traffic Self-Similarity*
11 of 24
CHAOTIC MAPS AND FTP DATA SESSIONS
IV. 2. Phase plots
A chaotic map whose associated stationary density is exponential can be found by performing a change of variable of the tent map [12, chap. 6], and is given by S ( x ) = λ ln ( 1 – 2e –λx )
0≤x≤∞
(8)
FIGURE 14. Chaotic map
60 lambda=0.1 lamdba=0.2 lambda=0.3
50
S(x)
40 30 20 10 0 0
5
10
15
20
25 x
30
35
40
45
50
The stationary density corresponding to this transformation is given by the exponential density with parameter λ f ( x ) = λ exp ( – λx )
(9)
Figure 15 shows a trajectory (discrete time series of interarrival times) generated with this map (λ=0.2) and the complementary distribution of states with a large number of iterations (log-scale in the y-axis). We see that the generated density corresponds to the exponential density of equation 8. The trajectory looks chaotic and the density of states is exponential as is the case for FTP data sessions interarrival times distribution (compare to figures 12 and 13). FIGURE 15. Trajectory and stationary distribution 0
10
S(x) Exp(-0.2)
-1 -2
8
-3
Xn
Log10[1-F(x)]
6
4
-4 -5 -6 -7 -8
2
-9 -10
0 0
1000
2000
3000
4000
5000 n
6000
7000
8000
9000
10000
0
5
10
15
20
25
30
35
40
45
x
In order to analyze the hypothesized deterministic behavior of the FTP data session arrival process we show in figure 16 the phase plot associated with FTP data sessions of 255 randomly chosen connections and a simulated series of the same number of sessions whose interarrival times are taken by iter-
A Phenomenological Approach to Internet Traffic Self-Similarity*
12 of 24
CHAOTIC MAPS AND FTP DATA SESSIONS
ating on equation 8. A phase plot consists of a graph in which ti+1 is plotted against ti, where the ti are in this case successive interarrival times. FIGURE 16. Phase plot of FTP data interarrival 500 450 400 350
ti+1
300 250 200 150 100 50 0 0
50
100
150
200
250 ti
300
350
400
450
500
FIGURE 17. Phase plot of Chaotic map 500 450 400 350 300 250 200 150 100 50 0 0
50
100
150
200 ti
250
300
350
400
We see that the phase plot constructed from the chaotic map rapidly converges to the chaotic map itself (compare figures 14 and 17), as we expect. In the phase plot of the empirical data (figure 16) we do not observe significant deterministic features, which should appear as lines or clusters (see [8] for a comparison with Ethernet data phase plot, where these characteristics are rather striking). This non-deterministic behavior could be due to stochastic perturbations caused by user think time and varying network load conditions that affect transmission time (see scatter plot of duration-size in figure 3). A deterministic contribution can be conjectured taking into account file transmission duration, arguing that the typical pattern that we mentioned before is given by transmission times (short transmission time produced by an ls followed by longer transmission times caused by a get or mget). However. the stochastic perturbation added by users and network dynamics make the interarrival times appear totally random so that modeling with a deterministic chaotic map does not seem plausible. To reinforce this point, we perform simulation studies of the chaotic map defined by equation 8 which show high sensitivity to random stochastic perturbation regarding long-range dependence. Stochastic perturbations could occur as follows: Xk + 1 = ηk S ( Xk) + ( 1 – ηk) ξk
k = 0, 1, ...
A Phenomenological Approach to Internet Traffic Self-Similarity*
(10)
13 of 24
FRACTAL RENEWAL PROCESS APPROACH
where η k takes only two values, 0 and 1, with probabilities (1-p) and p. Equation 10 defines a new transformation based on equation 8. Assuming that ξ k and η k are a collection of independent identically distributed random variables, also independent from Xk, then a unique stationary density for the Xk exists (proposition 10.4.1 in [12]). For values of p very close to 1, we observe long-range dependence. However, small decreases of p (in the order of 10-2) cause a significant drop of the self-similarity parameter H. For FTP data sessions arrivals, we are faced with continuously applied stochastic perturbations in the interarrival times due to user think time, which make the deterministic chaotic map models not suitable. V. FRACTAL RENEWAL PROCESS APPROACH In this section we adopt a different approach for modeling and analysis of FTP data sessions self-similarity. Instead of exploiting the potential determinism in the typical pattern of arrivals in FTP data sessions we suppose that the interarrival times form a renewal process. This fact, together with the heavy-tail behavior of the distribution leads to the study of Fractal Renewal Processes (FRPs) [17]. A brief introduction to FRPs is given in the first subsection, then we recall the heavy-tailed property of the FTP data session arrival process. Finally we compare our empirical traces to those obtained by simulation of FRPs. V. 1. Introduction to FRPs
Self-similarity in FRPs is achieved by interarrival times that are independent identically distributed random variables with the heavy-tail property. Despite the fact the process is renewal, the coincidence rate, autocorrelation function and spectral density are power laws. This is due to the slowly decaying tail of the distribution of the interarrival times. The counting process of an FRP is a second-order selfsimilar process [17] and there are explicit expressions for deriving the self-similarity parameter H from the mean rate and the α parameter of the interarrival time distribution (see section III. 3). FIGURE 18. Fractal Renewal Process Heavy tail
Time
This model, however, does not apply to our case study, since its interarrival times are heavy-tailed and not exponential as in the data (figure 12 and 13). To achieve an exponential marginal distribution of the interarrival times several FRPs are multiplexed, yielding the Sup-FRP (Superposed FRP) point process. The assumption made for FRPs regarding self-similarity also hold for Sup-FRPs. For a SupFRP of M sources, the following equations hold [17]: A Phenomenological Approach to Internet Traffic Self-Similarity*
14 of 24
FRACTAL RENEWAL PROCESS APPROACH
3–α H = -----------2 Mα λ = -------------------------------------------------------–1 –α –1 [ 1 + ( α – 1) e ] A –α
(11) –α 2
2–α
( 2 – α) ( 3 – α) e [ 1 + ( α – 1) e ] A α T 0 = ----------------------------------------------------------------------------------------------------------2α ( α – 1 )
where T 0 α is the fractal onset time and A equals the minimum value of the interarrival time [17]. Related to this last parameter, it is interesting to note that the counting process of a Sup-FRP is selfsimilar in a time scale given by the range of the interarrival times. In practical cases, this interarrival time is bounded, say by A