Journal …International Journal of Computer Mathematics
Article ID … GCOM 266877
TO: CORRESPONDING AUTHOR AUTHOR QUERIES - TO BE ANSWERED BY THE AUTHOR Dear Author Please address all the numbered queries on this page which are clearly identified on the proof for your convenience. Thank you for your cooperation
AQ
Please confirm that it is OK for all figures to be printed in black and white (the online version will show figures in colour).
Q1
References has been renumbered as per Journal style. Please check.
Q2
In Section 4.3/para 3/line no. 9, the brackets in the sentence “…is around 2 or 3 MB/user/day (in Dec 2003…..” has been deleted for correctness. Please check. Production Editorial Department, Taylor & Francis 4 Park Square, Milton Park, Abingdon OX14 4RN Telephone: +44 (0) 1235 828600 Facsimile: +44 (0) 1235 829000
International Journal of Computer Mathematics Vol. 00, No. 0, Month 2008, 1–12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Modelling user’s activity in a real-world complex network (SI-complex networks) Carmen Pellicer-Lostaoa *, Daniel Moratob and Ricardo López-Ruizc a Department of Computer Science, Universidad de Zaragoza, Zaragoza, Spain; b Department of Automatics and Computer Science, Universidad Pública de Navarra, Pamplona, Spain; c Department of Computer Science and BIFI, Universidad de Zaragoza, Zaragoza, Spain
(Received 25 June 2007; revised version received 22 August 2007; accepted 04 September 2007) In this paper, a new statistical user model for the Internet access is presented. Real traces of Internet traffic in a heterogeneous campus network are analysed. We find three clearly different styles of individual user’s behaviour, study their common features and group particular users behaving alike in three clusters. This allows us to build a probabilistic mixture model that implements the expected global behaviour for the different types of users. The implications of this emergent phenomenology are discussed in the field of multi-agent complex systems. Keywords: internet traffic; emergent and collective behaviour; multi-agent systems AMS Subject Classifications: 90B18; 68M10; 94C30; 62P30
1.
Introduction
One of the main objectives of traffic analysis is to identify predictive models. Models are so important because they make available practical tools (such as traffic simulators) that are widely required by TelCos (telecommunication companies) in core areas such as network engineering and marketing. In addition, those models can supply mathematical and physical insights into the phenomenon’s nature, giving a better understanding, providing generalist views and thus, making possible valuable innovation. Till today, a lot of efforts has been made to typify different aspects and qualities of the Internet traffic: flows [3, 16, 22], TCP connection arrivals [17], size of FTP files [7], Web transfers [18], etc. However, there are few works focused on modelling the user itself [4, 5, 23]. In fact, there is not much perspective of the Internet as a collection of multiple human-driven agents (partially due to its size and ever changing nature). As a consequence, this could be an interesting perspective, specially if it could shed some clues of globally efficient, emergent or self-organized behaviour. From this point of view, this paper describes the process of traffic analysis and modelling of an internet access link taking the user as a reference. We consider the campus network of the *Corresponding author. Email:
[email protected]
ISSN 0020-7160 print/ISSN 1029-0265 online © 2008 Taylor & Francis DOI: 10.1080/00207160701670328 http://www.informaworld.com
Techset Composition Ltd, Salisbury
GCOM266877.TeX
Page#:
12
Printed: 13/11/2007
2
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
C. Pellicer-Lostao et al.
Public University of Navarre (UPNA). It represents a local network connected to the Internet with a typical and heterogeneous traffic profile and a community of several thousands of users. We explore the use of the Internet access link and infer a new statistical user model based on the probabilistic traffic demand observed for individual users. In order to identify the characteristics and origins of this particular model, a statistical data analysis is further performed. The paper is structured as follows. Section 2 describes the measurement environment and the traffic traces collected for this work. Section 3 discusses the global behaviour observed in the network and reveals the co-existence of different patterns of behaviour in the users’ community. Section 4 presents the mathematical modelling applied to each of these patterns (or types of users). And as a conclusion, Sections 5 and 6 obtain the global model and discusses its significance.
2.
Measurement environment
We consider the campus network of the Public University of Navarre (UPNA) in Pamplona (Spain), which is connected to the Internet by two access links coming out from a border router. At the time of the measurements (December 2003 and April 2004) those were two ATM OC12 links with a normal use of 10% of the bandwidth and peaks of 25%. There were a minimal set of filters active in the firewall rules for specific applications: specially for applications designed for use in the local network, some security limitations on the mail (SMTP), and an incipient filtering on individual ports trying (unsuccessfully) to prevent peer-to-peer traffic). Therefore, a mostly unrestricted access to the Internet was granted for every user. A passive network monitor, placed in the fast ethernet link to the border router, was used to sniff and record the traffic packets on normal operation of the network. Software-based measurement tools were used to passively copy the traffic and build a trace file. The measurement enviroment is described in Figure 1 with the details of downstream and upstream directions for the captured traffic. It consists of a fast-ethernet hub and a commodity workstation with GNU/Linux operating system and the packet capture library (Lib-pcap, [13]) installed, to provide packet capture
Figure 1.
Details of the measurement scenario.
International Journal of Computer Mathematics
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
3
Table 1. Detail of traffic traces used in the study. The number of IP datagrams (in millions, MDtg) and the amount of bytes (in gigabytes, GB) are given for each day of measurement. Day 1 2 3 4 5 –
Date
MDtg (Down)
GBytes (Down)
MDtg (Up)
GBytes (Up)
We 17 December 2003 Th 18 December 2003 Mo 19 April 2004 Tu 20 April 2004 We 21 April 2004 Total traffic
71.06 79.21 68.75 74.31 71.73 365.07
41.86 44.79 43.31 42.69 41.05 213.73
62.58 67.69 68.02 89.33 75.65 363.28
18.80 19.32 20.55 22.40 21.26 102.35
capability. A validation is performed on these tools to ensure that no packet losses occur during the process. The post-process data analysis is done by means of ad hoc developed applications. The details of the collected packet traces are shown in Table 1. They comprise five complete (24 hours/each) normal working days, belonging to two different periods of the year (December 2003 and April 2004) in order to measure the time evolution of the model also. These traces involved 728.36 millions. of IP datagrams and 316.08 gigabytes of total traffic. The data bytes in the IP datagrams include the counting of the header bytes. The monitored traffic presents a standard time profile and composition, with typical day-time patterns of activity, TCP protocol predominance (98% of total traffic bytes) and asymmetric traffic (70% downstream, 30% upstream). As a consequence, the community under study shows some characteristics that are typical of an ordinary community of users [8], a medium-sized Intranet connected to the Internet in flat rate mode.
3. Traffic analysis in the community of users 3.1. The global view First, the global behaviour of the entire population is analysed. As it was shown in Section 2, TCP comprises the most significant load of traffic registered. Moreover, total traffic is dominated and shaped by the download component. These effects will lead us to focus on TCP downstream traffic for modelling. If the whole community of users is ordered according to their daily traffic activity, we obtain an enlightening global picture (see Figure 2). Figure 2a shows the TCP traffic registered per user on Day 2 with its downstream and upstream components. Here, the dominance of the downstream component over the total traffic amount is observed, the same way as the upstream is found to have a minor weight. Users are ordered in the horizontal axis according to their traffic loads (total, down or up, respectively). Thus, the first user on this axis is the user with the heaviest traffic activity and user number 4608 is the user with the lowest activity in the community. These users can be different persons for each component, as the most intensive user in the download is typically a different person than the most intensive user in the upload component. Figure 2b shows the TCP traffic downloaded per user and day for all traces (two days in December 2003, depicted in blue and three days April 2004, shown in red). A positive drift is observed in the consumption among the traces spaced in four months, but the shape of the plots remains constant. Total and Upstream traffic are removed from this figure for clarity but they also show similar shapes in all traces. Two main characteristics are observed in Figure 2: (1) Two inflection points and three different regions become visible in the graphics (Figure 2b). The first point is found at the value of around 30 MB/day for all traces, and the other one
4
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
C. Pellicer-Lostao et al.
Figure 2. (a) Bytes of TCP traffic registered per user on Thursday 18 December 2003 (total, down and upstream components), and (b) downstream bytes of TCP traffic registered per user for all days.
at the values of 10 or 30 KB/day for December 2003 or April 2004, respectively. The three areas are found in all the traces. A longitudinal variation gives information to a tendency of increasing consumption for the lighter users as months pass, but the shapes of the curves remain constant for the whole community. Each inflection point represents a mark at which the behaviour of users’ consumption changes dramatically. These findings reveal that somehow all users don’t behave uniformly. In fact, they can be classified in three clusters of different statistical behaviour, as we show in the next sections. (2) A great number of users don’t generate traffic (0 bytes/user/day in the upstream). In Figure 2a, it can be seen that users beyond number 1688 do not show any uploaded traffic. Their IP addresses register download traffic because they were automatically traced from the Internet by different applications (in hacking activities, etc.) but they never respond. This leads us to consider that they are not real users, but just unallocated IP addresses or unconnected machines in the Network. These IP addresses are eliminated from our study. 3.2.
Searching for patterns of behaviour
As a second step, we analyse users’behaviour during each day of the five collected traces. The main statistical magnitudes of downloaded TCP traffic for users within the three regions of consumption are obtained. These different groups of users have invariant patterns for each one of the analysed days. We refer to these groups as types A, B and C users, for the regions of the heaviest, medium and the lightest consumers, respectively. The data displayed in Tables 2 and 3 show the invariance of the behaviour in all traffic traces in terms of number of users, downloaded traffic per day and behaviour during the day. In Table 2, we show that the number of users in each of the three regions in Figure 2 and its global consumption per day are almost invariant for all traces. These magnitudes are expressed as a percentage over the total number of users and total bytes of traffic per day, in order to make it clearer. This is also true for the levels of mean consumption per user and day and the standard deviation of the statistical sample. So, each type of user represents a group of individuals that is almost constant in number and traffic demanded per day. It is important to remark though, that the threshold values, which separate A, B and C regions in previous subsection, were subjective selections of division points within two areas of inflection points in Figure 2. No absolute threshold
International Journal of Computer Mathematics
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
5
Table 2. Main statistical magnitudes obtained for each region show stability over time. These magnitudes are: the distribution of number of users and traffic consumption in each region (expressed as a percentage of the total magnitudes for each day) and the levels of consumption per user and day, expressed as the mean value and the standard deviation σ of the downloaded traffic (bytes) for the cluster of users in every region. Object
Statistical Magnitude
Day 1
Day 2
Day 3
Day 4
Day 5
Total Users Total Traffic TYPE A
Number of Users Downloaded GBytes % Users % Download σ
1727 40.60 7,01% 87,07% 89 MB 541 MB
1688 41.37 8,12% 87,48% 78 MB 527 MB
1782 42.40 8,36% 82,60% 63 MB 472 MB
1766 41.63 7,42% 84,41% 91 MB 422 MB
1775 40.35 7,94% 85,44% 75 MB 366 MB
TYPE B
% Users % Download σ
65,32% 12,92% 2.08 MB 6.57 MB
66,59% 12,52% 2.01 MB 6.69 MB
67,62% 17,39% 3.04 MB 7.69 MB
66,25% 15,57% 2.72 MB 7.30 MB
63,83% 14,54% 2.55 MB 6.75 MB
TYPE C
% Users % Download σ
27,68% 0,01% 4.47 KB 1.53 KB
25,30% 0,004% 3.11 KB 1.65 KB
24,02% 0,01% 10.69 KB 6.55 KB
26,33% 0,01% 7.77 KB 7.03 KB
28,23% 0,01% 7.79 KB 6.77 KB
Table 3. Distribution of consumption during Day 1 (Wednesday 17 December 2003) for the three observed groups. The daily distribution of downloaded traffic can be observed in its day-night composition. Also the day components are disaggregated with the morning-afternoon detail. Type of user
Morning (9h/13h)
Afternoon (13h/18h)
Total day (9h/18h)
Night (18h/9h)
TYPE A TYPE B TYPE C
26,63% 41,16% 28,25%
27,82% 32,75% 23,14%
54,45% 73,91% 51,39%
45,55% 26,09% 48,61%
values could be established and, as a consequence, it may happen that some elephants could be mules or some mules, mice. This can be appreciated in Table 2, where the obtained values of the standard deviation in Types A and B users are very high. Table 3 describes the pattern of consumption for each group during Day 1 (Wednesday (17 December 2003)), where the different day-time periods were chosen according to the Spanish standard labour day. This is another invariant property observed in the registered traces and it shows a characteristic daytime pattern for each group. In region A, users have a constant and heavy demand over day and night. Users in region B show a stronger consumption in the mornings. In region C, the traffic registered is very low and similar during the whole day.
3.3. Three clusters of users The analysis sketched in Section 3.2 gives us the portrayal of a community with three different types of users (A, B and C) coexisting in the Internet access. Clustering them in three different groups allows us to highlight the size disparity in traffic consumption. Size disparity is often called the “mice-elephants” phenomenon, traditionally observed in Internet flows (according to duration or bit-rate), but also referred as volume of traffic [1]. An object is labelled as mouse if it contributes to the number of total objects of an ensemble, elephant, if it contributes to the total
6
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
C. Pellicer-Lostao et al.
volume mass and in [2] hosts in between are called mules. Finally we name each type of users after this terminology. The main features of each group are: (1) Type A users (or Elephants): They are the great consumers, with a heavy and timely constant demand of Internet traffic. They take the 90% of the total traffic (see Table 2), and they have a continuous activity during the day, with almost 50% traffic at night and 50% at day (see Table 3). (2) TYPE B users (or Mules): They are the pragmatic users. Their activity on the net matches with business hours of activity. It takes place more intensively in the mornings (41%, see Table 3), and the total amount of day load is almost the 75% of their total download. It seems that they use the Internet access as a working tool. (3) TYPE C users (or Mice): They are the erratic and vanishing users with an extremely low and sporadic demand of traffic, and random daily patterns difficult to be tracked. The total amount of traffic demanded by these users is residual (see Table 2). The consumption is very small and it has the same residual weight at day and at night; almost 50% traffic at day and 50% at night (see Table 3). This community seems to follow the Pareto principle or the ‘80–20 rule’, which says that, when something is shared among a sufficiently large set of participants, there will always be a number k, such that k% of the resources is taken by (100 − k)% of the participants. In our case, k = 90, meaning that the 10% of the population (the Elepants group) is responsible for the 90% of the downloaded traffic. For the rest of the users, medium consumers (Mules) are the 65% and they make a business-like use of the resources, taking almost the rest of the bandwidth ‘pie’. Low consumers (Mice) contribute only in numbers, the 25% of the total, and they have residual and apparently random traffic demands. Let us remark at this point that many complex systems present similar generic features that include the Pareto-like statistical behaviour. The most popular examples can be found in areas such as econophysics (e.g. distributions of wealth or income in western economies or standardized price returns on individual stocks markets, Company size distribution [6, 12, 19]), social networks (e.g. size distribution of human settlements, frequency of human relations [14, 20]), scale-free networks (Epidemic spreading and transportation dynamics [9, 15]), transportation networks (e.g. search of the Pareto minimum paths [21]), natural phenomena (e.g. Gutenberg–Richter law of earthquake magnitudes [10]), biocomputation (e.g. optimization of genetic algorithms [24]), or Biology (protein sequence length distribution [11]). Hence, some similarities appear between all these systems and the distribution of downloaded Internet traffic for users of an Intranet.
4.
Models for the users’ downloaded traffic
As explained in previous section, the particular users conducting alike are grouped in three different clusters and a mathematical model is built here for each region of Figure 2b. Thus, the global model will be obtained as the mixture of three different users’ models. If the real-valued random variable X represents the amount of incoming TCP traffic that a user consumes per day, expressed in bytes, the downstream traffic belonging to each group is used to build three different cumulative distribution functions (CDFs). In our case, X attains discrete values xi , with probability pi = p(xi ) = P (X = xi ). The values of pi are calculated as the number ni of users that consumed xi bytes divided by the number Nc of total of users in the same cluster. ni P (X = xi ) = p(xi ) = . (1) F (x) = P (X ≤ x) = Nc x ≤x x ≤x x ≤x i
i
i
International Journal of Computer Mathematics
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350
7
As we are interested in the probability that a user downloads more than x bytes/day, we will work with the complementary cumulative distribution function (CCDF), which is defined as: Fc (x) = P (X > x) = 1 − F (x).
(2)
In the next subsections, we select three CCDFs to model the downstream TCP traffic for each different group and we fit our monitored real data to these CCDFs. We will begin with Type A users, continue with Type C and end up with Type B users, for they are going to build the bridge between the two extremes. 4.1. Model for type A users This cluster represents around the 10% of total users, consuming almost 90% of the traffic (see Table 2). It follows the Pareto principle. If we move in the x-axis towards user number one in Figure 2b, a staggering increase in the traffic demand per user can be seen. So a fast variation is observed only for this cluster and intuitively could give some clues of a long-tail behaviour in the CCDF. If we plot the measured CCDF data in a double logarithmic plot, we observe a straight line arrangement for the data-points, which is consistent with a power-law dependence. Based on these observations, a Pareto distribution is used to model this group. Fc (x) = b ∗ x −α ,
b ≥ 0, α ≥ 0, x ≥ b1/α .
(3)
A double logarithmic plot for the five days is done in Figure 3 to show the details of the fitting. The data follow a two straight line arrangement and the approach of fitting the empirical data to two Pareto distributions is selected, as it seems to be more general and complete than selecting a truncated Pareto as CCDF (see [9]). The first line of data is labelled as “Pareto 1” in Figure 3a and b and appears in the area of transition between Type A and Type B users. This transition exists because the threshold value that separates regions A and B (30 MB/user/day) is a point, that is chosen among an area of inflexion points. The result of finding two straight lines and not just one, indicates that there is no well-defined threshold or boundary that separates Type A users from the rest (observed in [14]). It also
Figure 3. (a), (c)–(f) Graphic details of the fittings for Elephant users to two Pareto distributions (Pareto 1 and Pareto 2) in double logarithmic plot for all the traces, and (b) empirical and fitted CCDF obtained for Day 1. A two straight line arrangement and the approach of fitting the empirical data to two Pareto distributions.
8
351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
C. Pellicer-Lostao et al. Table 4. Values of the fitting parameters b and α obtained for the five days in the two zones ‘Pareto1’ and ‘Pareto2’. Day b1 α1 b2 α2
1
2
3
4
5
7143.08 0.519 1.45E + 16 1.944
44700.9 0.630 1.25E + 24 2.842
19500.0 0.578 1.09E + 19 2.272
18613.8 0.576 9.25E + 15 1.928
10242.7 0.541 8.06E + 24 2.92
indicates that there are two speeds of consumption within the Elephants herd, as the power-law behaviour-has different characteristics in each distribution. The ‘Pareto 1’ zone has an exponent α < 1 and the value of α for the ‘Pareto 2’ is around 2. Table 4 shows the details of the fitting parameters. These users consume a heavy load of traffic every day. Consumption varies in a wide range between users (10 GB and 30 MB/user/day). The range of bytes between the user who has the maximum discharge (10,000 MB/day) and the user who discharged the minimum (30 MB/day) in this group is 9970 MB, meaning that the heaviest consumer takes around 1000 times more than the lightest in the group. In Table 2, the mean value of incoming traffic is around 80 MB/user/day but with a high dispersion value, typical of long-tail distributions (the standard deviation for this population is around 400–500 MB, revealing a high spread of its values). They are always on line and perform an intensive use of the link, with a constant and uniform traffic profile over day and night (see Table 3). They download high quantities of data, probably heavy Web- or FTP-files, or even they are connected to P2P networks. 4.2.
Model for type C users
Users belonging to this group consume a residual quantity of traffic, with random and irregular patterns of activity (see Tables 2 and 3). They show a better marked mean consumption than Elephants, and plotting the measured CCDF data in a natural logarithmic plot, a straight line arrangement appears for the data-points, which is consistent with an exponential dependence. Finally the exponential distribution is used to model this group: Fc (x) = be−ax ,
a ≥ 0, x ≥ 0.
(4)
Figure 4 shows the details of the fitting in a plot that represents y-axis in natural logarithm for the five days. Table 5 shows the details of the fitting parameters. As it can be seen in Figure 4, the model proposed seems to be less applicable for very low consumption (i.e. x under 4 KB in Day 1 in Figure 4(b)). The volume of consumption of Mice is so low that one could question the worth of modelling them. But let us recall that it is an important population within the network, around the 25% of the users belong to this group. They contribute to the number of entities but not to the volume. The demand of traffic is below tens of KB/user/day, with a weak tendency of growth as time passes, from threshold values of 10KB in December 2003 to 30 KB in April 2004 (see Figure 2b). This means that just a hundred of packets (40–1500 bytes long) have been received per day by one of this users, typical for sporadic surfing or mail applications activities. The range of bytes between the user who has the maximum discharge (10–30 KB per day) and the user who discharged the minimum in this group is around 9–29 KB, meaning that the heaviest consumer takes around 10–30 times more than the lightest in the group, which consumed around 1 KB per day.
International Journal of Computer Mathematics
401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
9
Figure 4. (a), (c)–(f) Graphic details of the fittings for Mice users in natural logarithm plot for all days, and (b) empirical and fitted CCDF obtained for Day 1. Table 5. Values of the fitting parameters a and b obtained for the five days. Day a b
1
2
3
4
5
6.37E − 4 7.32
1.50E − 4 2.41
1.49E − 4 1.61
6.05E − 4 3.01
1.43E − 4 1.53
4.3. Model for type B users This closter accomodates maximum number of users (∼65%, see Table 2). Their consumption lies somewhere in between the last two extremes. The Weibull distribution can offer a way of transition from an exponential to a heavy-tailed distribution with a shape parameter (β < 1), and it is found in literature that some empirical distributions are often close to this type of distribution [1]. Hence, a Weibull distribution is selected in this case as CCDF, due to its flexibility and because it can mimic a transition from the exponential to the Pareto distribution. Fc (x) = e−ax , β
a ≥ 0, β ≥ 0, x ≥ 0.
(5)
The details of this approximation are shown in Figure 5 in a logln–log plot that represents the decimal and natural logarithm of Fc (x) in the y-axis and the decimal logarithm of x in the x-axis for the five days of data collection. Table 6 shows the details of the fitting parameters. This group of users belong to an intermediate kind (see Table 2), with a moderate consumption of traffic and use of the Internet access. A Mule digests a traffic load in the range of tens of KB to tens of MB per user/day, showing a growing tendency for the lighter consumers as time passes. The range of bytes between the user who has the maximum discharge (30 MB/day) and the user who discharged the minimum (10–30 KB/day) in this group is 9–29 MB, meaning that the heaviest consumer takes around 1000 times more than the lightest in the group. The average demand per user is around 2 or 3 MB/user/day in December 2003 and seven in April 2004, respectively, and Q2 the dispersion value is high (the value of the standard deviation is around 6–7 MB/user/day), getting closer to the Elephant’s profile. It is also observed in Table 3 that they concentrate their activity in business hours, possibly using the Internet access in their work activities. These users seem to read emails, make sporadic surfing on the net and download relatively light Web pages or files (email attachment or FTP).
10
451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
C. Pellicer-Lostao et al.
Figure 5. (a), (c)–(f) Graphic details of the fittings for Mule users in logln–log plot for all days, and (b) empirical and fitted CCDF obtained for Day 1.
Table 6. Values of the fitting parameters a and β obtained for the five days. Day a β
1
2
3
4
5
7.08E − 5 0.633
3.59E − 5 0.663
3.33E − 5 0.675
3.62E − 5 0.678
3.17E − 5 0.676
5. The general model We proceed now to propose a statistical user model to describe the incoming traffic in the Internet access for the general community. This is the result of the combination of the three models which were obtained for each type of user. The CDF function F (x) for this global model is plotted in Figure 6, showing the empirical data registered on Day 1 (Wednesday, 17 December 2003) and the theoretical distributions fitted to these data. This global CDF represents the probability that a user demands less than an specific traffic load. This probability is calculated from the CDF for the type of user (or region of consumption) averaged by the relative weight of this type of user in the community (averaged by the number of users of each type shown in Table 2). The three different groups of users are very well identified in the general model. We can think of the Network as a multi-agent system in an active environment. At first sight, the agents could seem to be in a weak competition for the bandwidth in the Internet access, but it is more realistic to think that they are non-interacting agents without any mutual information among them. Even under these circumstances, and surprisingly, the system self-organizes with the pattern of behaviour given in Figure 6. The part of the population that receives the highest traffic income, generally less than 10% of the individuals, follows a power-law distribution (Pareto behaviour). This means that the consumption is dominated by a few individuals and, for the misfortune of planning engineers, that there always exists a significant probability or risk of finding a more intensive user in the Network. In the range of low traffic download, the cumulative distribution follows the Boltzmann–Gibbs law (exponential behaviour with Fc (x) ∝ exp(−x/T )). Hence, by analogy with statistical physics, one may expect that the activity of these users is characterized by some kind of random access to
International Journal of Computer Mathematics
501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550
11
Figure 6. (a) The general model built from the combination of the individual models for each type of user (empirical data and model curves). The x-axis is in logarithmic scale, and (b) details of the Elephants region. The two Pareto CDF approach used in the fitting procedure can be observed here, and (c) details of the Mules region, and (d) details of the Mice region.
the Net (memory-less), where the effective traffic ‘temperature’ (with T = 1/a in Table 5) would be related to the average amount of bytes consumed per agent. Let us remark that these two regimes, the Pareto and the Boltzmann–Gibbs statistical patterns, can be found in other real systems which also depend on the human behaviour, such as the distribution of wealth or income in the western societies [6]. But also a third different regime is observed here, namely the group of medium consumers or Mules described by the Weibull distribution function. This distribution can model the transition from an exponential to a heavy-tailed distribution by varying its shape parameter (β < 1). The low traffic demand dominates this group, but the consumption increases rapidly for the last users that are close to the Pareto-like region. Summarizing, the model depicted in Figure 6 collects and brings out all the information needed for the statistical forecasting of consumption that a community of users in a Intranet will perform for the Internet access link. As it is, it could also be useful for TelCos, to investigate new strategies in network planning or marketing areas of Internet consumption.
6.
Conclusions
This work describes the process of traffic analysis and modelling of an Internet access link taking the user as a reference. A new probabilistic global model is obtained to describe user traffic consumption in terms of bytes per user per day. Considering the Network as a multi-agent system, an emergent statistical structure is found for the Internet demand within the community. The identification of three different groups of users and the probabilistic distributions that describe their behaviour have been established. Some particular properties are present in these experimental results. Apart from the usual Pareto and Boltzmann–Gibbs behaviours, which are also found in other multi-agent systems commanded by the human action, a new emergent regime is found here. This can be modelled by the Weibull distribution, that could be interpreted as a bridge between the Pareto and the Boltzmann–Gibbs regimes.
12
551 552 553 554 555 556 557 558 559 560 561 562 563 564 Q1 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600
C. Pellicer-Lostao et al.
Let us stand out that different models and real systems, with dynamical processes and local interactions taking place at the microscopic scale, show similar macroscopic characteristics to those observed in this work. This similar pattern of behaviour, that continues to emerge in our analysis, is somehow striking since it would be more realistic to think that the agents are independent and non-interacting when they access to the Internet link. We hope that the stastics presented in this paper will help for a better understanding of the laws governing the interactions in multi-agent systems, and that they will provide empirical views to achieve enhanced modelling for complex systems. Acknowledgements The authors thank the Public University of Navarre (UPNA) for allowing them to monitor the data in its Internet access link.
References [1] A. Broido et al., Their share: diversity and disparity in IP traffic, Proceedings of Passive and Active Network Measurement (PAM), 2004. [2] A. Broido, E. Nemeth, and K. Claffy, Spectroscopy of Private DNS Update Sources, Proceedings of IEEE 3rd Workshop on Internet Applications (WIAPP’03), 2003. [3] N. Brownlee, and K.C. Claffy, Internet stream size distributions, Proceedings of ACM SIGCOMM, 2002, pp. 282–283. [4] A. Chattarjee, Login times in e-mail servers are scale-free, Preprint arXiv:cond-mat/0307533v1, 2003. [5] C. Cunha, A. Bestavros, and M. Crovella, Characteristics of WWW client-based traces, IEEE/ACM Trans. Netw. 1 (1999), pp. 134–233. [6] A. Dragulescu, and V.M. Yakovenko, Exponential and power-law probability distributions of wealth and income in the United Kingdom and the United States, Physica A 299 (2001), pp. 213–221. [7] A. Downey, Evidence for long-tailed distributions in the Internet, Proceedings of ACM SIGCOMM Internet Measurement Workshop (IMW), 2001, pp. 229–241. [8] S. Floyd, and V. Paxson, Difficulties in simulating the Internet Networking, IEEE/ACM Trans. Netw. 9 (2001), pp. 392–403. [9] R. Guimera et al., The worldwide air transportation network: anomalous centrality, community structure, and cities’ global roles, Proc. Nat. Acad. Sci. USA 102 (2005), pp. 7794–7799. [10] B. Gutenberg and C.F. Richter, Seismicity of the Earth and Associated Phenomena, Princeton University Press, 1954. [11] R. Jaina and S. Ramakuma, Stochastic dynamics modeling of the protein sequence length distribution in genomes: implications for microbial evolution, Physica A 273 (1999), pp. 476–485. [12] B. Leong Lan and Y. Oon Tan, Statistical properties of stock market indices of different economies, Physica A 377 (2007), pp. 605–611. [13] Libpcap. Available at ftp://ftp.ee.lbl.gov/libpcap.tar.z [14] F. Liljeros et al., The web of human sexual contacts, Nature 411 (2001), pp. 907–908. [15] R. Pastor-Satorras and A. Vespignani, Epidemic spreading in a scale-free networks, Phys. Rev. E. 86 (2001), pp. 3200–3203. [16] V. Paxson, End-to-end internet packet dynamics. IEEE/ACM Trans. Netw. 7 (1999), pp. 277–292. [17] V. Paxson and S. Floyd, Wide-area traffic: the failure of Poisson modeling, IEEE/ACM Trans. Netw. 3 (1995), pp. 226–244. [18] J. Pitkow, Summary of WWW Characterizations, World Wide Web, 2 (1999), pp. 3–13. [19] J.J. Ramsdena and Gy. Kiss-Haypal, Company size distribution in different countries, Physica A 277 (2000), pp. 220–227. [20] B. Roehner, Evolution of urban systems in the Pareto plane, J. Reg. Sci. 35 (1995), pp. 277–300. [21] V.N. Sastry, T.N. Janakiraman, and S. Ismail Mohideen, New polynomial time algorithms to compute a set of Pareto optimal paths for multi-objective shorthest path problems, Int. J. Comput. Math. 82 (2005), pp. 289–300. [22] K. Thompson, G.J. Miller, and R. Wilder, Wide-area Internet traffic patterns and characteristics, IEEE Netw. 11 (1997), pp. 10–23. [23] N. Vicari and S. Köhler, Measuring Internet user traffic behaviour dependent on access speed, Proceedings of 13th ITC Specialist Seminar on IP Traffic Modeling, Measurement and Management, 2000. [24] X. Zou and L. Kang, Fast annealing genetic algorithm for multi-objective optimization problems, Int. J. Comput. Math. 82 (2005), pp. 931–940.