Application profiling of IP traffic - IEEE Xplore

0 downloads 0 Views 356KB Size Report
Abstract. Application profiling is a critical step in QoS solution design of today's IP networks. To aggregate applications into QoS classes effectively, one has to ...
Application Profiling of IP Traffic

Danielle Liu, Frank Huebner AT&T Labs, 200 Laurel Avenue, Middletown, NJ 07748 [email protected] [email protected]

Abstract Application profiling is a critical step in QoS solution design of today’s IP networks. To aggregate applications into QoS classes effectively, one has to know the key characteristics of the applications. This is needed to arrive at a QoS solution design that meets the QoS targets for all applications, while allowing for maximum use of the network link capacities. We collected NetFlow data at several locations on a corporate Intranet and analyzed the data producing flow, packet and byte application breakdowns. For the most popular and bandwidth-consuming applications, we analyzed the stochastic characteristics like distributions of flow lengths, packet sizes, throughputs, etc. Our study reveals that the measured traffic composition does not vary much on a day-by-day basis, but that it can be very different from location to location. Therefore traffic measurements have to be collected at each congested link in the network, since singular measurements may lead to inaccurate assessments of the live traffic mix, resulting in ineffective QoS solution designs. Also, to account for changes in user behavior, in- or decrease of the user population and changes in the application mix, the process of traffic profiling should be repeated every 3-6 months.

1

Introduction

Extensive research effort has been devoted to IP traffic modeling and engineering. “The key issue on the design or selection of a congestion management scheme is the traffic pattern, and the traffic pattern is dependent upon the application [2]”. The heterogeneity of IP traffic

calls for partitioning the traffic. Traffic from different applications, e.g., HTTP, FTP, IP voice, DNS, NTP exhibits distinctive characteristics. In addition, there is a keen demand for preferential services for some business-critical applications. To provide differentiated services based on the QoS class, understanding the traffic characteristics for the top bandwidth consuming applications is the key to effective management of network resources. This study is based on NetFlow data collected at several locations on a corporate Intranet. The objective of this study is to identify the top applications that are running on this network and to summarize the key characteristics of these applications. These characteristics or profiles of the applications can be used to drive the traffic used in performance testing and to derive tractable models for network dimensioning that capture the important properties of the traffic data. Many studies have shown that the statistical properties of the IP traffic are considerably different from those of circuit voice traffic (e.g., [1,3,7]). The difficulties encountered in modeling IP traffic data stem from the fact that IP networks use multiple protocol layers and applications. Each layer plays a distinct role and the traffic viewed at each layer has distinct features. As we illustrate in Section 2, the traffic viewed at flow, packet and byte levels can be very different. The characteristics of TCP traffic are well known to be very different from those of UDP. In addition, the heterogeneity of IP traffic due to diversity of applications calls for partitioning the data by applications and characterizing and modeling the pieces separately [4]. The composition of IP traffic is known to be dynamic, i.e., it changes over time. New

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

applications, like Napster, Instant Messaging were unknown a few years ago, but they consume significant bandwidth of our networks today. In the case of Napster, it is already diminishing as Napster is offering fewer titles and new competitors emerged. The process of traffic profiling needs to be done on potentially congested links in the network, since these are of interest regarding QoS issues. Some fine-tuning may be needed when the traffic changes over time. This process should be repeated every 3-6 months. The outline of the paper is as follows: In Section 2, we discuss the data sets we collected. In Section 3, we identify the top applications that are running in the network and demonstrate that the traffic composition by flows, packets and bytes are very different. In Section 4, we examine the traffic characteristics of the key

data 1 data 2

data 3

11/29/2000 02/20/2001 02/21/2001 02/22/2001 02/23/2001 02/25/2001 total 03/02/2000 03/03/2000 03/04/2000 03/05/2000 total

applications. Implications of our findings and conclusions are summarized in Section 5.

2

Data Description

The first set of data was collected over a 24-hour period from a Cisco 7500 router located on the U.S. west coast, serving Asia-Pac customers. To investigate the variation of traffic composition over location and time, two additional data sets for different dates were collected on the U.S. east coast. The summary of the total number of flows, packets and bytes collected each day is shown in Table 1. The last 3 columns in the table give the average flow length in bytes and packets, and the average packet size for each data set. The average number of bytes in a flow of data set 3 is about twice that of data set 1, and the average packet size in data set 2 and 3 is over 50% larger than that of data set 1.

flows pkts bytes 63929756 1209935154 3.87982E+11 3074166 50748890 22693872525 4756510 78834991 40898361895 3717877 62068590 34131528964 2015155 39548437 17891761639 1637418 28828638 15157705300 15201126 260029546 1.30773E+11 25700960 607527732 3.35104E+11 14274538 311951832 1.38782E+11 27477024 623999587 3.16775E+11 24472647 491724834 2.58216E+11 91925169 2035203985 1.04888E+12

flow bytes flow pkts pkt bytes 6068.88 18.93 320.66 7382.12 16.51 447.18 8598.40 16.57 518.78 9180.38 16.69 549.90 8878.60 19.63 452.40 9257.08 17.61 525.79 8602.86 17.11 502.92 13038.56 23.64 551.59 9722.33 21.85 444.88 11528.71 22.71 507.65 10551.23 20.09 525.12 11410.11 22.14 515.37

Table 1. Summary of the data sets

3

Application Identification

Classification of traffic by port number is getting more and more challenging since there is a significant fraction of traffic that cannot be mapped to applications using well- known port numbers. Many protocols do not depend on wellknown port numbers, but either use a wellknown service for negotiating port numbers using secondary connections, or use arbitrary but fixed port numbers that are not registered with IANA (Internet Assigned Numbers Authority). The most popular application with negotiated port numbers is passive-mode FTP, in which the client sends the port number to use for a data connection over the command channel. There are

many other protocols that show similar behavior, such as Internet telephony applications. Note also that a significant portion of traffic falls into a category labeled with `0' in which both the source and destination port numbers equal 0. This includes all the traffic that was not mapped to a specific protocol. This traffic is either generated by protocols that use negotiated port numbers at both ends (e.g., passive-mode FTP), or by new applications that use unregistered fixed ports [1]. Table 2 summarizes the protocol usage of all the flows in data set 1. TCP is by far the dominant protocol, taking up 77% of the flows, 74 % of the packets and 89% of the bytes. The UDP applications came second, with 18% of the flows, 20% of the packets and 7% of the bytes.

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

These two protocols make up about 95% of all the traffic. The last 3 columes in the table give the average flow length in bytes and packets, and the average packet size for each protocol. The average TCP flow length is about 7 Kbytes and 18 packets. The average UDP flow length is about 2,350 bytes and 22 packets. The average

packet size for TCP flows is about 3.5 times as much as that of UDP flows. Most of the other protocols, except ICMP, typically have longer flows both in packets and bytes than those of TCP and UDP flows. The overall average statistics are dominated by those of TCP as expected.

Protocol flows pkts bytes % flows %pkts %bytes flow bytes flow pkts pkt bytes TCP 49068320 901052020 3.45493E+11 76.75 74.47 89.05 7041.05 18.36 383.43 UDP 11291943 245017897 26534315705 17.66 20.25 6.84 2349.84 21.70 108.30 GRE 12016 41464464 9915934057 0.02 3.43 2.56 825227.53 3450.77 239.14 SKIP 8068 3393061 2556298824 0.01 0.28 0.66 316844.18 420.56 753.39 ICMP 3479950 13777467 1656550433 5.44 1.14 0.43 476.03 3.96 120.24 ESP 36627 3389917 1236749980 0.06 0.28 0.32 33766.07 92.55 364.83 AH 2695 587355 255064798 0.00 0.05 0.07 94643.71 217.94 434.26 IPINIP 923 868587 249548458 0.00 0.07 0.06 270366.69 941.05 287.30 IPIP 1815 265294 48233841 0.00 0.02 0.01 26575.12 146.17 181.81 IGMP 196 34716 22149379 0.00 0.00 0.01 113007.04 177.12 638.02 other 27207 84384 14470023 0.04 0.01 0.00 531.85 3.10 171.48 Total 63929756 1209935154 3.87982E+11 100 100 100 6068.88 18.93 320.66

Table 2. Protocol summary of data set

GDWD VHW  IO RZ VSNWV SNWVE\ WHV EE\\ SR SRUWUW QX QXPP EH EHUU

htt p

nntp

s mtp

0

ftp-data

naps ter

htt ps

4 04 1

9 99 5

dns

4 04 0

pop3

1 75 5

ks hell

audio

half life

web-pr oxy

1 35 2

5 00 0

2 04 8

2 34 0

ftp-ctr l

5 50 1

6 90 1

1 04 4

hos t2 -ns

4 97 0

1 04 6

1 33 0

1 67 2

tcpmux

2 23 3

netbios -ns

6 11 2

1 04 3

1 08 0

1 02 5

1 04 7

1 03 9

1 05 5

4 96 0 6

1 46 2

1 03 7

1 03 6

1 07 4

3 12 8

4 00 0

1 04 5

1 06 5

r es t

Figure 1. Percentage of flows/packets/bytes by application for data set 1

Figure 1 is the graphic representation of the proportion of traffic by applications. The inner ring illustrates the percentage of flows for the top 50 applications and the rest of the applications.

The outer, and middle rings show the percentages of bytes and packets by the top 50 applications and the rest of the applications. The order of the applications were sorted by the percentage of bytes the application contributes.

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

The plot indicates the composition of the different applications could be very different depending on whether the quantity of interest is the amount of flows, packets or bytes. We see that HTTP is by far the most popular application. About 61% of flows and 45 % of the packets and bytes are HTTP. NNTP constitutes 16% of bytes, but only a tiny amount of the flows. This indicates the NNTP flows are long, with average flow length being about 300 Kbytes and 422 packets. Other applications that have long flows are FTP, Napster, Audio, and some gaming applications. DNS constitutes 12% of the flows, but those flows are known to be short, with an average of 300 bytes.

since these are the most common applications and they make up most of the bandwidth consumption in the network. Next we investigate the variation in the traffic composition over location and time. Additional NetFlow data was collected at two locations on the U.S. east coast. Figure 2 shows the traffic composition by flows, packets and bytes. It is clear from the charts that the traffic composition can be very different from one location to another. The view of traffic composition by flows, packets and bytes can be very different due to the different characteristics of applications. Day to day variation is minimal based on the data we collected.

The applications we will focus on are FTP, HTTP, SNMP, NNTP, NAPSTER and DNS, d a ta se t 2: flo w s

data se t 3: flows 100

100

ICMP IC M P

80

UDP-Other

80

U D P -O t h er

TCP-Other

U D P -D N S T C P -N A P S T E R

40

T C P -F T P

20

% flo ws

% flo w s

T C P -O t h er

60

UDP-DNS

60

TCP-NA PSTER TCP-FTP

40

TCP-SMTP

T C P -S M T P

TCP-NNTP

20 T C P -W W W

TCP-WWW

01

0

/2 0

01

/2 5

03/02/2001

03/03/2001

03/04/2001

03/05/2001

02

02

/2 4

/2 0

01 /2 3

/2 0

01

02

02

/2 2

/2 0

01 /2 0 /2 1 02

02

/2 0

/2 0

01

0

d a ta se t 2: p a ck e ts

data se t 3: packe ts

100 IC M P

80

UDP-Other

T C P -O t h er

TCP-Other

80

60 U D P -D N S

40

UDP-DNS

T C P -N A P S T E R T C P -F T P

20

T C P -S M T P

% pkts

%p k ts

ICMP

100

U D P -O t h er

60

TCP-NA PSTER TCP-FTP

40

TCP-SMTP TCP-NNTP

T C P -W W W

0

20 1

1

00

03/02/2001

0

0

2/ 25

2/ 24

0

/2

/2

00 /2 2/ 23 0

0

00

1

1 00 /2 2/ 22

/2 0

0

2/ 21

2/ 20

/2

00

00

1

1

TCP-WWW

d a ta se t 2: b yte s

03/03/2001

03/04/2001

03/05/2001

data se t 3: byte s

100 IC M P U D P -O t h er

UDP-Other

T C P -O t h er

80

60

%b yte s

ICMP

100

80

TCP-Other

U D P -D N S T C P -N A P S T E R T C P -F T P

20

T C P -S M T P

TCP-NA PSTER TCP-FTP

40

TCP-SMTP TCP-NNTP TCP-WWW

01 02

/2 5

/2 0

01 02

/2 4

/2 0

01 02

/2 3

/2 0

01 02

/2 2

/2 0

01 02

/2 1

/2 0

01

20

/2 0 /2 0

60

T C P -W W W

0

02

UDP-DNS

% byte s

40

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

0 03/02/2001

03/03/2001

03/04/2001

03/05/2001

Figure 2. Traffic composition of data set 2 and 3 3, plotted on a log scale. We see that none of 4 Application Profiling these applications have a uni-modal distribution function, indicating that there are several typical In this section, we examine several of the most sizes for each application. popular and bandwidth-consuming applications Most applications have some mass of small and determine the stochastic characteristics of flows that typically consists of some these applications based on the NetFlow data we acknowledgements. For HTTP, the common have. For brevity, the analysis discussed in this size is about 1,000 bytes. For Napster, the most section is based on data set 1. common type is about 100 Mbytes, the mass is wide spread, and drops off at around 10 Gbytes. 4.1 Flow Length in Bytes and Time Most of the DNS applications consist of short flows of less than 100 bytes. The probability density functions of the bytes in a flow for the 6 applications are shown in Figure

Figure 3. Probability density functions of flow bytes

Figure 4 shows the tail distributions of the flow bytes for the 6 applications of interest. They are

plotted on a log-log scale. We observe that except for Napster and HTTP, the 4 other

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

applications exhibit clear heavy tails. In this context, we use the definition discussed in [3]. We say a distribution is heavy-tailed

application has an average flow length of about 300 Kbytes, but the median is less than 400 bytes.

p( X > x) ~ x −α for large x,0 ≤ α ≤ 2 .

The DNS application has an almost perfect hyperbolic upper tail, with a relatively fast decay rate of –1.62. FTP and SNMP have similar decay rates of about -0.6.

if

For NNTP, the tail constitutes about 5% of the mass, but it has a very slow decay rate of –0.28. Because of this distribution the NNTP

Figure 4. Tail distribution of flow bytes

Figure 5 shows the probability density function for flow durations in ms plotted on a log scale. The number on each plot indicates the fraction of flows with duration 0, that is, flows with 1 packet. As expected, most of the DNS flows have only 1 packet. The other applications have

about 5-15% of 1-packet flows. These flows are mostly acknowledgement packets. The maximum flow duration is about 1,950 seconds. This is the flow length limit set at the router. Any flows longer than that expire and a new flow entry is created for those pairs that continue

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

the conversation. That is why we see the drastic

multiple flow records. Most flows of all

drop off for the Napster application suggesting many of the Napster flows probably got split into

applications except Napster last for between 100 ms to 100 seconds.

Figure 5. Probability density function of flow duration in ms

4.2

Average Packet Size Distribution

Since the NetFlow data does not contain records of each individual packet, we could not derive an accurate packet size distribution here. We can

only calculate the average packet size for each flow. What is the percentage of flows with small packets? From Figure 6, we see that all applications have over 50% of the flows with very small packets. These are typically flows with acknowledgements. DNS is the special case

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

here since it uses UDP, rather than TCP as its transport protocol. About 50% of the DNS flows have a packet size of 64 bytes. These are the

standard DNS queries. None of the DNS flows have an average packet size larger than 550 bytes.

Figure 6. Cumulative distribution function of average packet size per flow

In the case of Napster, there are about 20% of the flows with an average packet size near the typical MTU size of 576 bytes, and there are about 10% of the flows with packet size greater than 1,000 bytes. NNTP is another interesting case with over 90% of the flows having a very small average packet size. However the overall average packet size for NNTP is over 700 bytes. That average is taken over all the packets, while the distribution shown on Figure 6 is based on the averages per flow. This suggests that the small percentage of the long flows with larger packets make up most of the bytes consumed by the application. The same statement also applies to the FTP, SNMP and Napster applications. 4.3

Average Throughput Distribution

Another quantity of interest is the throughput an application can achieve. Based on the NetFlow data, we can calculate the average throughput for each flow by dividing the total bytes of the flow by the duration of the flow. We exclude all the flows with only 1 packet and with an average packet size less than 60 bytes. By doing that the calculated throughput captures the data flows rather than acknowledgements. Figure 7 shows the probability density functions for the average throughput on a log scale. The number on the chart is the average throughput in kbps across all the flows for each application. The DNS application has the highest average throughput of 44 kbps. Keep in mind that we excluded flows with 1 packet only for the calculation and about 70% of the DNS flows

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

have only 1 packet. However, being the only UDP application of the group, DNS is able to achieve much higher throughput than the elastic TCP applications when it has lots of bytes to transmit. Among the TCP applications, FTP and

NNTP have the highest average throughput of about 28 and 24 kbps respectively. Napster has an average throughput of 12 kbps. HTTP and SNMP have about the same average throughput of about 10 kbps.

Figure 7. Probability density function of the average throughput per flow

Since the traffic we captured here is between the U.S. and Asia, the round trip time is significantly higher than the traffic within the U.S. The throughput of a TCP connection during the congestion avoidance phase can be estimated by

MSS RTT

3 2p

,

where

MSS is the maximum

segment size, RTT is the round trip time and p is the loss probability [6]. We observe that the

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE

longer the round trip time, the smaller the throughput. In addition, the packet loss rate over transpacific links is also relatively high. We expect higher throughput for connections with end points within the U.S.

5

Implications and Conclusions

We collected NetFlow data at several locations on a corporate Intranet. Initial analysis of the data reveals the following: TCP is still (see also [5]) by far the dominant protocol, taking up over 70% of the flows, and about 90% of the bytes. The UDP applications came second, with about 20% of the flows, and 5% of the bytes. These two protocols make up over 90% of all the traffic. The average TCP flow length is about 7Kbytes and 18 packets. The average UDP flow length is about 2,350 bytes and 22 packets. The average packet size for TCP flows is about 3.5 times as much as that of UDP flows. The top applications are HTTP, SNMP, FTP, NNTP, Napster and DNS. These 6 applications make up about 80% of the total bytes. The composition of the applications is relatively stable from day to day, but can vary greatly from one location to the other. The distribution of the number of bytes in a flow is heavy-tailed for SNMP, FTP, NNTP and DNS. Flow duration distributions do not exhibit heavytail behavior because of the artificial timeout value set on the routers. Evidence suggests many long flows are split into multiple flow records, especially in the case of the Napster application. The UDP applications achieve higher throughput than the elastic TCP applications. Based on the data we collected for transpacific connections, the DNS application has the highest average throughput of 44 kbps. Keep in mind for the calculation, we exclude those flows with 1 packet only and about 70% of the DNS flows have only 1 packet. Among the TCP applications, FTP and NNTP have the highest average throughput of about 28 and 24 kbps respectively. Napster has an average throughput of 12 kbps. HTTP and SNMP have about the same average throughput of about 10 kbps. Application profiling is a critical step in QoS solution design. To aggregate applications into QoS classes effectively, one has to know the key

characteristics of the applications. This is not only important for, e.g., achieving lower delay or jitter for applications that require these metrics to be below certain bounds, but also for mitigating some protocol effects, e.g., UDP achieves higher throughputs than TCP in a congestion episode due to the elasticity (back-off) of the TCP protocol. Also the amount of network management traffic (that in general needs to be assigned a higher priority to keep the network alive in case of failures, etc.) needs to be considered in the QoS design, in order to be able to achieve the desired QoS targets for all applications. Our study reveals that the measured traffic can be very different from location to location. Therefore there is no cookie-cutter solution for QoS design. One has to collect traffic measurements at each congested link to configure the QoS control parameters. We also found that the day-to-day variation is relatively small. There is hope that a robust solution can be effective over some time period. It may not be necessary to use a dynamic policy manager to apply real time control. However some finetuning may be needed when the traffic changes over time. This process of traffic profiling should be repeated every 3-6 months.

6

References

[1] Sean McCreary and K.C. Claffy, ”Trends in Wide Area IP Traffic Patterns”, www.caida.org/outreach/papers/AIX0005. [2] R. Caceras, P. Danzig, S. Jamin, D. Mitzel, Characteristics of Wide-Area TCP/IP Conversations, ACM SIGCOMM, 1991. [3] Mark Crovella and Azer Bestavros, “SelfSimilarity in World Wide Web Traffic: Evidence and Possible Causes”, ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 1996. [4] Danielle Liu, Frank Huebner and Yonatan Levy, “A Hierarchical Multi-Class Traffic Model for Data Networks”, ITC 16, 1999. [5] Anja Feldmann, Jennifer Rexford, and Ramon Caceres, “Efficient policies for carrying Web traffic over flow-switched networks”, IEEE/ACM Transactions on Networking, December 1998. [6] Matthew Mathis, Jeffery Semke, Jamshid Mahdavi and Teunis Ott, “The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm”, Computer Communication Review, July 1997. [7] V. Bolotin, J. Coombs-Reyes, D. Heyman, L. Levy and D. Liu, “IP Traffic Characterization for Planning and Control”, ITC 16, 1999.

Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN’02) 0742-1303/02 $17.00 © 2002 IEEE