Jan 30, 2006 - we will be monitoring at the router showing us all traffic from ... Of course, only POP and IMAP traffic from an email server outside the intranet is ...
Relations Between In- and Outbound Email Traffic Berend Dekens b.h.j.dekens @ student.utwente.nl Faculty of Computer Science University of Twente - The Netherlands Januari 9th 2006
ABSTRACT In this paper we will analyze the relations between outbound email and inbound email traffic. We consider outbound traffic being sending email using the SMTP and inbound traffic being receiving or retrieving email using POP or IMAP. IMAP appears to be less used than POP. In corporate networks it is likely to have more SMTP than POP/IMAP traffic and in residential networks the situation is reversed. Even so, the amount of email traffic is negligible in comparison to the total amount of network traffic.
1.2. Research Questions What are the relationships between SMTP, POP and IMAP traffic? To answer this we need to determine the following: • What is the quantity of email traffic (SMTP,POP and IMAP) in relation to the total quantity of network traffic. • What is the quantity of sending email traffic (SMTP) in relation to the quantity of retrieving email traffic (POP, IMAP). • What is the average connection time when sending/retrieving email and does this imply something about the efficiency a particular protocol.
Keywords SMTP, POP, IMAP, email traffic, email traffic analysis
1.3. Research Approach
1. INTRODUCTION
The beginning of the paper will be based on existing literature about the subject. We will then use the conclusions of the referenced studies as a base for this paper to continue and answer parts of the research questions..
Email today is based on standards developed more than 20 years ago. With the growing and development of the internet over the years the standards for email delivery and retrieval have had some minor updates but are still pretty much the same. Because the internet has changed severely in terms of use, size and bandwidth the old standards are by any means outdated. However, because of the wide acceptance of these standard protocols, changes in the way we use or exchange email are hard to accomplish. Current day email consists of 2 transports: a transport protocol for sending email and transport protocols for retrieving email. Sending email is usually performed using SMTP (Simple Mail Transfer Protocol, [5]). Retrieving email is performed using IMAP (Internet Message Access Protocol, [4]) or POP (Post Office Protocol, [6]), the latter being most popular due to protocol simplicity. Although there are alternative methods for email exchange we will leave these out of the study and focus on the commonly used protocols. Even though email is considered a widely used communication medium only a limited number of papers have been written about the performances of the used protocols. In this paper we will analyze the mentioned email protocols and compare them to each other in terms of network traffic. DISCLAIMER Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission. 4th Twente Student Conference on IT , Enschede 30 January, 2006 Copyright 2006, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science
Then we will analyze network traffic to generate statistics which will be used to fully answer the research questions. The required statistics include traffic quantity, number of connections and time-of-day data and will be collected from a network repository. We will use the M2C Measurement Data Repository [7] which contains network packets from different institutes in the Netherlands and is stored locally on the same university network as the servers used for analyzing the data (hence providing fast access to the multi gigabyte storage). The data is stored in PCAP format which has been made anonymous by stripping payload and remapping internet addresses. This makes it impossible to determine the original content and origin of the data but provides us with a dataset which can be analyzed and scanned for email traffic. The captured data will be analyzed using a custom designed Java program which utilizes the JPCAP library to provide access. Each capture file will be scanned for traffic on ports 25 (SMTP), 110 (POP) and 143 (IMAP) and every connection will be tracked while storing statistics. Since every location has multiple traces, ranging from 10 MB to 4 GB in size the program merges the results of every analysis per location providing a general impression of each location.
2. RECENT STUDIES We will now discuss results from previous studies about email protocol comparisons.
In 'A Comparative Study of Simple Mail Transfer Protocol (SMTP), Post Office Protocol (POP) and X.400 Electronic Mail Protocols' we can find that 'POP3 proved to be the most efficient protocol although it’s functionality is limited' ([1], page 10, chapter 5).
traffic should come from a client sending an email to one of the servers outside of the intranet.
In a paper from Motorola called 'Mail Servers with Embedded Data Compression Mechanisms' there are some suggestions for improving or altering the email protocols. This paper is a summary of a study trying to determine if modifying the SMTP and POP transport protocols to include compression mechanisms can yield an increase in protocol efficiency. The study shows clearly that adding the compression mechanisms indeed speed up the transmission times. This also indicates a drop in email traffic when sending or receiving email. The results show this improvement especially when sending large amounts of non-binary data ([2], page 1). In 'A Performance Study on Internet Server Provider Mail Servers' [3] the research showed that a large part (40-50%) of the life time of a connection is spend on the authentication for the mail user. We can therefor conclude that results with extremely low connection durations are incomplete or failed authentication attempts. According to the paper 'Wide-Area Internet Traffic Patterns and Characteristics' in April 1995 a study showed that 6% of all packets were SMTP. This is a huge drop when we consider that in 1989 at the Bell Laboratories almost 50% of network traffic was SMTP [5]. Because the basic workings of the protocols haven't changed and email only got more popular we can assume this relative change is due to available bandwidth growth.
3. ANALYSIS OF NETWORK DATA We will now analyze the measured network traffic from the repository [7] as mentioned in chapter 1.3.
Figure 3.1. Intranet without servers
The other possibility is illustrated in figure 3.2. In this situation we are not able to see traffic between the email servers inside the intranet but SMTP communications between servers inside and servers outside of the intranet should be visible. Of course, only POP and IMAP traffic from an email server outside the intranet is visible – if the clients were to connect to one of the internal servers this would not be visible at the router. In theory we could assume that since we are now able to see server to server SMTP streams, the statistics of this situation should show a high amount of SMTP traffic.
3.1. Strategy The data will be analyzed using the JPCAP library and a custom designed program gathering the required data. We will create statistics for the amount of traffic of each protocol (SMTP, POP and IMAP) on their respective ports as well as the total amount of traffic analyzed. The transmission time stamps from the transport layers are used to generate time-of-day graphs. Even though we have huge gaps in time these graphs can show a usable traffic throughput. For each of the transport protocols we will record the number of connections (TCP sessions) as well as the minimum, maximum and average connection times.
3.2. Data Source The network traffic used for the statistics in this paper is from the M2C Measurement Data Repository [7]. This traffic is anonymously captured from 4 different networks in the Netherlands. We can roughly divide the sources into 2 types: the situation that the network has one or more email servers and the situation that all (or at least the official) email servers are outside the monitored network. To elaborate the difference look at figure 3.1. In this situation we will be monitoring at the router showing us all traffic from and to clients but not between servers. Thus the only SMTP
Figure 3.2. Intranet with servers
Location 1: The first location uses a 300 Mbit/s ethernet link with an average load of 60%. The data is gathered in July 2002. This is a residential university network connecting about 2000 students to the core network of this university.
Being a residential network should provide us with statistics which should approximate a traffic pattern for an average ISP (probably including the receiving email servers themselves). This is a network situation as shown in figure 3.2. See website [7] for more information. Location 2: The 2nd location is a 1 Gbit/s ethernet link with an average load of 1%. The data is gathered from May till August 2003. This is the network uplink of an university research institute connecting about 200 researchers and support staff to the internet. The email servers themselves are placed outside the measured network therefor traffic to and from the email servers will not appear in the statistics. The situation in the network is mostly like figure 3.1, although some small mail servers might be used on the intranet. See website [7] for more information. Note that this is an office network instead of a residential network. We will still analyze the data and compare it to the (normal) residential networks. Location 3: The 3rd location from the repository is a large college with an average load of 10-15%. The data is gathered from September till December 2003. This academic network provides internet access for over 1000 students and staff. Once again, the email servers are placed outside the measured scope so we will only be able to analyze traffic to and from the mail servers. The situation for this network is shown in figure 3.1. See website [7] for more information. Like location 2 is this more like an office network as it is meant for non-personal use. The only difference compared to location 2 is that the main part of the traffic is generated by the students. Location 4: The 4th location is a 1 Gbit/s aggregated uplink for an ADSL network with an average load of 15%. The data is gathered from February till July 2004. This network provides internet access to a few hundred computers, mostly in student dorms. The bandwidth of each connection varies from 256kbit/s to 8Mbit/s down and 256kbit/s to 1Mbit/s up. Like locations 2 and 3 the mail servers are outside the scope of the measurements and so we have the situation with out the internal mail server as shown in figure 3.1. Like location 1 this is a residential network which should approximate an average ISP and thus show the same traffic patterns.
4. ANALISIS In the following chapter we will discuss the results from the analysis from the repositories. In 4.1 we start out by discussing the results in terms of quantities in network traffic (amounts of bytes, packets and connections). Then we will discuss some more details about the connections we analyzed.
4.1 Traffic Quantities Lets first compare the amount of packets for each location per protocol. Note that statistics in table 4.1 are presented in powers of 103 for readability.
Table 4.1. Amounts per location per protocol in packets in 103
Loc 1
Loc 2
Loc 3
Loc 4
335709
162519
881369
673334
SMTP
227
3469
8520
356
POP3
883
256
6072
718
IMAP
52
21
166
321
All
Please note that the All row in table 4.1 and 4.2 show all packets or MB's that where captured on the network, not just from the email traffic. As we can see amount of packets used in email traffic is negligible compared to the total amount of packets. See table 4.3 for the exact amounts. In table 4.2 we have the traffic in megabytes. One of the first things that appear is the fact that the size of the packets are very different for each protocol: looking at location one we can roughly say that the distribution between SMTP, POP3 and IMAP is 20% to 75% to 5% when we only look at the amount of packets (table 4.1). However, when looking at the amount of traffic in megabytes (table 4.2) the distribution is roughly 35% to 60% to 5%. When we order the protocols to their respective amounts in packets or bytes we have the same order for both but the distribution is not the same. It seems like SMTP uses less amounts of packets but uses larger payloads while POP3 uses lots of small packets. The other locations seem to provide the same pattern. Table 4.2. Amounts per location per protocol in MB
Loc 1
Loc 2
Loc 3
Loc 4
254543
107951
645102
408056
SMTP
127
1783
3032
102
POP3
220
82
1455
154
IMAP
14
9
50
102
All
Another strange point can be found when we look at table 4.2 and look at location 1 and 4 as being the non-office type of networks and location 2 and 3 as the office networks: in both location 1 and 4 the amount of inbound traffic (POP3 and IMAP) is exceeding the amount of outbound traffic by at least roughly a factor 2. When we look at the quantities of location 2 and 3 we see the opposite: in location 2 the amount of SMTP traffic exceeds the other 2 protocols by almost a factor 20. This could be explained by assuming the office users send more email than non-office users. Also, as the statistics of location 2 are from a research department we could assume the users email results and project information to each other and other researchers – thus generating significantly more traffic than any other type of email user. Even so, when we compare the amounts of inbound and outbound email traffic (see table 4.3) to the total amount of network traffic its clear email is a negligible minority. For an office network we are looking at amounts of 2% of the total traffic. However when we look at the non-office situation (residential) the amount of network traffic spend on email communication is 0.3% or less. Table 4.3. Totals of email traffic compared to all network traffic
Loc 1
Loc 2
Loc 3
Loc 4
Packets
0.3%
2.3%
1.7%
0.2%
MBytes
0.1%
1.7%
1.1%
0.1%
Location 1 shows a different image with huge leap in the amount of connections for POP while the amount of packets or the amount of bytes is not as proportionally different. This could be because of email clients checking for new mail on a regular interval.
When we plot the percentage of email traffic compared to the total amount of traffic for the date of the repository (we took 2003 as an average of the repositories) and take in account the 50% from 1989 and the 6% from 1995 we get a graph that shows an exponential decrease, see figure 4.4.
Table 4.6. Number of recorded connections
Loc 1
50,00% 45,00%
POP
35,00%
Loc 4
82490
605425
12796
31080
3954
229406
27365
1432
246
2463
5670
IMAP
30,00%
Loc 3
4962
SMTP
40,00%
Loc 2
25,00%
The claim that POP is the most efficient protocol can be confirmed when looking at the average amount of bytes needed for each connection (table 4.7). Except for location 3, all the other locations show POP as being the least bandwidth consuming protocol. The fact that location 3 shows SMTP as the most efficient protocol can be explained when we look at the amount of email traffic. As this location has a high volume of SMTP traffic it is possible that most of the traffic consists of sending short emails. The result is a decrease in the overall size of SMTP transmissions.
20,00% 15,00% 10,00% 5,00% 0,00% 89
90
91
92
93
94
95
96
97
98
99
00
01
02
03
Figure 4.4. Amount of email traffic compared to the total amount of traffic
Even though email is getting used more and more the graph in figure 4.4 shows a decrease. This can easily be explained if we consider the explosive increase in bandwidth over the years.
Table 4.7. Average amount of bytes per connection
Loc 1
When we look at the average packet size (table 4.5) we can clearly see that all the email protocols use considerable smaller packets than the average size for all the other network traffic.
SMTP
Table 4.5. Average packet size in bytes
IMAP
Loc 1
Loc 2
Loc 3
All
795
696
767
635
SMTP
587
538
373
302
POP
260
336
251
224
IMAP
277
453
314
332
Loc 3
Loc 4
26930
22666
5250
8428
7417
21828
6651
5903
10201
39362
21298
18847
POP
Loc 4
Loc 2
Another aspect when looking at protocol efficiency is the required time per connection. When we look at table 4.8, we can clearly see that POP requires the least amount of time to operate. One feature of IMAP is the design to reuse a single connection, which results in longer connection durations. This is also clearly visible in table 4.8.
Also it seems that POP has the smallest packet sizes for each location even though the average size of each packet varies considerably for every protocol.
Table 4.8. Average connection duration in seconds
Loc 1 SMTP
4.2 Connections
POP
When we look at the number of connections it is roughly proportional to the amount of packets, see table 4.6.
IMAP
Loc 2
Loc 3
Loc 4
13.9
29.9
60.3
54.1
3.3
9.4
9.4
2.9
69.3
328.0
83.8
53.7
20 SMTP POP IMAP
18 16 14 12 10 8 6 4 2
Figure 4.9 . Time Of Day for location 4
20:15
20:14
20:13
20:12
20:11
20:10
20:09
20:08
20:07
20:06
20:05
20:04
20:03
20:02
20:01
-
04:24
04:23
04:22
04:21
04:20
04:19
04:18
04:17
04:16
04:15
04:14
04:13
04:12
04:11
04:10
0
When looking at the average throughput for every protocol we can only conclude that POP is far most efficient as its the fastest. See table 4.10. Table 4.10. Average throughput in bytes/second
Loc 1
Loc 2
Loc 3
Loc 4
SMTP
1937
758
87
156
POP
2234
2326
704
2026
147
120
254
350
IMAP
For reference we included the time of day chart for location 4 in figure 4.9. Note: the time of day chart consists of 2 parts: from 04:10 to 04:24 and 20:01 to 20:15. Average variance in the time of day chart for location 4 is 3.6% for SMTP, 3.1% for POP and 3.4% for IMAP. (see figure 4.9)
5. SUMMARY As stated the only network traces that should include interserver email communication is the network of location 1. Therefor it seems likely to assume an increase in SMTP traffic as the servers send the mail between each other. For some reason, this increase is not visible at all. We can conclude that either the dataset does not contain interserver communications or (more likely) the amount of POP and IMAP traffic exceeds the increase of server to server SMTP traffic. One thing that stands out is that POP appears to have the smallest packets (on average) and uses less bytes per connection than both IMAP and SMTP (note that the difference between SMTP and POP seems to degrade if SMTP is used more). The conclusion that POP is more efficient than SMTP [1] seems to hold: the average amount of bytes needed in a connection for POP is lower than both IMAP and SMTP. The fact that POP traffic constitutes for the largest portion of the traffic can be explained by the high number of connections. We could explain the difference in use of POP over IMAP when we look at the protocols. The IMAP standard never caught up to POP even though the first has more features. While IMAP provides more advanced functionality this also means the protocol is more complex: thus putting more load on an email server. This might be an explanation why POP3 seems favored over IMAP, from the perspective of an administrator. Since the server should serve as many users as possible using its actual hardware. When using software which requires more processing power the hardware needs increase and as such the maintenance costs of the server. The result is that after decades we are still mainly using the simple (and feature limited) but far more efficient and thus faster POP over IMAP. Hence most users might find themselves with a mail exchange which does not support IMAP at all. In every data set the results show that the amount of email traffic is very small compared to the total amount of captured network traffic. In 1.35 TB of analyzed network network we have roughly 4.9 GB of SMTP data, 1.9 GB of POP data and 175 MB of IMAP. As we found in [8] almost 6% of the total traffic was SMTP back in 1995. If we assume this total data set is a valid general representation of the internet traffic we have to conclude that only ten years later at most 0.4% of the total traffic is SMTP. (Note that this percentage is when we add the SMTP traffic from both the office and non-office situations) When including IMAP and POP with SMTP we can say that 0.5% of the total amount of traffic is email.
6. CONCLUSIONS Due to the explosive increase in bandwidth since the design of the specified protocols (SMTP, POP and IMAP) the standards might be outdated in terms of security and features due to the exponential pace of technology development but the amount of traffic they generate is negligible when compared to the total network traffic. The study [2] shows that the protocols can be extended to increase performance although the required amount of bandwidth for email is acceptable for all and every network as less than 0.5% of the traffic is email (in general and on average). We saw that the amount of IMAP traffic is very low compared to its POP counterpart. As mentioned before (see chapter 5) this is probably due to protocol complexity. However, it seems reasonable to argue that the amount of outgoing traffic should match the amount of incoming traffic. The statistics show otherwise: when looking at residential networks the amount of POP and IMAP traffic is much higher than the amount of SMTP. When monitoring an office network we saw the opposite: SMTP traffic outranks POP and IMAP by much. The small packets and numbers of packets per connection in the result sets back up the conclusion that POP is an efficient protocol [1], even without any enhancements [2]. Another indication of POP's efficient behavior is the fact that it has the shortest connections on average. SMTP is runner up with IMAP coming in last with very long lasting connections. The latter can easily be explained as IMAP can maintain a connection while reading mail and checking for new mail, thereby removing the need to re-authenticate for every update – which is a time consuming part of the mail exchange [3].
References [1] P. Tzerefos, C. Smythe, I. Stergiou and S. Cvetlkovic, “A Comparative Study of Simple Mail Transfer Protocol (SMTP), Post Office Protocol (POP) and X.400 Electronic Mail Protocols”, IEEE Computer Society, Proceedings of Conference on Local Computer Networks, November 1997 [2] A. Nand, T. Lai Yu, “Mail Servers with Embedded Data Compression Mechanisms”, IEEE Computer Society, Proceedings of the Conference on Data Compression, 1998, ISSN:1068-0314 [3] J. Wang, Y. Hu, “A Performance Study on Internet Server Provider Mail Servers”, IEEE Computer Society, Proceedings of Computers and Communications, 2004, ISBN: 0-7803-8623-X, On page(s): 56- 61 of Vol. 1 [4] RFC 2060 – Internet Message Protocol - Version 4 revision 1, seen on December 1st 2005 http://www.ietf.org/rfc/rfc2060.txt [5] RFC 821 - Simple Mail Transfer Protocol, seen on December 1st 2005 http://www.ietf.org/rfc/rfc0821.txt [6] RFC 1939 - Post Office Protocol - Version 3, seen on December 1st 2005 http://www.ietf.org/rfc/rfc1939.txt [7] M2C Measurement Data Repository, seen on December 1st 2005 http://m2c-a.cs.utwente.nl/repository/ [8] K. Thompson, G.J. Miller, and R. Wilder, “Wide-Area Internet Traffic Patterns and Characteristics”, IEEE Network, November/December 1997