PACE: Prefetching and Filtering of Personalized Emails - CiteSeerX

27 downloads 93235 Views 538KB Size Report
the email accounts of individuals and cache the necessary templates of the .... adds his advertisement with the emails retrieved from Email service provider and ...
PACE: Prefetching and Filtering of Personalized Emails at the Network Edges (Technical Report:

CS-MIST-TR-2003-005)

Jayashree Ravi† , Weisong Shi† , and Chengzhong Xu¶ †

Department of Computer Science Wayne State University {jravi123,weisong}@wayne.edu



Department of ECE Wayne State University = [email protected]

Abstract In this paper, we present a new technique for generating personalized dynamic pages for displaying emails by using pre-fetching and caching methods at the edges of the network. Our approach is to pre-fetch emails from all the email accounts of individuals and cache the necessary templates of the service provider and create personalized dynamic page at the edges of the network. We have proposed, developed and evaluated the prototype comparing it with popular HTTP based free email servers and shown that with this technique there is an improvement in performance with respect to user-perceived latency of up to 93% and savings in the bandwidth of up to 86% for the most popular size of emails. We have also proposed a centralized spam management at proxy level which helps both the user and the service provider to eliminate spam from the origin server and thus save the utilization of bandwidth by spam emails.

1

Introduction

Dynamic and personalized content delivery has attracted a great deal of attention recently from both commercial and research communities. One reason is of course the growing popularity of dynamic Web services, exemplified by news sites and personalized sites (e.g., my.yahoo.com) both of which require dynamic generation of content. The other reason is the “trickle-down” effect [10, 15] of widely deployed proxy caches and content delivery networks (CDNs), which effectively filter incoming requests for static Web content and presumably shift the traffic seen on the Internet from popular static objects to less popular objects and dynamic Web content Dynamic Web content can be broadly classified into two types. The first type which we henceforth call just dynamic pages are those that are generated without taking sessions of the user into account. This type does not need to know who has accessed the page and for every user the dynamic page generated is the same at any instant of time. The second type, which we call personalized dynamic pages are generated when the user accesses them through a secured system. In this case the dynamic page generated is tailor made for each user. The most popular personalized dynamic page today is the one which is generated for displaying emails. For example, Hotmail.com alone hosts about 84 million free e-mail accounts where the account holders are spread across the globe. Our previous analysis of a middle-size personalized Web site shows that 90% of the access to personalized web site are for emails [27] only. In today’s world most of us have at least one Web based email account. Web-based (a.k.a HTTP-based) email account has the advantage of accessing it at any place which has an internet access point. However for every email access the user is logged onto the main HTTP email servers and conventional Web caching and CDNs are of little help. The reason being since it is personalized and needs authentication, Akamai [1] servers just pass the entire request to the origin servers and pass the response back to the user without caching any data in between. There are several techniques which are proposed to cache dynamic pages as described in Section 2.1. Personalized pages could be cached with these techniques, however it needs additional information to be stored at the cache which is the per user information. This is the method adopted in CONCA [26].

From the account holder’s point of view, the user would have numerous email accounts with different service providers and the emails pertaining to him/her would be stored in geographically distant places spread across the globe. However all these data, which are stored at distant places should eventually flow towards the user. This happens when the user tries to read his emails. Also the user has to access each email account separately with separate authentication for each email account though they all fetch the same kind of information to the user namely, emails with attachments which are send over SMTP at the back end. The user also has to manage a separate spam list at each email account though each spam list may just be the same list which could probably be copied to other servers instead of duplicating the entry at each server separately as is being done today. From these observations, we conclude that caching and generation of personalized pages for emails has to be addressed with a unique technique which coherently handles all the common features which make the email page but at the same time provides room for enough diversification to include the service providers requirements. In this paper we have proposed such a new technique and developed a prototype called PACE to generate these pages at the edges of the network. PACE is built on five concepts: (1) Capture user’s details and pre-fetch the most recent emails from all the email accounts of each user and store them separately; (2) Cache different templates of various email service providers; (3) Centralized spam management to help the user in eliminating spam from the user’s origin email servers; (4) Single Authentication per user to access all his/her emails accounts; (5) Dynamically decide the placement of per user information in a proxy which is closest to the user/nomadic user based on the user’s/nomadic user’s access patterns. We have built and evaluated a PACE prototype in this paper. The evaluation results have shown the following benefits for the most common email sizes: (1) Improvement in user-perceived latency between 5%-93%, (2) Bandwidth savings between 75%- 86% by caching the basic templates for email alone. Along with these perceived benefits from our prototype we can also see these additional benefits: (1) Availability of the service is improved due to close proximity of the application to the end user. (2) Network problems from the proxy to the origin server would not affect the system. (3) Reduced load on the origin servers and thus improving the scalability of the origin servers. (4) Additional bandwidth savings due to elimination of spam at the origin server itself. The rest of this paper is organized as follows. In Section 2, we describe operations of the regular HTTP email system. The design and implementation of PACE architecture is described in Section 3 and Section 4 respectively. Section 5 presents the performance evaluation. Section 6 discusses related work and we summarize in Section 7.

2 Background 2.1 Caching Methods for Dynamic Pages The present day techniques for caching dynamic Web content can be broadly classified into two methods, content caching and function caching, which could be used for either server side proxies or cache side proxies. In the content caching, the HTML page which the application generates is cached as different fragments/ channels. These fragments/channels are maintained as separate objects in the edge server’s cache and are dynamically assembled into Web pages using XML/JavaScript type languages by fetching only non cacheable or expired fragments/channels from the origin server in response to user requests. The origin server supports this assembly and the exchange of information between origin server and proxy is XML type data. The representatives of this approach include Akamai [1], CONCA [26], and Client Side Include(CSI) [24]. All of them are built upon the edge-side include (ESI) technology [29]. In the function caching, the application itself is replicated and cached along with its associated applications so that the edge servers run applications instead of the origin server. The exchange of information between the origin server and proxy is the application itself. Examples of this method are vMatrix [4], IBM Websphere [16], Active Cache [6], Gemini, SEE [19].

2

2.2

Basic HTTP Email Architecture

CDF

A typical HTTP based email system is as shown in Figure 2 (a). The user logs in through a secured protocol like HTTPS to the HTTP server. The HTTP server is in turn connected to a back end Mail Server in which the user would have an account. The mail server at the back end sends and receives mails with other mail servers through SMTP protocol. Another way of viewing emails is through POP protocol by configuring the user’s email agent like Outlook Express on the client machine. The disadvantage here is for a user who uses different machines to access emails. Spam handling: In case of HTTP email systems, spam protection is handled at the origin mail servers and each email account of the user with different email servers have to be configured separately for spam. In case of user agents using POP protocol, a spam email would be first downloaded into the client machine after which the user agent classifies it as spam by which time the useful bandwidth of the network is already used. Generic content of an email page: A typical email page consists of three parts: (1) Basic email pertaining to the user. (2) HTML objects which are embedded with the emails by the email service provider which come along with the email as one single page with one single session key associated with the set. These objects make up 18KB to 40KB of data as noted down from popular email web sites like Yahoo and Hotmail. We refer to this as the basic template in our paper. (3) Objects referred through ‘src’ tags in basic template, which are typically used to retrieve images through separate HTTP GET methods which may or may not be associated with the particular session. Even if the email itself is 1 byte, a page of 18-40KB 1 is generated to plug in this 1 byte of wanted information. 0.9 0.8 This 18-40 KB of data can be kept at the edge server itself 0.7 if the page is generated there instead of at the origin server. 0.6 0.5 If we consider millions of users, savings of 18-40KB/email 0.4 of all these users will significantly reduce the internet traf0.3 fic between the edge server and origin server. There are 0.2 0.1 many solutions available today for caching part (3) objects, 0 however we have not come across solutions to address part 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k 12k 13k (1) and part (2). Our contribution in this paper is to preSize (Bytes) fetch and cache part (1) and (2) and generate the page at the Figure 1 Email size distribution. edge. Emails that the user receives varies in its size depending upon what the sender has sent. Whatever the size of the email, the template that is added by the service provider is of a fixed size. To understand the size distribution of emails, we collected the sizes of emails that our group members received in their Inbox of email accounts. The size distribution is as shown in the Figure 1. From this we see that more than 90% of the emails that our group members have received are less than 6KB of which more than 80% are less than 3KB. However the basic template that is added to this email is between 18KB to 40KB. This data reinforces our motivation to generate the dynamic page at the edge by caching the template.

3

Design of PACE

The flow of email in a PACE system is as shown in Figure 2(b). As can be seen from the figure, the user instead of accessing the origin HTTP mail server will now access the PACE account to check his/her emails and PACE in turn pre-fetches the emails and caches the templates and other associated objects to generate the email page for the user.

3.1

Content Design

The content of the HTML page generated by PACE is designed with two possible customers:

3

(a)

(b)

Figure 2: (a) Flow of emails in regular HTTP/POP email systems; (b) Flow of email in a PACE system. 1. Email service provider: Email service provider can utilize the services of Internet service provider(similar to oursourcing to CDN networks) who hosts PACE for pre-fetching emails and caching the Email service provider’s templates which would be plugged in while delivering the emails to the user. The user could have a free account with PACE as well as with Email service provider and views different templates for different accounts. This utility will be in line with the concept of pre-fetching and caching dynamic pages at the edges of the network for which is it primarily designed. 2. User: If the Email service provider does not wish to utilize the services of PACE then the end user can possibly use it. End user can create an account with the Internet service provider by paying him for the services of a PACE account. In this case PACE would just retrieve emails and display them on a HTML page without any advertisements of the Email Service provider. The user gets all the benefits of a HTTP email account and PACE works as a Web based user agent with improved accessibility/latency/availability and with additional edge services(e.g., virus scanning/spam management etc˙) provided by the service provider. Internet service provider(ISP) would be the agency which would be hosting PACE. Hence we have not considered ISP as a customer. However if the ISPs intend to plug in their advertisements with the emails instead of the Email service provider, then they can do so. In this case the user can open an account with the ISP for free and the ISP adds his advertisement with the emails retrieved from Email service provider and generates the HTML page. All the other edge services could also be provided with it as a complete package to the user. With these options PACE can generate three types of HTML pages depending upon the type of user’s account for displaying emails: (1) HTML page with emails and with advertisements from Email Service provider. (2) Simple HTML page with no advertisements and only emails of the user. (3) HTML page with emails and with advertisements from Internet Service provider who is hosting PACE. It could be a mix of any of the three combinations as well depending upon which agency is paying for the services of PACE. PACE proxy does not have a mail server with it. Hence all email accounts have to be created with the respective mail servers of the Email service providers and the information of the account should be entered in PACE. For sending mails the user has to access the respective mail server where he has an account for which links are provided which redirect the user to the Email service provider’s web site. PACE design could be extended to other templates too, for example templates used for sending emails could also be cached.

3.2

Support for Dynamic Placement of Per User Information

Any user who has created a proxy account could be accessing PACE from many different locations. For example a student could be accessing this account from his/her home in the mornings and evenings and from school during 4

the daytime. Similarly an office worker would be accessing from office during the daytime. We assume that most of the users would be operating from different locations with in a radius for which a single proxy would probably be the closest to both the locations. However if the locations are coming under different PACE proxies as shown for user 1 in Figure 3(a), then each proxy which is close to the user’s two access points(i˙e˙, home and school/office) would be the child PACE proxy for the user. Generally the two Internet access points would be coming under different networks. A parent PACE proxy is identified which is higher up in the network hierarchy and which is connected to all the child proxies. This parent proxy will keep all the pre-fetched emails of the user and the home account of the user is established here. Each child would have cached the other common cacheable components which make the dynamic page. The email page would be generated from the child proxy by fetching emails from the parent proxy and plugging in the common components at the child. To improve the latency the emails could be replicated in the child proxies from the parent proxies. We intend to study this separately to find the best method of caching/replication of data between child and parent. Hotmail server

Yahoo mail server

S1 S1

PACE-Cache Parent node for User 1

PACECache

PACE-Cache Child node for User 1

User 1 at home

User 1 at office

PACE-Cache Parent node for User 2

Personalized

PACE-Cache Child node for User 1

Shared

Hotmail.com

User 2

(a)

Yahoo.com | | | | | wayne.edu | | | user 1 | | | user n temp user 1 | | temp user m

| | S1

ID,Password Hotmail.com Yahoo.com

Emails

| | Wayne.edu

(b)

Figure 3: (a)Distributed caches dynamically identify the parent node. (b)Structure showing shared and personalized portions However if the user temporarily moves away from the parent zone, then we extend our CONCA [26] nomadic support for this situation. Each CONCA node stores per user state. This allows the state to be recreated on another proxy node that the user is currently close to. When a user travels away from his home cache, requests from client applications are routed to whichever cache is nearest to the user’s current location. This new cache contacts the user’s home cache to obtain information about the state associated with the user. It can then satisfy user’s requests more efficiently by reusing locally cached content and pre-fetching personalized content from the user’s home cache. Each proxy node is having both shared and personalized information stored separately as shown in Figure 3(b).

3.3

Authentication

We extend the CONCA proxy for handling the user authentication. The edge server itself can handle trust for the sake of email retrieval instead of passing on the HTTPS request to the origin server. The user inputs all his/her login information with respect to his/her various email accounts into the PACE database which is used to pre-fetch emails. The pre-fetched emails are stored in PACE. We propose a single authentication for the end user at PACE to view emails of all email accounts of the user. PACE in turn keeps the authentication information of the individual email account of the user and uses this to pre-fetch the emails for the user. Generally when a user checks his/her 5

email on one account the user would like to check all the email accounts on one go. Hence a single authentication at the edge proxy server will help in saving time and bandwidth needed for authentication at individual origin email servers. However when the user needs to enter the credit card information and other more trust dependent activities like shopping cart etc˙, a link would be provided to take the user to higher degree of trust management at the origin servers. Pre-fetching: When a user creates his profile in the proxy, separate IDs are created for each user and this is used to create per user directories in the proxy to store emails and attachments in the personalized portion of the proxy as shown in Figure 3(b). A pre-fetching program called MailFetcher runs as a daemon process checking for new emails at regular intervals. All the emails are stored in per user directories and the database. After a certain period of time the cache may be full with user’s emails and both origin server and cache would be holding the same set of data. Our intention is not to duplicate the storage of emails. We assume that most people would like to see only the latest emails and not the emails read by them once. Once the user logs in and views the emails in the cache, they are marked for deletion and after a time interval they are deleted thus helping in conserving memory at PACE. However the origin server will still carry all the emails both new and old. Hence the origin mail server will be primary storage of all the user’s emails. When the user wishes to see any of the old emails then at that instant the required mail is fetched and the email page is generated. The time taken for generation of this page would be higher than the time taken to view any of the latest emails but we assume that this would be closer to the latency of accessing it from the origin server directly.

3.4

Spam Filtering

Recent studies have shown that today 40% of all the emails are spams [2], and the cost related is escalating tremendously [17]. We propose to address this problem in two ways. In the first method, the user can mark any email from any account, as spam and this would be logged in the database, using which appropriate action is taken while pre-fetching the emails from origin server. Hence just identifying once is enough to handle spam in all the accounts. However just identifying the sender’s email ID is not enough to handle spam. At present as most spam emails are sent either with invalid sender’s address or false sender’s address and sometimes the sender’s address of the spam will be the same as the user himself. To handle this we propose to develop a spam management system, which studies the user’s behavior to identify spam. An agent, which is deployed for each user would study the emails that the user deletes with or without reading them on a regular basis. If the user has deleted a particular email from a particular account then the agent will search for the same content in the emails of the other accounts. If it finds another email with content matching more than a threshold value α then it classifies that email as spam. We intend to focus on the content of the email rather than the sender’s email ID in this case. This information of one user can be merged with the other user agent’s results as well and putting all the users data together can help in determining the spam at the higher level of abstraction. This valuable information on spam collected at every node individually can be exchanged with other proxy nodes to effectively tackle spam. Every identified spam mail will be assigned weights based on these methods. The mail with the highest weight is circulated around the other caches. Identified spam email can be deleted from the origin server itself by sending control signals from PACE instead of using the bandwidth to fetch the email to the proxy. This would be especially advantageous for users who are using user agents to fetch mails. However self-detecting spam management is not incorporated in our prototype at present.

4

Implementation

We have designed and developed a prototype version in Java using JSP & Servlets for user interface. Jakarta Tomcat 3.2.3 server [3] is used for generating JSP pages. We have used MySQL [22] for back end database to store emails and user information. Our prototype has two main modules. First one is the user interface, which enables the user to create an account

6

(a)Login page

(b) JSP page generated using Hotmail template

Figure 4: A snapshot of dynamic pages generated from PACE. with our edge server proxy, PACE. During the account creation, user inputs login and password for the proxy and also the details of all his emails accounts which includes the authentication details for each account. For example if the user has a WSU account, then the user inputs smtp.wayne.edu as the mail server URL for retrieving mails and also inputs his login and password for the WSU account. This per user information is stored in the proxy to pre-fetch emails from the origin server. After the account creation, the user can log into his/her account as shown in Figure 4(a) and read emails by clicking on the links provided for each account without a separate authentication for each account as shown in Figure 4(b). Accept request Mail Server Is email request?

NO

NO

Forward the request like a regular proxy

YES Forward to Tomcat to handle

Any new mails?

Authentication YES Generate personalized JSP pages with email account links

Store the mails in PACE

Logout (a)

(b)

Figure 5: Sequences of operations (a)user interface, (b)Mail Fetcher. The second module is the Mail Fetcher(refer Figure 5b), which periodically pre-fetches emails of all users for all accounts they have configured for pre-fetching. When the Mail Fetcher periodically checks the mails, it ensures that mails that are already fetched are not fetched again. This is done even before the mail is retrieved from the server hence extra bandwidth is not utilized during the process of pre-fetching these emails, which were fetched earlier. The complete information of the mails that are fetched is collected in MySQL database and attachments to 7

mails are saved as files on hard disk by creating per user directories. The input for making these personalized dynamic pages are emails pre-fetched through SMTP and the templates of the Email service provider or the ISP, which is stored on the hard disk. In Tomcat, the class files are compiled the first time the page is invoked and the subsequent times onwards even though the underlying data changes in the database, it takes lesser time than the first invocation of the same page as it uses the pre-compiled Java classes from second time onwards. However we would like to emphasize here that the creation of dynamic pages is purely proxy vendor’s choice and there is no binding between proxy application which is used to generate the pages to origin mail server application.

5

Performance Evaluation

User-perceived latency is the sum of the latencies of the network, server and the client. Network and server latencies in turn depends upon some dynamic factors (e.g., network traffic, server load). We cannot simulate these dynamic factors for our experimentation as there are no correlations or dependencies between them. Client latency depends on the browser program, client CPU speed, other programs running on the same machine. We can keep the client latency constant by running no other program on the client machine and using a single web browser in our case, Netscape browser to render the pages. With this background we assume that in the real world scenario users would typically access his/her accounts from home and office/school. From this angle we evaluated PACE from two environments: 1. Experimental setup 1 (Home ): Tomcat running on Windows XP with Pentium 4 CPU, 1.6 GHz, 512 MB of RAM connected to the internet through Comcast Internet Cable network. Two clients were used to measure the latency, one client is on the same machine which is running PACE and another client is on Windows 98 connected to PACE machine through wireless LAN. 2. Experimental setup 2 (School ): Tomcat running on Windows XP connected to the Internet through LAN on WSU Campus. Two clients were used to measure the latencies in this set up too. One client is on the same machine which is running PACE and another client is on Linux machine with in the WSU campus which is connected to the Internet through WSU Campus LAN. Real world host of PACE would be the ISPs using high speed clusters with high speed disk access like SCSI with RAIDs which would significantly improve the processing time of a JSP page. Moreover these clusters would be connected to the Internet directly. However our experimental setup uses a personal computer and the connection to the Internet is through the local LAN. The traffic to the server competes with the traffic on the LAN which we do not have much control on especially with the WSU campus LAN. The templates are stored in the hard disk which would be accessed every time a response is processed for a client request. Since it is a personal computer, IDE hard disks are used for storage which is not particularly good in terms of disk access speed. Despite these short comings our experimental results show a good reduction in the user-perceived latency when the pages are generated at PACE compared to the origin server.

5.1

Evaluation of User Perceived Latency

For our evaluation we selected Wayne State University (WSU) Web email service, Hotmail, Yahoo and Rediff mail [25]. WSU mail server is situated in the WSU campus itself, we find that Yahoo servers on which we did the experiments are located in California, we were unable to determine the Hotmail servers whereabouts. We selected Rediff mail another free email portal for which the server is located in India. Our PACE should especially be helpful in reducing the cross continental Internet traffic if all the emails are pre-fetched and dynamic pages generated from the edges close to the client.

8

We have evaluated our model by sending same emails with cc to various accounts. Since WSU enables users to retrieve the emails through POP protocol our Mail Fetcher pre-fetches the emails from WSU account. Userperceived latency in fetching the same emails from various accounts are measured and compared with the PACE account. Due to the dynamics of the Web, it is very hard to pin point the factors which affect the latency at any point of time. To overcome this issue we conducted the experiments for a particular measurement by accessing all the web servers one after the other in a round robin fashion for a particular email dynamic page. We discarded an occasional large download time for a particular page of any origin server. Over all we find that the user-perceived latency for a particular origin email server is consistent and varies with a small deviation for multiple invocations of the same page for a particular period of time though the latency for the same page is different at another point of time. We find that there is no significant time difference between the first invocation and second or subsequent invocations of the same page and in many cases the second or subsequent invocation is slower than the first probably due to network congestion/server load during the second or subsequent time. From this we can reasonably assume that no cache is operating in between to help generate these dynamic pages at present. The content for the dynamic pages which are generated at PACE to display emails were designed for two different customers: Email service provider and User . 5.1.1

Email Service Provider

We studied a typical email page of Hotmail and Yahoo and find that Yahoo adds 40 KB of data along with the email and Hotmail adds about 18 KB of data. This is the data (which we call basic template) which is embedded along with the email itself and come as one single page. Along with this there are other objects which are retrieved by separate GET methods from the client. Javascripts make a set of objects which the browser downloads to run them on the browser. Every time the same page is invoked different advertisement images are downloaded. We cached one such instance and we assume that the effect of changing images for every loading of the page has negligible affect on latency. The emails which are pre-fetched are plugged in along with the templates and images/Javascipts stored at PACE and a dynamic page is generated for the client. The size of objects that are cached at PACE for Yahoo was around 120KB. This is apart from the basic template for email page. For Hotmail we cached around 56KB of objects over and above the basic template. These 120KB and 56KB could already be cached as on today as session ID’s are not required to fetch them. We do not think that basic templates are cached anywhere at the moment. We studied the user-perceived latency for each of these pages generated at PACE and those of retrieving the same from the origin servers. We measured the user-perceived latency as seen by the client to access the page generated by PACE and the page generated by origin server at a given interval of time. We cached two templates: Hotmail template and Yahoo template with associated objects in both cases. We tested for two clients with home 1 setup; (1) Client and PACE run on the same machine. Results are as shown in Figure 6. (2) Client and PACE run on different machines. Results as shown in Figure 7. From these results we see that despite using a personal computer to serve as a web server, user-perceived latency is reduced for the pages which are generated at PACE using Yahoo and Hotmail templates. We have noted earlier that most of the email sizes are below 6KB for which 40-140KB of template and object data is added. Hence considering a size of 100KB as the total size of most of the dynamic pages, we see an improvement in the user perceived latency of 59% and 66% by caching Hotmail objects and 5% and 23% by caching Yahoo objects. 5.1.2

User

User is interested in seeing only his emails and not the advertisements that come along with the emails. User centric evaluation was carried out by removing all these additional data and generating the page with only the email in it. 1

Due to space constraints we have not shown the results for School setup here.

9

1.6 Hotmail-Template

6

Time in seconds

Time in seconds

7 Hotmail

5 4 3

Yahoo-Template

1.4

Yahoo

1.2 1 0.8 0.6

2

0.4

1

0.2 0

0 1k

10k

100k 200k

300k 400k

500k 600k 700k

800k

1k

10k

100k 200k 300k 400k 500k 600k 700k 800k Email size(Bytes)

Email size(Bytes)

(a)With Hotmail template

(b) With Yahoo template

Time in seconds

Time in seconds

Figure 6: User-perceived latency for the page created by PACE using origin server templates and for the page retrieved from the origin server from the client on the same machine as PACE. Both on Windows XP connected to internet through Comcast Internet cable service.

25 Hotmail-Template

20

Hotmail 15 10

6 Yahoo template 5

Yahoo

4 3 2

5

1

0

0

1k

10k

100k

200k

300k

400k

500k

600k

700k

800k

1k

10k

100k

200k

300k

400k

500k

600k

Email size(Bytes)

700k

800k

Email size(Bytes)

(a)With Hotmail template

(b) With Yahoo template

Figure 7: User-perceived latency for the page created by PACE using origin server templates and for the page retrieved from the origin server from the client on Windows 98 with PACE on XP. Client connected to PACE through wireless LAN. LAN connected to internet through Comcast Internet cable service.

10

20 18 16 14 12 10 8 6 4 2 0

16

Pace Wayne Yahoo Rediff Hotmail

Time in seconds

Time in seconds

We carried out the experiments for both the experimental setups and with two clients in each setup. In the first case client and PACE run on the same machine and in the second case client and PACE run on different machines. Through the client we accessed the same email page from all the accounts with various Email service providers and PACE. Figure 8 shows the results for the two clients on School setup and Figure 9 shows the results for the two clients on Home setup. From the results we can see that for the same email the user perceived latency is different with different email accounts. PACE has given us the best performance and Hotmail the worst. Though the performance of WSU webmail was very close to that of PACE, we see that as the size of the email increases the time taken increases at a higher rate for both WSU webmail and Hotmail. Though Yahoo plugs in more data for every email, the performance of fetching emails from Yahoo was the best during our experimentation. Over all we see an improvement in user-perceived latency between 28% to 93% for a dynamic page size of 100KB. Pace

14

Wayne

12

Yahoo

10

Rediff Hotmail

8 6 4 2 0

1k

10k

100k

200k

300k

400k

500k

600k

700k

1k

10k

100k

200k

300k

400k

Email size(Bytes)

500k

600k

700k

Email size(Bytes)

(a)

(b)

Figure 8: (a)User-perceived latency with client on Linux and PACE on Windows XP with client accessing PACE through WSU Campus LAN. (b) User-perceived latency with client and PACE on the same Windows XP connected to Internet through WSU Campus LAN.

Time in seconds

Time in seconds

7 25 Pace Wayne Yahoo Rediff Hotmail

20 15

Pace Wayne Yahoo Rediff Hotmail

6 5 4 3

10

2 5

1 0

0 1k

10k

100k

200k

300k

400k

500k

600k

700k

800k

1k

10k

100k 200k 300k 400k 500k

Email size(Bytes)

600k 700k 800k Email size(Bytes)

(a)

(b)

Figure 9: (a)User-perceived latency with client on Windows 98 and PACE on Windows XP with client accessing PACE through wireless LAN and LAN connected to Comcast cable service. (b) User-perceived latency with client and PACE on the same Windows XP connected to Internet through Comcast cable service.

5.2

Bandwidth Evaluation

By caching the templates at the edge proxy and generating the dynamic page from the edge, the only bandwidth used is for retrieving the emails alone. The bandwidth used to retrieve expired basic templates and objects can be amortized across all the users who are served from that edge proxy. We can broadly classify the cacheable components into two types. One is the basic template itself which is used to generate the email page and the other is 11

1

Basic template

0.9

Bandwidth savings: Yahoo

Bandwidth Savings:Hotmail

the embedded objects with ‘src’ tags in the basic template, which is retrieved through separate GET methods. Figure 10 shows the bandwidth savings that can be achieved for different email sizes for basic template and all objects which includes the basic template and embedded objects. As can be expected smaller email sizes achieve high bandwidth savings. In this study we have just considered the email page alone. However before retrieving the email page, a main page listing all the emails is generated. This page contains the links to the emails. Different email providers use different header details of the emails as links for example, in Hotmail the Subject of the email is used as link. In Yahoo the senders name is used as link. Our contribution of caching the template applies to this page too however we have shown the bandwidth savings only for email page alone.

All objects

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1k

10k

100k

200k

300k

400k

500k

600k

700k

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Basic template All objects

1k

Email Sizes(Bytes)

10k

100k

200k

300k

400k

500k

600k

700k

Email Size(Bytes)

(a)Hotmail template

(b)Yahoo template

Figure 10: Bandwidth savings by caching the templates at the edge proxy and generating the page from the edge.

6

Related Work and Discussion

Our work on PACE builds upon a large body of related work in the general area of Web caching. Instead of describing each separately, we group related efforts into three broad categories: dynamic content generation and delivery, edge services framework, and prefetching. Dynamic content caching Caching dynamic pages is a challenging area in which much of research work is concentrated today, including server-side effort [7, 8, 31], proxy-side effort [6, 9, 20, 21, 26], and recently clientside effort [24]. For example, performance and cache hit has been improved dramatically by introducing DUP algorithm [7, 8] by keeping a data dependency information between cached objects and underlying data. As soon as the system becomes aware of changes in the underlying data by way of triggers, graph traversal algorithms are applied to determine which cached objects are affected by the change. The constraint in this system is the data storage source and cache are highly integrated and is implemented at the source end of the cache. To adopt the same at the network edge requires more complex integration with the origin server and implementation becomes vendor and application specific. Moreover, this works for dynamic pages that are not personalized. Having a session build into the dynamic page generation is not addressed. Another technique used by Akamai [1] and CONCA cache [26] is by assembling dynamic content on the edge servers using Edge Side Include [29] technology. Using ESI lets a content provider break a dynamic page into fragments with independent cacheability properties. These fragments are maintained as separate objects in the edge servers cache and are dynamically assembled into Web pages in response to users requests. This again works for pages which are not personalized in which session are not maintained. Whenever authentication is required, Akamai [1] edge servers do not cache any data and the entire request is forwarded to the origin server. Client Side Include [24] assembles the ESI fragments at the client end using the browser. JavaScript/ ActiveX objects are used to run the applications on the browser which fetch the needed fragments from the origin server. This reduces the latency in the “last mile”, which is especially useful with dial up clients who have slow connections. However 12

Personalized dynamic pages are not addressed either. In this paper we are concentrating specifically on personalized dynamic email pages generation and delivery. To our knowledge, this is the first effort in moving the personalized dynamic page generation to edge side. Edge services framework The email spam filtering and management proposed in this paper is motivated by IETF’s Open Pluggable Edge Service (OPES) framework [28] and content adaptation [13, 14]. Content adaptation allows the system to inject additional functionalities along the data path between client and server. OPES proposes an environment to provide value added services to the end-users and/or content providers. Providing services at the edges enables incremental deployment and amortization of operating costs, thus benefiting the client and the provider both [5, 13, 19]. Although content adaptation and filtering has been proposed for a while, PACE is the first real system that support email spam filtering. Pre-fetching In this paper we pre-fetch emails from the origin server and use the pre-fetched emails to generate dynamic pages at the edges. Pre-fetching has been demonstrated to be an efficient mechanism for reducing Web access latency. As noted by several researchers [11, 12, 18, 23], there are three distinct pre-fetching scenarios: pre-fetching between clients and servers, between clients and proxies and between proxies and servers [18]. We are using pre-fetching between proxies and servers. Personalized pre-fetching is introduced by [30] in which HTML pages are pre-fetched based on keywords which are extracted from pages which are most often visited by the user. We are extending this notion to pre-fetch personalized emails. Pre-fetching is measured by the hit rate of the prefetched documents. Unlike pre-fetching in other scenarios, we expect our hit rate to be almost 100% without any bandwidth waste when implemented in a commercial environment as every person would like to view his/her latest emails at all times.

7 Summary and Future Work Personalized dynamic pages generated to display emails make an important part of the Internet traffic today, making it necessary to find efficient solutions to improve the user-perceived latency and reduce the traffic resulting from emails on the network. In this paper we have presented a new technique in which emails are pre-fetched at the edge servers and dynamic pages are generated at the edges using the information from the origin server. By developing a prototype we have shown that the user perceived latency reduces if this method is adopted. By having a centralized spam management at the proxy will help in reducing the spam traffic which today is consuming more than 40% of the email traffic. Along with this a relative study on user perceived latencies across many of the popular web emails services are tabulated. We have shown that this technique helps in reduction of latency as well as bandwidth savings. Our future work includes developing an AI based spam management [30]. We would also like to integrate this with our CONCA proxy and evaluate the model in a more distributed environment which includes the ISP network and placing the proxy with the ISP by caching all types of templates.

References [1] Akamai Technologies Inc., http://www.akamai.com/. [2] America online steps up spam fight by lauching litigation offensive against spammers, Apr. 2003, http://media. aoltimewarner.com/media/newmedia/cb_press_view.cfm?release%_num=55253129. [3] Apache Jakarta Project, http://jakarta.apache.org. [4] A. Awadallah and M. Rosenblum. The vMatrix: A network of virtual machine monitors for dynamic content distribution. Proc. of the 7th International Workshop on Web Caching and Content Distribution (WCW’02), Aug. 2002. [5] A. Beck and M. Hofmann. Enabling the internet to deliver content-oriented services. Proc. of the 6th International Workshop on Web Caching and Content Distribution (WCW’01), June 2001, http://www.cs.bu.edu/techreports/ 2001-017-wcw01-proceedings/107_beck.pd%f. [6] P. Cao, J. Zhang, and K. Beach. Active cache: Caching dynamic contents on the web. Proc. of IFIP Int’l Conf. Dist. Sys. Platforms and Open Dist. Processing, pp. 373-388, 1998, http://www.cs.wisc.edu/˜cao/papers/ active-cache.ps.

13

[7] J. Challenger, A. Iyengar, and P. Dantzig. A scalable system for consistently caching dynamic web data. Proc. of IEEE Conference on Computer Communications (INFOCOM’99), Mar. 1999. [8] J. Challenger, A. Iyengar, K. Witting, C. Ferstat, and P. Reed. A publishing system for efficiently creating dynamic web content. Proc. of IEEE Conference on Computer Communications (INFOCOM’00), Mar. 2000. [9] F. Douglis, A. Haro, and M. Rabinovich. HPP:HTML macro-pre-processing to support dynamic document caching. Proc. of the 1st USENIX Symposium on Internet Technologies and Systems (USITS’97), pp. 83-94, Dec. 1997, http: //www.douglis.org/fred/work/papers/hpp.pdf. [10] R. Doyle, J. Chase, S. Gadde, and A. Vahdat. The trickle-down effect: Web caching and server request distribution. Proc. of the 6th International Workshop on Web Caching and Content Distribution (WCW’01), June 2001. [11] D. Duchamp. Prefetching hyperlinks. Proc. of the 2nd USENIX Symposium on Internet Technologies and Systems (USITS’99), Oct. 1999. [12] L. Fan, P. Cao, and Q. Jacobson. Web prefetching between low-bandwidth clients and proxies: Potential and performance. Proceedings of ACM SIGMETRICS’99, May 1999, http://www.cs.wisc.edu/˜cao/papers/prepush.ps. gz. [13] A. Fox, S. Gribble, Y. Chawathe, and E. A. Brewer. Adapting to Network and Client Variation Using Infrastructural Proxies: Lessons and Prespectives. IEEE Personal Communication, Aug. 1998, http://www.cs.washington. edu/homes/gribble/papers/adapt.ps.zip. [14] X. Fu, W. Shi, A. Akkerman, and V. Karamcheti. CANS: Composable, Adaptive Network Services Infrastructure. Proc. of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS’01), pp. 135-146, Mar. 2001. [15] J. Gecsei. Determining hit ratios for multilevel hierarchies. IBM J. Res. Dev, July 1974. [16] IBM Corp. Websphere platform, http://www.ibm.com/websphere. [17] J. Krim. Spam’s cost to business escalates, Mar. 2003, http://www.washingtonpost.com/ac2/wp-dyn/ A17754-2003Mar12. [18] T. M. Kroeger, D. E. Long, and J. C. Mogul. Exploring the bounds of web latency reduction from caching and prefetching. Proc. of the 1st USENIX Symposium on Internet Technologies and Systems (USITS’97), Dec. 1997, http://www.cse. ucsc.edu/˜tmk/publications/ideal. [19] V. Mastoli, V. Desai, and W. Shi. SEE: a service execution environment for edge services. Proceedings of the 3rd IEEE Workshop on Internet Applications (WIAPP’03), June 2003. [20] M. Mikhailov and C. E. Wills. Change and relationship-driven content caching, distribution and assembly. Tech. Rep. WPI-CS-TR-01-03, Computer Science Department, WPI, Mar. 2001, http://www.cs.wpi.edu/˜cew/papers/ tr01-03.pdf. [21] A. Myers, J. Chuang, U. Hengartner, Y. Xie, W. Zhang, and H. Zhang. A secure and publisher-centric web caching infrastructure. Proc. of IEEE Conference on Computer Communications (INFOCOM’01), Apr. 2001. [22] MySQL Project, http://www.mysql.org. [23] J. Pitkow and P. Pirolli. Mining longest repeating subsequences to predict world wide web surfing. Proc. of the 2nd USENIX Symposium on Internet Technologies and Systems (USITS’99), Oct. 1999. [24] M. Rabinovich, Z. Xiao, F. Douglis, and C. Kamanek. Moving edge side includes to the real edge – the clients. Proc. of the 4th USENIX Symposium on Internet Technologies and Systems (USITS’03), Mar. 2003. [25] Rediff Mail, http://www.rediff.com. [26] W. Shi and V. Karamcheti. CONCA: An architecture for consistent nomadic content access. Workshop on Cache, Coherence, and Consistency(WC3’01), June 2001. [27] W. Shi, R. Wright, E. Collins, and V. Karamcheti. Workload characterization of a personalized web site — and it’s implication on dynamic content caching. Proc. of the 7th International Workshop on Web Caching and Content Distribution (WCW’02), pp. 1-16, Aug. 2002, http://www.cs.wayne.edu/˜weisong/papers/wcw02.pdf. [28] G. Tomlinson, R. Chen, and M. Hofmann. A model for open pluggable edge services, work in progress, Nov. 2001, http://www.ietf.org/internet-drafts/draft-tomlinson-opes-model-00.txt. [29] M. Tsimelzon, B. Weihl, and L. Jacobs. ESI language sepcification 1.0, 2000, http://www.esi.org. [30] C. Xu and T. Ibrahim. Keyword-based semantic prefetching in internet news services. Accepted by IEEE Transactions on Knowledge and Data Engineering, 2003. [31] H. Zhu and T. Yang. Class-based cache management for dynamic web content. Proc. of IEEE Conference on Computer Communications (INFOCOM’01), Apr. 2001, http://www.cs.ucsb.edu/projects/swala/cache2001. ps.

14