Caching Dynamic Data for E-Business Applications

6 downloads 134876 Views 128KB Size Report
Companies such as Akamai, and Digital Island have been providing Content De- livery/Distribution Network (CDN) services for several years. CDN services are ...
Caching Dynamic Data for E-Business Applications? Mehregan Mahdavi1 , Boualem Benatallah1 , and Fethi Rabhi2 1

2

School of Computer Science and Engineering The University of New South Wales, Sydney, NSW 2052, Australia School of Information Systems, Technology and Management The University of New South Wales, Sydney, NSW 2052, Australia

Abstract. This paper is concerned with business portals; one of the rapidly growing web applications. It investigates the problem of providing a fast response time in such applications, particularly through the use of caching techniques. It discusses issues related to caching providers’ response messages at business portals and proposes a caching strategy based on the collaboration between the portal and providers. .

1

Introduction

It is now common for many businesses to offer a web site through which customers can search and buy products or services on-line. Such businesses are referred to as product or service providers. Due to the large number of existing providers, business portals have emerged as Internet-based applications which enable access to different providers through a single web interface. The idea is to save time and effort for customers who only need to access the portal’s web interface instead of having to navigate through many provider web sites. Fig. 1 shows the general architecture of a business portal. It shows that each provider may have a membership relationship with a number of portals. Moreover, each provider may have a number of sub-providers. Each provider stores its own catalog and the integrated catalog represents the aggregation of all providers’ catalogs. The portal deals with a request from the customer by sending requests to the appropriate providers. Responses from providers are sent back to the portal, processed and a final response is returned to the customer. Emerging technologies such as web services promise to take portal-enabled applications a step further [3]. However, providing fast response time is one of the most critical issues in such applications. Network traffic between the portal and individual providers, server workload, or failure at provider sites are some contributing factors for slow response time. Previous research has shown that abandonment of web sites dramatically increases with the increase in the response time [20], resulting in loss of revenue by businesses. In general, providing a fast response time is one of the critical issues that today’s e-business applications must deal with. Caching is one of the key techniques which promises to overcome some of the performance issues. In particular, caching response messages (which we also refer ?

This research has been partly supported by an Australian Research Council (ARC) Discovery grant number DP0211207.

2

Mehregan Mahdavi, Boualem Benatallah, and Fethi Rabhi Customers

Business Portal

Providers Catalog

Integrated Catalog Meta−Data

...

...

Catalog

Fig. 1. The Architecture of Business Portals

to as dynamic objects or objects1 , for short) gives portals the ability to respond to some customer requests locally. As a result, response time to the customer is improved, customer satisfaction is increased and better revenue for the portal and the providers is generated. In addition, network traffic and the workload on the providers’ servers are considerably reduced. This in turn improves scalability and reduces hardware costs. The best candidates for caching are objects which are requested frequently and not changed very often. Products such as Oracle Web Cache, IBM WebSphere Edge Server, and Dynamai from Persistence Software enable system administrators to specify caching policies. Server logs (i.e., access log, and database update log) are also used to identify objects to be cached. Caching dynamic objects at business portals introduces new problems. Since the portal may be dealing with a large number of providers, determining such objects by an administrator or by processing logs is impractical. On one hand, an administrator cannot identify candidate objects in a highly dynamic environment where providers may join and leave the portal frequently. On the other hand, keeping and processing access logs in the portal is impractical due to high storage space and processing time requirements. Moreover, database update logs are not normally accessible by the portal. We propose a caching strategy based on the collaboration between portal and providers. Providers trace their logs, extract information to identify good candidates for caching and notify the portal. Providers associate a score with each response message which represents the usefulness of caching this object. Our strategy deals with the problem of inconsistencies between scores from different providers. The portal can trace the performance of the cache and dynamically regulate the scores from different providers. The major contributions of our work include: (i) a collaborative scheme for caching dynamic objects in remote servers (i.e., business portals) based on a score given by providers, and (ii) a regulation technique for monitoring and adjusting caching scores. The remainder of this paper is organised as follows. Section 2 presents our caching scheme. Section 3 discusses implementation aspects. Related work along with conclusions are presented in Section 4. 1

A dynamic object is a data item requested by the portal, such as the result of a database query, the result of a JSP page, an XML or SOAP response message.

Caching Dynamic Data for E-Business Applications

2

3

Caching at Web Portals

Caching a particular object at the portal depends on the available storage space, response time (QoS) requirements, access and update frequency of objects[11]. As mentioned earlier, identifying caching objects by the portal is not realistic. As owners of the objects, providers are more eligible and capable of deciding which objects should be selected. In accordance with this approach, our strategy allocates to each object a caching score determined mainly from a score (called cache worthiness) given by providers to response messages. The caching score is also determined using other parameters such as recency of objects, importance of providers, and correlation between objects. However, these parameters are not the focus of this paper. The cache worthiness is a score assigned by the content provider to each object. Its value, which is in the range [0, 1], represents the usefulness of caching this object at a portal. At both extremes, a value equal to zero indicates that the object cannot be cached at the portal while a value equal to 1 indicates that the object must be cached. We assume that each provider defines these scores independently based on its own policy and priority. The rest of this section explains the caching strategy in more detail. The first sub-section describes the meta-data being used to support the caching strategy and the next sub-section explains how the providers can use logs to determine object scores.

2.1

Meta-Data Support

The caching strategy is supported by two major tables; The cache look-up table used by portal to keep track of the cached objects, and the cache validation table used by provider to validate the cached objects at portal(s). An entry in the cache look-up table mainly consists of a Request-Instance (RI) and a Generation Time-Stamp (GTS). An RI represents the request based on which the provider generates and returns a response. It contains the name of the requested service operation plus the values for the input parameters. For example, a web service operation can be invoked through a URI which includes the name of a Servlet and input values. In this case the RI is represented by a URI. A GTS contains the time the response was generated by the content provider. This is used for validation when a cache hit occurs. The cache validation table keeps track of all the responses sent to the portal and is used to validate the freshness of the cached objects. For each cached object, it contains the RI and GTS which correspond to previous requests and response times respectively. When a hit is detected at the portal using the cache look-up table, a validation request message is sent to the relevant provider. The message includes the corresponding RI and GTS. The provider checks the freshness of the object by probing the cache validation table to find the relevant entry for the request instance. If the cache validation table does not contain an entry for the request instance, it means that the corresponding object is not fresh anymore due to changes in the database. It is also possible that entries are removed for other reasons such as space limitations. After the object is sent back, the portal responds to the customer request and a copy of the object may be cached at the portal for future requests. If an

4

Mehregan Mahdavi, Boualem Benatallah, and Fethi Rabhi

entry is found, the GTS in the message is compared with the corresponding field in the cache validation table. If it is equal to the GTS in the table, the provider will confirm validation of the cache. Changes in the back-end database invalidate entries in the cache validation table. If changing the content of the database affects the freshness of any object, then the appropriate entry in the provider cache validation table will be removed. Solutions for detecting changes in the back-end database and invalidating the relevant entry/object are provided in [2,17,6]. The cache look-up table reflects the content of the portal’s cache. When an object is cached, the relevant entry is created in the table. Similarly, if an object is removed from the cache, the relevant entry is removed from the table. However, the cache validation table may become inconsistent with the cache in the following two cases: (i) An object is cached at a portal but there is no entry for the object in the cache validation table at any provider. In this case, when the portal sends a validation request message to the provider, it can not check the freshness of the object and it has to generate the object, even though the object in the cache may be still fresh. (ii) There is an entry in the cache validation table but the relevant object is not cached anywhere. In this case, storage space is being wasted by the provider to keep such entries in cache validation table. To provide an effective caching strategy, the cache validation tables should closely reflect the content of the caches at the portal(s). For this reason, we assume that an entry is created in the cache validation table only when the cached object has a cache worthiness above a certain threshold (τ ). We use an LRU replacement strategy to create free space in the table when it becomes full. More efficient methods are under investigation.

2.2

Heterogeneous Cache Policy Support

This section discusses inconsistencies between scores given by different providers and describes a mechanism to regulate these scores. First, an object scoring technique that uses server logs is described. Then, the issue of how inconsistencies arise, even though all providers may use the same scoring strategy is discussed. Finally, our solution for regulating the scores is described.

Calculating Cache Worthiness. The best candidates for caching are objects (i) requested frequently, (ii) not changed frequently, and (iii) expensive to compute or deliver [19]. For other objects, the caching overheads may outweigh the caching benefits. Server logs at provider sites can be used to calculate a score for cache worthiness. In the rest of the paper we use Oi,m to denote an object i from a provider m. We identify four important parameters: • The access frequency is represented by A(Oi,m , k). It indicates the access frequency of Oi,m through a portal k in a specified time period. AR(Oi,m , k) = PA(Oi,m ,k) represents the popularity (i.e., access rate). Values close to 1 r

A(Or,m ,k)

Caching Dynamic Data for E-Business Applications

5

represent more popular objects and values close to 0 represent less popular objects. AR(Oi,m , k) is calculated by processing web application server access log. • The update frequency is represented by U (Oi,m ). It indicates the number of changes on the back-end database which invalidate Oi,m . The update rate is U (O ) represented by U R(Oi,m ) = P i,m . Values close to 1 represent more frer

U (Or,m )

quently changed objects and values close to 0 represent less frequently changed objects. U R(Oi,m ) is calculated by processing database update log. • The computation cost is represented by C(Oi,m ). It indicates the cost of generating Oi,m in terms of database access. It is calculated by processing the database request/delivery log and calculating the time elapsed between the C(Oi,m ) request and delivery of the result from the database. CR(Oi,m ) = K1 +C(O i,m ) represents the calculation cost rate. Values close to 1 represent more expensive and values close to 0 represent less expensive objects to generate from the database. K1 ∈ R+ is a constant value required by the transform function and is set by the system administrator. It can be simply set to C(Oi,m ). • The delivery cost is represented by the size of the object. Larger objects are more expensive to deliver in terms of time and network bandwidth consumpSize(Oi,m ) tion. SR(Oi,m ) = K2 +Size(O results in a rate in [0, 1] from the size of Oi,m . i,m ) Values close to 1 represent larger and values close to 0 represent smaller objects. K2 ∈ R+ is a constant value required by the transform function and is set by the system administrator. It can be simply set to Size(Oi,m ). The score for cache worthiness can be calculated based on the above parameters as follows:

W (Oi,m , k) = ν1 .AR(Oi,m , k) + ν2 .(1 − U R(Oi,m )) + ν3 .CR(Oi,m ) + ν4 .SR(Oi,m ) P4 0 ≤ ν 1 , ν2 , ν3 , ν4 ≤ 1 , ν =1 i=1 i The term W (Oi,m , k) ∈ [0, 1] represents the cache worthiness and indicates how useful caching Oi,m at portal k is. Tuning the effect of each term in calculating the score is enabled by ν1 , ν2 , ν3 , and ν4 .

Regulating Cache Worthiness. Although, all providers may use the same strategy to score their objects, the scores may not be consistent. This is mainly due to the fact that: (i) each provider uses a limited amount of logs to extract required information which varies from one to another, (ii) each provider may use different weight for each term in the above formula (i.e., ν1 , ν2 , ν3 , and ν4 ), (iii) the value of CR(Oi,m ) depends on the provider hardware and software platform, workload and etc., and (iv) providers may use other mechanisms to score the objects. To achieve an effective caching strategy, the portal should detect these inconsistencies and regulate the scores given by different providers. For this purpose, the portal uses a regulating factor λ(m) for each provider. When the portal receives a cache worthiness score, it multiplies it by λ(m) and uses the result in the calculation of the overall caching score. This factor can be set by an administrator at the beginning. It is adapted dynamically based on the performance of the cache.

6

Mehregan Mahdavi, Boualem Benatallah, and Fethi Rabhi

Before explaining the process of dynamically adjusting λ(m), two terms must be defined: (i) false hit, and (ii) real hit. A hit which occurs at the portal while the object is already invalidated is called a false hit. In contrast, a hit which occurs when the object is still fresh and can be served by the cache is called a real hit. False hits degrade the performance and increase the overheads both at portal and provider sites, without any outcome. These overheads include probing the cache validation table and generating validation request messages by portal, and probing cache look-up table by provider. Adapting λ(m) dynamically is done through the following process carried out by the portal: • When the ratio of false hits (i.e., the number of false hits divided by the total number of hits) for a provider exceeds a threshold (τ1 ), then the portal decreases λ(m) for the provider. The new value of λ(m) will be: λ(m) ← λ(m) . θ • When the ratio of real hits (i.e., the number of real hits divided by the total number of hits) for a provider exceeds a threshold (τ2 ), then the portal increases λ(m). The new value of λ(m) will be: λ(m) ← ρ.λ(m).

3

Implementation Aspects

In order to evaluate the performance of our caching scheme, we have implemented a business portal in a case study involving providers offering accommodation and car rental services. Fig. 2 shows the architecture of the system. Provider Service Components

SOAP Server

Accommodation

JDBC

Portal Database

Web Components

Business Components

HTML Servlet JSP

Java Class

Accommodation EJB

...

SOAP Components (Connectors)

Provider Service Components

SOAP Server

Car Rental

Car Rental

...

...

JDBC

JDBC

...

Database Registry

Fig. 2. The Architecture of the Implemented System Each provider offers the following operations: getHotels, getRoomDetails, bookRoom, and unBookRoom for accommodation providers and getPickUpLocations, getVehicleDetails, bookVehicle, and unBookVehicle for car rental providers. To support caching, these providers also supply an implicit operation which calculates cache worthiness for a response message. Once the provider registers its service with the portal an entry for the provider is created in the registry at the portal. When the portal receives a request (e.g., search request) the portal first accesses the registry to find appropriate providers to process the request (e.g., based on the type of request for accommodation or car rental, or location). For each provider, a connector object is created to handle the request/response. Connectors create SOAP

Caching Dynamic Data for E-Business Applications

7

requests and send them to providers and respectively receive SOAP responses from providers. In this implementation, BEA WebLogic application server and Oracle 8i DBMS are used to implement the portal service. We have used Jakarta-Tomcat web server and MySQL DBMS to implement provider services. To create, send, and receive messages, Apache Axis SOAP server is used. Ongoing work includes evaluating the performance of the proposed caching scheme. The evaluation uses the following two metrics: (i) number of web interactions per second , and (ii) consumed network bandwidth. We are also studying the accuracy of the regulation process (i.e., choosing the appropriate values of θ and ρ). Finally, we are examining the implications of divergence between cache look-up and cache validation tables on the overall performance.

4

Related Work and Conclusions

Proxy server is still the most popular caching mechanism. However, it only deals with static objects and does little or nothing about dynamic objects. In [13], caching dynamic objects at the proxy server level is enabled by transferring some programs to the proxy server which generate the dynamic part of the objects. ICP [18] was developed to enable querying of other proxies in order to find requested web objects. In Summary Cache [10], each cache server keeps a summary table representing the content of caches in other servers to minimise the number of ICP messages. CARP [14] can be considered as a routing protocol which uses a hash function to determine the owner of a requested object in an array of proxy servers. Companies such as Akamai, and Digital Island have been providing Content Delivery/Distribution Network (CDN) services for several years. CDN services are designed to deploy cache/replication servers at different geographical locations called edge servers. The first generation of these services was designed to cache static objects such as HTML pages, image, audio and video files. Nowadays, Edge Side Includes (ESI) enables the definition of different cachability for different fragments of an object. Processing ESI at these servers enables dynamic assembly of objects at edge servers which otherwise may be done at server accelerator, proxy server or browser. Caching policies for web objects are studied in [1,4,7]. Weave [19] is a web site management system which provides a language to specify a customized cache management strategy. Maintaining cache consistency has been studied in [8,9,15,5,12]. Triggers can be deployed to detect changes on back-end database and invalidate cached objects. Oracle Web Cache [16] uses a time-based invalidation mechanism [2]. CachePortal [17] intercepts and analyses system logs to detect changes on database and invalidate the corresponding object(s). Data Update Propagation (DUP) algorithm [6] uses an object dependence graph for determining the dependences between cached objects and the underlying data. It provides API for application programs to explicitly manage caches to add, delete, and update cache objects. Dynamai from Persistence Software uses an event-based invalidation technique. It includes a request-based invalidation mechanism through the application, or external events such as a system administrator. In summary, we have discussed the limitations of current approaches for providing an effective caching strategy in business portals. We proposed a caching

8

Mehregan Mahdavi, Boualem Benatallah, and Fethi Rabhi

scheme for business portals based on providers’ collaboration. The required metadata to support the strategy consists of cache look-up and cache validation tables. We have provided a mechanism for providers to score objects for caching. We have also provided a mechanism to regulate the scores at portals.

References 1. Aggrawal C., Wolf J L., Yu P. S (1999) Caching on the World Wide Web. IEEE TKDE 2. Anton J., Jacobs L., Liu X., Parker L., Zeng Z., Zhong T. (2002) Web Caching for Database Applications with Oracle Web Cache. ACM SIGMOD 3. Benatallah B., Casati F., editors. (2002) Special Issue on Web Services. Distributed and Parallel Databases, An International Journal, 12, 2–3 4. Cao P., Irani S. (1997) Cost-Aware WWW Proxy Caching Algorithms. The USENIX Symposium on Internet and Systems 5. Cao Y., Ozsu M. T. (2002) Evaluation of Strong Consistency Web Caching Techniques. WWW Journal 6. Challenger J., Iyengar A., Dantzig P. (1999) A Scalable System for Consistently Caching Dynamic Web Data. IEEE INFOCOM 7. Cheng K., Kambayashi Y. (2000) LRU-SP: A Size-Adjusted and PopularityAware LRU Replacement Algorithm for Web Caching. IEEE Compsac 8. Deolasee P., Katkar A.,Panchbudhe A., Ramamaritham K., Shenoy P. (2000) Adaptive Push-Pull: Disseminating Dynamic Web Data. The Tenth World Wide Web Conference (WWW-10) 9. Duvuri V., Shenoy P., Tewari R. (2000) Adaptive Leases: A Strong Consistency Mechanism for the World Wide Web. IEEE INFOCOM 10. Fan L., Cao P., Broder A. (2000) Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. IEEE/ACM Transactions on Networking 11. Kossmann D., Franklin M. J. (2000) Cache Investment: Integrating Query Optimization and Distributed Data Placement. ACM TODS 12. Liu C., Cao P. (1998) Maintaining Strong Cache Consistency in the World-Wide Web. International Conference on Distributed Computing Systems 13. Luo Q., Naughton J. F. (2001) Form-Based Proxy Caching for Database-Backed Web Sites. VLDB Conference 14. Microsoft Corporation (1997) Cache Array Routing Protocol and Microsoft Proxy Server 2.0. White Paper. http://www.mcoecn.org/WhitePapers/Mscarp.pdf 15. Olston C., Widom J. (2001) Best-Effort Synchronization with Source Cooperation. ACM SIGMOD Conference 16. Oracle Corporation (2001) Oracle9iAS Web Cache. White Paper 17. Selcuk K., Li W. S., Luoand Q., Hsiung W. P., Agrawal D. (2001) Enabling Dynamic Content Caching for Database-Driven Web Sites. ACM SIGMOD Conference 18. Wessels D., Claffy K. (1997) Application of Internet Cache Protocol (ICP), Version 2. Network Working Group 19. Yagoub K., Florescu D., Valduriez P., Issarny V. (2000) Caching Strategies for Data-Intensive Web Sites. VLDB Conference 20. Zona Research Inc. (2001) Zona Research Releases Need for Speed II. http://www.zonaresearch.com/info/press/01-may03.htm

Suggest Documents