Adaptive Caching of Fresh Web Search Results - Springer Link

2 downloads 291 Views 269KB Size Report
The proposed algorithm is built upon the optimization framework that dy- .... high freshness of served results and reduced load of the search engine's back- end.
Adaptive Caching of Fresh Web Search Results Liudmila Ostroumova Prokhorenkova, Yury Ustinovskiy, Egor Samosvat, Damien Lefortier, and Pavel Serdyukov Yandex, Moscow, Russia {ostroumova-la,yuraust,sameg,damien,pavser}@yandex-team.ru Abstract. In this paper, we study the problem of caching search results with a rapid rate of their degradation. We suggest a new caching algorithm, which is based on queries’ frequencies and the predicted staleness of cached results. We also introduce a new performance metric of caching algorithms called staleness degree, which measures the level of degradation of a cached result. In the case of frequently changing search results, this metric is more sensitive to those changes than the previously used stale traffic ratio. Keywords: SERP caching, staleness degree, fresh vertical.

1

Introduction

Modern search engines have to continuously process a large number of queries by evaluating them against huge document indexes. This may cause an overload of the backend servers, an increase in the latency of query processing, and hence, eventually, user dissatisfaction. Search results caching helps to reduce query traffic to backend servers and thus to avoid overloading them. When queries with cached results are issued again, these results may be served from the cache, as it is much faster than to process those queries again. Unfortunately, it may as well lead to user dissatisfaction due to a chance to serve users with stale results. Therefore, cached results should be updated, when they are supposed to be stale. However, reissuing queries to update all cached results too often may lead to the immediate overload of the backend servers. In previous works on search result caching, a cached result is thought to be stale, if the ordered set of top-k documents has changed even slightly: either their order has changed, or some documents have been added or deleted [4,6,9]. In this paper we show that this notion of staleness should be more flexible. First of all, the “time-to-die” for a cache entry should not necessarily come immediately after the corresponding SERP changed even a bit. SERPs for recency sensitive queries [7] may change very fast, what makes it impossible or, at least, impractical to update their caches every time when these small result changes take place. It is also often counterintuitive to update caches after even slightest search result changes, since some of them are often insignificant in terms of their influence on user satisfaction. Followed by these requirements and intuitions, our method, in contrast to its predecessors, tries to predict not only the fact that the result set for a query is changed at some point, but also the magnitude of its change. The main contributions of our paper are the following: A. Hanbury et al. (Eds.): ECIR 2015, LNCS 9022, pp. 110–122, 2015. c Springer International Publishing Switzerland 2015 

Adaptive Caching of Fresh Web Search Results

111

– We argue that the “time-to-die” for a SERP does not need to come immediately with every change of that SERP. Being interested in the magnitude of SERP’s changes, we define a new measure of caching algorithms’ performance taking this intuition into account. – We propose a new caching algorithm specially designed for search engines whose results change extremely fast. Our experiments demonstrate its advantage over the existing caching methods. – The proposed algorithm is built upon the optimization framework that dynamically derives an optimal cache update policy based on frequencies of queries and the predicted degree of staleness of cache entries. The rest of the paper is organized as follows. In Section 2, we discuss previous research on search results caching. In Section 3, we describe the caching framework we rely on, describe the data we use for the experiments, and discuss the new measure of caching algorithms’ performance. We present the baseline algorithms and new caching strategies in Sections 4 and 5. The experimental results are presented in Section 6. Finally, we conclude the paper and outline directions for future research.

2

Related Work

Usually, a cached SERP is thought to be stale, if top-k documents have been changed. A cache invalidation policy is needed to detect if the cached query result is stale or not before processing the query. There are two groups of approaches to the invalidation of cached results. The first group uses knowledge about index changes [1,4]. These approaches are effective, but hard to implement in practice. First of all, they are computationally expensive due to the necessity of accurate and timely determination of changes in the index. Second, it is usually hard to observe how index changes affected cached SERPs and avoid costly processing of the corresponding queries. The algorithm proposed in this paper belongs to the second group of approaches, which do not monitor any index changes. Such algorithms rely just on the query log and the history of SERPs changes and can be of two types: active and passive. Passive methods update stale cache entries only in response to a user request [2]. Active methods update stale cache entries whenever they have available resources to do that [6,9]. Previous studies show that active methods outperform passive [10]. Hence, this paper aims at advancing the state-of-the-art active cache updating policies. Passive methods use only TTL (time-to-live) values to invalidate cached results. If the age of a cache entry is greater than its TTL, then this entry is marked as expired. Often, each entry is associated with a fixed TTL [1,4,5,6], although, the methods with adaptive TTLs were also suggested [2]. Active policies are often also supplemented with TTLs in order to set limits on the age of shown results. Note that in our case we cannot use the adaptive TTLs from [2], which considered an idealized setting with no constraints on computing resources. The reason is that the sets of top documents from a fresh vertical for almost all queries change extremely frequently and this leads to very small adaptive TTLs according to [2]. Small TTLs inevitably lead to prohibitively frequent updates of cache entries, and that is what a search engine always tries to avoid.

112

L. Ostroumova Prokhorenkova et al.

This motivated us to predict the magnitude of SERP changes instead of just predicting whether the SERP has changed or not. An active caching policy was suggested in [9]. Here a machine learning method is used to estimate the arrival times of future queries. Based on this information, some cached results are chosen for updating. Since in our case many queries are issued several times per second (see Section 3.3), we predict not the arrival time of queries, but their frequencies. Moreover, as we discuss in Section 5, even the precise knowledge of when and how many queries will be issued in the near future gives a rather small improvement. Another active caching policy with proactive prefetching of cached results was suggested in [6]. It also uses TTLs to invalidate cached results, but, besides, leverages idle cycles of the backend servers to reprocess queries and refresh cache entries proactively, even before they expire according to their TTLs. A cached result is chosen to get updated based on the product of its age and the frequency of the respective query. Given all the abovementioned constraints, we regard this method as the only appropriate baseline for our study.

3 3.1

General Framework System Architecture

In this section, we describe the system architecture we consider in this paper. All queries are issued to the front-end of a search engine. A query result can be either taken from the cache or retrieved by forwarding the query further to the back-end search cluster. All previously unseen queries are always forwarded to the back-end cluster, since there are no cached results for them. Also, if TTL values are used by the search engine, all expired entries are deleted from the cache and the corresponding queries are considered as unseen. Results for other queries are served to users from the cache. The cache refreshing scheduler decides which SERP’s cache to update during idle cycles of the search cluster (by using spare computing resources of the search engine). The final SERP shown to the user usually contains results from different verticals with different indexes [3]. In this paper, we are interested in the vertical serving fresh content, removing any document older than 10 days from its index. This threshold is chosen in accordance with [7], where it was observed that most relevant documents for recency sensitive queries have ages within 10 days. However, such a threshold may depend on the current search engine’s settings and its understanding of how old the content should be to be considered “fresh”. For all queries seeking documents from this vertical, the list of top relevant documents changes extremely fast, therefore it is very important to have an efficient policy for caching the results served by this vertical. 3.2

Metrics

The primary requirements of caching algorithms with infinite cache capacity are high freshness of served results and reduced load of the search engine’s backend. The common evaluation measures of these algorithms are stale traffic ratio,

Adaptive Caching of Fresh Web Search Results

113

i.e., the fraction of stale query results shown to users and false positive ratio, i.e., the fraction of redundant cache updates [2,4,6]. Most previous studies consider that a cached result page Sc (q) (i.e., the list of k most relevant documents) served for the query q can be in just two states, either fresh or stale: IS (q) = 0 if Sc (q) = Sa (q) (fresh state) and IS (q) = 1 if Sc (q) = Sa (q) (stale state), where Sa (q) is the actual up-to-date list of top-k documents obtained by processing the query [4,6,9]. Then stale traffic ratio ST is the average of binary staleness IS (q) over  all queries q ∈ Q issued within a 1 given period of time [t0 , t1 ]: ST ([t0 , t1 ]) = |Q| q∈Q IS (q). We argue that staleness of a cached result page is not necessarily a binary property. Clearly, all stale result pages Sc (q) are stale to a varying degree. Therefore, we propose to study more discriminative measures of staleness than ST metric. We decided to focus on the NDCG-like measure, which we suppose to be the most suitable for the task of fresh vertical results caching. First, we introduce staleness degree d(Sc (q), Sa (q)) of a single served result Sc (q) — a non-binary alternative to IS (q). Second, we define staleness degree ratio, similarly to ST , as the average of d(Sc (q), Sa (q)) over queries Q issued within a period [t0 , t1 ]: 1  d(Sc (q), Sa (q)). (1) StDeg([t0 , t1 ]) = |Q| q∈Q

To measure staleness degree d(Sc (q), Sa (q)) we compare top-k documents shown to users in Sc (q) with the actual up-to-date top-k most relevant documents Sa (q). The parameter k will be referred to as cut-off parameter. Motivated by the principles embodied in the classical NDCG [8] measure, we introduce the notions of “gain” and “discount” to compute the quality of a cached search result. Let posc (u) and posa (u) denote the positions of a document u in Sc (q) and Sa (q) respectively. As in NDCG measure, the gain of a document u is some increasing function of its relevance and its discount is a decreasing function of its position. Since the only information about the current relative relevance of the documents in Sc (q) we normally have is their positions in Sa (q), we assume that the relevance of document u depends solely on posa (u). The resulting quality mea sure of Sc (q) is M(Sc (q), Sa (q)) = u∈ Sc (q) gain(posa (u))discount(posc (u)). If u ∈ Sc (q), but u ∈ Sa (q), u could have been ranked at any position ≥ k + 1. In that case, u is assigned the k + 1 position: posa (u) := k + 1. As in NDCG, we define staleness degree d(Sc (q), Sa (q)) by normalizing M(Sc (q), Sa (q)) to the unit segment (Mmax and Mmin are maximal and minimal possible values of M, they depend on the choice of gain and discount): M(Sc (q), Sa (q)) − Mmin d(Sc (q), Sa (q)) = 1 − . (2) Mmax − Mmin In this paper, we set gain(u) = 1/(posa (u)+1) and discount(u) = 1/(posc (u)+ 1). In Section 6, we analyze how the choice of different values of k (the number of documents in Sc (q) and Sa (q)) affects the performance of a caching algorithm optimized for StDeg. Further, if not specified otherwise, we use the fixed cutoff parameter k = 10. Here are some examples of staleness degree values (for k = 10): d(Sc (q), Sa (q)) = 0.52 if the first document of Sa (q) is replaced by another (irrelevant) document in Sc (q), d(Sc (q), Sa (q)) = 0.36 if the first document

114

L. Ostroumova Prokhorenkova et al.

of Sa (q) (the most relevant) is absent in Sc (q) (all other documents are moved up by one position), d(Sc (q), Sa (q)) = 0.15 if the second document is absent. 3.3

Data

Our experiments are based on the query log of the most popular among the search engines operating in Russia — Yandex (yandex.com, yandex.ru). First, we collected the dataset D1 required to conduct our motivating experiments (see Section 4) and tune several parameters of our algorithm. We sampled 6K random unique queries from the stream of queries issued on December 29, 2012. Then we monitored all issues of these queries by real users from December 29, 2012 to February 18, 2013 (∼700M issues). We also collected the dataset D2 required to evaluate the performance of our algorithms. For that purpose, we sampled another set of 6K random queries issued on February 25, 2013 and monitored all issues of these queries from February 25, 2013 to March 4, 2013 (∼85M issues). The most frequent query was issued 20 times per second on average. Also, 23% of queries were, at some point, issued several times per second. On the other hand, 75% of queries were issued less then 5 times per hour each on average. Note that, like the standard ST metric, StDeg metric can be used only for offline tuning since it requires the computation of actual query results in the search backend. In order to perform an offline evaluation and to train our predictor of a search result change, we needed to understand the dynamics of SERP changes. Therefore, we issued all 12K selected queries every 10 minutes during the corresponding above-mentioned time periods and saved the result pages of the vertical serving fresh content. Finally, for the first period we saved 45M SERPs and for the second period we saved 7M SERPs. The size of our dataset is comparable or exceeds the sizes of the datasets used in the previous studies of SERP caching with offline evaluation. In [2], e.g., 4,500 queries were issued once a day for a period of 120 days and the top 10 results were saved, so only 540K SERPs were saved.

4

Algorithms

In this section, we describe a general framework of all the algorithms we consider. All the algorithms have a limited quota for cache updates N : the number of query results which can be computed in the search backend per second. Note that N can be any positive real number. For example, if N < 1 then it is allowed to process only one query per 1/N seconds. As in the previous studies on caching [1,4,5,6,9], we also use TTL in order to always avoid showing too old results to users. As we already mentioned, all cache entries which have been in the cache for longer than TTL are marked as expired, and, if the corresponding queries are issued by users again, then they are passed directly to the back-end cluster. The cache entries for expired-but-reissued queries always have the highest priority and updated before any other cache entries, as proposed in [6]. Apparently, the more queries with expired TTL are issued, the less spare resources we have and the less cached results with non-expired TTL can be updated. In that way, the number of allowed cache updates per second in our system is dynamic.

Adaptive Caching of Fresh Web Search Results

115

The core part of the caching algorithms in our architecture is the refreshing scheduler, which updates cached results using spare computing resources. At every moment of time, we have a set of triples in the cache, each characterized by a query, its cached result and the time of its last update: Tq = {q, Sc (q), tu }. Every τ seconds (re-ranking period) we rank all the triples {Tq } according to their priorities in our refreshing scheduler — forming a queue of results Sc (q) to update. As soon as we have spare resources, at time tnow , we take top queries from this queue, update their cache entries (i.e., for a query q, we replace its cached SERP Sc (q) with the actual one Sa (q)), eliminate them from the queue. The quality of our algorithm essentially depends on the way we prioritize cached results to be updated. In general, this prioritization should continuously estimate the cost of keeping the outdated cache entry in the cache in the next period of time, what can be also roughly described as the need to rank triples corresponding to frequent queries with highly outdated results Sc higher. So, the refreshing scheduler should periodically perform the following steps: 1) estimate the frequency of query q; 2) estimate integral staleness of the cached result Sc in the next τ seconds following the current batch cache update; 3) combine these quantities into a ranking function. Further in this section, we describe three approaches to implement a refreshing scheduler. 4.1

Baseline

We implement the age-frequency (AF) strategy from [6] as our baseline. In [6] the scheduler ranks all cached results according to the value f (q)Δt, where f (q) is the frequency of q and Δt = tnow − tu is the age of the cached SERP Sc (q). In some sense, AF estimates the staleness of Sc simply as Δt and combines staleness Δt and frequency f (q) simply by taking their product. In the next section we show, that on real data the staleness of Sc (q) as a function of Δt is neither linear, nor query-independent. This observation motivated us to propose a refreshing scheduler that takes these observations into account. 4.2

Staleness-Frequency Strategy

log(1 − d(Δt))

We start with some motivating experiments. For all queries we com-0.4 puted the average freshness (i.e., -0.8 all queries -1.2 1 − d(Sa (q), Sc (q))) of cached re’java’ ’msi cx620’ -1.6 Δt sults with 10, 20, . . . , 2000 minutes 0 5 10 15 20 25 30 35 age. Figure 1 shows the obtained results averaged over all queries (‘All Fig. 1. Freshness of Δt old cached results queries’) and two exemplary queries. There is freshness in logarithmic scale on y-axis and the age of cached result in hours on x-axis. It follows from this figure that the freshness of the cached result for the query q can be approximated as (3) 1 − d(Sc (q), Sa (q)) = e−θ(q)Δt , where θ(q) is the degradation rate of a cached result Sc (q) created at time tu , Δt is its age, and Sa (q) is the actual result for the query q at time tu + Δt. 0

percent of queries

116

L. Ostroumova Prokhorenkova et al.

Apart from the evolution of freshness for the ‘average query’, Figure 1 demonstrates that the de-gradation 5% rate of the cached result essentially 0% depends on the query: θ = θ(q). 0 0.02 0.04 0.06 0.08 0.1 Figure 2 gives additional evidence StDeg Fig. 2. Staleness of 10 min old cached results supporting this assertion. It displays the distribution of staleness degree d(Sc (q), Sa (q)) over all 10-min old cached results. Altogether, these observations allow to reduce the problem of estimation of staleness of Sc (q) at a certain moment of time to the problem of estimation of θ(q). The staleness-frequency (SF) strategy aims at minimizing the staleness degree of all results shown to users within some period of time [t0 , t1 ]. Assume that: (1) we aim at minimizing StDeg([t0 , t1 ]), (2) query issues are uniformly distributed within the time interval [t0 , t1 ] (f (q) issues per second), (3) interval [t0 , t1 ] is ‘small’ (see details below). If we do not update triple Tq = {q, Sc (q), tu } during [t0 , t1 ], its contribution into StDeg([t0 , t1 ]) is (1/|Q| is the factor from the definition of StDeg): t1−tu  t1   1 1 1 − e−θ(q)x f (q) dx  d(Sc (q), Sa(t) (q))f (q) dt = L(q) = |Q| t0 |Q| 15% 10%

t0 −tu

  t1 − t0 f (q) 1 − e−θ(q)(t0 −tu ) . (4)  |Q| In the latter equality, we assume that θ(q)(t1 − t0 )  1, hence 1 − e−θ(q)x is almost constant on [t0 − tu , t1 − tu ]. The greedy solution to the StDeg([t0 , t1 ])minimization problem always updates the cache entry with the maximal loss value L(q). In our framework we reorder the queue of non-expired cache entries every τ seconds, therefore StDeg and L(q) in Equation (4) are computed over interval [tnow , tnow + τ ]. Note that the assumption θ(q)(t1 − t0 )  1 is automatically satisfied, if τ is small enough. The refreshing scheduler sorts triples Tq according to the value L(q) forming a queue. Basically, it sorts queries according −t0 in Equation (4) is the to the value f (q)(1 − e−θ(q)Δt ), since the factor t1|Q| same for all triples. Afterwards, we greedily optimize StDeg([tnow , tnow + τ ]) by continuously updating the cache entries waiting in this queue during the period τ . The computation of L(q) requires f (q) and θ(q) for all queries. Section 5 describes how to estimate these values from the query log. 4.3

Log-Staleness-Frequency Strategy

Although the SF strategy optimizes our performance measure StDeg greedily and directly, its performance may be heavily affected by our TTL policy. The reason is that SF policy tends to update frequent queries rather often, while rare queries with completely outdated results always have lower priority. For instance, assume that for queries q1 and q2 we have f (q1 )  f (q2 ). The staleness degree after 10 minutes on our data is greater or equal to 0.001 for 99% of queries,

Adaptive Caching of Fresh Web Search Results

117

thus if f (q1 )/f (q2 ) > 1000, the query q1 will be ranked higher irrespective of degradation of q2 for the majority of queries q1 and q2 . Given that moderately frequent and rare queries constitute a large share of total query volume, continuous serving of stale results for them will soon result into a dramatic increase in user dissatisfaction. The TTL mechanism, described earlier, allows to upperbound the age of cached results, but, eventually, due to the above-mentioned imbalance, leads to overabundance of expired cache entries. This observation motivated us to develop a strategy, which gives more weight to the highly stale results in the queue and hence updates them before their expiration more often. This strategy is based on the greedy optimization of a new intermediate objective function P. As for SF strategy, we assume, that staleness d(Sc (q), Sa (q)) does not change within a short time interval [t0 , t1 ] under consideration, i.e., θ(q)(t1 − t0 )  1. We define P as the product of freshnesses (which is one minus  staleness) of all results Q shown within [t0 , t1 ] and maximize it: P([t0 , t1 ]) = q∈Q (1 − d(Sc (q), Sa (q))f (q)(t1 −t0 ) → max . As we have discussed in the previous section, the function e−θ(q)Δt gives a reasonable estimation for freshness of Sc (q) at time tnow , where the age of the cache entry Δt = tnow − tu . Thus, the maximization reduces to the following minimization problem:  f (q) log(1 − d(Sc (q), Sa (q)) = − log P([t0 , t1 ]) = −(t1 − t0 ) q

= (t1 − t0 )



f (q)θ(q)Δt → min . (5)

q

As in the SF strategy (4), a greedy solution to this minimization problem on the time interval [tnow , tnow + τ ] updates results with the maximal value of θ(q)f (q)Δt. Note that for small values of Δt, θ(q)Δt is close to 1 − e−θ(q)Δt = d(Sa , Sc ). So, this strategy is similar to SF for those cache entries, which are often updated. What is more important, it tends to update old cached results more often, since 1 − e−θ(q)Δt  θ(q)Δt for large Δt. In Section 6 we show that this algorithm, referred to as LSF (log-staleness-frequency), indeed outperforms SF.

5

Prediction of Staleness and Query Frequency

The proposed algorithms require the estimation of query’s frequency f (q) and degradation rate θ(q). In this section, we describe our estimation methods.

5.1

Frequency

The frequency of the query is calculated over the whole period of past observations. The same historical frequency was used in [6]. We also noticed that knowing the true number of times the query will be issued in the next τ seconds (oracle strategies) improves the quality of all algorithms only by ∼ 1%, thus the estimation of frequency f (q) represents a much less challenging problem than the estimation of degradation rate θ(q).

118

5.2

L. Ostroumova Prokhorenkova et al.

Staleness

Query-Independent Estimation. Firstly, we estimate the query-independent rate of staleness on the training data. That is, for every temporally consecutive pair of search engine result pages S1 (q) at time t1 and S2 (q) at time t2 we log(1−d(Sc1 (q),Sc2 (q))) estimate the value of θ(q) as θ(q; Tq1 , Tq2 ) = . In order to define t2 −t1 θ we first average these estimates over all temporally consecutive pairs S1 (q) and S2 (q) for a given query q and then average obtained values over all queries. Let us remind that the interval t2 − t1 between two temporally consecutive search results is 10 minutes in our study (see Section 3.3). Basically, θ is just the slope of the ‘All queries’ blue line on Figure 1. Historical Estimation. Now we proceed with the description of the estimation of query-dependent parameter θ(q). At any moment of time for each query q we have a sequence of cached results at the preceding moments of time i H(q) = {(Sci (q), tiu )}H i=0 , where tu are the moments at which we updated the cache entry for query q. Note that both the moments tiu and the number of cached results H depend on our caching strategy. We use these historical data in order to make the historical estimation of θ(q). Namely, for each adjacent pair i i (Sci−1 (q), ti−1 u ), (Sc (q), tu ) in H(q), we calculate the i-th estimate of θ(q): log(1 − d(Sci−1 (q), Sci (q))) . (6) ti − ti−1 We derive H-th historical estimation θH (q) of θ(q) from the sequence θi (q) by  i considering the historical average θH (q) = H1 H i=1 θ (q) over ≤ T hours old i cached results Sc (q). For the experiments we took T = 24. We tried other values of T (12h, 48h) on the datased D1 and noticed almost no influence of this parameter (±1%) on StDeg. We also tried to use exponential moving average instead of historical average and obtained better performance for historical average. Since our approach requires sufficient historical data H(q), in order to construct a reasonable estimation of θ(q), we need some prior estimate for new queries and for queries with little history H(q). For that purpose, we combine  with query-independent estimation θ:  historical estimation of θ(q) θi (q) :=

H  w  θ(q) + θ, (7) w+H w+H where w is the parameter of our algorithm accounting for the weight of query Moreover, the more cached results H for the query independent estimation θ.  q we have, the more reliable historical estimation θ(q) we get and the higher weight to it is given in Equation (7), given the parameter w. θC (q) =

6

Experiments

In this section, we compare our algorithms SF and LSF with the baseline strategy AF and analyze the influence of parameters on their performance. Our algorithm has several parameters. Number of allowed cache updates per second N and TTL are usually defined by the requirements of a search engine: N directly

Adaptive Caching of Fresh Web Search Results

119

corresponds to the maximum possible load of the back-end cluster and TTL limits the age of shown results. Further in this section, we analyze the influence of both parameters on the performance of the algorithms. Re-Ranking Period τ . Re-ranking period is a parameter of all the algorithms. On the one hand, small values of τ lead to higher flexibility and better estimation of parameters. On the other hand, according to our framework, τ is the lower bound for the update frequency of each cache entry: we cannot update an entry more than once during one cycle. Therefore, the influence of τ is not necessarily monotone. In further experiments, for each algorithm and for every combination of other parameters we tune τ on the dataset D1 . The Estimation of Frequency and Staleness. For the algorithms SF and LSF we tuned the weight w (see Equation 7) on the dataset D1 . We noticed that the performance of these algorithms is the best and almost constant if w is in the interval [5, 100]. For the rest of the experiments we fix w = 10.

StDeg(t) improvement

Comparison with the Baseline. We compare all three strategies on the oneweek test query log (dataset D2 , see Section 3.3). For every moment of time t ∈ [ts , te ], where ts and te correspond to the start and the end of our testing period, we define staleness degree metric StDeg(t) as follows (see Equation (1)): StDeg(t) := StDeg([t, t + 24h]). Then for each algorithm A ∈{AF, SF, LSF} let StDegA (t) be the corresponding metric. Figure 3 demonstrates relative improvement of StDegA (t) over AF strategy. As one can see from the figure, both SF and LSF outperform AF strategy according to StDeg(t) during the entire test period, except the short warm-up interval. Throughout this interval, SF and LSF accumulate information on query-specific historical degradation to better predict the staleness of the corresponding results. Apparently, the more historical data of the degradation for a given query we collect, the more accurate prediction θ(q) rate we make with Equation (7), improving the overall quality of both methods. Furthermore, as we mentioned 15% in Section 4.3, LSF often out10% performs SF strategy. However, 5% most of the time both methods 0% improve StDeg with respect to SF -5% AF by 5-15%. The effect of our LSF -10% caching algorithms on false posDate AF -15% itive, stale traffic and staleness 01 05 25 03 27 /0 /0 /0 /0 /0 3 3 3 degree measures over the test pe2 2 riod [ts , te ], except 1-day warmFig. 3. Relative improvement of StDeg(t) up period, is given in Table 1 (see N = 1). Let us present an interpretation of the value StDeg = 0.02 (see Section 3.2). StDeg = 0.02 if for 2% of queries completely irrelevant list of results is shown, or for 6% of queries the most relevant document is missed, or for 13% of queries the second most relevant document is missed.

120

L. Ostroumova Prokhorenkova et al.

Number of Allowed Updates Per Second. We evaluated all three algorithms with various values of N on the one-week test log (dataset D2 ). For realistic experiments, this parameter should be proportional to the size of the analyzed query sample and should be adequate for the needs of a certain vertical. Since our dataset contains 6000 unique queries, we considered rather small values of N (N ≤ 1). Note that if it is allowed to update only one cache entry per second (N = 1), then the cache for all 6000 queries can be updated in 100 minutes, i.e., it is possible to keep all the cached results to be not older than 2 hours, which is acceptable for a vertical serving fresh content. For example, parameters used in [6] allow to update cache for all unique queries in 3.5-7.5 days. For every method we computed the average Table 1. Influence of N values of a certain metric during the one-week Alg.\Metric FP ST StDeg test period, except for the one day warm-up pe- AF, N = 1 0.23 0.067 0.021 0.21 0.064 0.019 riod. Exclusion of the warm-up period is very nat- SF, N = 1 LSF, N = 1 0.21 0.063 0.019 ural, since our algorithms are highly unstable in AF, N = 1/2 0.12 0.092 0.030 the beginning. Table 1 demonstrates the growth SF, N = 1/2 0.12 0.091 0.029 of StDeg with the decrease of N . As we expected, LSF, N = 1/2 0.11 0.090 0.028 N = 1/3 0.079 0.12 0.039 the influence of parameter N on the quality of AF, SF, N = 1/3 0.075 0.12 0.038 caching algorithm is much stronger than, say, the LSF, N = 1/3 0.073 0.11 0.037 choice of the refreshing scheduler. Indeed, with a AF, N = 1/5 0.025 0.18 0.057 SF, N = 1/5 0.024 0.18 0.058 small value of N we spend almost all available LSF, N = 1/5 0.024 0.17 0.055 resources on queries with the expired cached entries, instead of updating non-expired cache entries queued by our caching algorithm, as long there are no idle cycles left to pro-actively update non-expired cache entries. It is interesting to note, that for larger values of allowed updates per second (e.g., N = 1), both SF and LSF policies perform relatively well improving over the AF’s quality (see Table 1). On the contrary, small values of N result into the evident degradation of greedy SF algorithm in terms of StDeg in comparison with both AF and LSF methods. This observation was quite surprising to us, since SF directly relies and improves the objective measure StDeg. In fact, the results of this comparison of AF with SF motivated us to develop the modification of SF — LSF. We discuss this observation in details in Section 4.3. TTL. Table 2 shows the influence of TTL Table 2. Influence of TTL on the performance of the algorithms. As exAlg.\Metric FP ST StDeg pected, too small values of TTL make the AF, TTL = 2 h 0.16 0.086 0.026 SF, TTL = 2 h 0.15 0.084 0.025 performance of all algorithms worse. The reaLSF, TTL = 2 h 0.15 0.084 0.025 son is that all algorithms spend too many AF, TTL = 5 h 0.22 0.069 0.022 resources updating expired cached entries. SF, TTL = 5 h 0.21 0.066 0.020 LSF, TTL = 5 h 0.21 0.066 0.020 When TTL becomes larger, performance staAF, TTL = 10 h 0.23 0.067 0.021 bilizes, since there are not too many expired SF, TTL = 10 h 0.21 0.064 0.019 entries and the algorithms are able to follow LSF, TTL = 10 h 0.21 0.063 0.019 their main strategies. Note that all values of TTL are comparable, but are also slightly smaller than previously used TTL values [6]. We consider smaller values due to the specificity of fresh vertical, whose users have lower tolerance to stale results than users of an average vertical.

Adaptive Caching of Fresh Web Search Results

121

Table 3. Relative improvements of StDegk of SF/LSF over AF for various k Alg.\ k 1 2 3 4 5 6 7 8 9 10 LSF 16.0% 7.6% 7.1% 6.2% 5.6% 4.6% 5.6% 6.0% 10.9% 9.4% SF 12.9% 6.8% 4.7% 4.7% 4.7% 3.3% 4.4% 3.4% 10.6% 6.8%

StDeg(t)

Cut-Off Parameter. The choice of the cut-off parameter k affects both the metric and our algorithms. Previously, we fixed k = 10, now for each k = 1, . . . , 10 we run corresponding SF and LSF algorithms and measure their improvement over the baseline algorithm. For each k we denote the StDeg-metric as StDegk . Since for different k the algorithms SF and LSF aim at optimizing StDegk , we report improvements according to these metrics, see Table 3. One can see that despite of the choice of cut-off k, all our algorithms still significantly outperform the baseline according to the corresponding metric. Estimation of θ(q). It was also interesting to know how much it is possible to improve the caching algorithms by improving our estimation of θ(q). To answer this question, we define and evaluate the Oracle based prediction of staleness. It takes the real staleness at the 0.03 moment d(Sc (q), Sa (q)) and uses it in Equation (4) instead of 0.02 1 − e−θ(q)(t1 −t0 ) and in Equation (5). On Figure 4 we demonLSF Oracle LSF 0.01 strate StDeg(t) for the AF baseAF Oracle SF SF line, for ordinary SF and LSF meth0 ods, and for SF and LSF algorithms employing Oracle estimation of staleness (again, N = 1, Fig. 4. Oracle prediction of staleness TTL = 10h). As one can see, the knowledge of real staleness extremely improves the performance of the algorithm, indicating that staleness prediction algorithms are a promising subject of research in the future. 05

03

01

27

25

/0

/0

/0

/0

/0

3

3

3

2

2

7

Conclusion and Future Work

In this paper, we focus on the algorithms for caching results for the vertical serving fresh content, where top documents for a query change extremely fast. This motivated us to introduce and measure a new highly discriminative metric of cache entry quality: staleness degree. The algorithms we suggest are based on the minimization of the new metric — average staleness degree of results presented to users. The observed properties of this metric allow to solve the minimization problem greedily and directly. Our experimental results show that, independent of specific settings of various common parameters of algorithms, our methods outperform the baseline. The core part of both of our methods is the query-specific estimation of the degradation rate of a cache entry. In additional experiments we demonstrate that our approach has the potential to be improved by more accurate estimation of the degradation rate, which reveals a novel and promising direction in this research area.

122

L. Ostroumova Prokhorenkova et al.

We have also noticed that the staleness d(Sc (q), Sa (q)) is not only well approximated by an exponential function (see Equation (3)) with the parameter taking different values for different queries, but also this parameter changes in time. We noticed that the degradation rate is smaller during the weekends and at the night, as long as, indeed, new content usually appears and web pages are updated more often during business hours. Our way of degradation rate estimation utilizes rather small windows of historical averages and hence is able to adapt to daily trends dynamically. However, in our future work, we are going to experiment with time-dependent estimation of θ(q) which takes into account daily and weekly fluctuations.

References 1. Alici, S., Altingovde, I., Ozcan, R., Cambazoglu, B., Ulusoy, O.: Timestamp-based result cache invalidation for web search engines. In: Proc. SIGIR 2011 (2011) 2. Alici, S., Altingovde, I.S., Ozcan, R., Cambazoglu, B.B., Ulusoy, O.: Adaptive time-to-live strategies for query result caching in web search engines. In: Proc. 34th ECIR Conf., pp. 401–412 (2012) 3. Arguello, J., Diaz, F., Callan, J.: Learning to aggregate vertical results into web search results. In: Proc. 20th ACM CIKM Conf., pp. 201–210 (2011) 4. Blanco, R., Bortnikov, E., Junqueira, F., Lempel, R., Telloli, L., Zaragoza, H.: Caching search engine results over incremental indices. In: Proc. SIGIR 2010 (2010) 5. Bortnikov, E., Lempel, R., Vornovitsky, K.: Caching for realtime search. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 104–116. Springer, Heidelberg (2011) 6. Cambazoglu, B., Junqueira, F., Plachouras, V., Banachowski, S., Cui, B., Lim, S., Bridge, B.: A refreshing perspective of search engine caching. In: WWW 2010 (2010) 7. Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proc. WSDM 2010 (2010) 8. J¨ arvelin, K., Kek¨ al¨ ainen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20(4), 422–446 (2002) 9. Jonassen, S., Cambazoglu, B., Silvestri, F.: Prefetching query results and its impact on search engines. In: Proc. SIGIR 2012, pp. 631–640 (2012) 10. Kroeger, T.M., Long, D.D.E., Mogul, J.C.: Exploring the bounds of web latency reduction from caching and prefetching. In: 1st USITS (1997)