New Caching Techniques for Web Search Engines - Semantic Scholar

34 downloads 14445 Views 979KB Size Report
Web search engines embody a number of optimization techniques designed to .... icy considers the cost of processing a query when deciding whether or not to ...
New Caching Techniques for Web Search Engines Mauricio Marin1,2

Veronica Gil-Costa1 1

Carlos Gomez-Pantoja1

Yahoo! Research Latin America University of Santiago of Chile

2

ABSTRACT This paper proposes a cache hierarchy that enables Web search engines to efficiently process user queries. The different caches in the hierarchy are used to store pieces of data which are useful to solve frequent queries. Cached items range from specific data such as query answers to generic data such as segments of index retrieved from secondary memory. The paper also presents a comparative study based on discrete-event simulation and bulk-synchronous parallelism. The studied performance metrics include overall query throughput, single-user query latency and power consumption. In all cases, the results show that the proposed cache hierarchy leads to better performance than a baseline approach built on state of the art caching techniques.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process

General Terms Algorithms, Performance

Keywords Web Search Engines, Caching Strategies, Discrete-Event Simulation, Models for Parallel Computing, Query Processing

1.

INTRODUCTION

Web search engines embody a number of optimization techniques designed to let them efficiently cope with heavy and highly dynamic user query traffic [2, 3]. Among these techniques we have application caches which consist of keeping in main memory pre-computed answers for frequent queries [1, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 21]. Web search engines are composed of a large set of search nodes and a broker machine in charge of sending user queries to search nodes for answer calculation and receiving the results. Different caches are placed in both search nodes and broker

machine. A first contribution of this paper is the proposal of a new hierarchy of cache memories upon this architecture and the admission and eviction strategies required to achieve efficient performance. The performance of the caching strategies proposed so far has been studied by means of extensive experimentation using queries submitted by actual users of search engines. Mathematical analyzes are not effective because of the strong dependency on user behavior. Previous studies have focused on cache hits rather than on query throughput. As pointed out in [10], the purpose of caching in Web search engines is to improve query throughput and thereby cache hits may well not be the relevant metric to optimize. The reason is that a very frequent query could be solved much faster on the search nodes than a less frequent one which should then prevail in the cache. Counting cache hits upon a large set of queries is a fairly simple experiment to make. In contrast, predicting query throughput is a quite more challenging experiment because queries are propagated over a number of search nodes to exploit parallelism. A second contribution of this paper is the proposal of a technique devised for this experiment. Our performance evaluation technique is generic and combines the trace-based cache hits experiment with a wellknown model of parallel computing and process-oriented discrete-event simulation. Our approach is useful to comparatively evaluate additional complex metrics such as power consumption which otherwise would be impossible to evaluate in practice on actual large-scale high-performance hardware. The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 extends on related work to present a baseline cache hierarchy. Major search engines keep confidential the details of their caching strategies so we need to define a baseline upon which compare our proposal. Section 3 also describes the details of the proposed caching hierarchy. Section 4 describes the method we use to evaluate performance and Section 5 presents the experimental evaluation. Section 6 presents concluding remarks.

2. RELATED WORK

.

Regarding caching strategies, one of the first ideas studied in literature was having a static cache of results (RCache abridged) which stores queries identified as frequent from an analysis of a query log file. Markatos et al. [14] showed that the performance of the static caches is generally poor mainly due to the fact that the majority of the queries put into a search engine are not frequent ones and therefore, the

static cache reaches a low number of hits. In this sense, dynamic caching techniques based on replacement policies like LRU or LFU achieved a better performance. In another research, Lempel and Moran [12] calculated a score for the queries that allows for an estimation of how probable it is that a query will be made in the future, a technique called Probability Driven Caching (PDC). Lately, Fagni et al. [7] proposed a structure for caching where they maintain a static collection and a dynamic one, achieving good results, called Static-Dynamic Caching (SDC). In SDC, the static part stores frequent queries and the dynamic part handles replacement techniques like LRU or LFU. With this, SDC achieved a hit ratio higher than 30% in experiments conducted on a log from Altavista. Long and Suel [13] showed that upon storing pairs of frequent terms determined by the co-occurrence in the query logs, it is possible to increase the hit ratio. For those, the authors proposed putting the pairs of frequent terms at an intermediate level of caching between the broker cache and the end-server caches. Baeza-Yates et al. [2] have shown that caching posting lists is also possible, obtaining higher hit rates than when just doing results and/or terms. Recently, Gan and Suel [10] have studied weighted result caching in which the cost of processing queries is considered at the time of deciding the admission of a given query to the RCache. A related work to our objective is authored by Puppin et al. [21]. The authors proposed a method where a large query log is used to form P clusters of documents and Q clusters of queries by using a co-clustering algorithm. This allows defining a matrix where each entry contains a measure of how pertinent a query cluster is to a document cluster. In addition, for each query cluster a text file containing all the terms found in the queries of the cluster is maintained. Upon reception of a query q the query receptionist machine computes how pertinent the query q is to a query cluster by using the BM25 similarity measure. These values are used in combination with the matrix co-clustering entries to compute a ranking of document clusters. The method is used as a collection selection strategy achieving good precision results. The another similar work to this objective is our prior work [8, 18]. There a location cache is introduced in the query receptionist machine to allow search node selection based on semantic caching and machine learning methods.

3.

CACHING AND QUERY ROUTING

3.1 The baseline approach Data centers for large Web search engines contain thousands of processors arranged in highly-communicating groups called clusters. Usually each cluster is devoted to a single specialized operation related to the processing of queries arriving to the search engine. The one of interest in this paper is the determination of the top-K document IDs that best match a given user query. There are other operations served by separate clusters which are out of scope in this paper. For instance, construction of the results Web page for the top-K documents and advertising related to the query terms. Nevertheless, the top-K determination is known to be the most costly part of the processing of a query because of the huge sizes of Web text collections. For the top-K setting, the standard cluster architecture is an array of P × D processors or search nodes, where P indicates the level of text collection partitioning and D the

level of text collection replication. The rationale of this 2D array is as follows. The query answer time is proportional to the size of the text collection and to achieve answer times that are below a certain upper bound, the text collection is divided into P disjoint partitions. The bound is achieved by using parallelism. Each query is sent to all of the P partitions and, in parallel, the local top-K documents in each partition are determined. Then these local top-K results are collected together to determine the global top-K document IDs for the query. There exists a separate machine called broker which is in charge of sending arriving queries to the search nodes and collecting the results. The level of replication indicates that each of the P partitions is replicated D times. The purpose of replication is two-fold. It provides supports for fault-tolerance and it also enables efficient query throughput as queries are sent to all partitions and for each partition one of the replicas is selected uniformly at random. Therefore, at any given time, different queries can be solved by different replicas. Search nodes in the 2D array achieve efficient performance by using indexing and caching techniques. Regarding indexing, each search node contains an inverted file which is a data structure used to efficiently map from query terms to relevant documents [3, 15, 19, 24]. For the set documents assigned to the search node a table is constructed. Each table entry is associated with a relevant term found in the text collection and a respective list of items called posting list. Each posting list item contains a document ID and other data used for document ranking purposes. A particular document is referenced by one of the items in the posting list if the document text contains the respective term. The posting lists associated with the query terms quickly provide the relevant set of document IDs from which the topK ones have to come from by means of document ranking procedures. The length of the posting lists can be huge and during periods of time it usually happens that a large subset of terms in the inverted file is not referenced by queries submitted to the search engine. For this reason, most of the posting lists are kept in secondary memory and a subset of them is kept in main memory in what is called a posting list cache. This cache is administered with the standard LRU strategy. A second cache is maintained in the broker machine, namely the above mentioned machine which is in charge of receiving queries from users, sending the queries to the 2D array of processors for their solution, and collecting back the results to respond users with the answers. The cache kept in the broker is a results cache which stores the answer Web pages for frequent queries. The state of the art strategy for the results cache is the SDC strategy [7]. It keeps a readonly section to store the answer for queries that are frequent along large periods of time. This part represents a static cache administered with the LFU policy with queries taken from a large log containing queries submitted to the search engine in the recent past. The second part of the SDC is a dynamic cache maintained with the LRU policy and its objective is to capture queries that become very popular during short periods of time. Experiments reported in [7] suggest a proportion 80%–20% for the static and dynamic parts respectively. Recently, [10] showed that the dynamic part is better administered with the query throughput counterpart of the

hit-based LRU policy which is called Landlord. This policy considers the cost of processing a query when deciding whether or not to store it in the cache. Initially the priority of a cache entry is given by the average cost L of processing the respective query in a search node. Each time an entry is re-used its priority is increased by L units. When an entry is replaced by a new query, all remaining entries in the cache are decremented by the remaining priority of the entry being replaced by the new one. The efficient implementation of this strategy uses a priority queue as follows. Each new entry receives priority s + L where initially s = 0. The next entry to be replaced is the one with the least numerical value v in the priority queue and it sets s = v for the priority s+L of the new item to be stored in the entry. For the static part of the cache we consider f · L for the priority values where f is the frequency of occurrence of the query in the training query log. Alternatively, a third cache is maintained in the scheme. This is an intersection cache [13] which keeps in each search node the intersection of the posting lists associated with pairs of terms frequently occurring in queries. The intersection operation is useful to detect the documents that contain all of the query terms, which is a typical requirement for the top-K results in major search engines. In this way the three caches define a hierarchy which goes from the most specific pre-computed data for queries (results cache) to the most generic data required to solve queries (posting list cache). Figure 1.a illustrates a state of the art cache hierarchy where the intersection and posting list caches are maintained in each of the P ×D search nodes. Figure 1.b illustrates what happens when a query is not found in the results cache of the broker machine. In this case, the query is sent to one search node of each column. In each column, the respective search node is selected uniformly at random.

3.2 Proposed cache hierarchy The proposal of this paper is based on the use of two additional caches as shown in Figure 2.a. We have presented the location cache idea in complementary papers [8, 16, 18]. In those papers, the location cache is discussed in the context of P × 1 search nodes, that is, replication is not considered. The novelty of the current paper is the extension of the location cache idea to the P × D search nodes setting which also leads to the introduction of the top-K cache and the query routing scheme illustrated in Figure 2.b, all of which are described in detail in this section. Results and location caches. Each entry in the results cache maintains several KBs with the answer Web page for the respective query whereas the location cache keeps the list of column IDs where the answer to a query comes from. That is, the space occupied by an entry in the location cache is much smaller than the space of an entry in the results cache. A given query can be stored either in the results cache or in the location cache, but not in both caches at the same time. A good candidate to be stored in the results cache is a frequent query that requires a large amount of cluster resources to be entirely processed. The trade-off is represented by the value of f · L · m where f is the query occurrence frequency, L is the average cost of processing the query in a search node, and m is the number of search nodes producing the

documents within the global top-K results. Notice that K is of the order of a few tens (usually the first results Web page for the query) whereas P is of the order of hundreds, and thereby m ≤ K where K < P . This means that, on the average, quite less than P columns are able to produce documents within the global top-K results and m can be made quite less than K by applying document clustering techniques to distribute documents in the search nodes. The value of f · L · m is used to decide which queries are stored in the static part of the results cache whereas the dynamic part is administered with landlord by using L · m as the cost of queries. Frequent queries not stored in the results cache and whose solution require the use of very few cluster resources (columns) are good candidates to be placed in the location cache. The idea is to avoid the unnecessary cost of sending these frequent queries to P search nodes. In this case, the trade-off is represented by the value of f · L · P/m and we admit a query in the location cache provided that P/m ≥ 2. Like the results cache, the location cache is also divided in a static and dynamic part with similar 80%–20% proportion and costs f · L · P/m and L · P/m respectively (experimentally we have found these values to be suitable for efficient query throughput). The static part is initialized by running a training query log. Every time the broker receives the results for a new query that it previously sent to the search nodes for solution, the broker must decide whether to store the query in the dynamic part of either caches. We use the following rule for this decision. If the value of L·m is larger than the least priority value in the results cache priority queue, then the item is stored in this cache. If not, the broker tries to store it in the location cache by verifying that P/m ≥ 2 and that the value L · P/m is larger than the least priority value stored in the location cache priority queue. The top-K cache. This type of cache is present in all of the P × D search nodes. Each entry in a top-K cache maintains the IDs of the documents found to be the global top-K ones for a frequent query. We emphasize that this cache holds global top-K results instead of results which are local to one or a subset of the search nodes. When the broker does not find a newly arriving user query in the results cache, it sends the query to one search node selected uniformly at random by using a hashing function on the query terms. This search node becomes the one that caches the global top-K results for the query. The search node receives the query, sends it to the search nodes in charge of solving the query, integrates the results produced by the participating search nodes, stores the results in the top-K cache using the LRU policy, and responds to the broker with the global top-K results for the query. The search node can prevent the query from being stored both in the results and top-K cache because, like the broker, it knows the values of L and m for each query. When the broker finds out that the newly arriving query is stored in the location cache, it sends attached to the query the list of columns that are able to produce documents within the global top-K results. Otherwise, it is implicit that the query must be sent to all of the columns. This decision is taken when the query is received by the search node selected to cache the top-K results for the query. Namely, the search node whose ID matches the value returned by the

(a) Standard cache hierarchy

(b) Standard query routing on search nodes

Figure 1: Baseline approach to query processing.

(a) Proposed cache hierarchy

(b) Proposed query routing on search nodes

Figure 2: This paper proposal.

hashing function executed on the query terms. Certainly, this search node ID is not necessarily related to the list of columns found in the location cache for the query. Query routing. As illustrated in Figure 2.b, we apply document clustering at column level and term clustering at row level. Document clustering means that initially the documents forming the text collection are mapped to P columns in a non-random manner. The clustering is made to reduce the average number of columns contributing with documents in the top-K results for queries. Our clustering algorithm [16] groups together documents in accordance with determined relationships among them. We determine these relationships using the results from previous queries submitted to the search engine. For each of these queries, we calculate the top nK results with n as large as possible. In this way, a large set of documents is related each other by query IDs, and these relationships are used as an input to the clustering algorithm. The remaining documents are mapped at random onto the P columns. Further details can be found in [16] along with an evaluation against alternative document mapping methods including random allocation. Notice that even in the case of random allocation, we should expect m ≤ K < P for a large scale system.

A contribution of this paper is related to the way in which the D replicas are treated along each column. We propose a method to efficiently exploit the cumulative main memory space available from the P × D search nodes. The idea is that when we avoid as much as possible both (a) caching items that do not contribute to the global top-K results of queries and (b) caching exactly the same item in two or more search nodes along the same column, then the overall number of accesses to secondary memory can be significantly reduced. Intuitively this can be beneficial to performance since accesses to secondary memory are expensive in running time and reducing them can improve query throughput and reduce the rate of disk failure in the search engine. The problem is to achieve this objective without losing much of the gain to load imbalance. Notice that the data stored in the three caches maintained in main memory may differ across search node caches associated with the same column. Correct operation is not compromised since as soon as a given replica needs data that are not present in one of its caches, the search node can read them from its own secondary memory. This is also valid for search node failures as others in the same column can use their secondary memory contents to re-execute the queries that were being processed by the falling node (recall that

each search node along a column keeps the same data in its local disk). In the baseline strategy (Figure 1.b), the search nodes in each column are selected uniformly at random. Even though this is beneficial for overall load balance, it is also true that in the random approach the effective main memory space presented by the cumulative sum of cache space associated with any column tends to be quickly reduced because of the identical items cached in different search nodes of the column. That is, popular query terms distributed at random on the search nodes of a column tend to compete for and win cache entries in several search nodes. This may produce a waste of resources causing degradation of query throughput due to the excessive use of secondary memory. At the column side, the location cache alleviates the problem of caching items that do not contribute to the top-K results since, on the average, its effect is that for a certain percentage of processed queries, only the search nodes capable of placing document IDs into the top-K results are contacted by the broker. In addition, we significantly reduce the probability of caching the same item in two search nodes of the same column by grouping terms into rows. We refer to the terms of the text collection and we mean that we virtually group terms into rows in the sense that each search node main memory along a column is assigned to a subset of the terms that it holds in its secondary memory. The query log used to determine the document partitioning is also used to determine the correlations among terms. These correlations can be used to map a given pair of terms to the same row if they frequently appear together in queries. We use the following heuristic for term-clustering. First, a cost function is computed that considers the relative frequency of occurrence of each pair of terms in the query log and the lengths of the respective posting lists. Pairs of high cost should be assigned to the same row, so the algorithm considers pairs in decreasing cost order assigning a pair of unassigned terms to the least loaded row. If one of the terms of the pair has already been assigned, the other term is assigned to the same row. The remaining terms are assumed to be assigned uniformly at random (this is effected by means of a dynamic cache as explained below). To support the P × D case with D > 1, the location cache is extended with a cache that stores the row IDs where the terms are mapped. This cache is also composed of a static and a dynamic part. In the static part a fraction of the most frequent terms is kept in a read-only table. These frequent terms are determined from the query log and are mapped to rows by using the above described heuristic. The dynamic part uses the LRU policy to keep information of which row a term has been mapped. The first time a term which is not in the static part of the cache, is placed in the dynamic part of the cache, a row is selected uniformly at random. If the term becomes very popular, the following queries containing it will be routed to the same row preferentially. Figure 3 shows the many paths that a query can follow in the proposed caching hierarchy. RCache and LCache stand for results and location cache respectively. In the figure, queries go from the broker to one search node pi selected in accordance with the result of the hash on the query terms. This is indicated in the figure as SEARCH NODE Pi. From this search node the query can be sent to one or more search nodes, which are indicated in the figure as SEARCH NODES.

Discussion. As mentioned above, the location cache is meant to contain the most minimal data about the answer of each cached query, namely the list of search node IDs from which each document in the global top-K results come from, and in fact it can be administered with any existing caching policy. Notice that if we distribute related document clusters into P search nodes and the remaining ones (sorted by centers similarity distance) in contiguous-ID search nodes, then the lists of search nodes stored in the location cache are highly compressible by employing delta encoding. This is useful for the static part of the location cache. For the dynamic part, one can impose an upper limit on the length of the list of search node IDs allowed to reside in the cache. Moreover, because of clustering, the length of the list of search node IDs is expected to be smaller than K. Notice that the location cache can also work in a setting where documents are evenly distributed at random onto the processors. Here small memory space is still feasible since for large scale systems the number of search nodes is expected to be larger than the number of global top-K results for queries that are presented to the users. This at least for the first page of results. In other words, it is perfectly feasible to configure a system so that an entry in the location cache requires much less space than a corresponding entry in the results cache. Independent of the kind of queries stored in the location cache, their resolution time can be further reduced at the cost of more memory per entry by storing K pairs (procID, docID) corresponding to the global top results of the respective query. Namely a mix between the location and the top-K caches proposed in this paper. The idea is to use each location cache entry to go directly to the respective processor and retrieve the specific document in order to build the answer Web page. This only requires secondary memory operations and no documents ranking is necessary. The pairs (procID, docID) are also highly compressible since we can re-numerate the document IDs at each processor following the order given by the document clusters stored in them. In fact, the pair (procID, docID) can be stored in the bits required to store global documents IDs by using the most significant bits to store the processor ID and the least significant bits to store the document ID local to the processor. We achieve the same effect in two steps because we separately handle the location and the top-K caches. The reason for this separation is that we target a search engine design capable of maintaining all search nodes operating at a larger utilization than current ones do. This implies that the broker has to be prepared to react upon sudden peaks in query traffic containing topic-shifting queries as in cases in which users get suddenly concerned upon a world-wide event. In [22] it is proposed to achieve this goal by responding approximated answers to queries during these short periods of extremely high traffic and evolving to an exact answer when the peak vanishes away. Peaks are to be quickly absorbed by the results cache, but in the meantime the approximated answers for the new popular queries to be stored in the results cache need to be calculated by the search nodes. In [8, 18] we propose using semantic and machine learning methods on the location cache contents to quickly select the search nodes most likely to provide a good approximated answer to the topic-shifting queries. In this case, it is essential to store search node IDs in the location cache entries, and use as efficiently as possible the space used by storing just search

BROKER input queries

look up RCache

hit

no

yes

look up

add list of search nodes

hit

LCache

yes

no

respond to user build answer webpage

respond to user

SEARCH NODE Pi

yes

look up top-K cache

hit

update RCache or LCache

update top-K cache

no

look up Intersection cache SEARCH NODES

rank document and update Intersection cache intersect and rank documents

yes

hit

yes

no

look up List cache

no

hit

disk access

update List cache

Figure 3: Flow of queries across the proposed cache hierarchy. nodes IDs in order to enlarge the number of entries in the location cache. Our methods outperform those proposed in [22] for search node selection with the advantage that they are on-line. On the other hand, the reason to maintain a static and dynamic part in each cache is that search nodes are multithreaded systems wherein it is usual to process each incoming query with a different thread. The read-only static part of the cache causes no R/W conflicts among the concurrent threads, whereas the 80%-20% rule promotes that a small number of threads set locks in the cache entries of the dynamic part. Finally, notice that even though we assume the use of the strategy presented in [13] for handling intersections between pairs of terms, alternatively any given processor can make direct remote memory accesses to other processor main memories to look for already computed intersections as proposed in [9]. This prevents the processor from using secondary memory to store pre-computed intersections which is potentially more efficient than the scheme proposed in [13].

4.

PERFORMANCE EVALUATION

Evaluating the performance of parallel information retrieval systems is a complex problem on its own merit. It shares with other applications the challenge of accurately predicting the cost of parallel operations on actual hardware. Models for general purpose parallel computation can help us provided that one is able to properly represent the application work-load requirements into the model’s worldview. A key difference with most applications is the strong dependency of results on user behavior. This makes pertinent resorting to experimentation driven by large sets of queries produced by actual users. To the best of our knowledge, all experiments published so far have been made by using either simple simulators (e.g., cache hits counters) or actual realizations of the strategies which are run on fairly small clusters of processors. Neither of those is suitable for evaluating complex performance metrics such as power consumption or scalability.

Most implementations reported so far are based on the message passing approach to parallel computing in which we can even see combinations of multi-threaded and computation/communication overlapped systems. We believe that under such heterogeneous forms of parallel computing it is quite risky to make reasonable claims about comparative performance as it is difficult to reproduce the same experimental conditions. The execution of the program is dependent on the particular state of the machine and its fluctuations. On the other hand, artifacts such as threads are potential sources of overheads that can produce unpredictable outcomes in terms of running time. Yet another source of unpredictable behavior can be the accesses to disk used to retrieve the posting lists. In fact, the difficulty of comparing these kind of systems have been discussed in [20]. In terms of comparative evaluation, one of the benefits of using a practical model of parallel computation is that one can directly concentrate on the essentials of performance and compare strategies under exactly the same conditions. In our case, software overheads such as message and cache administration are basically the same in both the baseline and the proposed strategy, and they are known to be negligible since ours is a large grain application where the cost dominant operations can be clearly identified. For both strategies, what is relevant is to count and charge for these dominant operations since the strategies perform them in different order and quantity across processors. This marks their difference in performance. Among the practical models intended to be simple, accurate and general purpose, the bulk-synchronous parallel (BSP) model is suitable for predicting the performance of algorithms running on distributed memory parallel machines. In BSP, parallel computation is organized as a sequence of supersteps, in each of which the processors may perform computations on local data and/or send messages to other processors. The messages are available for processing at their destinations by the next superstep, and each superstep is ended with the barrier synchronization of processors. The cost of supersteps is basically determined by the observed maxima of computation and communication across

processors. The BSP cost model is defined as follows. The total running time cost of a BSP program is the cumulative sum of the costs of its supersteps, and the cost of each superstep is the sum of three quantities: w, h G and L, where w is the maximum of the computations performed by each processor, h is the maximum of the messages sent/received by each processor with each word costing G units of running time, and L is the cost of barrier synchronizing the processors. The effect of the computer architecture is included by the parameters G and L, which are increasing functions of P . We assume they are O(log P ). The average cost of each access to disk is represented by S. The average values of G, L and S can be obtained by running benchmark programs on the target machine. In our simulators we distribute the cost relevant operations among the supersteps and processors, wherein the cost of each operation is determined by a set of process-oriented discrete-event simulation objects which model resource contention of operations. In each superstep, the use of these resources causes cumulative costs in the respective processors. The degree of contention is determined by the outcome of the cache admission and eviction policies which in the simulators we implement exactly they are to work in a production system. Communication among processors is performed in accordance with BSP as its cost model takes the network latency into consideration. We further describe the basics of our simulators with an example. In the baseline strategy depicted in Figure 1.a, the work-load generated by the processing of an one-term t query, containing a posting list of global length r · k · P , can be decomposed in r iterations where each iteration involves the processing of a k-sized piece of posting list in each search node. Assuming that the query first arrives to search node pi from the broker, then this work-load can be represented by the following sequence of primitive operations hp i

B(t)P i → [F(k) k P → R(k) k P ]r → S(K) k P → E(P K)hpi i In this sequence, the broadcast operation (B) represents the part in which the search node pi sends the term t to P search nodes. Then, in parallel (k), all P search nodes perform the fetching (F) from secondary memory of the k sized piece of posting list. They rank (R) the documents present in their piece of posting lists and the sequence F-R is repeated r times. (For more than one term, the rank operation takes into consideration the costs of intersecting the posting lists and scoring the resulting documents.) Then in parallel they send (S) their local top-K results to search node pi . Finally the search node pi merges (E) them to produce the global top-K results for the query. In our simulators, these operations are properly combined with operations for cache memory. For instance, if a given section of a posting list is already stored in a cache, we do not charge the cost of fetching (F) in that particular search node. The above sequence of primitive operations exemplifies the execution DAG associated with each query. These DAGs are independent each other but they implicitly synchronize when they compete for using hardware resources (i.e, processors, disks and network). Our simulators emulate these interactions with resources by modeling them as queuing devices and DAGs dynamically form queuing networks among them. To this end, we use process oriented simulation where each active query is a simulation process that must follow a DAG driven circuit to use resources across BSP processors

and supersteps. During simulation, in a superstep and processor (search node), a given query process that must use a resource to cause the cost of executing a primitive operation, executes a SIMULA like hold(cost) operation before proceeding with the next primitive indicated in its DAG. This is useful to model resource utilization which is important to predict power consumption of hardware devices. To properly model the fact that at any instant a hightraffic search engine is processing a large set of queries in parallel, each with the same chance of using the hardware resources, the primitive operations are executed in the BSP computer by allowing each active query to execute at most one hold(cost) operation per superstep. In other words, we do round-robin of primitive operations across processors and supersteps which is the key fact that allows us to use the BSP cost model to evaluate our performance metrics. The round-robin of small quanta of work per query, per superstep and per processor allows us to properly model the asynchronous continuous system since we let BSP act as a sort of discretization of the real system. The fact that BSP accurately predicts the execution cost of algorithms on actual hardware has been shown by many other people (we verify this claim in the next section). Also, at any instant, we keep Q = q P D active queries and as soon as a query finishes processing, a new one is injected into the simulation. Queries remain under processing along differing numbers of supersteps as they present different load requirement on the system. To reflect the fact that secondary memory access take place in parallel with main-memory computations, we process disk requests by using asynchronous simulation processes representing an asynchronous thread in our simulator (one per search node) and we make sure that as soon as a disk operation is entirely serviced its results are taken into consideration at the start of the next superstep. In each processor we use a synchronous simulation object representing a bulk-synchronous thread upon which we cause the simulated cost of each primitive operation. The broker is a separate simulation object that reads queries and asynchronously sends them to search nodes depending on the contents of its results and location caches. We use an object-oriented approach to build our simulators. The asynchronous and BSP threads are simulated by concurrent objects of class Thread attached to each object of class Processor. Competition for resources among the simulated threads is modeled with P concurrent queuing objects of class CPU and Disk which are attached to each Processor object respectively, and one concurrent object of class Network that simulates communication among processors. Each time a thread must execute a Rank(x) operation it sends a request for a quanta of size x to the CPU object. This object serves requirements in a round-robin manner. In the same way a Fetch(x) request is served by the respective Disk object in a FCFS fashion. These objects, CPU and Disk, execute the hold(time_interval) operation to simulate the period of time in which those resources are servicing the requests, and the requesting Thread objects “sleep” until their requests are served. The concurrent object Network simulates an all-to-all communication topology among processors. For our simulation study the particular topology is not relevant as long as all strategies are compared under the same set of resources CPU, Disk and Network, and their behavior is not influ-

1

100000

0.8 Hit Ratio

Frequency

1e+06

10000 1000

LRU LFU SDC Landlord

0.6 0.4 0.2

100

0

10 0

5000

10000

15000

20000

25000

25K 50K 100K 200K 400K 800K

30000

Cache Size

ID Queries

(a) Figure 4: Log query frequency.

5.

EXPERIMENTS

The experiments were performed using a log of 36,389,567 queries submitted to the AOL search service between March 1 and May 31, 2006. We preprocessed the query log following the rules applied in [10]. The resulting query set has 16,900,873 queries, where 6,614,176 are unique queries and the vocabulary consists of 1,069,700 distinct query terms. In the experiments, these queries were executed in chronological order, as they appear in the log, using the first 60% (10,140,523) to warm up the caching system, and using the remaining 40% (6,760,350) to measure the performance metrics. We simulated demanding query traffic situations by using this last set of queries (40%). Figure 4 shows that the query frequencies follow the Zipf law. Over the terms found in the fraction of the query log used as the training set (60%), we executed our heuristic to cluster the terms into the rows of the P × D array of search nodes. These queries were also applied to a sample (1.5TB) of the UK Web obtained in 2005 by Yahoo!, over which a 26,000,000-term and 56,153,990-document inverted index was constructed. We executed the queries against this index in order to get the top 50 documents for each query. We then used the relationships between queries and these documents to execute our document clustering algorithm [16]. This defined the distribution of documents over the P

1 Running Time/MAX

enced by the caching strategy. The Network object contains a queuing communication channel between all pairs of processors. The average rate of channel data transfer between any two processors is determined empirically by using benchmarks on the hardware as described in the next section. Also the average service rate per unit of quanta in ranking and fetching is determined empirically using the same hardware, which provides a precise estimation of the relative speed among the different resources. We further refined this by calibrating the simulator cost parameters to achieve a similar query throughput to a third party system running on actual hardware. Certainly, there can be variations to our basic simulation model but we have found the above described model to predict reality quite well. For instance, bulk-synchronous communication could also overlap computation as in the case of secondary memory accesses. As mentioned above, the key fact is that the baseline and proposed strategies are both run over the same simulated BSP computer and input queries, wherein the effect of caching is included in the simulations by means of actual implementations of the respective algorithms.

Real Sim

0.8 Top-128

Top-1024

0.6 0.4 0.2 0 4

8 16 32 64

4

8 16 32 64

Number of Processors

(b) Figure 5: Validation against third party results. columns of the array of P × D search nodes. The remaining queries (40%) were executed against the same index in order to generate work-load traces indicating the use of hardware devices that each query requires to be entirely solved. We injected these traces into discrete-event simulators to evaluate our performance metrics. We also used the UK sample to obtain the intersection sets for the posting lists associated with the query terms. Validation. As a validation experiment and considering that the proposal in this paper is critically dependent on proper caching, we executed the different caching strategies discussed in [10] over our log. Figure 5.a shows that our results precisely mimics the performance figures obtained in [10]. This experiment allowed us to verify that our query log was properly pre-processed and that our caching policies are correctly implemented and trained with the initial 60% set of queries. Proper tuning of the simulator cost parameters is also critical to the comparative evaluation of the proposed cache hierarchy. Figure 5.b shows normalized results both for real execution time and simulation time for the baseline query processing strategy with D = 1. These results are for an implementation independent to ours. They were obtained with the Zettair Search Engine (www.seg.rmit.edu.au/zettair) and we adjusted our simulator parameters to simulate the performance of Zettair. We installed Zettair in 64 processors of a RLX cluster and broadcast queries to all processors using TCP/IP sockets to measure query throughput. The cost of processor computation, disk accesses, communication and synchronization in our simulator was determined by performing benchmarks on RLX using the BSP on MPI library (http://bsponmpi.sourceforge.net). The simulator was implemented using C++ and the LibCppSim library [17]. The

results show that the simulator is capable of almost overlapping each point in the curves of Figure 5.b.

selected replicas of pi recalculate the local top-K for the affected queries and send them to their mergers.

Performance metrics. Our comparative study considers four relevant metrics for our setting. The first one is the overall query throughput defined as the total number of processed queries divided by the total simulation time. If the proposed cache hierarchy is to be prone to serious imbalance this is reflected in a poor throughput. The second metric is user experience latency, namely the maximum query response time observed in the stream of queries executed by the search engine. If some hardware devices are to get occasionally saturated (large queues) because of skewed query routing in the proposed hierarchy, this is reflected in the latency metric. Our third metric is power consumption. Most optimizations introduced in the query processing regime of search engines have an impact in the economical operation of data centers. The relative importance of this impact is growing as more optimizations are introduced in other power demanding components such as cooling [4]. We determine overall power consumption by means of the observed utilization of the simulated queuing devices. For each device, in the literature we can find curves that map from utilization to power consumption, in particular we use those presented in [4]. Notice that the cost of searching and administering the different caches is not significant in the overall running time [10]. In our simulators we do not consider this cost but we assign main memories utilization values to charge them for power consumption. We consider that the time of using a main memory is equal to the total parallel time in which the search node performed ranking operations, disk accesses and communication (in practice they use main memory to perform these tasks). In order to properly tune the simulator to reflect power consumption values which are close to reality, we set the simulator power consumption curves in accordance with the curves presented in [4]. The results can be seen in Figure 6.a (left part) which show the average energy consumed by the different hardware devices for the baseline strategy. Figure 6.a (left part) shows values that are identical to those presented in [4]. Finally, our fourth metric is related to the impact of faulty nodes on the query processing efficiency. To this end we use a simple fault injection model. We periodically select uniformly at random one of the search nodes and set it out of the simulation. The network simulation object simulating the message passing among processors detects that the search node is not responding and signal search nodes waiting for results. Our simulators proceed as follows. Just before a failure, search nodes are performing merging of local top-K results to produce the global top-K ones for a subset of the active queries. They are also performing local ranking to provide their local top-K results to the mergers of other subset of queries. Upon failure of search node pi , a new replica is selected uniformly at random for each query for which pi is a merger. The other search nodes performing local ranking for those queries send their local top-K results to those replicas of pi and the replicas selected as mergers must re-execute the local ranking for those queries to finally obtain the global top-K results for the affected queries. The search node pi was also acting as local ranker for mergers located in other search nodes. In that case, the randomly

Comparative Evaluation. In the following we show the results we obtained with the simulations of the approaches described in figures 1 and 2, namely the baseline approach and the proposal of this paper respectively. In all cases we show results normalized to 1 in order to better illustrate the comparative performance. To this end, we divide all quantities by the observed maximum in each case. Also in all experiments, measures are taken after processing 60% of the input queries. Power consumption. Figure 6.b shows the impact in reduction of power consumption of the proposed caching strategy. In this figure we show two cases. The left side bars present the results obtained with the power curves from actual devices as in the context of Figure 6.a (left part). Current devices are far from being optimal in terms of energy efficiency. Typically at 0% utilization they still demand about 50% of peak power. In a second experiment, we adjusted the power consumption curves so that at 0% utilization devices consume 15% of peak power, which represents an ideal case. The results are presented in the right part of Figure 6.b, which corresponds to the cases illustrated in Figure 6.a (right part). In both cases, real and ideal, the results show that the proposed cache hierarchy can lead to a significant reduction of total power consumption. Power consumption is directly related to device utilization. Figure 7 shows the CPU average utilization for the baseline and proposed strategies along different intervals of simulation time (one curve per search node) and under high query traffic. On average the proposed strategy reduces the device utilization significantly. We observed similar behavior in the other simulated devices. For the proposed strategy, the figure shows imbalance in a few search nodes which are under-utilized because of queries skewed to popular terms and the effect of the location cache. We observed that on the average each query that is found in the location cache is sent to an average of 11.3 search nodes out of 128. This explains the smaller utilization values of the proposed strategy with respect to the baseline strategy. Nevertheless the price of imbalance is worthwhile to pay here since the proposed strategy is able to improve performance metrics by more than 30% in all cases. Search nodes hit by queries and space savings. In the following experiments we fix D = 4 and range P from 64 to 512. In all cases the total number of active queries Q = q P D increases proportionally with P where q = 8. Figure 8 (Left) shows the number of processors visited by individual queries when the search engine is operating under different query traffic conditions and working with P = 128 and D = 4. In the baseline strategy all queries always visit the 128 search nodes, whereas using our proposal the average number of search nodes hit by queries is reduced to an average value of 64.5 search nodes. Figure 8 (Right) shows the effective memory used for caching. In this case we counted the total number of unique items stored in the P × D posting list and intersection caches. In this case the proposed cache hierarchy is able to store about 40% more unique terms than the baseline strategy. This implies a potential reduction in secondary memory acceses and the saving in effective space used is by far more than the 5% extra space we payed for the inclusion of the location and top-K caches.

1.2 1 Power (%)

Type of Cache Results cache Intersection cache Posting lists cache Location cache Top-K cache

NET DISK RAM CPU

1.4

Real

Ideal

0.8

Baseline 7.13 % 2.0% 19.17% – –

Proposal 7.09% 3.70% 35.79% 21.96% 3.10%

0.6

Table 1: Cache hits for P = 128 and D = 4.

0.4 0.2 0.0 16

30

90

16

30

90

Load (%)

(a) Validation Proposal Baseline

1.4 1.2

Real

Ideal 52%

55%

Power (%)

1 47% 0.8

46%

33% 40%

0.6 0.4 0.2 0 Low

Medium

High

Low

Medium

High

Query Traffic

(b) Results Figure 6: Power consumption results.

1.2

1.2

CPU Utilization (%)

Baseline

Proposal

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0 5 10 15 20 25 30 35

0

5 10 15 20 25 30 35

Interval of Time

Interval of Time

Proposal

Proposal Baseline

High Query Traffic 64.5 0 Medium Query Traffic 64.5 0

# Unique terms

Number of processors visited per query

Figure 7: Normalized CPU utilization using P = 128 and D = 4.

1

0.5

Low Query Traffic 64.5 0

0 0

20

40

60

80 100

Interval of Time

64

128

256

512

Processors

Figure 8: Left: Number of search nodes visited per query. Right: Memory capacity.

The reduction in secondary memory accesses is reflected in the hit ratio of the different caches. In Table 1 we show the gain in hit ratio we achieve with the proposed strategy. Notice that we deliberately defined a results cache size so that hits are no larger than 10%. This to let a larger stream of queries flow through the cache hierarchy. In a production system the results cache hit ratio is larger but it is known that the frequency distribution of stream of queries flowing down the hierarchy is still Zipfian as in Figure 4. Throughput and query answer time. Figure 9.a shows the query throughput obtained with different values of D= 1, 2, 4 and 8 over a time interval and P = 128. In this and the next experiment we measure throughput at regular small intervals of simulation time to detect singularities in the stream of queries. As expected, the higher the number of replicas the bigger the increase in throughput. The proposed strategy is in general more efficient than the baseline strategy in this metric. However, there are some intervals in which our strategy slow throughput down to values similar to those of the baseline strategy. This happens in cases in which the current queries being processed hit about the same reduced set of search nodes. This determined by the same set of columns and rows of the 2D array of search nodes. In this case, the performance is limited by the answer time of those few nodes. One should expect the baseline strategy to outperform our strategy because it can distribute the traffic coming to the same column among the D rows uniformly at random, whereas the proposed strategy is restricted to a few rows in this case. But the results show that this is not the case because at the same time the popular queries directed to a few rows tend to be trapped by the top-K cache. Figure 9.b shows a similar situation but with a larger degree of dispersion of data points because of imbalance. In both experiments the size of the text collection is kept constant and one should expect more imbalance as the collection is partitioned into more search nodes. Regarding the scalability with either the number of search nodes P or the size N of the collection, we study two cases for high query traffic. The first one keeps the size N of the text collection constant as we increase P . The second case increases N proportionally with P . Figure 10.a presents the results for both cases and they show that the proposed cache hierarchy is able to increase query throughput by more than 30% for large number of search nodes. Those are average values considering the whole interval that start after processing the first 60% of the input queries. Figure 10.b presents individual response query times for P = 128 and D = 4. The results show that the baseline strategy overcomes the maximum user latency by a wide margin at regular intervals because of saturation whereas the proposed strategy keeps query latency steady. As mentioned above, we obtained these efficient results by paying a 5% increase in the total space occupied by caches over the baseline strategy.

1.4 1.2

Baseline Proposal D=1

1.2

F=5% D=2

D=4

D=8

F=15%

1 Throughput

1 Throughput

Baseline Proposal

0.8

0.8 0.6

0.6 0.4 0.4

0.2

0.2

0 0

1

2

3

0

4

0

1

2

3

4

Interval of Time

(a) Query throughput

Interval of Time

(a) Increasing replicas D for P = 128 1.4

F=5% P=128

P=256

P=512

Query response time

1.2 1 Throughput

Baseline Proposal

1.2

Baseline Proposal P=64

0.8 0.6 0.4

F=15%

1 0.8 0.6 0.4 0.2

0.2

0 0

0

1

2

3

4

5

0

1

2

3

4

5

Interval of Time

(b) Query response times

Interval of Time

(b) Increasing search nodes P for D= 4 Figure 11: Taking into account fault tolerance. Figure 9: Query throughput results.

Proposal Baseline

1.4 1.2 Throughput

N increases with P

N fixed 38%

1

42%

42%

36%

44%

45%

0.8 0.6

32%

0.4

29%

0.2 0 64

128

256

512

64

128

256

512

Number of search nodes

(a) Query Throughput Baseline Proposal

1.2

Query response time

1 0.8 0.6 0.4 0.2 0 0

5

10

15

20

25

30

35

Interval of Time

(b) Query response times Figure 10: Query processing results.

Evaluating fault tolerance. In the next figures we show results for the behavior of the search engine under situations in which search nodes fail temporarily, that is they leave service for a while and then are re-incorporated. In each simulation time interval of fixed size ∆i , we inject faults into the system so that the total number of search nodes out of service do not exceed 5% or 15% of the total number of search nodes at all time. These are dramatic cases of search node service failure. Then in each time interval of size ∆j < ∆i , one of the search nodes that is out of service is re-instated. We impose quality of service in the sence that we always present exact results for the query to the user. This means that upon a failure it is necessary to reexecute the parts of queries being executed by the faulty search nodes. Figure 11.a shows values for the throughput obtained with the proposed strategy and the baseline strategy. In this figure, the average ratios of baseline strategy to proposed straregy over the whole interval are 0.39 and 0.55 for 5% and 15% rate of failures respectively. This shows that the proposed strategy achieves better query throughput in general. However, the individual points of Figure 11.a show that the throughput between small intervals of time can be equally compromised in both strategies. However, Figure 11.b shows clearly that the proposed strategy is able to keep user query latency at small values. Notice that in both sides of the figures the results are normalized to the maximum of the whole set of points. This gives a panoramic view of the effects of going from 5% to 15% failures. Finally, Figure 12 shows overall averages for query throughput, final simulation clock time, query response time and response time of queries affected by failures. In each case, the proposed strategy outperforms the baseline strategy.

1.6 1.4 Normalized value

1.2

Baseline-5% Proposal-5% Baseline-15% Proposal-15%

[6]

1

[7]

0.8 0.6 0.4 0.2 0 Throughput

Final sim. clock

Response time

Faulty resp. time

Figure 12: Averages in the presence of 5% and 15% failure rate for search nodes selected uniformly at random.

[8]

[9]

[10]

6.

CONCLUSIONS

We have proposed a new cache hierarchy for Web search engines. Overall, the experimental results show that the proposed strategy outperforms the baseline approach in relevant performance metrics. Certainly a feasible tool to evaluate metrics such as power consumption is by means of simulators that are a properly tuned with reality. Our performance evaluation method combines into the same simulation programs techniques that are widely accepted as valid for the cost prediction of complex systems such as process-oriented discrete-event simulation and bulk-synchronous parallelism. Our claims on comparative performance are pertinent for the following reasons: (a) in the experiments we used a reallife workload coming from a very large sequence of queries issued by actual users, (b) the processing requirements of each query was precisely determined by indexing a 1.5TB sample of the Web, (c) the interaction among queries and cost relevant cluster devices was precisely simulated by executing the sequence of queries upon actual realizations of the admission and eviction caching policies, and (d) the parallel cost of processing the queries not found in any cache was determined by circulating those queries among queuing hardware devices and BSP supersteps. A relevant point is that the baseline and proposed strategy were evaluated upon exactly the same simulated parallel search engine and input work-load which, by the above a, b, c, and d points, makes the comparison fair.

[11] [12]

[13]

[14] [15]

[16]

[17]

[18]

[19]

[20]

7.

REFERENCES [1] K. Amiri, S. Park, R. Tewari, and S. Padmanabhan. Scalable template-based query containment checking for web semantic caches. In ICDE, pp. 493–504. 2003. [2] R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. Design trade-offs for search engine caching. ACM TWEB, 2(4):1–28, 2008. [3] R. Baeza and B. Ribeiro, Modern Information Retrieval, Addison-Wesley, 1999. [4] L. A. Barroso and U. Holzle. The case for energy-proportional computing. IEEE Computer, 40(12):33-37, 2007. [5] B. Chidlovskii, C. Roncancio, and M. Schneider. Semantic cache mechanism for heterogeneous Web

[21]

[22]

[23] [24]

querying. Computer Networks, 31(11-16):1347-1360, 1999. B. Chidlovskii, and U. Borghoff. Semantic caching of Web queries. VLDB Journal, 9(1):2-17, 2000. T. Fagni, R. Perego, F. Silvestri, and S. Orlando. Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data. ACM TOIS, 24(1):51-78, 2006. F. Ferrarotti, M. Marin and M. Mendoza. A last-resort semantic cache for Web queries. In SPIRE, pp. 310-321, 2009. E. Feuerstein, V. Gil-Costa, M. Mizrahi and M. Marin. Performance of Web Search Algorithms. In VECPAR, 2010. Q. Gan, T. Suel. Improved techniques for result caching in Web search engines. In WWW, pp. 431-440, 2009. P. Godfrey, and J. Gryz. Answering queries by semantic caches. In DEXA, pp. 485-498, 1999. R. Lempel, and S. Moran. Predictive caching and prefetching of query results in search engines. In WWW, pp 19-28, 2003. X. Long, and T. Suel. Three-level caching for efficient query processing in large Web search engines. In WWW, pp. 257-266, 2005. E. Markatos. On caching search engine query results. Computer Communications, 24(7), 2000. M. Marin and V. Gil-Costa. High-performance distributed inverted files. In CIKM, pp. 935-938, 2007. M. Marin, F. Ferrarotti, M. Mendoza, C. Gomez and V. Gil-Costa. Location cache for Web queries. In CIKM, pp. 1995-1998, 2009. M. Marzolla, LibCppSim: A SIMULA-like, portable process-oriented simulation library in C++. In ESM, 2004. M. Mendoza, M. Marin, F. Ferraroti, B. Poblete. Learning to distribute queries onto Web search nodes. In ECIR, 2010. W. Moffat, J. Webber, Zobel, and R. Baeza-Yates. A pipelined architecture for distributed text query evaluation. Information Retrieval 10(3): 205-231, Aug. 2007. A. Moffat and J. Zobel. What does it mean to measure performance? In Conf. Web Informations Systems, pp. 1-12, 2004. D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates. Load-balancing and caching for collection selection architectures. In INFOSCALE, 2007. D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates. Tuning the Capacity of Search Engines: Load-driven Routing and Incremental Caching to Reduce and Balance the Load. To appear in TOIS, 2009. L.G. Valiant. A bridging model for parallel computation. Comm. ACM, 33:103–111, 1990. H. Yan, S. Ding and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pp. 401-410, 2009.

Suggest Documents