On efficient posting list intersection with multicore ... - Semantic Scholar

On Efficient Posting List Intersection with Multicore Processors Shirish Tatikonda

Flavio Junqueira, B. Barla Cambazoglu, and Vassilis Plachouras

The Ohio State University Columbus, OH, USA

Yahoo! Research Barcelona, Spain

[email protected]

(fpj,barla)@yahoo-inc.com, [email protected] Categories and Subject Descriptors

the parallelism within a given query. With the intra-query model, we are able to improve both the throughput and the query response time simultaneously – a first of its kind, to the best of our knowledge. Commercial search engines are typically driven by query latency, so improving the time to process queries individually is crucial. The query latency includes the time for decompression, posting list intersection, document scoring, and result page generation. Herein, we consider only the time for decompression and list intersection – we refer to this time as the query latency or the query response time.

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Retrieval models

General Terms Algorithms, Design, Performance

Keywords multicores, parallel query processing, list intersection

1.

INTRODUCTION

The size of indexable Web and the number of search queries submitted by users have been growing consistently throughout the past decade. With such a growth, efficient and scalable methods to implement information retrieval (IR) systems become critical for user satisfaction. Thus far the performance of IR with respect to query throughput and query latency has been improved by designing new list intersection algorithms [2] and by developing novel caching strategies [1]. In contrast to these techniques, we explore a new research direction to improve IR efficiency: designing algorithms that leverage modern computer architectures such as multicore systems. Multicores, primarily motivated by energy and power constraints, pack two or more cores on a single die. They typically share on-chip L2 cache as well as the front-side bus to main memory. As these systems become more popular, the general trend has been from single-core to many-core: from dual-, quad-, eight-core chips to the ones with tens of cores. So far, however, very little has been done to exploit the full potential of these chips in the context of IR. Strohman and Croft used 64-bit machines and four-core chips to show modest improvements in throughput [6]. Their techniques suffer from bandwidth issues, and as a result, provide only limited scalability. Bonacic et al. used synchronous strategies to group the queries into batches, and thereafter to process them sequentially [3]. Ding et al. parallelized the posting list intersections using graphics processors (GPUs) [4]. These techniques, however, fail to give good performance as the number of cores increases. In this article, we present and discuss two different parallel query processing models for multicore systems – inter-query parallelism and intra-query parallelism. While the former explores the parallelism between the queries, the latter exploits

2. PARALLEL RETRIEVAL MODELS Developing efficient parallel query processing models is quite challenging. Posting lists are usually kept in compressed format (to reduce the storage requirements) making it difficult to support random accesses on lists. Parallel strategies need to partition the work across cores to balance load more evenly, i.e., to reduce idle time per core. Furthermore, the memory accesses of individual cores should be minimized so that the memory bandwidth is not saturated. The posting list of each term is a sorted list of document identifiers that is stored as a skip list [5]. A skip is a pointer i → j between two non-consecutive documents i and j in the posting list. The number of documents skipped between i and j is defined as skipsize. For a term t, the posting list L(t) is a tuple (St , Ct ) where St = {s1 , s2 , ..., sk } is a sequence of skips and Ct contains the remaining documents (between skips) stored using P F orDelta compression scheme [7]. While skips are popularly used to speed up the list intersection process, we leverage skips to provide random access over compressed posting lists. Figure 1a shows the differences between our two parallel IR models. The inter-query model exploits the parallelism among queries by handling each query on a different core. Here, the posting lists of a given query are intersected using the standard merge-based technique with appropriate pruning strategies based on skips. The documents within a skip pointer are decompressed on demand. The intra-query model, on the other hand, exploits the parallelism within a query by dividing the associated work into independent tasks (see Figure 1b). Each task holds a sequence of documents from both posting lists on which the intersection is performed. Consider a query q with two terms a and b, whose posting lists are L(a)=(Sa , Ca ) and L(b)=(Sb , Cb ) with m = |Sa | and n = |Sb |. Assume, without loss of generality, that the query terms are sorted in the

Copyright is held by the author/owner(s). SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA. ACM 978-1-60558-483-6/09/07.

738

800

Inter Intra

(a)

600 500 400 300 200 100 0

(b)

0.012

0.01

0.008

0.006

0.004

0.002 1

2

3

4

5

6

7

Number of Cores

8

1

2

3

4

5

6

7

8

Number of Cores

Figure 2: (a) Query throughput (b) Average query latency runtime overhead due to task creation and task pool maintenance, resulting in sub-linear performance. The average query latency of Inter is almost constant since only the parallelism between different queries is explored. In contrast, it reduces continuously for Intra as number of cores is increased. It is important to note that the drop in throughput and speedup due to the intra-query model when compared to that of the inter-query model is less than 20%. However, the improvement in query latency is more than 5-fold. In a nutshell, Inter only improves the query throughput and speedup whereas Intra provides excellent improvement in query latency by sacrificing some performance with respect to throughput and speedup. Unlike existing approaches [6], we found that the memory accesses in our models, especially in the intra-query model due to small tasks, are small and uniform. Thus, it is highly unlikely that the memory bandwidth reaches its saturation. We omit some of the results obtained by varying query length and skipsize due to lack of space. We are currently evaluating more sophisticated intersection algorithms to see if they provide any benefit over the simple merge-based method. In the future, we plan to investigate these two parallel models with respect to power and energy management techniques such as DVFS and core-hopping. Acknowledgments: This work has been partly supported by NSF grants NGS-CNS-0406386, CAREER-IIS-0347662, RI-CNS-0403342, and CCF-0702587.

4. REFERENCES

Figure 1: (a) parallel query processing models, (b) architecture of the intra-query model

3.

0.014

Inter Intra

700

Avg. Query Latency (sec/query)

Throughput (# of queries/sec)

increasing order of their posting list size, i.e., m ≤ n. For each skip pointer in L(a), we create a task with one or more skip pointers from L(b) such that intersection is performed on resulting sequences of posting lists. More specifically, we generate a set of independent tasks {t1 , t2 , ..., tm } where ti = (si , si+1 , sj , sk ), si ,si+1 ∈ Sa for 1 ≤ i ≤ m and sj , sk ∈ Sb for 1 ≤ j ≤ k ≤ n. Note that, si+1 is undefined when i = m. For a given si and si+1 in L(a), the skips from L(b) are chosen such that si ≥ sj and si+1 ≤ sk . In other words, all the documents within a skip pointer si → si+1 fall in the document interval given by [sj , sk ]. To find common elements from these skips lists, we apply typical list intersection methods. It is straightforward to extend this approach for queries with more terms. The tasks are generated by applying a modified mergebased or search-based list intersection algorithm on skips from Sa and Sb . Each task fully specifies the portion of posting lists which need to be intersected. Thus, once we have tasks created and pushed into a task pool, different cores process them independently (see Fig. 1b). Common documents are then fed to the ranking phase for further processing. Since document scores are independent of each other, we can easily parallelize the ranking phase – each core takes a document from rank pool, ranks it, and proceeds to the next one. The efficiency can further be improved by integrating both intersection and ranking phases.

[1] R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. The impact of caching on search engines. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 183–190, 2007. [2] J. Barbay, A. Lopez-Ortiz, and T. Lu. Faster Adaptive Set Intersections for Text Searching. Proceedings of 5th Workshop on Experimental Algorithms (WEA), pages 146–157, 2006. [3] C. Bonacic, C. Garcia, M. Marin, M. Prieto, and F. Tirado. Improving Search Engines Performance on Multithreading Processors. In Proceedings of 8th International Meeting on High Performance Computing for Computational Science (VECPAR), pages 201–213, 2008. [4] S. Ding, J. He, H. Yan, and T. Suel. Using Graphics Processors for High-Performance IR Query Processing. Proceedings of 17th International World Wide Web Conference (WWW), pages 1213–1214, 2008. [5] W. Pugh. Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM, 33(6):668–676, 1990. [6] T. Strohman and W. Croft. Efficient document retrieval in main memory. In Proceedings of 30th SIGIR Conference on Research and Development in Information Retrieval, pages 175–182, 2007. [7] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE ’06), page 59, Washington, DC, USA, 2006. IEEE Computer Society.

EMPIRICAL EVALUATION

Let Q be the given query workload. We consider three different performance measures. (i) Speedup that is defined as TTP1 , where T1 is the time to process the query workload with one processor and TP is the time using P processors. (ii) Throughput that is measured as the ratio between total number of queries |Q| and the total time spent in processing them. (iii) Average Query Latency that is computed as P|Q| ( i=1 fi −si )/|Q| where si is the time at which the intersection process for the ith query is started, and fi is the time at which the process is complete. Our data consists of a crawl of documents from the UK domain, and Altavista query log with 200, 000 queries. These queries are processed in a streaming fashion. The skipsize of skip lists is set to 512. With respect to speedup and throughput, the inter-query model (Inter ) achieves almost linear scalability due to very simple parallelization (see Fig. 2a). On the other hand, the intra-query model (Intra) incurs

739

On efficient posting list intersection with multicore ... - Semantic Scholar

On efficient posting list intersection with multicore ... - Semantic Scholar

Suggest Documents

Efficient Nested Dissection for Multicore ... - Semantic Scholar

Continuous Skyline Queries on Multicore ... - Semantic Scholar

Efficient Lock Free Privatization - Multicore Algorithmics - Multicore ...

Posting 21-24 List

Greedy List Intersection

Efficient Lists Intersection by CPU- GPU ... - Semantic Scholar

k-IOS: Intersection of Spheres for Efficient ... - Semantic Scholar

String Matching with Multicore CPUs: Performing ... - Semantic Scholar

Symmetric Rank-k Update on Clusters of Multicore ... - Semantic Scholar

Performance of Multicore Systems on Parallel Data ... - Semantic Scholar

Data Partitioning on Heterogeneous Multicore and ... - Semantic Scholar

Multicore Processors: Challenges, Opportunities ... - Semantic Scholar

Isolation in Commodity Multicore Processors - Semantic Scholar

AUTOSAR OS on a Message-Passing Multicore ... - Semantic Scholar

A Wireless Network-on-Chip Design for Multicore ... - Semantic Scholar

Multicore Fiber Sensors for Simultaneous ... - Semantic Scholar

Deterministic Synchronization in Multicore ... - Semantic Scholar

Software Transactional Memory for Multicore ... - Semantic Scholar

Isolation in Commodity Multicore Processors - Semantic Scholar

Multicore Fiber Sensors for Simultaneous ... - Semantic Scholar

Fast Lists Intersection with Bloom Filter using ... - Semantic Scholar

A Highly Efficient Multicore Floating-Point FFT Architecture Based on ...

Incremental Constraint-Posting Algorithms in ... - Semantic Scholar

An Efficient List Decoder Architecture for Polar ... - Semantic Scholar