Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes

Inverted Indexes Query “Brutus” AND “Calpurnia”

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

2

Document-at-a-time Evaluation •  The conceptually simplest query answering method Query


3

Algorithm Find posting lists

Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

4

Term-at-a-time Evaluation


5

Algorithm

Compute scores on one term

Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

6

Comparison •  Memory usage –  The document-at-a-time only needs to maintain a priority queue R of a limited number of results –  The term-at-a-time needs to store the current scores for all documents

•  Disk access –  The document-at-a-time needs more disk seeking and buffers for seeking since multiple lists are read in a synchronized way –  The term-at-a-time reads through each inverted list from start to end – requiring minimal disk seeking and buffer J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

7

List Skipping •  Consider an inverted list of n bytes, if we add skip pointers after each c bytes, and the pointers are k bytes long each –  Reading the whole list: Θ(n) bytes –  Jumping through the list using the skip pointers: Θ(kn/c) = Θ(n) –  No asymptotic gain –  When c is large and k is small, it may gain in practice


8

Big Skips •  If c gets too large, the average performance drops •  Consider finding p postings in a list of n bytes –  There are n/c total intervals in the list –  Need to read kn/c bytes in skip pointers –  Need to read data in p intervals – on average, assume that the postings we want are about halfway between two skip pointers – read additional pc/2 bytes –  The total number of bytes to read: kn/c + pc/2 –  When n/c  p, skipping does not help

•  Most disks require a skip of at least 100,000 postings to gain in speedup –  Skipping is useful in reducing the amount of time spent on decoding compressed data and processing cached data J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

9

Computing Cosine Score


10

Efficient Scoring •  For a query q = w1 w2

 –  The unit vector v (q ) has only two nonzero

components –  If query terms are not weighted, the nonzero components are equal to 2 / 2 = 0.707

•  Generally, for any two documents d1 and d2 • 

        V (q) ⋅ v (d1 ) > V (q) ⋅ v (d 2 ) if and only if v (q) ⋅ v (d1 ) > v (q) ⋅ v (d 2 )   V (q ) ⋅ v (d ) is the weighted sum over all terms

in query q, of the weights of those terms in d J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

11

Efficient Scoring Algorithm

Using a heap, selecting top k answers can be done with 2J comparisons where J is the number of answers of nonzero scores J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

12

Approximate Top-K Retrieval •  Retrieve K documents that are likely to be among the K highest scoring documents –  Goal: lower down the query answering cost –  Cosine measure is also an approximation of information need

•  Major cost: computing cosine similarities between the query and a large number of documents •  Approximation strategies –  Find a set A of documents that are contenders, where K < |A| « N, such that A is likely to have many documents with scores near those of the top K –  Return the top-K documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

13

Index Elimination •  For a multi-term query q, we only need to consider documents containing at least one of the query terms •  Only consider documents containing terms whose IDF exceeds a preset threshold –  Only check those discriminative words –  Benefit: the postings lists of low-IDF terms are generally long (many are stop words)

•  Only consider documents that contain many of the query terms J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

14

Champion Lists •  For each term t in the dictionary, precompute the top-r documents of the highest weights for t, where r is a preset parameter –  Set different r for different terms – larger for rare terms and smaller for frequent terms

•  Given a query q, let A be the union of the champion lists for each of the terms comprising q –  Compute cosine similarity only between q and those documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

15

Static Quality Scores and Ordering •  Different documents have different importance –  Example: how good are reviews on a web page? –  Modeled by a quality measure g(d) ∈ [0, 1]   V (q) ⋅ V (d )  TotalScore(q, d ) = g (d ) +  | V (q) | × | V (d ) |

•  Sort documents in posting lists in g(d) descending order Suppose g(1) = 0.25, g(2) = 0.5, and g(3) = 1


16

Using Quality Score Ordering •  For a well-chosen value r, maintain for each term t a global champion list of the top-r documents with the highest value of g(d)+TFIDF(t, d) –  At query time, only compute TotalScore for documents in the union of those global champion lists

•  Maintain for each term t two posting lists –  High list: m documents with the highest TF values for t –  Low list: the other documents containing t –  Use high list only if at least K answers can be generated


17

Tiered Indexes •  Generalization of champion lists


18

Clustering and NN Search •  Clustering –  Pick N documents as leaders at random from the collection –  For each document that is not a leader (called a follower), compute Nits nearest leader •  Each cluster has

N

= N

followers

–  Alternatively, a follower can be assigned to b1 leaders

•  Query answering as nearest neighbor search

–  For a query q, find the leader L (or b2 leaders) that is closest to q – computing cosine similarities between q and N leaders –  The candidate set A contains the closest leader and the followers J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

19

Example


20

Putting All Together


21

Summary and To-Do-List •  Query evaluation –  Document-at-a-time versus term-at-a-time

•  List skipping •  Efficient scoring •  Approximate top-K retrieval –  Index elimination, champion lists, quality score and ranking, clustering and nearest neighbor search

•  Read Section 5.7 J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes

22

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes

Suggest Documents

Compression of Inverted Indexes For Fast Query Evaluation

Compression of Inverted Indexes For Fast Query Evaluation

XML Query Optimization Using Path Indexes

Distributed Inverted Indexes - Daim - NTNU

Distributed Query Processing Using Partitioned Inverted Files

answering graph pattern query using incremental views

Imprecise probabilistic query answering using ... - Springer Link

Cooperative Query Answering Using Multiple ... - Semantic Scholar

Progressive Semantic Query Answering

Combining Query Translation with Query Answering ... - VideoLectures

Combining Query Translation with Query Answering ... - VideoLectures

Associative Query Answering via Query Feature

Inverted indexes: Types and techniques - IJCSI

Inverted indexes: Types and techniques - Semantic Scholar

Inverted Indexes for Phrases and Strings∗

Query Answering in Circumscription - IJCAI

Cooperative multi-hierarchical query answering

Approximate Query Answering In Numerical

Consistent Query Answering under Inclusion

Cooperative Query Answering with Generalized

Privacy-Preserving Location-Based Query Using Location Indexes ...

Approximate Query Answering Using Data Warehouse ... - Springer Link

Lightweight Spatial Conjunctive Query Answering using Keywords - KBS

Approximate Query Answering Using Data Warehouse Striping - Core