Document not found! Please try again

Query Answering Using Inverted Indexes

60 downloads 124 Views 650KB Size Report
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes. 2. Inverted Indexes. Query “Brutus” AND “Calpurnia” ...
Query Answering Using Inverted Indexes

Inverted Indexes Query “Brutus” AND “Calpurnia”

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Document-at-a-time Evaluation •  The conceptually simplest query answering method Query

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Algorithm Find posting lists

Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Term-at-a-time Evaluation

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes



Compute scores on one term

Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Comparison •  Memory usage –  The document-at-a-time only needs to maintain a priority queue R of a limited number of results –  The term-at-a-time needs to store the current scores for all documents

•  Disk access –  The document-at-a-time needs more disk seeking and buffers for seeking since multiple lists are read in a synchronized way –  The term-at-a-time reads through each inverted list from start to end – requiring minimal disk seeking and buffer J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


List Skipping •  Consider an inverted list of n bytes, if we add skip pointers after each c bytes, and the pointers are k bytes long each –  Reading the whole list: Θ(n) bytes –  Jumping through the list using the skip pointers: Θ(kn/c) = Θ(n) –  No asymptotic gain –  When c is large and k is small, it may gain in practice

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Big Skips •  If c gets too large, the average performance drops •  Consider finding p postings in a list of n bytes –  There are n/c total intervals in the list –  Need to read kn/c bytes in skip pointers –  Need to read data in p intervals – on average, assume that the postings we want are about halfway between two skip pointers – read additional pc/2 bytes –  The total number of bytes to read: kn/c + pc/2 –  When n/c  p, skipping does not help

•  Most disks require a skip of at least 100,000 postings to gain in speedup –  Skipping is useful in reducing the amount of time spent on decoding compressed data and processing cached data J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Computing Cosine Score

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Efficient Scoring •  For a query q = w1 w2

 –  The unit vector v (q ) has only two nonzero

components –  If query terms are not weighted, the nonzero components are equal to 2 / 2 = 0.707

•  Generally, for any two documents d1 and d2 • 

        V (q) ⋅ v (d1 ) > V (q) ⋅ v (d 2 ) if and only if v (q) ⋅ v (d1 ) > v (q) ⋅ v (d 2 )   V (q ) ⋅ v (d ) is the weighted sum over all terms

in query q, of the weights of those terms in d J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Efficient Scoring Algorithm

Using a heap, selecting top k answers can be done with 2J comparisons where J is the number of answers of nonzero scores J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Approximate Top-K Retrieval •  Retrieve K documents that are likely to be among the K highest scoring documents –  Goal: lower down the query answering cost –  Cosine measure is also an approximation of information need

•  Major cost: computing cosine similarities between the query and a large number of documents •  Approximation strategies –  Find a set A of documents that are contenders, where K < |A| « N, such that A is likely to have many documents with scores near those of the top K –  Return the top-K documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Index Elimination •  For a multi-term query q, we only need to consider documents containing at least one of the query terms •  Only consider documents containing terms whose IDF exceeds a preset threshold –  Only check those discriminative words –  Benefit: the postings lists of low-IDF terms are generally long (many are stop words)

•  Only consider documents that contain many of the query terms J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Champion Lists •  For each term t in the dictionary, precompute the top-r documents of the highest weights for t, where r is a preset parameter –  Set different r for different terms – larger for rare terms and smaller for frequent terms

•  Given a query q, let A be the union of the champion lists for each of the terms comprising q –  Compute cosine similarity only between q and those documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Static Quality Scores and Ordering •  Different documents have different importance –  Example: how good are reviews on a web page? –  Modeled by a quality measure g(d) ∈ [0, 1]   V (q) ⋅ V (d )  TotalScore(q, d ) = g (d ) +  | V (q) | × | V (d ) |

•  Sort documents in posting lists in g(d) descending order Suppose g(1) = 0.25, g(2) = 0.5, and g(3) = 1

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Using Quality Score Ordering •  For a well-chosen value r, maintain for each term t a global champion list of the top-r documents with the highest value of g(d)+TFIDF(t, d) –  At query time, only compute TotalScore for documents in the union of those global champion lists

•  Maintain for each term t two posting lists –  High list: m documents with the highest TF values for t –  Low list: the other documents containing t –  Use high list only if at least K answers can be generated

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Tiered Indexes •  Generalization of champion lists

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Clustering and NN Search •  Clustering –  Pick N documents as leaders at random from the collection –  For each document that is not a leader (called a follower), compute Nits nearest leader •  Each cluster has


= N


–  Alternatively, a follower can be assigned to b1 leaders

•  Query answering as nearest neighbor search

–  For a query q, find the leader L (or b2 leaders) that is closest to q – computing cosine similarities between q and N leaders –  The candidate set A contains the closest leader and the followers J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes



J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Putting All Together

J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes


Summary and To-Do-List •  Query evaluation –  Document-at-a-time versus term-at-a-time

•  List skipping •  Efficient scoring •  Approximate top-K retrieval –  Index elimination, champion lists, quality score and ranking, clustering and nearest neighbor search

•  Read Section 5.7 J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes