J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted
Indexes. 2. Inverted Indexes. Query “Brutus” AND “Calpurnia” ...
Query Answering Using Inverted Indexes
Inverted Indexes Query “Brutus” AND “Calpurnia”
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
2
Document-at-a-time Evaluation • The conceptually simplest query answering method Query
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
3
Algorithm Find posting lists
Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
4
Term-at-a-time Evaluation
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
5
Algorithm
Compute scores on one term
Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
6
Comparison • Memory usage – The document-at-a-time only needs to maintain a priority queue R of a limited number of results – The term-at-a-time needs to store the current scores for all documents
• Disk access – The document-at-a-time needs more disk seeking and buffers for seeking since multiple lists are read in a synchronized way – The term-at-a-time reads through each inverted list from start to end – requiring minimal disk seeking and buffer J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
7
List Skipping • Consider an inverted list of n bytes, if we add skip pointers after each c bytes, and the pointers are k bytes long each – Reading the whole list: Θ(n) bytes – Jumping through the list using the skip pointers: Θ(kn/c) = Θ(n) – No asymptotic gain – When c is large and k is small, it may gain in practice
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
8
Big Skips • If c gets too large, the average performance drops • Consider finding p postings in a list of n bytes – There are n/c total intervals in the list – Need to read kn/c bytes in skip pointers – Need to read data in p intervals – on average, assume that the postings we want are about halfway between two skip pointers – read additional pc/2 bytes – The total number of bytes to read: kn/c + pc/2 – When n/c p, skipping does not help
• Most disks require a skip of at least 100,000 postings to gain in speedup – Skipping is useful in reducing the amount of time spent on decoding compressed data and processing cached data J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
9
Computing Cosine Score
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
10
Efficient Scoring • For a query q = w1 w2
– The unit vector v (q ) has only two nonzero
components – If query terms are not weighted, the nonzero components are equal to 2 / 2 = 0.707
• Generally, for any two documents d1 and d2 •
V (q) ⋅ v (d1 ) > V (q) ⋅ v (d 2 ) if and only if v (q) ⋅ v (d1 ) > v (q) ⋅ v (d 2 ) V (q ) ⋅ v (d ) is the weighted sum over all terms
in query q, of the weights of those terms in d J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
11
Efficient Scoring Algorithm
Using a heap, selecting top k answers can be done with 2J comparisons where J is the number of answers of nonzero scores J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
12
Approximate Top-K Retrieval • Retrieve K documents that are likely to be among the K highest scoring documents – Goal: lower down the query answering cost – Cosine measure is also an approximation of information need
• Major cost: computing cosine similarities between the query and a large number of documents • Approximation strategies – Find a set A of documents that are contenders, where K < |A| « N, such that A is likely to have many documents with scores near those of the top K – Return the top-K documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
13
Index Elimination • For a multi-term query q, we only need to consider documents containing at least one of the query terms • Only consider documents containing terms whose IDF exceeds a preset threshold – Only check those discriminative words – Benefit: the postings lists of low-IDF terms are generally long (many are stop words)
• Only consider documents that contain many of the query terms J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
14
Champion Lists • For each term t in the dictionary, precompute the top-r documents of the highest weights for t, where r is a preset parameter – Set different r for different terms – larger for rare terms and smaller for frequent terms
• Given a query q, let A be the union of the champion lists for each of the terms comprising q – Compute cosine similarity only between q and those documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
15
Static Quality Scores and Ordering • Different documents have different importance – Example: how good are reviews on a web page? – Modeled by a quality measure g(d) ∈ [0, 1] V (q) ⋅ V (d ) TotalScore(q, d ) = g (d ) + | V (q) | × | V (d ) |
• Sort documents in posting lists in g(d) descending order Suppose g(1) = 0.25, g(2) = 0.5, and g(3) = 1
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
16
Using Quality Score Ordering • For a well-chosen value r, maintain for each term t a global champion list of the top-r documents with the highest value of g(d)+TFIDF(t, d) – At query time, only compute TotalScore for documents in the union of those global champion lists
• Maintain for each term t two posting lists – High list: m documents with the highest TF values for t – Low list: the other documents containing t – Use high list only if at least K answers can be generated
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
17
Tiered Indexes • Generalization of champion lists
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
18
Clustering and NN Search • Clustering – Pick N documents as leaders at random from the collection – For each document that is not a leader (called a follower), compute Nits nearest leader • Each cluster has
N
= N
followers
– Alternatively, a follower can be assigned to b1 leaders
• Query answering as nearest neighbor search
– For a query q, find the leader L (or b2 leaders) that is closest to q – computing cosine similarities between q and N leaders – The candidate set A contains the closest leader and the followers J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
19
Example
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
20
Putting All Together
J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
21
Summary and To-Do-List • Query evaluation – Document-at-a-time versus term-at-a-time
• List skipping • Efficient scoring • Approximate top-K retrieval – Index elimination, champion lists, quality score and ranking, clustering and nearest neighbor search
• Read Section 5.7 J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes
22