â¡A good crawler covers the most documents possible. 5. â« Narrow-topic queries. â¡E.g., get homepage of John Doe. â
Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1
Search Engine Samplers Search Engine Public Interface
Index
Web
D
Top k results Queries
Sampler
Random document x∈D
Indexed Documents
2
Motivation
Useful tool for search engine evaluation: Freshness
Fraction of up-to-date pages in the index
Topical
bias
Identification of overrepresented/underrepresented topics
Spam
Fraction of spam pages in the index
Security
Fraction of pages in index infected by viruses/worms/trojans
Relative
Size
Number of documents indexed compared with other search engines 3
Size Wars August 2005 : We index 20 billion documents.
September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.
So, who’s right? 4
Why Does Size Matter, Anyway?
Comprehensiveness A
good crawler covers the most documents possible
Narrow-topic queries E.g.,
get homepage of John Doe
Prestige A
marketing advantage 5
Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05]
Sample pages uniformly at random from the search engine’s index Two alternatives Absolute
Sample until collision Collision expected after k ~ N½ random samples (birthday paradox) Return k2
Relative
size estimation
size estimation
Check how many samples from search engine A are present in search engine B and vice versa 6
Other Approaches
Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]
Queries from user query logs [LawrenceGiles98, DobraFeinberg04]
Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]
7
The Bharat-Broder Sampler: Preprocessing Step Lexicon Large corpus
C
L t1, freq(t1,C) t2, freq(t2,C) … …
8
The Bharat-Broder Sampler Search Engine t1 AND t2
L
Two random terms t1, t2
Top k results
BB Sampler
Random document from top k results
Only if: • all queries return the same number of results ≤ k • all documents are of the same length Then, samples are uniform. 9
The Bharat-Broder Sampler: Drawbacks
Documents have varying lengths Bias
towards long documents
Some queries have more than k matches Bias
towards documents with high static rank
10
Our Contributions
A pool-based sampler Guaranteed
to produce near-uniform samples
Focus of this talk
A random walk sampler After
sufficiently many steps, guaranteed to produce near-uniform samples Does not need an explicit lexicon/pool at all!
11
Search Engines as Hypergraphs “news”
“google”
www.cnn.com
news.google.com
news.bbc.co.uk
www.google.com
www.foxnews.com maps.google.com
www.mapquest.com
en.wikipedia.org/wiki/BBC www.bbc.co.uk
“bbc” results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:
Vertices: Hyperedges:
Indexed documents { result(q) | q ∈ P }
maps.yahoot.com
“maps”
12
Query Cardinalities and Document Degrees “news”
“google”
www.cnn.com
news.google.com
news.bbc.co.uk
www.google.com
www.foxnews.com maps.google.com
www.mapquest.com
en.wikipedia.org/wiki/BBC www.bbc.co.uk
maps.yahoot.com
“bbc”
Query cardinality: card(q) = |results(q)|
card(“news”) = 4, card(“bbc”) = 3
Document degree: deg(x) = |queries(x)|
“maps”
deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2
Cardinality and degree are easily computable
13
Sampling documents uniformly
Sampling documents from D uniformly Hard Sampling documents from D non-uniformly: Easier Will show later: can sample documents proportionally to their degrees:
14
Sampling documents by degree “news”
“google”
www.cnn.com
news.google.com
news.bbc.co.uk
www.google.com
www.foxnews.com maps.google.com
www.mapquest.com
en.wikipedia.org/wiki/BBC www.bbc.co.uk
“bbc”
maps.yahoot.com
“maps”
p(news.bbc.co.uk) = 2/13 p(www.cnn.com) = 1/13 15
Monte Carlo Simulation
We need: Samples from the uniform distribution We have: Samples from the degree distribution Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? Yes! Monte Carlo Simulation Methods Rejection Sampling
Importance Sampling
MetropolisHastings
MaximumDegree 16
Monte Carlo Simulation
π: Target distribution
p: Trial distribution
In our case: π = uniform on D In our case: p = degree distribution
Bias weight of p(x) relative to π(x):
In our case:
Samples from p p-Sampler
π−Sampler (x1,w(x)), (x2,w(x)), …
Sample from π Monte Carlo Simulator
x 17
Bias Weights
Unnormalized forms of π and p:
:
(unknown) normalization constants
Examples:
π = uniform: p = degree distribution:
Bias weight: 18
Rejection Sampling [von Neumann]
C: envelope constant C
≥ w(x) for all x
The algorithm: accept
:= false while (not accept)
generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true
return
x
In our case: C = 1 and acceptance prob = 1/deg(x) 19
Pool-Based Sampler
Degree distribution: p(x) = deg(x) / Σx’deg(x’) Search Engine q1,q2,…
results(q1), results(q2),…
Pool-Based Sampler (x1,1/deg(x1)), Rejection Degree distribution (x2,1/deg(x2)),… Sampling sampler Documents sampled from degree distribution with corresponding weights
Uniform sample
x
20
Sampling documents by degree “news”
“google” www.cnn.com
news.google.com
news.bbc.co.uk
www.google.com
www.foxnews.com maps.google.com
www.mapquest.com
en.wikipedia.org/wiki/BBC www.bbc.co.uk
“bbc”
maps.yahoot.com
“maps”
Select a random query q Select a random x ∈ results(q) Documents with high degree are more likely to be sampled If we sample q uniformly “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality
21
Sampling documents by degree (2) “news”
“google” www.cnn.com
news.google.com
news.bbc.co.uk
www.google.com
www.foxnews.com maps.google.com
www.mapquest.com
en.wikipedia.org/wiki/BBC www.bbc.co.uk
maps.yahoot.com
“bbc”
“maps”
Select a query q proportionally to its cardinality Select a random x ∈ results(q) Analysis: 22
Degree Distribution Sampler Search Engine Query sampled from cardinality distribution
q
results(q)
Degree Distribution Sampler Cardinality Distribution Sampler
Sample x uniformly from results(q)
Document sampled from degree distribution
x
23
Sampling queries by cardinality
Sampling queries from pool uniformly: Sampling queries from pool by cardinality:
Easy Hard
Requires
knowing cardinalities of all queries in the search engine
Use Monte Carlo methods to simulate biased sampling via uniform sampling: Target
distribution: the cardinality distribution Trial distribution: uniform distribution on the query pool 24
Sampling queries by cardinality
Bias weight of cardinality distribution relative to the uniform distribution:
Can be computed using a single search engine query
Use rejection sampling:
Envelope constant for rejection sampling:
Queries are sampled uniformly from the pool Each query q is accepted with probability
25
Complete Pool-Based Sampler Search Engine Uniform Query (q,card(q)),… Sampler Uniform query sample
Degree Distribution Sampler
Rejection Sampling Query sampled from cardinality distribution
(q,results(q)),…
(x,1/deg(x)),…
Documents sampled from degree distribution with corresponding weights
Rejection Sampling
x Uniform document sample 26
Dealing with Overflowing Queries
Problem: Some queries may overflow (card(q) > k) Bias
towards highly ranked documents
Solutions: Select
a pool P in which overflowing queries are rare (e.g., phrase queries) Skip overflowing queries Adapt rejection sampling to deal with approximate weights
Theorem: Samples of PB sampler are at most β-away from uniform. (β = overflow probability of P)
27
Creating the query pool Large corpus
C
P q1 q2 … …
Example: P = all 3-word phrases that occur in C
If “to be or not to be” occurs in C, P contains:
Query Pool
“to be or”, “be or not”, “or not to”, “not to be”
Choose P that “covers” most documents in D 28
A random walk sampler
Define a graph G over the indexed documents (x,y)
∈ E iff queries(x) ∩ queries(y) ≠ ∅
Run a random walk on G Limit distribution = degree distribution Use MCMC methods to make limit distribution Metropolis-Hastings Maximum-Degree
uniform.
Does not need a preprocessing step Less efficient than the pool-based sampler 29
Bias towards Long Documents Percent of documents from sample .
60%
Pool Based 50%
Random Walk
40%
Bharat-Broder
30% 20% 10% 0% 1
2
3
4
5
6
7
8
9
10
Deciles of documents ordered by size 30
Relative Sizes of Google, MSN and Yahoo!
Google = 1 Yahoo! = 1.28 MSN Search = 0.73
31
60%
Google MSN Yahoo!
50% 40% 30% 20% 10%
in fo
ie
es
no
it
us
ca
au go v
de
ed u
uk
ne t
or g
0%
co m
Percent of documents from sample .
Top-Level Domains in Google, MSN and Yahoo!
Top level domain name 32
Conclusions
Two new search engine samplers Pool-based
sampler Random walk sampler
Samplers are guaranteed to produce nearuniform samples, under plausible assumptions. Samplers show no or little bias in experiments.
33
Thank You
34