lecture15-webchar-handout-6-per

8millionipaddressesrunning

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: c. 19.5 Introduc)on to Informa)on Retrieval Introduc)on to Informa)on Retrieval Sampling
URLs
 Sta)s)cal
methods
   Ideal
strategy:
Generate
a
random
URL
and
check
for
 containment
in
each
index.
 Sec. 19.5   Approach
1

   
Problem:
Random
URLs
are
hard
to
find!

Enough
to
 generate
a
random
URL
contained
in
a
given
Engine.
   Approach
1:
Generate
a
random
URL
contained
in
a
 given
engine
   Random
queries
   Random
searches
   
Approach
2
   Random
IP
addresses
   Random
walks
   Suffices
for
the
es)ma)on
of
rela)ve
size
   Approach
2:
Random
walks
/
IP
addresses
   In
theory:
might
give
us
a
true
es)mate
of
the
size
of
the
web
(as
 opposed
to
just
rela)ve
sizes
of
indexes)
 25
 Sec. 19.5 Introduc)on to Informa)on Retrieval   
Lexicon:
400,000+
words
from
a
web
crawl
 Introduc)on to Informa)on Retrieval Sec. 19.5 Query
Based
Checking
 Random
URLs
from
random
queries
   Generate
random
query:
how?
 26
 Not an English dictionary   
Conjunc(ve
Queries:
w1
and
w2
 e.g., vocalists AND rsi
   Get
100
result
URLs
from
engine
A
   Choose
a
random
URL
as
the
candidate
to
check
for
 presence
in
engine
B
   This
distribu)on
induces
a
probability
weight
W(p)
for
each
 page.

   Strong Query
to
check
whether
an
engine
B
has
a
 document
D:
   
Download
D.
Get
list
of
words.

   
Use
8
low
frequency
words
as
AND
query
to
B
   Check
if
D
is
present
in
result
set.
   Problems:
     ...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online