lecture15-webchar-handout-6-per

Countnormalizedurlsinresultsets useraostascs

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: rela)ve
sizes
of
search
engines

   The
web
is
really
infinite

   The
no)on
of
a
page
being
indexed
is
s)ll reasonably
well
 defined.
   Already
there
are
problems
   Dynamic
content,
e.g.,
calendars

   So_
404:
www.yahoo.com/<anything>
is
a
valid
page
   Sta)c
web
contains
syntac)c
duplica)on,
mostly
due
to
 mirroring
(~30%)
   Some
servers
are
seldom
connected
   Document
extension:
e.g.,
engines
index
pages
not
yet
crawled,
by
 indexing
anchortext.
   Document
restric)on:
All
engines
restrict
what
is
indexed
(first n
 words,
only
relevant
words,
etc.)

   Who
cares?
   Media,
and
consequently
the
user
   Engine
design
   Engine
crawl
policy.
Impact
on
recall.
 21
 Introduc)on to Informa)on Retrieval Sec. 19.5 22
 Sec. 19.5 Introduc)on to Informa)on Retrieval Rela)ve
Size
from
Overlap
 Given
two
engines
A
and
B
 New
defini)on?
   The
sta)cally
indexable
web
is
whatever
search
 engines
index.
 Sample URLs randomly from A Check if contained in B and vice versa   IQ
is
whatever
the
IQ
tests
measure.
   Different
engines
have
different
preferences
 A∩B   
max
url
depth,
max
count/host,
an)‐spam
rules,
priority
 rules,
etc.
 A∩B = A∩B = (1/2) * Size A (1/6) * Size B   Different
engines
index
different
things
under
the
 same
URL:
 (1/2)*Size A = (1/6)*Size B ∴ Size A / Size B =   frames,
meta‐keywords,
document
restric)ons,
document
 extensions,
...
 (1/6)/(1/2) = 1/3 23
 Each test involves: (i) Sampling (ii) Checking 24
 4 Se...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online