lecture15-webchar-handout-6-per

2m1 pickminifsoverallshinglessind 264 with i pick

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: amp;
overlap
for
individual
queries

   Es)mated
index
size
ra)o
&
overlap
by
averaging
over
all
 queries
   Advantage
   Might
be
a
beper
reflec)on
of
the
human
percep)on
 of
coverage
   Issues
   Samples
are
correlated
with
source
of
log
 Duplicates
   Technical
sta)s)cal
problems
(must
have
non‐zero
 results,
ra)o
average
not
sta)s)cally
sound)
 31
 Sec. 19.5 Introduc)on to Informa)on Retrieval 32
 Introduc)on to Informa)on Retrieval Queries
from
Lawrence
and
Giles
study
 Random
IP
addresses
   adap)ve access control   neighborhood preserva)on topographic   hamiltonian structures   right linear grammar   pulse width modula)on neural   unbalanced prior probabili)es   ranked assignment method   internet explorer favourites impor)ng   karvel thornber   zili liu Sec. 19.5   Generate
random
IP
addresses
   Find
a
web
server
at
the
given
address
   soKmax ac)va)on func)on   bose mul)dimensional system theory   gamma mlp   dvi2pdf   john oliensis   rieke spikes exploring neural   video watermarking   counterpropaga)on network   fat shaNering dimension   abelson amorphous compu)ng   If
there’s
one
   Collect
all
pages
from
server
   From
this,
choose
a
page
at
random
 33
 Introduc)on to Informa)on Retrieval Sec. 19.5 Random
IP
addresses
 Introduc)on to Informa)on Retrieval Sec. 19.5 Advantages
&
disadvantages
   Advantages
   HTTP
requests
to
random
IP
addresses

   Clean
sta)s)cs
   Independent
of
crawling
strategies
   Ignored:
empty
or
authoriza)on
required
or
excluded
   [Lawr99]
Es)mated
2.8
million
IP
addresses
running
 crawlable
web
servers
(16
million
total)
from
observing
 2500
servers.
   OCLC
using
IP
sampling
found
8.7
M
hosts
in
2001
   Disadvantages
   Doesn’t
deal
with
duplica)on

   Many
hosts
might
share
one
IP,
or
not
accept
requests
   No
guarantee
all
pages
are
linked
to
root
page.


   Netcra_
[Netc02]
accessed
37.2
million...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online