lecture2-dictionary-handout-6-per

At 1resultwasforcatwelllolcatsnot caterpillarinc 18 3

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 
straighporward
 13
 Sec. 2.2.3 Introduc)on to Informa)on Retrieval   With
a
stop
list,
you
exclude
from
the
dic)onary
 en)rely
the
commonest
words.
Intui)on:
   They
have
liZle
seman)c
content:
the, a, and, to, be   There
are
a
lot
of
them:
~30%
of
pos)ngs
for
top
30
words
   But
the
trend
is
away
from
doing
this:
   Good
compression
techniques
(lecture
5)
means
the
space
for
 including
stopwords
in
a
system
is
very
small
   Good
query
op)miza)on
techniques
(lecture
7)
mean
you
pay
liZle
 at
query
)me
for
including
stop
words.
   You
need
them
for:
   Phrase
queries:
“King
of
Denmark”
   Various
song
)tles,
etc.:
“Let
it
be”,
“To
be
or
not
to
be”
   “Rela)onal”
queries:
“flights
to
London”
 Introduc)on to Informa)on Retrieval 14
 Sec. 2.2.3 Normaliza)on
to
terms
 Normaliza)on:
other
languages
   We
need
to
“normalize”
words
in
indexed
text
as
well
 as
query
words
into
the
same
form
   Accents:
e.g.,
French résumé
vs.
resume.
   Umlauts:
e.g.,
German:
Tuebingen
vs.
Tübingen   We
want
to
match
U.S.A.
and
USA   Should
be
equivalent
   Result
is
terms:
a
term
is
a
(normalized)
word
type,
 which
is
an
entry
in
our
IR
system
dic)onary
   We
most
commonly
implicitly
define
equivalence
 classes
of
terms
by,
e.g.,

   Most
important
criterion:
   How
are
your
users
like
to
write
their
queries
for
these
 words?
   Even
in
languages
that
standardly
have
accents,
users
 oben
may
not
type
them
   dele)ng
periods
to
form
a
term
   U.S.A.,
USA USA   Oben
best
to
normalize
to
a
de‐accented
term
   dele)ng
hyp...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online