lecture2-dictionary-handout-6-per

224 introducontoinformaonretrieval otherstemmers sec

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: hens
to
form
a
term
   Tuebingen, Tübingen, Tubingen Tubingen
   an?‐discriminatory, an?discriminatory an?discriminatory 15
 Sec. 2.2.3 Introduc)on to Informa)on Retrieval 16
 Introduc)on to Informa)on Retrieval Normaliza)on:
other
languages
 Case
folding
   Normaliza)on
of
things
like
date
forms
 Sec. 2.2.3   Reduce
all
leZers
to
lower
case
   excep)on:
upper
case
in
mid‐sentence?
   7月30日 vs. 7/30   Japanese use of kana vs. Chinese characters
   e.g.,
General Motors   Fed
vs.
fed   SAIL
vs.
sail   Tokeniza)on
and
normaliza)on
may
depend
on
the
 language
and
so
is
intertwined
with
language
 detec)on
 Is this Morgen will ich in MIT …   Oben
best
to
lower
case
everything,
since
 users
will
use
lowercase
regardless
of
 ‘correct’
capitaliza)on…
 German “mit”?   Crucial:
Need
to
“normalize”
indexed
text
as
well
as
 query
terms
into
the
same
form
 17
   Google
example:
   Query
C.A.T.   #1
result
was
for
“cat”
(well,
Lolcats)
not Caterpillar
Inc. 18
 3 Sec. 2.2.3 Introduc)on to Informa)on Retrieval Normaliza)on
to
terms
 Introduc)on to Informa)on Retrieval Thesauri
and
soundex
   Do
we
handle
synonyms
and
homonyms?
   E.g.,
by
hand‐constructed
equivalence
classes
   An
alterna)ve
to
equivalence
classing
is
to
do
 asymmetric
expansion
   An
example
of
where
this
may
be
useful
   Enter:
window
   Enter:
windows   Enter:
Windows   car
=
automobile color
=
colour   We
can
rewrite
to
form
equivalence‐class
terms
   When
the
document
contains
automobile,
index
it
under
car‐ automobile
(and
vice‐versa)
 
Search:
window, windows 
Search:
Windows, windows, window 
Search:
Windows   Or
we
can
expand
a
query
   When
the...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online