lecture2-dictionary-handout-6-per

7 introducontoinformaonretrieval sec 221 8 sec 221

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: nit
document?
 But these tasks are often done heuristically … 5
         A
file?
 An
email?

(Perhaps
one
of
many
in
an
mbox.)
 An
email
with
5
aZachments?
 A
group
of
files
(PPT
or
LaTeX
as
HTML
pages)
 6
 1 Introduc)on to Informa)on Retrieval Sec. 2.2.1 Introduc)on to Informa)on Retrieval Tokeniza)on
   Input:
“Friends, Romans, Countrymen”
   Output:
Tokens
   Friends   Romans   Countrymen   A
token
is
a
sequence
of
characters
in
a
document
   Each
such
token
is
now
a
candidate
for
an
index
 entry,
aber
further
processing
 TOKENS
AND
TERMS
   Described
below
   But
what
are
valid
tokens
to
emit?
 7
 Introduc)on to Informa)on Retrieval Sec. 2.2.1 8
 Sec. 2.2.1 Introduc)on to Informa)on Retrieval Tokeniza)on
 Numbers
   Issues
in
tokeniza)on:
   Finland’s capital →           Finland? Finlands? Finland’s?
   Hewle:‐Packard
→
Hewle:
and
Packard
as
two
 tokens?
   state‐of‐the‐art:
break
up
hyphenated
sequence.


   co‐educa?on   lowercase,
lower‐case,
lower case
?
 3/12/91 Mar. 12, 1991 12/3/91 55 B.C. B‐52 My PGP key is 324a3df234cb23e (800) 234‐2333   Oben
have
embedded
spaces
   Older
IR
systems
may
not
index
numbers
   But
oben
very
useful:
think
about
things
like
looking
up
error
 codes/stacktraces
on
the
web
   (One
answer
is
using
n‐grams:
Lecture
3)
   It
can
be
effec)ve
to
get
the
user
to
put
in
possible
hyphens
   San Francisco:
one
token
or
two?


   Will
oben
index
“meta‐data”
separately
   How
do
you
decide
it
is
one
token?
   Crea)on
date,
format,
etc.
 9
 Introduc)on to Informa)on Retrieval Sec. 2.2.1 10
 Sec. 2.2.1 Int...
View Full Document

Ask a homework question - tutors are online