lecture2-dictionary-handout-6-per

223 introducontoinformaonretrieval

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: roduc)on to Informa)on Retrieval Tokeniza)on:
language
issues
 Tokeniza)on:
language
issues
   French
   Chinese
and
Japanese
have
no
spaces
between
 words:
   L'ensemble
→
one
token
or
two?
   L ?
L’ ?
Le ?
   Want
l’ensemble
to
match
with
un ensemble   Un)l
at
least
2003,
it
didn’t
on
Google
   Interna)onaliza)on!
   莎拉波娃现在居住在美国东南部的佛罗里达。   Not
always
guaranteed
a
unique
tokeniza)on

   Further
complicated
in
Japanese,
with
mul)ple
 alphabets
intermingled
   German
noun
compounds
are
not
segmented
   Dates/amounts
in
mul)ple
formats
   LebensversicherungsgesellschaUsangestellter   ‘life
insurance
company
employee’
   German
retrieval
systems
benefit
greatly
from
a
compound
spli>er
 module
 フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana   Can
give
a
15%
performance
boost
for
German

 11
 Hiragana Kanji Romaji End-user can express query entirely in hiragana! 12
 2 Sec. 2.2.1 Introduc)on to Informa)on Retrieval Tokeniza)on:
language
issues
 Introduc)on to Informa)on Retrieval Sec. 2.2.2 Stop
words
   Arabic
(or
Hebrew)
is
basically
wriZen
right
to
leb,
 but
with
certain
items
like
numbers
wriZen
leb
to
 right
   Words
are
separated,
but
leZer
forms
within
a
word
 form
complex
ligatures
   


















 









←

→



←
→
























←
start
   ‘Algeria
achieved
its
independence
in
1962
aber
132
 years
of
French
occupa)on.’
   With
Unicode,
the
surface
presenta)on
is
complex,
but
the
 stored
form
is...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online