notes for ling post tday

notes for ling post tday - more similar the words will be...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
document classification we rep a doc as a vector of term weights x documents (x, y, z) will have a value for each word (meeting or mortgage) would be a 0 or a 1 vector space model (binary) w1 w2 meeting mortgage x <1 , 0> y <0 , 1> z <1 , 1> can improve upon this by telling how many times occurred rather than present or not present. *problem: treats every word as the same. Intuitively, some words will be better than others (some words will be more common across documents, e.g. "the") doesn't mean that every doc will be ABOUT the. Could just ignore these kind of words (stop list) Some we don't want to do this. So we need "DEGREES of Stopness." tf --- Term freq: number of times a word occurs in a particular doc Doc freq: number of docs a word occurs in idf = inverse document frequencies = log (N/ n i) cos90 = 0 cos180 = -1 cos of angle between vectors is going to range between -1 and 1. The closer to 1, the
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: more similar the words will be between docs. We can estimate doc similarity by using similarity in vector space. length of the vector is a function of doc length. Longer docs will have more words similar to others. We are primarily interested in difference between vectors and not length. Can rep words as little documents. Reduces classification task to simple math formula. However, Problems: Someone will still have to decide what kind of classes and keywords you want. Sparse data. A lot of the words will be 0 (common). Words that don't occur in both docs will not be valid for similarity. Big models. Solution: pick keywords. One method" Latent semantic analysis. first and second order relations. ____word____...
View Full Document

This note was uploaded on 09/06/2009 for the course LING 571 taught by Professor Staff during the Fall '08 term at San Diego State.

Page1 / 2

notes for ling post tday - more similar the words will be...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online