This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: more similar the words will be between docs. We can estimate doc similarity by using similarity in vector space. length of the vector is a function of doc length. Longer docs will have more words similar to others. We are primarily interested in difference between vectors and not length. Can rep words as little documents. Reduces classification task to simple math formula. However, Problems: Someone will still have to decide what kind of classes and keywords you want. Sparse data. A lot of the words will be 0 (common). Words that don't occur in both docs will not be valid for similarity. Big models. Solution: pick keywords. One method" Latent semantic analysis. first and second order relations. ____word____...
View Full Document
This note was uploaded on 09/06/2009 for the course LING 571 taught by Professor Staff during the Fall '08 term at San Diego State.
- Fall '08