lecture21-annotated - 1 Eric Xing © Eric Xing @ CMU,...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Eric Xing © Eric Xing @ CMU, 2006-2008 1 Machine Learning Machine Learning 10 10-701/15 701/15-781, Fall 2008 781, Fall 2008 Principal Components Analysis Principal Components Analysis And Topic Models And Topic Models Eric Xing Eric Xing Lecture 21, November 24, 2008 Reading: Chap 12.1, CB book Modified from www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Eric Xing © Eric Xing @ CMU, 2006-2008 2 The Problem: NLP and Data Mining We want: z Semantic-based search z infer topics and categorize documents z Multimedia inference z Automatic translation z Predict how topics evolve z … Research topics 1900 2000 Research topics 1900 2000 2 Eric Xing © Eric Xing @ CMU, 2006-2008 3 Modeling document collections z A document collection is a dataset where each data point is itself a collection of simpler data. z Text documents are collections of words. z Segmented images are collections of regions. z User histories are collections of purchased items. z Many modern problems ask questions of such data. z Is this text document relevant to my query? z Which documents are about a particular topic? z How have topics changed over time? z What does author X write about? Who is likely to write about topic Y? Who wrote this specific document? z Which category is this image in? Create a caption for this image. z What movies would I probably like? z and so on….. Eric Xing © Eric Xing @ CMU, 2006-2008 4 Text document retrieval z Represent each document by a high-dimensional vector in the space of words 3 Eric Xing © Eric Xing @ CMU, 2006-2008 5-- Relevant docs may not have the query terms Æ but may have many “related” terms-- Irrelevant docs may have the query terms Æ but may not have any “related” terms Example Eric Xing © Eric Xing @ CMU, 2006-2008 6 Problems z Looks for literal term matches z Terms in queries (esp short ones) don’t always capture user’s information need well z Problems: z Synonymy : other words with the same meaning z Car and automobile z No associations between words are made in the vector space representation. z Polysemy : the same word having other meanings z Apple (fruit and company) z The vector space model is unable to discriminate between different meanings of the same word. z What if we could match against ‘concepts’, that represent related words, rather than words themselves 4 Eric Xing © Eric Xing @ CMU, 2006-2008 7 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) z Uses statistically derived conceptual indices instead of individual words for retrieval z Assumes that there is some underlying or latent structure in word usage that is obscured by variability in word choice z Key idea: instead of representing documents and queries as vectors in a t-dim space of terms z Represent them (and terms themselves) as vectors in a lower-dimensional space whose axes are concepts that effectively group together similar words z Uses SVD to reduce document representations, z The axes are the Principal Components from SVD (singular value decomposition)...
View Full Document

This note was uploaded on 01/26/2010 for the course MACHINE LE 10701 taught by Professor Ericp.xing during the Fall '08 term at Carnegie Mellon.

Page1 / 33

lecture21-annotated - 1 Eric Xing © Eric Xing @ CMU,...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online