Lecture7b.slides.pdf - CS4487 Machine Learning Lecture 7...

This preview shows 1 out of 4 pages.

CS4487 - Machine Learning Lecture 7 - Linear Dimensionality Reduction Dr. Antoni B. Chan Dept. of Computer Science, City University of Hong Kong Outline 1. Linear Dimensionality Reduction for Vectors A. Principal Component Analysis (PCA) B. Random Projections C. Fisher's Linear Discriminant (FLD) 2. Linear Dimensionality Reduction for Text A. Latent Semantic Analysis (LSA) B. Non-negative Matrix Factorization (NMF) C. Latent Dirichlet Allocation (LDA) Dimensionality Reduction Goal: Transform high-dimensional vectors into low-dimensional vectors. Dimensions in the low-dim data represent co-occuring features in high-dim data. Dimensions in the low-dim data may have semantic meaning. For example: document analysis high-dim: bag-of-word vectors of documents low-dim: each dimension represents similarity to a topic.
Image of page 1

Subscribe to view the full document.

Latent Semantic Analysis (LSA) Also called Latent Semantic Indexing Consider a bag-of-word representation (e.g., TF, TF-IDF) document vector is the frequency of word in document Approximate each document vector as a weighted sum of topic vectors. Topic vector contains co-occuring words. corresponds to a particular topic or theme . Weight represents similarity of the document to the p-th topic. Objective: minimize the squared reconstruction error (Similar to PCA): Represent each document by its topic weights. Apply other machine learning algorithms... Advantage: Finds relations between terms (synonymy and polysemy). distances/similarities are now comparing topics rather than words. higher-level semantic representation Example on Spam Email dataset x i x i , j j i = x ̂ p n =1 w p v p v p w p || | min v , w i x i x ̂ i | 2
Image of page 2
use bag-of-words representation with 50 words term-frequency (TF) normalization In [2]: # Load spam/ham text data from directories textdata = datasets . load_files( "email" , encoding = "utf8" , decode_error = "replace" ) # convert to bag-of-words representation cntvect = feature_extraction . text . CountVectorizer(stop_words = 'english' , max_feat ures =50 ) X = cntvect . fit_transform(textdata . data) Y = textdata . target # TF representation tf_trans = feature_extraction . text . TfidfTransformer(norm = 'l1' , use_idf = False ) Xtf = tf_trans . fit_transform(X) # print the vocabulary print cntvect . vocabulary_ LSA on Spam data Apply LSA with 5 topics implemented as
Image of page 3

Subscribe to view the full document.

Image of page 4
You've reached the end of this preview.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern