Ch27b_ir2-vectorspace-95

Ch27b_ir2-vectorspace-95 - Computing Relevance, Similarity:...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
Database Management Systems, R. Ramakrishnan 1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 2 Document Vectors ± Documents are represented as “bags of words” ± Represented as vectors when used computationally A vector is like an array of floating point Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse Database Management Systems, R. Ramakrishnan 3 Document Vectors: One location for each word. nova galaxy heat h’wood film role diet fur 10 5 3 A “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Database Management Systems, R. Ramakrishnan 4 Document Vectors nova galaxy heat h’wood film role diet fur 10 5 3 51 0 10 8 7 91 0 5 10 10 0 57 9 61 0 28 75 1 3 A B C D E F G H I Document ids Database Management Systems, R. Ramakrishnan 5 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior Assumption: Documents that are “close” in space are similar. Database Management Systems, R. Ramakrishnan 6 Vector Space Model ± Documents are represented as vectors in term space Terms are usually stems Documents represented by binary vectors of terms ± Queries represented the same as documents ± A vector distance measure between the query and documents is used to rank retrieved documents Query and Document similarity is based on length and direction of their vectors Vector operations to capture boolean query conditions
Background image of page 2
Database Management Systems, R. Ramakrishnan 7 Vector Space Documents and Queries docs t1 t2 t3 RSV=Q.Di D1 1 0 1 4 D2 1 0 0 1 D3 0 1 1 5 D4 1 0 0 1 D5 1 1 1 6 D6 1 1 0 3 D7 0 1 0 2 D8 0 1 0 2 D9 0 0 1 3 D10 0 1 1 5 D11 1 0 1 3 Q 1 2 3 q1 q2 q3 D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 D 11 t 2 t 3 t 1 Boolean term combinations Q is a query – also represented as a vector Database Management Systems, R. Ramakrishnan 8 Assigning Weights to Terms n Binary Weights o Raw term frequency p tf x idf Recall the Zipf distribution Want to weight terms highly if they are •frequent in relevant documents … BUT •infrequent in the collection as a whole Database Management Systems, R. Ramakrishnan 9 Binary Weights ± Only the presence (1) or absence (0) of a term is included in the vector docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 0 1 D10 0 1 1 D11 1 0 1
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Database Management Systems, R. Ramakrishnan 10 Raw Term Weights ± The frequency of occurrence for the term in each document is included in the vector docs t1 t2 t3 D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/06/2010 for the course CSE 302 taught by Professor Joel during the Summer '05 term at Punjab Engineering College.

Page1 / 12

Ch27b_ir2-vectorspace-95 - Computing Relevance, Similarity:...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online