2 - Preliminaries Information Retrieval Introduction Text...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Preliminaries: Information Retrieval
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Introduction ± Text mining refers to data mining using text documents as data. ± Most text mining tasks use Information Retrieval (IR) methods to pre-process text documents. ± These methods are quite different from traditional data pre- processing methods used for relational tables. ± Web search also has its root in IR.
Background image of page 2
Information Retrieval (IR) ± Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs. o Expressed as queries ± Historically, IR is about document retrieval, emphasizing document as the basic unit. o Finding documents relevant to user queries ± Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
IR architecture
Background image of page 4
IR queries ± Keyword queries ± Boolean queries (using AND, OR, NOT) ± Phrase queries ± Proximity queries ± Full document queries ± Natural language questions
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Information retrieval models ± An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. ± Main models: o Boolean model o Vector space model o etc
Background image of page 6
Boolean model ± Each document or query is treated as a bag of words or terms. Word sequence is not considered. ± Given a collection of documents D , let V = { t 1 , t 2 , . .., t | V | } be the set of distinctive words/terms in the collection. V is called the vocabulary . ± A weight w ij > 0 is associated with each term t i of a document d j D . For a term that does not appear in document d j , w ij = 0. d j = ( w 1 j , w 2 j , . .., w | V | j ), ± Boolean model: weight is Boolean
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Boolean model (contd) ± Query terms are combined logically using the Boolean operators AND, OR, and NOT. o E.g., (( data AND mining ) AND (NOT text )) ± Retrieval o Given a Boolean query, the system retrieves every document that makes the query logically true. o Called exact match. ± The retrieval results are usually quite poor because term frequency is not considered.
Background image of page 8
Vector space model ± Documents are also treated as a bag of words or terms. ± Each document is represented as a vector. ± However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF-IDF scheme. ± Term Frequency (TF) Scheme: The weight of a term t i in document d j is the number of times that t i appears in d j , denoted by f ij . Normalization may also be applied.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
TF-IDF term weighting scheme ± The most well known weighting scheme o TF: still term frequency o IDF: inverse document frequency. N : total number of docs df i : the number of docs that t i appears.
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/08/2012 for the course CSCI 6907 taught by Professor Zhang during the Spring '11 term at GWU.

Page1 / 47

2 - Preliminaries Information Retrieval Introduction Text...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online