2 - Preliminaries: Information Retrieval Introduction Text...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Preliminaries: Information Retrieval
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Introduction ± Text mining refers to data mining using text documents as data. ± Most text mining tasks use Information Retrieval (IR) methods to pre-process text documents. ± These methods are quite different from traditional data pre- processing methods used for relational tables. ± Web search also has its root in IR.
Background image of page 2
Information Retrieval (IR) ± Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs. o Expressed as queries ± Historically, IR is about document retrieval, emphasizing document as the basic unit. o Finding documents relevant to user queries ± Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
IR architecture
Background image of page 4
IR queries ± Keyword queries ± Boolean queries (using AND, OR, NOT) ± Phrase queries ± Proximity queries ± Full document queries ± Natural language questions
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Information retrieval models ± An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. ± Main models: o Boolean model o Vector space model o etc
Background image of page 6
Boolean model ± Each document or query is treated as a bag of words or terms. Word sequence is not considered. ± Given a collection of documents D , let V = { t 1 , t 2 , . .., t | V | } be the set of distinctive words/terms in the collection. V is called the vocabulary . ± A weight w ij > 0 is associated with each term t i of a document d j D . For a term that does not appear in document d j , w ij = 0. d j = ( w 1 j , w 2 j , . .., w | V | j ), ± Boolean model: weight is Boolean
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Boolean model (contd) ± Query terms are combined logically using the Boolean operators AND, OR, and NOT. o E.g., (( data AND mining ) AND (NOT text )) ± Retrieval o Given a Boolean query, the system retrieves every document that makes the query logically true. o Called exact match. ± The retrieval results are usually quite poor because term frequency is not considered.
Background image of page 8
Vector space model ± Documents are also treated as a bag of words or terms. ± Each document is represented as a vector. ± However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF-IDF scheme. ± Term Frequency (TF) Scheme: The weight of a term t i in document d j is the number of times that t i appears in d j , denoted by f ij . Normalization may also be applied.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
TF-IDF term weighting scheme ± The most well known weighting scheme o TF: still term frequency o IDF: inverse document frequency. N : total number of docs df i : the number of docs that t i appears.
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 47

2 - Preliminaries: Information Retrieval Introduction Text...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online