info-retrieval

info-retrieval - Chapter 5: Information Retrieval and Web...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
Chapter 5: Information Retrieval and Web Search An introduction Most slides courtesy Bing Liu
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Introduction Text mining refers to data mining using text documents as data. Most text mining tasks use Information Retrieval (IR) methods to pre-process text documents. These methods are quite different from traditional data pre-processing methods used for relational tables. Web search also has its root in IR.
Background image of page 2
3 Information Retrieval (IR) Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs. Expressed as queries Historically, IR is about document retrieval, emphasizing document as the basic unit. Finding documents relevant to user queries Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 IR architecture
Background image of page 4
5 IR queries Keyword queries Boolean queries (using AND, OR, NOT) Phrase queries Proximity queries Full document queries Natural language questions
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Information retrieval models An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. Main models: Boolean model Vector space model Statistical language model etc
Background image of page 6
7 Boolean model Each document or query is treated as a “bag” of words or terms . Word sequence is not considered. Given a collection of documents D , let V = { t 1 , t 2 , . .., t | V | } be the set of distinctive words/terms in the collection. V is called the vocabulary . A weight w ij > 0 is associated with each term t i of a document d j D . For a term that does not appear in document d j , w ij = 0. d j = ( w 1 j , w 2 j , . .., w | V | j ),
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
8 Boolean model (contd) Query terms are combined logically using the Boolean operators AND , OR , and NOT. E.g., (( data AND mining ) AND (NOT text )) Retrieval Given a Boolean query, the system retrieves every document that makes the query logically true. Called exact match . The retrieval results are usually quite poor because term frequency is not considered.
Background image of page 8
9 Vector space model Documents are also treated as a “bag” of words or terms. Each document is represented as a vector.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 32

info-retrieval - Chapter 5: Information Retrieval and Web...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online