Lecture-D

Lecture-D - Lecture D Searching the Web Introduction The...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
Lecture D Searching the Web
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Introduction The Web can be treated as a very large, unstructured, global information system, but it is not a DBMS Web documents: text, audio, video, and images There is a demand on efficient tools to manage , retrieve and filter information from the Web Approaches Search Web documents using user-specified words / patterns in a text search engines: index a portion of the Web documents Web directories: classify Web documents by subjects hyperlinks: create pointers from one Web page to another
Background image of page 2
3 Challenges Distributed network systems : High percentage of volatile data : Hugh volume : Unstructured and redundant data : Quality of data : Heterogeneous data : data spans over computers, interconnected with no predefined topology, unreliable network (bandwidth varies) dynamic data, dangling links, and data relocation exponential growth rate; difficult to cope with Web text docs are semi- structured / unstructured in nature and replicated (~30%) no editorial process on Web data which can be invalid , poorly written and erroneous multiple media types (formats) with different languages
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Problems Posted by the Web Most of the problems, such as diversity of data types and poor data quality , are unsolvable Problems on the interaction with the retrieval system: How to specify a query ? How to interpret the answers provided by the system? How do we handle a large (a 1,000 documents) answer? How do we rank the documents? How do we select documents that really are of interest to the user? How do we browse efficiently in large retrieved documents? Design goal: (i) formulate “good” queries , and (ii) obtain a manageable and relevant answer
Background image of page 4
5 Ranking Algorithms Yuwono and Lee (1997) propose 4 ranking algorithms: 1) Boolean Spreading Activation : 2) Most-cited : 3) TFxIDF : 4) Vector Spread Activation : (1) and (2): rely on WWW meta-information (i.e., hyperlink structure) without considering term frequencies (3): uses only word occurrence statistics (4): uses both word occurrence statistics & hyperlink structures Classical Boolean + “simplified link analysis” based only on the terms included in pages having a link to the pages in the answer set based on word distribution statistics Vector space model + spread activation model; relatively superior, 76% (precision)
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Ranking Algorithms M : the number of query words . N : the number of Web documents in the collection. q j : the j th query word, 1 j M . d i : the i th Web document, 1 i N . R i,q : the relevance score of d i with respect to query q . C i,j : occurrence of q j in d i , where C i,j = { 0 otherwise. L
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 10/18/2011.

Page1 / 34

Lecture-D - Lecture D Searching the Web Introduction The...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online