Ch27c_ir3-websearch-95

Ch27c_ir3-websearch-95 - Web Search Engines Chapter 27,...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
Database Management Systems, R. Ramakrishnan 1 Web Search Engines Chapter 27, Part C Based on Larson and Hearst’s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 2 Search Engine Characteristics ± Unedited – anyone can enter content Quality issues; Spam ± Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! ± Different kinds of users Lexis-Nexis: Paying, professional searchers Online catalogs: Scholars searching scholarly literature Web: Every type of person with every type of goal ± Scale Hundreds of millions of searches/day; billions of docs Database Management Systems, R. Ramakrishnan 3 Web Search Queries ± Web search queries are short: ~2.4 words on average (Aug 2000) Has increased, was 1.7 (~1997) ± User Expectations: Many say “The first item shown should be what I want to see!” This works if the user has the most popular/common notion in mind, not otherwise.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Database Management Systems, R. Ramakrishnan 4 Directories vs. Search Engines ± Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories ± Search Engines All pages in all sites Search over the contents of the pages themselves Organized in response to a query by relevance rankings or other scores Database Management Systems, R. Ramakrishnan 5 What about Ranking? ± Lots of variation here Often messy; details proprietary and fluctuating ± Combining subsets of: • IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. • Popularity information • Link analysis information ± Most use a variant of vector space ranking to combine these. Here’s how it might work: Make a vector of weights for each feature Multiply this by the counts for each feature Database Management Systems, R. Ramakrishnan 6 Relevance: Going Beyond IR ± Page “popularity” (e.g., DirectHit) Frequently visited pages (in general) Frequently visited pages as a result of a query ± Link “co-citation” (e.g., Google) Which sites are linked to by other sites? Draws upon sociology research on bibliographic citations to identify “authoritative sources” • Discussed further in Google case study
Background image of page 2
Database Management Systems, R. Ramakrishnan 7 Web Search Architecture Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds Database Management Systems, R. Ramakrishnan 9 Inverted Indexes the IR Way
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Database Management Systems, R. Ramakrishnan 10 How Inverted Files Are Created ± Periodically rebuilt, static otherwise.
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/06/2010 for the course CSE 302 taught by Professor Joel during the Summer '05 term at Punjab Engineering College.

Page1 / 14

Ch27c_ir3-websearch-95 - Web Search Engines Chapter 27,...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online