30_Web_Search_packed

State can be reached from 101 state

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Any surfer looking at page i will: –  if ci = 0, choose one of the other n pages at random; –  if ci ≠ 0, flip a coin whose P(heads) = p (the coin is assumed to be independent of the surfing), and •  if it’s heads, select one of the out- links at random; •  if it’s tails, select one of the n Web pages at random. •  One- step transiHon probabiliHes are: ⎧ ⎪ ⎪ ⎪ pij = ⎨ ⎪ ⎪ ⎪ ⎩ 1 n p⋅ if ci = 0 1 1 + (1 − p ) ⋅ ci n (1 − p ) ⋅ 1 n if ci ≠ 0 and link i → j exists if ci ≠ 0 and link i → j does not exist Ilya Pollak Modified model of Web surfing ⎧ ⎪ ⎪ ⎪ pij = ⎨ ⎪ ⎪ ⎪ ⎩ 1 n p⋅ if ci = 0 1 1 + (1 − p ) ⋅ ci n (1 − p ) ⋅ 1 n if ci ≠ 0 and link i → j exists if ci ≠ 0 and link i → j does not exist •  Assuming that p < 1, the resulHng Markov chain graph is fully connected, with pij ≠ 0 for all Web pages i and j. •  Therefore, the enHre graph forms a single recurrent class, with no periodic states. •  Define PageRank(i) as the steady state probability for the surfer to be at page i acer a large number of steps under this model. •  Then PageRank(i) exists and does not depend on the starHng point. •  Retrieve pages based on word frequency and prominence, and perhaps other criteria, and sort by PageRank. Ilya Pollak Comments •  Google’s original algorithm used word frequency, visual prominence (e.g., font size), anchor text (text surrounding the link to page j in page i), in addiHon to PageRank. •  Google’s current page ranking algorithm has hundreds of other ingredients which are kept secret and are changed with Hme, so as to both improve the algorithm and prevent people from taking advantage. •  Other concurrently developed algorithms for ranking websites were based on the idea that experts’ links to page i should count for more than non- experts’ links. “Experts” are idenHfied by counHng how many highly- ranked search results they link to. This is the basis for the hubs- and- authoriHes (or HITS) algorithm of Jon Kleinberg and SALSA algorithm of Lempel and Moran. Ilya Pollak InformaHon retrieval •  Web search is an example of informaHon retrieval. •  Before the Web, informaHon retrieval meant searching databases of newspaper arHcles, scienHfic papers, patents, legal abstracts, medical records, etc. •  An interesHng applicaHon of text- based search to video is SnapStream which is based on closed capHons. –  Used by government enHHes and entertainment industry (e.g., the Daily Show). •  PageRank is a akin to determining impact factors of scienHfic publicaHons: being cited helps, especially being cited by important publicaHons. •  Non- text- based search is more difficult but has wide applicaHons: –  forensics (fingerprint matching, footprint matching, face matching); –  health care (matching an X- ray image against a data based of lung cancer images, to aid in determining the diagnosis and treatment). Ilya Pollak...
View Full Document

This note was uploaded on 09/11/2013 for the course ECE 302 taught by Professor Gelfand during the Fall '08 term at Purdue.

Ask a homework question - tutors are online