124.11.lec18 - CS 124/LINGUIST 180 From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 18: Networks part I: Link Analysis, PageRank slides from Chris Manning, a few also from Ray Mooney and Bing Liu Outline   Anchor text   Background on networks   Bibliometric (citaFon) networks   Social networks   Link analysis for ranking   PageRank   HITS   Search Engine OpFmizaFon Slide from Chris Manning The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) 3 Assumption 2: The anchor of the hyperlink describes the target page (textual context) Slide from Chris Manning Anchor Text WWW Worm - McBryan [Mcbr94]   For ibm how to disFnguish between:   IBM’s home page (mostly graphical)   IBM’s copyright page (high term freq. for ‘ibm’)   Rival’s spam page (arbitrarily high term freq.) “ibm” “ibm.com” A million pieces of anchor text with “ibm” send a strong signal www.ibm.com 4 Slide from Chris Manning “IBM home page” Indexing anchor text   When indexing a document D, include anchor text from links poinFng to D. Armonk, NY-based computer giant IBM announced today www.ibm.com 5 Joe’s computer hardware links Sun HP Slide IBM from Chris Manning Big Blue today announced record profits for the quarter Indexing anchor text   Can someFmes have unexpected side effects –   like what?   Can score anchor text with weight depending on the authority of the anchor page’s website   E.g., if we were to assume that content from cnn.com or yahoo.com is authoritaFve, then trust the anchor text from them 6 Slide from Chris Manning Anchor Text  Other applicaFons  WeighFng/filtering links in the graph  GeneraFng page descripFons from anchor text 7 Slide from Chris Manning Roots of Web Link Analysis   Bibliometrics   Social network analysis 8 Slide from Chris Manning Citation Analysis: Impact Factor   Developed by Garfield in 1972 to measure the importance (quality, influence) of scienFfic journals.   Measure of how oben papers in the journal are cited by other scienFsts.   Computed and published annually by the InsFtute for ScienFfic InformaFon (ISI).   The impact factor of a journal J in year Y is the average number of citaFons (from indexed documents published in year Y) to a paper published in J in year Y ­1 or Y ­2.   Does not account for the quality of the ciFng arFcle. Slide from Ray Mooney Citations vs. Links   Web links are a bit different than citaFons:   Many links are navigaFonal.   Many pages with high in ­degree are portals not content providers.   Not all links are endorsements.   Company websites don’t point to their compeFtors.   CitaFons to relevant literature is enforced by peer ­review. Slide from Ray Mooney Social network analysis   Social network is the study of social enFFes (people in an organizaFon, called actors), and their interacFons and relaFonships.   The interacFons and relaFonships can be represented with a network or graph,   each vertex (or node) represents an actor and   each link represents a relaFonship. CS583, Bing Liu, UIC 11 Centrality   Important or prominent actors are those that are linked or involved with other actors extensively.   A person with extensive contacts (links) or communicaFons with many other people in the organizaFon is considered more important than a person with relaFvely fewer contacts.   The links can also be called 'es. A central actor is one involved in many Fes. CS583, Bing Liu, UIC 12 Prestige   PresFge is a more refined measure of prominence of an actor than centrality.   DisFnguish: Fes sent (out ­links) and Fes received (in ­links).   A presFgious actor is one who is object of extensive Fes as a recipient.   To compute the presFge: we use only in ­links.   Difference between centrality and presFge:   centrality focuses on out ­links   presFge focuses on in ­links.   PageRank is based on presFge CS583, Bing Liu, UIC 13 Drawing on the citation work   First adempt to do link analysis 1 4 Slide from Chris Manning Query-independent ordering   First generaFon: using link counts as simple measures of popularity.   Two basic suggesFons:   Undirected popularity:   Each page gets a score = the number of in ­links plus the number of out ­links (3+2=5).   Directed popularity:   Score of a page = number of its in ­links (3). 1 5 Slide from Chris Manning Query processing   First retrieve all pages meeFng the text query (say venture capital).   Order these by their link popularity (either variant on the previous page).   More nuanced – use link counts as a measure of staFc goodness, combined with text match score Slide from Chris Manning Spamming simple popularity   Exercise: How do you spam each of the following heurisFcs so your page gets a high score?   Each page gets a staFc score = the number of in ­ links plus the number of out ­links.   StaFc score of a page = number of its in ­links. Slide from Chris Manning Intuition of PageRank From Wikipedia: C has higher PageRank than E, even though more inlinks to E Pagerank scoring  Imagine a browser doing a random walk on web pages:  Start at a random page 1/3 1/3 1/3  At each step, go out of the current page along one of the links on that page, equiprobably  “In the steady state” each page has a long ­term visit rate  ­ use this as the page’s score. Slide from Chris Manning Not quite enough   The web is full of dead ­ends.   Random walk can get stuck in dead ­ends.   Makes no sense to talk about long ­term visit rates. ?? Slide from Chris Manning Teleporting   At a dead end, jump to a random web page.   At any non ­dead end, with probability 10%, jump to a random web page.  With remaining probability (90%), go out on a random link.  10%  ­ a parameter. 2 1 Slide from Chris Manning Result of teleporting  Now cannot get stuck locally.  There is a long ­term rate at which any page is visited (not obvious, will show this).  How do we compute this visit rate? Slide from Chris Manning Markov chains   A Markov chain consists of n states, plus an n×n transiFon probability matrix P.   At each step, we are in exactly one of the states.   For 1 ≤ i,j ≤ n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Pii>0 is OK. i Slide from Chris Manning Pij j Markov chains   Clearly, for all i,   Markov chains are abstracFons of random walks.   Exercise: represent the teleporFng random walk from 3 slides ago as a Markov chain, for this case: Slide from Chris Manning Ergodic Markov chains   A Markov chain is ergodic if   you have a path from any state to any other   For any start state, aber a finite transient Fme T0, the probability of being in any state at a fixed Fme T>T0 is nonzero. Not ergodic (even/ odd). Slide from Chris Manning Ergodic Markov chains   For any ergodic Markov chain, there is a unique long ­term visit rate for each state.  Steady ­state probability distribu>on.   Over a long Fme ­period, we visit each state in proporFon to this rate.   It doesn’t mader where we start. Slide from Chris Manning Probability vectors   A probability (row) vector x = (x1, … xn) tells us where the walk is at any point.   E.g., (000…1…000) means we’re in state i. 1 i n More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi. Slide from Chris Manning Change in probability vector   If the probability vector is x = (x1, … xn) at this step, what is it at the next step?   Recall that row i of the transiFon prob. Matrix P tells us where we go next from state i.   So from x, our next state is distributed as xP. Slide from Chris Manning Steady state example   The steady state looks like a vector of probabiliFes a = (a1, … an):   ai is the probability that we are in state i. 1/4 3/4 1 2 3/4 1/4 For this example, a1=1/4 and a2=3/4. 2 9 Slide from Chris Manning How do we compute this vector?   Let a = (a1, … an) denote the row vector of steady ­ state probabiliFes.   If we our current posiFon is described by a, then the next step is distributed as aP.   But a is the steady state, so a=aP.   Solving this matrix equaFon gives us a.   So a is the (leb) eigenvector for P.   (Corresponds to the “principal” eigenvector of P with the largest eigenvalue.)   TransiFon probability matrices always have largest eigenvalue 1. Slide from Chris Manning One way of computing a   Recall, regardless of where we start, we eventually reach the steady state a.   Start with any distribuFon (say x=(10…0)).   Aber one step, we’re at xP;   aber two steps at xP2 , then xP3 and so on.   “Eventually” means for “large” k, xPk = a.   Algorithm: mulFply x by increasing powers of P unFl the product looks stable. Slide from Chris Manning Pagerank summary   Preprocessing:   Given graph of links, build matrix P.   From it compute a.   The entry ai is a number between 0 and 1: the pagerank of page i.   Query processing:   Retrieve pages meeFng query.   Rank them by their pagerank.   Order is query ­independent. Slide from Chris Manning The reality   Pagerank is used in Google, but is hardly the full story of ranking  Many sophisFcated features are used  Some address specific query classes  Machine learned ranking heavily used Slide from Chris Manning Pagerank: Issues and Variants   How realisFc is the random surfer model?   What if we modeled the back budon?   Surfer behavior sharply skewed towards short paths   Search engines, bookmarks & directories make jumps non ­random.   Biased Surfer Models   Weight edge traversal probabiliFes based on match with topic/query (non ­uniform edge selecFon)   Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) 3 4 Slide from Chris Manning Topic Specific Pagerank     Goal – pagerank values that depend on query topic Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category (say, one of the 16 top level ODP categories) based on a query & user  ­specific distribuFon over the categories   Teleport to a page uniformly at random within the chosen category     Sounds hard to implement: can’t compute PageRank at query Fme! Slide from Chris Manning Topic Specific Pagerank   Offline: Compute pagerank for individual categories Query independent as before   Each page has mulFple pagerank scores – one for each ODP category, with teleportaFon only to that category     Online: DistribuFon of weights over categories computed by query context classificaFon   Generate a dynamic pagerank score for each page  ­ weighted sum of category ­specific pageranks Slide from Chris Manning Influencing PageRank (“Personalization”)   Input:   Web graph W   Influence vector v over topics v : (page → degree of influence) Vector has one component for each topic   Output:   Rank vector r: (page → page importance wrt v)   r = PR(W , v) Slide from Chris Manning Non-uniform Teleportation Sports Teleport with 10% probability to a Sports page Slide from Chris Manning Interpretation of Composite Score   Given a set of personalizaFon vectors {vj} ∑j [wj  PR(W , vj)] = PR(W , ∑j [wj  vj]) Given a user’s preferences over topics, express as a combinaFon of the “basis” vectors vj Slide from Chris Manning Interpretation Sports 10% Sports teleportation Slide from Chris Manning Interpretation Health 10% Health teleportation Slide from Chris Manning Interpretation Health Sports pr = (0.9 PRsports + 0.1 PRhealth) gives you: 9% sports teleportation, 1% health teleportation Slide from Chris Manning Hyperlink-Induced Topic Search (HITS)   In response to a query, instead of an ordered list of pages each meeFng the query, find two sets of inter ­ related pages:   Hub pages are good lists of links on a subject.   e.g., “Bob’s list of cancer ­related links.”   Authority pages occur recurrently on good hubs for the subject.   Best suited for “broad topic” queries rather than for page ­finding queries.   Gets at a broader slice of common opinion. Slide from Chris Manning Hubs and Authorities   Thus, a good hub page for a topic points to many authoritaFve pages for that topic.   A good authority page for a topic is pointed to by many good hubs for that topic.   Circular definiFon  ­ will turn this into an iteraFve computaFon. Slide from Chris Manning The hope Authorities Hubs Long Slide from Chris Manning distance telephone companies High-level scheme  Extract from the web a base set of pages that could be good hubs or authoriFes.  From these, idenFfy a small set of top hub and authority pages; → iteraFve algorithm. Slide from Chris Manning Spam in Search   Or, “Search Engine OpFmizaFon” Sec. 19.2.2 The trouble with paid search ads …   It costs money. What’s the alternaFve?   Search Engine Op>miza>on:   “Tuning” your web page to rank highly in the algorithmic search results for select keywords   AlternaFve to paying for placement   Thus, intrinsically a markeFng funcFon   Performed by companies, webmasters and consultants (“Search engine opFmizers”) for their clients   Some perfectly legiFmate, some very shady Slide from Chris Manning Sec. 19.2.2 Search engine optimization (Spam)   MoFves   Commercial, poliFcal, religious, lobbies   PromoFon funded by adverFsing budget   Operators   Contractors (Search Engine OpFmizers) for lobbies, companies   Web masters   HosFng services   Forums   E.g., Web master world ( www.webmasterworld.com )   Search engine specific tricks   Discussions about academic papers Slide from Chris Manning Sec. 19.2.2 Simplest forms   First generaFon engines relied heavily on P/idf   The top ­ranked pages for the query maui resort were the ones containing the most maui’s and resort’s   SEOs responded with dense repeFFons of chosen terms   e.g., maui resort maui resort maui resort   Oben, the repeFFons would be in the same color as the background of the web page   Repeated terms got indexed by crawlers   But not visible to humans on browsers Slide from Chris Manning Pure word density cannot be trusted as an IR signal Sec. 19.2.2 Variants of keyword stuffing   Misleading meta ­tags, excessive repeFFon   Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …” Slide from Chris Manning Sec. 19.2.2 Cloaking   Serve fake content to search engine spider   DNS cloaking: Switch IP address. Impersonate Y SPAM Is this a Search Engine spider? Cloaking Slide from Chris Manning N Real Doc Sec. 19.2.2 More spam techniques   Doorway pages   Pages opFmized for a single keyword that re ­direct to the real target page   Link spamming   Mutual admiraFon socieFes, hidden links, awards   Domain flooding: numerous domains that point or re ­ direct to a target page   Robots   Millions of submissions via Add ­Url Slide from Chris Manning The war against spam   Quality signals  ­ Prefer authoritaFve pages based on:   Votes from authors (linkage signals)   Votes from users (usage signals)   Robust link analysis   Ignore staFsFcally implausible linkage (or text)   Use link analysis to detect spammers (guilt by associaFon) Slide from Chris Manning   Spam recogniFon by machine learning   Training set based on known spam   Family friendly filters   LinguisFc analysis, general classificaFon techniques, etc.   For images: flesh tone detectors, source text analysis, etc.   Editorial intervenFon         Blacklists Top queries audited Complaints addressed Suspect padern detecFon More on spam   Web search engines have policies on SEO pracFces they tolerate/block   hdp://help.yahoo.com/help/us/ysearch/index.html   hdp://www.google.com/intl/en/webmasters/   Adversarial IR: the unending (technical) badle between SEO’s and web search engines   Research hdp://airweb.cse.lehigh.edu/ Slide from Chris Manning NY Times article on JC Penney link spam   hdp://www.nyFmes.com/2011/02/13/business/ 13search.html   “in the last several months, JCPenny.com in the #1 spot for:       “dresses”, “bedding”, “area rugs” “bedding” “area rugs”   “Someone paid to have thousands of links placed on hundreds of sites scadered around the Web “2,015 pages with phrases like “casual dresses,” “evening dresses,” “lidle black dress” or “cocktail dress”.   The NY Times informed Google   At 7 p.m., J. C. Penney #1 result for “Samsonite carry on luggage.”   Two hours later, it was at No. 71   New Problem: Content Farms   Demand Media: eHow, etc   hdp://www.wired.com/magazine/2009/10/ff_demandmedia/all/1   Demand Media’s “legion of low ­paid writers” “pump out 4,000 videoclips and arFcles a day. It starts with an algorithm” based on: 1.  Search terms (popular terms from more than 100 sources comprising 2 billion searches a day), 2.  The ad market (a snapshot of which keywords are sought aber and how much they are fetching), 3.  The compeFFon (what’s online already and where a term ranks in search results).   Wired on Google’s change:   hdp://www.wired.com/epicenter/2011/02/google ­clamp ­down ­content ­factories/  ­ previousa•4700d42dc9287ba6d0e0a00756d•   “Google updated its core ranking algorithm…to decrease the prevalence of…content farms in top search results.” How to address content farms?   From Google blog: “We’ve been exploring different algorithms to detect content farms, which are sites with shallow or low ­quality content. One of the signals we're exploring is explicit feedback from users. To that end, today we’re launching an early, experimental Chrome extension so people can block sites from their web search results. If installed, the extension also sends blocked site informaFon to Google, and we will study the resulFng feedback and explore using it as a potenFal ranking signal for our search results.” ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online