This preview shows page 1. Sign up to view the full content.
Unformatted text preview: CS 124/LINGUIST 180: From
Languages to Information
Dan Jurafsky Lecture 18: Networks part I: Link Analysis, PageRank slides from Chris Manning, a few also from
Ray Mooney and Bing Liu Outline
Anchor text Background on networks Bibliometric (citaFon) networks Social networks Link analysis for ranking PageRank HITS Search Engine OpFmizaFon Slide from Chris Manning The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1:
A hyperlink between pages denotes
author perceived relevance (quality signal) 3 Assumption 2:
The anchor of the hyperlink describes
the target page (textual context) Slide from Chris Manning Anchor Text
WWW Worm  McBryan [Mcbr94]
For ibm how to disFnguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “ibm” “ibm.com” A million pieces of anchor text
with “ibm” send a strong signal www.ibm.com
4 Slide from Chris Manning “IBM home page” Indexing anchor text
When indexing a document D, include anchor text from links poinFng to D. Armonk, NYbased computer
giant IBM announced today
www.ibm.com 5 Joe’s computer hardware links
Sun
HP
Slide
IBM from Chris Manning Big Blue today announced
record profits for the quarter Indexing anchor text
Can someFmes have unexpected side eﬀects – like what? Can score anchor text with weight depending on the authority of the anchor page’s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritaFve, then trust the anchor text from them 6 Slide from Chris Manning Anchor Text
Other applicaFons WeighFng/ﬁltering links in the graph GeneraFng page descripFons from anchor text 7 Slide from Chris Manning Roots of Web Link Analysis
Bibliometrics Social network analysis 8 Slide from Chris Manning Citation Analysis: Impact Factor
Developed by Garﬁeld in 1972 to measure the importance (quality, inﬂuence) of scienFﬁc journals. Measure of how oben papers in the journal are cited by other scienFsts. Computed and published annually by the InsFtute for ScienFﬁc InformaFon (ISI). The impact factor of a journal J in year Y is the average number of citaFons (from indexed documents published in year Y) to a paper published in J in year Y
1 or Y
2. Does not account for the quality of the ciFng arFcle. Slide from Ray Mooney Citations vs. Links
Web links are a bit diﬀerent than citaFons: Many links are navigaFonal. Many pages with high in
degree are portals not content providers. Not all links are endorsements. Company websites don’t point to their compeFtors. CitaFons to relevant literature is enforced by peer
review. Slide from Ray Mooney Social network analysis
Social network is the study of social enFFes (people in an organizaFon, called actors), and their interacFons and relaFonships. The interacFons and relaFonships can be represented with a network or graph, each vertex (or node) represents an actor and each link represents a relaFonship. CS583, Bing Liu, UIC 11 Centrality
Important or prominent actors are those that are linked or involved with other actors extensively. A person with extensive contacts (links) or communicaFons with many other people in the organizaFon is considered more important than a person with relaFvely fewer contacts. The links can also be called 'es. A central actor is one involved in many Fes. CS583, Bing Liu, UIC 12 Prestige
PresFge is a more reﬁned measure of prominence of an actor than centrality. DisFnguish: Fes sent (out
links) and Fes received (in
links). A presFgious actor is one who is object of extensive Fes as a recipient. To compute the presFge: we use only in
links. Diﬀerence between centrality and presFge: centrality focuses on out
links presFge focuses on in
links. PageRank is based on presFge CS583, Bing Liu, UIC 13 Drawing on the citation work
First adempt to do link analysis 1
4 Slide from Chris Manning Queryindependent ordering
First generaFon: using link counts as simple measures of popularity. Two basic suggesFons: Undirected popularity: Each page gets a score = the number of in
links plus the number of out
links (3+2=5). Directed popularity: Score of a page = number of its in
links (3). 1
5 Slide from Chris Manning Query processing
First retrieve all pages meeFng the text query (say venture capital). Order these by their link popularity (either variant on the previous page). More nuanced – use link counts as a measure of staFc goodness, combined with text match score Slide from Chris Manning Spamming simple popularity
Exercise: How do you spam each of the following heurisFcs so your page gets a high score? Each page gets a staFc score = the number of in
links plus the number of out
links. StaFc score of a page = number of its in
links. Slide from Chris Manning Intuition of PageRank From Wikipedia: C has higher PageRank than E, even though
more inlinks to E Pagerank scoring
Imagine a browser doing a random walk on web pages: Start at a random page 1/3
1/3
1/3 At each step, go out of the current page along one of the links on that page, equiprobably “In the steady state” each page has a long
term visit rate
use this as the page’s score. Slide from Chris Manning Not quite enough
The web is full of dead
ends. Random walk can get stuck in dead
ends. Makes no sense to talk about long
term visit rates. ?? Slide from Chris Manning Teleporting
At a dead end, jump to a random web page. At any non
dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10%
a parameter. 2
1 Slide from Chris Manning Result of teleporting
Now cannot get stuck locally. There is a long
term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate? Slide from Chris Manning Markov chains
A Markov chain consists of n states, plus an n×n transiFon probability matrix P. At each step, we are in exactly one of the states. For 1 ≤ i,j ≤ n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Pii>0
is OK. i
Slide from Chris Manning Pij j Markov chains
Clearly, for all i, Markov chains are abstracFons of random walks. Exercise: represent the teleporFng random walk from 3 slides ago as a Markov chain, for this case: Slide from Chris Manning Ergodic Markov chains
A Markov chain is ergodic if you have a path from any state to any other For any start state, aber a ﬁnite transient Fme T0, the probability of being in any state at a ﬁxed Fme T>T0 is nonzero. Not
ergodic
(even/
odd). Slide from Chris Manning Ergodic Markov chains
For any ergodic Markov chain, there is a unique long
term visit rate for each state. Steady
state probability distribu>on. Over a long Fme
period, we visit each state in proporFon to this rate. It doesn’t mader where we start. Slide from Chris Manning Probability vectors
A probability (row) vector x = (x1, … xn) tells us where the walk is at any point. E.g., (000…1…000) means we’re in state i. 1
i
n More generally, the vector x = (x1, … xn)
means the walk is in state i with probability xi. Slide from Chris Manning Change in probability vector
If the probability vector is x = (x1, … xn) at this step, what is it at the next step? Recall that row i of the transiFon prob. Matrix P tells us where we go next from state i. So from x, our next state is distributed as xP. Slide from Chris Manning Steady state example
The steady state looks like a vector of probabiliFes a = (a1, … an): ai is the probability that we are in state i. 1/4 3/4
1 2 3/4 1/4 For this example, a1=1/4 and a2=3/4.
2
9 Slide from Chris Manning How do we compute this vector?
Let a = (a1, … an) denote the row vector of steady
state probabiliFes. If we our current posiFon is described by a, then the next step is distributed as aP. But a is the steady state, so a=aP. Solving this matrix equaFon gives us a. So a is the (leb) eigenvector for P. (Corresponds to the “principal” eigenvector of P with the largest eigenvalue.) TransiFon probability matrices always have largest eigenvalue 1. Slide from Chris Manning One way of computing a
Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribuFon (say x=(10…0)). Aber one step, we’re at xP; aber two steps at xP2 , then xP3 and so on. “Eventually” means for “large” k, xPk = a. Algorithm: mulFply x by increasing powers of P unFl the product looks stable. Slide from Chris Manning Pagerank summary
Preprocessing: Given graph of links, build matrix P. From it compute a. The entry ai is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeFng query. Rank them by their pagerank. Order is query
independent. Slide from Chris Manning The reality
Pagerank is used in Google, but is hardly the full story of ranking Many sophisFcated features are used Some address speciﬁc query classes Machine learned ranking heavily used Slide from Chris Manning Pagerank: Issues and Variants
How realisFc is the random surfer model? What if we modeled the back budon? Surfer behavior sharply skewed towards short paths Search engines, bookmarks & directories make jumps non
random. Biased Surfer Models Weight edge traversal probabiliFes based on match with topic/query (non
uniform edge selecFon) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) 3
4 Slide from Chris Manning Topic Specific Pagerank
Goal – pagerank values that depend on query topic Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category (say, one of the 16 top level ODP categories) based on a query & user
speciﬁc distribuFon over the categories Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can’t compute PageRank at query Fme! Slide from Chris Manning Topic Specific Pagerank
Oﬄine: Compute pagerank for individual categories Query independent as before Each page has mulFple pagerank scores – one for each ODP category, with teleportaFon only to that category Online: DistribuFon of weights over categories computed by query context classiﬁcaFon Generate a dynamic pagerank score for each page
weighted sum of category
speciﬁc pageranks Slide from Chris Manning Influencing PageRank
(“Personalization”)
Input: Web graph W Inﬂuence vector v over topics v : (page → degree of inﬂuence) Vector has one
component for
each topic Output: Rank vector r: (page → page importance wrt v) r = PR(W , v) Slide from Chris Manning Nonuniform Teleportation Sports Teleport with 10% probability to a Sports page
Slide from Chris Manning Interpretation of Composite Score
Given a set of personalizaFon vectors {vj} ∑j [wj PR(W , vj)] = PR(W , ∑j [wj vj]) Given a user’s preferences over topics, express as a combinaFon of the “basis” vectors vj Slide from Chris Manning Interpretation Sports 10% Sports teleportation
Slide from Chris Manning Interpretation
Health 10% Health teleportation
Slide from Chris Manning Interpretation
Health Sports pr = (0.9 PRsports + 0.1 PRhealth) gives you:
9% sports teleportation, 1% health teleportation
Slide from Chris Manning HyperlinkInduced Topic Search
(HITS)
In response to a query, instead of an ordered list of pages each meeFng the query, ﬁnd two sets of inter
related pages: Hub pages are good lists of links on a subject. e.g., “Bob’s list of cancer
related links.” Authority pages occur recurrently on good hubs for the subject. Best suited for “broad topic” queries rather than for page
ﬁnding queries. Gets at a broader slice of common opinion. Slide from Chris Manning Hubs and Authorities
Thus, a good hub page for a topic points to many authoritaFve pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular deﬁniFon
will turn this into an iteraFve computaFon. Slide from Chris Manning The hope Authorities
Hubs Long
Slide from Chris Manning distance telephone companies Highlevel scheme
Extract from the web a base set of pages that could be good hubs or authoriFes. From these, idenFfy a small set of top hub and authority pages; → iteraFve algorithm. Slide from Chris Manning Spam in Search Or, “Search Engine OpFmizaFon” Sec. 19.2.2 The trouble with paid search ads …
It costs money. What’s the alternaFve? Search Engine Op>miza>on: “Tuning” your web page to rank highly in the algorithmic search results for select keywords AlternaFve to paying for placement Thus, intrinsically a markeFng funcFon Performed by companies, webmasters and consultants (“Search engine opFmizers”) for their clients Some perfectly legiFmate, some very shady Slide from Chris Manning Sec. 19.2.2 Search engine optimization (Spam)
MoFves Commercial, poliFcal, religious, lobbies PromoFon funded by adverFsing budget Operators Contractors (Search Engine OpFmizers) for lobbies, companies Web masters HosFng services Forums E.g., Web master world ( www.webmasterworld.com ) Search engine speciﬁc tricks Discussions about academic papers Slide from Chris Manning Sec. 19.2.2 Simplest forms
First generaFon engines relied heavily on P/idf The top
ranked pages for the query maui resort were the ones containing the most maui’s and resort’s SEOs responded with dense repeFFons of chosen terms e.g., maui resort maui resort maui resort Oben, the repeFFons would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Slide from Chris Manning Pure word density cannot
be trusted as an IR signal Sec. 19.2.2 Variants of keyword stuffing
Misleading meta
tags, excessive repeFFon Hidden text with colors, style sheet tricks, etc. MetaTags =
“… London hotels, hotel, holiday inn, hilton, discount,
booking, reservation, sex, mp3, britney spears, viagra, …” Slide from Chris Manning Sec. 19.2.2 Cloaking
Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Y SPAM Is this a Search
Engine spider? Cloaking Slide from Chris Manning N Real
Doc Sec. 19.2.2 More spam techniques
Doorway pages Pages opFmized for a single keyword that re
direct to the real target page Link spamming Mutual admiraFon socieFes, hidden links, awards Domain ﬂooding: numerous domains that point or re
direct to a target page Robots Millions of submissions via Add
Url Slide from Chris Manning The war against spam
Quality signals
Prefer authoritaFve pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Robust link analysis Ignore staFsFcally implausible linkage (or text) Use link analysis to detect spammers (guilt by associaFon) Slide from Chris Manning Spam recogniFon by machine learning Training set based on known spam Family friendly ﬁlters LinguisFc analysis, general classiﬁcaFon techniques, etc. For images: ﬂesh tone detectors, source text analysis, etc. Editorial intervenFon
Blacklists Top queries audited Complaints addressed Suspect padern detecFon More on spam
Web search engines have policies on SEO pracFces they tolerate/block hdp://help.yahoo.com/help/us/ysearch/index.html hdp://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) badle between SEO’s and web search engines Research hdp://airweb.cse.lehigh.edu/ Slide from Chris Manning NY Times article on JC Penney link
spam
hdp://www.nyFmes.com/2011/02/13/business/ 13search.html “in the last several months, JCPenny.com in the #1 spot for:
“dresses”, “bedding”, “area rugs” “bedding” “area rugs” “Someone paid to have thousands of links placed on hundreds of sites scadered around the Web “2,015 pages with phrases like “casual dresses,” “evening dresses,” “lidle black dress” or “cocktail dress”. The NY Times informed Google At 7 p.m., J. C. Penney #1 result for “Samsonite carry on luggage.” Two hours later, it was at No. 71 New Problem: Content Farms
Demand Media: eHow, etc hdp://www.wired.com/magazine/2009/10/ﬀ_demandmedia/all/1 Demand Media’s “legion of low
paid writers” “pump out 4,000 videoclips and arFcles a day. It starts with an algorithm” based on: 1. Search terms (popular terms from more than 100 sources comprising 2 billion searches a day), 2. The ad market (a snapshot of which keywords are sought aber and how much they are fetching), 3. The compeFFon (what’s online already and where a term ranks in search results). Wired on Google’s change: hdp://www.wired.com/epicenter/2011/02/google
clamp
down
content
factories/
previousa•4700d42dc9287ba6d0e0a00756d• “Google updated its core ranking algorithm…to decrease the prevalence of…content farms in top search results.” How to address content farms?
From Google blog: “We’ve been exploring diﬀerent algorithms to detect content farms, which are sites with shallow or low
quality content. One of the signals we're exploring is explicit feedback from users. To that end, today we’re launching an early, experimental Chrome extension so people can block sites from their web search results. If installed, the extension also sends blocked site informaFon to Google, and we will study the resulFng feedback and explore using it as a potenFal ranking signal for our search results.” ...
View Full
Document
 Winter '09

Click to edit the document details