CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 Web as a directed graph What is the structure of the Web? How is it organized?

Two types of directed graphs: DAG – Directed Acyclic Graph: Has no cycles: if u can reach v , then v can not reach u Strongly connected: Any node can reach any node via a directed path Any directed graph can be expressed in terms of these two types of graphs 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Strongly connected component (SCC) is a set of nodes S : Every pair of nodes in S can reach each other There is no larger set containing S with this property Any directed graph is a DAG on its SCCs: Each SCC is a super-node Super-node A links to super-node B if a node in A links to node in B 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Take a large snapshot of the Web and try to understand how it’s SCCs “fit” as a DAG Computational issues: Say want to find SCC containing specific node v ? Observation: Out(v) … nodes reachable from v (via out-edges) In(v) … nodes reachable from v (via in-edges) SCC containing v : = Out(v, G) In(v, G) = Out(v, G) Out(v, G) where G is G with directions of edges flipped 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 v
250 million webpages, 1.5 billion links [Altavista] 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 [Broder et al., ‘00]

Out-/In- Degree Distribution: p k : fraction of nodes with k out-/in-links Histogram of p k vs. k 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Normalized count, p k