5 Web Search
Outline: 1. Page rank, for discovering the most important" pages on the Web, as used in Google. 2. Hubs and authorities, a more detailed evaluation of the importance of Web pages using a variant of the eigenvector calculation used for Page ra
CS 345 Data Mining Lecture 1
Introduction to Web Mining
What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns Applications
Web search e.g., Google, Yahoo, Vertical Search e.g., FatLens, Become, Recommendations e
6 Mining the Web
Outline: 1. Dynamic itemset counting : Searching for interesting sets of items in a space too large ever to consider even each pair of items. 2. Books and authors" : Sergey Brin's intriguing experiment to mine the Web for relational data.
CS345 Data Mining
Virtual Databases
Example
Find marketing manager openings in Internet companies so that my commute is shorter than 10 miles.
Structured queries e.g., in SQL
Virtual Relations
Web
Applications
Comparison shopping
shopping.com, fatlens, mo
CS345 Data Mining
Page Rank Variants
Review Page Rank
Web graph encoded by matrix M
NN matrix (N = number of web pages) Mij = 1/|O(j)| iff there is a link from j to i Mij = 0 otherwise O(j) = set of pages node i links to
Define matrix A as follows
Aij = M
CS345 Data Mining
Web Spam Detection
Economic considerations
Search has become the default gateway to the web Very high premium to appear on the first page of search results
e.g., e-commerce sites advertising-driven sites
What is web spam?
Spamming = any
SQL Recursion
WITH
stu that looks like Datalog rules an SQL query about EDB, IDB Rule =
RECURSIVE
SQL query
R
arguments AS
1
Example
Find Sally's cousins, using EDB Parchild, parent.
WITH Sibx,y AS SELECT p1.child, p2,child FROM Par p1, Par p2 WHERE p1.p
CS345 CS345
Compact Skeletons
Compact Skeletons Compact
Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website
Assume no noise for now
Reconstruct relation
Compact Skeletons
Relation Skeleton Dat
9 Sequence Matching
Sequences are lists of values S = x1; x2; : : :; xk, although we shall often think of the same sequence as a continuous function de ned on the interval 0-to-1. That is, the sequence S can be thought of as sample values from a continuou
CS345 Data Mining
Mining the Web for Structured Data
Our view of the web so far
Web pages as atomic units Great for some applications
e.g., Conventional web search
But not always the right model
Going beyond web pages
Question answering
What is the height
4 Query Flocks
Goal: apply a-priori trick and other association-rule tricks to a more general class of complex queries.
4.1 Query Flock Notation
A query ock is a generate-and-test system consisting of: 1. A query with parameters; we write the query in Dat
Evaluating the Web
PageRank Hubs and Authorities
1
PageRank
Intuition: solve the recursive equation: a page is important if important pages link to it. In high-falutin terms: importance = the principal eigenvector of the stochastic matrix of the Web.
A fe
1 What Is Data Mining?
Originally, data mining" was a statistician's term for overusing data to draw invalid inferences. Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reason
3
Low-Support, High-Correlation Mining
We continue to assume a market-basket" model for data, and we visualize the data as a boolean matrix, where rows = baskets and columns = items. Key assumptions: 1. Matrix is very sparse; almost all 0's. 2. The number
CS345 - Data Mining
Introductions What Is It? Cultures of Data Mining
1
Course Staff
Instructors:
Anand Rajaraman Jeff Ullman
TA:
Robbie Yan
2
Requirements
Homework (Gradiance and other) 20%
Gradiance class code BB8F698B
Project 40% Final Exam 40%
3
Proje
Clustering Large Datasets in Arbitrary Metric Spaces
Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrkey Computer Sciences Department, University of Wisconsin-Madison Allison Powellz James Frenchx Department of Computer Science, University of Virginia, Ch
10 Mining Episodes
In the episode model, the data is a history of events ; each event has a type and a time of occurrence. An example of event type might be: switch 34 became overloaded and had to drop a packet." It is probably too general to have an even
Still More Stream-Mining
Frequent Itemsets Elephants and Troops Exponentially Decaying Windows
1
Counting Items
Problem: given a stream, which items appear more than s times in the window? Possible solution: think of the stream of baskets as one binary st
More Stream-Mining
Counting How Many Elements Computing Moments
1
Counting Distinct Elements
Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: mainta
Mining Data Streams
The Stream Model Sliding Windows Counting 1s
1
The Stream Model
Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make critical calculations about the stream using a limited
Low-Support, High-Correlation
Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing
1
The Problem
Rather than finding high-support itempairs in basket data, look for items that are highly correlated.
If one appears in a basket, there is a g
More Clustering
CURE Algorithm Clustering Streams
1
The CURE Algorithm
Problem with BFR/k -means:
Assumes clusters are normally distributed in each dimension. And axes are fixed - ellipses at an angle are not OK.
CURE:
Assumes a Euclidean distance. Allows
Clustering
Distance Measures Hierarchical Clustering k -Means Algorithms
1
The Problem of Clustering
Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some s
CS345 Data Mining
Crawling the Web
Web Crawling Basics
Start with a seed set of to-visit urls
get next url get page
to visit urls
Web
visited urls
extract urls web pages
Crawling Issues
Load on web servers Insufficient resources to crawl entire web
Which
8 More About Clustering
We continue our discussion of large-scale clustering algorithms, covering: 1. Fastmap, and other ways to create a Euclidean space from an arbitrary distance measure. 2. The GRGPF algorithm for clustering without a Euclidean space.
1 Clustering
Given points in some space | often a high-dimensional space | group the points into a small number of clusters, each cluster consisting of points that are near" in some sense. Some applications: 1. Many years ago, during a cholera outbreak in
Hash-Based Improvements to A-Priori
Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms
1
PCY Algorithm
Hash-based improvement to A-Priori. During Pass 1 of A-priori, most memory is idle. Use that memory to keep counts of buckets into which
Association Rules
Market Baskets Frequent Itemsets A-priori Algorithm
1
The Market-Basket Model
A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys
2 Association Rules and Frequent Itemsets
The market-basket problem assumes we have some large number of items, e.g., bread," milk." Customers ll their market baskets with some subset of the items, and we get to know what items people buy together, even i
1 What Is Data Mining?
Originally, data mining" was a statistician's term for overusing data to draw invalid inferences. Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reason