More Stream-Mining
Counting Distinct Elements Computing "Moments" Frequent Itemsets Elephants and Troops Exponentially Decaying Windows
1
Counting Distinct Elements
Problem: a data stream consists of elements chosen from a set of size n. Maintain a count
Near-Neighbor Search
Applications Matrix Formulation Minhashing
1
Example Application: Face Recognition
We have a database of (say) 1 million face images. We want to find the most similar images in the database. Represent faces by (relatively) invariant v
NearNeighbor Search
Applications Matrix Formulation Minhashing
1
Example Application: Face Recognition
x We have a database of (say) 1 million face images. x We want to find the most similar images in the database. x Represent faces by (relatively) invari
Near-Neighbor Search
Applications Matrix Formulation Minhashing
1
Example Problem - Face Recognition
We have a database of (say) 1 million face images. We are given a new image and want to find the most similar images in the database. Represent faces by (
NearNeighbor Search
Applications Matrix Formulation Minhashing
1
Example Problem Face Recognition
x We have a database of (say) 1 million face images. x We are given a new image and want to find the most similar images in the database. x Represent faces b
What is Database Theory?
A collection of studies, often connected to the relational model of data. Restricted forms of logic, between SQL and full rst-order. Dependency theory: generalizing functional dependencies. Conjunctive queries CQ's: useful, decida
CS345 Data Mining
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover
Detecting dup
Link Analysis Algorithms
CS345 Data Mining
Link Analysis Algorithms Page Rank
Page Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we wont cover
Detecting duplicates and mirrors Mining for communities
CS345 Data Mining
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover
Detecting dup
CS345 Data Mining
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we wont cover
Detecting duplicates
CS345 Data Mining
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we won't cover
Detecting duplicates
CS345 Data Mining
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover
Detecting dup
Topics
CS345 Data Mining
Link Analysis 2 Page Rank Variants
This lecture
Many-walkers model Tricks for speeding convergence Topic-Specific Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Random walk interpretation
At time 0, pick a page on the web uniformly
CS345 Data Mining
Link Analysis 2 Page Rank Variants
Anand Rajaraman, Jeffrey D. Ullman
Topics
This lecture
Manywalkers model Tricks for speeding convergence TopicSpecific Page Rank
Random walk interpretation
At time 0, pick a page on the web unif
CS345 Data Mining
Recommendation Systems
Anand Rajaraman, Jeffrey D. Ullman
Recommendations
Search
Recommendations
Items
Products, web sites, blogs, news items,
The Long Tail
Source: Chris Anderson (2004)
From scarcity to abundance
Shelf space is a scarc
CS 345A Data Mining
MapReduce
Singlenode architecture
CPU Machine Learning, Statistics Memory "Classical" Data Mining Disk
Commodity Clusters
Web data sets can be very large
Cannot mine on a single server (why?) Standard architecture emerging:
Te
CS 345A Data Mining
MapReduce
Single-node architecture
CPU Machine Learning, Statistics Memory Classical Data Mining Disk
Commodity Clusters
Web data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server (why?) Standard archi
More StreamMining
Counting Distinct Elements Computing "Moments" Frequent Itemsets Elephants and Troops Exponentially Decaying Windows
1
Counting Distinct Elements
x Problem: a data stream consists of elements chosen from a set of size n. Maintain a count
Still More Stream-Mining
Frequent Itemsets Elephants and Troops Exponentially Decaying Windows
1
Counting Items
Problem: given a stream, which items appear more than s times in the window? Possible solution: think of the stream of baskets as one binary st
Still More StreamMining
Frequent Itemsets Elephants and Troops Exponentially Decaying Windows
1
Counting Items
x Problem: given a stream, which items appear more than s times in the window? x Possible solution: think of the stream of baskets as one binary
Stream Clustering
Extension of DGIM to More Complex Problems
1
Clustering a Stream
Assume points enter in a stream. Maintain a sliding window of points. Queries ask for clusters of points within some suffix of the window. Important issue: where are the cl
Stream Clustering
Extension of DGIM to More Complex Problems
1
Clustering a Stream
x Assume points enter in a stream. x Maintain a sliding window of points. x Queries ask for clusters of points within some suffix of the window. x Important issue: where ar
CS345 Data Mining
Introductions What Is It? Cultures of Data Mining
1
Course Staff
x Instructors:
Anand Rajaraman Jeff Ullman Robbie Yan
x TA:
2
Requirements
x Homework (Gradiance and other) 20% x Project 40% x Final Exam 40%
Gradiance class code BB8F69
CS345 - Data Mining
Introductions What Is It? Cultures of Data Mining
1
Course Staff
Instructors:
Anand Rajaraman Jeff Ullman
TA:
Jeff Klingner
2
Requirements
Homework (Gradiance and other) 20%
Gradiance class code DD984360
Project 40% Final Exam 40%
3
Pr
CS345 Data Mining
Introductions What Is It? Cultures of Data Mining
1
Course Staff
x Instructors:
Anand Rajaraman Jeff Ullman Jeff Klingner
x TA:
2
Requirements
x Homework (Gradiance and other) 20% x Project 40% x Final Exam 40%
Gradiance class code DD9
CS345A: Data Mining on the Web
Course Introduction Issues in Data Mining Bonferroni's Principle
1
Course Staff
x Instructors:
Anand Rajaraman Jeff Ullman Babak Pahlavan
x TA:
2
Requirements
x Homework (Gradiance and other) 20%
Gradiance class code B0E9A
CS345A: Data Mining on the Web
Course Introduction Issues in Data Mining Bonferroni's Principle
1
Course Staff
x Instructors:
Anand Rajaraman Jeff Ullman
x Reach us as cs345awin0809staff @ lists.stanford.edu. x More info on www.stanford.edu/class/cs345a.
Generalizing MapReduce
The Computational Model MapReduceLike Algorithms Computing Joins
1
Overview
x There is a new computing environment available: x Mapreduce allows us to exploit this environment easily. x But not everything is mapreduce. x What else c
CS 345A Data Mining
MapReduce
Single-node architecture
CPU Machine Learning, Statistics Memory "Classical" Data Mining Disk
Commodity Clusters
Web data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server (why?) Standard arc