CS246: Mining Massive Data Sets
Winter 2013
Final
These questions require thought, but do not require long answers. Be as concise as possible.
You have three hours to complete this final. The exam has 22 pages and the total is 180
points so that you can p
Stanford CS 246H:
Mining Massive Data Sets
Hadoop Lab
Stanford CS 246H Winter 15
Hadoop, There It Is!
Stanford CS 246H Winter 15
Big Data Problems GeIng Bigger
END-USER
APPLICATIONS
Data Growth
THE INTERNET
MOBILE
Jeffrey D. Ullman
Stanford University
The entity-resolution problem is to examine a
collection of records and determine which refer
to the same entity.
Entities could be people, events, etc.
Typically, we want to merge records if their
values in correspo
Application: Similar Documents
Shingling
Minhashing
Locality-Sensitive Hashing
Application: Entity Resolution
Jeffrey D. Ullman
Stanford University
It has been said that the mark of a computer
scientist is that they believe hashing is real.
I.e., it is p
Jeffrey D. Ullman
Stanford University
Given a set of training points (x, y), where:
1. x is a real-valued vector of d dimensions, and
2. y is a binary decision +1 or -1,
a perceptron tries to find a linear separator
between the positive and negative input
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http:/cs246.stanford.edu
Training movie rating data
100 million ratings, 480,000 users, 17,770 movies
6 years of data: 2000-2005
Test movie rating data data
Last few ratings of each user
Jeffrey D. Ullman
Stanford University
Often, our data can be represented by an
m-by-n matrix.
And this matrix can be closely approximated by
the product of two matrices that share a small
common dimension r.
n
r
n
m
M
~
V
r
U
2
There are hidden, or latent
Jeffrey D. Ullman
Stanford University
Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that members of a
cluster are close to each other, while
members of different clusters are far.
2
x
x
Jeffrey D. Ullman
Stanford University
Web pages are important if people visit them a
lot.
But we cant watch everybody using the Web.
A good surrogate for visiting pages is to assume
people follow links randomly.
Leads to random surfer model:
Start at a r
Jeffrey D. Ullman
Stanford University/Infolab
Slides mostly developed by
Anand Rajaraman
Classic model of (offline) algorithms:
You get to see the entire input, then compute
some function of it.
Online algorithm:
You get to see the input one piece at a
Jeffrey D. Ullman
Stanford University/Infolab
Suppose we have a dataset stored in a
distributed file system, spread over many
chunks (e.g. blocks of 64MB).
We want to find a particular value V, looking at
as few chunks as possible.
A Bloom filter on each
3announcements:
ThanksforfillingouttheHW1poll
HW2isduetoday5pm(scansmustbereadable)
HW3willbepostedtoday
CS246:MiningMassiveDatasets
JureLeskovec,StanfordUniversity
http:/cs246.stanford.edu
Highdim.
data
Graph
data
Infinite
data
Machine
learning
Apps
L
CS246: Mining Massive Datasets
Winter 2015
Hadoop Tutorial
Due 5:00pm January 13, 2015
General Instructions
The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you
acquainted with the code and homework submission system. Comp