L21: Markov Chains
Markov Chains represent and model the ow of information in a graph, they given insight into how a graph
is connected, and which nodes are important.
As we will see, they also provid
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers
2.
Hubs are pages that link t
B490 Mining the Big Data
1 Finding Similar Items
Qin Zhang
1-1
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.
Application: Dont want to show both when Google.
2-1
Mo
11 Spectral Clustering
Another perspective on clustering is that there are three main types: (1) Bottom-Up, (2) Assignment-Based,
and (3) Top-Down. The bottom-up variety was like the hierarchical clus
B490 Mining the Big Data
3 Mining Frequent Items
Qin Zhang
1-1
Motivations
Find what are hot!
IP addresses in network routers
Most popular movies in Netix
People having most fans in Twitter
.
2-1
Moti
B490 Mining the Big Data
5. Models for Big Data
Qin Zhang
1-1
MapReduce
2-1
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Input
Output
Map
Goal
Shue
Reduce
Standard model
in industry for
massiv
10 k-Means Clustering
Probably the most famous clustering formulation is k-means. This is the focus today. Note: k-means is not
an algorithm, it is a problem formulation.
k-Means is in the family of a
6 Locality Sensitive Hashing
In the last few lectures we saw how to convert from a document full of words or characters to a set, and then
to a matrix, and then to a k-dimensional vector. And from the
7 Distances
We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing
that way, and in particular the Jaccard similarity is easy to dene in regards
9 Hierarchical Clustering
This marks the beginning of the clustering section. The basic idea is to take a set X of items and somehow
partition X into subsets, so each subset has similar items. Obvious
13 Frequent Itemsets
A classic problem in data mining is association rule mining. The basic problem is posed as follows: We
have a large set of m tuples cfw_T1 , T2 , . . . , Tm , each tuple Tj = cfw_
B490: Probability Basics
Instructor: Qin Zhang
1
Expectation and Variance
If random variable X takes discrete values x1 , x2 , . . . with probabilities p1 , p2 , . . ., then E[X] =
2
2
2
i=1 xi pi , a
B490 Mining the Big Data
0 Introduction
Qin Zhang
1-1
Data Mining
What is Data Mining?
A denition: Discovery of useful, possibly unexpected,
patterns in data.
2-1
Data Mining
What is Data Mining?
A de