L21: Markov Chains
Markov Chains represent and model the ow of information in a graph, they given insight into how a graph
is connected, and which nodes are important.
As we will see, they also provide important life lessons:
[L1] Only your current posit
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers
2.
Hubs are pages that link to authorities
List of newspapers
Course bulletin
Lis
B490 Mining the Big Data
1 Finding Similar Items
Qin Zhang
1-1
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.
Application: Dont want to show both when Google.
2-1
Motivations
Finding similar documents/webpages/images
(Ap
11 Spectral Clustering
Another perspective on clustering is that there are three main types: (1) Bottom-Up, (2) Assignment-Based,
and (3) Top-Down. The bottom-up variety was like the hierarchical clustering where we start will very small
clusters and buil
B490 Mining the Big Data
3 Mining Frequent Items
Qin Zhang
1-1
Motivations
Find what are hot!
IP addresses in network routers
Most popular movies in Netix
People having most fans in Twitter
.
2-1
Motivations
Find what are hot!
IP addresses in network rout
B490 Mining the Big Data
5. Models for Big Data
Qin Zhang
1-1
MapReduce
2-1
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Input
Output
Map
Goal
Shue
Reduce
Standard model
in industry for
massive data
computation
E.g., Hadoop.
Minimize (1) total com
10 k-Means Clustering
Probably the most famous clustering formulation is k-means. This is the focus today. Note: k-means is not
an algorithm, it is a problem formulation.
k-Means is in the family of assignment based clustering. Each cluster is represented
6 Locality Sensitive Hashing
In the last few lectures we saw how to convert from a document full of words or characters to a set, and then
to a matrix, and then to a k-dimensional vector. And from the nal vector we could approximate the Jaccard
distance b
7 Distances
We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing
that way, and in particular the Jaccard similarity is easy to dene in regards to the k-shingles of text documents. In this lecture w
9 Hierarchical Clustering
This marks the beginning of the clustering section. The basic idea is to take a set X of items and somehow
partition X into subsets, so each subset has similar items. Obviously, it would be great if we could be more
specic, but t
13 Frequent Itemsets
A classic problem in data mining is association rule mining. The basic problem is posed as follows: We
have a large set of m tuples cfw_T1 , T2 , . . . , Tm , each tuple Tj = cfw_tj,1 , tj,2 , . . . , tj,k has a small number (not
all
B490 Mining the Big Data
0 Introduction
Qin Zhang
1-1
Data Mining
What is Data Mining?
A denition: Discovery of useful, possibly unexpected,
patterns in data.
2-1
Data Mining
What is Data Mining?
A denition: Discovery of useful, possibly unexpected,
patte