CSE 255
Data Mining and Predictive Analytics
Introduction
What is CSE 255?
In this course we will build
models that help us to
understand data in order to gain
insights and make predictions
Examples Recommender Systems
Prediction: what (star-) rating will
Using RP-trees to
analyze weather
Patterns
Yoav Freund
UCSD
Goal of final HW
Model TMAX and TMIN in the continental USA as a
function of location and of time.
Input: TMAX,TMIN for 365 days of a particular
station/year.
Final Output: A model (program +
Stochas(c Gradient Descent
For PCA
The gradient
f : R d R is a smooth function from R d to R
!
!
The gradient of f at the point x, denoted f ( x)
is a vector pointing in the direction of steepest ascend (increase) of f
!
The gradient f ( x) can
Streaming & Chunking
Big Data Analytics
The Memory Hierarchy
B=Block size
Total Size:
Moores law
The only way to improve performance
is by going parallel
Multi-core computers
Data-centers
Data-Center networks
Up to date Characteristics
Cost
L1-L2
On-Chip
Review for Big Data
Analytics
(CSE255, DSE230)
What are the steps involved in handling a Cache
miss? (fill in missing part)
Choose which slot to replace
- write chosen slot to slower memory
Read requested page into slot.
Which of the following conditions
HDFS and Hadoop
Yoav Freund / Big Data Analytics
Plan of class
HDFS
Bucket sort and the sorting competitions
Map-Reduce
The architecture of Hadoop
The complexity that you dont see (unless you are
the hadoop manager)
Distributed File Systems (DFS)
Disks ar
Clustering
Clustering in Rp
Clustering in Rp
Clustering in Rp
Clustering in Rp
Two common uses of clustering:
Clustering in Rp
Two common uses of clustering:
Vector quantization
Find a finite set of representatives that provides good coverage of a
comple
Intrinsic dimension
Yoav Freund
UCSD
Intrinsic dimension
Suppose we have a uniform distribution over some
domain.
We partition it into n cells.
The Diameter of the partition is the maximal
distance between two points belonging to the same
cell.
As n i
Compression
and Entropy
Yoav Freund
UCSD
The fax machine
56KBps
An A4 BW page at 300 DPI = 8.6MB
-> sending a page without compression would
take 8,600/56 = 153 sec
= 2 minutes 23 sec per page.
A typical fax takes 5-10 seconds/page
How?
Run-length en
Streaming Algorithms
Yoav Freund
This lecture is based on:
Data Streams: Algorithms and Applications by S. Muthukrishnan
Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and
Jeffrey D. Ullman
Streaming Algorithm
Why do we want to process only o
The Hadoop
Eco-System
HBase
A column-based key-value data storage.
Example: find all of the records corresponding to stations in
the continental USA
HBase is a distributed column-oriented database that sits on
top of HDFS.
Good for analytics - scanning ma
Locality Sensitive Hashing
Yoav Freund
This lecture is based on:
Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and
Jerey D. Ullman, Sections 3.6-3.8
Formal definition of locality sensitive Hash functions
A family F of hash functions is said t
Big Data Analy,cs
Introduc,on
Yoav Freund
UCSD / Computer Science and
Engineering
Coordinates
Class meets tue, thu, 3:00 4:50
Demonstra,ons and homework will be based on
iPython notebooks.
Hadoop and Spark cl
CSE 255 Lecture 2
Data Mining and Predictive Analytics
Supervised learning Regression
Supervised versus unsupervised learning
Learning approaches attempt to
model data in order to solve a problem
Unsupervised learning approaches find
patterns/relationship
CSE 255 Lecture 3
Data Mining and Predictive Analytics
Supervised learning Classification
Last week
Last week we started looking at
supervised learning problems
Last week
We studied linear regression, in order
to learn linear relationships between
feature
CSE 255 Lecture 4
Data Mining and Predictive Analytics
Nearest Neighbour & Classifier
Evaluation
Office hours
In addition to my office hours (Tuesday 9:30-11:30),
tutors will run office hours on Friday 12-2pm check
Piazza for the location
Last lecture
How
Case study 2
Data Mining and Predictive Analytics
Understanding Opinions and Preferences
in Product Networks
Relationships between products
Relationships between products
browsed together
bought together
Relationships between products why?
1. To understan
CSE 255 Lecture 5
Data Mining and Predictive Analytics
Dimensionality Reduction
Course outline
Week 4: Ill cover homework 1, and
get started on Recommender Systems
Week 5: Ill cover homework 2 (at the
end of the week), and do some
midterm prep
Will cov
Map Reduce
Algorithms and Theory
Map Reduce
Mapper: maps each input record to one or
more key-value pairs
Reducer: maps key-value to a single key value
pair, all keys are iden;cal.
Shue/Sort: moving all pairs
Stochas(c Gradient Descent
for Perceptron, LMS, Back-prop
Big Data Analy(cs
The per-example gradient of PCA
In the case of finding the first eigen-value of the covariance matrix:
! !T ! !T 1 n ! ! T ! 1 n !T ! ! T ! 1 n !T ! 2
f (v) = v
PCA and SVD
Sanjoy Dasgupta
Dimensionality reduction
Why reduce the number of features in a data set?
1
It reduces storage and computation time.
2
High-dimensional data often has a lot of redundancy.
3
Remove noisy or irrelevant features.
Dimensionality r