CSE 255
Data Mining and Predictive Analytics
Introduction
What is CSE 255?
In this course we will build
models that help us to
understand data in order to gain
insights and make predictions
Examples R
Using RP-trees to
analyze weather
Patterns
Yoav Freund
UCSD
Goal of final HW
Model TMAX and TMIN in the continental USA as a
function of location and of time.
Input: TMAX,TMIN for 365 days of a part
Stochas(c Gradient Descent
For PCA
The gradient
f : R d R is a smooth function from R d to R
!
!
The gradient of f at the point x, denoted f ( x)
is a vector pointing in the direction of ste
Streaming & Chunking
Big Data Analytics
The Memory Hierarchy
B=Block size
Total Size:
Moores law
The only way to improve performance
is by going parallel
Multi-core computers
Data-centers
Data-Center
Review for Big Data
Analytics
(CSE255, DSE230)
What are the steps involved in handling a Cache
miss? (fill in missing part)
Choose which slot to replace
- write chosen slot to slower memory
Read requ
HDFS and Hadoop
Yoav Freund / Big Data Analytics
Plan of class
HDFS
Bucket sort and the sorting competitions
Map-Reduce
The architecture of Hadoop
The complexity that you dont see (unless you are
the
Clustering
Clustering in Rp
Clustering in Rp
Clustering in Rp
Clustering in Rp
Two common uses of clustering:
Clustering in Rp
Two common uses of clustering:
Vector quantization
Find a finite set of
Intrinsic dimension
Yoav Freund
UCSD
Intrinsic dimension
Suppose we have a uniform distribution over some
domain.
We partition it into n cells.
The Diameter of the partition is the maximal
distance
Compression
and Entropy
Yoav Freund
UCSD
The fax machine
56KBps
An A4 BW page at 300 DPI = 8.6MB
-> sending a page without compression would
take 8,600/56 = 153 sec
= 2 minutes 23 sec per page.
A
Streaming Algorithms
Yoav Freund
This lecture is based on:
Data Streams: Algorithms and Applications by S. Muthukrishnan
Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and
Jeffrey D. Ullm
The Hadoop
Eco-System
HBase
A column-based key-value data storage.
Example: find all of the records corresponding to stations in
the continental USA
HBase is a distributed column-oriented database tha
Locality Sensitive Hashing
Yoav Freund
This lecture is based on:
Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and
Jerey D. Ullman, Sections 3.6-3.8
Formal definition of locality sensitiv
Big Data Analy,cs
Introduc,on
Yoav Freund
UCSD / Computer Science and
Engineering
Coordinates
Class meets tue, thu, 3:00 4:50
Demonstra,ons and homework will be ba
CSE 255 Lecture 2
Data Mining and Predictive Analytics
Supervised learning Regression
Supervised versus unsupervised learning
Learning approaches attempt to
model data in order to solve a problem
Unsu
CSE 255 Lecture 3
Data Mining and Predictive Analytics
Supervised learning Classification
Last week
Last week we started looking at
supervised learning problems
Last week
We studied linear regression,
CSE 255 Lecture 4
Data Mining and Predictive Analytics
Nearest Neighbour & Classifier
Evaluation
Office hours
In addition to my office hours (Tuesday 9:30-11:30),
tutors will run office hours on Frida
Case study 2
Data Mining and Predictive Analytics
Understanding Opinions and Preferences
in Product Networks
Relationships between products
Relationships between products
browsed together
bought toget
CSE 255 Lecture 5
Data Mining and Predictive Analytics
Dimensionality Reduction
Course outline
Week 4: Ill cover homework 1, and
get started on Recommender Systems
Week 5: Ill cover homework 2 (at t
An example of LSH and
amplica3on
Hash func3on and Hash table
Hash Table
index content
Records
0 0111011101 . . . 1101000100
1 1000011000 . . . 1101010110
Map Reduce
Algorithms and Theory
Map Reduce
Mapper: maps each input record to one or
more key-value pairs
Reducer: maps key-value to a single key value
pair, all
Stochas(c Gradient Descent
for Perceptron, LMS, Back-prop
Big Data Analy(cs
The per-example gradient of PCA
In the case of finding the first eigen-value of the covariance matrix:
!
PCA and SVD
Sanjoy Dasgupta
Dimensionality reduction
Why reduce the number of features in a data set?
1
It reduces storage and computation time.
2
High-dimensional data often has a lot of redundancy.