Data Science Infrastructure:
Data Frames and Data Frameworks
D.S. Parker
UCLA
May 26, 2017
D.S. Parker (UCLA)
c
2017
May 26, 2017
1 / 53
What is Data Science Infrastructure?
How can we define Infrastructure ?
Probably the best answer: look at the industry
1. KDD Cup 2014: Predicting Excitement at DonorsChoose.org
a) Problem Description: DonorsChoose.org is an online charity that makes it
easy to help students in need through school donations. At any time,
thousands of teachers in K-12 schools propose proje
The following papers come from KDD 2014, ICDM 2014, ICDE 2014, CIKM 2014,
and VLDB 2014
1. Graph Classification
(1) Scalable SVM-based Classification in Dynamic Graphs (ICDM14)
(2) Multi-Graph-View Learning for Graph Classification (ICDM14)
2. Graph Summa
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Sequential Pattern Mining
Why sequential pattern mining?
GSP algorithm
FreeSpan and PrefixSpan
Boarder Collapsing
Constraints and extensions
2
CS249: Big Data Analytics
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Partition: Scan Database Only
Twice
Partition the database into n partitions
Itemset X is frequent X is frequent in at
least one partition
Scan 1: partition database and
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Constraints in Data Mining
Knowledge type constraint:
classification, association, etc.
Data constraint using SQL-like queries
find product pairs sold together in stores
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Outline
What is association rule mining?
Methods for association rule mining
Extensions of association rule
2
CS249: Big Data Analytics
What Is Association Rule
Mining?
Clustering
CS 249
Winter 2015
Wei Wang
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Outline
What is clustering
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based clustering methods
Outlier analysis
2
CS 249: Big
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Sequential Pattern Mining
Why sequential pattern mining?
GSP algorithm
FreeSpan and PrefixSpan
Boarder Collapsing
Constraints and extensions
2
CS249: Big Data Analytics
Classification
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Classification vs. Prediction
Classification:
the ordering among values has no meaning
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Outline
What is association rule mining?
Methods for association rule mining
Extensions of association rule
2
CS249: Big Data Analytics
What Is Association Rule
Mining?
Classification 2 (CS249)
2015-01-29
TA: Ruirui Li
Outline
Team, Survey, and Project
Classification
Nave Bayesian Classification
Bayesian Networks
Classification based on Association
Team, Survey, and Project
Formed team, survey, and project selectio
CS 249
Big Data Analytics
Instructor: Wei Wang
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Big Data are Everywhere
The UNIVERSITY of CALIFORNIA at LOS ANGELES
So are the Challenges
3
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Welcome!
Instru
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Outline
What is association rule mining?
Methods for association rule mining
Extensions of association rule
2
CS249: Big Data Analytics
What Is Association Rule
Mining?
CS249 - Spring 2017 - D.S. Parker 2017
Spam and Logistic Regression
The dataset is downloadable from the Machine Learning datasets at UCI:
https:/archive.ics.uci.edu/ml/machine-learning-databases/spambase
In [1]:
SpamBase = read.csv("http:/archive.ics.uci
LDA and QDA: a Quick Overview
D.S. Parker
UCLA
April 23, 2017
c
2017
(UCLA)
LDA and QDA: a Quick Overview
April 23, 2017
1 / 12
Refer to the Texts for a complete Presentation
This is just a very quick overview. For a more complete explanation, see
for exa
JSS
Journal of Statistical Software
April 2011, Volume 40, Issue 1.
http:/www.jstatsoft.org/
The Split-Apply-Combine Strategy for Data
Analysis
Hadley Wickham
Rice University
Abstract
Many data analysis problems involve the application of a split-apply-co
Distributions
D.S. Parker
UCLA
April 15, 2017
D.S. Parker (UCLA)
c
2017
April 15, 2017
1 / 55
Benfords Law
0.30
0.35
Benford's Law: the probability that a leading digit is d = log10(d+1)/d)
0.20
0.15
Density
0.25
0.10
Benford density log10(x+1)/x)
0.05
0.
Outline
Mathematical background
PCA
SVD
Some PCA and SVD applications
Case study: LSI
Iyad Batal
Mathematical Background
Variance
If we have one dimension:
English: The average square of the distance from the mean of the
data set to its points
Definitio
Visualization and Information Graphics
D.S. Parker
UCLA
May 26, 2017
D.S. Parker (UCLA)
c
2017
May 26, 2017
1 / 53
Existing Information Graphics
I
I
I
I
Google Charts (GViz) (gallery)
Tableau (VizQL) (gallery)
Jupyter (notebook) (gallery)
ggplot (ggplot2)
Matrix Factorization
D.S. Parker
UCLA
May 26, 2017
D.S. Parker (UCLA)
c
2017
May 26, 2017
1 / 35
Triangular Decompositions
There are many established decompositions using triangular matrices:
I
QR-decomposition A = Q R where Q is orthogonal and
R is right
Graph Mining
D.S. Parker
UCLA
June 2, 2017
D.S. Parker (UCLA)
c
2017
June 2, 2017
1 / 50
The Data Science Industry
Mapping Big Data
http:/demo.relato.io/oreilly
http:/www.oreilly.com/data/free/files/mapping-big-data.pdf
D.S. Parker (UCLA)
c
2017
June 2, 2
Applied Predictive Modeling
D.S. Parker
UCLA
May 14, 2017
D.S. Parker (UCLA)
c
2017
May 14, 2017
1 / 129
Applied Predictive Modeling: Knowing what to Do
I
Which models should I use?
I
How does one select model parameter values?
I
Is there a good way to co
Appied Predictive Modeling
An integrated package for supervised learning, using over 50 kinds of models, and a variety of different
metrics:</p>
Applied Predictive Modeling
M. Kuhn and K. Johnson
Springer-Verlag, 2013.
ISBN: 978-1-4614-6848-6 (Print)
</co
Classification
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the value
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Constraints in Data Mining
Knowledge type constraint:
classification, association, etc.
Data constraint using SQL-like queries
find product pairs sold together in stores
Association Rule Mining
CS249
Winter 2015
The UNIVERSITY of CALIFORNIA at LOS ANGELES
Partition: Scan Database Only
Twice
Partition the database into n partitions
Itemset X is frequent X is frequent in at
least one partition
Scan 1: partition database and
Clustering
CS 249
Winter 2016
Wei Wang
The UNIVERSITY of CALIFORNIA, LOS ANGELES
PAM: A K-medoids Method
PAM: partitioning around Medoids
Arbitrarily choose k objects as the initial medoids
Until no change, do
(Re)assign each object to the cluster to whic
UID:
NAME:
!
CS249 Basic Data Science
Spring 2017
c
D.S. Parker 2017
SAMPLE Midterm Examination
OPEN BOOK, OPEN NOTES, OPEN COMPUTER
BUT NO COMMUNICATION WITH OTHERS OR USE OF INTERNET
Saturday, May 6, 1:00pm3:00pm
Problem
1
2
3
4
Total
Points
/25
/25
/2
Overview of PCA (Principal Components Analysis)
D.S. Parker
UCLA
April 30, 2017
D.S. Parker (UCLA)
c
2017
April 30, 2017
1 / 22
Principal Components Analysis in one slide
I
Input data: n p matrix X , where p is large (high dimensional)
I
Goal: find a way
Linear Regression Overview
D.S. Parker
UCLA
April 30, 2017
D.S. Parker (UCLA)
c
2017
April 30, 2017
1 / 23
Linear Regression, in one slide
I
Input (training data):
real n p matrix X ,
I
Objective (linear model):
find optimal coefficients :
I
Least Squares
Vectors and Matrices
D.S. Parker
UCLA
April 27, 2017
D.S. Parker (UCLA)
Vectors and Matrices
April 27, 2017
1 / 25
Vectors
I
Transpose
I
Scalar Product
I
Norms
I
Rotations
I
Bases and Coordinate Systems
I
Vector Spaces
I
Vector Projection
D.S. Parker (UCL