Cluster Analysis
Bin Li
EXST7142 - Statistical Data Mining
1 / 48
Supervised vs. unsupervised learning
I
Supervised learning:
I
I
I
I
Data: input variables (X1 , X2 , . . . , Xp ) and at least one
response Y on n observations.
Goal: predict Y using X1 , X

Classical Linear Models
and
Generalized Linear Models
Topics Covered in this Session
Classical Linear Models
Model formulae
Generic functions
Example 1: Janka Dataset
Example 2: Iowa wheat example
Example 3: Boston housing example (handout)
Generalized

Market Basket Analysis
and
Mining Association Rules
Bin Li
1 / 32
What is assocation rule analysis?
I
Association rule analysis has emerged as a popular tool for
mining commercial data bases.
I
Goal: find joint values of the variables X = (X1 , X2 , . . .

Mining association rules in US census data
This data set is from the UCI machine learning repository and included in package arules as the
data set AdultUCI. The dataset originates from the U.S. census bureau database and contains 48,842
instances with 14

CART examples in R
Bin Li
1
Regression Tree Example: Body Fat Study
Overweight and obesity are considered to be major health problems because of their strong association with a higher risk of diseases of the metabolic syndrome, including diabetes mellitus

Import and Export Data
Outlines
Getting stuff in
Low level functions: scan
Functions for small-midsized datasets: read.table,
read.csv
Editing data
Importing binary files
Reading in large datafiles: ODBC connections
Getting stuff out
Low level fun

Bagging examples using R
Bin Li
1
Credit approval example
This example concerns credit card applications. All attribute names and values have been changed to
meaningless symbols to protect confidentiality of the data. This dataset is interesting because t

Linear regression example in R
Bin Li
The Boston housing dataset is a classic benchmark dataset in data mining area. It was originally
used by Harrison and Rubinfeld in 1978. The dataset is about the housing values in suburbs of Boston.
There are 506 obse

Introduction to Statistical Learning
Bin Li
EXST7142 - Statistical Data Mining
1 / 52
What is statistical learning
I
Statistical learning is the science of learning from the data
using statistical methods.
I
I
I
I
I
I
Predict the price of a stock in 6 mon

Principal Component Analysis
Bin Li
EXST7142 - Statistical Data Mining
1 / 35
Outline
I
Introduction to principal component analysis (PCA) through
an illustration of variable redundancy.
I
Explain the properties of PCA through a numeric example.
I
US crim

Detecting Insults in Social Commentary
A Case Study in Text Mining
1 / 22
Data description
I
I
I
Source: Kaggle 2012 data mining competition.
Objective: Predict whether a commentary posted during a
public discussion is considered insulting to one of the
p

Manipulating Data in R
1
Outlines
Sorting
Sorting and ordering elements: sort, rank and order
Dates and times
Manipulation of dates and conversion
Summarizing data
Tabulating and separating data: table and split
Merge two datasets: merge
Vectorize

Logistic regression example in R
Bin Li
This example concerns credit card applications. All attribute names and values have been changed
to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there
is a good mix

Nonparametric Regression and
Generalized Additive Models (GAM)
Bin Li
EXST7142 - Statistical Data Mining
1 / 77
Nonparametric regression
I
In the traditional regression analysis, the form of the
regression function has been specified. For example, we migh

Ensemble methods Bagging and Random Forest
Bin Li
EXST7142 - Statistical Data Mining
1 / 45
Model selection vs. model combination
I
Model selection:
I
I
I
Many models, which one to choose?
Goal: good interpretability and/or better predictive
performance.

Generalized additive model example in R
This is a study of the relationship between atmospheric ozone concentration, O3 and other meteorological variables in the Los Angeles Basin in 1976. To simplify matters, lets only focus on three predictors:
temperat

Model Assessment and Selection
Bin Li
EXST7142 - Statistical Data Mining
1 / 32
Loss functions
I
Typical choices for quantitative response Y :
I
I
I
Squared error: L(Y , f(X ) = (Y f(X )2
Absolute error: L(Y , f(X ) = |Y f(X )|
Typical choices for categor

Suppose we wish to know if the
expression level of a gene in a tumor can
predict if patients will have a recurrence of
their cancer.
High expression => high probability of
recurrence
Not cancer
Cancer
Threshold
Gene expression
In this case, gene express

Neural network examples in R
1
Ozone data
We apply the neural networks to the ozone data which was analyzed before using the nnet package, due
to Venables and Ripley (2002). We start with just three variables for simplicity and fit a feed-forward
neural n

Introduction to R
EXST7142 - Statistical Data Mining
1 / 59
Outline
I
What is R?
I
Installing R and R packages.
I
R basics.
I
Data objects in R: vector, matrix, dataframe, and list.
I
Indexing in R.
I
Graphics in R.
I
Displaying high dimensional and/or la

A Case Study of Predicting Algae Blooms
EXST7142 - Statistical Data Mining
1 / 22
Problem description
I High concentrations of certain harmful algae in rivers constitute a serious
ecological problem with a strong impact not only on river lifeforms, but
al

Classification and Regression Trees (CART)
Bin Li
EXST7142 - Statistical Data Mining
1 / 29
Classification tree in iris example
The famous iris data set gives the measurements in centimeters of
the variables sepal length and width and petal length and wid

Neural Networks
Bin Li
EXST7142 - Statistical Data Mining
1 / 30
Biological inspiration
I
Idea: To make the computer
more robust, intelligent, and
learn, .
I
Lets model our computer
software (and/or hardware)
after the brain.
2 / 30
Neurons in the brain
I