Chapter 4 Dimension Reduction
Exploring the data
Statistical summary of data: common metrics
Average
Median
Minimum
Maximum
Standard deviation
K Nearest Neighbors
Supervised statistical learning
method
A simple classification method used to predict the class of a
categorical variable
Assign a new example to the class that is most common among its k
nearest neighbors in the training samples.
S
Logistic Regression
Logistic Regression
Regression model where the dependent (output) variable is
categorical.
If a binary variable is a function of a continuous input variable ,
logistic regression may be used to estimate the conditional
distribution u
Wage data Example
Set your working directory in RStudio to the directory that contains the file "wagedata.csv"
Read data from "wagedata.csv" into a dataframe called wagedf
wagedf = read.csv("wagedata.csv")
Inspect the stucture of the data frame wagedf
s
Chapter 17 Smoothing Methods
Smoothing is data driven
Regression methods assume underlying
unchanging structure (linear, exponential,
polynomial)
Chapter 16 Regression Based
Forecasting
Main ideas
Fit linear trend, time as predictor
Modify & use also for non-linear trends
Exponential
Polyn
Chapter 15 Handling Time Series
Main ideas
Forecast future values of a time series
Distinction between forecasting (main focus) and
describing/exp
Chapter 14 Cluster Analysis
Clustering: The Main Idea
Goal: Form groups (clusters) of similar records
Used for segmenting markets into groups of
sim
Chapter 13 Association Rules
What are Association Rules?
Study of what goes with what
Customers who bought X also bought Y
What symptoms go with
Decision Tree Inductive
Learning
Decision Trees
Given a set of training examples in the form of a set of attribute
values as inputs and classes as outputs, a tree is constructed such
that:
Each non-terminal node is an attribute.
Each arc from a node co
Artificial Neural
Networks
Feedforward Neural Network
Predicted output
Y
Output node
Hidden
nodes
bias
1
X1
Input nodes
X2
Weights associated with the connections
are iteratively adjusted during training
to decrease the prediction error.
Training stops
Chapter 11 Neural Nets
Basic Idea
Combine input information in a complex & flexible
neural net model
Model coefficients are continually tweaked in
Overview
Core Ideas in Data Mining
Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization
1
Super
9/14/2015
Chapter 3 Data Visualization
Graphs for Data Exploration
Basic Plots
Line Graphs
Bar Charts
Scatterplots
Distribution Plots
Boxplots
Histograms
1
9/
Chapter 5 Evaluating Classification
& Predictive Performance
Why Evaluate?
Multiple methods are available to classify or
predict
For each method,
Chapter 6: Multiple Linear
Regression
Topics
Explanatory vs. predictive modeling with regression
Example: prices of Toyota Corollas
Fitting a pred
Chapter 7 K-Nearest-Neighbor
Characteristics
Data-driven, not model-driven
Makes no assumptions about the data
1
9/28/2015
Basic Idea
For a given re
Bayesian Classifier
Bayesian Classifier
1
Nave Bayes Classifier
Nave Bayes Classifier:
Nave Bayes classifier is based on the Bayes Theorem by Thomas Bayes (1702- 1761).
Nave Bayes classifiers are a family of simple probabilistic classifiers base
Chapter 9 Classification and
Regression Trees
Trees and Rules
Goal: Classify or predict an outcome based on a
set of predictors
The output is a set
Chapter 10 Logistic Regression
Logistic Regression
Extends idea of linear regression to situation
where outcome variable is categorical
Widely use
Introduction and Overview
Junping Sun
Data Mining
1-1
Data Mining
Data Mining:
Extracting useful information from large data sets. (Hand et al. 2001)
It is a process of non-trivial extraction of implicit, previously unknown, and
potentially useful infor