Regression
Data Mining Task 5: Regression
Example:
Predict tomorrows stock price based on
the past price.
Work on continuous variables.
Usually covered in a statistics class, not
data mining class.
Determine the Best Split:
Information Gain
How to Determine The Best Split
Before Splitting: 10 records of Class 0,
10 records of Cass 1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1
Visualization With R
Visualization in R
Base graphics overview
Important high-level plotting functions
Basic plot: generic x/y plotting
Common chart types: bar plot, box plot,
hist, pie, qqnorm, qqplot
Imagelike plot type: image, heat map,
contour
Visuali
Determine the Best Split
Impurity Measure
Measure of Impurity: Gini
Gini index for a given node t :
|Classes|
Maximum (1 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records
Bagging Method
Bagging
Creating k samples by sampling with
replacement.
D1, D2, , Dk
Train a classifier on each sample.
Majority vote from all classifiers.
Work best with unstable classifier.
Robust to overfitting when applied to
noisy data.
Bagging
Sampl
Introduction With
Mammogram Example
The Value of Mammogram Screen Test
Cornell professor Steven Strogatzs blog article on New
York Times Chances Are:
http:/opinionator.blogs.nytimes.com/2010/04/25/
chances-are/
The Value of Mammogram Screen Test
Following
Graph-Based Clustering:
Chameleon Algorithm
Graph-Based Clustering
Graph-based clustering uses the proximity graph.
Consider each point as a node in a graph.
Each edge between two nodes has a weight that is the
proximity between the two points.
Start with
Introduction to Frequent
Pattern Mining
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction.
Market Basket Transactions
TID
Items
1
2
3
4
5
Issues with Decision
Tree Algorithms
Decision Tree Issues
Data
fragmentation
Search
strategy
Expressiveness
Overfitting
Data Fragmentation
Number
of instances gets smaller as you
traverse down the tree.
Number
of instances at the leaf nodes
could be too s
Test of Significance
for Accuracy
Test of Significance
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
Model M2: accuracy = 75%, tested on 5,000
instances
Can we say M1 is better than M2?
How much confidence can we place on accuracy of
Model Overfitting and
Occamss Razor Principle
Model Evaluation
Metrics for performance evaluation
How to evaluate the performance of a model?
Methods for performance evaluation
How to obtain reliable estimates?
Methods for model comparison
How to compare
Clustering
Data Mining Task 2: Clustering
A real-world example:
Customer segmentation:
Goal: To find the subgroups among a large customer
base.
Clustering approach:
Collect some attributes about the customers, like their age,
income, favorite brands, and
ANN Architecture
Artificial Neural Networks (ANN)
Model is an assembly of
interconnected nodes and
weighted links.
Input
nodes
Black box
X1
X2
Output node sums up
each of its input value
according to the weights
of its links.
Compare output node
against s
Bayesian Classifier for
Multiple Attributes
What If the Diagnosis Is Determined by More
Than the Mammogram Result?
Attribute 1: mammogram test result: positive or negative
Attribute 2: family history: yes or no
Attribute 3: race: Caucasian or not
Need to
Application of Bayes Theorem
for Mammogram Example
Back to the Mammogram Example
P( X | Y ) P(Y )
P(Y | X ) =
P( X )
Target attribute Y: cancer vs. no cancer
Explanatory attribute X: a patients mammogram test result
(positive vs. negative)
Among patients
Cluster Validity Concepts
Cluster Validity
For supervised classification we have a variety of
measures to evaluate how good our model is.
Accuracy, precision, recall
For cluster analysis, the analogous question is how
to evaluate the goodness of the resul
Receiver Operating
Characteristic (ROC Method)
for Model Comparison
Model Evaluation
Metrics for performance evaluation
How to evaluate the performance of a model?
Methods for performance evaluation
How to obtain reliable estimates?
Methods for model comp
Data Visualization Introduction
Data Visualization: Link Data and Questions
Benefits of visualization
Know data better.
Visually discover data pattern.
Prepare for more targeted data exploration.
Choice of right chart and graph
Desired features
Interactiv
Smooth Technique for Bayesian
Classifier Method
How to Estimate Probabilities of Continuous
Attributes?
Two approaches for continuous attributes:
Discretization
[0,60k), [60k, 100k), [100k, .)
Probability density estimation
Assume attribute follows a nor
Outlier Resistance Analysis
and Multivariate Techniques
Outlier-Resistant Analysis
Outlier sensitive measures:
Mean, variance, standard deviation, range
Outlier-resistant measures:
Median
IQR
MAD
Trade-off: more computing cost to sort
values.
Rule of thum
HAC (Hierarchical Agglomerate
Clustering) Algorithm
Hierarchical Clustering
Produces a set of nested clusters organized
as a hierarchical tree
Can be visualized as a dendrogram
A treelike diagram that records the sequences
of merges or splits
5
6
0.2
4
3
Dataset Types
Dataset Types
Record data: data in the
tabular format
Each row is a data example.
Each column is an attribute.
Fixed set of attributes.
The kind of dataset we are
going to analyze in this class.
Dataset Types
Tid Refund
Marital
Status
Taxabl
Predictive vs. Descriptive
Data Mining
Predictive vs. Descriptive Analysis
Predictive analysis:
Use some variables to predict unknown or future
values of other variables: classification,
regression.
Descriptive analysis:
Derive patterns (correlations, tre
Apriori Algorithm for Frequent
Pattern Generation: Theory
Association Rule Mining Task
Brute-force approach
List all possible association rules.
Compute the support and confidence for each rule.
Prune rules that fail the min_sup and min_conf
threshold
Data Sampling
Sampling
Why sampling?
Sampling when obtaining or analyzing the
entire set of data of interest is too expensive
or time consuming.
And the sample is representative, meaning it
has approximately the same property (of
interest) as the origina
Data Issues
Data Quality
What kinds of data quality problems?
How can we detect problems with the
data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
Missing values
Duplicate data
Noise
Noise refers to modifica
Attribute Discretization
How Many Values Can an Attribute Have?
Discrete:
Finite: Limited number of possible values.
E.g., attribute a persons home country can have as
many as ~300 possible values.
Countably infinite: The number of values is
infinite but
Data Mining Overview
What Is Data Mining?
Many definitions
Automatically discovering useful information
in large data repositories
Nontrivial extraction of implicit, previously
unknown, and potentially useful information
from data (Gregory Piatetsky-Sha
Traditional Model Performance
Metrics: Accuracy and Its Limit
Problem With Accuracy Measure
We need to learn some fundamental concepts first:
Confusion matrix for two classes (can be
extended to multiple classes)
PREDICTED CLASS
ACTUAL
CLASS
a: TP (true p
Data Transformation
How to Prepare Data for Analysis?
Understand the meaning of data.
Assess the quality of data.
Transform data for analysis.
Data Transformation
Aggregation
Attribute transformation
Sampling
Dimensionality reduction and
feature selection