Data Mining
Classification: Basic Concepts,
Decision Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Classification: Definition
Give
CSE 572 Data Mining
Probability
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Random Variable
A random variable is a quantity that depends on
the outcome of a random experiment
Can be discrete or continuous
Examples:
Discrete random var
CSE 572 Data Mining
Classification
Decision Tree Induction Algorithm
- Node Impurity
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
Tan,Steinbach, Kumar
Introduction to D
CSE 572 Data Mining
Classification
Basics
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
CSE 572 Data Mining
Rule Based Classifiers Direct Methods
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Direct Method: Sequential
Covering
1.
2.
Start from an empty rule
Repeat
a. Grow a rule using the Learn-One-Rule
function
Start with a
CSE 572 Data Mining
Model Overfitting
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Underfitting and Overfitting
(Example)
Two class problem:
+, o
3000 data points (30% for
training, 70% for testing)
Data set for + class is
generated from
CSE 572 Data Mining
Rule Based Classifiers Indirect Methods
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Indirect Methods
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
2
Indirect Method: C4.5rules
Use class-based ordering
R
CSE 572 Data Mining
Introduction to Data Mining
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
What is Data Mining?
Data
mining is
The derivation of Information from Data
The extraction of useful patterns from data sources,
e.g., databas
CSE 572 Data Mining
Classification
Decision Tree Induction Algorithm
- Node Impurity
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
Tan,Steinbach, Kumar
Introduction to D
CSE 572 Data Mining
Data Basics of Data Mining Data
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Data for Data Mining
Data
mining typically deals with data that have
already been collected for some purpose other
than data mining.
Data m
CSE 572 Data Mining
Basic Cluster Analysis
(Chapter 8 Section 1)
Tan,Steinbach, Kumar
Introduction to Data Mining
1
What is Cluster Analysis?
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and diffe
CSE 572: Data Mining
Density-based Clustering
Read Section 8.4
1
Density-based Clustering
Locates regions of high density that are
separated from one another by regions of low
density
High density regions
Low density background
6 density-based clusters
2
CSE 572 Data Mining
Basic Cluster Analysis
Using K-Means
(Chapter 8 Section 2)
Tan,Steinbach, Kumar
Introduction to Data Mining
1
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is a
CSE572:DataMining
Association Analysis
1
AssociationRuleMining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Mil
CSE 572 Data Mining
Support Vector Machines (SVM)
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Support Vector Machines
In machine learning, support vector machines
(SVMs, also support vector networks) are
supervised learning models with a
CSE 572 Data Mining
Data
Preprocessing
Review of Linear Algebra
1
Vectors
A vector is a quantity that has magnitude and
direction
D
Magnitude,
Direction,
2
3
Properties of Vector
Addition and subtraction
,
4
Properties of Vector
Transpose: ,
Dot product:
CSE 572 Data Mining
Data Data Quality
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Data Quality
What
kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples
of data quali
Classification: Alternative
Techniques
Lecture Notes for Chapter 5
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Instance-Based Classifiers
S
A tr1
e
t
o
f
S
t
.
o
r
e
d
A trN
C
a
s
e
s
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
What is Data?
Collection of data objects and
their attributes
An attribute is a property or
ch
Some Basic Concepts
The transpose of a vector or matrix is obtained by interchanging the
corresponding rows and columns. The transpose of A is denoted by AT
If a matrix A is m by n (has m rows and n columns), then its transpose will have n
rows and m colu
CSE 471/598
Introduction
to
(Acknowledgement: Some slides are borrowed from
Rao Kambhampati s and Huan Liu s slides.)
Artificial Intelligence
Fall 2014
Introduction
Time and Place: M W 4:30 5:45
PM ; BYAC-110
Me: Chitta Baral, BYENG 572,
[email protected]
ht
Data Mining: Introduction
Lecture Notes for Chapter 1
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Why Mine Data? Commercial
Viewpoint
Lots of data is being collected
and warehoused
We
Data Mining: Exploring Data
Lecture Notes for Chapter 3
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
Introduction to Data Mining
8/05/2005
1
Iris Sample Data Set
Many of the exploratory data techniques are illustrated with th
Midterm Review
Date: March 16, 2017 (Thursday)
Venue: CAVC 359
Time: 12:00pm 1:15pm
Syllabus: Chapter 1 Chapter on Support Vector Machines
You are allowed to bring one A4 size cheat sheet
Calculators are allowed. But you are not allowed to access the inte
Clingo installation
guide
Download
Download Clingo
at
http:/sourceforge.
net/projects/potass
co/files/clingo
/
Versions 3.x and
4.x are not fully
compatible. More
at
http:/sourceforge.
net/projects/potass
co/files/clingo/4.2
.0
Precompiled versions for
Introduction to AI Class 2
8/27/2014
Natural Language to KR
Learning
Learning cont.
Probabilistic reasoning
From
http:/opinionator.blogs.nytimes.com/2010/04/25/chance
s-are/
The probability that low-risk women (40 to 50 years old,
with no symptoms or fa
CSE 572 Data Mining
Rule Based Classifiers
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Rule-Based Classifier
Classify records by using a collection of if
then rules
Rule:
(Condition) y
where
Condition is a conjunctions of attributes
y i
CSE 572 Data Mining
Classification
Decision Trees
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Example of a Decision Tree
al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
ss
e
e
n
t
t
a
cl
ca
ca
co
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
CSE 572 Data Mining
Data Data Preprocessing Part 1
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarizatio