8/21/2014
Differences and Ratios
Differences and Ratios
Difference
Ratios
When the variance in values from a standard is
more meaningful than the absolute value the
difference may be more useful.
Example
Standard:
100
Values:
90, 100, 110
Difference: -10,
8/21/2014
Data Reduction Strategies
A data store may have terabytes of data
Complex data analysis may take a very long time to run on the
complete data set
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but y
8/21/2014
Underfitting and Overfitting
Methods for Performance Evaluation
Obtaining a reliable estimate of performance
Performance of a model may depend on other
factors besides the learning algorithm
Class distribution
Cost of misclassification
Size o
CIS 436 Data Mining
Fall 2014
MWF 12:20-1:10
Hartwell 27
Professor: Dr. Anthony Scime
Office: Brown 219
Email: ascime@brockport.edu
Course Description: Studies data mining process with the goal of discovering nontrivial,
interesting and actionable knowled
8/21/2014
The Market-Basket Model
Association
A large set of items, e.g., things sold in a
supermarket.
A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.
2
Application - Real market baskets
A
This is the WEKA Classification Homework with the correct answers.
To get the answers below you must use the Voting Data on ANGEL and the files already divided
into training and testing sets. Use Voting Data Train.arff and Voting Data Test.arff
If you cha
8/21/2014
Smoothing
Smoothing
The values of an attribute may range over a
number of distinct values.
However, if the range contains many many
distinct values,
taken as a whole their value may be limited
and
their differences may not be significant.
Smoo
CIS 436 Data Mining Project
Project: A substantial part of this course is completion of a data mining project. The project
should be worked on throughout the course with presentations discussing progress. It culminates
with a paper. Students will be assig
This is an example of a Data Mining Log.
July 14, 2014 received data set (Copy of Attacks on Countries(1) need to make these changes
Convert ARI-Raw bins where 0-3.3 is Low, 3.3-6.6 is Medium, and 6.6-10 is Large
Convert SRI to bins where 0-3.3 is Low, 3.
9/5/2014
Normalization
Normalization
Decimal Scaling Normalization
Transformation of data to scale data to a
specific range of values
When the distance between data points vary
and where larger distances distort the data,
normalization can mitigate the di
8/21/2014
Attribute Relevance Analysis - Why?
Which dimensions should be included?
How high level of generalization?
Reduce # attributes
Easy to understand patterns
Attribute Relevance
2
Attribute Relevance Analysis - What?
Attribute Relevance Analysis