Project 0
1: What do you think data mining is for?
Identify a problem from your own experience that you think would be amenable to data
mining. Describe:
What the data is.
What type of benefit you might hope to get from data mining.
What type of data mini

Association Rules
Start date: Sept 30, Due date: Oct. 14th
Your task for this project is to identify and perform an association rule mining task. This
involves
1.
2.
3.
4.
5.
Selecting an appropriate data set
Preparing and preprocessing the data
Finding r

Project 2: Classification
Start date Oct 23, due Nov 6 beginning of class.
The goal of this project is to choose and evaluate classification mechanisms. I would
suggest using the mechanisms available in Weka or RapidMiner, although you may
implement your

Project 3: Clustering
Start date Nov 13
The goal of this project is to choose and evaluate clustering mechanisms. I would suggest
using the mechanisms available in Weka or RapidMiner, although you may implement
your own if you wish.
Use Image Segmentation

Project 3: Clustering
Start date: Nov 18
The goal of this project is to choose and evaluate clustering mechanisms. I would suggest
using the mechanisms available in Weka, although you may implement your own if you
wish.
Use Image Segmentation, housing and

Mining Sequential Patterns
Slides are adapted from Introduction to Data Mining by Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
Sequence Data
Timeline
Sequence Database:
Object A A A B B B B C Timestamp 10 20 23 11 17 2

Chapter 2
Getting to Know Your Data
2.1
Exercises
1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the
characterization of data dispersion, and discuss how they can be computed eciently in large datab

Fall 2013 CAP4770 Quiz 1
Name: _
September 5th, 2012
Panther ID: _
Note: a) Total Points: 10pts
b) Close Book and Notes
1.
Describe the difference between discrete attribute and continuous attribute? (3 pts)
2.
What are typical functions/tasks of data min

Fall 2014 CAP4770 Quiz 2 September 11, 2014
Name: _
Panther ID: _
Note: a) Total Points: 10pts
b) Close Book and Notes
1.
Give two methods for handling missing values? (1pt)
2.
Describe major tasks in data pre-processing. (2 pts)
3.
Suppose a group of 12

Fall 2014 COP4770 Quiz 4 September 23, 2014
Name_ Panther ID_
1. The following is an example student dataset with 2 dimensions and 8 instances:
dimension X is college major, dimension Y is whether the student attends data
mining class or not. Both dimensi

CAP4770 Quiz 8
Name: _
October 16, 2014
Panther ID: _
1. List all the 4-subsequences contained in the data sequence: < cfw_1, 3 cfw_2 cfw_2, 3
cfw_4 > (4pts)
2. List all the 3-element subsequences contained in the data sequence: < cfw_1, 3 cfw_2
cfw_2, 3

CAP4770
Practice exam solutions
1. Describe major tasks in data-preprocessing
Ans: data cleaning, data integration, data transformation, data reduction, data
discretization
2. What is the main idea of principal component analysis (PCA)?
Ans: The basic ide

Question I: Short Answers
Consider a binary classification problem with the following set of attributes and attribute
values:
Air Conditioner = cfw_Working, Broken,
Engine = cfw_Good, Bad,
Mileage = cfw_High, Medium, Low,
Rust = cfw_Yes, No
Suppose a

CAP4770 Quiz 6
Name: _
October 7th, 2014
Panther ID: _
1. Generally what are the two steps for mining association rules? (2pt)
2. What is the candidate pruning principle used in Apriori? (1pt)
3. Given the following dataset:
Transaction-id
10
20
30
40
50

CAP4770 Introduction to Data Mining
Final Project Instruction
Data:
We use a gene data set as our data for the final project. This data set is
i n att ributes-in-rows format, comma-separated values. I t can be
downloaded by following this link:
ht tp:/use

Data Preparation
Initially, the data set contained 7071 columns, one for each gene and one for a serial number for
each instance. The information about each patient was recorded in rows. There were 70 rows, 69 for
each patient and one with names of each g

WhatIstheProblemoftheKMeansMethod?
q
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the
distribution of the data.
q
K-Medoids: Instead of taking the mean value of the object in a c

Project 1: Association Rules
Start date Sept 30, Due: Oct. 16th
Your task for this project is to identify and perform an association rule mining task. This
involves
1.
2.
3.
4.
5.
Selecting an appropriate data set
Preparing and preprocessing the data
Find

CAP 4770:
Introduction to Data Mining
Fall 2008
Dr. Tao Li
Florida International University
Outline
Course Logistics
Data Mining Introduction
Four Key Characteristics
Combination of Theory and Application
Engineering Process
Collection of Functionaliti

Concepts and Techniques
Chapter 2
August29,2011
DataMining:ConceptsandTechniques
1
What is about Data?
Generaldatacharacteristics
Basicdatadescriptionandexploration
Measuringdatasimilarity
August29,2011
DataMining:ConceptsandTechniques
2
What is Data?
A

WhatIsAssociationMining?
q
Association rule mining:
Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
Frequent pa

WhatisClusterAnalysis?
q
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Intracluster
distancesare
minimized
Intercluster
distancesare

WhatIstheProblemoftheKMeansMethod?
q
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the
distribution of the data.
q
K-Medoids: Instead of taking the mean value of the object in a c

HierarchicalClustering:TimeandSpacerequirements
q
O(N2) space since it uses the proximity matrix.
N is the number of points.
q
O(N3) time in many cases
There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
Compl

ClusterValidity
q
For supervised classification we have a variety of
measures to evaluate how good our model is
Accuracy, precision, recall
q
For cluster analysis, the analogous question is how to
evaluate the goodness of the resulting clusters?
q
But cl