COMP7103 Assignment 1
Deadline: Oct 15, 2014 11:55pm
Question 1
Classify the following attributes discussed in lecture notes Chapter 2 Page 11 as binary, nominal,
ordinal or numerical:
a)
b)
c)
d)
Refund Binary,nominal,numerical,binary
Marital Status
Taxa
Introduction
1
Overview
Why data mining?
Data Mining and Knowledge Discovery
Data mining tasks
Classification
Association analysis
Cluster analysis
The KDD process
2
Why Mine Data? Commercial
applications
Lots of data is being collected
and wareho
CSIS7103 Assignment 2
Due date: Nov 26, 2014 23:55pm
No late submission is allowed. Submit a softcopy in WORD or PDF format through the
Moodle system.
Question 1
You are given a data set, weather2013.csv (available on Moodle), which contains some
weather
Data
1
Data
What are the types of data?
How do we measure data quality?
How do we preprocess data for analyzing them?
How do we measure similarity between data objects?
2
Typical Structured Data
Collection of data objects and their attributes
An attribute
Classification (Part 1)
1
Overview
Basic decision tree classifier (DT) construction
Some technical issues of DT classification
Evaluating classifiers
2
Classification
What common characteristics are shared by the
red people, and not by the purple people?
OLAP
1
Relational Model
A traditional relational db presents to its user a one-
dimensional table of facts.
In the relational model, information is captured by
relations (or tables). Each relation is described by a
relation schema.
A schema is a set of
Cluster Analysis
1
Cluster Analysis
What is clustering?
Applications
Types of clusters
Partitioning methods
Hierarchical methods
2
What is Cluster Analysis?
Finding groups of objects such that the objects in a
group are similar (or related) to each anothe
Association Analysis (Part 1)
1
Overview
Association rule definition
The Apriori Algorithm
Efficiency issues
Compact representations
2
Market Basket Analysis
principal application of association rules is market
basket analysis
It models supermarket data a
Classification (Part 2)
1
Overview
Nearest-neighbor classifiers
Bayesian classifiers
Support vector machines
Ensemble methods
2
Nearest Neighbor Classifiers
Basic idea:
Given an unlabeled record Y, find the records in the
training set that are most
Association Analysis (Part 2)
1
Overview
Interestingness
Mining quantitative association rules
Mining sequential patterns
2
Pattern Evaluation
Association rule algorithms tend to produce too many
rules
many of them are uninteresting or redundant
Redundant
1
About the course
Instructor:
Prof. Ben Kao (kao@cs.hku.hk, CYC326)
Tutors:
Kevin Lam (yklam2@cs.hku.hk, CYC319)
Jolly Cheng (mycheng@cs.hku.hk, CYC319)
Forum, emails, course announcements: Moodle
2
Basic Information
Lectures
Friday, 7-10 pm, CPD-2.