CS57300: Homework 4
Due date: Sunday Apr 12, midnight (submit via turnin)
More Exploration of Naive Bayes Classiers using Word Clusters
In this programming assignment you will run further experiments
Data Mining
CS57300
Purdue University
November 16, 2010
Descriptive modeling: evaluation
Cluster validity
For prediction tasks there are a variety of external evaluation metrics
Accuracy, squared lo
CS57300: Homework 2
Due date: Sunday September 23, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the R code that you used for analysis. Your
homework must be typed.
CS57300: Homework 1
Due date: Sunday February 1, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the R code that you used for analysis. Your homework
must be typed an
Name:
CS573 Midterm: Fall 2010
This is a closed-book, closed-notes exam. Non-programmable calculators are allowed for probability
calculations.
There are 11 pages including the cover page. The total n
Data Mining
CS57300
Purdue University
!
February 5, 2015
Decision making
Are A and B the same color?
The trick uses the biases in the human visual system
Heuristics and biases
Tversky & Kahneman, ps
Data Mining
CS57300
Purdue University
!
January 22, 2015
What is data?
Collection of entities and their attributes
Entity: collection of
attributes
Aka: record, point,
case, sample,
object, or i
Data Mining
CS57300
Purdue University
!
January 13, 2015
Introduction
What is data mining?
Why now?
Data mining process
Course overview
Data mining
The process of identifying valid, novel, potent
Data Mining
CS57300
Purdue University
!
February 19, 2015
NBC learning
Model space
Parametric model with specic form
(i.e., based on Bayes rule and assumption of conditional independence),
Models
Data Mining
CS57300
Purdue University
!
February 12, 2015
Predictive modeling: introduction
Data mining components
Task specication: Prediction
Data representation: Homogeneous IID data
Knowledge r
Data Mining
CS57300
Purdue University
!
February 3, 2015
Dimensionality reduction
Identify and describe the dimensions that underlie the data
May be more fundamental than those directly measured but
Data Mining
CS57300
Purdue University
!
February 10, 2015
Types of hypotheses
Descriptive: propositions that describe a
characteristic of an object
Relational: propositions that describe
the relat
Data Mining
CS57300
Purdue University
!
January 27, 2015
Covariance and correlation
Covariance
Measures how variables Xj and Xk vary together
n
X
1
COV (Xj , Xk ) =
n
xij
Xj
xik
i=1
Xk
Positive if l
Data Mining
CS57300
Purdue University
!
January 20, 2015
Probability and statistics basics
Modeling uncertainty
Necessary component of almost all data analysis
Approaches to modeling uncertainty:
F
CS57300: Homework 3
Due date: Friday October 12, midnight (submit via turnin)
Bag of Words Naive Bayes
In this programming assignment you will implement a naive Bayes classication (NBC) algorithm
and
CS57300: Homework 1
Due date: Friday September 7, midnight (submit pdf to Blackboard)
1
Probability (4 pts)
A deck of playing cards contains 52 cards, divided into 4 suits (, , , ), with each suit
con
CS57300: Homework 1
Due date: Friday September 7, midnight (submit pdf to Blackboard)
1
Probability (4 pts)
A deck of playing cards contains 52 cards, divided into 4 suits (, , , ), with each suit
con
Data Mining
CS57300
Purdue University
!
February 24, 2015
Tree learning
Top-down recursive divide and conquer algorithm
Start with all examples at root
Select best attribute/feature
Partition exa
Data Mining
CS57300
Purdue University
!
February 26, 2015
Other predictive models
Nearest neighbor
Instance-based method
Learning
Stores training data and delays processing until a new instance mus
CS57300: Homework 3
Due date: Monday March 9, midnight (submit via turnin)
Bag of Words Naive Bayes
In this programming assignment you will implement a naive Bayes classication (NBC) algorithm
and use
CS57300: Homework 2
Due date: Wednesday February 18, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the code that you used for analysis. Your
homework must be typed.
CS573 HW4
Haoran Lin
1 day extension
1
(a)
standard kmeans
wp
150000
wnp
score
112500
75000
37500
0
10
20
50
100
200
cluster size
The best k value is 200
(b)
there are some noticeable topics like fren
Gaussian mixture models (revisited)
K
Assume that the data are generated
from a mixture of k multi-dimensional
Gaussians, where each component is
has parameters: Nk (k , k )
Data Mining
f (x)
=
wk fk
Data Mining
CS57300
Purdue University
Association rules
April 21, 2015
Association rules
Rule evaluation
Data
Support (aka frequency)
Basket: customer transaction; items: products
s() = fr() / N
Data Mining
CS57300
Purdue University
Anomaly detection
(source: Introduction to Data Mining by Tan, Steinbach and Kumar)
April 23, 2015
Task
Examples
Anomalies/outliers: data points that are conside
Data Mining
CS57300
Purdue University
Descriptive modeling
March 31, 2015
Data mining components
Descriptive models
Task specication: Description
Descriptive models summarize the data
Data represen
Data Mining
Modeling pathologies
CS57300
Purdue University
March 23, 2015
Source: David Jensen, University of Massachusetts, CS383
Pathologies of induction algorithms
Overtting
Overtting
Accuracy
Ad