CS57300: Homework 3
Due date: Monday March 9, midnight (submit via turnin)
Bag of Words Naive Bayes
In this programming assignment you will implement a naive Bayes classication (NBC) algorithm
and use it on a sample of the reviews from the Yelp data setto
Data Mining
CS57300
Purdue University
!
February 24, 2015
Tree learning
Top-down recursive divide and conquer algorithm
Start with all examples at root
Select best attribute/feature
Partition examples by selected attribute
Recurse and repeat
Other
Data Mining
CS57300
Purdue University
!
February 19, 2015
NBC learning
Model space
Parametric model with specic form
(i.e., based on Bayes rule and assumption of conditional independence),
Models vary based on parameter estimates in CPDs
Search alg
Data Mining
CS57300
Purdue University
!
January 13, 2015
Introduction
What is data mining?
Why now?
Data mining process
Course overview
Data mining
The process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in
Data Mining
CS57300
Purdue University
!
January 22, 2015
What is data?
Collection of entities and their attributes
Entity: collection of
attributes
Aka: record, point,
case, sample,
object, or instance
Entities
Attribute: property or characteristic
Data Mining
CS57300
Purdue University
!
February 5, 2015
Decision making
Are A and B the same color?
The trick uses the biases in the human visual system
Heuristics and biases
Tversky & Kahneman, psychologists, propose that people often do not follow
ru
Data Mining
CS57300
Purdue University
!
February 12, 2015
Predictive modeling: introduction
Data mining components
Task specication: Prediction
Data representation: Homogeneous IID data
Knowledge representation
Learning technique
Prediction technique
Data Mining
CS57300
Purdue University
!
February 3, 2015
Dimensionality reduction
Identify and describe the dimensions that underlie the data
May be more fundamental than those directly measured but hidden to the
user
Reduce dimensionality of modeling
Data Mining
CS57300
Purdue University
!
February 10, 2015
Types of hypotheses
Descriptive: propositions that describe a
characteristic of an object
Relational: propositions that describe
the relationship between 2+ variables
Descriptive
Hypothesis
Non
Data Mining
CS57300
Purdue University
!
January 27, 2015
Covariance and correlation
Covariance
Measures how variables Xj and Xk vary together
n
X
1
COV (Xj , Xk ) =
n
xij
Xj
xik
i=1
Xk
Positive if large values of Xj are associated with large values of X
Data Mining
CS57300
Purdue University
!
January 20, 2015
Probability and statistics basics
Modeling uncertainty
Necessary component of almost all data analysis
Approaches to modeling uncertainty:
Fuzzy logic: form of many-valued logic that reasons with
CS57300: Homework 3
Due date: Friday October 12, midnight (submit via turnin)
Bag of Words Naive Bayes
In this programming assignment you will implement a naive Bayes classication (NBC) algorithm
and use it on the 152,327 review objects in the Yelp academ
CS57300: Homework 2
Due date: Sunday September 23, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the R code that you used for analysis. Your
homework must be typed. Use of Latex is recommended, but not required.
In this
CS57300: Homework 1
Due date: Friday September 7, midnight (submit pdf to Blackboard)
1
Probability (4 pts)
A deck of playing cards contains 52 cards, divided into 4 suits (, , , ), with each suit
containing 13 ranks (2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Que
CS57300: Homework 1
Due date: Friday September 7, midnight (submit pdf to Blackboard)
1
Probability (4 pts)
A deck of playing cards contains 52 cards, divided into 4 suits (, , , ), with each suit
containing 13 ranks (2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Que
Name:
CS590D / STAT 598M: Midterm
1
Data mining components (8 pts)
Read the excerpts from the paper Neural Data Mining for Credit Card Fraud Detection by R.
Brause, T. Langsdorf, and M. Hepp, ICTAI, 1999 on the last page.
1. Describe the data mining task.
Data Mining
CS57300
Purdue University
!
February 26, 2015
Other predictive models
Nearest neighbor
Instance-based method
Learning
Stores training data and delays processing until a new instance must be
classied
Assumes that all points are represented
Data Mining
CS57300
Purdue University
!
February 17, 2015
Learning predictive models
Choose a data representation
Select a knowledge representation (a model)
Denes a space of possible models M=cfw_M1, M2, ., Mk
Use search to identify best model(s)
Se
CS57300: Homework 2
Due date: Wednesday February 18, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the code that you used for analysis. Your
homework must be typed. Use of Latex is recommended, but not required.
In this
CS57300: Homework 1
Due date: Sunday February 1, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the R code that you used for analysis. Your homework
must be typed and submitted as a PDF. Use of Latex is recommended, but
CS57300: Homework 4
Due date: Sunday Apr 12, midnight (submit via turnin)
More Exploration of Naive Bayes Classiers using Word Clusters
In this programming assignment you will run further experiments with the NBC. Instructions below
detail how to use turn
CS573 HW4
Haoran Lin
1 day extension
1
(a)
standard kmeans
wp
150000
wnp
score
112500
75000
37500
0
10
20
50
100
200
cluster size
The best k value is 200
(b)
there are some noticeable topics like french restaurant. In topic of French
restaurant,there are
Gaussian mixture models (revisited)
K
Assume that the data are generated
from a mixture of k multi-dimensional
Gaussians, where each component is
has parameters: Nk (k , k )
Data Mining
f (x)
=
wk fk (x; )
k=1
K
p(x)
=
For each data point:
p(k)p(x|k)
k=
Data Mining
CS57300
Purdue University
Association rules
April 21, 2015
Association rules
Rule evaluation
Data
Support (aka frequency)
Basket: customer transaction; items: products
s() = fr() / N
Basket: document; items: words
Proportion of N items
Data Mining
CS57300
Purdue University
Anomaly detection
(source: Introduction to Data Mining by Tan, Steinbach and Kumar)
April 23, 2015
Task
Examples
Anomalies/outliers: data points that are considerably dierent from the
remainder of the data
Fraud det
Data Mining
CS57300
Purdue University
Pattern mining: representation & learning
April 16, 2015
Data mining components
Pattern discovery
Task specication: Pattern discovery
Models describe entire dataset (or large part of it)
Data representation: Homoge
Data Mining
CS57300
Purdue University
Descriptive modeling
March 31, 2015
Data mining components
Descriptive models
Task specication: Description
Descriptive models summarize the data
Data representation: Homogeneous IID data
Global summary
Knowledge
Data Mining
Modeling pathologies
CS57300
Purdue University
March 23, 2015
Source: David Jensen, University of Massachusetts, CS383
Pathologies of induction algorithms
Overtting
Overtting
Accuracy
Adding components to models that reduce performance or le