CS57300: Homework 4
Due date: Sunday Apr 12, midnight (submit via turnin)
More Exploration of Naive Bayes Classiers using Word Clusters
In this programming assignment you will run further experiments with the NBC. Instructions below
detail how to use turn
Descriptive modeling: evaluation
Cluster validity
For prediction tasks there are a variety of external evaluation metrics
Accuracy, squared loss, area under ROC, etc.
For cluster analysis the exte
Due date: Sunday September 23, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the R code that you used for analysis. Your
homework must be typed. Use of Latex is recommended, but not required.
In this
Due date: Sunday February 1, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the R code that you used for analysis. Your homework
must be typed and submitted as a PDF. Use of Latex is recommended, but
Decision making
Are A and B the same color?
The trick uses the biases in the human visual system
Heuristics and biases
Tversky & Kahneman, psychologists, propose that people often do not follow
ru
What is data?
Collection of entities and their attributes
Entity: collection of
attributes
Aka: record, point,
case, sample,
object, or instance
Entities
Attribute: property or characteristic
Introduction
What is data mining?
Why now?
Data mining process
Course overview
Data mining
The process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in
NBC learning
Model space
Parametric model with specic form
(i.e., based on Bayes rule and assumption of conditional independence),
Models vary based on parameter estimates in CPDs
Search alg
Predictive modeling: introduction
Data mining components
Task specication: Prediction
Data representation: Homogeneous IID data
Knowledge representation
Learning technique
Prediction technique
Dimensionality reduction
Identify and describe the dimensions that underlie the data
May be more fundamental than those directly measured but hidden to the
user
Reduce dimensionality of modeling
Types of hypotheses
Descriptive: propositions that describe a
characteristic of an object
Relational: propositions that describe
the relationship between 2+ variables
Descriptive
Hypothesis
Non
Covariance and correlation
Covariance
Measures how variables Xj and Xk vary together
n
X
1
COV (Xj , Xk ) =
n
xij
Xj
xik
i=1
Xk
Positive if large values of Xj are associated with large values of X
Probability and statistics basics
Modeling uncertainty
Necessary component of almost all data analysis
Approaches to modeling uncertainty:
Fuzzy logic: form of many-valued logic that reasons with
CS57300: Homework 3
Due date: Friday October 12, midnight (submit via turnin)
Bag of Words Naive Bayes
In this programming assignment you will implement a naive Bayes classication (NBC) algorithm
and use it on the 152,327 review objects in the Yelp academ
Tree learning
Top-down recursive divide and conquer algorithm
Start with all examples at root
Select best attribute/feature
Partition examples by selected attribute
Recurse and repeat
Other
Other predictive models
Nearest neighbor
Instance-based method
Learning
Stores training data and delays processing until a new instance must be
classied
Assumes that all points are represented
CS57300: Homework 3
Due date: Monday March 9, midnight (submit via turnin)
Bag of Words Naive Bayes
In this programming assignment you will implement a naive Bayes classication (NBC) algorithm
and use it on a sample of the reviews from the Yelp data setto
CS57300: Homework 2
Due date: Wednesday February 18, midnight (submit pdf to Blackboard)
Submit both your answers to the questions and the code that you used for analysis. Your
homework must be typed. Use of Latex is recommended, but not required.
In this
CS573 HW4
Haoran Lin
1 day extension
1
(a)
standard kmeans
wp
150000
wnp
score
112500
75000
37500
0
10
20
50
100
200
cluster size
The best k value is 200
(b)
there are some noticeable topics like french restaurant. In topic of French
restaurant,there are
Gaussian mixture models (revisited)
K
Assume that the data are generated
from a mixture of k multi-dimensional
Gaussians, where each component is
has parameters: Nk (k , k )
K
p(x)
=
For each data point:
p(k)p(x|k)
k=
Association rules
Rule evaluation
Data
Support (aka frequency)
Basket: customer transaction; items: products
s() = fr() / N
Basket: document; items: words
Proportion of N items
Task
Examples
Anomalies/outliers: data points that are considerably dierent from the
remainder of the data
Fraud det
Data mining components
Pattern discovery
Task specication: Pattern discovery
Models describe entire dataset (or large part of it)
Data representation: Homoge
Data mining components
Descriptive models
Task specication: Description
Descriptive models summarize the data
Data representation: Homogeneous IID data
Global summary
Knowledge
Pathologies of induction algorithms
Overtting
Overtting
Accuracy
Adding components to models that reduce performance or le