Name:
CS39000-DM0 Midterm1: Spring 2014
This is a closed-book, closed-notes exam. Non-programmable calculators are allowed for probability
calculations.
There are six pages including the cover page. The total number of points for the exam is 30. Note
the
Data Mining & Machine Learning
CS39000-DM0
Purdue University
September 27, 2016
Predictive modeling: introduction
Data mining components
Task specification: Prediction
Data representation: Homogeneous IID data
Knowledge representation
Learning techniq
Data Mining & Machine Learning
CS39000-DM0
Purdue University
September 8, 2016
Last time: Hypothesis and (human) Bias
A Hypothesis allows us to reason about the data in a formal way.
Describes the connection between random variables
We can make differe
Data Mining & Machine Learning
CS39000-DM0
Purdue University
September 15, 2015
Data exploration
and visualization
Measurement
Real world
Data
Relationship
in real world
Relationship
in data
Goal: map domain entities to symbolic representations
Tabular da
Data Mining & Machine Learning
CS39000-DM0
Purdue University
October 6, 2016
Classification
In its simplest form, a classification model defines a decision
boundary (h) and labels for each side of the boundary
Input: x=cfw_x1,x2,.,xn is a set of attribu
Data Mining & Machine Learning
CS39000-DM0
Purdue University
October 13, 2016
Linear Classifiers
Linear threshold functions
Associate a weight (wi) with each feature (xi)
Prediction: sign(b + wTx) = sign (b + wi xi)
b + wTx 0 predict y=1
Otherwise, p
Data Mining & Machine Learning
CS39000-DM0
Purdue University
September 22, 2016
Data exploration
and visualization
http:/extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html
http:/extremepresentation.typepad.com/blog/2006/09/choosing_a_good.h
Data Mining & Machine Learning
CS39000-DM0
Purdue University
September 29, 2016
Predictive modeling: introduction
Classification
In its simplest form, a classification model defines a decision
boundary (h) and labels for each side of the boundary
Input:
Data Mining & Machine Learning
CS39000-DM0
Purdue University
August 23, 2016
Adapted from Jennifer Neville Fall15 slides
Course overview
Goals
Identify key elements of data mining and
machine learning algorithms
Understand how algorithmic elements
inter
Data Mining & Machine Learning
CS39000-DM0
Purdue University
September 1, 2016
Probability and statistics basics
Modeling uncertainty
Necessary component of almost all data analysis
Approaches to modeling uncertainty:
Fuzzy logic
Possibility theory
R
Data Mining & Machine Learning
CS39000-DM0
Purdue University
October 18, 2016
Linear Classifiers
Linear threshold functions
Associate a weight (wi) with each feature (xi)
Prediction: sign(b + wTx) = sign (b + wi xi)
b + wTx 0 predict y=1
Otherwise, p
CS39000-DM0 Homework 2
Due date: Wednesday, February 15, 11:59pm in Blackboard.
Submit a PDF with both your answers to the questions and the R code that you used for
analysis. Your homework must be typed. Use of Latex is recommended, but not required.
In
CS39000-DM0 Homework 3
Due date: Thursday March 9, 11:59pm
In this programming assignment you will implement a naive Bayes classification algorithm
and evaluate it on the Yelp dataset. Instructions below detail how to turn in your code
and assignment to B
CS39000-DM0 Homework 4
Due date: Wednesday March 29, 11:59pm
In this programming assignment you will implement the k-means clustering algorithm and
apply it on the Yelp dataset. Instructions below detail how to turn in your code and
assignment to Blackboa
CS39000-DMO Homework 1
Due date: Friday January 27, 11:59pm (submit pdf to Blackboard)
1
Basic Probability and Statistics
1. (4 pts)
(a) Suppose that E, F and G are independent events. Prove that
P [E ^ (F _ G)] = P (E)P (F _ G)
(b) Let A and B be indepen
CS39000-DM0 Homework 5
Due date: Monday April 17, 11:59pm.
In this programming assignment you will use python to implement an association rule
algorithm and apply it to the yelp4.csv dataset, which has only discrete attributes and
no missing values. Instr
CS39000-DM0 Homework 4
Due date: Wednesday April 9, midnight
In this programming assignment you will implement a regression tree classication
algorithm and evaluate it on Yelp dataset. This is the dataset that we used in homework 2 and 3. Instructions bel
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
February 18, 2014
Predictive modeling: learning
Learning predictive models
Choose a data representation
Select a knowledge representation (a model)
Denes a space of possible models M=cfw_M
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
March 6, 2014
Naive Bayes classiers
Classication as probability estimation
Instead of learning a function f that assigns labels
Learn a conditional probability distribution over the output
CS39000-DM0 Homework 1
Due date: Saturday Feb 1, 11:59pm in Blackboard
Note the point value of each question and allocate your eort accordingly. In order to
receive full credit for correct answers and partial credit for others, please show your work.
Your
CS39000-DM0 Homework 1 Solutions
Data Mining Tasks (20 pts)
Summarize two examples of data mining applications from recent conference papers/reports.
EXAMPLE 1:
1. Briey summarize (1-2 sentences) the article and include a reference.
This paper focuses on
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
January 21, 2014
Probability and statistics basics
Modeling uncertainty
Necessary component of almost all data analysis
Approaches to modeling uncertainty:
Fuzzy logic
Possibility theory
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
January 30, 2014
Data exploration
and visualization
Visualization
Human eye/brain have evolved powerful methods to detect structure in nature
Display data in ways that exploit human patter
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
February 6, 2014
Predictive modeling: introduction
Data mining components
Task specication: Prediction
Data representation: Homogeneous IID data
Knowledge representation
Learning techniqu
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
February 11, 2014
Python: A Simple Tutorial
Slides adapted from UPenn CIS530 python tutorial
Why Python?
Interpreted language
Dynamically typed: variables do not have a predened type
Rich,
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
January 28, 2014
Data and Measurement
Measurement
Real world
Data
Relationship
in real world
Relationship
in data
Goal: map domain entities to symbolic representations
What is data?
Collecti
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
January 23, 2014
Probability and statistics (cont)
Expectation
Denotes the expected value or mean value of a random variable X
Discrete
E [X ] =
x
Continuous
E [X ] =
x
Expectation of a f
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
February 4, 2014
Data exploration
and visualization
http:/extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html
http:/extremepresentation.typepad.com/blog/2006/09/choosing_a_good.
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
January 16, 2014
Elements of Data Mining
& Machine Learning Algorithms
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation
Data Mining
Preprocessing
Knowledge
Knowledge
Patte
Data Mining & Machine Learning
CS39000-DM0
Purdue University
!
January 14, 2014
Data mining
The process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data
(Fayyad, Piatetsky-Shapiro & Smith 1996)
Articial Int