Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Overview
Terminology
Variable: Any measurement on the records,
including both the input (X) variables and
the output (Y) variables
Training data: Portion of data used to fit a
model
Test data: Portion of the data used only at
the end of the model building
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Chapter 7 KNearestNeighbor
Galit Shmueli and Peter Bruce 2010
Characteristics
Datadriven, not modeldriven
Makes no assumptions about the data
Basic Idea
For a given record to be classified, identify
nearby records
Near means records with similar predi
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Fourth Assignment for Data Mining
20111691
Q1. Specify characteristics of different schemes of clustering.
Min: It is called Minimum distance or single linkage method. This scheme measures the
minimum distance between two clusters. This method is useful
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
ASL
Autonomous Systems Lab
Localization  Introduction to MapBased Localization
Autonomous Mobile Robots
Roland Siegwart
Margarita Chli, Paul Furgale, Marco Hutter, Martin Rufli, Davide Scaramuzza
Autonomous Mobile Robots
Margarita Chli, Paul Furgale, Ma
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 03
At the beginning of the class, 4/21/2015
made by Seungtae Park
modified by Seungchul Lee
Problem 1 (preparation for data classification)
In this problem, we will try to classify handwritten digits.
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 06
At the beginning of the class, 5/26/2015
made by Seungchul Lee
Problem 1
The uniform distribution for a continuous variable x is defined by
1
U ( x; a, b) ;
, a xb
ba
Verify that this distribution
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 07
At the beginning of the class, 6/02/2015
made by Seungchul Lee
Problem 1
Suppose you randomly sample m data points from an exponential distribution. You want to
estimate the expectation of the distr
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 02
At the beginning of the class, 4/07/2015
Made by Seungchul Lee
You can numerically solve the optimization problems.
Problem 1
Each dimensions of a shaft shown in Figure 1 is to be machined individua
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Chapter 13 Association Rule
Mining
What are Association Rules?
Transactionbased or eventbased
Analysis of what goes with what
What symptoms go with what diagnosis
What movies go with what movies
Originated with study of customer
transactions databases
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Data Visualization
Graph for Data Exploration
Basic plots
Line graphs
Bar charts
Scatterplots
Distribution plots
Boxplots
Histogram
Line Graph for Time Series
Bar Chart for Categorical
Variable
Scatterplot
Displays relationship between two numerical varia
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Midterm
Name:
Student ID:
K.Bae created an economy with two assets called M aster and U nder on July 1, 2015.
He is not able to make a riskfree asset yet because it has been just four months since
he got out of the long dark tunnel of PhD program in Fina
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
20111691
4.1
a)
Numerical value
Ordinal Value
sodium
fat
carbo
fiber
potass
shelf
Nominal Value
mfr, type
sugars
vitamins
cups
rating
weight
protein
calories
b)
Calories
Protein
fat
sodium
fiber
carbo
sugars
potass
vitamins
weight
cups
ratings
Mean
106.8
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Chapter 5 Evaluating
Classification & Predictive
Performance
Why Evaluate?
Multiple methods are available to classify or
predict
For each method, multiple choices are
available for settings
To choose best model, need to assess each
models performance
Accu
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Introduction to Data Mining
Changyong Lee, Ph.D.
Proposition A bow while B shake hands
What is A and B?
Asians bow while Caucasians shake
hands.
Asians bow while Caucasians shake
hands.
People wearing clothes of a similar color bow while
people wearing cl
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
7.2 a) k=5 is best. In the case of k=5, its error is lowest.
B) k=18.85
c) its model is derived by train dataset. So, it doenst make the error.
d) The valiation data set also is used in model as train data set. So, its error is low.
e) it take very much t
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Dimension Reduction
Exploring the data
Statistical summary of data: common metrics
Average
Median
Minimum
Maximum
Standard deviation
Counts & percentages
Summary Statistics Boston
Housing
Correlation Analysis
Below: Correlation matrix for portion of Bosto
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Chapter 8 Nave Bayes
Galit Shmueli and Peter Bruce 2010
Characteristics
Datadriven, not modeldriven
Make no assumptions about the data
Nave Bayes: The Basic Idea
For a given new record to be classified, find
other records like it (i.e., same values for
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 04
At the beginning of the class, 5/12/2015
made by Seungchul Lee
Problem 1
I have demonstrated the PCA algorithm using video recordings of a spring and mass system.
Figure 1 shows PCA results which yo
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 05
At the beginning of the class, 5/19/2015
made by Seungchul Lee
Problem 1
You will use Kmeans to compress an image by reducing the number of colors it contains.
1) Image Representation
The data for
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
HSE570 Multivariate Analysis and Data Mining
Homework 01
before the class, 3/31/2015
Made by Seungtae Park
Modified by Seungchul Lee
Problem 1
Problem 2
Problem 3
Problem 4
2
Problem 5
Problem 6
Problem 7
3
Problem 8
Problem 9 (Polynomial interpolation)
n
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Nonlinear regression
Peak Hourly Demand (GW)
3
2.5
2
1.5
0
20
40
60
High Temperature (F)
80
100
High temperature / peak demand observations for all days in 20082011
2
Central idea of nonlinear regression: same as linear regression,
just with nonlinea
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Multiobjective leastsquares
in many problems we have two (or more) objectives
we want J1 = Ax y2 small
and also J2 = F x g2 small
(x Rn is the variable)
usually the objectives are competing
we can make one smaller, at the expense of making the other
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Predict peak demand from high temperature
What will peak demand be tomorrow?
If we know something else about tomorrow (like the high
temperature), we can use this to predict peak demand
Peak Hourly Demand (GW)
3
2.5
2
1.5
60
65
70
75
80
85
High Temperat
Ulsan National Institute of Science and Technology
data mining
HSE 570

Fall 2014
Linear Algebra Review
Prof. Seungchul Lee
iSystems Design Lab.
Acknowledgement to
 Prof. Stephen Boyd (Stanford)
 Prof. Sanjay Lall (Stanford)
 Prof. Zico Kolter (CMU)
for material of this lecture
Linear equations
Set of linear equations (two equation