DATA MINING
Lecture 1
Javier Cabrera
Fall 2013
Outline
1. What is Data mining?
2. Software
3. Data Visualization, EDA
4. Modeling: Penalized, Unsupervised
Supervised
5. Intro to R, R packages
What is Data mining?
KDD 2013 SPONSORS
NSA
What is Data mining?
VECTORS AND MATRICES:
Matrix: a rectangular array of elements
Dimension: rxc means r rows by b columns
Example: A = [a ], i=1,2,3; j=1,2
In general: A = [a ], i=1,r; j=1,.c
ij
ij
x1
x2
x = is a multivariate observation.
.
x
p
Multivariate observ
BASIC STATISTICAL CONCEPTS
These are some of the topics that you are required to know for
the Data mining class. If you feel that you are not familiar with
some of them please read about them in your basic stats text
book or come and talk to me and I will
Final Report:
SALES OF ORTHOPEDIC EQUIPMENT
The objective of this study is to find ways to increase sales of orthopedic material from our
company to hospitals in the United States. I want each person to concentrate in a subset of 3000
hospitals chosen at
3D Plots: Are sometimes useful but may need animation
Conditional plots
(In Rweb) data(state)
attach(data.frame(state.x77)#> don't need `data' arg. below
coplot(Life.Exp ~ Income | Illiteracy * state.region, number = 3,
panel = function(x, y, .) panel.smo
Part 2
An economist compares salaries of 30 years old NJ and Pennsylvania employed man and
women with and without college diploma. Salaries are on $1000.The data is
state
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gender diploma salary
NJ
M
N
150
GEE MODELS
GEEs are sometimes used to analyze longitudinal data with normal responses
or categorical. Bellow we have two examples one with continuous and one
with categorical responses.
GEEs require the specification of a marginal model, so general forms
STAT 586 Fall 2010. Report I. (Group report). Due Oct 20, 2010
Topic: Analyzing Survey Data with Categorical Variables.
Application: Survey of Wine preferences in SB and HK
This survey was conducted in the summer of 2008 and was
directed to decision maker
Analyzing Survey Data with Categorical
Variables.
Application: Survey of Wine preferences in SB
and HK
Topic:
The objective of this project is to compare .
Two projects:
1. Census Database 5% sample from New Jersey or
California. This is for postgraduate
PROJECT 1
An economist compares salaries of 30 years old NJ and Pennsylvania employed man and
women with and without college diploma. Salaries are on $1000. The data is
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
state gender diploma salary
NJ
M
Exploratory Data Analysis
1. EDA (Exploratory Data Analysis by J.Tukey).
The basic idea behind EDA methodology is to learn how to explore data and find
valuable information, structures and relationships among variables. Find the structure
of the majority
Code for Robust Regression
IN R:
library(MASS)
l1 = lmsreg(stack.x,stack.loss)
l2 = lsfit(stack.x,stack.loss)
plot(stack.x[,2],resid(l1)
abline(0,1)
plot(resid(l1)
IN SAS
You can also import the dataset nym into PROC IML and from there use a robust
regres
FACTOR ANALYSIS:
FA assumes the existence of a few latent variables that define the phenomena under
study. The observations are functions of these unknown latent variables, or more specifically are
linear combinations of the latent variables.
The objectiv
How to load Fortran subroutines into R
Fortran, C, C+ are computer languages that generate machine code. Therefore they could be used to
speed up the computations in R. R has a mechanism to load Fortran or C code.
Example: The following example show how t
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
What is Grubbs test or rule and what is it used for
Give 3 standard methods for identifying outliers
Compare Grubbs rule from the resistant z-score rule for det
PROJECT 1
The following data is the estimated number of three kinds of plankton caught in six hauls
each with two nets. You have two weeks to do it, with your group.
Net A
KindI
KindII
KindIII
897
107
98
107 43
1101 28
4568 68
696
55
324
Net B
566 467
32
HOW TO WRITE REPORTS
Please follow these guidelines when you write
your reports.
1. Project Outline
Title Page: Must contain
o
Project title.
o
Your name.
o
Date.
o
Executive Summary. ( very short and to the point).
The executive summary contains a brief
Multivariate Data Analysis and
Data Mining
1
Outline
1. Multivariate Data
2. Data Visualization for Multivariate Data.
3. A basic multivariate example: Crime data.
4. Geometric intuition of Multivariate data.
5. Dimension Reduction Principal Components
6.
Robust Fitting
We saw resistant estimates that are not affected by some percentage (greater
than 0%) of outliers.
Breakdown Point (BP): The BP of an estimator is the proportion of a sample
that needs to be changed so that the estimator takes any value. Th
Project II. Comparing low income housing in New Jersey with other US
states.
Dataset: You need the file with NJ data (Gnj.zip) plus the other two states
assigned to your group.
Important variables for affordability are
Value = housing value code
Puma1 = p
Robust Fitting
We saw resistant estimates that are not affected by some percentage (greater
than 0%) of outliers.
Breakdown Point (BP): The BP of an estimator is the proportion of a sample
that needs to be changed so that the estimator takes any value. Th
R Software Documentation
1. Installation
To install R please go to the webpage cran.r-project.org and download the executable for MS windows or the source code. The
self installation file will guide you through the installation procedure. You can also get
1/30/2013
Outline
Lecture 1: R Basics
Why R, and R Paradigm
References, Tutorials and links
R Overview
R Interface
R Workspace
Help
R Packages
Input/Output
Reusing Results
Applied Statistical Computing and
Graphics
R has a Steep
Learning Curve
Why R?
(ste
SAS Basics.
A. SAS UNDER WINDOWS
1. Open SAS
2. Locate windows:
Explorer, Log, Editor, Output,
3. Locate SAS user and Work folders in the Explorer window.
4. Observe changes in the buttons as we change windows.
5. Open the text file c:\hospital.csv
This
Modern procedures for model
selection
When there are lots of Xs, get models with high variance and prediction
suffers. Three solutions:
1.
Subset selection
Score: AIC, BIC, etc.
All-subsets + leaps-and-bounds,
Stepwise methods,
2.
Shrinkage/Ridge Regressi
Brief R Tutorial
June 6, 2008
The best way to go through this tutorial is to first install a version of R (see installation section below) and type the commands along with the examples given. This way you can see for yourself what output each command give