DATA MINING
Lecture 1
Javier Cabrera
Fall 2013
Outline
1. What is Data mining?
2. Software
3. Data Visualization, EDA
4. Modeling: Penalized, Unsupervised
Supervised
5. Intro to R, R packages
What is
R Software Documentation
1. Installation
To install R please go to the webpage cran.r-project.org and download the executable for MS windows or the source code. The
self installation file will guide y
VECTORS AND MATRICES:
Matrix: a rectangular array of elements
Dimension: rxc means r rows by b columns
Example: A = [a ], i=1,2,3; j=1,2
In general: A = [a ], i=1,r; j=1,.c
ij
ij
x1
x2
x = is a mu
Final Report:
SALES OF ORTHOPEDIC EQUIPMENT
The objective of this study is to find ways to increase sales of orthopedic material from our
company to hospitals in the United States. I want each person
3D Plots: Are sometimes useful but may need animation
Conditional plots
(In Rweb) data(state)
attach(data.frame(state.x77)#> don't need `data' arg. below
coplot(Life.Exp ~ Income | Illiteracy * state.
Part 2
An economist compares salaries of 30 years old NJ and Pennsylvania employed man and
women with and without college diploma. Salaries are on $1000.The data is
state
1
2
3
4
5
6
7
8
9
10
11
12
13
GEE MODELS
GEEs are sometimes used to analyze longitudinal data with normal responses
or categorical. Bellow we have two examples one with continuous and one
with categorical responses.
GEEs require t
STAT 586 Fall 2010. Report I. (Group report). Due Oct 20, 2010
Topic: Analyzing Survey Data with Categorical Variables.
Application: Survey of Wine preferences in SB and HK
This survey was conducted i
Analyzing Survey Data with Categorical
Variables.
Application: Survey of Wine preferences in SB
and HK
Topic:
The objective of this project is to compare .
Two projects:
1. Census Database 5% sample f
PROJECT 1
An economist compares salaries of 30 years old NJ and Pennsylvania employed man and
women with and without college diploma. Salaries are on $1000. The data is
.
1
2
3
4
5
6
7
8
9
10
11
12
13
Exploratory Data Analysis
1. EDA (Exploratory Data Analysis by J.Tukey).
The basic idea behind EDA methodology is to learn how to explore data and find
valuable information, structures and relationshi
Code for Robust Regression
IN R:
library(MASS)
l1 = lmsreg(stack.x,stack.loss)
l2 = lsfit(stack.x,stack.loss)
plot(stack.x[,2],resid(l1)
abline(0,1)
plot(resid(l1)
IN SAS
You can also import the datas
BASIC STATISTICAL CONCEPTS
These are some of the topics that you are required to know for
the Data mining class. If you feel that you are not familiar with
some of them please read about them in your
FACTOR ANALYSIS:
FA assumes the existence of a few latent variables that define the phenomena under
study. The observations are functions of these unknown latent variables, or more specifically are
li
How to load Fortran subroutines into R
Fortran, C, C+ are computer languages that generate machine code. Therefore they could be used to
speed up the computations in R. R has a mechanism to load Fortr
PROJECT 1
The following data is the estimated number of three kinds of plankton caught in six hauls
each with two nets. You have two weeks to do it, with your group.
Net A
KindI
KindII
KindIII
897
107
HOW TO WRITE REPORTS
Please follow these guidelines when you write
your reports.
1. Project Outline
Title Page: Must contain
o
Project title.
o
Your name.
o
Date.
o
Executive Summary. ( very short and
Multivariate Data Analysis and
Data Mining
1
Outline
1. Multivariate Data
2. Data Visualization for Multivariate Data.
3. A basic multivariate example: Crime data.
4. Geometric intuition of Multivaria
Robust Fitting
We saw resistant estimates that are not affected by some percentage (greater
than 0%) of outliers.
Breakdown Point (BP): The BP of an estimator is the proportion of a sample
that needs
Project II. Comparing low income housing in New Jersey with other US
states.
Dataset: You need the file with NJ data (Gnj.zip) plus the other two states
assigned to your group.
Important variables for
Robust Fitting
We saw resistant estimates that are not affected by some percentage (greater
than 0%) of outliers.
Breakdown Point (BP): The BP of an estimator is the proportion of a sample
that needs
1/30/2013
Outline
Lecture 1: R Basics
Why R, and R Paradigm
References, Tutorials and links
R Overview
R Interface
R Workspace
Help
R Packages
Input/Output
Reusing Results
Applied Statistical Computin
SAS Basics.
A. SAS UNDER WINDOWS
1. Open SAS
2. Locate windows:
Explorer, Log, Editor, Output,
3. Locate SAS user and Work folders in the Explorer window.
4. Observe changes in the buttons as we chan
Modern procedures for model
selection
When there are lots of Xs, get models with high variance and prediction
suffers. Three solutions:
1.
Subset selection
Score: AIC, BIC, etc.
All-subsets + leaps-an
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
What is Grubbs test or rule and what is it used for
Give 3 standard methods for identifying outliers
Comp
Brief R Tutorial
June 6, 2008
The best way to go through this tutorial is to first install a version of R (see installation section below) and type the commands along with the examples given. This way