OR474 Fall 2002 D. Ruppert
Exam 2
Solutions to In-Class Part
1. A regression tree is used when the response is interval and a classification tree is
used when the response is categorical.
2. We would simply grow a right-sized tree if we knew how to do thi

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 11/14/07 Link Analysis
What is Link Analysis?
Technique for drawing conclusions about linked data
Specifically,
a set of nodes linked by relations Hyperlink data, citation data
Link analysis

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 11/14/07 Collaborative Filtering
BellCores MovieRecommender
Participants sent email to videos@bellcore.com System replied with a list of 500 movies to rate on a 1-10 scale (250 random, 250

Statistical Data Mining ORIE 474
Spring 2007 Tatiyana Apanasovich 11/12/07
Memory-based Reasoning(MBR)
The human ability to reason from experience depends on the ability to recognize appropriate examples form the past. One first identifies similar cases f

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 11/07/07
Market basket analysis
Market basket analysis uses the information about what customers purchase to provide the insight into who they are and why they make certain purchases and the

Statistical Data Mining ORIE 474
Fall 2006 Tatiyana Apanasovich 11/27/06 Artificial Neural Networks(Cont.)
NN: training and overfitting
The most delicate part of neural network modeling is generalization, the development of a model that is reliable in pr

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 11/05/07 Artificial Neural Networks
[Artificial] Neural Networks to: A class of powerful, general-purpose tools readily applied
Prediction Classification Clustering Biological Neural Net (h

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/31/06 Cluster Analysis
Cluster Analysis (CA): Building Blocks
Distance measures
We
need to specify what we mean by certain points being closer to each other can be characterized by the ki

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/26/07 Nearest Neighbor Models
10.6 Nearest Neighbor (NN) Models
Data: (x(i),c(i), where i=1,n and c(i) in cfw_c1, cm Distance function: d(x(i),x(j) Model Structure:
To classify a new obj

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/24/07 Classification Modeling
10.1 Predictive Modeling
Aims to predict the unknown value of a variable of interest based on the known values of other variables Learn a mapping
input Xnxp s

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/17/07 Logistic Regression
Why use logistic regression?
The arem im re any portant re arch topics for which the se de nde variableis "lim d." pe nt ite Binary logistic re ssion is a typeof

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/15/07 Generalized Linear Models & Logistic Regression
11.3 Generalized Linear Models (GLMs)
Recall: Linear model
(i)
The Y(i) are independent random variables, with distribution N(i), 2),

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/12/07 DM Algorithms Ex: Classification and Regression Trees (CART)
5.1 Introduction
A DM algorithm is a well-defined procedure that takes data as input and produces output in the form of m

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/10/07
Assumptions of Multiple Linear Regression Model
1. Linearity: E (Y | X 1 ,K , X K ) = 0 + 1 X 1 + L + K X K 2. Constant variance: The standard deviation of Y for the subpopulation of

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/05/07 Regression (III)
Collect data Preliminary checks on data quality Diagnostics for relationships and strong interactions Remedial measures Are remedial measures needed?
Data collection

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/03/07 Regression (II)
Hypothesis Tests in Multiple Linear Regression
Tests on Individual Regression Coefficients and Subsets of Coefficients The hypotheses for testing the significance of

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 10/01/07 Regression (I)
Multiple Linear Regression Models
Introduction
Many applications of regression analysis involve
situations in which there are more than one regressor variable. A regr

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana Apanasovich 11/16/07 Link Analysis
Introduction
Airline Route Maps are useful Hyperlinks were revolutionary
Apples HyperCard (Bill Atkinson)
Claim that there are no more than 6 degrees of separation b

Statistical Data Mining ORIE 474
Fall 2007 Tatiyana V. Apanasovich 11/19/07 Bayesian NN
Whyarethe Excitement? What they?
Bayesian nets are a network-based framework for representing and analyzing models involving uncertainty Where did they come from? Cr

Principal Components Analysis
For example, suppose you are interested in examining the relationship among measures of food consumption from different sources. The sample data set Protein records the amount of protein consumed from nine food groups for ea

1
PCA using Insight node
1. Start SAS. Download and import the dataset baseball.xls from the course website. Start Enterprise Miner and create a new project. 2. Drag an Input Data Source node onto the diagram, open it, and input the data set. Drag an Insi

1
Exploring Data (cont.)
3.1. Exploration with MultiPlot node. Many tools available in SAS Enterprise Miner enable you to explore your data further. One such tool is the MultiPlot node. The MultiPlot node is a visualization tool that enables you to explor

Fall 2006 ORIE474: Section 7 notes Nikolai Blizniouk
The goal today is to discuss how to do principal components (PC) regression, how to create interactions in SAS EM and SAS Analyst, and how to use SAS Analyst to extract diagnostic information not provid

Fall 2006 ORIE474: Section 6 notes Nikolai Blizniouk
The goal of these notes is to provide some guidance for the use of Regression node in SAS. The setup assumes that you have drawn a diagram similar to that from Section 5.
Doing regression with categoric

Fall 2006 ORIE474: Section 4 Notes Nikolai Blizniouk
In this recitation we shall use the DONATIONS data set to investigate the eect of the income level, ownership of a house and level of average amount of donations to date on the amount of donation in the

Clustering Using SAS EM October 6, 2003 The Data and Task
The data in prospect.xls is demographic information from a companys database on 4701 of their customers. (Names and addresses have been removed.) Preferences for dierent products may depend on thes

Regression by Cluster Using SAS EM The Data and Task
The data le orthopedic.xls for this recitation can be found on the course website. This le contains information compiled by a company that sells orthopedic devices to hospitals. The observations are 470

Recitation Supplement for September 29, 2003 Fitting and Comparing Candidate Models
We will use the same data set, myraw.xls, as was used in the rst section. (See recitation supplement for description). We want to build a model that predicts how a donor r

Creating a SAS EM Decision Tree for Classication Data
We will use the data set hmeq.xls that contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventu