Chapter 8 Nave Bayes
Nave Bayes: The Basic Idea
For a given new record to be classified, find other records like it (i.e., same values for
the predictors)
What is the prevalent class among those records?
Assign that class to your new record
Usage
Requir

Logistic Regression
Extends idea of linear regression to situation where outcome variable is
categorical
Widely used, particularly where a structured model is useful to explain (=profiling)
or to predict
We focus on binary classification
i.e. Y=0 or Y=

K-Nearest-Neighbor
Basic Idea
For a given record to be classified, identify nearby records
Near means records with similar predictor values X1, X2, Xp
Classify the record as whatever the predominant class is among the nearby records (the
neighbors)
How

Negative slope to reference curve
Oversampling and Asymmetric Costs
Rare Cases
Responder to mailing
Someone who commits fraud
Debt defaulter
Often we oversample rare cases to give model more information to work
with
Typically use 50% 1 and 50% 0 fo

Advantages of trees
Easy to use, understand
Produce rules that are easy to interpret & implement
Variable selection & reduction is automatic
Do not require the assumptions of statistical models
Can work without extensive handling of missing data
Dis

Pruning
Software lets tree grow to full extent, then prunes it back
Idea is to find that point at which the validation error begins to rise
Generate successively smaller trees by pruning leaves
At each pruning stage, multiple trees are possible
Use c

What are Association Rules?
Study of what goes with what
Customers who bought X also bought Y
What symptoms go with what diagnosis
Transaction-based or event-based
Also called market basket analysis and affinity analysis
Originated with study of cu

Function constants
Owners: -73.16
Nonowners: -51.42
Adjusted for prior probabilities:
Owners: -75.06 + log(0.15) = -75.06
Nonowners: -51.42 + log(0.85) = -50.58
Unequal Misclassification Costs
For the two-class (buyer/non-buyer) case, we can account for

Data summarization
Normalization (= standardization) is usually performed in PCA; otherwise
measurement units affect results
Note: In XLMiner, use correlation matrix option to use normalized variables
Summary
is an important for data exploration
Dat

Reducing Categories
A single categorical variable with m categories is typically transformed into m-1
dummy variables
Each dummy variable takes the values 0 or 1
0 = no for the category
1 = yes
Problem: Can end up with too many variables
Solution: Red