# Fisher's exact test
?fisher.test()
# the example of Lady taste tea
# construct the matrix of result
t <- matrix(c(3, 1, 1, 3), 2, 2)
# use fisher's exact test, with one-sided alternative hypothesis (since the lady
claimed she is good at telling whether

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
#install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- re

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
#install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- re

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- rea

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
#install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- re

STAT W2026
Sta+s+cal Applica+ons
and
Case Studies
Contact Informa+on
Jingjing Zou
jz2335@columbia.edu
SSW 1021
Oce Hours
Tuesday and Thursday
AKer class
Or by appointment
Probably no TA as the form of

W2026 Lecture 8
1 / 19
Choice of
I
The performance of regularization depends on the choice of
the penalty parameter
2 / 19
Choice of
I
The performance of regularization depends on the choice of
the penalty parameter
I
Recall the form of the penalized

a22893b74a4845f1b24b9591ff4a655674ff3eb4
Text
plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but h
the happy bastard's quick movie review damn that y2k bug . it's got a head start i

# Movie review - tokenization
# this is a library (of R functions) we will use later
library(tm)
# set working directory here
# read in file from .csv
sentiment <- read.csv("movie-reviews-sentiment.csv", header = T, as.is = T)
# exploratory results
dim(se

W2026 Lecture 3
1 / 33
Cont. from Lecture 2
I
You are a statistical consultant, one day a client walks in, and
asks you the following question:
2 / 33
Cont. from Lecture 2
I
You are a statistical consultant, one day a client walks in, and
asks you the fol

W2026 Lecture 11
1 / 20
Nonparametric Testing Methods
I
In the previous lecture, we discussed about testing the null
hypothesis without assuming distributions of the data
2 / 20
Nonparametric Testing Methods
I
In the previous lecture, we discussed about t

W2026 Lecture 6
1 / 18
Actuarial vs. Prediction
I
In actuarial studies, the main focus is to explain the outcome
with predictors
2 / 18
Actuarial vs. Prediction
I
In actuarial studies, the main focus is to explain the outcome
with predictors
I
Including m

W2026 Lecture 4
1 / 34
Ordinary Least Squares vs. Maximum Likelihood
2 / 34
Ordinary Least Squares vs. Maximum Likelihood
I
Ordinary least squares method minimizes the average (or
total) squared distance between the actual and fitted values
2 / 34
Ordinar

W2026 Lecture 15
Support Vector Machines and Applications
1 / 19
When Perfect Linear Separation is Impossible (cont.)
I
There are two different methods
2 / 19
When Perfect Linear Separation is Impossible (cont.)
I
There are two different methods
I
One is

W2026 Lecture 10
1 / 20
Fishers Exact Test
I
Lady tasting tea
2 / 20
Fishers Exact Test
I
Lady tasting tea
I
A lady claimed she could tell whether milk or tea was added
to a cup first
2 / 20
Fishers Exact Test
I
Lady tasting tea
I
A lady claimed she could

W2026 Lecture 9
1 / 21
Performance of Predictions in Movie Review Data
I
Recall that with frequencies of words as predictors
2 / 21
Performance of Predictions in Movie Review Data
I
Recall that with frequencies of words as predictors
I
And with LASSO to c

W2026 Lecture 7
1 / 26
Variable Selection with AIC
I
AIC stands for Akaike information criterion
2 / 26
Variable Selection with AIC
I
AIC stands for Akaike information criterion
I
For a given model, AIC is
2p Log-likelihood
2 / 26
Variable Selection with

W2026 Lecture 12
1 / 19
Rank Based Tests
I
In previous lectures we learned about Wilcoxon signed rank
test for paired samples and Wilcoxon rank sum test for two
independent samples
2 / 19
Rank Based Tests
I
In previous lectures we learned about Wilcoxon s

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
#install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- re

W2026 Lecture 13
1 / 22
Movie Review Data Revisited
I
In the movie review problem, we were trying to predict the
sentiment (positive or negative)
2 / 22
Movie Review Data Revisited
I
In the movie review problem, we were trying to predict the
sentiment (po

# nonparametric tests
?wilcox.test
require(graphics)
# One-sample test.
# Hollander & Wolfe (1973), 29f.
# Hamilton depression scale factor measurements in 9 patients with
# mixed anxiety and depression, taken at the first (x) and second
# (y) visit after

W2026 Lecture 16
Tree Methods
1 / 21
Classification and Regression Trees (CART)
I
Before we were fitting the whole data with one single model
2 / 21
Classification and Regression Trees (CART)
I
Before we were fitting the whole data with one single model
I

# example from http:/www.ats.ucla.edu/stat/r/dae/logit.htm
mydata <- read.csv("http:/www.ats.ucla.edu/stat/data/binary.csv")
head(mydata)
# The variable rank takes on the values 1 through 4. Institutions with a rank of
1 have the highest prestige, while t

W2026 Lecture 14
Support Vector Machines
1 / 25
Linear Classifiers
I
In a classification problem, suppose one needs to classify the
outcome to 2 categories with p predictors, a linear classifier is
in the form of
0 + 1 x1 + 2 x2 + + p xp = 0
2 / 25
Linear

W2026 Lecture 5
1 / 12
Q-Q Plot
I
Q-Q stands for quantile-quantile
2 / 12
Q-Q Plot
I
Q-Q stands for quantile-quantile
I
It is for comparing the sample distribution to a given model
2 / 12
Q-Q Plot
I
Q-Q stands for quantile-quantile
I
It is for comparing t

W2026 Lecture 17
Unsupervised Learning Methods
1 / 24
Supervised vs. Unsupervised Learning
I
Methods we discussed earlier, such as regression, SVM, and
tree methods, were for supervised problems
2 / 24
Supervised vs. Unsupervised Learning
I
Methods we dis

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
#install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- re

# Movie review - tokenization
# this is a library (of R functions) we will use later
# you may need to install this package first if have not done so
#install.packages("tm")
library(tm)
# set working directory here
# read in file from .csv
sentiment <- re

W2026 Lecture 2
Piazza is ready to use
The Survey
Overall:
If you have not nished both of W2024 and
W2025, then it is highly recommended that you
take those before this course
W2024 is actually open to re

W2026 Lecture 18
Clustering Methods
1 / 22
PCA and Clustering
I
One can also use principle components to cluster subjects
2 / 22
PCA and Clustering
I
One can also use principle components to cluster subjects
I
For example, if only one component is used, o