Midterm review
Details
Here, 1112:15, Tuesday May 16
You can use your notes, your book, R help.
You can NOT use outside resources (stack overflow, someone
elses notes)
You WILL need a computer that runs Rstudio and all of the
packages we use in class
Ill
The first principal component is a vector that points along
the axis that has the most variation. Scores are projected
onto this axis.
principal components
and review
lizards.csv, houses.csv and morehouses.csv
The second PC points along the axis that is (
# 04/12/2017 Discussion Code
# By: Jeffrey Chao
require(ggplot2)
# Here, we will use the American TimeUse Survey data mentioned in the first
# lecture as an example. (This is the "atus copy.csv" file.)
# First, we will plot time spent socializing on the

title: "04/19/2017 Stats 101C Discussion"
author: "Jeffrey Chao"
date: "April 18, 2017"
output: html_document

First, we load in the data. Make sure the directory has been set to wherever
your CSV file is. We will be using the bank note data mentioned
# 4/26/17 Stat 101C Discussion Code
# By: Jeffrey Chao
library(ISLR)
# The validation set approach
# First, fit a logistic regression model.
# We'll be using the college data set in the ISLR package.
college = College
set.seed(123)
train = sample(x = 1:77
# 05/02/17 Stat 101C Discussion Code
# By: Jeffrey Chao
library(ISLR)
library(boot)
oj = OJ
# Doing resampling with the boot package.
# To use the boot function with something like the median, you are going to need
# to write a function like below, and th
The dilemma
Lecture 4.2
Cross Validation
require(ISLR)
data(Auto)
and maybe
pgatour2006 from CCLE
Youve fit a model, and youve found the MSE for
your data.
But you know that when the testing data comes, the
MSE will be bigger.
How, then, can you estimate
upload corealsample.csv into Rstudio
Lecture 3.1
Graphics and Linear Discriminant Analysis
Charles Joseph Minard
(17811870)
The greatest statistical graphic ever?
William Playfair, 17591823
1869
Francis Galton
18221911
Florence Nightengale
18201910
W.
Outline
Feature Selection
Ridge Regression
Lasso
Principal Components
Ridge Regression
Uses a linear model
y=
0
+
p
X
j Xj
j=1
with parameters chosen to minimize (for fixed value
of lambda):
0
12
p
p
n
X
X
X
2
@yi
A
+
0
j xij
j
i=1
j=1
j=1
n
X
i=1
0
@yi
0
d54d230c6e952334f3bbeaea3a9d182b631884a6
Name
Aaron Baddeley
Adam Scott
Alex Aragon
Alex Cejka
Arjun Atwal
Arron Oberholser
Bart Bryant
Ben Crane
Ben Curtis
Bernhard Langer
Bill Haas
Billy Andrade
Billy Mayfair
Bo Van Pelt
Bob Estes
Bob May
Bob Tway
Brad
Stats 101C HW 2
Matthew Yun (204667525)
41817
Q1) Make sure you have ggplot2 installed in Rstudio. Ask me or the TA if you need help with this. You will
then need to enter require(ggplot2). You only need to do this once (but will have to do it again if
Determining Errors in the Testing Data
Two approaches
1. Use data to estimate
lecture 3.2
Train
estimate
data:banknote (CCLE under Site Info)
data: library(fivethirtyeight); data(bechdel); help(bechdel)
Your Data
Test Data
apply model to your setaside da
Last Time
Lecture 4.1
You used the bechdel data to see if you could
predict whether a film would pass the bechdel test.
You were asked to compare LDA, QDA, KNN, and
logistic regression.
The best we got was about a 45% success rate
Alas, about 45% of the f
Today
lecture 2.1:
classification
Upload banknotes.csv into Rstudio
(Site Info/Data Not In Textbook)
Reminder of Bias/Variance
Introduction to Classification
KNearest Neighbor method
Intro/Overview of Logistic Regression
Theorem
E y0
f(x0 )
2
what it mea
knn classificaiton algorithm
Lecture 2.2
Set k equal to an integer.
To classify an observation x_0, find the k nearest
neighbors to x_0 (using euclidean distance)
The neighbors vote on the classification; majority
vote wins.
The value of k is tuned so as
Feature selection
load pgatour2006
outline
Study selecting variables with linear models first
Study linear models fit with Least Squares
Study approaches other than LS
Study nonlinear models
download and upload the data (under CCLE/site
info/ data not in
HW 5
Lindsey London (303769968)
February 7, 2017
Stat 102B TA Session 1
Problem 1
#read in file
library(readr)
class < read_csv("~/stats 102b/distancedatahwk.csv")
# Parsed with column specification:
# cols(
#
code = col_character(),
#
study = col_doubl
CART
Classification and
Regression Trees
banknote data (Site Info)
Large Scale Overview
Using trees to do classification
Easy modification to using trees to do regression
Bootstrapping trees to improve overfitting problem
Random Forests to obtain more pre
Lecture 9.2
unsupervised learning
context
There is no response variable, and we have
no gold standard to help us determine if
classification is correct or not.
Predictors are numerical (but there are
methods that address categorical
predictors. See corr
neural nets
The simplest neural networks look something like this
X
Yi = f ( (xj wj )
where
1
f (x) =
1 + exp( x)
But thats not a helpful way to picture it. This is better:
Input Layer
Hidden Layer
Output Layer
Neurons fire signals.
The signal travels alo
Boosting
The story so far
CART, classification and regression trees, increase flexibility over standard regression
approaches
Random forests consist of many randomly generated trees. The randomness is twofold.
First, a bootstrap sample of original data a
#lecture pro 3.2#
#logistic regression
# Run the logistic regression
banknote < banknote < read.delim("C:/Users/Wendy
Yan/Desktop/UCLA/Quarters/Spring 2017/Stats 101C/week#2/banknote.txt")
logit.1 = glm(Y ~ Left, data = banknote, family = "binomial")
su
Feature Selection
Outline
Ridge Regression
Lasso
Principal Components
Ridge Regression
Uses a linear model
y=
0
+
p
X
j Xj
j=1
with parameters chosen to minimize (for fixed value
of lambda):
0
12
p
p
n
X
X
X
2
@yi
A +
0
j xij
j
i=1
j=1
j=1
n
X
i=1
0
@yi
0
Practice Problems
Choose one or two to work on. Problems 1 and 2 use the house.csv dataset;
problem 3 uses morehouses.csv. Hint: Before beginning, consider a transformation
of the response variable.
1. Use Backwards Stepwise regression to determin
Feature selection
load pgatour2006
outline
Study selecting variables with linear models first
Study linear models fit with Least Squares
Study approaches other than LS
Study nonlinear models
download and upload the data (under CCLE/site
info/ data not in
Lecture 4.2
Cross Validation
require(ISLR)
data(Auto)
and maybe
pgatour2006 from CCLE
The dilemma
Youve fit a model, and youve found the MSE for
your data.
But you know that when the testing data comes, the
MSE will be bigger.
How, then, can you estimate