Lecture 3: Linear Regression, Exploratory Data Analysis,
and the Bootstrap
STAT GR5206
Statistical Computing & Introduction to Data Science
Cynthia Rush
Columbia University
September 23, 2016
Cynthia Rush
Lecture 3: Regression and Graphics
September 23, 2016
1 / 104

Course Notes
•
Next week labs are meeting.
•
Homework 1 is due on Monday at 8pm. No late homeworks accepted.
•
Homework 2 will be assigned on Monday.
•
Remember to use Piazza to ask questions.
Cynthia Rush
Lecture 3: Regression and Graphics
September 23, 2016
2 / 104

Last Time
•
Filtering
. Accessing elements of a structure based on some criteria.
v[v>5], m[ m[,1]!=0, ]
.
•
Lists
. Elements can all be di
↵
erent types. Access like
l[[3]],
l$name
. Create with
list()
.
•
NA
and
NULL
values
.
NA
is missing data and
NULL
doesn’t exist.
•
Factors and Tables
. Factors is how
R
classifies categorical variables.
•
Dataframes
. Used for data that is organized with rows indicating
cases and columns indicating variables.
•
Importing and Exporting Data in
R
. Use
read.csv()
and
read.table()
depending on dataset type. The working directory.
•
Control Statements
. We studdied iteration,
for
loops and
while
loops, and
if, else
statements.
•
Vectorized Operations
. To be used instead of iterations.
Cynthia Rush
Lecture 3: Regression and Graphics
September 23, 2016
3 / 104

Section I
Multiple Linear Regression
Cynthia Rush
Lecture 3: Regression and Graphics
September 23, 2016
4 / 104

Multiple Linear Regression
Example
A large national grocery retailer tracks productivity and costs of its
facilities closely. Consider a data set obtained from a single distribution
center for a one-year period. Each data point for each variable represents
one week of activity. The variables included are number of cases shipped
in thousands (
X
1
), the indirect costs of labor as a percentage of total costs
(
X
2
), a qualitative predictor called holiday that is coded 1 if the week has
a holiday and 0 otherwise (
X
3
), and total labor hours (
Y
).
Cynthia Rush
Lecture 3: Regression and Graphics
September 23, 2016
5 / 104

Multiple Linear Regression
Suppose, as statisticians, we are asked to build a model to predict total
labor hours in the future using this dataset.
What information would be useful to provide such a model?
•
Is there a relationship between holidays and total labor hoours? What
about number of casses shipped? Indirect costs?
•
How strong are these relationships?
•
Is the relationship linear?
Cynthia Rush
Lecture 3: Regression and Graphics
September 23, 2016
6 / 104

Multiple Linear Regression