Activity Solution: Introducing Logistic Regression
The relationship between correlation and linear regression is very much
akin to the relationship between odds ratio (discussed last time) and logistic
regression. For a binary response variable Y and a si
Review for Midterm Exam
Time: 50 minutes, in class.
One formula/cheat sheet (letter-size, two-sided, hand-writing
A simple calculator, UBC ID
Two versions of exam
Cover materials up to last Friday
No labs next week
Review for Midte
In statistical analysis, the mean and the standard deviation
(SD) are the two most important statistics. They measure the
center and the variation of data respectively. They provide the
basis for statistical inference such as condenc
Linear Model Review
Regression models are among most useful statistical methods.
The simplest regression models are linear models:
y = 0 + 1 x + e.
In linear regression models, the parameters may be estimated by
the least square method. The regression coe
Main considerations in designing experiments:
control confounders (or lurking variables)
A lurking variable is a
Summary: Bootstrap Tests
Bootstrap: distribution-free, works for any samples.
The 95% bootstrap condence interval for population mean (or
median m) is (u0.025 , u0.975 ), where u0.025 and u0.975 are the 2.5% and
97.5% percentiles estimated from bootstrap
Example 1: Kidney stone treatments
This is a real-life example from a medical study comparing the
success rates of two treatments for kidney stones.
Treatment A Treatment B
number cured/total number
Sums of Squares in Regression
Both linear regression models and ANOVA models are
examples of linear models.
Like ANOVA models, for regression models we can also
decompose the total variation (sums of squares) into that from
the model and that from random
Parameter Estimators in Regression
In regression model y = 0 + 1 x + e, the parameters 0 and 1
can be estimated using the least squares method.
The parameter estimates 0 = b0 and 1 = b1 and their standard
errors SE(1 ) and SE(1 ) can be used for inference
In regression analysis, it is important to check if the model
ts the data well or not (i.e., model diagnostics). This step
should not be skipped.
Two important tools for model diagnostics are residual plots and
normal QQ plots (or no
Multiple linear regression model
In practice, it is more common to have more than one predictors
in a linear regression model, which is called a multiple linear
Examples of multiple regression models
Risk = 0 + 1 Age + 2 Fitness + e.
Regression: Dummy Variables
Regressions are among most important statistical tools. It is
a big area and is very useful in practice.
An ANOVA model may be written as a regression model.
For example, suppose that we wish to compare three groups
Regression: Curve tting
Regression models are used to capture the main features in
the data. A straight line is the simplest, but it may not
capture the main features in the data.
In some cases, a polynomial model can be more exible in
modelling data than
Binary variables are widely used in practice. Examples:
smoking/no-smoking, cancer/no-cancer, male/female, pass/fail,
To study the relationship between two binary variables, we can
summary data in a 2 2 table.
The association between two c
In linear regression models, the response y is a continuous
variable and is assumed to be normally distributed.
In practice, many response variables may be binary, such as
pass/fail, death/alive, cancer/no cancer, etc. In this case, li
A time series is a single series of data measured repeatedly
over time. It is common in economic data or nancial data.
A key characteristic of time series data is that the data may
be correlated (or dependent), i.e., there may be serial
Review for Final Exam
Date/Time: Tuesday, December 15, 7:009:30pm, in SRC A
The nal exam is closed notes/books
Two formula/cheat sheets (letter-size, two-sided, hand-writing
A simple calculator, UBC ID
The nal exam will cover all materials thro
Summary: Interaction in Two-factor Designs
In one-way ANOVA, we study the effect of one factor on the
response. The factor has two or more levels. This corresponds to
the simplest experimental design: subjects are randomized into
two groups: treatment gro
Review: ANOVA Part II
When the assumptions for an ANOVA model do not hold, such as
data not normal or variances not equal, we can use the
corresponding nonparametric Kruskal Wallis test.
An ANOVA model may be viewed as a special linear regression model
Summary: Analysis of Two-way Designs
In the analysis of two-way ANOVA, possible interaction
between the two factors should be taken into account.
In a simple 2 2 table, the interaction effect can be estimated.
Formulas for interaction effect for more gene
Activity Solution: Properties of Regression Estimators
We have a bivariate sample (x1 , y1 ) , . . . , (xn , yn ) from two variables X
and Y. The regression line of Y on X can be written
y = b0 + b1 x
(xi x) (yi y )
i=1 (xi x
Summary: The Sign Test
The data are not assumed to follow any parametric distribution
(i.e., the test is distribution free).
The hypotheses are about the population median (or mean).
The test statistics is the number of positive signs.
The exact distribut
Summary: Kruskal-Wallis test
The two-sample t-test is used to compare two population means.
The analysis of variance (ANOVA) is used to compare three or
more population means.
The distribution-free (nonparametric) version of the two-sample
t-test is the W
Summary: Wilcoxon rank sum test
The Wilcoxon rank sum test is a distribution-free
(nonparametric) version of the two-sample t-test.
The test is useful when (i) the data do not follow normal
distributions, or (ii) the sample size is small.
The null hypothe
Summary: Permutation Test
No distributional assumption for the data.
The permutation test is another alternative to the two-sample
t-test. It is also an alternative to the Wilcoxon rank sum test.
It is useful when (i) the sample size is small; or (ii) dat
Summary: Power of a test
Neyman-Pearson principle of evaluating tests: rst control
Type I error (i.e., signicance level ), then maximize power.
That is, with the same signicance level, the higher the power
the more desirable the test.
Type I error: reject
Summary: Goodness-of-Fit in Contingency Tables
Contingency tables are used to summarize data from two or more
A common hypothesis to be tested is that the variables are independent,
i.e., H0 : the variables are independent. H1 : the
Summary: Density Curve Fitting
We can use the 2 goodness-of-t test method for categorical data to
check if continuous data follow a known distribution (or model).
The basic idea is to divide continuous data into intervals, count the
number of observations
In practice, data may or may not follow a statistical model.
A model is an assumption, which may or may not t the data.
In statistical analysis, when we assume a model, we need to test
whether the model ts the data or not. Such a test
Summary: Normal Probability Plots
Many statistical methods assume that data follow normal
distributions. Examples include:
one-sample or two-sample t-tests
linear regression models
In data analysis, before applying these methods/models, we
CHAPTER 1: Non-parametric Tests
Parametric test & Non-parametric test
Parametric test: techniques that are reliant on distributional assumptions
Non-parametric test: tests that are distribution free (but still requires data to be
independent and identic
Implicit Bias among Physicians and its Prediction of Thrombolysis
Decisions for Black and White Patients
Alexander R. Green, MD, MPH1, Dana R. Carney, PhD2, Daniel J. Pallin, MD, MPH3,
Long H. Ngo, PhD4, Kristal L. Raymond, MPH5, Lisa I. Iezzoni, MD, MSc6
Assignment hw01 due 09/20/2016 at 11:59pm PDT
describes a sensible null hypothesis of interest?
1. (5 points) How does evidence of economic recession influence an individuals attitudes toward redistribution of wealth?
Is it t
Assignment hw02 due 09/27/2016 at 11:59pm PDT
1. (5 points) Some researchers (see, for example, Cassidy
and Jones, 2001) have suggested that blood pressure may tend
to be higher on average when measured on a subjects right ar
Assignment hw03 due 10/04/2016 at 11:59pm PDT
Suppose that in fact the sibling on the high fat diet will gain
more weight than their sibling in 75% of cases over a twelveweek period. Taking this information to define the alte