This preview shows pages 1–8. Sign up to view the full content.
Random sample
•
Selection bias, nonresponse bias, measure
ment error— common to design of exper
iments.
•
what is a good sample?
Intuitively, we
want something representative of the pop
ulation. In statistics, it is formalized as
a
random sample
: a sample selected from
the population in such a way that every dif
ferent sample of size n has an equal chance
of selection.
•
Of course, it is easy to say it, but not easy
at all to get it.
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document EPA car mileage rating data
•
However, one can easily get samples like
EPA, EPAn2, EPAn06. See R output and
compare their histograms using hist().
•
This is a good time to introduce R, a free
statistical package, which is downloadable
from
http://cran.rproject.org/
on which, you can also ﬁnd introductions,
both quick and comprehensive.
2
•
Advantages of R over minitab: (1) free;
(2) written by research statisticians who
are working at the frontier, which means
more built in modern statistical packages.
(3) interactive interface; and many other
features. However, it is not as commecial
ized as minitab, so less popular in industry.
3
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document •
How to make stemandleaf display? com
mand: stem( )
•
Numerical measures of central tendency:
One obvious choice is the mean, which is
deﬁned as
¯
x
=
∑
n
i
=1
x
i
n
,
where
x
i
’s are data points.
•
Look at the EPA data, one can get the
sample mean by using mean(EPA). You
can check that with
sum(EPA)/100
. Mean
tells you where most of the observations
tend to center around.
4
•
The other competitive notion is
median
:
suppose you have odd number of data points,
the median is deﬁned to be the value right
in the middle of the sorted data; but if your
sample has even number of points, the me
dian is the average of those two values in
the center of your sorted data.
•
compare median and mean for the data:
2.3, 4.5, 6.4, 8.4, 3.4, 5.3, 4.7,3.8. Claim:
median is robust to outliers. In this regard,
median is more accurate in measuring the
center.
•
Indeed one may have
skewed
data due to
measurement error, which may bring in out
liers. See the data
EPAn06
. So be careful
when measuring the center.
5
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document How should one measure the spread, or the
variability of your data?
•
You may think of the range, i.e., maxmin.
What if there are outliers due to measure
ment error.
Will range reﬂect the true
spread out?
•
Statisticians tend to use the socalled
sam
ple variance
. By formula it is given by
s
2
=
∑
n
i
=1
(
x
i

¯
x
)
2
n

1
.
Alternatively, a commonly used related quan
tity is the
sample standard deviation
, which
is the square root of the sample variance:
s
=
q
s
2
.
6
•
As you can imagine, if the whole popu
lation is observed, the population variance
and its standard deviation would be deﬁned
in the similar way. Statisticians tend to
denote them by
σ
2
,σ
. But keep in mind,
these are usually not available, because the
population is unmanageable. So they are
parameters
(or characteristics, as you may
call ) that need to be estimated. Look at
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview. Sign up
to
access the rest of the document.
This note was uploaded on 06/06/2011 for the course STAT 515 taught by Professor Zhao during the Spring '10 term at South Carolina.
 Spring '10
 Zhao
 Statistics

Click to edit the document details