110906 Class Notes
Data Set 1
{46, 49, 52, 57}
Mean
(46 + 49 + 52 + 57)/4 = 51
Standard Deviation (
s
x
)
s
x
=
Data Set 2
{24, 34, 55, 91}
Mean = 51
Standard Deviation
29.631
SHARP EL 531
MODE
STAT
SD
46 M+
49 M+
52 M+
57 M+
RCL (mean) 51 RCL
s
x
4.690
CASIO
MODE MODE SD
46 M+
49 M+
52 M+
57 M+
[2
nd
function] (upper left)
Svar
Exploratory Data Analysis (EDA)
John Tukey
The Median
The median of a data set is found by putting the
numbers into numerical order and finding:
i.
The middle number if
n
is odd
ii.
The average of the two middle numbers if
n
is even
The median is called the 50
th
percentile.
Tukey states that the median divides the data set into
a lower half and a higher half. If
n
is odd, Tukey
considers the median to belong both to the lower half
and the upper half. Tukey’s lower hinge is the median
of the lower half, and the upper hinge is the median
of the upper half.
Lower hinge ≈ 25
th
percentile = Q
1
= first quartile
Upper hinge ≈ 75
th
percentile = Q
3
= third quartile
Tukey’s
h
spread = upper hinge – lower hinge
≈ interquartile range
=Q
3
– Q
1
Interquartile range helps protect against outliers.
Application (to data sets from worksheet)
Question 2 (stemandleaf notation)
n
= 45
mean = 87.04444
s
x
= 18.764
median (23
rd
number) = 85
mean > median
This reflects the positive skew.
NB: These numbers are meaningful for large data
sets but are relatively meaningless for small
data sets.
Lower hinge (12
th
number) = 74
Upper hinge (12
th
number from end) = 99
Tukey’s
h
spread = 99 – 74 = 25
Interquartile range
Tukey’s Fences
The lower inner fence is:
Lower hinge minus 1.5 *
h
spread
The upper inner fence is:
Upper hinge plus 1.5 *
h
spread
NB: 1.5 is an arbitrary number according to Prof.
But Tukey claimed that after spending 10 years in the
computer lab that this was a good number for his
purposes.
A number in the data set is called a Tukey outlier if
either it is greater than the upper inner fence or less
than the lower inner fence.
For the data in example 2,
Lower inner fence = 741.5 * 25 = 36.5
No low outliers
Upper inner fence = 99 + 1.5 * 25 = 136.5
138 is an outlier
Potential exam question – why do statisticians care
about finding outliers?
Outlier data are fairly
often
incorrect
data. Even if sometimes they are
correct, statisticians sometimes choose to exclude
them from their analyses because they throw
everything off. For this reason, “average” income
has been replaced with “median” income when a
city’s incomes are reported.
Range = highest number, minus lowest number
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
= 138 – 51 = 87
Tukey’s Adjacent Values
Tukey’s lower adjacent value
= lowest nonoutlier in the data set
Tukey’s upper adjacent value
= highest nonoutlier in the data set
Tukey’s FivePoint Summary
The fivepoint summary:
Lower adjacent value
Lower hinge
Median
Upper hinge
Upper adjacent value
BoxandWhisker Plot
See class handout page for example
Boundaries of box are the lower and upper hinge;
mark in middle of box is the median; whiskers extend
out to the upper and lower adjacent values; outliers
are indicated by asterisks.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '09
 All
 Normal Distribution, Probability, Probability theory

Click to edit the document details