unit 11 - Statistics – V3100018.001 UNIT 11 – Interval...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistics – V3100018.001 UNIT 11 – Interval estimation Giuseppe Arbia, Catholic University of the Sacred Hearth, Roma, Italy 1 In statistical inference we distinguish 3 problems 1.  Point estimation (unit 10) 2.  Confidence intervals (this unit) 3.  Hypothesis testing (next unit) 2
 1. Point estimation •  We approximate one unknown parameter with a SINGLE value. •  We want to find the BEST approximation. •  E. g. what is the best way of estimating the population mean ? •  We need to specify the criteria to say what is the BEST WAY •  The BEST estimator is an estimator which is –  Unbiased –  Consistent –  Most efficient 3
 2. Confidence intervals •  We acknowledge that an estimate contains an error and we attach to the estimate a measure that expresses the degree of confidence we have on the result (expressed as a probability). P(m ! ! " µ " m + ! ) = 1 ! " error Probability of error 4
 3. Hypothesis testing •  Both inductive and deductive. •  We have some a priori idea and we ask the data to confirm or reject this idea. •  E. g. We believe that µ=µ0 and we look at a sample to see how likely is this to be true given the observed sample. 5
 Confidence intervals 1. Confidence intervals around a mean. 2. Confidence intervals around a proportion 3. Confidence intervals around a variance 4. Around regression coefficients 6
 Sampling
distribu6on
of
the
mean
 How
likely
is
it
that
the
true
value
of
the
mean
(that
is
the
 popula6on
mean)
falls
within
a
given
range
?
 7
 Back once again to our deductive exercise start Suppose again that we have an urn that contains 6 balls each numbered progressively from 1 to 6, and we draw 4 balls from the urn without replacement. We can draw 15 different samples. !6$ 6! = 15 # &= " 4 % 4! 2 ! It is easy to calculate the probability of an interval including the true population mean. E. g. : P(3≤ µ≤ 4) = 11/ 15 = 0.7333 11
cases
out
of
15
 If
we
can
assume
that
the
popula6on
is
normally
 distributed
with
KNOWN
variance
σ2
 2 X ~ N (µ, ! ) And
we
further
assume
that
the sample is random (SRS), we
know
that
the
sample
mean
is
also
normally
 distributed
(at
a
given
sample
size
n)
as
:
 
 ! !2$ m ~ N # µ, & " n% 10
 So
that,
standardizing
we
have:
 m−µ ~ N ( 0,1 ) σ n 11
 
 m Let
us
recall
the
meaning
of
:
 ~ 
 σ N µ, n If
we
could
draw
all
possible
samples
(of
a
given
dimension
n
 arbitrary
but
constant
in
each
sample)
from
a
popula6on,
and
in
 each
sample
we
calculate
the
sample
mea,
then
the
distribu6on

 of
these
means
would
be
normal,
with
a
mean
equal
to
the
true
 popula6on
mean
and
a
standard
devia.on
given
by
the
ra6o
 between
the
true
standard
deviaion
of
the
poula6on
and
the
 square
root
of
n.
 The
standard
devia6on
of
the
disitribu6on
of
the
means
is
called

 STANDARD
ERROR
of
the
mean.
 12
 In
general,
given
a
standard
normal
distribu6on,
we
can
quite
 easily
calculate
the
following
probability
for
any
given
k
 
 P − k ≤ z 
≤ k = 1 − α and,
in
par*cular,
in
our
case
we
can
calculate
 m−µ P − k ≤ ≤ k = 1−α σ n { } with
k
depending
on
α
and
that
can
be
obtained
from
the
 Normal
Tables
or
through
the
Excel
procedure
NORM.S.DIST.
 
 NOTICE:
If(1‐α)
is
the
probability
that
z
is
within
the
desired
interval,
then
there
is
a
 probability
of
α
that
z
IS
NOT
in
the
given
interval,
in
other
words
the
probability
of
making
a
 mistake.
 13
 Example
k
=
1
 P − 1 ≤ m−µ σ n ≤ 1 = = P{z ≤ 1}− Pr{z ≤ −1} 14
 Using
the
Excel
func6on
NORM.S.
DIST,
or
the
 Standardized
Normal
tables
we
have:
 15
 therefore
 = P {z ! 1} " P {z ! "1} = 
 = 0.8413 " 0.1587 = 0.6827 
 and
we
can
say
that:
 
 P − 1 ≤ m−µ ≤ 1 = 0.6827 σ n 16
 Similarly
using
the
Excel
func6on
 NORM.S.INV(p)
or
again
the
tables
we
have
 17
 That
is
almost
1
 18
 Analogously
for
k
=
2
we
have
 P − 2 ≤ m−µ σ n ≤ 2 = = P{z ≤ 2}− P{z ≤ −2} = 0 ,9772 − 0,0228 = 0,9545 19
 In
sta6s6cal
inference
we
o_en
use
the
value

 k
=
1,96
because
in
this
case:
 P − 1,96 ≤ m−µ σ n ≤ 1,96 = = P{z ≤ 1,96}− P{z ≤ −1,96} = = 0 ,975 − 0 ,025 = 0 ,95 20
 If
we
mul6ply
the
three
members
by
the
 standard
devia6on
we
have
 ) + + + " ! % " ! %m!µ " ! %+ 0, 95 = P *!1, 96 $ ( 1, 96 $ '($ '" '. = # n& # n& ! % # n &+ + $ ' + + # n& , / ) "! % " ! %= P *!1, 96 $ ' ( m ! µ ( 1, 96 $ '. = # n& # n &/ , 21
 The
term
 
 m 
! µ represents
the
ERROR
that
we
make
when
we
 es6mate
μ
with
m.
 !! $ 1, 96 # & The
error
is
limited
by
the
term















which

 " n% depends
on
three
elements
 
 22
 The
term
 
 m 
! µ represents
the
ERROR
that
we
make
when
we
 es6mate
μ
with
m.
 !! $ 1, 96 # & The
error
is
limited
by
the
term















which

 " n% depends
on
three
elements
 
 !! $ 1, 96 # & " n% 1.
The
popula,on
variability.
 The
larger
is
the
variability
 the
larger
will
be
the
error.
 Zero
variability
corresponds
to
NO
ERROR,
whatever
is
the
sample
size.
 A
sample
size
of
1
is
enough
for
a
perfect
es6ma6on.
 23
 The
term
 
 m 
! µ represents
the
ERROR
that
we
make
when
we
 es6mate
μ
with
m.
 !! $ 1, 96 # & The
error
is
limited
by
the
term















which

 " n% depends
on
three
elements
 
 !! $ 1, 96 # & " n% 2.
The
sample
size.
 The
larger
is
the
sample
size
 The
lower
will
be
the
error
 At
sample
size
=
∞
we
do
not
make
any
mistake.
No
macer
how
variable
is
the
phenomenon
 24
 The
term
 
 m 
! µ represents
the
ERROR
that
we
make
when
we
 es6mate
μ
with
m.
 !! $ 1, 96 # & The
error
is
limited
by
the
term















which

 " n% depends
on
three
elements
 
 The
“magic
number
“
1.96
depends
 on
the
assigned
probability
(in
our
 case
we
fixed
α
=0.05,

 hence
(1‐α
=0.95).
The
higher
is
the
 probability
of
not
making
a
mistake,
 the
higher
will
be

 this
number,
hence
the
larger
will
be
 the
error.
 !! $ 1, 96 # & " n% If
we
increase
the
probability
of
not
making
an
error,
we
 have
to
accept
larger
errors.
 25
 The
term
 
 m 
! µ represents
the
ERROR
that
we
make
when
we
 es6mate
μ
with
m.
 !! $ 1, 96 # & The
error
is
limited
by
the
term















which

 " n% depends
on
three
elements
 
 I
can
always
reduce
the
probability
of
 error
to
almost
zero.
But
this
number
 here
becomes
very
large
and
the
 error
consequently
very
large.
 For
instance,
if
we
want
a
probability
 α
=
0.01
and
hence
(1‐α)
=0.99,
the
 value
on
the
Normal
table
is
2.58
and
 the
error
is
larger.
 !! $ 1, 96 # & " n% 26
 Let
us
now
add
to
the
previous
expression
the
 true
popula6on
mean
to
all
three
members.
We
 obtain
 ) "! % " ! %, = P *!1, 96 $ ' ( m ! µ ( 1, 96 $ '- = # n& # n &. + σ σ ≤ m − µ + µ ≤ µ + 1,96 = = P µ − 1,96 n n σ σ ≤ m ≤ µ + 1,96 = P µ − 1,96 n n 27
 This
expression
corresponds
to
the
interval
that
we
can
build
in
our
deduc.ve
exercise.
 It
states
that,
with
a
probability
of
0.95,
the
sample
 mean
of
a
random
sample
will
be
in
a
given
 interval.
 σ σ ≤ m ≤ µ + 1.96 = 0.95 Pµ − 1.96 n n Since
so
far
we
assume
to
know
the
popula6on
variance,
this
 expression
represents
a
variable
interval
(of
unknowns
length
 because
the
true
mean

µ
is
unknown)
around
a
given
known
 value
m.
 HOWEVER
IT
IS
NOT
USEFUL
OPERATIONALLY
!
 28
 If
we
go
back
to
the
previous
expression
and
we
 now
add
the
sample
mean
to
all
three
members
 we
obtain,
instead
 ) "! % " ! %, P *!1, 96 $ ' ( m ! µ ( 1, 96 $ '- = # n& # n &. + ) , "! % "! % = P *!1, 96 $ ' ! m ( m ! µ ! m ( 1, 96 $ ' ! m- = # n& # n& + . 29
 By
changing
the
sign
and
rever6ng
the
 inequality
we
obtain
another
 probabilis6c
expression
:

 σ σ ≤ µ ≤ m + 1.96 = 0.95 = Pm − 1.96 n n that
represents
an
induc6ve
expresson.
 A
fixed
interval
(since
both
σ
and
m,
are
known)

 around
the
unknown
value
of
µ. 30
 We
call
confidence
interval
 
The
interval
within
which,
with
a
given
 probability,
we
expect
to
find
the
true
 (unknown)
popula6on
parameter.
 
 The
assigned
probability
is
called
 confidence
level
 31
 Determining
the
value
z
such
that
the
confidence
level
is
95%.
 32
 Determining
the
value
z
such
that
the
confidence
level
is
99%.
 33
 Confidence
intervals
for
5
different
samples
with
n
=

25
drawn
from
a
normal
popula6on

 such
that
μ=
368
and
σ
=
15
 34
 Let
us
recall
the
very
first
lesson
 If
π is
an
unknown
parameter
that
we
wish
to
es*mate,
 and
p is
its
point
es*mate
(unbiased,
consistent
and
 most
efficient)

and
ε is
the
maximum
error
that
we
 want
to
make
with
a
given
probability,
then
we
can
 write:
 P(p - ε ≤ π ≤ p + ε) = (1-α) With 0 ≤ α ≤ 1 
 Remember
that
P(p = π) = 0 This
is
the
reason
why
we
need
an
interval
es*ma*on
 35
 Let
us
use
some
simula6ons
to
fully
understand
 the
meaning
of
a
confidence
interval.
 
 Let
us
draw
200
samples
of
dimensions
n=100
 from
a
popula6on
that
is
N(170;
100)
and
let
us
 calculate
in
each
sample
a
confidence
interval
 with
a
confidence
level
of
90
%.
 36
 166 168 170 172 174 x 37
 0 50 100 numero dell'intervallo 150 200 Number
of
intervals
totally
to
the
le_
of
the
true
mean
(170)
=
10
 Number
of
intervals
totally
to
the
right
of
the
true
mean
(170)
=
8
 Number
of
intervals
containing
the
true
value
(170)
=
181
 Percentage
of
intervals
containing
the
true
value
(170)
=
91%
 If
we
now
draw
10,000
samples
and
we
build
up
condfidence
intervals
with
confidence
 level
of
99%
 Number
of
intervals
totally
to
the
le_
of
the
true
mean
(170)
=
50
 Number
of
intervals
totally
to
the
right
of
the
true
mean
(170)
=
49
 Number
of
intervals
containing
the
true
value
(170)
=
9901
 Percentage
of
intervals
containing
the
true
value
(170)
=
99.01%
 38
 Example
 Esempio:
 We
have
the
following
observa6ons
from
a
popula6on
X
N(μ;
σ=3).

 
 12,
9,
10,
13
 
 Build
up
a
confidence
interval
for
μ
at
a
confidence
level
95%.
 m= 44 = 11 4 m ± z! 2 zα 2 = z0.025 = 1.96 " 3 = 11 ± 1.96 = 11 ± 2.94 = [8.06;13.94] n 4 So
the
interval
es.ma.on
is
 8.06 ≤ µ ≤ 13.94 39
 Excel
Procedure
:
Confidence
interval
for
the
mean
 (Suppose
that
we
know
the
true
standard
devia.on
of
the
popula.on
σ
=
10.23)
 Sample
mean
 Pop.
St.
dev.
 
 40
 Sample
mean
 Pop.
St.
dev.
 41
 Sample
mean
 Pop.
variance
 The
95%
confidence
interval
for
the
mean
is



61.71
±
2.7036
 42
 So
far
so
good
!
 But
what
happens
when
the
 popula6on
variance
is
unknown
(as
in
 most
applica6ons)
?
 43
 We
have
seen
that
standardizing
the
 sample
mean
we
have:
 
 
 m−µ 
 ~ N ( 0,1 ) σ 
 
 n 44
 If
the
true
popula6on
standard
 devia6on
is
not
known.
We
can
 subs6tute
it
with
its
unbiased
 es.ma.on
s
thus
obtaining

 
 1 m s= 
 " ! µ% ~ ??? "( x ! m) n !1 s 
 $ n' # & 
 (???: we don't know the distribution) n 2 i i=1 45
 This
 opera6on
(the
 subs6tu6on
of
the
 t r u e
 v a l u e
 w i t h
 a n
 u n b i a s e d
 es6ma6on),
however,
is
not
costless.

 
 In
 fact
 the
 outcome
 is
 a
 distribu6on
 that
is
no
more
Normal
and
where
the
 uncertainty
 (and
 therefore
 the
 probability
of
 values
 very
large
or
 very
 small)
is
greater.

 46
 Student’s
t
distribu6on
 m!µ ~ t -Student "s% $ ' # n& 47
 Student’s
t
distribu6on
 m!µ m!µ m!µ !m = = = "s% sm sm $ ' !m # n& m!µ !m = n ! 1 sm n !1 ! m z "2 n !1 = z 2 " df df The
student’s
t
is
the
ra6o
between
a
standardized
Normal
distribu6on
and
the
square

 root
of
a
Chi‐squared
divided
by
its
degrees
of
freedom.
 So
it
also
depends
on
the
parameter
df
degrees
of
freedom
 48
 t Distribution William Searly Gosset, known with the pseudonym “Student”, introduced for the first time the t distribution. Gosset was an Oxford graduate in mathematics and worked for the Guinness Brewery in Dublin. He developed the t distribution while working on small-scale materials and temperature experiments. William
Sealy
Gosset
 (Canterbury,
13
giugno
1876
–
Beaconsfield,
16
 ocobre
1937)

 49
 t Distribution The t distribution is a family of distributions similar to the normal, but with more uncertainty in the tails . Contrary to the standard normal distribution, that does not depend on any parameter, any specific t distribution depends on a parameter known as the degrees of freedom (df) Degrees of freedom refer to the number of independent pieces of information that go into the computation of s. 50
 t Distribution A t distribution with more degrees of freedom has less dispersion. As the degrees of freedom increases, the difference between the t distribution and the standard normal probability distribution becomes smaller and smaller. 51
 P della t di Proprietàroper6es
Student i.  N
and
t
are
very
similar,
they
are
both
symmetrical
and
bell‐ shaped
 ii.  t
is
more
dispersed
around
the
mean
than
the
Normal
due
to
 the
greater
uncertainty
introduced
by
the
subs6tu6on
of
σ
 with
s.
In
other
words

there
is
more
area
below
the
tails
and
 less
in
the
central
part.
 iii.  While
there
is
one
and
only
one
Standard
Normal
 distribu6on,
there
are
several
Student’s
t
distribu6ons,
one
 for
each
df.
They
are
tabulated
in
Table
2
Appendix
B
of
the
 





textbook.
 iv.

When
df

increases,
the
t
distribu6on
tends
to
look
like
a
 
standard
normal
distribu6on
(due
to
the
Central
Limit
 
Theorem)
 52
 t

Distribu6on
 t distribution (20 degrees of freedom) Standard normal distribution t distribution (10 degrees of freedom) z, t 0 53
 t Distribution For more than 100 degrees of freedom, the standard normal z value provides a good approximation to the t value. The standard normal z values can be found in the infinite degrees ( ∞ row of the t distribution table. ) 54
 Table
of
Student’s
t
 Table of the t Distribution distribu6on
 Degrees Area in Upper Tail of Freedom .20 .10 .05 .025 .01 .005 . . . . . . . 50 .849 1.299 1.676 2.009 2.403 2.678 60 .848 1.296 1.671 2.000 2.390 2.660 80 .846 1.292 1.664 1.990 2.374 2.639 100 .845 1.290 1.660 1.984 2.364 2.626 ∞ .842 1.282 1.645 1.960 2.326 2.576 Standard normal z values 55
 

 
So,
for
n
large
Student’s
tends
to
resemble
 the
normal
distribu6on.
 
 
But
in
small
samples
(e.
g.
<
40)
it
will
 provide
larger
confidence
intervals
to
take
 into
account
the
greater
uncertainty
 connected
with
the
ignorance
about
the
 true
popula6on
variance
 56
 For
instance
n
=
10
 If
the
variance
is
known
 
 σ σ 
 ≤ µ ≤ m + 1.96 = 0.95 Pm − 1.96 10 10 
 But
if
we
ignore
the
variance
and
subs6tute
it
withn
an
es6ma6on,
the
 interval
becomes:
 
# s s& " µ " m + 2.262 ' = 0.95 
 P $ m ! 2.262 % 10 10 ( 
 
Therefore,
at
the
same
confidence
level,
we
have
a
larger
 interval(more
uncertainty
about
the
true
value)
 
 57
 Area
or
probability
 2.5%
of
probability

 in
each
tail.
 So
that
there
is
95
%
 In
the
interval
 58
 Another
example
 n
=
4;
g.d.l.
=
4‐1
=
3
 t3;0,025 = 3,182 P (!3, 182 < t < 3, 182 ) = 95% " % m!µ P $ !3, 182 < < 3, 182 ' = 95% Sn # & ⇓ µ = m ± 3, 182 ! s n 59
 Area
or
probability
 2.5%
of
probability

 in
each
tail.
 So
that
there
is
95
%
 In
the
interval
 60
 Generalizing Generalizzando:
 
 µ = m ± tn!1;0,025 " Degrees
of
freedom
 s n Probability
in
the
tail
 Confidence
interval
for
the
mean
at
a
confidence
level
of
95%
with
a
sample
of
dimension
n
 =
 ‐
 t‐distribu6on
with
99
df
 61
 Example:
The
Saxon
Company
is
a
hydraulic
supply
company.
Consider
a
sample
of
100
 sales
of
the
company
in
one
month.

 n = 100 m=110.27$ s=28.95 $ Build
up
the
95%
confidence
interval
for
the
mean
sales
in
that
month.
 Using
Student’s
t

















































Using
the
Standard
Normal
 t n!1, ! 2 m±t = t99,0.025 = 1.9842 z! = 1.96 2 ! n!1, 2 s n 28.95 100 104.53 " µ " 116.01 110.27 ± 1.984 m ± z! 2 s n 28.95 100 104.60 ! µ ! 115.94 110.27 ± 1.96 62
 Area
or
probability
 2.5%
of
probability

 in
each
tail.
 So
that
there
is
95
%
 In
the
interval
 63
 Meaning
of
the
df
 The
calcula6on
of
the
unbiased
sample
variance
s2
requires
the
calcula6on
of
the
term
 
 
 2 xi ! m 
 
 
 To
calculate
s2
we
need
first
to
calculate
m.
As
a
consequence
of
this
there
are
only

 n‐1

sample
values
that
are
free

to
assume
any
value
(degrees
of
freedom)
 "( ) 64
 example
 Esempio:
 n = 5, m = 20 We
know
that
 5 !x i=1 5 ⇒
 i = 20 5 !x i = 100 i=1 So
that,
when
4
of
the
5
values
are
known,
the
fi_h
can
only
assume
a
value
such
that:
 
 
 
 5 
 xi = 100 
 i =1 
 
 x1 = 18, x2 = 24, x3 = 19 e x4 = 16 ⇒

 x5 = 23 If


 ∑ 65
 Excel
procedure
 Procedura
alterna6va:
 Data
–
Data
Analysis
–
Descrip6ve
Sta6s6cs
 Sample
mean
 Pop.
St.
dev.
 66
 Uses
the
normal
tables
 Sample
mean
 Pop.
St.
dev.
 Uses
the
t
distribu6on
tables
 67
 2. Confidence interval for a proportion As we know the sample proportion is distributed as a Normal (Unit 10) " % p(1 ! p) p ~ N $! ; ' n& # Sample
propor6on
 True
popula6on
propor6on
 68
 Therefore, fixing e. g. α=0.05, the confidence interval for the proportion is p(1 ! p) p(1 ! p) " ! " p + 1, 96 n n and therefore: # p(1 ! p) p(1 ! p) & P $ p ! 1, 96 " ! " p + 1, 96 ' = 0.95 n n( % this is the interval, around the true population proportion p, within p ! 1, 96 which we expect to find at a level of confidence of 95% the true proportion !. 69
 Example:
the
Saxon
Plumbing
Company
wants
to
es6mate
the
percentage
of
wrong

 Invoices.
Suppose
that,
in
a
sample
of
100
invoices,
there
are
10
that
contain
an
error.
 We
build
up
a
confidence
interval
at
95%
as
 ! ± z" 2 p (1 ! p) 0.10 " 0.90 = 0.10 ± 1.96 n 100 0.0412≤p≤0.1588
 70
 Example
2:
In
Italy
from
the
popula6on
of
individuals
with
the
right
to
vote
 We
draw
a
Simple
Random
Sample
of
dimension
n
=
2305.
60%
of
the
sampled
 Individuals
declare
that
they
are
not
going
to
vote
in
the
next
referendum.
 A
referendum
to
be
valid
has
to
be
expressed
by
at
least
50
%
of
the
popula6on.
 Determine
the
confidence
interval
of
the
percentage
of
voters
at
95%
of
confidence.
 

 The
sample
is
without

replacement
from
a
finite
popula6on
of
dimension
N.
 n 2, 305 = The
ra6o




































is
small
 N 30, 000, 000 71
 N !n We
can
neglect
the
correc6ng
factor

 N !1 So
the
confidence
interval
is
 p ± z0.025 p (1 ! p) = n 0,6 ⋅ 0,4 = 0,6 ± 1,96 = 2305 = 0,6 ± 0,02 = [0,58; 0,62] 72
 Determination of the optimal sample size 73 1. Optimal sample size to estimate the mean with a preassigned level of precision Recall that: σ σ Pm − 1.96 ≤ µ ≤ m + 1.96 = 0.95 n n 74 Therefore, by subtracting m from each member we have: σ σ P− 1.96 ≤ (m − µ ) ≤ +1.96 = 0.95 n n Let
us
define
the
es6ma6on
error
at
95%
 confidence
level
as:
 !! $ error = e = 1.96 # & " n% 75
 If we raise to the power of 2 both sides, we have 2 ! e = 1.96 n 2 2 hence 2 ! n = 1.96 2 e 2 76
 Example: •  Suppose that we want to estimate the mean making at most an error of ±2 (with a probability of 95%), and suppose that the standard deviation is 4, we need at least 42 16 n = 1.96 ⋅ 2 = 3.8416 ⋅ = 15.3 ≅ 16 4 2 2 Sample units. (We always round up to the next integer). 77
 i)  The
popula.on
variability
σ
generally
 is
not
known
!!
 
 Solu6ons
 Past
 experience,
 auxiliary
 variables
 or
 a
 pilot
 survey
 that
 provides
a
provisional
es6ma6on
of
σ
 ii)  
 Basing
 only
 on
 the
 minimum
 and
 the
 maximum
 and 
 approxima*ng
the
standard
devia*on
as
 Range != 4 78
 In
a
steel
mill
we
want
to
check
that
the
average
length
of
steel
bars
is
within

 some
pre
assigned
limits.
 From
the
past
experience
we
know
that
the
discrepancy
between
the
actual
 Length
and
the
desired
length
are
between
1
inch
and
28
inches.
 How
large
has
to
be
a
simple
random
sample
if
we
want
to
es6mate
the
average
 μ
with
m

making
a
maximum
absolute
error
of
3,
with
a
probability
of
99
%
?
 Range 28 ! 1 27 != = = = 6.75 4 4 4 2 2 ! 2 6.75 n = 2.57 2 = 2.57 = 33.58 ! 34 2 e 3 2 79
 Generalizing for a generic probability of error α     Margin of Error Necessary Sample Size e = z! /2 " n ( z! /2 )2 " 2 n= e2 80
 NOTICE THAT THE REQUIRED SAMPLE SIZE DOES NOT DEPEND ON THE POPULATION DIMENSION 81
 2. Optimal sample size to estimate the proportion with a preassigned level of precision we know that for n large, p( 1 − p) p~N π , n 82
 Therefore standardizing p −π p( 1 − p ) n ~ N ( 0,1 ) 83
 Therefore fixing α=0.05 we have # % % P $!1, 96 " % % & ' % % p!! " 1, 96( = 0.95 p(1 ! p) % % ) n and therefore, at a level of confidence of 95% error will be in the interval: # p(1 ! p) p(1 ! p) ' P $!1, 96 " ( p ! ! ) " +1, 96 ( = 0.95 n n) & 84
 Therefore the error, at 95% confidence in this case is p(1 ! p) error = e = 1.96 n and squaring p(1 ! p) e = 1.96 n 2 2 85
 And solving for n we have p(1 − p) n = 1.96 2 e 2 P
is
unknown
 We
can
obtain
it
from
previous
surveys
or
a
priori
knowledge
as
in
the
case
of
the

 mean.
 However
in
the
case
of
the
percentage
we
have
a
further
alterna6ve
 86
 The maximum value of p(1-p) is 0.25: p 1-p p*(1-p) 0.00 1.00 0.0000 0.10 0.90 0.0900 0.20 0.80 0.1600 0.30 0.70 0.2100 0.40 0.60 0.2400 0.45 0.55 0.2475 0.49 0.51 0.2499 0.50 0.50 0.2500 0.51 0.49 0.2499 0.55 0.45 0.2475 0.60 0.40 0.2400 0.70 0.30 0.2100 0.80 0.20 0.1600 0.90 0.10 0.0900 1.00 0.00 0.0000 87
 Therefore, if we do not know the value of p, we can consider the case of maximum uncertainty, that is when p = 0.5. Then p*(1-p)=0.25 (maximum variance, maximum uncertainty). 0.25 0.9604 1 n = 1.96 = !2 2 2 e e e 2 For
a
quick
approximate

 Computa6on
take
the
 inverse
of
the
square
 of
the
maximum
error
 
 
 This
 means
 that
 in
 the
 worst
 case,
 we
 will
 select
 a
 sample
 of
 a
 size
 higher
 than
 necessary,
but
we
avoid
the
risk
of
selec6ng
a
sample
of
a
size
smaller
than
necessary.
 88
 Example 1: What is the optimal sample size toestimate a percentage with probability 95% and making a mistake smaller then 3% ? 0.25 0.9604 n = 1.96 = = 1067 2 2 e 0.03 2 89
 Example
2:
 Establish
how
many
sample
unit
we
need
to
es6mate
the
percentage

 of
smokers
in
one
city,
aiming
at
an
es6ma6on
with
a
confidence
of
95%
and
a
 maximum
absolute
error
of
4
%.
 1 ε = 0,04; zα 2 = 1,96; p = (1 − p ) = 2 z 2 p(1 − p ) 1,962 ⋅ 0,25 n= = = 600,25 ≈ 600 2 2 ε 0,04 90
   Example 3 Suppose
that
we
impose
a
0.99
probability
that
 the
sample
propor6on
is
within
+
.03
of
the
 popula6on
propor6on.
 How
large
a
sample
size
is
needed
to
meet
the
 required
precision?

(A
previous
sample
of
similar
 units
yielded
0.44
for
the
sample
propor6on.)
 91
 error =e = z! /2 " (1 ! " ) = 0.03 n At 99% confidence, z.005 = 2.576. Recall that p = .44. ( z! /2 )2 p(1 ! p ) ( 2.576)2 (.44 )(.56) n= = " 1817 2 2 e (.03) A sample of size 1817 is needed to reach a desired precision of + .03 at 99% confidence. 92
 Note: We used 0.44 as the best estimate of π in the preceding expression. If no information is available about π , then 0.5 can be assumed because it provides the highest possible sample size. If we had used π = 0.5, the recommended n would have been 1843 > 1817. 93
 Again generalizing for a generic probability of error α   Margin of Error e = z ! /2 p (1 ! p ) n Solving for the necessary sample size, we get n= (z ! / 2 )2 p (1 ! p ) e2 94
 NOTICE THAT, AGAIN, THE REQUIRED SAMPLE SIZE DOES NOT DEPEND ON THE POPULATION DIMENSION 95
 3. Confidence interval for the variance We know that (UNIT 10). 2 (n ! 1)s 2 " " n!1 2 ! 96
 Therefore fixing as usual α=0.05, the confidence interval will be: ! 2.975 ! ! 2 ! ! 2.025 0 0 therefore: (n " 1)s 2 ! ! ! ! 2.025 0 "2 This is the interval around the sample variance where we expect to observe 2 0.975 with probability 95% the true population variance. 97
 ! 2 0.975 (n " 1)s 2 ! ! ! 2.025 0 "2 (n " 1)s 2 ! 0.975 ! 2 " (n " 1)s "! 2 ! 0.975 2 2 2 (n ! 1)s 2 2 " " 0.025 2 ! 2 (n ! 1)s "! 2 2 " 0.025 (n ! 1)s 2 (n ! 1)s 2 "" 2 " ! 2.025 ! 2.975 0 0 Confidence
interval
for
the
variance
 98
 Example Example: In quality control a low variance is a measure of reliability in the sense that it guarantees the customer a certain standard. The filling variance for boxes of cereals is designed to be 0.02 or less. A sample of 41 boxes of cereals shows an unbiased sample standard deviation of 0.16 ounces. At a level of confidence of 95% determine whether the variance in the cereal box fillings is exceeding the design specification (n ! 1)s 2 (n ! 1)s 2 2 "" " 2 2 ! ( n!1),0.025 ! ( n!1),0.975 40 # 0.0256 40 # 0.0256 2 "" " 59.342 24.433 s 2 = 0.16 2 = 0.0256 Larger
than
0.02
 0.0172 " " 2 " 0.0419 We
cannot
exclude,
at
95%
of
probability,
that
the
variance
is
more
than
0.02.
We
cannot

 exclude,
in
fact,
that,
at
95%
confidence
level,
it

doubles
the
prescribed
value.
So

 99
 we
are
not
working
at
then
desired
level
of
quality.

 100
 4.
Confidence
intervals
for
the
regression
coefficients:
 Data
analysis
 101
 Individuals











Gender








height







weight
 By
default
1‐α
=
0.95,
 but
we
can
edit
it
 102
 slope
 103
 In the example, the point estimation of the slope (as calculated on a sample of size n = 55, is 0.9053.
We
expect
 that,
with
95%
of
probability,
the
true
slope
at
the
popula6on
 level
will
be
between

0.6443
and
1.1664.
That
means
that
we
 have
uncertainty
about
the
value,
but
not
about
the
DIRECTION
 of
the
dependence
(that
is
about
the
sign
of
the
slope)
 A similar argument can be used for the intercept. 104
 More
details
later
on
in
the
course
 ...
View Full Document

This note was uploaded on 04/05/2012 for the course STATS V3100018.0 taught by Professor Giuseppearbia during the Spring '12 term at NYU.

Ask a homework question - tutors are online