This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CHAPTER 1 Statisticians fall asleep
faster by Taking a random
sample of sheep. Chapter 1: Data and Distributions Data Example: Determine the nature of the "unusual episode" whose data appears below. You are
permitted to ask me "yes/no" questions suggested by the data. Data Set: Population at Risk and Death Rates for an Unusual Episode [The article associated with
this dataset appears in the Journal of Statistics Education, Volume 3, Number 3 (November 1995).] Tables: Population at Risk, Deaths, and Death Rates for an “Unusual Episode” By Economic Status and Sex Population Exposed Number of Deaths per 100
to Risk Deaths Exposed to Risk
Economic 
Status Male Female Both Male Female Both Male Female Both
1(high) 180 145 325 118 4 122 65 3 37
II 179 106 285 154 13 167 87 12 59
III 510 196 706 422 106 528 83 54 73
Other 862 23 885 670 3 673 78 13 76
Total 1731 470 2201 1364 126 1490 80 27 67
By Economic Status and Age
Population Exposed Number of Deaths per 100
to Risk Deaths Exposed to Risk
Economic ————————————————————————————————————————————————————————————— Status Adult Child Both Adult Child Both Adult Child Both I(high) 319 6 325 122 O 122 38 O 37
II 261 24 285 167 O 167 64 0 59
III 627 79 706 476 52 528 76 66 73
Other 885 O 885 673 O 673 76 76
Total 2092 109 2201 1438 52 1490 69 48 67 Section 1.1: Populations, Samples, and Processes Deﬁnition: Statistics is the discipline concerned with the optimal acquisition (garbage in = garbage
out) and analysis of data in order to model a population or process. “T he longrange contribution of statistics depends not so much
upon getting a lot of highly trained statisticians into industry
as it does in creating a statistically minded generation of
physicists, chemists, engineers, and others who will in any
way have a hand in developing and directing the production
processes oftomorrow — W. A. Shewhart & W. E. Deming Deﬁnition: Target population — the set (actual or conceptual) of all entities of interest in a survey or
study; the group about which the statistician wishes to draw conclusions.  Different surveys have different target populations.
 The target population must be clearly deﬁned for a sample to be drawn from it; including time
frame. Deﬁnition: Variable  characteristic of interest from the population (e.g., height, color, response)
which varies across the entities in the population. Deﬁnition: Observation —~ values of a variable of interest for a given entity. Deﬁnition: Census  attempt to acquire data on every entity in the target population. Example 1. When I go up for tenure, the promotion and tenure committee need feedback about my
teaching. For this study, a possible target population is: W ’ltgié Sl‘ide/llﬁ J» [tit/<3 7éttjl4+/WJP’/4clz<l wﬁﬁa ‘3 he” e “1+, {£055. Branches of Statistics: Descriptive and Inferential Statistics Descriptive Statistics: Graphical and numerical techniques for describing or summarizing data that
capture the essence of the data. 0 Some numerical summaries for describing data include: Mean, median, mode, maximum, standard
deviation, variation, correlation coefﬁcient (bivariate data) 0 Some graphical methods for displaying data include: Dotplot, stemandleaf diagram, histograms,
time series plots (time ordered data) We can summarize the relevant information in a data set by determining what data are present and how often they occur. This representation is the distribution of the data. Deﬁnition: The distribution of a data set provides the following information about a set of numbers:
1. the unique numbers which appear in the set, and
2. how often each number appears in the set. You can read more about distributions in the optional notes entitled “Distributions and the IID
Assumption” in the TMI folder. Example 2. (Data taken from site http://www.shodororg/interactivate/activities/boxplot/). Only 1
variable of interest: Average gas mileage for year 2000 cars by size. 49, 49, 45, 45, 41, 38, 38, 38, 40, 37, 37, 34, 35, 36, 35, 38, 38, 32, 32, 32, 37, 31, 32, 31, 32, 30, 30, 32, 30, 30, 29, 28, 29,
29, 29, 30, 28, 27, 29, 30, 28, 27, 28, 27, 27, 29, 29, 29, 26, 27, 25, 25, 25, 25, 25, 25, 25, 26, 26, 27 Frequency and Relative Frequency Distribution of the Data: Relative Fre uenc Deﬁnition: Inferential statistics are techniques used to make inferences about a population from a
sample of that population. Population versus Process Data Deﬁnition: Population data: Data in which the order in which the observations were collected does
NOT matter! Deﬁnition: Process data: Data in which the order in which the observations were collected does
matters! The order provides additional information that we lose by representing it as a distribution. Example 3. Population versus process data. (a) The height of all students in this classroom. (b) The total number of minutes we meet each day in class. (c) The number of ounces of peanut butter taken from ajar off an assembly line every hour.
(d) The number of books each student in this class bought for this quarter’s classes. (e) A student’s test scores throughout the quarter. Univariate, Bivariate, and Multivariate data
Univariate data: Observations on a single variable Example 4. Number of trickor—treaters at my door at various times this past Halloween Number of trick or treaters by 10 minute intervals
6:00 to 6:10 10 6: 10 to 6:20  15 Frequency distribution for the # of trickortreaters per 10 minutes:
6:20 to 6:30  25 6:30 to 6:40 — 25 6:40 to 6:50 —— 28 6:50 to 7:00  30 7:00 to 7:10 — 28 7:10 to 7:20 — 21 7:20 to 7:30 —— 16 7:30 to 7:40  15 7:40 to 7:50  16 7:50 to 8:00 — 11 8:00 to 8:10  10 8:10 to 8:20 — 5 Isthisprocessdata? KW (1L6 430/1 (10,64555n) (/pr #(3 ,5 ‘P/occ’iS QLJI Bivariate data: Observations are made on two variables
Example 5. For each trickortreater at my door on Halloween, I recorded their age and weight: Age (years) Weight (pounds) 5 50 Frequency distribution for the age of trickor—treaters:
6 39 18 138 2 24 4 40 68 110 28 156,“. Multivariate data: Observations are made on more than two variables Example 6. Collecting data on variables that affect the price of a house: Number of bedrooms Age of house (yrs) Size of yard (acres) Pool?
5 15 1.03 No
4 10 0.33 No 3 2 1.5 Yes Section 1.2: Visual Displays for Univariate Data
Read text pages: 816 (up to Histograms for Unequal Class Widths), and 17 (Histogram Shapes)  18 Recall: two major branches of statistics
0 Descriptive Statistics
0 Descriptive statistics describe the data in your population or sample 0 The most common descriptive statistics provide information about a sample's central
tendency (mean, median, mode) and variability (variance, standard deviation, range). 0 Inferential Statistics 0 Inferential statistics are techniques for drawing reasonable inferences about populations
based on samples from the population. Types of variables 0 Qualitative data or variables have only categories or qualities 1. Car type: Saturn, Ford, Chevy, BMW...
2. Customer satisfaction: Excellent, Good, Fair, Poor, Horrible
3. Classiﬁcation of a bolt meeting a length requirement: Acceptable or Not Acceptable 0 Quantitative data or variables have numerical measures 1. Daily temperature of my ofﬁce: 68° F, 72° F, 800 F
2. Time for Bounty to soak up a milk spill with a 1.5” radius: 30.0 sec, 15.3 sec, 13.2 sec
3. The length ofa bolt measured in centimeters: 12.6, 12.9, 13.4, 12.3, 13.6, 13.5, 12.6 Graphs for displaying QUALITATIVE DATA: 0 Pie Chart, Bar Chart — pick up any issue of USA Today and you’ll probably see at least one
of these! Example 1. Using a pie chart versus a bar chart to display qualitative data. Which do you prefer? if Numbaofemais h hbox on 11/8, 11/9, 11/10, 11/11, 11/12 Number ofemalls In lnboxon 11/8~11/12 Cannery
I 11/8 100
[I 11/9
I 11/10
E5 11/11
I 11/12 dekhhbox 20 11/8 11/9 11/10 11/11 11/12
Date Going off on a tangent for a minute Anytime throughout the quarter, bring me or email me a bad graphic example for a bonus point! JOB SECURITY Asmara; nhnw‘ m: “4,3: a 152 rt"
Percjn'age Z‘ma Pg .tvsu. ‘ I": rt). L, GNP change
per person
(SUS) 1 9901 997  IE1 Swat:
' businesses For donor
nations 00A change
per person
(SUB)
1990—1997
for donor
nations 1 W W0 ".". I71“! '1! .BJL“ 5." MM! M E'J‘. mam u
0'
d , i Aware 12:13 1". .t. .d . g munch“ ~' "1‘”
' lg» Caru :W M l 512* v ., 4 “ 7
q pt MW. ‘1 L ~ .mru' . car w‘Umf + 220% + 479 % l
“i g +202%
i
I mean £131! wired an: wumw a»... 3km.» Clwmfm my um: 8 Graphical methods that we will consider for displaying QUANTITATIVE DATA: Dot diagrams,
histograms, stemand—leaf plots, box plots The method used is determined by the type of data and the idea to be presented with the data. 0 Dot Plots (or Dot Diagrams) represent each observation by a dot on a numerical axis Example 2. 12 measurements on the strength of paper to be used in cardboard tubes (in pounds) 163 145 165 170
155 168 163 201 179 139 14D 150 150 171] 180 191] 201]
Dotplot of Paper Strengths (in pounds) o Dot Plots make it easy to pick out outliers — an outlier is an unusual observation or extreme
value, and it usually warrants further attention
0 Used on “reasonably small” data sets 0 Histograms: Unlike our text, histograms will not be discussed in the context of two separate
categories, discrete and continuous. We’ll be following the “continuous method” for all histogram
displays (as described below) 0 Frequency distributions condense data into a more manageable and readable format 0 Frequency Distribution — a table giving a count (frequency) of the number values within a
particular category or class interval. 0 Relative Frequency Distribution — a table giving the proportion (rather than the number of
values) falling within a particular category or class interval. 0 Histogram —— a graphical way to display a frequency (or relative frequency, cumulative
frequency, cumulative relative frequency) distribution. 0 Unlike the text suggests (page 16), its best to keep the class intervals of the same width —
otherwise, the graph is hard to read (in terms of summarizing data) 0 Be careful of using too many or too few cells: Number of bins 21 J; , where n is the number
of data values Example 3. Age at which the US. presidents began their ﬁrst terms. Data taken from
http://'home.comcast.net/~sharenday7/Presidents/APOO.htm 57 61 57 57 58 57 61 54 68 51 49 64 50
48 65 52 56 46 54 49 51 47 55 55 54 42
51 56 55 51 54 51 60 62 43 55 56 61 52
69 64 46 54 9 Note: Clearly deﬁne your bin “boundaries” so your reader knows which bins contain the borderline
values. Label your bins clearly when there is a question involved; don’t leave your reader to decipher
your binning scheme! Frequency Table of Ages. Let x represent the age of the president at the beginning of his term. Age of Presidents at Frequency . Relative Cumulative Cumulative
beginning of their 5 Frequency Frequency Relative
ﬁrst terms Frequency .._......_..._..._____4_.........._.._....._.____... ......_......_________ ___.__..........__._.._____.._.....,.._______........_._ ____.....__.... ...... Frequency Histogram Pruldcnts mday «term9c Relative Frequency Histogerm in Minitab (gives percents, not progclﬁons) ”stingramof Presidents Istday oftennag: 42 46 50 54 58 62 66 70
”about: at day 0! turn out 10 Cool Graphic: Histogram of heights constructed using the people. Photograph by Peter Morenus
in conjunction with Professor Linda Strausberg, University of Connecticut. Subjects are University of
Connecticut genetics students, females in white tops, males in dark tops. Shapes of Histograms 4U 30 20 Percent 10 40 30 ~45 —3.5 «2.5 «15 >05 05 1.5 2.5 3.5 45
Symmetric,Unimodal Negatively skewed 10
E’
a:
0
B 5
a. 101 Percent Positively Skewed l l
4.4 5,2 8.0 1‘2 2.] 2.8 3.5 6.8 11 o StemandLeaf Display 0 Effective display of *large* data sets
0 Each data value has 2 parts: > Stem: one of more of the leading digits > Leaf: remaining digits after the stem value
0 Possible Problems: too few stems or too many stems
0 Information a stemandleaf diagram conveys:
identiﬁcation of a typical value
extent of spread about the typical value
presence of gaps in data
extent of symmetry in the distribution of values
number and locations of peaks
presence of outlying data values VVVVVV Steps for constructing a stemand—leaf diagram by hand: (1) Select one or more digits for the stem values. The trailing digits become the leaves.
(2) List possible stem values in a vertical column. (3) Record the leaf for every observation beside the corresponding stem value. (4) Indicate the units for stems and leaves someplace in the display! Example 4. Stem and leaf diagram for age of President at beginning of ﬁrst term. See Example 3 for
data. StemandLeaf Display: Presidents 1st day of term age Stem—and‘leaf of Presidents lst day of term age N = 43
Leaf Unit = 1.0 2 4 23
2 4
5 4 667
8 4 899
14 5 011111
16 5 22
(9) 5 444445555
18 5 6667777
11 5 8
10 6 0111
6 6 2
5 6 445
HI 68, 69 Here’s another view of the same data when the display is incremented by 10’s instead of 2’s: StemandLeaf Display: Presidents 1st day of term age ll Stem—and—leaf of Presidents lst day of term age N 43 Leaf Unit = 1.0 8 4 23667899
(25) 5 0111112244444555566677778
10 6 01112445 H1 68, 69 12 1.3 Describing Distributions By its deﬁnition, statistics is concerned with the acquisition and analysis of data. Often, the data of
interest are numbers or measurements obtained from observation of the units in a population or
process. More formally, data are facts that represent particular characteristics of the units. The
characteristics themselves are variables. e.g. The time that the sun rises each morning at the Terre Haute International Airport is a
variable. The fact that the sun rose yesterday morning at 7:47 am. is data. Recall: There are two types of variables:
1. Categorical (Qualitative) Variables — variables whose values are categories. e.g. gender, favorite soda, state of residence, etc.
In this class, we do not do much with categorical variables, but there is an entire branch
of statistics dealing with categorical data analysis (if you’re interested). 2. Quantitative Variables — variables which are numeric by nature.
e.g. income, weight, height, time required to complete a task, etc. Two types of quantitative variables:
Discrete and Continuous (see page 11 of our text) The support of a variable is the possible values that the variable can assume. The support of a variable
is either: (1) Discrete: the support is discrete if its set of possible values is either ﬁnite or countably
inﬁnite (e.g., 3, 4, 5, ...). Variables with a discrete support are discrete variables. Examples: 0 The number of siblings you have 0 The number of books you purchased for this quarter 0 The number of tails obtained when you ﬂip a coin twice 0 The number of non—smooth Lego side tosses out of 100 o The number of matching lottery picks a The number of rolls of a die before a “6” appears /
o The number of questions correct on the ﬁrst test 57 ’ ., :Fl 1 1.. E .l . I (“.41“) I”) {p {Z} aquC/j‘iﬁ‘ﬂqﬂr
o The number of phone calls you received last week i (2) Continuous: the support is continuous if its set of possible values consists of an entire
interval of real numbers. These variables are continuous variables. Text: A continuous
variable is one whose value is determined by making a measurement of some sort. Examples: The time we spend in class today is any real number in the interval 0 to 55 minutes
The temperature in my ofﬁce each morning at 7:30 am. The weight of your grandmother The miles per gallon achieved by your car The time on the phone during a call to with mother The waiting time in Subway’s line for lunch The lifetime of the battery in your computer 13 Example 1. I surveyed 200 Terre Haute residents this morning as to how many hours they slept last
night. Suppose I can construct a histogram of the sleep hours in which: 0 for each rectangle, area = relative frequency of the interval
0 total area of all rectangles = 1. With a large amount of data, we can envision a smooth curve being a model for the relative frequency histogram. «2 ' 2 4 e a 10 12 sleep hours last night forXZOO TH residents Using the histogram above, determine the approximate proportion of TH residents that: (a) slept less than 4 hours last night. (b) slept at least 10 hours last night.
.Oi+.0L+.ox+.iL‘.l(O 4:er .oL/ The density function that “best” ﬁts the data above is ﬁx ) = A —"~* for all real values of x. We can now calculate the proportion of values with the function ﬁx). To calculate the proportion of TH residents that slept less than 4 hours, we just need to determine the area under the curve ﬁx) for x’s
from O to 4. Similarly for residents that slept at least 10 hours. 2 i z 0.1573 (via Maple). 92 j x 0.0228 (via Maple).
7 H) 14 Deﬁnition: A density function ﬁx) is used to describe (at least approximately) the population or
process distribution of a continuous variable x. The graph of ﬁx) is called the density curve and
must satisfy the following properties: (1) ﬁx) 2 0 for all x, (2) I f (x)dx = 1, (that is, the area under the density curve is 1), [7
(3) The proportion of x values between the values a and b = I f (x)dx . FACT: Since there is no area under a density curve at a single point, then: I the proportion of values between a and a is 0.
I “the proportion of x satisfying a s x S b” = “the proportion of x satisfying a < x < b” Example 2. Suppose I take a bus to RHIT every day and a bus arrives at my bus stop every 5 minutes.
Because I don’t always leave my house at exactly the same time, I don’t always arrive at the bus stop
at the same time. Let x be my waiting time (in minutes) at the bus stop. Then x is a continuous
variable with support 0 S x S 5. One possible density curve that I can use to model x is: f() 1/5 0 S x S 5
x =
0 otherwise We can graph ﬂx). Isf(x) a legitimate density function? What proportion ofq the time will I have wait between 2 and 4 minutes for the bus?
~ 1. a i
?(ZLXLL{)’ fs’cjac vi g What proportion of the time will I have to wait at most 1 minute?
I ’MXL') ? {.Jx : "gt ’1 15 Example 3. “Time headway” in trafﬁc ﬂow is the elapsed time between the time that one car ﬁnished
passing a ﬁxed point and the instant that the next car begins to pass that point. The following density
function f(x) is essentially the time headway (in seconds) for two randomly chosen consecutive cars on
a freeway during a period of heavy ﬂow as suggested in “The Statistical Properties of Freeway Trafﬁc”
(Transportation Research, Volume 11: 221228): 0.15e‘0'15("’°‘5) x 2 0.5 f(x) — i 0 otherwise
A graph of ﬁx):
0.14
0.12
0.1
BIB
[1% 0.04 0.02 2 4 E B 10 12 14
x Isﬂx) a legitimate density function? YES! Why? (1) ﬁx) 2 O for all real numbers x (2) vii/”(906136 = 1 0.151641”): 'e0.075dx 0.5 H Tf(x)dx 0.15.60.075Te«0.15xdx 0.5 r .075 —0.
11m (r 60 re 15" ll \ B H B—no 05 j
' 0.075 vOJSB 0.075 410.75 ‘
= 11m ( e  e + e . e
3—»: ’
= 1 . Determine the proportion of cars in which the headway time between them and the following car is at most ﬁve seconds. 5
Proportion of headway times between 0.5 and 5 seconds = 0.15 Jew" e0'°75 dxz 0.4908
()5 Determine the proportion of cars in which the headway time between them and the following car is
exactly 2 seconds. Z Mtg/Kn?) \ ?(7{~:2> : [pike I <,;L(, :i O /
Z 16 One special continuous distribution: the exponential distribution (text page 29) Deﬁnition: A continuous random variable x is said to have an exponential distribution with
parameter 7» > 0 if its density function is: ile‘l' x20 f(x)={ 0 x<0 Example 4. Suppose the length of a phone call (in minutes) x has an exponential distribution with
parameter )L = 1/10. (a) Determine the density curve and graph it. W
,‘e“\ t0 y" X (b) What proportion of phone calls made in the U.S....
View
Full Document
 Spring '08
 DeVasher
 Statistics

Click to edit the document details