Unformatted text preview: Basics of the Course
The course is taught using powerpoint.
These lecture notes WILL change as the
term progresses. Special Slide Pictures Data Notation
Denote data by x1, x2 , x3, … xn where n
is the number of data values we have, called
the sample size.
The collection of x1, x2 , x3, … xn is called a
dataset whereas a particular value xi is
called a datum, data value or observation. Example (Spoof)
http://ca.youtube.com/watch?v=MQw12_kNAhU&feature=related Example
Description
This data set gives the average heights and weights for
American women aged 30–39.
Obs
1
2
3
4
5 height weight
58
115
59
117
60
120
61
123
62
126 Data Types NOTE: A quantitative variable can be made qualitative…we’ll see
in a second… Examples with Clickers
Height
2) Grades
Clicker Responses:
A) Qualitative
B) Categorical
C) Quantitative
D) Both A and B
1) Quantitative Data Examples with Clickers
1) Height
2) Number of Cats Owned by Canadians
Clicker Responses:
A) Discrete
B) Continuous
C) Both
D) Niether Qualitative Data Examples with Clickers
Body Size: Skinny, Normal, Obese
2) Type of Stone: Granite
Clicker Responses:
A) Discrete
B) Continuous
C) Both
D) Neither
1) Analysis
Raw data is hard to analyse. For example,
consider the Ph values below. Remember a
Ph of 7 is neutral, a Ph <7 acidic and a Ph >
7 basic.
Are the 100 lakes below acidic, basic or
neutral? Data
5 6867887868677557
98 7 6 8 8 8 5 6 8 8 5 7 6 7 8 8
4866666788577578
7866768868877876
10 6 8 6 6 8 6 6 7 8 7 6 8 7 6 7
7765787776687777
87 Well??? Dataset Characteristics
3 Characteristics
1.
2.
3. Analysis
Two Techniques:
1.
2. Example Dataset
Consider the following data:
1,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7
We can build a display simply by ticking off
every time we see a number. 1,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7 Center
Rough Definition – The middle of the data
Pictorially  Spread
Rough Definition  How separated our data
values are. Shape
The appearance of the data. Shape
The shape of a dataset can be determined
numerically using measures such as
Kurtosis and Skew – but we will not
investigate these statistics in this course. Center
There are 3 measures of center:
A)
B)
C) Mode
The most popular value. Also the most useless statistic.
e.g.
1,1,1,2,20 Mean
You would call it an “average”.
Notation:
Data:
Mean: Example
Consider the data:
1,
1,
The mean is: 1, 2, 20 Median
The middle value of the data.
Notation (Median): Notation (Sorted Data): Algorithm: Median
Given the data: x1, x2 , x3, … xn.
1.
Sort the data from smallest to largest. 2. If n is odd, then take the middle value.
3. Else if n is even, take the average of the
middle 2 values. Example
1. Sort
2. n odd = middle
n even = average 1,1,20,2,1 Example
1,1,20,3,1,4,12,2 What is the median?
A) 1
B) 1.5
C) 2
D) 2.5
E) 3 Example
1. Sort
1, 1, 1, 2, 3, 4,12, 20
2. n odd = middle
n=8
n even = average Q2 = Outliers
Outliers are values that are more extreme
than the others.
For example: 1, 2, 3, 4, 1000
For example: 0.8, 11, 0.1, 0.6, 1, 0.3, 0.9 Summary: 1,1,1,2,20
Mode 1 Mean 5 Median 1 Question
Why is the mean different from the median
and mode? Order Statistics
The median is called the “second quartile”.
This implies there are “other” quartiles.
A quartile derives it's name from quarter and
each quartile divides the data into quarters. Pictorially In Words
25% of our data is below Q1, the first quartile.
50% of our data is below Q2, the second
quartile.
75% of our data is below Q3, the third
quartile. Algorithm: Q1
1. Perform the Median Algorithm.
2. Remove all datum above the median.
3. Perform the Median Algorithm on the
remaining data.
4. This is the middle of the lower half of the
data, the first quartile. Example
Given the data:
0.8, 11, 0.1, 0.6, 1,
0.3
1. Sort it
11.0, 0.8, 0.6, 0.1,
0.3, 1.0 2. RECALL
Dataset Characteristics
3 Characteristics
1. Center
2. Spread
3. Shape Spread
There are several
ways in which we
can calculate
spread: 3. 4. 1.
2. 5. Range
The range gives the distance between the
largest and smallest values.
Formula in Words: Formula with Notation: Interquartile Range
The interquartile range gives the distance
covered by the middle 50% of the data.
Formula: Data:
Which dataset has
more spread?
A) 1
B) 2
C) 3
D) 1 = 2
E) none of the above Data 1:
1, 2, 3
Data 2:
1, 1, 1, 2, 3, 3, 3
Data 3:
100, 100.5, 101,
101.5, 102 Range Calculation
Data 1:
1, 2, 3
Data 2:
1, 1, 1, 2, 3, 3, 3
Data 3:
100, 100.5, 101, 101.5, 102 IQR Calculation
Data 1:
1, 2, 3
Data 2:
1, 1, 1, 2, 3, 3, 3
Data 3:
100, 100.5, 101, 101.5, 102 Standard Deviation
In words:
the standard deviation is approximately the
average distance the data values are from
the center. Formulas
1. Not nice for Calculation, but great for
interpretation. Formulas
2. Useful for calculation but NOT
interpretation. Formulas
3. Another one! Useful for calculation but
NOT interpretation. 3 Formulae Example
Consider the data 1, 2, 3. Calculate the st.
dev. Example
Given: 10 ∑i 10 2
i x i=1 0 ; ∑ i x =1 5 0 0 Calculate the standard deviation: Example
Given: 10 ∑i 10 2
i x i=1 0 ; ∑ i x =1 5 0 0 Calculate the standard deviation: Interpretation
Deviation
Definition in words: Definition numerically: Standard Deviation
The standard deviation is approximately, the
average deviation.
Why approximately???? Deviation Example
Consider the data: 1, 2, 3
Calculate the average deviation: Clicker Question
Pick 3 numbers. Calculate the average
deviation. The answer is:
A)
0
B)
>0
C)
<0
D)
I just want the clicker mark.
E)
None of the above. What's the problem!!!??? How do we correct it????? Other Issues
But….
1.Square rooting doesn’t undo squared
terms!
Example:
√(12+22+32) ≠ √12+ √22 + √32
2. Because of “1”, our value for s is too small,
so we divide by n1 instead of n. n vs n1 Degrees of Freedom
n1 is called the degrees of freedom.
Another way of thinking about degrees of
freedom:
Suppose I gave you n data values at
random. They are “free” to be whatever I
want them to be. Degrees of Freedom Continued
Now, instead of n data values, I give you n1
values + the average. Is that last data
value, the nth, “free”? Range Vs. Standard Deviation
(Typical Plot) Standard
Deviations
minimum Center maximum Maximum  Mininmum = 6s
Which means....
s = Range/6
Note: Sometimes it is not 6 but 4 or another
constant...this depends on the data. Interpretation
For a set of data, the standard deviation is 5.
Is this big, small or uncertain?
A)
Big
B)
Small
C)
Uncertain Interpretation and Units Variance
The variance is merely the square of the
standard deviation.
Notation: Coefficient of Variation (CV)
Formula: Interpretation/Use: Example
The length of fish Riley catches m's on
Monday: 1, 2, 3 In cm's on Tuesday: 100, 200, 300 Surface Investigation
Monday Tuesday Which has the
greatest spread?
A) Monday
B) Tuesday
C) Neither Answer Units
The standard deviation, mean, mode,
median all have the same units as the
data.
The variance, which is equal to standard
deviation squared has units squared. Graphical Techniques
In addition to numeric techniques, we have
graphical techniques that can be used to
analyze data.
These graphical techniques include
boxplots, dot plots etc… Example Dataset
Consider the following data:
1,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7
We can build a display simply by ticking off
every time we see a number. 1,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7 Dotplots
A dot plot is similar to this tick mark game
that we've played since children. Each
data value is plotted and replaced by a
point.
Hence the data 1,2,3 would look like: 1 2 3 Dotplots with Repeats
For a single set of data we may be
interested in the repeats. In such a case
we may draw a dot for every repeat.
Eg. 1,1,2,3 1 2 3 Example: Soybean What can you see with this
plot?? Frequency Distribution Example
Example: Who is your favourite actor?
A) Brad Pitt
B) This guy C) Angelina Jolie
D) Her E) Someone else/don't want to answer Frequency
We build bars which have a height equal to
the frequency with which a response
occurs. NonCategorical Data
If our data is not categorical, we first build
intervals for the data.
Intervals are created subjectively but should
all be the same size.
The x axis contains the intervals while the y is
the frequency. Example: Grades
What is your Calculus 1 grade?
A) 85% to 100%
B) 70% to 85%
C) 55% to 70%
D) 40% to 55%
E) Prefer not to say. Intervals
These intervals are chosen subjectively.
I could have chosen any set. I did try to
chose them to make them all the same
size. Clicker Questions
The shape is:
A) Bell
B) Skewed left
C) Skewed right
D) uniform (flat)
E) none of the above Clicker Questions
The center is:
A) 576
B) 578
C) 579
D) 581
E) none of the above Relative Frequency Example
We divide each freqency by n.
The plot is otherwise the same. Example 0 .2 0
0 .1 5
0 .1 0
0 .0 5
0 .0 0 D e n s ity 0 .2 5 0 .3 0 0 .3 5 D e p th o f L a k e H u r o n in F e e t 1 8 7 5  1 9 7 2 575 576 577 578 579 L a k e H u ro n 580 581 582 Clicker Question
What is the proportion of times that lake
Huron was less than 578 feet deep?
A) 10%
B) 12%
C) 24%
D) Not able to say. Boxplots Unmodified Boxplot
Min Q1 Q2 IQR=Q3Q1 Range = Max  Min Q3 Max Recall: Outliers
Outliers: Data values that are more extreme
(larger or smaller) than the others.
E.g. 1,1,2,2,3,3,4,4,5,5,6,6,25 Finding Outliers
What is an outlier mathematically? Obviously from
the data above the number 25 is suspect.
Any value that is:
Less than the lower limit: LL=Q11.5(IQR) Greater than the upper limit: UL= Q3+1.5(IQR) Why 1.5 times?? Math to Prove 25 is an Outlier
1,1,2,2,3,3,4,4,5,5,6,6,25 Example Continued
1,1,2,2,3,3,4,4,5,5,6,6,25 Example Continued
1,1,2,2,3,3,4,4,5,5,6,6,25 Modified Boxplot
Unless stated otherwise I am asking
about the modified boxplot!
The difference: The upper whiskers are
either the maximum or the closest point
below the UL to the center.
The lower whiskers are either the minimum
or closest point to the LL, which ever is
closer to the center. Modified Boxplot
Q1 Q2 outlier IQR=Q3Q1 Range = Max  Min Q3 Example Using:
1,1,2,2,3,3,4,4,5,5,6,6,25 Boxplots and Shape
• The box (Q1 to Q3) gives a good
indication of the shape of our data.
» A » »C B Boxplot A is:
A) Symmetric (Bell)
B) Skewed left
C) Skewed right
D) Uniform (flat)
E) None of the above. Boxplots and Shape
• The box (Q1 to Q3) gives a good
indication of the shape of our data.
» A » »C B Boxplot B is:
A) Symmetric (Bell)
B) Skewed left
C) Skewed right
D) Uniform (flat)
E) None of the above. Stem And Leaf Plots Loss of Information
Individual data values are lost when we
draw a boxplot, histogram, dot plot etc…
The Stem and Leaf plot attempts to counter
this issue. Example:
Problem: Measurements of the annual flow
of the river Nile at Ashwan 1871–1970.
Plan: Not relevant. Data
1120 1160 963 1210 1160 1160 813 1230 1370
1140 995 935 1110 994 1020 960 1180 799
958 1140 1100 1210 1150 1250 1260 1220 1030
1100 774 840 874 694 940 833 701 916
692 1020 1050 969 831 726 456 824 702
1120 1100 832 764 821 768 845 864 862
698 845 744 796 1040 759 781 865 845 944
984 897 822 1010 771 676 649 846 812
742 8011040 860 874 848 890 744 749 838
1050 918 986 797 923 975 815 1020 906
901 1170 912 746 919 718 714 740 Stem and Leaf Plot Parts
The decimal point is 2 digit(s) to the right of the 
46
5
6  5899
7  000123444455667778
8  000011222233344555556667779
9  0011222244466678899
10  0122234455
11  00012244566678
12  112356
13  7 Stem and Leaf Plot Example
The decimal point is 2 digit(s) to the right of the 
46
5
6  5899
7  000123444455667778
8  000011222233344555556667779
9  0011222244466678899
10  0122234455
11  00012244566678
12  112356
13  7 Stem and Leaf Plot
What do you notice????
The decimal point is 2 digit(s) to the right of the 
46
5
6  5899
7  000123444455667778
8  000011222233344555556667779
9  0011222244466678899
10  0122234455
11  00012244566678
12  112356
13  7 Parts
1) Legend: “The decimal point is 2 digit(s) to the
right of the ”
a) This tells me that the numbers are 46=460.
b) If it had said “2 digit(s) to the LEFT of the ” then 4
6=0.046 2) Stem is the part to the left of “”
3) Leaves are the parts to the right of the “”
4) Each leaf represents a data value. Hence we
have 6 data values starting with 12. Example
Measurements of vein diameters were taken
on 100 patients. The following stem and
leaf plot was obtained. Example Continued
The decimal point is 2 digit(s) to the left of the 
32  78
33  224
33  5577777899
34  0000011111233333444
34  5566666678888888999
35  0001111111122223344
35  5555677788889999
36  0112244
36  56678 Based on the Legend 321 Means:
A) 321
B) 32.1
C) 3201
D) 3.21
E) None of the above The decimal point is 2 digit(s) to the
left of the 
32  78
33  224
33  5577777899
34  0000011111233333444
34  5566666678888888999
35  0001111111122223344
35  5555677788889999
36  0112244
36  56678 What do you notice that is
interesting about the stems???
Why was this done?? The decimal point is 2 digit(s) to the
left of the 
32  78
33  224
33  5577777899
34  0000011111233333444
34  5566666678888888999
35  0001111111122223344
35  5555677788889999
36  0112244
36  56678 Example:
Problem: Does the stress of machinery
affect the ability of a soya plant to grow?
Further, does the amount of light influence
it’s ability to grow? Plan:
52 seeds were potted with one seed per pot. The
52 seeds were randomly divided into 4 samples
with 13 seeds per sample. The seeds in 2
samples were stressed by being shaken for 20
minutes daily, while the seeds in the other two
were not shaken (no stress). The two samples
that received the same exposure to stress were
grown under different levels of light. Thus the
four samples of plants were allocated to one of 4
treatments that were defined by 2 basic
treatments, stress and light. Data:
ln ly mn my 264 235 314 283 200 188 320 312 225 195 310 291 268 205 340 259 215 212 299 216 241 214 268 201 232 182 345 267 256 215 271 326 229 272 285 241 288 163 309 291 253 230 337 269 288 255 282 282 230 202 273 257 Analysis: Under which conditions would you want to
grow your Soybeans?
A) B)
C) D) Moderate Light,
Stress
Low Light, Stress
Moderate Light,
no stress
Low light, no
stress Example 2  View Article
From:
Medical Article
http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html Problem: To
investigate the human
body.
Plan: Measure the
items shown at left on
males and females.
Data: Measurements
of 247 men & 260
women
Analysis: See article
on last slide. Is is possible for the
Biacromial
Measurement of a
particular female
exceeds that of a
particular male? Yes
B) No
C) zzzzzzz
A) Probability
We can define probability in 3 ways. Subjective
Relative frequency
Mathematical / classical Subjective
Based on intuition we guess what the
probability is.
i.e. There’s a 99% chance I’ll pass! Subjective
Adv: Disad: Relative frequency
The probability of something happening is the number
of times it occurs divided by the # of attempts.
e.g. Coins
Pretend everyone in class is using the same coin. Flip it.
What did you get??
A) Heads
B) Tails Question
Will you write the quizzes more than once
even if you got 100% on the first try?
A) Yes
B) No Relative Frequency
Adv: Disad: Classical
Experiment A theoretically repeatable process or
phenomenon
e.g.
Trial e.g. One repetition of an experiment Classical ctd.
Outcome The result of our experiment.
Also called a “simple” event
We use capital letters to denote outcomes
e.g. A Classical continued
Compound
Event:
e.g. If an event A is made up of more
than one “simple event” Classical Ctd
Universe or Sample Space:
The collection of all outcomes of an
experiment.
We denote it by “S”.
e.g. Review
An outcome might be A = roll a one
An event might be, get an even #,
B = {2, 4, 6}
The size of an event/sample space is the
objects/simple events in it. We
size by B
e.g. B = {2, 4, 6} B = 3 # of
denote the Probability
Let E be an event containing E simple
outcomes.
Let S be the sample space with S simple
outcomes.
Then the probability E occurs is Pr(E)=E/S Example
1. What is the probability of getting a head on a
coin? e.g.
A biologist classifies a colony of wild baboons
by fur colour.
E = having lightcoloured fur
Of 150 animals observed, 5 are lightcoloured
P (lightcoloured fur) = Example
In a genetic experiment brown rabbits are
crossed with black rabbits. As a result, of the
44 progeny, 13 are brown and 5 are black. The
remainder are mottled (various colours). What
is the probability you select a mottled rabbit? Properties of Probabilities
1. 2. Properties of Probabilities
3. 4. Properties of Probabilities 1 0 P)1
) ≤( ≤
E 2)P ( E ) = 0 ≡ E never happens 3 P = E ah e
) ( 1 a ya n
E ≡ l sps
wp
)
4 EEK ,E representssimple events mutually
) 12
, , m exclusive all possible and
P1+(2+ +(m=
EE
E
( ) P )K P )1 Leading Questions
What if…
we want to know the probability we select either a
brown OR mottled rabbit?
We want to know the probability that in 2 tries we
select a brown AND a mottled rabbit? Symbol 1
“OR”
Notationally we write:
In words we mean: Symbol 2
“AND”
Notationally we write:
In words we mean: Symbol 3
“Not”
Notationally we write:
In words we mean: Venn Diagrams
A Venn diagram is a pictorial representation of our
probability
The box is the sample space.
e.g. A circle within the box denotes a probability for an event.
e.g. Mutually Exclusive
Two events are mutually exclusive (ME) if they have no
outcomes in common or cannot occur together.
e.g. ME Events: e.g. Not ME Events: Clicker ME
Is the event “Person wears glasses” mutually
exclusive from the event “Person has freckles”?
A) Yes
B) No
C) Uncertain Mutual Exclusion
Are the events A = Roll a one on dice 1; B =
Roll a one on dice 2; mutually exclusive
(ME)?
A)
Yes
B)
No Venn Diagram
In the following Venn diagram, the square
represents the…(best answer) A)
B)
C)
D) Event
Simple Event
An Outcome
Sample Space ME
ME and VENN Diagrams
If two events are ME or disjoint, the circles are also
disjoint.
e.g. Hence P o) P)P)
r r =( + ( .
( Br
A
Ar
B Or in terms of our notation: ME and VENN Diagrams
If two events are not ME, they overlap: e.g. Hence P o ) P ) P) P B
r r =( +( −( )
( Br
A
Ar
Br
A Proof by Picture: P o ) P ) P) P B
r r =( +( −( )
( Br
A
Ar
Br
A Example
Problem: To investigate Seal pup fur colour.
Plan: Pups Categorized by Coat Colour and Sex Data
Sex
Colour Male Female Total Yellow 25 10 35 Thin White 10 5 15 Fat White 25 5 30 Grey 15 5 20 Total 75 25 N = 100 Notation
Let G denote Grey.
Let Y denote Yellow.
Let M denote Male.
Let W denote White.
Let T denote Thin. Question 0
What is the probability
a pup is not thin and
white? Sex
Colour M F Total Y 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Question 1
What is the probability
a coat is Yellow? Sex B) 10/100
C) 35/100
D) 25/75 M F Total Y A) 25/100 Colour 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Details Details....
Sex
Colour M F Total Y 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Question 2
What is the probability
a coat is Yellow or
Grey? Sex
Colour M F Total Y 25 10 35 TW 10 5 15 B) 40/100 FW 25 5 30 C) 55/100 G 15 5 20 D) 40/75 Total 75 25 N=
100 A) 25/100 E) None of the Above Details Details....
Sex
Colour M F Total Y 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Question 3
What is the probability
a randomly selected
pup is yellow and
male?
A) 85/100
B) 75/100
C) 35/100
D) 25/100 Sex
Colour M F Total Y 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Details Details....
Sex
Colour M F Total Y 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Question 4
Are the events yellow
and male ME? Sex
Colour M F Total A) Yes Y 25 10 35 B) No TW 10 5 15 C) Can't say FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Question 4  Start
What about Yellow OR
male?? What is the
probability a randomly Colour
selected pup is yellow Y
OR male? Sex
M F Total 25 10 35 TW 10 5 15 FW 25 5 30 G 15 5 20 Total 75 25 N=
100 Details, Details…
What about Yellow OR
male?? W...
View
Full Document
 Spring '09
 SPRINGER
 Normal Distribution, Probability, Probability theory

Click to edit the document details