Statistics, the discipline, is the art and science of extracting
useful information from data. Statistic: number calculated from data
When do we know that X affects Y?
•
Controlled experiment
o
Physical sciences, x is controlled, y is measured
o
Med sciences, clinical trials, doubleblind placebo
o
Social sciences, subjects exposed to controlled X, y
observed
•
Strong theory
o
Physics, opposite charges attract
o
Bio, descent of species
o
No strong theory for social sciences
Observational data > prediction
The ASSOCIATION
Clustered: curve or line NOT the right fit
Convex = cup, Concave = cap
No association = random
cov(x,y) = 1/(n1)[(x1xbar)(y1ybar)+(x2xbar)(y2ybar)+…(xnxbar)(ynybar)]
s(x+y)
2
=s(x)
2
+ s(y)
2
+ 2cov(x,y)
CauchySchwartz inequality
s(x)s(y) ≤ cov(x,y) ≤ s(x)s(y)
If perfect positive linear assoc
cov(x,y) = s(x)s(y)
If perfect negative linear assoc
s(x)s(y) = cov(x,y)
Comparing w/ products of sdev is inconvenient
c(x,y) = cov(x,y)/(s(x)s(y))
covariance is theoretically
important for algebra
correlation is practically
important for convenient measure of linear assoc
zbar = 0, sample sdev z = 1, ALWAYS, unitfree
Standardization can be done w/ ANY location measure (median) and ANY
dispersion measure (IQR), result will still be 0 location measure & unit dispersion
measure of 1.
cov(Zx,Zy) = c(x,y)
Scatterplot matrices
•
Plots in column all have same x axis
•
Plots in row all have same y axis
Zy = a(Zx) + b
, abs val of a = 1,
b = 0
Zy = ± Zx
cor(x,y)
= (s
2
(Zy+Zx) – s
2
(ZyZx))/4 = cov(x,y)/(s(x)s(y))
ni = count of ith label
pi = proportion of ith label
sdev even more problematic w/ skew, squares amplify influence of outliers
time series
simple: daily stock price of 1 company
multiple: daily stock prices of multiple companies
1 var: barplot, comparing freq across labels
2 var: mosaic, comparing conditional freq; compare proportion of Y
by groups of X
Quant
1 var:
Histogram: see shape/skew
Boxplot: show location, dispersion, outliers=pts outside
wkrs
2 var: scatterplot
Comparison box plot: compare levels of quant variable by groups of qual
variables
Nested barplots: heights of bars reflect freq of Y groups nested w/in X groups
Compare importance
of Y w/in X
Graphical methods
see data as whole
discover unexpected facts
Numerical summaries
simplicity by condensing lot of data to few #’s
precision
i.e. when comparing groups (eyeballing)
ways to reason about uncertainty
NEITHER REPLACES THE OTHER
Quant variables
Measures of location: mean, median, quantiles, min, max
Dispersion: sd, IQR, range
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview. Sign up
to
access the rest of the document.
 Spring '09
 Statistics, Standard Deviation, Variance, Probability theory

Click to edit the document details