Statistics, the discipline, is the art and science of extracting
useful information from data. Statistic: number calculated from data
When do we know that X affects Y?
•
Controlled experiment
o
Physical sciences, x is controlled, y is measured
o
Med sciences, clinical trials, double-blind placebo
o
Social sciences, subjects exposed to controlled X, y
observed
•
Strong theory
o
Physics, opposite charges attract
o
Bio, descent of species
o
No strong theory for social sciences
Observational data --> prediction
The ASSOCIATION
Clustered: curve or line NOT the right fit
Convex = cup, Concave = cap
No association = random
cov(x,y) = 1/(n-1)[(x1-xbar)(y1-ybar)+(x2-xbar)(y2-ybar)+…(xn-xbar)(yn-ybar)]
s(x+y)
2
=s(x)
2
+ s(y)
2
+ 2cov(x,y)
Cauchy-Schwartz inequality
-s(x)s(y) ≤ cov(x,y) ≤ s(x)s(y)
If perfect positive linear assoc
cov(x,y) = s(x)s(y)
If perfect negative linear assoc
-s(x)s(y) = cov(x,y)
Comparing w/ products of sdev is inconvenient
c(x,y) = cov(x,y)/(s(x)s(y))
covariance is theoretically
important for algebra
correlation is practically
important for convenient measure of linear assoc
zbar = 0, sample sdev z = 1, ALWAYS, unit-free
Standardization can be done w/ ANY location measure (median) and ANY
dispersion measure (IQR), result will still be 0 location measure & unit dispersion
measure of 1.
cov(Zx,Zy) = c(x,y)
Scatterplot matrices
•
Plots in column all have same x axis
•
Plots in row all have same y axis
Zy = a(Zx) + b
, abs val of a = 1,
b = 0
Zy = ± Zx
cor(x,y)
= (s
2
(Zy+Zx) – s
2
(Zy-Zx))/4 = cov(x,y)/(s(x)s(y))
ni = count of ith label
pi = proportion of ith label
sdev even more problematic w/ skew, squares amplify influence of outliers
time series
simple: daily stock price of 1 company
multiple: daily stock prices of multiple companies
1 var: barplot, comparing freq across labels
2 var: mosaic, comparing conditional freq; compare proportion of Y
by groups of X
Quant
1 var:
Histogram: see shape/skew
Boxplot: show location, dispersion, outliers=pts outside
wkrs
2 var: scatterplot
Comparison box plot: compare levels of quant variable by groups of qual
variables
Nested barplots: heights of bars reflect freq of Y groups nested w/in X groups
Compare importance
of Y w/in X
Graphical methods
-see data as whole
-discover unexpected facts
Numerical summaries
-simplicity by condensing lot of data to few #’s
-precision
i.e. when comparing groups (eyeballing)
-ways to reason about uncertainty
NEITHER REPLACES THE OTHER
Quant variables
Measures of location: mean, median, quantiles, min, max
Dispersion: sd, IQR, range