Statistics, the discipline, is the art and science of extracting useful information from data. Statistic: number calculated from data When do we know that X affects Y? Controlled experiment o Physical sciences, x is controlled, y is measured o Med sciences, clinical trials, double-blind placebo o Social sciences, subjects exposed to controlled X, y observed Strong theory o Physics, opposite charges attract o Bio, descent of species o No strong theory for social sciences Observational data --> prediction The ASSOCIATION Clustered: curve or line NOT the right fit Convex = cup, Concave = cap No association = random cov(x,y) = 1/(n-1)[(x1-xbar)(y1-ybar)+(x2-xbar)(y2-ybar)+…(xn-xbar)(yn-ybar)] s(x+y) 2 =s(x) 2 + s(y) 2 + 2cov(x,y) Cauchy-Schwartz inequality -s(x)s(y) ≤ cov(x,y) ≤ s(x)s(y) If perfect positive linear assoc cov(x,y) = s(x)s(y) If perfect negative linear assoc -s(x)s(y) = cov(x,y) Comparing w/ products of sdev is inconvenient c(x,y) = cov(x,y)/(s(x)s(y)) covariance is theoretically important for algebra correlation is practically important for convenient measure of linear assoc zbar = 0, sample sdev z = 1, ALWAYS, unit-free Standardization can be done w/ ANY location measure (median) and ANY dispersion measure (IQR), result will still be 0 location measure & unit dispersion measure of 1. cov(Zx,Zy) = c(x,y) Scatterplot matrices Plots in column all have same x axis Plots in row all have same y axis Zy = a(Zx) + b , abs val of a = 1, b = 0 Zy = ± Zx cor(x,y) = (s 2 (Zy+Zx) – s 2 (Zy-Zx))/4 = cov(x,y)/(s(x)s(y)) ni = count of ith label pi = proportion of ith label sdev even more problematic w/ skew, squares amplify influence of outliers time series simple: daily stock price of 1 company multiple: daily stock prices of multiple companies 1 var: barplot, comparing freq across labels 2 var: mosaic, comparing conditional freq; compare proportion of Y by groups of X Quant 1 var: Histogram: see shape/skew Boxplot: show location, dispersion, outliers=pts outside wkrs 2 var: scatterplot Comparison box plot: compare levels of quant variable by groups of qual variables Nested barplots: heights of bars reflect freq of Y groups nested w/in X groups Compare importance of Y w/in X Graphical methods -see data as whole -discover unexpected facts Numerical summaries -simplicity by condensing lot of data to few #’s -precision i.e. when comparing groups (eyeballing) -ways to reason about uncertainty NEITHER REPLACES THE OTHER Quant variables Measures of location: mean, median, quantiles, min, max Dispersion: sd, IQR, range

