This preview shows page 1. Sign up to view the full content.
Unformatted text preview: -Samples produce different b0 and b1 , but sampling distribution model of regression slope is centered at 1 (the slope of idealized regression line) -Standardize Slopes by subtracting the model mean and dividing by SE Student's t with n-2 df t (df = n 2) = (b1 - 1) / SE (b1 ) Usual H0: 1 = 0, because if slope = 0, there is no linear association btwn 2 variables CI for regression slope: b1 t* (df = n 2) x SE (b1) ***Regression estimates Rate of Change--CANNOT TELL CAUSATION ***SE increases, t-score decreases (less significant results) -Very low P-value means association you see in the data is unlikely to have occurred by chance reject H0 -Can Also predict mean y value for all cases OR y-value for a particular case **MORE PRECISION PREDICTING MEANS (difference is all in SE-- the farther from center of our data, the LESS precise--SE for INDIVIDUAL predicted value is LARGER than SE for MEAN--extra variability) Predicting for new individual (not part of original data set) "x sub new" = xv v = b0 + b1x v CI for mean predicted value: v t* (df = n 2) x SE (v) "mean y value for all with that value for x" Narrower CI and smaller SE Prediction Interval for individual: v t* (df = n 2) x SE (v) "exact y value for particular individual with that x" Wider CI and larger SE A CI--has 95% chance of capturing the true y value of a randomly selected individual with the given x-value Watch OUT 1. High influence points & outliers 2. Extrapolation 3. Make sure errors are Normal 4. Watch out for plot thickening 5. Don't fit linear regression to data that aren't straight
Sample (statistics): Latin Letters ybar Mean Population (parameters): Greek Letters S Stand Dev R Correlation Phat proportion p Categorical Data Frequency Tables, Bar & Pie charts, Contingency Tables Quantitative Data Histograms, Stem & leaf, dot plot, boxplot, scatterplots Marginal Distribution distribution of either variable alone; also the counts or percentages are the totals found in the margins (last row / column) of table Data information w/ a context (Who, what W's of data / When, where, Why good to have) Who called "cases" MAKE A PICTURE w/ data Relative Frequencies / Proportions depend on whether taken from column total, row total, or grand total (marginal total) 5 Number Summary min, max, Q1, Q3, median4 Measures of SpreadStandard Dev, IQR Measures of PositionMean, Median, Quartiles Independence in contingency table, when the distribution of one variable is the same for all categories of another Simpson's paradox when averages are taken across different groups (not related), they may appear contradictory. CHANGING CENTERadding a constant to each value adds same amount to Mean, Median, and Quartiles, but DOES NOT change Stand Dev or IQR CHANGING SCALEmultiplying each data value by a constant changes the measures of position (Mean, Median, Quartiles), and measures of spread too HISTOGRAMS: Each column in histogram is a bin--represents a case 1. Shape (General Trend) a. Unimodal / Bimodal / Multimodal -b. Skewed = "strewn" : Symmetric To the right ( : : . . ) Mean > Median - To the left ( . . : : ) Median > Mean 2. Center: Median (middle number when in order) Mean (average) 3. Spread: Range = max min (single #) IQR (Q3 Q1 each Q is 25%) Stand Dev = s = sqrt variance = sqrt ( (y-ybar) / n-1 ) same units as data IQR and Median used for skewed data / Mean and Standard Dev better for symmetric, unimodal (DON"T use when outliers present) Box Plot Good 4 comparing groups Dot Plot Stem & Leaf - ___________________________________________________________________________________________________________________________ SCATTERPLOTS: used to compare 2 quantitative variables 1) direction + or - 2) form: straight or curved 3) scatter: little, mod, great 4) outliers X-axis- variable helping to explain, model, or predict Y-axis variables we are trying to model, explain, predict -Amount of scatter determines the strength of the association n Standardize: mean = 0, s = 1; Z-SCORE no units = y ybar / s (z = 1, means 1 standard deviation away from mean) 68 95 99.7 Rule: for normal dist 1 s above/below 68 % of data / 2 s 95 % of data / 3 s 99.7 % (works if unimodal / symm., & approx normal) Models all are wrong, useful only because they simplify reality... Normal Model inflexion point is 1 s away from mean Normal Model is appropriate for a distribution whose shape is roughly unimodal and symmetric 4 Correlation r how close two variables are to having a straight line relationship ( -1 r 1 ) * horizontal line = 0 * Conditions = 1) quantitative variables 2) straight relationship 3) no outliers no units Zy = y / sy Zx = y / sx r = Zx * Zy / (n-1) *** CORRELATION CAUSATION *** DON'T say "CORRELATION" when you see only a general "ASSOCIATION" Moving any number of SDs in X moves r times that number of SDs in Y If r = .46, then if you move 2 SDs away from mean in X, y moves .92 (.46 * 2) SDs in Y -Correlation is NOT affected by changes of center OR scale; it depends ONLY ON Z-SCORE d Linear Model: = b0 + b1x slope (b1 ) = r * sy / sx intercept (b0) = ybar - b1 * xbar
Regression Line "line of best fit" least squares line ordinary least squares (OLS) - minimum sum of squared residuals min e2 Variable Coefficient Constant .3 Year70 .5 = .3 + .5 * x-value Slope = units of y per x-unit "a change in one unit of x will change the by this amount" -Slope of least squares line for z-scores is correlation (r) "over one standard deviation in x, up r standard deviations in " Intercept = a starting value in y-units; "predicted y () when x = 0" -The farther the x value is from xbar, the less trust we should place on predicted value Extrapolation prediction in which we venture into far x territory Leverage (hi) measures the influence of each case on the regression equation (x-outliers) 0 < hi <1 DEPENDS ONLY ON PREDICTOR (X) VARIABLE -Cases w/ high leverage should be noted because they may dominate the determination of regression equation Influential points points that have high leverage and are model outliers (**can hide in the plots of residuals because they can pull the line close to them, so they would have small residuals) Lurking Variable hidden variable that stands behind a relationship and determines it by simultaneously affecting both variables Subsets all data in a linear model must be homogenous, if data consists of 2 or more groups, it is usually better to fit different linear models to each group Residual (e) = Observed y (predicted y is ) Data = Model + Residual - Strongest residual plot is scattered / not clustered, should show no pattern at all -A negative residual means predicted value is an overestimate / a positive residual means predicted value is an underestimate -If points go from spread out to narrow (clustered), then linear model may not be appropriate because spread is changing; If they maintain equal spread but there is a dip, linear model is not appropriate because the relationship is not linear -The standard deviation of residuals measures the scatter around the line R = squared correlation "the fraction of the variability of y accounted for by its least squares linear regression on x"accounts for VARIATION of y R = 79% 79% of the changes in y can be accounted for by the linear model (y = b0 + b1x ) 1 - R = the fraction of variability left behind in the residuals
Re-expression of Data using logarithms, square roots, squares, reciprocals, etc 1) make distribution more symmetric (histogram 2)Make the spread of several groups more alike (boxplots) 3)Make the form of a scatterplot more linear 4) Make the scatter of a scatterplot more evenly spread Regression to the Mean Because the correlation is always less than 1.0 in magnitude, each predicted y tends to be fewer Stand Devs away from the mean than its corresponding x was to the mean Who: 882 births What: mother's age, length of pregnancy, gender of baby, etc When: 1998 Where: City hospital Why: researchers were investigating impact of prenatal care on baby How: probably hospital records Who: automobiles in student and staff lots (case: each car is case) what: make, type When: NA Where: large university Why: NA How: survey Who: 28,892 men aged 30-87(case: each man) What: fitness level When: over 10 yr period Where: NA Why: establish association btwn death and fitness Random an event is random is we know what outcomes could happen, but not which particular values did or will happen Outcome an individual result of a simulated component of a simulation--random if we cannot predict value beforehand -Random outcomes exhibit long-run regularity; tendency of an observed proportion to settle down with more and more data is an example of regularity; regularity lets us use randomness to reduce bias in sampling and designed experiments Simulation a simulation models random events by using random numbers to specify event outcomes with relative frequencies that correspond to true real-world frequencies we are trying to model Simulation component a situation in a simulation in which something happens at random Population the entire group of individuals or instances about whom we hope to learn Sample a representative subset of a population, examined in hope of learning about population Sample Size number of individuals in a sample Size of population does not matter, only Sample size does (soup example)--large sample yields more precise estimates about the whole population, unless the sample is nonrandom or unrepresentative; large sample size cannot correct biased design Census sample that consists of entire population Sampling Frame a list of individuals from who the sample is drawn; the group we are trying to collect info from Sampling variability natural tendency for randomly drawn samples to differ; natural result of random sampling (sometimes called sampling error) Surveys used to learn about population Randomization best defense against bias; each individual is given a fair, equal chance of selection Matching attempt to force a sample to resemble specified attributes may make better sample, but is no replacement for randomization Population Parameter a numerically valued attribute of a MODEL for a pop.; "value of statistic on entire population"; Statistic value calculated for sampled data; # used to describe real world data Statistical Inference take data at hand and apply it to population at large Representative a sample is said to be representative if the statistics computed from it accurately represent the corresponding population parameters Simple Random Sample each element of population has an equal and independent chance of being selected Stratified Ransom Sample population divided into several populations (or strata--share characteristics) and random samples are drawn from each stratum; if we know about true pop. proportions of these characteristics, should arrange so that sample represent population by having correct proportions Cluster Sample Entire groups, or clusters, are chosen at random; usually used as a matter of convenience, practicality, or cost; each cluster should be heterogeneous (representative of pop.) Systematic Sample a sample drawn by selecting individuals systematically from a sampling frame; only when there is no relationship between the order of a sampling frame and the variables of interest can a systematic sample be representative; to make it random, must start w/ randomly selected person Multistage Sample combines several sampling methods Convenience Sample consists of individuals who are conveniently available; often fail to be representative bc each person in pop not equally convenient Voluntary Response Bias when individuals choose on their own whether to participate in sample; these samples always invalid and cannot be recovered Undercoverage fails to sample from some part of pop. OR that samples in a way that gives one part of pop. less representation than it has in the population Nonresponse Bias when a large fraction of those sampled refuse to participate, or cannot be contacted, observed, or measured; Those who do respond are likely not to represent entire sample; voluntary response bias is a form of nonresponse bias, but nonresponse bias can occur for other reasons Response Bias anything in survey design that influences responses; ex: wording of questions, changes in responses to please interviewer or respondent Observational Study researchers don't assign choices, they simply observe them; a study based on data in which no manipulation of factors used -Observational studies LACK RANDOMIZATION Matching - sometimes this may be necessary if subjects are similar in ways not under study--in this case, matching subjects can reduce unwanted variation much the same ways as blocking does in an experiment Experiment manipulates the factor levels to create treatments; randomly assigns subjects to these treatment levels; compares the responses of the subject groups across treatment levels; the experimenter actively and deliberately manipulates the factors to control details of the possible treatments, and assigns the subjects those treatments at random Group 1 Treatment 1 Random Allocation < > Compare 1. 2. Retrospective Study study in which subjects are selected and their previous conditions or behaviors are determined, not random samples Prospective Study subjects are followed to observe future outcomes; no treatments deliberately applied (**BETTER bc people have bad memories) Response variable being acted upon and measured (Quantitative) Group 2 Treatment 2 Subject (or participant) individuals on whom or which we experiment (more generically called experiment unit) Factor explanatory variable in an experiment (Qualitative) -Specific levels experimenter chooses for a factor called the levels of a factor Treatment combination of specific levels from all factors that an experimental unit receives Statistically Significant when the differences we observed are big enough to be attributed to the treatments Control Treatment given default treatment or null (placebo) given Control Group groups of subjects to whom it is applied Confounding When levels of one factor are associated with the levels of another factor (different teaching styles in different seasons) Blinding disguise treatments; Blinding by misleading is also good; Two classes of individuals who can effect outcome 1. Those who influence results (subjects, treatment administrators, technicians) 2. Those who evaluate the results (judges, treating physicians, etc) Placebo "fake" (null) treatment; looks like treatments being tested; Placebo effect when subjects report changes from placebo Best Experiment Randomized, comparative (multiple trials), double-blind, placebo-controlled FOUR PRINCIPLES OF EXPERIMENTAL DESIGN: 1. Control we control sources of variation other than the factors being tested by making conditions as similar as possible for all treatment groups 2. Randomize in assigning subjects to treatments; allows us to equalize the effects of unknown or uncontrollable sources of variation; if subjects are not assigned to treatment at random, you will not be able to use the powerful methods of statistics to draw conclusions from your study a. b. c. Single Blind when every individual in either of the classes is blinded Double Blind when everyone in both classes is blinded 3. Replicate 2 kinds of replication: a. repeat the experiment, applying the treatments to DIFFERENT SUBJECTS; b. replicate when experimental units are not a representative sample from the population of interest; Outcome of an experiment on a single subject is an anecdote, not data; Replication of an entire experiment with the controlled sources of variation at different levels is an essential step in science 4. Block when attributes of the subjects that we are not studying and cannot control may affect the outcomes of the experiment (such as some people being much more skilled than others); if we group the similar individuals together and randomize within these blocks, we can remove much of the variability due to the difference among the blocks; Blocking is an important compromise between randomization and control--NOT required -Completely Randomized Design all experimental units have equal chance of receiving any treatment -Randomized Block Design when we assign experimental units to treatments at random, but within each block -BLOCKING in an experiment is similar to STRATIFYING in a SAMPLE both help remove variation in the data ...
View Full Document
- Fall '07