This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Statistics 10 / Fall 11
.Review of the concepts for midterm one Plots and when to use each depending‘on the type of the variable Histogram (used for showing the distribution of quantitative /numerical data —
weight. Height, IQ) Boxplot/ side by side (used to show for quantitative data, particularly useful for
looking at ﬁve number summary for the relationship between a qualitative and
quantative variable; example distribution of SATQ scores for males and female] Barchart (used for qualitative data)
Stern and leaf plot (used for quantitative data when N is not large. Shows the
original data, while histogram does not] Segmented bar chart: used for showing the frequencies that result from a
contingency table based upon two qualitative variables. To make the bars the same
size it is recommended to use row or column percentages. Example: Percentage of
males and females who major in engineering and medicine. You could have two bars
one showing medicine and one engineering and within each bar you have the
percentage of males and females. Or, two bars showing males and females and
within each bar you have the percentage of engineering and medicine. Contingency table with interpretation of row and column percentages. See
explanation for segmented bar Chart. Contingency tables give you the information
needed to do the segmented bar chart. Ogives/ concept of cumulative percent, given X find cum% and vice versa
Ogive is a scatter plot of cumulative percentage vs. X. Example cumulative
percentage vs. IQ. Given IQ, you can find percentile and vice versa. Scatterplots. Scatterplots are used to show the relationship between two
quantitative variables such as height with weight of infants, college GPA and
freshman GPA. Etc. Measures of center Mean median mode Mean or X: 2(X) /N Median = 50th percentile (better measure of center for skewed data)
Mode 2 X with highest frequency (least useful as measure of center] Percentages and cumulative percents Percentage = frequency * 100 / N Cumulative percentages are the same as percentiles. For example Q1 is the 25th
percentile. It is the point below which are 25% of your data and above which are
75% ofthe data. Measure of spread
lQR (Q3 —— Q1) recommended for very skewed data with outliers that effect the SD
Variance and standard deviation Variance S"2 = 2(X — X)“2/ (N1). We square it because 2(X  X) is equal to zero. We divide by N —1 because we want
the measure of scatter to be independent of N. We subtract one because we are
estimating the mean, so we lose one degree of freedom (we will discuss this in detail
later. Do not worry about it] SD = Square root of variance [the reason we take the square root of the variance
and use SD as a measure of spread is that measure of center and spread need to be
in the same unit. For instance if we are looking at weight in pounds, variance will be
poundAZ so we need to take the square root ofit to get SD which will be in pounds. Normal distribution
Calculation of z to find percentile
Use Z (X — X)/S(X) to find Z, then go to the table to find the relevant area. Given percentile find X Given the area find Z and then solve for X. Be careful to find the right Z value. One
thing to keep in mind is that ifyour percentile is less than 50, Z will be negative and
ifyour percentile is more than 50, then Z will be positive. Normal quantile plots used to check normality If your data are normal, then when you build the normal quantile plot, which is a
plot of Z values (range of —3 to +3) with X [the variable of interest) there should be a
pretty good linear relationship and the points should line up. If not, the points will
be away from the line, or cluster above or below the line; depending on the
histogram for which to build the data. When to use normal distribution and when not to Cannot use normal approximation ifyour data do not fit the normal distribution.
Normality can be checked through the normal quantile plot or you can decide the fit
to the normal distribution by seeing whether your actual data are a close fit to the
expected normal . Do 68% and 95% of your data fall within one or two standard deviations of the mean or not. Correlation Pearson correlation can be used only for linear and not for nonlinear data It is only used for calculation of relationship between two quantitative and not
qualitative variables Before using the formula for correlation draw the scatterplot. Be aware of the effect
of inﬂuential points on the magnitude of correlation. r= Ellx  Xl * (Y Yll/Slxl * 500* N1
F = 2200 * 2(Y)/ 5(X) * 5(Y)* N1 Mathematically, coefficient of correlation is the relationship between two sets of
standardize scores and since it is also divided by N1, it is unit free and it is
independent of N. So, coefficients of correlation gathered on data with different
sample sizes and units of measurement are comparable. For example, if we find r
between ACT and GPA for 200 students and r between SAT and GPA for 500
students, we can compare the values Interpretation of correlation
Linear correlation simply shows if there is an association between two quantitative variables, it does not lead to causal conclusions. Nonlinear relationships Nonlinear relationship cannot be measured with Pearson coefficient of correlation.
If we use the Pearson formula to calculate r for nonlinear relationships, we might
find r = 0 Linear regression
Slope and intercept/ calculation and interpretation If there is a linear relationship between two quantitative variables you can use one
to predict the other. But, the predictor usually happens before the outcome. Slope (bl) = r * S(Y) /S(X)
Intercept [b0) = 7 — b* X Y = outcome, X = predictor
b1 interpretation. As we change the predictor for one unit the outcome changes by the amount of slope ofbl b0 interpretation: If the predictor 2 zero, the outcome = intercept which is
sometimes also called the constant Least square regression line
Y" = b0 + b1 X It is the line that minimizes the square of the errors or the distance between the
actual scores and the predicted scores. The predicted scores are always on the
least square regression line and the actual ones are scattered around the line. The
less the distance of the actual from the predicted the less the residual or the error. Residuals
Residuals are the difference between the actual and the predicted score e=Yi—Y" Regression assumptions (linearity, equality of variance, independence) and
how you can check them Independence: which means all of the persons in the study should have an equal
chance of being selected and the choice of one does not depend on the choice ofthe
other. This assumption is not checked statistically. It is checked through how the
sample was selected. Linearity or linear relationship between the predictor and the outcome and this is
checked by the scatterplot or the plot of residuals vs. X. The scatterplot should show
a linear relationship and there should be no patter to the data in the plot of
residuals. The residuals should be equally scattered around the mean of zero.
(means of residuals = O) I Equality of the variance of the error and it is checked by the plot of residuals (e)
or standardized residuals Z[e) vs. X. No pattern to the data shows that variance of
error is similar for different values of X. Plots of residuals and standardized residuals vs. the predictor and what it
shows. See the explanation under the regression assumptions. Mean, variance, and standard deviation of residuals
Mean of the residuals is equal to zero because some of the residuals are positive and
some are negative and so the sum is zero and so is the mean. E=E(Y—Y")/n=2e/n=0 5A2 (e) = 2(e E)A2/N2 = 2(e “2]/N2
We are squaring and summing the errors and dividing by NZ. The reason we divide
by N —2 is because we are estimating slope and intercept This formula is very similar to. All you are doing is replacing X with e and you do not
have a mean because the mean of the errors is equal to zero. The concept is exactly
the same. S"2(X) = 2(x— X)"2/ (N—1].
S(e) would be the square root of s"(e)
Outliers, leverages, and inﬂuential points Best way to decide if a point is an outlier would be to look at the standardized
residual and ifit is more than +3 or less than —3, you can consider it an outlier.
Remember 99.7% of the data fall within three standard deviations of the mean. 30,
you are considering 0.3% of your data to be outliers. However, this is not set in
stone and you can set your own rule on greater than +2 or less than —2 considering
5% to be outliers, etc. So, outliers have a large residual and are far away from the
regression line. Leverages usually have a large X value so that their X value is way beyond the
mean of X Inﬂuential points are those that totally change the direction of the regression
line and have a major effect on the coefficient of correlation. To decide if a
point is inﬂuential or not, you have to find out what happens to the magnitude
of the coefﬁcient of correlation [does it go up or down) if you remove that
point from the data. If nothing much happens, then the point is not really
inﬂuential. The recommendation is to run the data with and without the
points that you consider potentially inﬂuential and see how the coefficient of
correlation changes. Formulas you need to know They are give to you within the context of explanations offered above.
Mean, variance, standard deviation, correlation, slope, intercept, least square
regression line, residual, variance of residuals, standard deviation of residuals. SATQ score for a 380.00
440.00 Valid 530.00
540.00 Missing System
Total sample of UCLA
students Freq uenC 450.00
480.00
500.00 \0 @2000 {AM 52300 l>NNAoo—xxlo#wwNbbhCD—‘WN‘ANU'N'A—‘A'h—‘A—A—A A
ONOO
CDACJI Cumulative
Valid Percent Percent
1.2 12 ...
View
Full Document
 Fall '11
 Gould

Click to edit the document details