10_review_midterm_one_fall_11

10_review_midterm_one_fall_11 - Statistics 10 / Fall 11...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistics 10 / Fall 11 .Review of the concepts for midterm one Plots and when to use each depending‘on the type of the variable Histogram (used for showing the distribution of quantitative /numerical data — weight. Height, IQ) Boxplot/ side by side (used to show for quantitative data, particularly useful for looking at five number summary for the relationship between a qualitative and quantative variable; example distribution of SATQ scores for males and female] Barchart (used for qualitative data) Stern and leaf plot (used for quantitative data when N is not large. Shows the original data, while histogram does not] Segmented bar chart: used for showing the frequencies that result from a contingency table based upon two qualitative variables. To make the bars the same size it is recommended to use row or column percentages. Example: Percentage of males and females who major in engineering and medicine. You could have two bars one showing medicine and one engineering and within each bar you have the percentage of males and females. Or, two bars showing males and females and within each bar you have the percentage of engineering and medicine. Contingency table with interpretation of row and column percentages. See explanation for segmented bar Chart. Contingency tables give you the information needed to do the segmented bar chart. Ogives/ concept of cumulative percent, given X find cum% and vice versa Ogive is a scatter plot of cumulative percentage vs. X. Example cumulative percentage vs. IQ. Given IQ, you can find percentile and vice versa. Scatterplots. Scatterplots are used to show the relationship between two quantitative variables such as height with weight of infants, college GPA and freshman GPA. Etc. Measures of center Mean median mode Mean or X: 2(X) /N Median = 50th percentile (better measure of center for skewed data) Mode 2 X with highest frequency (least useful as measure of center] Percentages and cumulative percents Percentage = frequency * 100 / N Cumulative percentages are the same as percentiles. For example Q1 is the 25th percentile. It is the point below which are 25% of your data and above which are 75% ofthe data. Measure of spread lQR (Q3 —— Q1) recommended for very skewed data with outliers that effect the SD Variance and standard deviation Variance S"2 = 2(X — X)“2/ (N-1). We square it because 2(X - X) is equal to zero. We divide by N —1 because we want the measure of scatter to be independent of N. We subtract one because we are estimating the mean, so we lose one degree of freedom (we will discuss this in detail later. Do not worry about it] SD = Square root of variance [the reason we take the square root of the variance and use SD as a measure of spread is that measure of center and spread need to be in the same unit. For instance if we are looking at weight in pounds, variance will be poundAZ so we need to take the square root ofit to get SD which will be in pounds. Normal distribution Calculation of z to find percentile Use Z (X — X)/S(X) to find Z, then go to the table to find the relevant area. Given percentile find X Given the area find Z and then solve for X. Be careful to find the right Z value. One thing to keep in mind is that ifyour percentile is less than 50, Z will be negative and ifyour percentile is more than 50, then Z will be positive. Normal quantile plots used to check normality If your data are normal, then when you build the normal quantile plot, which is a plot of Z values (range of —3 to +3) with X [the variable of interest) there should be a pretty good linear relationship and the points should line up. If not, the points will be away from the line, or cluster above or below the line; depending on the histogram for which to build the data. When to use normal distribution and when not to Cannot use normal approximation ifyour data do not fit the normal distribution. Normality can be checked through the normal quantile plot or you can decide the fit to the normal distribution by seeing whether your actual data are a close fit to the expected normal . Do 68% and 95% of your data fall within one or two standard deviations of the mean or not. Correlation Pearson correlation can be used only for linear and not for nonlinear data It is only used for calculation of relationship between two quantitative and not qualitative variables Before using the formula for correlation draw the scatterplot. Be aware of the effect of influential points on the magnitude of correlation. r= Ellx - Xl * (Y- Yll/Slxl * 500* N-1 F = 2200 * 2(Y)/ 5(X) * 5(Y)* N-1 Mathematically, coefficient of correlation is the relationship between two sets of standardize scores and since it is also divided by N-1, it is unit free and it is independent of N. So, coefficients of correlation gathered on data with different sample sizes and units of measurement are comparable. For example, if we find r between ACT and GPA for 200 students and r between SAT and GPA for 500 students, we can compare the values Interpretation of correlation Linear correlation simply shows if there is an association between two quantitative variables, it does not lead to causal conclusions. Nonlinear relationships Nonlinear relationship cannot be measured with Pearson coefficient of correlation. If we use the Pearson formula to calculate r for nonlinear relationships, we might find r = 0 Linear regression Slope and intercept/ calculation and interpretation If there is a linear relationship between two quantitative variables you can use one to predict the other. But, the predictor usually happens before the outcome. Slope (bl) = r * S(Y) /S(X) Intercept [b0) = 7 -— b* X Y = outcome, X = predictor b1 interpretation. As we change the predictor for one unit the outcome changes by the amount of slope ofbl b0 interpretation: If the predictor 2 zero, the outcome = intercept which is sometimes also called the constant Least square regression line Y" = b0 + b1 X It is the line that minimizes the square of the errors or the distance between the actual scores and the predicted scores. The predicted scores are always on the least square regression line and the actual ones are scattered around the line. The less the distance of the actual from the predicted the less the residual or the error. Residuals Residuals are the difference between the actual and the predicted score e=Yi—Y" Regression assumptions (linearity, equality of variance, independence) and how you can check them Independence: which means all of the persons in the study should have an equal chance of being selected and the choice of one does not depend on the choice ofthe other. This assumption is not checked statistically. It is checked through how the sample was selected. Linearity or linear relationship between the predictor and the outcome and this is checked by the scatterplot or the plot of residuals vs. X. The scatterplot should show a linear relationship and there should be no patter to the data in the plot of residuals. The residuals should be equally scattered around the mean of zero. (means of residuals = O) I Equality of the variance of the error and it is checked by the plot of residuals (e) or standardized residuals Z[e) vs. X. No pattern to the data shows that variance of error is similar for different values of X. Plots of residuals and standardized residuals vs. the predictor and what it shows. See the explanation under the regression assumptions. Mean, variance, and standard deviation of residuals Mean of the residuals is equal to zero because some of the residuals are positive and some are negative and so the sum is zero and so is the mean. E=E(Y—Y")/n=2e/n=0 5A2 (e) = 2(e E)A2/N-2 = 2(e “2]/N-2 We are squaring and summing the errors and dividing by N-Z. The reason we divide by N —2 is because we are estimating slope and intercept This formula is very similar to. All you are doing is replacing X with e and you do not have a mean because the mean of the errors is equal to zero. The concept is exactly the same. S"2(X) = 2(x— X)"2/ (N—1]. S(e) would be the square root of s"(e) Outliers, leverages, and influential points Best way to decide if a point is an outlier would be to look at the standardized residual and ifit is more than +3 or less than —3, you can consider it an outlier. Remember 99.7% of the data fall within three standard deviations of the mean. 30, you are considering 0.3% of your data to be outliers. However, this is not set in stone and you can set your own rule on greater than +2 or less than —2 considering 5% to be outliers, etc. So, outliers have a large residual and are far away from the regression line. Leverages usually have a large X value so that their X value is way beyond the mean of X Influential points are those that totally change the direction of the regression line and have a major effect on the coefficient of correlation. To decide if a point is influential or not, you have to find out what happens to the magnitude of the coefficient of correlation [does it go up or down) if you remove that point from the data. If nothing much happens, then the point is not really influential. The recommendation is to run the data with and without the points that you consider potentially influential and see how the coefficient of correlation changes. Formulas you need to know They are give to you within the context of explanations offered above. Mean, variance, standard deviation, correlation, slope, intercept, least square regression line, residual, variance of residuals, standard deviation of residuals. SATQ score for a 380.00 440.00 Valid 530.00 540.00 Missing System Total sample of UCLA students Freq uenC 450.00 480.00 500.00 \0 @2000 {AM 52300 -l>NNAoo—xxlo#wwN-b-b-hCD—‘WN‘ANU'N'A—‘A'h—‘A—A—A A ONOO CD-ACJI Cumulative Valid Percent Percent 1.2 1-2 ...
View Full Document

This note was uploaded on 12/03/2011 for the course STATISTICS 10 taught by Professor Gould during the Fall '11 term at UCLA.

Page1 / 6

10_review_midterm_one_fall_11 - Statistics 10 / Fall 11...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online