### stats_class_2_notes

Course: SHUG 2053, Fall 2009
School: East Los Angeles College
Notes on Stats Class 2 Rehan Ali, Linacre College 1. Mode: The mode is the data value which occurs most frequently in a sample. When asked for a mode for continuous data, you rst need to assign the data values to equal-sized intervals, then plot a histogram. The tallest peak will represent the data range that contains the mode. Modes are normally not a good estimate of the actual value of your random variable...

Notes on Stats Class 2 Rehan Ali, Linacre College 1. Mode: The mode is the data value which occurs most frequently in a sample. When asked for a mode for continuous data, you rst need to assign the data values to equal-sized intervals, then plot a histogram. The tallest peak will represent the data range that contains the mode. Modes are normally not a good estimate of the actual value of your random variable (i.e. the thing you re trying to measure), as it can easily be biased by a skewed distribution. The mean and median are more resilient to this. However there are exceptions to this, as discussed below. 2. Bimodal distibutions: Sometimes you can have more than one obvious peak in a sample. In such a case, you have a <a href="/keyword/bimodal-distribution/" >bimodal distribution</a> where, as the name suggests, you have two modes. Figure 1: <a href="/keyword/bimodal-distribution/" >bimodal distribution</a> In this case, the mean and median will be inaccurate (as they won t give any information about the two distinct groups, only about the entire sample). The mode is more useful here, as you can assign two modes, one to each main peak, and this will re ect the properties of your sample better. 3. Validity of regression lines: When tting a regression line, it s useful to get a measure of how well that line ts the data. One popular method is to calculate a regression / correlation coef cient, where values close to 1 indicate a good t. However this is now considered a bad measure, as it only tells you is how well a straight line can t your data, but not whether it s better to t the straight line as opposed to a different type of line (e.g. a curve). Try the Java applet at http://www.stattucino.com/berrie/dsl/regression/regression.html to see how misleading it can be in slightly curved data. Instead, it s better to calculate and plot residuals. To do this, simply nd the distance between each data point and the regression / best- t line (going straight up or down), and plot these. This is the method ...

