# Statistics Flashcards

Terms Definitions
 population entire collection of ind. or objects about which information is desired sample subset of the population selected for study categorical data (qualitative) univariate data set when observations are categorical numerical data(quantitative) univariate data set if each observation is a number discrete data possible values of the variable correspond to isolated points on a number line observations determined by counting continuous data possible values forms and entire interval on the number line frequency distribution table that displays the possible categories along with the associated frequencies and/or relative frequencies bar chart use with categorical data horizontal access used for category names vertical axis used for frequency or relative frequency looking for frequently and infrequently occurring categories observational study observes characteristics of a sample selected from one or more existing populations goal is to draw conclusions about corresponding population or about differences between two or more pops. experiment when investigator observes how response variable behaves when one or more of explanatory variables (factors) are manipulated goal is to determine effect of manipulated factors researcher controls who is in which group selection bias (undercoverage) tendency for samples to differ from the corresponding pop. as a result of systematic exclusion of some part of the pop. response bias (measurement) tendency for samples to differ from the corresponding population because the method of observation tends to produce values that differ from the true value simple random sample (SRS) of size n is a sample that is selected from a population in a way that ensures that every different possible sample of desired size has same chance of being selected explanatory variable(factors) variables that have values that are controlled by the experimenter response variable variable that is not controlled by the experimenter and is measured as part of experiment treatment experimental condition extraneous variable one that is not one of the explanatory variables in the study but is thought to affect the response variable random assignment (of subjects to treatments or of treatments to trials) to ensure that the experiment does not systematically favor one experimental condition(treatment) over another comparative bar chart used to give a visual comparison of two or more groups accomplished by constructing two or more bar charts that use the same set of horizontal and vertical axes. use the relative frequency to construct scale on vertical axis so we can make meaningful comparisons if sample sizes are not the same. stem-and-leaf display a compact way to summarize univariated numerical data each number broke into two pieces used with a small to moderate number of observations(not large) stem is the first part of the number and consists of beginning digit(s) leave is the last part and consists of final digit(s) outlier (p103) an unusually small or large data value. relative frequency distribution calculated by dividing the frequency by total # of observations in the data set histogram graph of the frequency or relative frequency distribution similar to a bar chart for categorical data. discrete numerical data works well for large data sets horizontal and vertical scale unimodal histogram with a single peak bimodal histogram with two peaks positively skewed (right skewed) if upper tail of histogram stretches out much farther than lower tail negatively skewed (left skewed) if lower tail is much longer then the upper tail symmetric vertical line of symmetry so that the part of the histogram to the left of the line is a mirror image to the part on the right. scatterplot most important graph based on bivariate numerical data x-axis meets a horizontal line from y-axis shows the point representing the observation a fairly strong curved pattern indicates a strong relationship sample mean (average) sum of all observations in the sample divided by number of observations in the sample. sample median obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list. Then: if single middle value of n is odd this equals sample mean the average of the middle two values if n is even comparing mean and median median is the value on the measurement axis that separates the smoothed histogram into two equal parts the mean is the balance point for the distribution if histogram is symmetric(dividing point and balance point equal) mean and median are the same. If histogram is unimodal with a longer upper tail (+) the outlying values in the upper tail pull the mean up so it will generally lie above the median. An unusually high exam score will raise the mean but does not affect the median....and vise versa for negative skew. sample proportion number of &amp;amp;amp;amp;amp;quot;S's&amp;amp;amp;amp;amp;quot; in the sample divided by n range largest observation - smallest observation sample variance sum of squared deviations from the mean divided by n-1 value - mean squared divided by n-1 sample standard deviation the size of a &amp;amp;amp;amp;amp;quot;typical&amp;amp;amp;amp;amp;quot; or &amp;amp;amp;amp;amp;quot;representative&amp;amp;amp;amp;amp;quot; deviation from the mean it is the positive square root of the sample variance and is denoted by s. quartiles and interquartile range (IQR) IQR-measure of variability that is resistant to the effects of outliers lower quartile=mean of lower half of sample upper quartile=mean of upper half of sample IQR=uq-lq five-number summary uses smallest observation lower quartile = median of lower half median upper quartile = median of upper half largest observation skeletal boxplot see figure 4.8 practice outlier (p185) more than 1.5 (IQR) away from the nearest quartile. (the nearest end of the box it is extreme if it is more than 9(IQR) from the nearest quartile and it is mild otherwise. z-score value - mean divided by s/d tells us how many standard deviations the value is from the mean. It is positive or negative according to whether the value lies about or below the mean correlation coefficient r measures the strength of any linear relationship between two numerical variables =pearson!!! least-squares line sample regression line, the line that minimizes this sum of squared deviations y hat = a + bx a=intercept b=slope y hat is the prediction of y that results from substuting a particular x value into the equation Data Pieces of information about individuals organized in variables dataset Set of data identified with particular circumstances Norminal variables No natural order ex. gender/eye color Ordinal variables Natural order ex. categories ordered from strongly disagree to strongly agree Interval variables A measurement or count that it makes sense to discuss the difference between the. Slurs but not the ratio ex. Temperature Ratio variables ratio between values has intrinsic meaning Ex. Income weight or time Distribution -what values the variables take and -how often the variable takes those values mode Mode The most commonly occurring value in a distribution highest frequency Mean Average of a set Median Midpoint half of the observations are smaller and half are larger Complement Rule P(A)= 1 - P(notA) or P(notA)= 1 - P(A) at least Multiplication Rule P(A and B) = P(A) * P (B) The General Addition Rule For any two events A and B, P( A or B)= P(A) + P(B) - P(A and B). Conditional Probability Of event B, given A is P(B l A) = P(A and B)/ P(A) Check Independence compare P(B l A) and P(B l not A) The General Multiplication Rule A and B P(A and B) = P(A) * P(B l A)
/ 59
Term:
Definition:
Definition: