Terms  Definitions 

inference 
the process of learning about a population by studying a sample

sample regression 
estimates the association between x and y in the entire population

regression line 
an estimate from a sample trying to describe the true regression line from the population

observational study 
a statistical study in which the subjects are not modified (just observed) so that researchers can measure and record certain characteristics

experiment (experimental study) 
A statistical study in which a "treatment" is applied to the subjects (i.e. they are modified) and researchers measure the effect of the treatment

lurking variable (confounding variable) 
other variables that may influence the response that are not studied

explanatory variable 
variable that explains or causes the differences in another variable, ( "x" or independent variable)

response variable 
variable which is thought to depend on the value of the explanatory variable, ("y", dependent variable)

study question 
the question about the population that the study is attempting to answer

population 
the complete set of all individuals/objects the study is attempting to answer a question about, the whole group of individuals we are interested in

study subjects 
the individuals actually measured in the study (i.e. the selected sample of individuals/objects from the population)

treatment 
what the research does/gives to some or all of the study subjects; the factor whose effect is under study; also called the explanatory variable

control group 
group of subjects that have the same sources of variability as those receiving the treatment but does NOT receive treatment; sometimes called the placebo group

confounding factor 
any factor other than the experimental treatment that can affect the response variable in the experiment

completely randomized design 
a design in which the treatments in the experiment are randomly assigned to the experimental units without using matched pairs or blocks

researchers 
people who make measurements

single blinding 
subject doesn't know if he/she is in the treatment or control group

double blinding 
neither RESEARCHERS nor SUBJECTS know where the participants are assigned between the control and treatment group

matched pair design 
makes two measures on each subject

blocking design 
extension of completely randomized design
 put similar subjects into blocks, expect the blocks to differ with respect to the response variable then do a completely randomized experiment within each block 
block 
a group of subjects that are similar in some way

"blocks" refers to ... 
individuals

"experimental units" refers to... 
repeated time periods in which the blocks receive the varying treatments

scatter plot 
used to compare variables
must measure two variables on a common individual (an individual can be a person, place, or even time) then plot the two variables 
positive association 
this type of association occurs when the value of one variable tends to increase as the value of the other variable increases

negative association 
this type of association occurs when the value of one variable tends to decrease as the value of the other variable tends to increase

nonlinear association 
this type of association occurs when there is no linear relationship between two values

correlation 
a number that indicates the strength and the association of a straightline relationship between two quantitative variables

strength of correlation 
determined by the absolute value of the correlation, indicates the overall closeness of the points to a straight line

direction of the correlation 
determined by the sign of the correlation

magnitude of r 
absolute value of r, indicates the strength of the relationship

r = 1 or r = 1 
indicates that there is a perfect linear relationship and all data points fall in the straight line

squared correlation, r² 
this is the proportion of variation in the response variable that is explained by the explanatory variable. It is positive between 0 and 1.
Referring to a correllation 
r 
correlation coefficient, used to measure linear relationship between x and y

the line of best fit 
this estimates the average value of y when you know x and individual's values will vary around the predicted value
 can be used to give a prediction of a value of y, given a specific value of x 
randomization test 
a test on two groups when paired data is NOT available

sampling frame 
a list of all individuals in the population

in hypothesis testing, population parameter = 
null value

null hypothesis 
the statement being tested
a statement that describe some aspect of the statistical behavior of a set of data this statement is treated as valid unless the actual behavior of the data contradicts this assumption 
null value 
the specific # the parameter equals if the null hypothesis is true
 value of population parameter being tested in the null hypothesis 
alternative hypothesis 
 a statement that something is happening
 researchers want to prove this  it may be a statement that the assumed status quo is false, or that there is a relationship, or there is a difference 
two types of alternative hypothesis 
one sided test, two sided test

onesided test 
when Ha specifies a single direction

twosided test 
when Ha includes values in both directions

pvalue 
the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming Ho is true

level of significance 
(α) is the border line for deciding that the pvalue is low enough to justify choosing the alternative hypothesis

hypothesis testing about paired differences 
matched pairs design

paired ttest 
a onesample ttest used on the sample of differences to examine whether the sample mean difference is significantly different from 0

sampling distribution 
describes the possible values the statistic might have when random samples are taken from a population
the distribution of statistics ("xbar" or "p hat") for all possible samples from the same population of a given sample size (n) 
statistical inference 
gives us methods for drawing conclusions about a population based on data from samples

confidence interval 
an interval of values computed from sample data that is likely to include the true population

standard error 
is the estimated standard deviation of the sample distribution of the statistic

confidence level 
proportion of samples for which the confidence interval will capture the true parameters, % of time we expect the procedure to work, determines how frequently the observed interval contains the parameter

standard error of sample mean 
(s) is the sample standard deviation

statistic 
a number summarized by the same characteristic of the sample data, computed from the sample values, a known value that varies from sample to sample

is the distribution of possible values of the statistic for repeated samples of the same size taken from the same population 
mean of a sampling distribution 
the average of all possible values of the statistic for repeated samples of the same size from a population

the standard deviation(SD) of a sampling distribution 
measures the average distance of the possible values of the statistic from the mean of the sampling distribution, roughly speaking

there is a difference between N and n! N= n= 
n= sample size (number of values in one sample/subgroup)
N= number of samples (number of subgroups) 
Law of Large Numbers (LLN) 
as you average more observations, sample mean settles down at population mean

graphs used for categorical variables 
1. pie chart
2. bar graph 
graphic representations for quantitative variables 
1. histogram
2. stemandleaf plot 3. box plot 
standard deviation 
a value that measures the variability (spread) of data.

density curve 
the outline of the histogram which approximates the overall pattern of a distribution
1. Its always on or above the horizontal axis 2. It has area of exactly 1 underneath it 
standard normal distribution 
this is a normal distribution with a mean of 0 and a standard deviation of 1
all other normal distributions are compared to this 
zscore 
(a standardized value) that is the distance between a specified value and the mean, measured in number of standard deviations

observation (individual) 
an individual or the value of a single measurement

variable 
a characteristic that can differ from one individual to the next

categorical variables 
the observational units are being divided into units, there is no special ordering of the categories

ordinal variables 
the observational units are being divided into categories which have an order
basically a categorical variable with ordered categories 
quantitative variables 
variables that take numerical values
 you should be able to do mathematical operations with these numbers such as adding, multiplying, etc. (A social security number would not be one of these) 
graphs for quantitative variables 
1. Histogram
2. StemandLeaf Plot 3. Dot Plot 
Pie Chart 
each slice of a pie corresponds to a category and the size of the angle of the slice shows the percentage of the individuals in the corresponding category

Bar Graph 
each category is presented as a bar
 the height of the bar represents the number (or percentage) of individuals in the corresponding category 
range 
highest value subtract the lowest value

histogram 
bar graphs for a quantitative range of possible value are broken into categories

frequency 
actual number of individuals who fall into each interval (of a histogram)

relative frequency 
proportion or percentage that are in an interval (of a histogram)

stem and leaf plot 
every individual data value is shown

dot plot 
display a dot for each observation along a number line

distribution 
the overall pattern of how often the possible values occur

shape of a distribution 
shows how values are distributed in a distribution

center 
location, average, mean and median measure this

outlier 
unusual values that do not fit with the rest of the pattern
(may be due to data entry errors or may be actual unusual values) 
symmetric distribution 
one half of the distribution is the mirror image of the other (bell shape)

bimodal distributions 
has two peaks which can be caused by two or more groups of values in the sample

multimodal distribution 
distribution with several peaks

median 
the middle number of the data when it is ordered, 50% of the data is above it and 50% of the data is below it

two measures of the center 
mean and median

symmetric distribution (mean ? median) 
mean = median

right skewed distribution (mean ? median) 
mean>median
mean is greater than median 
left skewed distribution (mean ? median) 
mean<median
mean is less than median 
First Quartile (Q1) 
25% of the data is at or below this number

Third Quartile (Q3) 
75% of the data is at or below this number

InterQuartile Range (IQR) 
A value describing the spread over approximately the middle 50% of the data

the five number summary includes 
1) maximum
2) minimum 3) Q1 4) median 5) Q3 
boxplot 
a graphical representation of the 5 number summary

1.5*IQ Rule 
an outlier is any value that lays more than one and a half times the length of the box

variance 
measures the distance of all individuals from the mean

strata 
sub groups of population which might have different responses to the question of interest

stratified sample 
is a collection of samples taken in each stratum of the population

cluster samples 
sampling technique used when natural groups are evident in a statistical population

systematic samples 
select ever kth individual from the sampling frame

under coverage 
sampling frame does not include all the population

over coverage 
sampling frame includes individuals who are not in the population being examined

data entry errors 
person recording the data makes mistakes

question wording error 
the set up of the question can have a big influence on the answers

definition of statistics 
a collection of procedures and principles for gathering data and analyzing information to help people make decisions when face with uncertainty

individuals 
the objects described by the data set
(each student in the class is an observational unit or individual) 
variables 
characteristics of the individuals
(max speed, sex of the students, height, time of sleep) 
sample 
subgroup of the population examined to measure the variables and gather information

parameter 
a number that describes a characteristic of the population. It is mostly a summary of a population. It's value is unknown.

statistic 
census 
taken to measure ALL individuals in the population

selection bias 
this method of selection of participants favors a particular outcome

non response bias 
some part of the individuals in the sample cannot be reached or do not respond, this creates a bias because respondents may differ in meaningful ways from nonrespondents.

response bias 
participants give incorrect information

response rate 
the proportion of the sample that responded to the question

nonresponse rate 
the proportion of the sample that didn't respond to the question

convenience samples 
investigators choose individuals that are easy to reach

volunteer response samples 
individuals decide whether to answer the questions or not

simple random sample 
definition?

statistical significance 
a result is unlikely to have occurred just by chance

practical significance 
the difference from the claimed value we observe is actually meaningful

numbers in"stem"column of stem and leaf plot 
first digit of each number in the data set

numbers in"leaf"column of stem and leaf plot 
contains only the last digit of the # regardless of whether it falls before or after the decimal point

