# The t-Test

## The t-Test

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution if the null hypothesis is supported.

### Learning Objectives

Outline the appropriate uses of t-tests in Student's t-distribution

### Key Takeaways

#### Key Points

• The t- statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland.
• The t-test can be used to determine if two sets of data are significantly different from each other.
• The t-test is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.

#### Key Terms

• t-test: Any statistical hypothesis test in which the test statistic follows a Student's t-distribution if the null hypothesis is supported.
• Student's t-distribution: A family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t-distribution.

### History

The t-statistic was introduced in 1908 by William Sealy Gosset (shown in ), a chemist working for the Guinness brewery in Dublin, Ireland. Gosset had been hired due to Claude Guinness's policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Gosset devised the t-test as a cheap way to monitor the quality of stout. The t-test work was submitted to and accepted in the journal Biometrika, the journal that Karl Pearson had co-founded and for which he served as the Editor-in-Chief. The company allowed Gosset to publish his mathematical work, but only if he used a pseudonym (he chose "Student"). Gosset left Guinness on study-leave during the first two terms of the 1906-1907 academic year to study in Professor Karl Pearson's Biometric Laboratory at University College London. Gosset's work on the t-test was published in Biometrika in 1908.

William Sealy Gosset: Writing under the pseudonym "Student", Gosset published his work on the t-test in 1908.

### Uses

Among the most frequently used t-tests are:

• A one- sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis.
• A two-sample location test of a null hypothesis that the means of two normally distributed populations are equal. All such tests are usually called Student's t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal. The form of the test used when this assumption is dropped is sometimes called Welch's t-test. These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.
• A test of a null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. For example, suppose we measure the size of a cancer patient's tumor before and after a treatment. If the treatment is effective, we expect the tumor size for many of the patients to be smaller following the treatment. This is often referred to as the "paired" or "repeated measures" t-test.
• A test of whether the slope of a regression line differs significantly from 0.

## The t-Distribution

Student's
$\text{t}$
-distribution arises in estimation problems where the goal is to estimate an unknown parameter when the data are observed with additive errors.

### Learning Objectives

Calculate the Student's
$\text{t}$
-distribution

### Key Takeaways

#### Key Points

• Student's
$\text{t}$
- distribution (or simply the
$\text{t}$
-distribution) is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
• The
$\text{t}$
-distribution (for
$\text{k}$
) can be defined as the distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation, after multiplying by the normalizing term.
• The
$\text{t}$
-distribution with
$\text{n}-1$
degrees of freedom is the sampling distribution of the
$\text{t}$
-value when the samples consist of independent identically distributed observations from a normally distributed population.
• As the number of degrees of freedom grows, the
$\text{t}$
-distribution approaches the normal distribution with mean
$0$
and variance
$1$
.

#### Key Terms

• confidence interval: A type of interval estimate of a population parameter used to indicate the reliability of an estimate.
• Student's t-distribution: A family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
• chi-squared distribution: A distribution with
$\text{k}$
degrees of freedom is the distribution of a sum of the squares of
$\text{k}$
independent standard normal random variables.

Student's
$\text{t}$
-distribution (or simply the
$\text{t}$
-distribution) is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. It plays a role in a number of widely used statistical analyses, including the Student's
$\text{t}$
-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis.

If we take
$\text{k}$
samples from a normal distribution with fixed unknown mean and variance, and if we compute the sample mean and sample variance for these
$\text{k}$
samples, then the
$\text{t}$
-distribution (for
$\text{k}$
) can be defined as the distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation, after multiplying by the normalizing term
$\sqrt { \text{n} }$
, where
$\text{n}$
is the sample size. In this way, the
$\text{t}$
-distribution can be used to estimate how likely it is that the true mean lies in any given range.

The
$\text{t}$
-distribution with
$\text{n} - 1$
degrees of freedom is the sampling distribution of the
$\text{t}$
-value when the samples consist of independent identically distributed observations from a normally distributed population. Thus, for inference purposes,
$\text{t}$
is a useful "pivotal quantity" in the case when the mean and variance (
$\mu$
,
$\sigma^2$
) are unknown population parameters, in the sense that the
$\text{t}$
-value has then a probability distribution that depends on neither
$\mu$
nor
$\sigma^2$
.

### History

The
$\text{t}$
-distribution was first derived as a posterior distribution in 1876 by Helmert and Lüroth. In the English-language literature it takes its name from William Sealy Gosset's 1908 paper in Biometrika under the pseudonym "Student." Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example of the chemical properties of barley where sample sizes might be as small as three participants. Gosset's paper refers to the distribution as the "frequency distribution of standard deviations of samples drawn from a normal population." It became well known through the work of Ronald A. Fisher, who called the distribution "Student's distribution" and referred to the value as
$\text{t}$
.

### Distribution of a Test Statistic

Student's
$\text{t}$
-distribution with
$\nu$
degrees of freedom can be defined as the distribution of the random variable
$\text{T}$
:

$\text{T}=\dfrac{\text{Z}}{\sqrt{\text{V}/ \nu}} = \text{Z} \sqrt{\dfrac{\nu}{\text{V}}}$

where:

• $\text{Z}$
is normally distributed with expected value
$0$
and variance
$1$
• $\text{V}$
has a chi-squared distribution with
$\nu$
degrees of freedom
• $\text{Z}$
and
$\text{V}$
are independent

A different distribution is defined as that of the random variable defined, for a given constant
$\mu$
, by:

$\left( \text{Z}+\mu \right) \sqrt { \dfrac { \nu }{ \text{V} } }$

This random variable has a noncentral
$\text{t}$
-distribution with noncentrality parameter
$\mu$
. This distribution is important in studies of the power of Student's
$\text{t}$
-test.

### Shape

The probability density function is symmetric; its overall shape resembles the bell shape of a normally distributed variable with mean
$0$
and variance
$1$
, except that it is a bit lower and wider. In more technical terms, it has heavier tails, meaning that it is more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may produce outlying values when the denominator of the ratio falls close to zero. As the number of degrees of freedom grows, the
$\text{t}$
-distribution approaches the normal distribution with mean
$0$
and variance
$1$
.

Shape of the

$\text{t}$
-Distribution: These images show the density of the
$\text{t}$
-distribution (red) for increasing values of
$\nu$
(1, 2, 3, 5, 10, and 30 degrees of freedom). The normal distribution is shown as a blue line for comparison. Previous plots are shown in green. Note that the
$\text{t}$
-distribution becomes closer to the normal distribution as
$\nu$
increases.

### Uses

Student's
$\text{t}$
-distribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown parameter, such as a mean value, in a setting where the data are observed with additive errors. If (as in nearly all practical statistical work) the population standard deviation of these errors is unknown and has to be estimated from the data, the
$\text{t}$
-distribution is often used to account for the extra uncertainty that results from this estimation. In most such problems, if the standard deviation of the errors were known, a normal distribution would be used instead of the
$\text{t}$
-distribution.

Confidence intervals and hypothesis tests are two statistical procedures in which the quantiles of the sampling distribution of a particular statistic (e.g., the standard score) are required. In any situation where this statistic is a linear function of the data, divided by the usual estimate of the standard deviation, the resulting quantity can be rescaled and centered to follow Student's
$\text{t}$
-distribution. Statistical analyses involving means, weighted means, and regression coefficients all lead to statistics having this form.

A number of statistics can be shown to have
$\text{t}$
-distributions for samples of moderate size under null hypotheses that are of interest, so that the
$\text{t}$
-distribution forms the basis for significance tests. For example, the distribution of Spearman's rank correlation coefficient
$\rho$
, in the null case (zero correlation) is well approximated by the
$\text{t}$
-distribution for sample sizes above about
$20$
.

## Assumptions

Assumptions of a
$\text{t}$
-test depend on the population being studied and on how the data are sampled.

### Learning Objectives

Explain the underlying assumptions of a
$\text{t}$
-test

### Key Takeaways

#### Key Points

• Most
$\text{t}$
-test statistics have the form
$\text{t}=\frac{\text{Z}}{\text{s}}$
, where
$\text{Z}$
and
$\text{s}$
are functions of the data.
• Typically,
$\text{Z}$
is designed to be sensitive to the alternative hypothesis (i.e., its magnitude tends to be larger when the alternative hypothesis is true), whereas
$\text{s}$
is a scaling parameter that allows the distribution of
$\text{t}$
to be determined.
• The assumptions underlying a
$\text{t}$
-test are that:
$\text{Z}$
follows a standard normal distribution under the null hypothesis, and
$\text{s}^2$
follows a
$\chi^2$
distribution with
$\text{p}$
degrees of freedom under the null hypothesis, where
$\text{p}$
is a positive constant.
• $\text{Z}$
and
$\text{s}$
are independent.

#### Key Terms

• alternative hypothesis: a rival hypothesis to the null hypothesis, whose likelihoods are compared by a statistical hypothesis test
• t-test: Any statistical hypothesis test in which the test statistic follows a Student's
$t$
-distribution if the null hypothesis is supported.
• scaling parameter: A special kind of numerical parameter of a parametric family of probability distributions; the larger the scale parameter, the more spread out the distribution.

Most
$\text{t}$
-test statistics have the form
$\text{t}=\frac{\text{Z}}{\text{s}}$
, where
$\text{Z}$
and
$\text{s}$
are functions of the data. Typically,
$\text{Z}$
is designed to be sensitive to the alternative hypothesis (i.e., its magnitude tends to be larger when the alternative hypothesis is true), whereas
$\text{s}$
is a scaling parameter that allows the distribution of
$\text{t}$
to be determined.

As an example, in the one- sample
$\text{t}$
-test:

$\text{Z}=\dfrac{\bar{\text{X}}}{(\hat{\sigma}/\sqrt{\text{n}})}$

where
$\bar { \text{X} }$
is the sample mean of the data,
$\text{n}$
is the sample size, and
$\hat { \sigma }$
is the population standard deviation of the data;
$\text{s}$
in the one-sample
$\text{t}$
-test is
$\hat { \sigma } /\sqrt { \text{n} }$
, where
$\hat { \sigma }$
is the sample standard deviation.

The assumptions underlying a
$\text{t}$
-test are that:

• $\text{Z}$
follows a standard normal distribution under the null hypothesis.
• $\text{s}^2$
follows a
$\chi^2$
distribution with
$\text{p}$
degrees of freedom under the null hypothesis, where
$\text{p}$
is a positive constant.
• $\text{Z}$
and
$\text{s}$
are independent.

In a specific type of
$\text{t}$
-test, these conditions are consequences of the population being studied, and of the way in which the data are sampled. For example, in the
$\text{t}$
-test comparing the means of two independent samples, the following assumptions should be met:

• Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, or it can be assessed graphically using a normal quantile plot.
• If using Student's original definition of the
$\text{t}$
-test, the two populations being compared should have the same variance (testable using the
$\text{F}$
-test or assessable graphically using a Q-Q plot). If the sample sizes in the two groups being compared are equal, Student's original
$\text{t}$
-test is highly robust to the presence of unequal variances. Welch's
$\text{t}$
-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.
• The data used to carry out the test should be sampled independently from the two populations being compared. This is, in general, not testable from the data, but if the data are known to be dependently sampled (i.e., if they were sampled in clusters), then the classical
$\text{t}$
-tests discussed here may give misleading results.

## t-Test for One Sample

The
$\text{t}$
-test is the most powerful parametric test for calculating the significance of a small sample mean.

### Learning Objectives

Derive the degrees of freedom for a t-test

### Key Takeaways

#### Key Points

• A one sample
$\text{t}$
-test has the null hypothesis, or
$\text{H}_0$
, of
$\mu = \text{c}$
.
• The
$\text{t}$
-test is the small-sample analog of the
$\text{z}$
test, which is suitable for large samples.
• For a
$\text{t}$
-test the degrees of freedom of the single mean is
$\text{n}-1$
because only one population parameter (the population mean) is being estimated by a sample statistic (the sample mean ).

#### Key Terms

• t-test: Any statistical hypothesis test in which the test statistic follows a Student's
$t$
-distribution if the null hypothesis is supported.
• degrees of freedom: any unrestricted variable in a frequency distribution

The
$\text{t}$
-test is the most powerful parametric test for calculating the significance of a small sample mean. A one sample
$\text{t}$
-test has the null hypothesis, or
$\text{H}_0$
, that the population mean equals the hypothesized value. Expressed formally:

$\text{H}_0: \, \mu = \text{c}$

where the Greek letter
$\mu$
represents the population mean and
$\text{c}$
represents its assumed (hypothesized) value. The
$\text{t}$
-test is the small sample analog of the
$\text{z}$
-test, which is suitable for large samples. A small sample is generally regarded as one of size
$\text{n} < 30$
.

In order to perform a
$\text{t}$
-test, one first has to calculate the degrees of freedom. This quantity takes into account the sample size and the number of parameters that are being estimated. Here, the population parameter
$\mu$
is being estimated by the sample statistic
$\bar { \text{X} }$
, the mean of the sample data. For a
$\text{t}$
-test the degrees of freedom of the single mean is
$\text{n}-1$
. This is because only one population parameter (the population mean) is being estimated by a sample statistic (the sample mean).

### Example

A college professor wants to compare her students' scores with the national average. She chooses a simple random sample of
$20$
students who score an average of
$50.2$
on a standardized test. Their scores have a standard deviation of
$2.5$
. The national average on the test is a
$60$
. She wants to know if her students scored significantly lower than the national average.

1. First, state the problem in terms of a distribution and identify the parameters of interest. Mention the sample. We will assume that the scores (
$\bar{\text{X}}$
) of the students in the professor's class are approximately normally distributed with unknown parameters
$\mu$
and
$\sigma$
.

2. State the hypotheses in symbols and words:

${ \text{H} }_{ 0 }:\, \mu =60$

i.e.: The null hypothesis is that her students scored on par with the national average.

${ \text{H} }_{ \text{A} }:\, \mu <60$

i.e.: The alternative hypothesis is that her students scored lower than the national average.

3. Identify the appropriate test to use. Since we have a simple random sample of small size and do not know the standard deviation of the population, we will use a one-sample
$\text{t}$
-test. The formula for the
$\text{t}$
-statistic
$\text{T}$
for a one-sample test is as follows:

$\text{T}=\dfrac { \bar { \text{X} } -60 }{ \text{S}/\sqrt { 20 } }$
,

where
$\bar { \text{X} }$
is the sample mean and
$\text{S}$
is the sample standard deviation. The standard deviation of the sample divided by the square root of the sample size is known as the "standard error" of the sample.

4. State the distribution of the test statistic under the null hypothesis. Under
$\text{H}_0$
the statistic
$\text{T}$
will follow a Student's distribution with
$19$
degrees of freedom:
$\text{T}\sim \tau \cdot (20-1)$
.

5. Compute the observed value
$\text{t}$
of the test statistic
$\text{T}$
, by entering the values, as follows:

$\text{t}=\dfrac { \bar { \text{x} } -60 }{ s/\sqrt { 20 } } =\dfrac { 50.2-60 }{ 2.5/\sqrt { 20 } } =\dfrac { -9.8 }{ 0.559 } =-17.5$

6. Determine the so-called
$\text{p}$
-value of the value
$\text{t}$
of the test statistic
$\text{T}$
. We will reject the null hypothesis for too-small values of
$\text{T}$
, so we compute the left
$\text{p}$
-value:

$\text{p} = \text{P}\left( \text{T}\le t;{ \text{H} }_{ 0 } \right) =\text{P}\left( \text{T}\left( 19 \right) \le -17.5 \right) \approx 0$

The Student's distribution gives
$\text{T}\left( 19 \right) =1.729$
at probabilities
$0.95$
and degrees of freedom
$19$
. The
$\text{p}$
-value is approximated at
$1.777$
.

7. Lastly, interpret the results in the context of the problem. The
$\text{p}$
-value indicates that the results almost certainly did not happen by chance and we have sufficient evidence to reject the null hypothesis. This is to say, the professor's students did score significantly lower than the national average.

## t-Test for Two Samples: Independent and Overlapping

Two-sample t-tests for a difference in mean involve independent samples, paired samples, and overlapping samples.

### Learning Objectives

Contrast paired and unpaired samples in a two-sample t-test

### Key Takeaways

#### Key Points

• For the null hypothesis, the observed t-statistic is equal to the difference between the two sample means divided by the standard error of the difference between the sample means.
• The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained—one from each of the two populations being compared.
• An overlapping samples t-test is used when there are paired samples with data missing in one or the other samples.

#### Key Terms

• blocking: A schedule for conducting treatment combinations in an experimental study such that any effects on the experimental results due to a known change in raw materials, operators, machines, etc., become concentrated in the levels of the blocking variable.
• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

The two sample t-test is used to compare the means of two independent samples. For the null hypothesis, the observed t-statistic is equal to the difference between the two sample means divided by the standard error of the difference between the sample means. If the two population variances can be assumed equal, the standard error of the difference is estimated from the weighted variance about the means. If the variances cannot be assumed equal, then the standard error of the difference between means is taken as the square root of the sum of the individual variances divided by their sample size. In the latter case the estimated t-statistic must either be tested with modified degrees of freedom, or it can be tested against different critical values. A weighted t-test must be used if the unit of analysis comprises percentages or means based on different sample sizes.

The two-sample t-test is probably the most widely used (and misused) statistical test. Comparing means based on convenience sampling or non-random allocation is meaningless. If, for any reason, one is forced to use haphazard rather than probability sampling, then every effort must be made to minimize selection bias.

### Unpaired and Overlapping Two-Sample T-Tests

Two-sample t-tests for a difference in mean involve independent samples, paired samples and overlapping samples. Paired t-tests are a form of blocking, and have greater power than unpaired tests when the paired units are similar with respect to "noise factors " that are independent of membership in the two groups being compared. In a different context, paired t-tests can be used to reduce the effects of confounding factors in an observational study.

### Independent Samples

The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. For example, suppose we are evaluating the effect of a medical treatment, and we enroll 100 subjects into our study, then randomize 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test.

Medical Treatment Research: Medical experimentation may utilize any two independent samples t-test.

### Overlapping Samples

An overlapping samples t-test is used when there are paired samples with data missing in one or the other samples (e.g., due to selection of "I don't know" options in questionnaires, or because respondents are randomly assigned to a subset question). These tests are widely used in commercial survey research (e.g., by polling companies) and are available in many standard crosstab software packages.

## t-Test for Two Samples: Paired

Paired-samples
$\text{t}$
-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice.

### Learning Objectives

Criticize the shortcomings of paired-samples
$\text{t}$
-tests

### Key Takeaways

#### Key Points

• A paired-difference test uses additional information about the sample that is not present in an ordinary unpaired testing situation, either to increase the statistical power or to reduce the effects of confounders.
• $\text{t}$
-tests are carried out as paired difference tests for normally distributed differences where the population standard deviation of the differences is not known.
• A paired samples
$\text{t}$
-test based on a "matched-pairs sample" results from an unpaired sample that is subsequently used to form a paired sample, by using additional variables that were measured along with the variable of interest.
• Paired samples
$\text{t}$
-tests are often referred to as "dependent samples
$\text{t}$
-tests" (as are
$\text{t}$
-tests on overlapping samples).

#### Key Terms

• confounding: Describes a phenomenon in which an extraneous variable in a statistical model correlates (positively or negatively) with both the dependent variable and the independent variable; confounder = noun form.
• paired difference test: A type of location test that is used when comparing two sets of measurements to assess whether their population means differ.

### Paired Difference Test

In statistics, a paired difference test is a type of location test used when comparing two sets of measurements to assess whether their population means differ. A paired difference test uses additional information about the sample that is not present in an ordinary unpaired testing situation, either to increase the statistical power or to reduce the effects of confounders.
$\text{t}$
-tests are carried out as paired difference tests for normally distributed differences where the population standard deviation of the differences is not known.

### Paired-Samples $\text{t}$-Test

Paired samples
$\text{t}$
-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a "repeated measures"
$\text{t}$
-test).

A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment, say for high blood pressure, and the same subjects are tested again after treatment with a blood-pressure lowering medication. By comparing the same patient's numbers before and after treatment, we are effectively using each patient as their own control. That way the correct rejection of the null hypothesis (here: of no difference made by the treatment) can become much more likely, with statistical power increasing simply because the random between-patient variation has now been eliminated.

Blood Pressure Treatment: A typical example of a repeated measures

$\text{t}$
-test is in the treatment of patients with high blood pressure to determine the effectiveness of a particular medication.

Note, however, that an increase of statistical power comes at a price: more tests are required, each subject having to be tested twice. Because half of the sample now depends on the other half, the paired version of Student's
$\text{t}$
-test has only
$\frac{\text{n}}{2-1}$
degrees of freedom (with
$\text{n}$
being the total number of observations. Pairs become individual test units, and the sample has to be doubled to achieve the same number of degrees of freedom.

A paired-samples
$\text{t}$
-test based on a "matched-pairs sample" results from an unpaired sample that is subsequently used to form a paired sample, by using additional variables that were measured along with the variable of interest. The matching is carried out by identifying pairs of values consisting of one observation from each of the two samples, where the pair is similar in terms of other measured variables. This approach is sometimes used in observational studies to reduce or eliminate the effects of confounding factors.

Paired-samples
$\text{t}$
-tests are often referred to as "dependent samples
$\text{t}$
-tests" (as are
$\text{t}$
-tests on overlapping samples).

## Calculations for the t-Test: One Sample

The following is a discussion on explicit expressions that can be used to carry out various
$\text{t}$
-tests.

### Learning Objectives

Assess a null hypothesis in a one-sample
$\text{t}$
-test

### Key Takeaways

#### Key Points

• In each case, the formula for a test statistic that either exactly follows or closely approximates a
$\text{t}$
-distribution under the null hypothesis is given.
• Also, the appropriate degrees of freedom are given in each case.
• Once a
$\text{t}$
-value is determined, a
$\text{p}$
-value can be found using a table of values from Student's
$\text{t}$
-distribution.
• If the calculated
$\text{p}$
-value is below the threshold chosen for statistical significance (usually the
$0.10$
, the
$0.05$
, or
$0.01$
level), then the null hypothesis is rejected in favor of the alternative hypothesis.

#### Key Terms

• p-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.
• standard error: A measure of how spread out data values are around the mean, defined as the square root of the variance.

The following is a discussion on explicit expressions that can be used to carry out various
$\text{t}$
-tests. In each case, the formula for a test statistic that either exactly follows or closely approximates a
$\text{t}$
-distribution under the null hypothesis is given. Also, the appropriate degrees of freedom are given in each case. Each of these statistics can be used to carry out either a one-tailed test or a two-tailed test.

Once a
$\text{t}$
-value is determined, a
$\text{p}$
-value can be found using a table of values from Student's
$\text{t}$
-distribution. If the calculated
$\text{p}$
-value is below the threshold chosen for statistical significance (usually the
$0.10$
, the
$0.05$
, or
$0.01$
level), then the null hypothesis is rejected in favor of the alternative hypothesis.

### One- Sample T-Test

In testing the null hypothesis that the population mean is equal to a specified value
$\mu_0$
, one uses the statistic:

$\text{t}=\dfrac { \bar { \text{x} } -{ \mu }_{ 0 } }{ \text{s}/\sqrt { \text{n} } }$

where
$\bar { \text{x} }$
is the sample mean,
$\text{s}$
is the sample standard deviation of the sample and
$\text{n}$
is the sample size. The degrees of freedom used in this test is
$\text{n}-1$
.

### Slope of a Regression

Suppose one is fitting the model:

${ \text{Y} }_{ \text{i} }=\alpha +\beta { \text{x} }_{ \text{i} }+{ \varepsilon }_{ \text{i} }$

where
$\text{x}_\text{i}, \text{i}=1, \cdots, \text{n}$
are known,
$\alpha$
and
$\beta$
are unknown, and
$\varepsilon_\text{i}$
are independent identically normally distributed random errors with expected value
$0$
and unknown variance
$\sigma^2$
, and
$\text{Y}_\text{i},\text{i}=1,\cdots,\text{n}$
are observed. It is desired to test the null hypothesis that the slope
$\beta$
is equal to some specified value
$\beta_0$
(often taken to be
$0$
, in which case the hypothesis is that
$\text{x}$
and
$\text{y}$
are unrelated). Let
$\hat{\alpha}$
and
$\hat{\beta}$
be least-squares estimators, and let
$SE\text{\textunderscore}\hat{\alpha}$
and
$SE\text{\textunderscore}\hat{\beta}$
, respectively, be the standard errors of those least-squares estimators. Then,

$t\text{-score} = \dfrac{\hat{\beta}-\beta\text{\textunderscore}0}{\text{SE}\text{\textunderscore}\hat{\beta}} \sim \tau\text{\textunderscore}{\text{n}-2}$

has a
$\text{t}$
-distribution with
$\text{n} - 2$
degrees of freedom if the null hypothesis is true. The standard error of the slope coefficient is:

$\displaystyle{\text{SE}_{\hat{\beta}}=\frac{\sqrt{\frac{1}{\text{n}-2} \sum_{\text{i}=1}^\text{n} \left(\text{Y}_\text{i} - \hat{\text{y}}_\text{i} \right) ^2}}{\sqrt{\sum_{\text{i}=1}^\text{n} \left( \text{x}_\text{i} - \bar{\text{x}} \right) ^2}}}$

can be written in terms of the residuals
$\hat{\varepsilon}_\text{i}$
:

$\hat{\varepsilon}_\text{i} = \text{Y}_\text{i} - \hat{\text{y}}_\text{i} - (\hat{\alpha} + \hat{\beta}\text{x}_\text{i} )$

Therefore, the sum of the squares of residuals, or
$\text{SSR}$
, is given by:

$\displaystyle{\text{SSR} = \sum_{\text{i}=1}^\text{n} \hat{\varepsilon}_\text{i}^2}$

Then, the
$\text{t}$
-score is given by:

$\displaystyle{\text{t} = \frac{ \left( \hat{\beta} - \beta_0\right) \sqrt{\text{n}-2}}{\sqrt{ \frac{\text{SSR}}{\sum_{\text{i}=1}^\text{n}\left( \text{x}_\text{i} - \bar{\text{x}}\right)^2}}}}$

## Calculations for the t-Test: Two Samples

The following is a discussion on explicit expressions that can be used to carry out various t-tests.

### Learning Objectives

Calculate the t value for different types of sample sizes and variances in an independent two-sample t-test

### Key Takeaways

#### Key Points

• A two- sample t-test for equal sample sizes and equal variances is only used when both the two sample sizes are equal and it can be assumed that the two distributions have the same variance.
• A two-sample t-test for unequal sample sizes and equal variances is used only when it can be assumed that the two distributions have the same variance.
• A two-sample t-test for unequal (or equal) sample sizes and unequal variances (also known as Welch's t-test) is used only when the two population variances are assumed to be different and hence must be estimated separately.

#### Key Terms

• pooled variance: A method for estimating variance given several different samples taken in different circumstances where the mean may vary between samples but the true variance is assumed to remain the same.
• degrees of freedom: any unrestricted variable in a frequency distribution

The following is a discussion on explicit expressions that can be used to carry out various t-tests. In each case, the formula for a test statistic that either exactly follows or closely approximates a t-distribution under the null hypothesis is given. Also, the appropriate degrees of freedom are given in each case. Each of these statistics can be used to carry out either a one-tailed test or a two-tailed test.

Once a t-value is determined, a p-value can be found using a table of values from Student's t-distribution. If the calculated p-value is below the threshold chosen for statistical significance (usually the 0.10, the 0.05, or 0.01 level), then the null hypothesis is rejected in favor of the alternative hypothesis.

### Equal Sample Sizes, Equal Variance

This test is only used when both:

• the two sample sizes (that is, the number, n, of participants of each group) are equal; and
• it can be assumed that the two distributions have the same variance.

Violations of these assumptions are discussed below. The t-statistic to test whether the means are different can be calculated as follows:

$\text{t}=\frac { { \bar { \text{X} } }_{ 1 }-{ \bar { \text{X} } }_{ 2 } }{ { \text{S} }{ \text{x} }_{ 1 }{ \text{x} }_{ 2 }\cdot \sqrt { \frac { 2 }{ \text{n} } } }$
,

where

${ \text{S} }{ \text{x} }_{ 1 }{ \text{x} }_{ 2 }=\sqrt { \frac { 1 }{ 2 } \left( { \text{S} }^{ 2 }{ \text{x} }_{ 1 }+{ \text{S} }^{ 2 }{ \text{x} }_{ 2 } \right) }$
.

Here,
${ \text{S} }{ \text{x} }_{ 1 }{ \text{x} }_{ 2 }$
is the grand standard deviation (or pooled standard deviation), 1 = group one, 2 = group two. The denominator of t is the standard error of the difference between two means.

For significance testing, the degrees of freedom for this test is 2n − 2 where n is the number of participants in each group.

### Unequal Sample Sizes, Equal Variance

This test is used only when it can be assumed that the two distributions have the same variance. The t-statistic to test whether the means are different can be calculated as follows:

$\text{t}=\frac { { \bar { \text{X} } }_{ 1 }-{ \bar { \text{X} } }_{ 2 } }{ { \text{S} }{ \text{x} }_{ 1 }{ \text{x} }_{ 2 }\cdot \sqrt { \frac { 1 }{ { \text{n} }_{ 1 } } +\frac { 1 }{ { \text{n} }_{ 2 } } } }$
,

where.

Pooled Variance: This is the formula for a pooled variance in a two-sample t-test with unequal sample size but equal variances.

${ \text{S} }{ \text{x} }_{ 1 }{ \text{x} }_{ 2 }$
is an estimator of the common standard deviation of the two samples: it is defined in this way so that its square is an unbiased estimator of the common variance whether or not the population means are the same. In these formulae, n = number of participants, 1 = group one, 2 = group two. n − 1 is the number of degrees of freedom for either group, and the total sample size minus two (that is, n1 + n2 − 2) is the total number of degrees of freedom, which is used in significance testing.

### Unequal (or Equal) Sample Sizes, Unequal Variances

This test, also known as Welch's t-test, is used only when the two population variances are assumed to be different (the two sample sizes may or may not be equal) and hence must be estimated separately. The t-statistic to test whether the population means are different is calculated as:

$\text{t}=\frac { { \bar { \text{X} } }_{ 1 }-{ \bar { \text{X} } }_{ 2 } }{ { \text{s} }_{ { \bar { \text{X} } }_{ 1 }-{ \bar { \text{X} } }_{ 2 } } }$

where.

Unpooled Variance: This is the formula for a pooled variance in a two-sample t-test with unequal or equal sample sizes but unequal variances.

Here s2 is the unbiased estimator of the variance of the two samples, ni = number of participants in group i, i=1 or 2. Note that in this case
${ { \text{s} }_{ { \bar { \text{X} } }_{ 1 }-{ \bar { \text{X} } }_{ 2 } } }^{ 2 }$
is not a pooled variance. For use in significance testing, the distribution of the test statistic is approximated as an ordinary Student's t-distribution with the degrees of freedom calculated using:

Welch–Satterthwaite Equation: This is the formula for calculating the degrees of freedom in Welsh's t-test.

This is known as the Welch–Satterthwaite equation. The true distribution of the test statistic actually depends (slightly) on the two unknown population variances.

## Multivariate Testing

Hotelling's
$\text{T}$
-square statistic allows for the testing of hypotheses on multiple (often correlated) measures within the same sample.

### Learning Objectives

Summarize Hotelling's
$\text{T}$
-squared statistics for one- and two-sample multivariate tests

### Key Takeaways

#### Key Points

• Hotelling's
$\text{T}$
-squared distribution is important because it arises as the distribution of a set of statistics which are natural generalizations of the statistics underlying Student's
$\text{t}$
-distribution.
• In particular, the distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a
$\text{t}$
-test.
• For a one- sample multivariate test, the hypothesis is that the mean vector (
$\mu$
) is equal to a given vector (
${ \mu }_{ 0 }$
).
• For a two-sample multivariate test, the hypothesis is that the mean vectors (
${ \mu }_{ 1 }$
and
${ \mu }_{ 2 }$
) of two samples are equal.

#### Key Terms

• Hotelling's T-square statistic: A generalization of Student's
$\text{t}$
-statistic that is used in multivariate hypothesis testing.
• Type I error: An error occurring when the null hypothesis (
$\text{H}_0$
) is true, but is rejected.

A generalization of Student's
$\text{t}$
-statistic, called Hotelling's
$\text{T}$
-square statistic, allows for the testing of hypotheses on multiple (often correlated) measures within the same sample. For instance, a researcher might submit a number of subjects to a personality test consisting of multiple personality scales (e.g., the Minnesota Multiphasic Personality Inventory). Because measures of this type are usually highly correlated, it is not advisable to conduct separate univariate
$\text{t}$
-tests to test hypotheses, as these would neglect the covariance among measures and inflate the chance of falsely rejecting at least one hypothesis (type I error). In this case a single multivariate test is preferable for hypothesis testing. Hotelling's
$\text{T}^2$
statistic follows a
$\text{T}^2$
distribution.

Hotelling's
$\text{T}$
-squared distribution is important because it arises as the distribution of a set of statistics which are natural generalizations of the statistics underlying Student's
$\text{t}$
-distribution. In particular, the distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a
$\text{t}$
-test. It is proportional to the
$\text{F}$
-distribution.

### One-sample $\text{T}^2$ Test

For a one-sample multivariate test, the hypothesis is that the mean vector (
$\mu$
) is equal to a given vector (
${ \mu }_{ 0 }$
). The test statistic is defined as follows:

$\text{T}^2 = \text{n} (\bar{\mathbf{\text{x}}}-\mu_0)' \mathbf{\text{S}}^{-1} (\bar{\mathbf{\text{x}}}-\mu_0)$

where
$\text{n}$
is the sample size,
$\bar { \text{x} }$
is the vector of column means and
$\text{S}$
is a
$\text{m} \times \text{m}$
sample covariance matrix.

### Two-Sample $\text{T}^2$ Test

For a two-sample multivariate test, the hypothesis is that the mean vectors (
${ \mu }_{ 1 },{ \mu }_{ 2 }$
) of two samples are equal. The test statistic is defined as:

$\text{T}^2 = \dfrac{\text{n}_1\text{n}_2}{\text{n}_1 + \text{n}_2}(\bar{\mathbf{\text{x}}}_1 - \bar{\mathbf{\text{x}}}_2)' {\mathbf{\text{S}}_{\text{pooled}}}^{-1} (\bar{\mathbf{\text{x}}}_1 - \bar{\mathbf{\text{x}}}_2)$

## Alternatives to the t-Test

When the normality assumption does not hold, a nonparametric alternative to the
$\text{t}$
-test can often have better statistical power.

### Learning Objectives

Explain how Wilcoxon Rank Sum tests are applied to data distributions

### Key Takeaways

#### Key Points

• The
$\text{t}$
-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances.
• The Welch's
$\text{t}$
-test is a nearly exact test for the case where the data are normal but the variances may differ.
• For moderately large samples and a one-tailed test, the
$\text{t}$
is relatively robust to moderate violations of the normality assumption.
• If the sample size is large, Slutsky's theorem implies that the distribution of the sample variance has little effect on the distribution of the test statistic.
• For two independent samples when the data distributions are asymmetric (that is, the distributions are skewed) or the distributions have large tails, then the Wilcoxon Rank Sum test can have three to four times higher power than the
$\text{t}$
-test.
• The nonparametric counterpart to the paired-samples
$\text{t}$
-test is the Wilcoxon signed-rank test for paired samples.

#### Key Terms

• Wilcoxon signed-rank test: A nonparametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e., it is a paired difference test).
• central limit theorem: The theorem that states: If the sum of independent identically distributed random variables has a finite variance, then it will be (approximately) normally distributed.
• Wilcoxon Rank Sum test: A non-parametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other.

The
$\text{t}$
-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. The Welch's
$\text{t}$
-test is a nearly exact test for the case where the data are normal but the variances may differ. For moderately large samples and a one-tailed test, the
$\text{t}$
is relatively robust to moderate violations of the normality assumption.

For exactness, the
$\text{t}$
-test and
$\text{Z}$
-test require normality of the sample means, and the
$\text{t}$
-test additionally requires that the sample variance follows a scaled
$\chi^2$
distribution, and that the sample mean and sample variance be statistically independent. Normality of the individual data values is not required if these conditions are met. By the central limit theorem, sample means of moderately large samples are often well-approximated by a normal distribution even if the data are not normally distributed. For non-normal data, the distribution of the sample variance may deviate substantially from a
$\chi^2$
distribution. If the data are substantially non-normal and the sample size is small, the
$\text{t}$
-test can give misleading results. However, if the sample size is large, Slutsky's theorem implies that the distribution of the sample variance has little effect on the distribution of the test statistic.

Slutsky's theorem extends some properties of algebraic operations on convergent sequences of real numbers to sequences of random variables. The theorem was named after Eugen Slutsky. The statement is as follows:

Let
$\{\text{X}_\text{n}\}$
,
$\{\text{Y}_\text{n}\}$
be sequences of scalar/vector/matrix random elements. If
$\text{X}_\text{n}$
converges in distribution to a random element
$\text{X}$
, and
$\text{Y}$
converges in probability to a constant
$\text{c}$
, then:

$\displaystyle{ \text{X}_\text{n} + \text{Y}_\text{n} \overset{\text{a}}{\rightarrow} \text{X} + \text{c}\\ \text{Y}_\text{n}\text{X}_\text{n} \overset{\text{d}}{\rightarrow} \text{cX}\\ \text{Y}_\text{n}^{-1}\text{X}_\text{n} \overset{\text{d}}{\rightarrow} \text{c}^{-1} \text{X} }$

where
$\overset{\text{d}}{\rightarrow}$
denotes convergence in distribution.

When the normality assumption does not hold, a nonparametric alternative to the
$\text{t}$
-test can often have better statistical power. For example, for two independent samples when the data distributions are asymmetric (that is, the distributions are skewed) or the distributions have large tails, then the Wilcoxon Rank Sum test (also known as the Mann-Whitney
$\text{U}$
test) can have three to four times higher power than the
$\text{t}$
-test. The nonparametric counterpart to the paired samples
$\text{t}$
-test is the Wilcoxon signed-rank test for paired samples.

One-way analysis of variance generalizes the two-sample
$\text{t}$
-test when the data belong to more than two groups.

## Cohen's d

Cohen's
$\text{d}$
is a method of estimating effect size in a
$\text{t}$
-test based on means or distances between/among means.

### Learning Objectives

Justify Cohen's
$\text{d}$
as a method for estimating effect size in a
$\text{t}$
-test

### Key Takeaways

#### Key Points

• An effect size is a measure of the strength of a phenomenon (for example, the relationship between two variables in a statistical population ) or a sample -based estimate of that quantity.
• An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population.
• Cohen's
$\text{d}$
is an example of a standardized measure of effect, which are used when the metrics of variables do not have intrinsic meaning, results from multiple studies are being combined, the studies use different scales, or when effect size is conveyed relative to the variability in the population.
• As in any statistical setting, effect sizes are estimated with error, and may be biased unless the effect size estimator that is used is appropriate for the manner in which the data were sampled and the manner in which the measurements were made.
• Cohen's
$\text{d}$
is defined as the difference between two means divided by a standard deviation for the data:
$\text{D}=\frac { { \bar { \text{x} } }_{ 1 }-{ \bar { \text{x} } }_{ 2 } }{ \sigma }$
.

#### Key Terms

• p-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.
• Cohen's d: A measure of effect size indicating the amount of different between two groups on a construct of interest in standard deviation units.

Cohen's
$\text{d}$
is a method of estimating effect size in a
$\text{t}$
-test based on means or distances between/among means. An effect size is a measure of the strength of a phenomenon—for example, the relationship between two variables in a statistical population (or a sample-based estimate of that quantity). An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential statistics such as
$\text{p}$
-values. Among other uses, effect size measures play an important role in meta-analysis studies that summarize findings from a specific area of research, and in statistical power analyses.

Cohen's

$\text{d}$
: Plots of the densities of Gaussian distributions showing different Cohen's effect sizes.

The concept of effect size already appears in everyday language. For example, a weight loss program may boast that it leads to an average weight loss of 30 pounds. In this case, 30 pounds is an indicator of the claimed effect size. Another example is that a tutoring program may claim that it raises school performance by one letter grade. This grade increase is the claimed effect size of the program. These are both examples of "absolute effect sizes," meaning that they convey the average difference between two groups without any discussion of the variability within the groups.

Reporting effect sizes is considered good practice when presenting empirical research findings in many fields. The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance of a research result. Effect sizes are particularly prominent in social and medical research.

Cohen's
$\text{d}$
is an example of a standardized measure of effect. Standardized effect size measures are typically used when the metrics of variables being studied do not have intrinsic meaning (e.g., a score on a personality test on an arbitrary scale), when results from multiple studies are being combined, when some or all of the studies use different scales, or when it is desired to convey the size of an effect relative to the variability in the population. In meta-analysis, standardized effect sizes are used as a common measure that can be calculated for different studies and then combined into an overall summary.

As in any statistical setting, effect sizes are estimated with error, and may be biased unless the effect size estimator that is used is appropriate for the manner in which the data were sampled and the manner in which the measurements were made. An example of this is publication bias, which occurs when scientists only report results when the estimated effect sizes are large or are statistically significant. As a result, if many researchers are carrying out studies under low statistical power, the reported results are biased to be stronger than true effects, if any.

### Relationship to Test Statistics

Sample-based effect sizes are distinguished from test statistics used in hypothesis testing in that they estimate the strength of an apparent relationship, rather than assigning a significance level reflecting whether the relationship could be due to chance. The effect size does not determine the significance level, or vice-versa. Given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. For example, a sample Pearson correlation coefficient of
$0.1$
is strongly statistically significant if the sample size is
$1000$
. Reporting only the significant
$\text{p}$
-value from this analysis could be misleading if a correlation of
$0.1$
is too small to be of interest in a particular application.

### Cohen's D

Cohen's
$\text{d}$
is defined as the difference between two means divided by a standard deviation for the data:

$\text{D}=\dfrac { { \bar { \text{x} } }_{ 1 }-{ \bar { \text{x} } }_{ 2 } }{ \sigma }$

Cohen's
$\text{d}$
is frequently used in estimating sample sizes. A lower Cohen's
$\text{d}$
indicates a necessity of larger sample sizes, and vice versa, as can subsequently be determined together with the additional parameters of desired significance level and statistical power.

The precise definition of the standard deviation s was not originally made explicit by Jacob Cohen; he defined it (using the symbol
$\sigma$
) as "the standard deviation of either population" (since they are assumed equal). Other authors make the computation of the standard deviation more explicit with the following definition for a pooled standard deviation with two independent samples.

$\displaystyle{\text{s}=\sqrt{\frac{(\text{n}_1 - 1)\text{s}_1^2 + (\text{n}_2 -1) \text{s}_2^2}{\text{n}_1 + \text{n}_2 - 2}}}$