Tarter, M. E. (2008), Data transformations. In S. Boslaugh (Ed.),
Encyclopedia of Epidemiology
(pp. 249–254). Thousand Oaks, CA: Sage Publications.
Transformations
Introduction
Data transformations modify measured values systematically.
For example, suppose
the heart-rate (HR) ratio variate,
HRR
= (HR Work – HR Rest)/(HR Predicted Maximum
– HR Rest) is transformed to the new variate arcsin(
HRR
).
In terms of the match up
between, on the one hand, the statistical methodology applied to study arcsin(
HRR
)
and, on the other, the assumptions that underlie this methodology, a variate like
arcsin(
HRR
) is often a preferred transform of a variate like
HRR
.
In modern statistical usage transformations often help preprocess raw data prior to
the implementation of a general-purpose software package.
Were the steps from data
input to some display or printing device’s output compared to a journey by car through a
city, a transformation like arcsin(
HRR
) would play the role of an access road to the
software package’s freeway onramp.
Software validity or, loosely speaking, journey
safety, depends on underlying assumptions.
Hence data transformations can be classified
on the basis of types of assumptions.
These include a measured variate’s
standard
Normality, its
general
Normality, model linearity and/or variate homoscedasticity, i.e
equal standard deviations.
In addition, some useful transformations are not designed to
preprocess measurements individually.
Instead, once an estimator or test statistic has
been computed using raw measurements, these transformations can help enhance the
Normality of the estimator or test statistic.
Transformations and Simulated Data
Besides the transformation of measured values, among the steps implemented for the
purpose of simulating artificial data values a transformation procedure is usually applied.
For example, by using a pair of uniformly distributed random numbers as input a Box-
Muller transformation (BMT) generates a pair of independent,
standard
Normal, in other
words, Normal with zero expectation and unit variance, variates.
To answer the two
questions, (1) Why does the BMT have so many applications? And, (2) How are
transformation components assembled? it is helpful to call upon the following notational
conventions.
The two Greek letters,
φ
and
Φ
, represent the standard Normal density
function, i.e. curve, and cumulative distribution function (cdf), respectively.
In the same
way that sin
-1
often designates the arcsin function,
Φ
-1
designates the inverse of
.
The three symbols that form
-1
(which in older statistical and epidemiological texts
is often called the
probit function
) provide a useful notational device because of the
tendency for transformation and other data analysis steps to be taken in the reverse of the
order in which data simulation process components are implemented.
For instance no
data analysis text discusses a scale parameter
σ
before discussing a location parameter
µ.