This preview shows pages 1–4. Sign up to view the full content.
The Effect of Bandwidth Selection in Histogram Construction
and Comparison over Density Function Regimes and differing Sample
Size
Derek Rampal
4/19/2009
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document Abstract
Characterization of unknown distribution functions is a frequently encountered problem in the world of
probability and statistics.
The usual first step is exploratory data analysis and the construction of a
histogram or naïve density estimator.
The bandwidth or bin width is a critical parameter, but its
calculation is frequently overlooked while constructing histograms.
This parameter can be finely tuned
for a given distribution function leading to more effective construction and thereby better understanding
of the underlying sampling distribution.
The respective methodologies suggested by Sturges, Scott,
Freedman and Diaconis will be compared to the optimal bin width visually for different sample sizes and
a variety of distributions.
Introduction
The properties of a given sample’s distribution function are frequently of great interest to statisticians.
Unfortunately, the determination of this function within absolute mathematical constructs is not trivial.
Characterization of the function is confounded by the fact that most distribution function properties are
frequently unknown by the researcher.
Furthermore, because of randomness within the sample itself a
perfect fit is theoretically impossible except at the limit.
Nonparametric Kernel Density Estimation
(KDE) is one method which has been suggested to attempt to fit the sampling distribution.
It is worth
noting that kernel function selection turns out to be a much smaller selection issue in KDE than the
bandwidth selection and it is no more evident than in the histogram (the origin of KDE).
Due to the complexity of the random data set, exploratory data analysis becomes quite useful tool in a
preliminary visual analysis of the sample data.
The histogram is one tool frequently used by statisticians
but has a caveat of complexity itself in the selection of the bandwidth or bin width used in its
construction.
Through a simple analysis of any data set, it can be clearly seen that histograms with bin
widths too small have limitations with regard to their smoothness or precisions and histograms with bin
widths too large have limitations with regard to their exactness or accuracies.
Though a theoretically
optimal bandwidth exists, its calculation involves the actual distribution function (which is unknown
frequently as stated previously).
A variety of bandwidth selection procedures are available throughout
literature.
Sturges, in 1926, suggested
(eq.1)
k = 1 + log
2
n
[3]
where k is
the number of
“bins” and
n is the sample
size.
An
alternate
construction by
Scott was suggested in 1976,
(eq.2)
h = 3.5*s*n
1/3
−
[2]
where h is the bandwidth or bin size, s is the square root of the sample variance and n is the sample size.
A similar calculation was suggested by Freedman and Diaconis in 1981
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview. Sign up
to
access the rest of the document.
This note was uploaded on 09/19/2009 for the course MATH compstat taught by Professor Qian during the Spring '09 term at FAU.
 Spring '09
 qian

Click to edit the document details