derek_rampal_Project2

# derek_rampal_Project2 - The Effect of Bandwidth Selection...

This preview shows pages 1–4. Sign up to view the full content.

The Effect of Bandwidth Selection in Histogram Construction and Comparison over Density Function Regimes and differing Sample Size Derek Rampal 4/19/2009

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Abstract Characterization of unknown distribution functions is a frequently encountered problem in the world of probability and statistics. The usual first step is exploratory data analysis and the construction of a histogram or naïve density estimator. The bandwidth or bin width is a critical parameter, but its calculation is frequently overlooked while constructing histograms. This parameter can be finely tuned for a given distribution function leading to more effective construction and thereby better understanding of the underlying sampling distribution. The respective methodologies suggested by Sturges, Scott, Freedman and Diaconis will be compared to the optimal bin width visually for different sample sizes and a variety of distributions.
Introduction The properties of a given sample’s distribution function are frequently of great interest to statisticians. Unfortunately, the determination of this function within absolute mathematical constructs is not trivial. Characterization of the function is confounded by the fact that most distribution function properties are frequently unknown by the researcher. Furthermore, because of randomness within the sample itself a perfect fit is theoretically impossible except at the limit. Non-parametric Kernel Density Estimation (KDE) is one method which has been suggested to attempt to fit the sampling distribution. It is worth noting that kernel function selection turns out to be a much smaller selection issue in KDE than the bandwidth selection and it is no more evident than in the histogram (the origin of KDE). Due to the complexity of the random data set, exploratory data analysis becomes quite useful tool in a preliminary visual analysis of the sample data. The histogram is one tool frequently used by statisticians but has a caveat of complexity itself in the selection of the bandwidth or bin width used in its construction. Through a simple analysis of any data set, it can be clearly seen that histograms with bin widths too small have limitations with regard to their smoothness or precisions and histograms with bin widths too large have limitations with regard to their exactness or accuracies. Though a theoretically optimal bandwidth exists, its calculation involves the actual distribution function (which is unknown frequently as stated previously). A variety of bandwidth selection procedures are available throughout literature. Sturges, in 1926, suggested (eq.1) k = 1 + log 2 n [3] where k is the number of “bins” and n is the sample size. An alternate construction by Scott was suggested in 1976, (eq.2) h = 3.5*s*n 1/3 [2] where h is the bandwidth or bin size, s is the square root of the sample variance and n is the sample size. A similar calculation was suggested by Freedman and Diaconis in 1981

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 09/19/2009 for the course MATH compstat taught by Professor Qian during the Spring '09 term at FAU.

### Page1 / 14

derek_rampal_Project2 - The Effect of Bandwidth Selection...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online