This is essentially the value the splits the dataset

Info icon This preview shows pages 23–32. Sign up to view the full content.

This is essentially the value the splits the dataset in two: approximately half of the data is below the median and half is above the median. More generally, we can define Definition: Sample Percentiles Calculation of sample percentiles is not done the same way everywhere, and most statistical packages use a definition that involves interpolation (like the median above).
Image of page 23

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

I.24 Sample Median and Percentiles For our dataset we have that the median is 26. This value does not change if we remove the two entries valued 99. The sample median is a measure of location that is robust to outliers , unlike the sample mean. However, the median seems to also discard a lot of information in comparison with the sample mean. A compromise between the two is the trimmed mean Definition: 10% Trimmed Mean In our example
Image of page 24
I.25 Graphical Representations Especially for large datasets, graphical representations are often much more (qualitatively) informative than numerical summaries. Perhaps we simplest graphical representation is the scatter-plot (baby weight, in grams) It is sometimes convenient to jitter to abysses of the points, so it is easier to see what s going on… 1500 2000 2500 3000 3500 4000 4500 5000 1500 2000 2500 3000 3500 4000 4500 5000
Image of page 25

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

I.26 Histogram of x x Frequency 1500 2000 2500 3000 3500 4000 4500 5000 0 150 Histograms Scatterplots are still a bit difficult to read – a way we can get a better view is by aggregating data into bins 1500 2000 2500 3000 3500 4000 4500 5000
Image of page 26
I.27 Histograms – Choice of Binning The choice of the number of bins is a tricky business… Too few !!! Too many !!! Just right !!! There are rules-of-thumb for the number of bins that most software will use… You don t need to worry too much (yet)... Frequency 1500 2000 2500 3000 3500 4000 4500 5000 0 150 Histogram of x Frequency 1000 2000 3000 4000 5000 0 400 x Frequency 1500 2000 2500 3000 3500 4000 4500 5000 0 15 35
Image of page 27

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

I.28 Histograms Actually, if the data can be viewed as independent samples from some continuous distribution, the histogram (after proper normalization) can be interpreted as an estimate of the true underlying density function !!! Baby weight: this histogram has a bell-like shape. Is it reasonable to model baby weight as a sample from a normal distribution? Histogram of y y Density 1500 2000 2500 3000 3500 4000 4500 5000 0e+00 4e-04 8e-04
Image of page 28
I.29 Density Estimators Histograms are actually a very crude density estimator. There are much better alternatives, like kernel-based estimators The principle behind all these estimators is still the same – locally averaging data. However, these can be much more accurate than the histogram. 2000 3000 4000 5000 0e+00 4e-04 8e-04 density.default(x = y, n = 50000) N = 1236 Bandwidth = 102 Density
Image of page 29

Info icon This preview has intentionally blurred sections. Sign up to view the full version.