Unformatted text preview: Lecture 3: Middle • Last time
• • We examined some basic graphical and numerical summaries, essentially covering
and extending Chapter 6 of your text  We saw a series of plots, some familiar,
some new, but all aimed at helping you read a data set
Some plots were designed to illustrate the “structure” of a single variable
(symmetry, uni or multimodality, skew) while others helped us assess association
between two variables • Today
• • • We will start by taking up our discussion of developing a boxplot for more than a
single variable, a graphic to summarize the shape of a 2dimensional point cloud
We will then examine tools for viewing (continuous) data in 2 or more dimensions,
spending some time with projections and linked displays
We’ll end with some material for your (first) homework assignment  The subject of
graphics will not end here, however, in that we’ll also examine spatial (mapbased)
data as well as text as data later in the term • Frequency displays
• We began examining some simple graphical devices to display the counts per
category for a qualitative variable excellent very good good fair poor 0 1000 2000 3000 4000 5000 6000 Frequency displays
And we introduced a new display, a mosaic plot, to exhibit association between two
qualitative variables
BMI and general health
overweight normal good fair poor underweight very good • excellent • bmicat obese • Frequency displays
• • In the previous case, we discretized a quantitative variable, respondents’ BMI
values, into categories and then used a mosaic plot to exhibit possible associations
with variables like the respondents’ general health
If we didn’t know about the CDC’s categories (obese, overweight, normal, etc.),
could we still employ this technique? How would we divide the continuous BMI
measure into categories? • Continuous variables
• • Frequency displays for continuous variables help us examine the “shape” of a
data set  Histograms, for example, function in the same way as barplots after
binning the data into (for our purposes, equally sized) intervals
We describe these plots with terms like symmetric or skewed, uni or multimodal The shape is a story, and opens up questions for further study 4000
2000
0 Frequency 6000 Histogram of BMI 10 20 30 40 50 BMI 703*weight/height^2 60 70 • Continuous variables
• • Last time we informally presented the construction of another kind of graphic, a
quantilequantile plot that allows us to compare a known distributional shape to
that of our data
In the last lecture, we used the normal distribution as a kind of ruler... Normal Q−Q Plot 70 ● ● 50
40
30
20
● ● ●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●●
●●
●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●
●●●
●●●
●●
●
●
●
●
●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●
●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●●
●●
●● ● 10 Sample Quantiles 60 ●
●
● −4 −2 0
Theoretical Quantiles 2 4 Quantiles 0.1 0.2 0.3 0.4 Last time we made a passing reference to quantiles  The qth quantile of a
probability distribution is the point xq such that area = q
0.0 • standard normal density • −3 −2 −1 0
x xq 1 2 3 • Sample quantiles
• • • Similarly, given a data set with n points, x1 , . . . , xn , we can represent the sorted
data as x(1) , . . . , x(n) where x(1) is the smallest point and x(n) is the largest in our
sample, for example
Assuming we have no repeat values in x1 , . . . , xn , it should be clear then that j/n
points of our sample are less than or equal to x(j)  These are the sample quantiles
and you can, with various extension strategies, define the sample quantiles for any
q between 0 and 1
The QQ plot then plots the theoretical j/n quantile from the normal distribution
against the sorted data x(1) , . . . , x(n) Normal Q−Q Plot 70 ● ● 50
40
30
20
● ● ●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●●
●●
●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●
●●●
●●●
●●
●
●
●
●
●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●
●
●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●●
●●
●● ● 10 Sample Quantiles 60 ●
●
● −4 −2 0
Theoretical Quantiles 2 4 • Shape
• • Of course, the normal distribution is not our only “ruler” and we often want to
compare the data to some other known distribution  On the next slide, we
compare the BMI values to the quantiles of the exponential distribution
We can also compare two data sets in this way, plotting sorted data sets
against each other Exponential Q−Q Plot 70 ● ● 50
40
30
20 ● ●
●● ●
●●
●
●
●●
●●
●●●
●●
●●
●●●
●●●
●
●
●
●
●●
●
●
●
●●●
●●●
●●●
●●●
●●●
●●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●●●
●●
●●●
●●●
●●●
●
●
●
●
●●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●●
●●
●●
●●
●
●
●●
●●
●
●
●
●●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●
●
●●
●●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● 10 Sample Quantiles 60 ● 0 2 4 6 Theoretical Quantiles 8 10 • Box plots
• • • We then considered a cartoon or thumbnail of a distribution to compare data
that fall into different groups  The box covers the central 50% of the data (the
region between the 25th and 75th percentiles (quantiles with q=0.25 and q=0.75)
We also discussed a technique to highlight outlying or “outside” points that are
possibly too large (or too small) and might warrant further investigation
On the next slide, we present the BMI data again, this time broken down by
respondent’s general health  You might compare this display to what we saw a
few slides back using mosaic plots 70 ● 60 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● 20 30 40 50 ● ● 10 ●
● poor fair good very good excellent • Extensions
• • Last time, we considered an extension of the box plot to more than one variable That is, create a display that can create a cartoon “spatial” distribution of points
What concepts do we have to generalize to do this? 100
200
300
400 weight
500
600
700 100 200 300 400 desired weight
500 600 700 • Extensions: Optimization
• • Last time, one of you mentioned capturing the “center” through some notion of
distance  Let’s make that a bit concrete
Suppose we have a set of n values x1 , . . . , xn for a single variable and consider
the expression
n
X
i=1 • z xi  The value of z that minimizes this quantity turns out to be the median! 10
20
30
40
50 z
60
70 2e+05 4e+05 6e+05 sum_i x_i−z
8e+05 • Extensions: Optimization
• • Assuming z xi  is not zero, its derivative (as a function of z) is simply the sign
of z xi  If it is zero, then its derivative from the left is 1 and its derivative from
the right is +1
Using this fact, you can show that the the expression
n
X
i=1 • z xi  must be convex  If we have an even number of data points (n even) with no
repetitions among the x1 , . . . , xn , then any point between xn/2 and x(n/2+1) has
a zero derivative • Extensions: Optimization
• • We have already seen the interquartile range as a notion of spread in the data The width of the interval (the height of a box in a box plot) that covers the central
50% of the data
There are competing notions for the spread that are based on absolute
deviations  For example, the MAD or median absolute deviation is defined to be median { x1
• xmed , x2 xmed , . . . , xn xmed  } We say that the median and either the IQR or the MAD are “robust” to outlying
data  We’ll make that precise in a moment... • Aside: Mean and variance
• The arithmetic mean is another notion of the center of a distribution  We recall that
the mean of n values x1 , . . . , xn of a variable is simply x=
• x1 + x2 + · · · + xn
n The associated measure of spread is the sample variance  It is the sum of squared
deviations from the mean
2 s= ( x1 x) 2 + ( x2 x) 2 + · · · + ( x n
n1 x) 2 • Aside: Mean and variance
• If we consider the sum of squared deviations as a function of any point z, then we
can show (this time taking derivatives is easy!) that the minimizer of
n
X ( xi z) 2 i=1 • is the sample mean, z = x ! Go ahead, try it! • Aside: Mean and variance
• • While the median and the interquartile range are very direct notions of center and
spread, the mean and standard deviation are slightly more delicate  For example,
the mean is very much influenced by one or more “extreme points”
Why would we expect this? Does the median have the same problem? Aside: Web visits example Aside: Web visits example Aside: Web visits example Aside: Web visits example • Aside: The normal distribution
• • As we will see next time, the sample mean and variance are tied up with
estimating the parameters of a normal distribution (the population mean and
variance)  In a modeling context, you certainly wouldn’t propose a normal
distribution for the Web visits data!
John Tukey was one of the first statisticians to call attention to the fact that
departures from a normal distribution could hurt the mean and variance  His test
case was a “mixture” of two normals • Aside: Robustness
• • • Imagine tossing a coin such that 99.2% of the time, you sampled an observation
from the normal distribution with mean µ and variance 2 ; and 0.8% of the time
you selected an observation from a normal with mean µ and variance 9 2
This was Tukey’s idea of “contamination”  That there could be some error
process that was introducing wild observations at a very low rate, but even with this
low rate bad things happened
His work from the early 60s (and before, actually) led to a subfield of statistics
concerned with the robustness of estimators to departures from assumptions (like
the data come from a single normal population) ⇢⌧ ( u ) • Aside: Quantiles (again)
• It turns out that we can define quantiles as
a minimization problem as well  Let’s
define a new function ⇢⌧ ( u ) =  u  + ⌧ u
• For a given level 0 < q < 1, we can define
the set ⌧ = 2q 1 and then define the
qth quantile to be the value of z that
minimizes
n
X ⇢⌧ ( z median (q=0.5) ⇢⌧ ( u ) xi ) i=1 • At the right we have an example of the
function ⇢⌧ (u) for q=0.5 (the median) and
q=0.9 the 90th percentile 90th percentile (q=0.9) 2e+05 4e+05 6e+05 sum_i x_i−z
8e+05 median (q=0.5) 10
20
30
40
50 z
60
70 1e+05 2e+05 3e+05 sum_i x_i−z+(2*0.9−1)*(x_i−z)
4e+05 5e+05 90th percentile (q=0.9) 10
20
30
40
50 z
60
70 • Back to the center of a points in space... • Extensions: Optimization
• • One way to think about the center of a 2variable data cloud would be to extend
this optimization approach  This was done, for example, in the late 1800s and
early 1900s by the U.S. Census
One definition of the “center” of the U.S. population involved a hypothetical
assembly of all the people in the country  Simply, the median of this spatial
distribution of people is the point that minimizes the total distance the population
would have to travel to assemble there • Extensions: Optimization
• In symbols, if the distance between a point on the map z and the spatial
coordinate where a person lived (the latitude and longitude, say) is the Euclidean
distance kxi zk , then the median is defined as the value of z minimizing
n
X
i=1 • kxi zk Notice that when we have univariate data again, the expression above reduces to
the sum of absolute deviations and we get back our univariate median! • Extensions: Optimization
• If are data are in the plane (consisting of two measurements per
point), xi = (xi1 , xi2 ) , then the distance is just kxi zk = p (xi1 zi1 )2 + (xi2 zi2 )2 • Extensions: Optimization
• If we have a single measurement per point so that xi and z are scalars, then kxi
• zk = p ( xi z i ) 2 =  xi zi  so that the quantity we are minimizing (with respect to z) is again
n
X
i=1  xi z • Extensions: Optimization
• • While we won’t prove it, defined in this way, we can rotate our points in the plane
and still come up with the same center or median  This would not be true if we
took our center to be the vector consisting of the median of x11 , . . . , xn1 (the values
of the first coordinate), and the median of x12 , . . . , xn2 , the values of the second
Why might this kind of invariance be important? • Other metaphors
• • The distance approach is just one way to generalize the median or center of the
distribution, and, as you might expect, statisticians have had a lot of time to think
this concept over
We’ll talk about just one other approach because it gives some insights into the
structure of data in two (and higher) dimensions  On the next slide we present the
associated “bagplot”, an extension of the boxplot 350 350 350 400 men in excellent health 400 men in very good health 400 men in good health ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ●
● ● ● ●
●● ● ●● 150 150 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● 350 400 ●
● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●
● ● ● ● ●●
● ● ● ●
● ● ●
●
● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●●
● ● ● ● ● ● 150 200 200 ● ● ●● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● 250 ● desired weight 250 ● ● ● ●
● ● ● ● ●
● ● 100 ● 100 ● 100 desired weight ● ● ● ●
● ● 300 ● ● ● 250 ● ● 200 ● ● desired weight 300 300 ● ● ● ● ●
●
● 100 150 ● 200 250
weight 300 350 400 100 150 200 250
weight 300 350 400 100 150 200 250
weight 300 Depth
• Let’s start by thinking about what we’re doing when we define the median for a
data from a single variable (say just the BRFSS reported weights)  Last time
we took it to be the point that divides our data into two pieces (plus or
minus some extra details when we have an even number of points) • We will first consider the median as the “deepest” location relative to our
data set and then consider how to generalize that notion  Again, we do this
because it gives us insight into concepts like the median and displays like
boxplots 10 weights from the BRFSS ●
● 100 ● ●● 150 ● ● 200 ● ● ● 250 10 weights from the BRFSS ●
● ● ●● ● ● ● ● ● the median 100 150 200 250 10 weights from the BRFSS ●
● 100
100 ● ●● 150 ● ● 200 ● ● ● 250 Depth
• To define the depth of any location on the real line relative to this data set, we
count the proportion of points to the left and to the right and define its
depth to be the smaller of the two • Here are a couple of examples... 10 weights from the BRFSS ●
● ● ●● ● ● ● ● ● 8/10 of the points lie to the
left and 2/10 to the right so
the depth = 0.2 100 150 200 250 10 weights from the BRFSS ●
● ● ●● ● ● ● ● ● 4/10 of the points lie to the
left and 6/10 to the right so
the depth = 0.4 100 150 200 250 10 weights from the BRFSS ●
● ● ●● ● ● ● ● ● 3/10 of the points lie to the left and
8/10 to the right so the depth = 0.3
(we include the data point itself in both
the left and right totals) 100 150 200 250 Depth
• The median, then, is the location on the real line having the greatest
depth  If an “interval” of locations have greatest depth (as is the case on the
previous slides, where any location between and including the 5th and 6th data
points all have depth 1/2) we take the midpoint of the interval as the median • Now, how do we generalize this to two dimensions? How do we generalize the
notion of left and right? 260 20 points from the CDC BRFSS 240 ● 200 ● 180 ● ●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS 240 ● 200 ● 180 ● ● ●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS 240 ● 200 ● 180 ● ● 5/20 points lie above the line
and 15/20 lie below ●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS 240 ● 220
200 ● 180 ● ● ●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight ● 8/20 points lie above the line
and 12/20 lie below 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS ● 240 1/20 of the points lie above the dark
gray line and 19/20 lie below  1/20 is
the we can find, rotating the line
through every angle so
the depth = 0.05 200 ● 180 ● ● ●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS 240 ● 200 ● 180 ● ●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS 240 ● 200 ● 180 ● ●
●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 260 20 points from the CDC BRFSS ● 240 3/20 of the points lie above the dark
gray line and 18/20 lie below and this is
the smallest proportion we can find by
rotating the line through every angle so
the depth = 0.15 200 ● 180 ● ●
●
● 160 ●
● ● ● ● ● ● 140 ● ● ● ●
●
●
● 120 desired weight 220 ● 120 140 160 180 200
weight 220 240 260 700
600
500 ● ●
● depth = 0.0001 depth = 0.007 depth = 0.07 depth = 0.33 100 desired weight 400
300
200 ● 100 200 300 400
weight 500 600 700 A generalization
• We can then define the “depth median” as the deepest location (if it’s
unique) or the “center of gravity” of the set if there are more (it’s guaranteed to
be a closed, bounded and convex set if any of those words speak to you  and
it’s not important if not) • Similarly, we can use depth to define the deepest 50% of the data
(essentially), creating a generalization of the box part of the box plot  The
“whiskers” or in this case an outer “loop” is defined by inflating the middle
50% (default is a factor of 3, again based on simulations) and settling back
on the data • The authors of the graphic say:
Like the univariate boxplot, the bagplot also visualizes
several characteristics of the data: its location (the depth
median), spread (the size of the bag), correlation (the orientation of the bag), skewness (the shape of the bag and the
loop), and tails (the points near the boundary of the loop
and the outliers) • Scatterplots
• • For two continuous variables, we have already introduced a scatterplot as a tool for
assessing associations  R provides considerable control over what a plot like this
looks like
On the next few slides, we change plotting characters as well as the range of the xand yaxes on the plot of respondents’ desired weight versus their current weight We also add a reference line with mean 0 and slope 1 (Why?) # first plot
plot(cdc$weight,cdc$wtdesire)
# second plot, changing plotting character
plot(cdc$weight,cdc$wtdesire,pch=".")
# third plot, changing range of x and yaxes
plot(cdc$weight,cdc$wtdesire,pch=".",xlim=c(70,700),ylim=c(70,700))
# fourth plot, adding a reference line with slope 1 and intercept 0
plot(cdc$weight,cdc$wtdesire,pch=".",xlim=c(70,700),ylim=c(70,700))
abline(0,1) 700
600 ● 400 ● 200 300 ● 100 cdc$wtdesire 500 ● ●
●
● ●
●
●
●●
● ●● ● ●
●●
●
●
●
● ●●●●●●●●● ● ●
●●
●●
●●●●●●●●●●● ● ●●●
● ● ●●●● ● ● ●
●●
●● ● ●●●● ●●●● ● ● ●
●
●
●
●
●
● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●●●●● ●●●●●●● ● ● ●
●
● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●● ●●●
●●●● ●●●●●●●●● ● ●●● ●
●
● ●● ●●●● ●
● ●● ●●
●● ● ●●●●● ●●
● ●● ●●●● ● ● ● ● ●
●●
●● ●
● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●
●
● ●● ● ● ●
●● ● ●
● ● ● ●● ● ● ● ●● ●
● ●●● ● ●
●● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ● ●● ●● ● ●●● ●● ● ● ● ●
● ● ●●●●●● ● ● ● ● ●
● ● ●● ● ●●●● ● ●
● ●● ● ● ● ●
● ●●
●
●●
●●●●
● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●
● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●
● ●●● ● ●●●● ●●● ●●●●● ● ● ● ●
● ●● ●●●●●● ● ●● ●● ●●● ● ●
● ●● ●●●●● ●●● ●●● ●
● ●● ●●●● ● ● ●
●● ● ● ●
● ●●
●
●
●●
● ● ●● ●● ●●●●●●● ● ●●●●●● ● ●●●●●●●● ●●●● ●●
●●●●●●●●●●●● ● ●
● ●●●●● ● ●
● ●●●●●●●●●●●●●●●●●●●●● ● ●
●●●●●●●● ●● ●●● ●● ● ● ●
● ●●●●●● ●● ● ●
● ●●●●●●● ●
● ●●●●●● ●
●●
●● ●●● ●
●
●● ●
●
●
● ● ● ●●●●●●● ●● ●●●●●●●● ●●●● ●● ● ● ● ●
● ● ●●●●●
●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●● ● ●● ●
● ●●●● ●●●●●● ● ●● ● ●● ● ●
● ● ●●●● ●● ● ● ●
● ● ● ●●● ●●
●● ● ● ●
●
●
●
●●
●●●●● ●●●●●●●●●● ● ● ● ● ● ● ●●●● ●
●●● ●●● ●●●●● ●●● ● ●
●● ● ●
● ● ● ●●
●●●●●●●●●●●●●●● ●● ● ● ● ●
● ● ●●●●●●●●●●●● ● ●
●● ●
●
●●● ● ●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●
●●●●●●●●●●●●●●●●●●●●●●● ● ●
●●●●●●●●●●●●●● ●● ●● ● ● ●
● ●● ●●●●●● ●● ●●● ● ●●● ●
● ●●●●●●●●● ● ●
● ●●●●●● ●● ●
● ●●●●●●● ●●
●●● ● ●
●●●●●●●● ●
● ●●●●●●●● ●●
●
●●●●● ● ●●●●●●● ●●●●●● ● ●
●●●● ● ●●●●●●● ●● ● ●
● ●● ● ●●●●●●● ● ●
●●● ●●●● ● ●● ●
● ● ●●● ● ● ●
●●
●●
●
●
●●● ●● ●
● ● ●● ●●●●●●●●●●●●● ●●● ●●● ● ●● ● ● ●
● ● ● ●● ●● ● ●● ●● ● ● ●
●● ●
●●● ● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●● ●
● ●● ●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●● ●● ●
● ● ●● ●●●●●● ●●●●●●●● ● ●● ●
● ● ● ●●●● ● ●
●
●
●●
●●●
●●●●●●●●●●●●●●●●●●● ● ● ●
● ●●●●●●●●●● ● ● ●●
● ● ●●●●●●●● ● ● ●
●● ●●●● ●● ● ●
●
●●●●● ●●
●
●
●●●●●●●●●●● ● ● ●●●● ●
●●●●● ●●
●●
●●●●● ● ●● ●
●●●● ●
● ● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●● ●
● ● ●●●●●●●●●●●●● ● ●●● ●●● ● ● ● ● ●
● ●●●●●●●●●●●● ● ●●●● ●●
●●●●●●●●●●● ● ●● ●
●●●●●●●●●● ● ● ●
●●●●● ● ●
●
●
●●●●● ● ●● ● ●
●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●
●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●
●●●●●●●●●●●●●●● ●● ● ● ● ●
●●●●●●●●●●●●●● ●● ● ●
●●●●●●●●●●●●●●● ●
●
●
●
●● ●● ●●●●●● ●●● ●● ● ● ●● ●
● ● ●●●●●●●●● ● ●● ● ●
●● ●●●●● ● ● ● ● ●
● ● ●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ● ●
● ●●●●●●●●●●●●●●● ● ● ●●●
●●● ●●●●●●● ● ●●● ● ● ●
●●●●●●●●●● ● ● ●
● ●● ●
●●●●●●●● ● ● ● ●
●● ●●●●●●●
●●●●●●●●●● ● ●●●● ●
●●●●● ● ●●
●●●●●● ●
●●●●●●●●● ●
●● ● ●●●● ●
●●
●●●●●●● ●
● ● ●●●●●●●●●●● ● ● ● ●● ●
● ● ●●●●●●● ● ●● ● ●
●●●●●●●●● ●
●●●●●●●●● ●
●●●● ●
●●●
●
●
●● ●
●
●
●●●●●●●●●●●●●● ●●● ●●●●● ● ● ●●●● ● ●●
●●●● ●●●●
●●●●● ● ●
●
●● ●
●● ●
● ● ●●●●●●●●●●●●●●● ● ● ●
●●●●●●●●●●●● ● ● ● ●
●● ●●● ●● ●
●
●●● ● ●
●● ●●
● ●●●●● ● ●● ● ● ●
●● ● ● ● ●
●
●
●● ●● ● ●●
●● ●●
● ●●● ●
●
●●
●●
●●
●
●
●
● 100 200 ● ● ● ● 300
cdc$weight ● ● ●●●
●
● ●● ● ●
● ●●
●●
●●●
●
●●
●
●●
● ●
●
● ●
●● ●● ● ●●
● 400 500 100
200
300 cdc$weight
400
500 100 200 300 400 cdc$wtdesire
500 600 700 100
200
300
400 cdc$weight
500
600
700 100 200 300 400 cdc$wtdesire
500 600 700 100
200
300
400 cdc$weight
500
600
700 100 200 300 400 cdc$wtdesire
500 600 700 • Alterations
• • • One problem that we’ve ignored with this plot is that it represents 20,000 pairs of
points  It’s hard to see that because there is considerable overplotting
That is, we have seen already that weight tends to be reported in 5pound
increments, and the same is true for respondents’ desired weights  Inevitably that
means that some pairs of points are represented multiple times
What can we do to bring these points out? 100
200
300
400 jitter(cdc$weight)
500
600
700 100 200 300 400 jitter(cdc$wtdesire)
500 600 700 700 Counts
600 1522
1427
1332 500 1237 cdc$wtdesire 1142
1047 400 952
857
762 300 666
571
476 200 381
286
191
96 100 1
100 200 300 cdc$weight 400 500 Counts
600
302
283
264 500 246 cdc$wtdesire 227
208 400 189
170
152 300 133
114
95
76 200 57
39
20 100 1
100 200 300 cdc$weight 400 500 # first plot
plot(cdc$weight,cdc$wtdesire)
# second plot, changing plotting character
plot(cdc$weight,cdc$wtdesire,pch=".")
# third plot, changing range of x and yaxes
plot(cdc$weight,cdc$wtdesire,pch=".",xlim=c(70,700),ylim=c(70,700))
# fourth plot, adding a reference line with slope 1 and intercept 0
plot(cdc$weight,cdc$wtdesire,pch=".",xlim=c(70,700),ylim=c(70,700))
abline(0,1)
# final plots, jittering...
plot(jitter(cdc$weight),jitter(cdc$wtdesire),pch=".",xlim=c(70,700),ylim=c(70,700))
# then binning the data using hexbin...
h < hexbin(cdc$weight,cdc$wtdesire)
plot(h)
# ... or in one go (and changing the number of bins to 200)
plot(hexbin(cdc$weight,cdc$wtdesire,200)) • Scatterplots
• • The scatterplot is so ubiquitous that we barely recognize what we’re doing  It
seems sensible and almost automatic and, hence, practically invisible
While this kind of plot is great for two dimensions, what do we do if we have three
variables? Four variables? Twenty? What are our options now? Vulnerability
• The underlying question here is interesting
and relevant (they usually are, for what it’s
worth)  Here we are interested in
understanding how climate change (and the
accompanying increase in extreme weather
events) will affect different parts of the world • Specifically, the researchers produce a
model that relates variables capturing some
notion of vulnerability to the impacts that
weatherrelated natural disasters have had,
country by country • Results
The ﬁrst stage of our analysis was to estimate statistical models of losses from climaterelated disasters, based on a
set of climatic and socioeconomic variables that will likely change over time, which appear in Table 1. The
dependent variables are logged values of the number of people per million of national population killed or affected,
respectively, by droughts, ﬂoods, or storms over the period 1990–2007. The variable number of disasters is the
logged value of numbers reported by each country over the same period, and accounts for climate
exposure; estimated coefﬁcient values greater than 1 in both models indicate that average losses per disaster are
higher in more disasterprone countries. We expected that larger countries are likely to experience disasters over a
smaller proportion of their territory or population, and also beneﬁt from potential economies of scale in their disaster
management infrastructure, both resulting in lower average per capita losses; the negative coefﬁcient estimates for
the variable national population in both models are consistent with this expectation. The variable HDI represents
the Human Development Index, a United Nations (UN) indicator comprised of per capita income, average
education and literacy rates, and average life expectancy at birth. Recent studies of disaster losses —not
limited to climaterelated events—have shown that countries with medium HDI values experience the highest
average losses, whereas countries with high HDI values experience the lowest (14, 15). We therefore included the
logged HDI values in quadratic form. Negative coefﬁcient estimates for both HDI and HDI^2 in both models are thus
consistent with these expectations, given that logged HDI values are always negative, and the square of the logged
values are in turn positive. Finally, we considered several additional socioeconomic variables not directly captured
by HDI, and found only two that improved model ﬁt. For the model of the number of people killed, the positive
coefﬁcient estimate for female fertility indicates that countries with higher birth rates experience greater
average numbers of deaths. We do not take this to mean that there is a direct connection between fertility
and natural hazard deaths, but rather that higher birth rates are associated with lower female empowerment,
and lower female empowerment is associated with higher disaster vulnerability, as has been shown
previously (16, 17). For the model of the number of people affected, the negative coefﬁcient estimate for the
proportion urban population is consistent with urban residents being less likely to require postdisaster assistance
than rural residents, also observed previously (18, 19). Both models yield an R^2statistic slightly greater than 0.5,
indicating that variance in the independent variables explains just over half ofthe variance in the numbers killed and
affected. This is consistent with results from past analyses based on similar data and methods (8–10) Vulnerability
• In the end, a great deal of attention is
paid to a regression table (below), the
form of which we should be fairly
familiar with • In each row they present the regression
of the logarithm of the number of people
killed by weatherrelated natural
disasters from 1990 to 2007 as a
function of several predictors, one of
which is slightly special... Looking at the data
• The data we were given consist of measurements associated with 144 different
countries  For each we have the following variables • country_name the name of the country • ln_events the natural logarithm of the number of droughts, floods and
storms occurring in the country from 19902007 • ln_pop the natural logarithm of the country’s population • ln_fert the natural logarithm of an estimate of the country’s female fertility • hdi the Human Development Index for the country • death_risk the proportion of people out of 1M in population killed in
droughts, floods and storms • There are four predictor variables (if you count HDI and its square as one)
which, while not big by any stretch of the imagination, is complex enough to
keep us from “seeing” the whole data set • Instead, we might opt for partial views... # the first few countries...
> head(vul)
country_name ln_events
ln_fert
hdi
ln_pop ln_death_risk
1
Albania 2.302585 1.2383740 0.7530000 4.006120
0.7102835
2
Algeria 3.496508 1.5993880 0.7025000 6.283885
0.8961845
3
Angola 3.044523 1.9459100 0.4460000 5.556056
0.2246880
4
Argentina 3.637586 1.0116010 0.8525001 6.483515
1.1036180
5
Armenia 1.386294 0.7654679 0.7380000 3.976562
2.3671240
6
Australia 4.394449 0.7654679 0.9480000 5.837925
1.0504330
# ... and the last few countries
> tail(vul)
139
140
141
142
143
144 country_name ln_events ln_fert
hdi
ln_pop ln_death_risk
Venezuela 2.995732 1.335001 0.7810 6.072891
4.2457150
Viet Nam 4.595120 1.504077 0.7025 7.248978
1.9736860
Yemen 3.091043 1.994700 0.4735 5.808543
0.7734824
Zaire/Congo Dem Rep 2.944439 1.887070 0.4010 6.846872
1.3174430
Zambia 2.564949 1.871802 0.4365 5.222156
2.4495670
Zimbabwe 2.484907 1.704748 0.5630 5.360353
0.3104967 # order according to death_risk
> tail(vul[order(vul$ln_death_risk),])
country_name ln_events ln_fert
hdi
93
Nicaragua 3.135494 1.589235 0.6735
55
Haiti 3.784190 1.568616 0.5080
10
Bangladesh 4.836282 1.547562 0.5000
139
Venezuela 2.995732 1.335001 0.7810
56
Honduras 3.496508 1.686399 0.6765
88
Myanmar 2.772589 1.398717 0.5830
> max(vul$ln_events)
[1] 5.948035
> exp(max(vul$ln_events))
[1] 383 ln_pop ln_death_risk
4.499810
3.717359
5.033049
3.860935
7.828728
4.111300
6.072891
4.245715
4.701086
4.938241
6.688770
5.118413 15
10
5
0 Frequency 20 25 30 Histogram of vul$hdi 0.3 0.4 0.5 0.6 0.7
vul$hdi 0.8 0.9 1.0 Normal Q−Q Plot 0.9 ●●
● ●● ●
● ●●
●
●●
●●
●
●
●
●
● 0.7 ●
●●
●●
●
●
●
●● 0.6 ● 0.5 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
● 0.4 Sample Quantiles 0.8 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●●●
●
●
●●
●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●●
● ●● ● −2 −1 0
Theoretical Quantiles 1 2 ● ● 2.0 ●
●●
●
●
●
● ● ● ●
● ● ●
●
●
●●
●●
●●
●
●
● ● ● ●
●
● ● ●
●
● ● ● ● ●
● 1.5 ●●
● ●
● ● ● ●
●
●●
●
●
●●
●
● ● ● ● ●
● ● ●●
●
● ●
● ●
● ●
● ●● ●
● vul$ln_fert ●
● ● ●
●
●●
●● ● ●● ●
● ● ●
●
●● ● ● ● ● 1.0 ●
● ●
● ● ● ● ●
● ● ●
●
●
● ● ● ●● ●● 0.5 ● ● ●
● ●
● ●
● ● ● ●
● ●
●
●●
●
●●
●
● 0.4 0.5 0.6 0.7
vul$hdi 0.8 ● ●● 0.9 ●
● ● 4 ●
● ●
● ●
● ● 2 ● ● ● ●
● ●●
●
● ● ● ● 0 ● ●
● ● ●
● ●
● ●
●
● ● ● ● ●
●●
● ●●
● ●
● ● ● ● ● ● ● ●
● ● ●
●
● ● ●
● ●
● ●
● ●
●
●
● ● ●
● ● ● ●●
● −2 ●
● ● ●
● ● ●
●
●
●
● ●●
●
● ● ●
● ●
● ● ● ● ●●
● ●
● ● ●● ● ● ● ● ● ● ● ● ●●
●●
● ●●
●●
●
● ●
● ● ● ● ● ●
● −4 vul$ln_death_risk ●●
● ● ● ● ● ● ● ● 0.4 0.5 0.6 0.7
vul$hdi 0.8 0.9 • Scatterplot matrix
• • A scatterplot lets us examine pairs of variables  We can stack the plots for all
possible pairs of data points in the form of a matrix, a scatterplot matrix
On the next slide, we plot all but the country names  Tell me something about the
associations you see... ●
● ●●
●
●
●●
●
●
●
●● ● ● ● ●
●
●
●
●● ● ● ●
● ●●
●
●
●●
●
●
● ●●
●
●
●
●
●● ● ● ●
●●
●
●
●
● ● ● ●●●●● ●
●●
●●
●
● ●●
●
●
●
●
●●
● ● ● ●●
●
●
●
●●
●
●●
● ● ● ●● ● ●
●
●●
●
●
●● ●
●●
●
●
● ●●●
●
●
●●
● ● ● ●●
●
●
●
●● ●
●
●●
● ● ● ●●
●
●●
●
●
● ●●
●
●
●
●●
●
●
●●
●
●●
●
● ● ●● ●
●
●
●
●
●● ●
● ● ●●
●
● ● ● ●●● ●
●
● ●●
●
●
● ● ●●● ●● ● ●●
●
●
●
●
●●
●●
●
●●
●●
●
●● ● ●
●
●
●●
● ● ●
●
● ●●
●
●●
●
● ● ●●
● ●●
●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●●
●
● ● ●● 8 ● ● ● ● ●● ●
●●
●
●
●
●
●
●● ●● ●
●●
●● ●
●●● ● ●
● ●●
● ●●
●
●● ●
●● ● ●
●●
●
●●
●● ● ● ● ●
●
●●
●● ● ●
●●
●● ● ●
●● ●●● ● ● ● ● ●
●
●●
●
●
● ● ● ●● ●
● ●●
● ●●● ●●● ●
● ●●
●
●
●
●
●●
● ●● ● ●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●● 6
4
2 ●●
●
●
●●
●
●●
● ● ●●
●
●● ●
●
●
●●
●
●
●
●
● ●●●
●
●● ● ● ●
●
● ● ●●
●
●●
●●
●
●●
●●
●
● ●●●
●
●
●● ●
●
●● ● ● ●● ●
●
●
●
●●●
●● ●
● ●●
●
●
● ●● ● ●
●
●
● ●●● ●
●
●●
●
● ●●
●
● ● ●● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ● ●
●●
●
●
●
●
● ●●
●
●●
●●●
●●
● ● ●●● ●
●
● ● ●● ● ●
●
● ●●● ● ●
●●●
●●
●
● ● ● ●●●
● ●● ●
●
●
●
●
●●● ●
●
●
●
●● ● ● ●
●
● ● ●●
●
●
● ●●
●
●● ● ●●● ● ● ●
●● ● ● ●●● ●
●●
●
●
●
●
●● ● ● ●
●
●●●●
●
●
●
●
●●
● ● ● ● ● ● 4 5 6 ●
●
●
●●
●
●
●
●
●
● ●● ●
●
● ●●
●●
●
● ●●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●● ●● ● ● ●
●● ● ●● ●
●●● ●
●●
●●
●
●
●
●
●●
● ● ●●
●
●●
●
●● ●
●●
●● ●●
●
●● ●
●
● ● ●● ●
●●
●●
●
●
●
●
●
●● ● ●● ● ● ●
●
●
●●
●
●
●● ●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●
● ● ●
●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
● ●●
●
●
●●
●
●
●
● ●●
●
● ●
● ● ●
●
● ● ●● ●
●
●
●
●
●
●
●
● ● ●●●
●
●
●
●
●●
●
● ● ●●
●
● ●●
●●
●
●●
●
●
●
●
●
●
● ● ●● ●
● ●●●
●
●
●
●
●●
●
●
●● ● ●● ●
● ●●
●● ● ● ● ●● ●
●
●
●
●●●
●●
●
●● ● ●
●● ●
●
●●
● ●● ● ● ●
●
●
●
●●
●
●
●
● ●●
●●
● ● ●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●●
● ● ●● ● ●
●
●
●●
●
●
●
● ●● ●
●●
●
●
●
● ●●
● ● ● ● ●●
●
●
● ●●
●●
●●
●●
●
●
●
●
●●
●● ●●● ●● ● ●●
●● ●
●
●●● ● ●
●●
● ●● ●● ●● ● ● ●
●
●
●●
● ● ●●
●
●●
●
●●
●
● ●● ●
● ● ●● ● ● ●● ●● ●● ● ●
●
●● ● ● ● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ● ●● ln_pop ● ● ●
● ●
● 6 ● ●
●● ● ● ●
●●
●●●
● ●●
●●
●
●
●
●●
●
●
●● ● ● ●
●
●
●
●
●
● ● ●● ●
●
●
●
●
●●
●●●
●
●●
●
●
● ● ●●
●
● ● ●● ● ●●
●
●
● ●●
●●●●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
● ● ●●● ●● ●
● ●
●●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
● ●● ●
● ● ●●
●
●
●
●●
●
● ●● ●
●
●
●
●● ● ●
●
●
●
●
●●
●● ● ● ●
●
●
●●
●●
●
●
● ● ●● ●● ●
●
●
●● ●
●
●
●● ●
●
●
●●
●●
● ●● ● ● ●●
● ● ●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●● ●
●
●●
●
●
●
●
● ● 3 ● ● ● ●
●
●
●
●
●●
●●
● ● ● ●● ●
●
●
●●
●
●
●
●● ●
●
● ● ●●● ●
●●
●
●
●●●
●
●
● ● ●●
●●
●
● ●●
●
● ●●
●
●●
●●
●
●●
●
●●●
●
● ●● ● ● ●
●
● ●● ●
●●
●
●
●
●
●
●
●
●●
●
● ●●
●●●
●●
●●
●
●●● ●
●● ● ●
●
●
●●
●
●●
●● ● ●
●
●
●●
●
●
●
●● ●
●
●
●
●
● ● ● ●
● ● ● ●● ●
●
●
● ● ● ●● ● ●●
●● ● ●
●
●
●
●●
●
●●
● ●●
●●
●
● ●●
●●
●
●
●
●
●●
●● ●
●
●
● ● ● ●●
● ● ● ●● ● ●
●●
● ●●
●
●
●● ● ●
●
●●
●
●
● ●● ●
●
●
●
●
●
●
●●
●
●
●
●●
●● ● ● ●
●●
●●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●● ● ●
● ● ●●● ●●
●●
●
●
● ● ● ●● ● ● ● ●
● 2 ●
●
● ● ● 1 ●
●●●
●●
●
●● ● ●
●
●
● ●●
●
● ●● ● ● ● ●
●
●
● ●●
●
●●
● ● ● ●●●●
●
● ● ●●
●
●
●
●
●
●
●●
●● ● ●
●
●
●
● ●●
●
●●
●
● ●● ●
●
●
● ●●
●
●
●●
●
●
●
● ●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
● ● ●●
●
●
●
●
● ● ●●
●
● ● ●●
●●
●
●●
●
●
● ●●
● ● ● ●
●
● ●● ● hdi ●
●
● ●●
●●●
●
●● ●
● ● ●●
●
●
●
●●
●
● ● ●●
●● ●
●
●
●●
●
●
● ● ●●
●
●
●●
●
● ● ●
●
● ●
●
● ● ●●
●●●●●●
●
●
●
●
●● ●
●
●●
●
●●
●●
●●
●●
●
●● ●●
●●
●●
●● ●
●
●
●
●●
● ●●
●
● ● ●● ● ● ●●
●
●● ●
●
●
●
●
●● ● ●
●●
●
●
●● ● ● ● ●
●●
● ●●
● ●●●
● ● ● ● ● ● 10 ● ● ● ● ● ● ●
●
●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●●● ●
●● ●
●
●
● ●● ● ●
●●
●
●
● ● ●●
●●
●
●●●
●
● ● ●● ●
●
●●
● ● ●●
●
● ●● ● ●●
●
●
●
●
●● ●
●●
●
● ● ●●
●
●●
● ●● ●●●
●● ●●
●
●
●
●
●●
●●
●
●
● ●● ●
●
●●
●●● ●
●
●
●
● ● ● ●●
●
●●
●
●
●
●
●
●
● ● 0.4 0.5 0.6 0.7 0.8 0.9 ● ● ● ● ● ● 4 ● ● ● ● ●
●
● ● ●●
●
●
●●
●
● ● ●●●● ●●
●● ● ●
●
●
●●
● ●●
●●
●● ●
●
●
●
●●
●● ●
●
●
●
●
●
●
● ● ●● ● ●●
●
●
●
●
●
●●● ●
●●
●●
●
●● ●
●
●●
● ● ●●
● ●●
● ●●●
●
●
●●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
● ●●
●
●●
●
●
●
●
●
● ●●
●
● ●● ●● ●
● ● ●● ● ●
●
●
● ln_fert ●
● ●● ●
●
●●
●
●
●●
● ● ● ●●
●
●
● ● ●●●● ● ●
●●
●●
●● ● ●● ●● ●
● ● ● ●● ●
●●
●●
●
● ●● ●●
● ● ● ● ● ●●
●
●●
● ● ●●
●● ●● ●●
●
●
● ● ●●
●
●●
●●
● ●● ● ●
● ●●
●
●
●
●
●
●
●●
●
●
●
●
●● ● ●●
●
●●
●
●
●
●●●
●●
●
●
●
●●
●●
● ● 5 ●
●
● ● ● ● 4 ● 3 ● ●
● ● ●● ●
● ● 10
● ●●
●
●
● ●
●
●● ●
●
● ●●
●
●
●
●
● ●● ●●
●
●● ●●
●●
● ● ● ●● ●● ● ●
●
● ●● ●●
●●
●●
●
●● ●
● ●● ● ● ● ●
●
●●
●
●●
●● ●
●●
●
● ●●
● ●●● ● ● ● ●●
● ●● ●● ●
●
● ● ●●
●
●
●
●● ●
●
●●●●
●●
●
●
● ●● ● ●
●● ● ● ●
●
●
●
●●
●
● 2 0.5 1.0 1.5 2.0 ● ● 8 ● ●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
● ● ● ● ● ●●
●
● ● ● ●● ●
●●●● ●
●
●
●
●●
●
●● ●● ●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ● ● ● ●●
●●
●
●
●●●
●
●
●●
●● ● ●● ● ●
●●
●
●
●● ●
●
●●
●● ●
●
● ●●
●
● ●● ● ● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
●● ● ● ● ● ● ●
●
●
●
●●
●
●●
●
● ●● ● ●
● ●●● ● ●
●
●●●
● ●●●
●
●●
●
●●
●
●●
●●
● ● ● ●●
●●
● ● ●●
●
●
●
●
●●
●
●
●
●● ● ●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
● ●●
●
●●
●●
●●●
●
●
●
●
● 6 2 ● ln_events 4 1 2
● 0.4 0.5 0.6 0.7 0.8 0.9 2.0 ●
● ln_death_risk 0 1.5 −2 1.0
● −4 0.5 ● −4 −2 0 2 4 • Aside: Parallel coordinates plot
• • As with the multivariable median, statisticians have had decades to think about
visualizing many variables at one time  Here, however, we haven’t been as
successful in coming up with definitive tools
The next plot views each variable as a vertical axis and then represents the values
from a single country as a broken line running between these axes ln_events ln_fert hdi ln_pop ln_death_risk ln_events ln_fert hdi ln_pop ln_death_risk • Multidimensional data
• • • Although I’m not sure exactly the first person to do this and in what context, we can
make a mapping between the rows in a data table and a point in Euclidean space If you think about it, this is a big conceptual leap
So in the case of the vulnerability data, we have 5 quantities recorded for each
country (well, 6 if you consider its name, but we’ll leave that out for now, viewing it
more as an index than a measured quantity)
That means each row of data can be thought of as a point in 5dimensional
Euclidean space  The scatterplot matrix representing a series of projections of the
data into two dimensions ln_death_risk ln_fert
hdi ln_fert ln_death_risk
hdi • Multidimensional data
• We can carry this idea farther and examine twodimensional projections of our data
set that are not “axisaligned” as in the scatterplot matrix  We can consider casting
shadows of the data when viewed from different angles • ●
● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ●● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ●
●
● ● ●
●
● ● ● ● ●
●
● ● • Multidimensional data
• • • For those of you who have had linear algebra (or for whom this material still feels
familiar), we can make this projection idea rigorous  As a graphical tool, projections
let us turn the data over and examine it for structures that might not be obvious from
axisaligned projections
One approach to this idea is the socalled grand tour  We select a series of
random directions from which to view the data and then move smoothly between
them, interpolating the motion
There is an excellent tool for doing this... • Linked displays
• • GGobi also implements linking between the displays, allowing you to highlight data
in one window and examine where it appears in other plots  On the next page we
“bush” a histogram and see where the points fall on a scatterplot
Linking displays in this way is a powerful way to examine the dependence structure
in your data and to examine outliers... • And GGobi implements the grand tour... • Projections
• • • As you watch the projections dance across the screen, we are scanning for
directions that are “interesting”, providing us with a view into the clustering or
grouping of data that might not be immediately evident otherwise
It turns out (a consequence of the Central Limit Theorem) that these projected views
of the data will be “uninteresting” in that they will look like a bivariate normal
distribution
This, then, becomes one possible definition of “uninteresting” and we can score
views by how dissimilar they are from this distribution  In the late 1970s and early
1980s, this led to a statistical technique known as projection pursuit • Time series
• • Finally, your book presents time series plots  That is, plots of a single variable as a
function of time (in quality control or process monitoring applications, these kinds of
plots are exceedingly common)
The next few pages are plots of some basic statistics about the BRFSS as it has
evolved in time  These are mostly token figures and we’ll be making much more
extensive use of time series plots later in the quarter BRFSS state summary by year
● ● ● ●
● 50 ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● 40 ● ● 30 ● ● ● 20 participating states ● ● 1985 1990 1995 2000
year 2005 2010 400 BRFSS question summary by year
● 350 ● ●
● ● ●
● ● ● ● ● ●
● 250 ● 200 ● ● ● ●
● ● 150 ●
● ● 100 total questions 300 ● ●
● ● 1985 1990 1995 2000
year 2005 2010 BRFSS respondent summary by year
●
● 4e+05 ●
● 3e+05 ● ● ● 2e+05 ● ●
●
●
●
● 1e+05 ● ● ● ● ● ● ● ●
● 0e+00 total respondents ● ● ● ● ● 1985 1990 1995 2000
year 2005 2010 • Finally, a prelude to your homework... The registrar
• The registrar maintains a record of every class
you take; in addition to what class, it publishes a
catalog of when classes meet and how many
people were enrolled • On the next page, we present a few lines from a
data file we will eventually consider in lab; it is
was provided by the registrar (at a cost of $85)
and contains the schedules for every student on
campus last quarter* • In all, we have 162380 separate rows in this
table, each corresponding to a different student
and a single class with 31981 total students;
What can we learn from these data? And, more
importantly, how?
•
• *Note that the identification number in this table is not your student ID, or
even part of it, but a random number generated to replace your real ID 1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 id subj_area_cd cat_sort MEET_BLDG_CD meet_rm_sort strt_time end_time DAYS_OF_WK_CD career_cd
816640632
ANTHRO
0009
HAINES
00314 10:00:00 10:50:00
M
U
816640632
ANTHRO
0009
FOWLER
A00103B 11:00:00 12:15:00
TR
U
816640632
GEOG
0005
HAINES
00039 13:00:00 14:15:00
MW
U
816640632
ENGCOMP
0003
HUMANTS
A00046 09:30:00 10:45:00
TR
U
816640632
GEOG
0005
BUNCHE
A00170 11:00:00 12:50:00
M
U
816643648
MGMT
0403
GOLD
B00313 09:30:00 12:45:00
S
G
816643648
MGMT
0405
GOLD
B00313 14:00:00 17:15:00
S
G
816577472
COMM ST
0187
PUB AFF
01222 09:30:00 10:45:00
TR
U
816577472
COMM ST
0168
ROYCE
00362 17:00:00 19:50:00
M
U
816577472
COMM ST
0133
DODD
00175 10:00:00 10:50:00
MWF
U
806029941
EDUC
0491
KAUFMAN
00153 17:00:00 19:50:00
W
G
806029941
EDUC
0330D
FIELD
08:00:00 14:50:00
MTWRF
G
821748664
ANTHRO
0007
HAINES
00039 09:00:00 09:50:00
MWF
U
821748664
SPAN
0120
FOWLER
A00139 15:30:00 16:50:00
MW
U
821748664
SPAN
0120
HUMANTS
A00046 11:00:00 11:50:00
R
U
821748664
WOM STD 0107C M
HAINES
A00025 14:00:00 15:50:00
TR
U
821748664
ANTHRO
0007
HAINES
00350 12:00:00 12:50:00
R
U
820969784
ENGR
0180
BOELTER
02444 18:00:00 18:50:00
M
U
820969784
EL ENGR
0115AL
ENGR IV
18132 12:00:00 15:50:00
T
U
820969784
EL ENGR
0115A
ROLFE
01200 08:00:00 09:50:00
MW
U
820969784
EL ENGR
0115A
BOELTER
05280 09:00:00 09:50:00
F
U
820969784
STATS
0105
PAB
02434 15:00:00 15:50:00
R
U
820969784
STATS
0105
FRANZ
02258A 12:00:00 12:50:00
MWF
U
820969784
ENGR
0180
BOELTER
02444 16:00:00 17:50:00
MW
U
821030697
GEOG
0005
HAINES
00039 13:00:00 14:15:00
MW
U • The registrar
• • • At the end of your career here, you receive a transcript, an aggregate of all the data
the registrar has on you (but printed in some reduced format)  I’d like you to
consider these data and what we might learn from it when it is aggregated across all
the students at UCLA
“Learning” here might involve recoding variables, sometimes reshaping the data set
entirely, to give us a different view (changing, say the unit of observation which here
is a “student in a class” or an “enrollment event”)
This is part of your first homework assignment and is due by Friday  I will create
data sets based on your suggestions and you are going to apply your data summary
and visualization skills to tell me something! ...
View
Full Document
 Winter '12
 Hansen
 Normal Distribution, Mean, Quantile

Click to edit the document details