This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Chapter 1 Section 1: Analyzing Categorical Data The value of categorical variables is often used to place variables in certain categories such as
“male" or “female.” The distribution of a categorical variable lists the categories and gives either the
count or the percent of the individuals who fall in each category. Example:
The radio audience rating service Arbitron places the country’s 13,838 radio stations into categories that describe the kinds of programs they broadcast. Here are two different tables showing the distribution of station formats:
W t.
~ Format Percemotstaﬂons
Adult contemporary 112
Mutt standards 8.6
4.1
. 14.9
:2. News/lblkﬂntormation 15.7
, Oldies ‘ 7.7
Religious “.6
Hook 6.3 .. Spanish language 54
,' .. Otherfomlats 11.4 . ' Total ' 99.9
In this case, the individuals are the radio stations and the variable being measured is the kind of
programming that each station broadcasts. The table on the left, which we call a frequency table,
displays the counts (frequencies) of stations in each format category. On the right, we see a relative
frequency table of the data that shows the percents (relative frequencies) of stations in each format category. It is a good idea to check data for consistency. The counts should add to 13,838, the total number of
stations. They do. The percent should add to 100%. In fact, they add to 99.9%. What happened?
Each percent is rounded to the nearest tenth. The exact percents would add to 100, but the rounded
percents only come close. This is a roundoff error. Roundoff errors do not point to he mistakes of our work, just the effect of rounding off results. Bar Graphs and Pie Charts Columns of numbers can take time to read, so the distribution of the categorical data can be
displayed by a pie chart or bar graph. ’ siioe occupiers 14.9%TJf the pie
beans: 14.9% of the radio stations
um a‘Cuuuu'y" fmml. Contemporary hit This bar has height [49%
becmm: 14.9% 0! stationis
fit the “Country" format. Percent Mentions
G 1.2 '.‘ E f : ‘rv‘ v . tr .. ,
  x . . . , a '
vyuoryﬁcycfi$$y&§:§¢dy (931;; 0‘6}
(b) Radio station formal nouns Lt (a) Pie chart and (b) bar graph of US. radio stations by format. Do the data tell you what you want to know? Let’s say that you plan to buy radio time to advertise your Web site for downloading MP3 music ﬁles.
How helpful are the data in the above ﬁgure? it is not very helpful. You are not interested in counting
stations, but in counting listeners. For example, 14.6% of all stations are religious, but they only have
a 5.5% share of the radio audience, according to Arbitron. In fact, you are not even interested in the
entire radio audience, because MP3 users are mostly young people. You really want to know what
kinds of radio stations reach the largest numbers of young people. Always think about whether the
data you help answer your questions. Pie charts show the distribution of a categorical variable as a “pie" whose slices are sized by
the counts or percents for the categories. A pie chart must include all the categories to make up a
whole. In the radio station example, we need the "Other Formats" category to complete the whole (all
radio stations) and allow us to make the pie chart. Use a pie chart only when you want to emphasize
each category’s relation to the whole. Pie charts are awkward to make by hand, but technology will
do the job for you. Bar graphs (or bar charts) represent each category as a bar. The bar heights show the
categorical counts or percents. Bar graphs are easier to make than pie charts and are also easier to
read. To convince yourself, try to use the pie chart in the previous figure to estimate the percent of
radio stations that have an “Oldies” format. Now look at the bar graph — it is easy to see that the
answer is about 8%. Bar graphs are also more flexible than pie charts. Both graphs can display the distribution of a
categorical variable, but a bar graph can also compare any set of quantities that are measured in the
same units. Example: Who owns an MP3 Player?
Portable MP3 players, such as the Apple iPod. are popular — but not equally popular with all age
groups. Here are the percents of people in various age groups who own a portable MP3 player,
according to an Arbitron survey of 1112 randomly selected people. Age group (year) Percent owning an MP3 player
12w ‘
mm ‘—
“—
55mm. —_ a) Let's make a well labeled bar graph of the information. 9m 0% M93 Owws an; Ara/L C'lekPS . 2‘3”“: ‘agnIEIWIﬁigﬁ ' ‘IEIIIII :33
""EIE.£EE:"“"E:.EEE§§EE§  . m: V'¥\< . . I . I . ' III
' .1 :1 r ‘ I IIIEEEIIIIIIIIIIIE . IIIIIIIIII
3 q V .IIIIIII==I IiIIIIIIIIIII I IIIIIII
 ‘ VAHVAEIEI:IIII IIII..IIIIIII III
' ' I I I
_ y, “III:
I!_I_= IE: I MEWIIE=III IIIﬂ.
I. I IWWIII III III II
Ia—rl \sra“ 25—3ﬁ “'54 5 5 4 ‘ ‘ r C c g ’\ \z b) Would it be appropriate to make a pie chart for these data? Why or why not? J
X fax/F 5mm. “M \HA at Me M93 OM is M
" ”\>\¢ A” onQ (‘05? \aa qffﬁgna‘rt 40 05.: 4 \de Cxﬂf’rk» TwoWay Tables and Marginal Distributions We have seen some techniques for analyzing the distribution of a single categorical variable.
What do we do when a data set involves two categorical variables? The following table will be used
to analyze various distributions. Example: I’m Gonna Be Rich!!
A survey of 4826 randomly selected young adults (aged 19 to 25) asked, “What do you think are the
chances you will have much more than a middleclass income at age 30?” The table below shows
the responses, omitting a few people who refused to respond or who said they were already rich. Young adults by gender and chance of getting rich
Opinion Female Male Total Almost no n E_E__ Some chance but probably not 426 m—
A 5050 chance Accce cnencc m_1421 .
Almost certain 1083 20? . 1] Total 2367 2459 4826 «emu Stoz
This is a twoway table because it describes two categorical values, gender and opinion about
becoming rich. Opinion is the row variable. Gender is the column variable. The entries in the table are the counts of individuals in each opinionby—gender class. To get a better grasp in the information in the table, we can look at the distribution of each
variable separately. The distributions of opinion alone and gender alone are marginal distributions
because they appear at the right and bottom of the twoway table. Deﬁnition: Marginal distribution
The marginal distribution of one of the categorical variables in a twoway table of counts is the distribution of values of that variable among all individuals described by the table. Percents are more often more informative than counts, especially when we are comparing
groups of different sizes. We can display the marginal distributions of opinions in percents by dividing
each row total by the table total and converting to a percent. For instance, the percent of these young
adults who think they are almost certain to be rich by age 30 is almost certain total ___ 1083 = 0.224 ___ 224% table total 4826 Young adults by gender and chance of getting rich
Opinion Female Male Total mum“W
_mV
—a——m.
__——2«. 7.
was 24 .47. 2367 2459 4826 _
'44‘57 7 6i,0/, a) Use the data in the twoway table to calculate the marginal distribution (in percents) of opinions. b) Make a graph to display the marginal distributions. Describe what'you see. M ”x ‘ l I II IIIII ll. IT“... 1'“: II
2:: Innllnlullulul Ilunllgulgglgii II
33* Eli!Iiiiiiﬂiﬂlﬂiﬁiﬂﬁ‘hﬂi IE:
355% “a~"!~‘5‘~9~"“ EEE‘JHEEEEEEEEi'ﬁi Organizing a Statistical Problem
As you learn more about statistics you will be asked to solve more complex problems. Although no single strategy will work on every problem, it might be helpful to have a general
framework for organizing your thinking. Here is a fourstep process you can follow. How to Organize a Statistical Problem: A FourStep Process State: What is the question that you are trying to answer? Plan: How will you go about answering the question? Do: Make graphs and carry out needed calculations. Conclude: Give our oractical conclusion in the settin of the realworld Example: Based on the survey data, can we conclude that young men and women differ in their opinions about
the likelihood of future wealth? Give appropriate evidence to support your answer. Follow the four
step process. §L§t§ 2 ~
T5 ‘3 l a g‘ﬂaua. ' ll\ 6‘: 1‘ (u on
‘l‘o 9A 53101“ FAA—gr: ““5sz = Pla n:
U.) { ‘\U<\\ («\conaxf W Coc a. \«Hov‘c\
1 5.. fl VC ‘1)?! C
OW‘ (\dxﬂ'lrf‘. '5. :'\ q
(w an cm; A mcpw ' :3 r' w J J :17
‘3\ Ac _, ‘L;y . 3‘.) ' ‘._ ..( ‘24 ber'n. ‘ A; ‘ \x" \Y’JE
4 kt R5°Vi5 r 4:
.EE+EE
53!! iii: a
. 
I a In [,7 I.
Eﬁyﬁﬁ
[fir:25 an ll
: ‘l
>1§§3l ‘. ,1;
6
ﬂ \ ‘ II
ilii' ‘ "ﬁiiiﬁr’im
“ﬁyﬁﬁﬂf
ssirg ks! HalII
IIIIIIIIIH
: I
u
I?“
\gams .V a
533$!
n§ ‘ ::v I!
iﬁﬂigﬁﬁ 1 s§§§§ “IIII J: I
*3 QC! .4 \‘=§;
\\“ i‘ 3e
: ‘53
mg! 1 §i§§§
‘!!%ﬁ
““3 g: A Relationships between Categorical Variables: Conditional Distributions The twoway table contains much more information than the two marginal distributions of
opinion alone and gender alone. Marginal distributions tell us nothing about the relationship between
the two variables To describe the relationship between two categorical variables, we must calculate
some well chosen percents from the counts given in the body of ~ »   ~ , ,
the table. We can study the opinions of women alone by looking
only at the “Female” column in the twoway table. To ﬁnd the
percent of young women who think they are almost certain to be
rich by age 30, divide the count of such women by the total
number of women, the column total: W=ﬂi=oz=os 205%
column total 2367 Doing this for all ﬁve entries in the “Female column gives
the conditional distribution of opinion among women. See
the table to the right. We use the term “conditional” because ' 
this distribution describes only young adults who satisfy the condition that they are female. Deﬁnition: Conditional Distribution
A conditional distribution of a variable describes the values of that variable among individuals who have a speciﬁc value of another variable. There is a separate conditional distribution for each value
of the other variable. Example:
Calculate the conditional probability of opinion among men. Conditional Distribution of opinion among men Conclude: . \
A CoorS‘ta 5n; . \OY'sk ‘3 bar 63W?“ wt. (‘90 51a ,kal Pesraks hen/ﬂ $64126 {\X» “‘3‘; °fi 1“" ‘t n m \S‘T V“ Gr \ “V3“? . "" {Wm
TR 0V\Y\\ m" W\Ls (Lg ‘5" D f \‘C\\ I\‘na. “LL:1 T“: r' a 'a , ‘ ‘” K: o?
be"“3 0. 50—30 . , I. g (A 9, \N‘; « km?” l \ i—c‘r.~.“.\£s in. W “ 7 . — > _"’“~“‘L ' We could have also used a segmented bar graph to
compare the distributions of male and female responses in
the previous example. The ﬁgure to the right shows the
completed graph. Each bar has ﬁve segments — one for
each of the opinion categories. It is fairly difﬁcult to compare
the percents of males and females in each category because
the “middle” segments in the two bars start at different
locations on the vertical'axis. The sideby—side bar graph we
created makes the comparison easier. Both graphs provide evidence of an association
between gender and opinion about future wealth in this
sample of young adults. That is, the values of one variable
(opinion) tend to occur more or less frequently combination
with speciﬁc values of the other variable (gender). Men
more often rated their chances of becoming rich in the two
highest categories; women said “some chance but probably not" much more frequently. Can we say
that there is an association between gender and opinion in the population of young adults? Making
this determination requires formal inference, which will have to wait a few chapters. Deﬁnition: Association
We say that there is an association between two variables if speciﬁc values of one variable tend to occur in common with speciﬁc values of the other. There is one caution that we need to offer: even a strong association between two categorical
variables can be inﬂuenced by other variables lurking in the background. The Data Exploration that
follows gives you a chance to explore this idea using a famous data set. Class of Travel Survived
First Class 140 W 71 Second Class
Third Class
1:3 (, (‘H 9". ‘H é (is) In the movie Titanic there was a suggestion that: ' “I“
 Firstclass passengers received special treatment in boarding the lifeboats while some other
passengers were prevented from doing so (especially thirdclass passengers).  Women and children boarded the lifeboats ﬁrst, followed by the men. 1) What do the data tell us about the e two suggestions? \, 4; A ‘r r}; 2.1 M ,;.\,o\ “A ” W '
15‘ w a m S ’ .W‘A, R. N g, » (law a . .2Q,r\f5,x:,.r5 “~ »
‘ 5:" :. , “5 . ‘ ' g k . _,‘
030‘s! . A” be ;‘ \(\\5\W U“ or} ,k we win) , C r N a. new: 2) How does gender affect the relationship between class and travel and survival status? Explain. ‘F‘ (2')? .Z ‘ _ . NE, 1,.33\,\ A" <5 1‘, but“ Print" \ \H L \Y A: V :7} “ ‘1 f 0‘ 5
3‘31 ax“. w an r n 5A <' i 2.: s c‘ \\>\ \r<\‘ °\ ix) Lu W. \ r 3202” l
\G («A 3A. ( ": \n'r ff Surm u (~\ \ 3,. A—g; ’ ...
View
Full Document
 Fall '15
 Mr. Sanchez
 Statistics, Probability, Chapter 1 Notes, Analyzing Categorical Data

Click to edit the document details