Chapter 1 - Chapter 1 Section 1 Analyzing Categorical Data...

Info icon This preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 4
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 1 Section 1: Analyzing Categorical Data The value of categorical variables is often used to place variables in certain categories such as “male" or “female.” The distribution of a categorical variable lists the categories and gives either the count or the percent of the individuals who fall in each category. Example: The radio audience rating service Arbitron places the country’s 13,838 radio stations into categories that describe the kinds of programs they broadcast. Here are two different tables showing the distribution of station formats: W t. ~ Format Percemotstaflons Adult contemporary 112 Mutt standards 8.6 4.1 . 14.9 :2. News/lblkflntormation 15.7 , Oldies ‘ 7.7 Religious “.6 Hook 6.3 .. Spanish language 54 ,' .. Otherfomlats 11.4 . ' Total ' 99.9 In this case, the individuals are the radio stations and the variable being measured is the kind of programming that each station broadcasts. The table on the left, which we call a frequency table, displays the counts (frequencies) of stations in each format category. On the right, we see a relative frequency table of the data that shows the percents (relative frequencies) of stations in each format category. It is a good idea to check data for consistency. The counts should add to 13,838, the total number of stations. They do. The percent should add to 100%. In fact, they add to 99.9%. What happened? Each percent is rounded to the nearest tenth. The exact percents would add to 100, but the rounded percents only come close. This is a roundoff error. Roundoff errors do not point to he mistakes of our work, just the effect of rounding off results. Bar Graphs and Pie Charts Columns of numbers can take time to read, so the distribution of the categorical data can be displayed by a pie chart or bar graph. ’ siioe occupiers 14.9%TJf the pie beans: 14.9% of the radio stations um a‘Cuuuu'y" fmml. Contemporary hit This bar has height [49% becmm: 14.9% 0! stationis fit the “Country" format. Percent Mentions G 1.2- '.‘ E f : ‘rv‘ v- . tr- .. , - - x . . . , a ' vyuoryficycfi$$y&§:§¢dy (931;; 0‘6} (b) Radio station formal nouns Lt (a) Pie chart and (b) bar graph of US. radio stations by format. Do the data tell you what you want to know? Let’s say that you plan to buy radio time to advertise your Web site for downloading MP3 music files. How helpful are the data in the above figure? it is not very helpful. You are not interested in counting stations, but in counting listeners. For example, 14.6% of all stations are religious, but they only have a 5.5% share of the radio audience, according to Arbitron. In fact, you are not even interested in the entire radio audience, because MP3 users are mostly young people. You really want to know what kinds of radio stations reach the largest numbers of young people. Always think about whether the data you help answer your questions. Pie charts show the distribution of a categorical variable as a “pie" whose slices are sized by the counts or percents for the categories. A pie chart must include all the categories to make up a whole. In the radio station example, we need the "Other Formats" category to complete the whole (all radio stations) and allow us to make the pie chart. Use a pie chart only when you want to emphasize each category’s relation to the whole. Pie charts are awkward to make by hand, but technology will do the job for you. Bar graphs (or bar charts) represent each category as a bar. The bar heights show the categorical counts or percents. Bar graphs are easier to make than pie charts and are also easier to read. To convince yourself, try to use the pie chart in the previous figure to estimate the percent of radio stations that have an “Oldies” format. Now look at the bar graph — it is easy to see that the answer is about 8%. Bar graphs are also more flexible than pie charts. Both graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities that are measured in the same units. Example: Who owns an MP3 Player? Portable MP3 players, such as the Apple iPod. are popular — but not equally popular with all age groups. Here are the percents of people in various age groups who own a portable MP3 player, according to an Arbitron survey of 1112 randomly selected people. Age group (year) Percent owning an MP3 player 12w ‘- mm ‘— “-— 55mm. —_ a) Let's make a well labeled bar graph of the information. 9m 0% M93 Owws an; Ara/L C'lekPS . 2‘3”“: ‘agnIEIWI-fiigfi -' ‘IEIIIII :33 ""EIE.£EE:"“"E:.EEE§§EE§ - . m: V'¥\< . . I . I . ' III ' .1 :1 r ‘ I IIIEEEIIIIIIIIIIIE . IIIIIIIIII 3 q V .IIIIIII==I IiIIIIIIIIIII I IIIIIII - ‘ VAHVAEIEI:IIII IIII..IIIIIII III ' ' I I I _ y, “III:- I!_I_= IE: I MEWIIE=III III-fl. I. I IWWIII III III II Ia—rl \sra“ 25—3fi “'54 5 5 4 ‘ ‘ r C c g ’\ \z b) Would it be appropriate to make a pie chart for these data? Why or why not? J- X fax/F 5mm. “M \HA at Me M93 OM is M " ”\>\-¢ A” onQ (‘05? \aa qfffigna‘rt 40 05.: 4 \de Cxflf’rk»- Two-Way Tables and Marginal Distributions We have seen some techniques for analyzing the distribution of a single categorical variable. What do we do when a data set involves two categorical variables? The following table will be used to analyze various distributions. Example: I’m Gonna Be Rich!! A survey of 4826 randomly selected young adults (aged 19 to 25) asked, “What do you think are the chances you will have much more than a middle-class income at age 30?” The table below shows the responses, omitting a few people who refused to respond or who said they were already rich. Young adults by gender and chance of getting rich Opinion Female Male Total Almost no n E_E__ Some chance but probably not 426 m— A 50-50 chance Accce cnencc m_1421 . Almost certain 1083 20? . 1] Total 2367 2459 4826 «emu St-oz- This is a two-way table because it describes two categorical values, gender and opinion about becoming rich. Opinion is the row variable. Gender is the column variable. The entries in the table are the counts of individuals in each opinion-by—gender class. To get a better grasp in the information in the table, we can look at the distribution of each variable separately. The distributions of opinion alone and gender alone are marginal distributions because they appear at the right and bottom of the two-way table. Definition: Marginal distribution The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table. Percents are more often more informative than counts, especially when we are comparing groups of different sizes. We can display the marginal distributions of opinions in percents by dividing each row total by the table total and converting to a percent. For instance, the percent of these young adults who think they are almost certain to be rich by age 30 is almost certain total ___ 1083 = 0.224 ___ 224% table total 4826 Young adults by gender and chance of getting rich Opinion Female Male Total mum-“W _mV —|a-——m. __——2«. 7. was 24 .47. 2367 2459 4826 _ '44-‘57 7- 6i,0/, a) Use the data in the two-way table to calculate the marginal distribution (in percents) of opinions. b) Make a graph to display the marginal distributions. Describe what'you see. M ”x ‘ l I II IIIII ll. IT“... 1'“: II 2:: Inn-llnlullulul- Ilunllgulgglgii II 33* Eli!Iiiiiiflifllflifiiflfi-‘hfl-i IE: 355% “a-~"!~‘5‘~9~"“ EEE‘JHEEEEEEEE-i'fii Organizing a Statistical Problem As you learn more about statistics you will be asked to solve more complex problems. Although no single strategy will work on every problem, it might be helpful to have a general framework for organizing your thinking. Here is a four-step process you can follow. How to Organize a Statistical Problem: A Four-Step Process State: What is the question that you are trying to answer? Plan: How will you go about answering the question? Do: Make graphs and carry out needed calculations. Conclude: Give our oractical conclusion in the settin- of the real-world Example: Based on the survey data, can we conclude that young men and women differ in their opinions about the likelihood of future wealth? Give appropriate evidence to support your answer. Follow the four- step process. §L§t§ 2 ~ T5 ‘3 l a g‘flaua. ' ll\ 6‘: 1‘ (u on ‘l‘o 9A 53101“ FAA—gr: ““5sz = Pla n: U.) { ‘\U<\\ («\conaxf W Coc a. \«Hov‘c-\ 1 5.. fl VC- ‘1)?! C OW‘ (\dxfl'lrf‘. '5. :'\ q (w an cm; A mcpw ' :3 r' w J J :17 ‘3\ Ac _, ‘L;y . 3‘.) ' ‘._ .-.( ‘24 ber'n. ‘ A; ‘ \x" \Y’JE 4 kt R5°Vi5 r -4: .EE+EE 53!! iii: a . - I a In [,7 I. Efiyfifi [fir-:25 an ll : ‘l >1§§3l ‘. ,1; 6 fl \ ‘ II ilii' ‘ "fiiiifir’im “fiyfififlf ssirg ks! Hal-II IIIIIIIIIH : I u I?“ \gams .V a 533$! n§ ‘ ::v I! ififligfifi 1 s§§§§ “II-II J: I *3 QC! .4- \‘=§; \\“ i‘ 3e : ‘53 mg! 1 §i§§§ ‘!!%fi ““3 g: A Relationships between Categorical Variables: Conditional Distributions The two-way table contains much more information than the two marginal distributions of opinion alone and gender alone. Marginal distributions tell us nothing about the relationship between the two variables To describe the relationship between two categorical variables, we must calculate some well chosen percents from the counts given in the body of ~ » - - ~ , , the table. We can study the opinions of women alone by looking only at the “Female” column in the two-way table. To find the percent of young women who think they are almost certain to be rich by age 30, divide the count of such women by the total number of women, the column total: W=fli=oz=os 205% column total 2367 Doing this for all five entries in the “Female column gives the conditional distribution of opinion among women. See the table to the right. We use the term “conditional” because ' - this distribution describes only young adults who satisfy the condition that they are female. Definition: Conditional Distribution A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable. Example: Calculate the conditional probability of opinion among men. Conditional Distribution of opinion among men Conclude: . \ A CoorS‘ta 5n; . \OY'sk ‘3 bar 63W?“ wt. (‘90 51a ,kal Pesraks hen/fl $64126 {\X» “‘3‘; °fi 1“" ‘t n m \S‘T V“ Gr \ “V3“? . "" {Wm TR 0V\Y\\ m" W\Ls (Lg ‘5" D f \‘C\\ I-\‘na. “LL-:1 T“: r' a 'a , ‘- ‘” K: o? be"“3 0. 50—30 . , I. g (A 9, \N‘; « km?” l \ i—c‘r.~.“.\£s in. W “ 7 . — > _"’“~“‘L ' We could have also used a segmented bar graph to compare the distributions of male and female responses in the previous example. The figure to the right shows the completed graph. Each bar has five segments — one for each of the opinion categories. It is fairly difficult to compare the percents of males and females in each category because the “middle” segments in the two bars start at different locations on the vertical'axis. The side-by—side bar graph we created makes the comparison easier. Both graphs provide evidence of an association between gender and opinion about future wealth in this sample of young adults. That is, the values of one variable (opinion) tend to occur more or less frequently combination with specific values of the other variable (gender). Men more often rated their chances of becoming rich in the two highest categories; women said “some chance but probably not" much more frequently. Can we say that there is an association between gender and opinion in the population of young adults? Making this determination requires formal inference, which will have to wait a few chapters. Definition: Association We say that there is an association between two variables if specific values of one variable tend to occur in common with specific values of the other. There is one caution that we need to offer: even a strong association between two categorical variables can be influenced by other variables lurking in the background. The Data Exploration that follows gives you a chance to explore this idea using a famous data set. Class of Travel Survived First Class 140 W 71 Second Class Third Class 1:3 (, (‘H 9". ‘H é (is) In the movie Titanic there was a suggestion that: ' “I“ - First-class passengers received special treatment in boarding the lifeboats while some other passengers were prevented from doing so (especially third-class passengers). - Women and children boarded the lifeboats first, followed by the men. 1) What do the data tell us about the e two suggestions? \, 4; A ‘r r}; 2.1 M ,;.\,o-\ “A -” W ' 15‘ w a m S ’ .W‘A, R. N g, » (law a . .2Q,r\f5,x:,.r5 “~ » ‘ 5:"- :. , “5 . ‘ ' g k . _,‘ 030‘s! . A” be ;‘ \(\\5\W U“ or} ,k we win) , C -r N a. new: 2) How does gender affect the relationship between class and travel and survival status? Explain. ‘F‘ (2')? .Z ‘ _ . NE, 1,.33\,\ A" <5 1‘, but“ Print" \ \H L \Y A: V :7} “ ‘1 f 0‘ 5 3‘31 ax“. w an r n 5A <' i 2.:- s c‘ \\>\ \r<\‘ °\ ix) Lu W. \ r 3202” l \G («A 3A. ( ": \n'r ff Surm u (~\ \ 3,. A—g; ’ ...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern