Lecture 1 Slides Handout.pdf - The Economist The Economist...

This preview shows page 1 out of 15 pages.

Unformatted text preview: The Economist The Economist Scenario: You’re an analyst for The Economist. Your editor pairs you up with a journalist to write a piece about the perception of corruption and its relationship to the human development index. It is your job to inform the analytic direction of the article and to provide the data visualizations that will accompany the article. Statistics: Unlocking the Power of Data Lock5 What Can We Do? Which questions would you like to answer based off of the data you have seen? What Information is Available? As the analyst What can you all tell me about the data below? ⚫ ⚫ ⚫ ⚫ ⚫ CPI - corruption perception index How do you begin analyzing the data? Which questions can be answered data? How would you answer those questions? How do you know your results are valid and meaningful? How would you explain the results to: ⚪ ⚪ HDI - Human Development Index HDI.Rank - Low HDI implies Low Rank in the HDI A technical peer? A non-technical stakeholder? Statistics: Unlocking the Power of Data Lock5 Examples Statistics: Unlocking the Power of Data Lock5 Which Tools to Use? What is the relationship between the human development index and the corruption perception index? - Hypothesis Testing - Modeling Does HDI vary by region? - In which regions does it vary? - How much does it vary? Statistics: Unlocking the Power of Data Lock5 Do the Results Mean Anything? - What is a p-value and what does it mean? - Are my standard errors accurate? - Are any assumptions, such as normality, met? Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Course Expectations Statistics: Unlocking the Power of Data Lock5 Visualizations and Intro to Modeling ⚫ Assess and explore a new dataset. ⚫ Determine appropriate questions. ⚫ Differentiate between the questions you want to answer and the questions you can answer. ⚫ Determine which statistical methods can be applied to the available data. ⚫ Interpret the results of a given statistical method. ⚫ Understand the limitations of interpretation. ⚫ Explain your analysis and choices to others. ⚫ Gain data literacy skills to help you in the future. ⚫ Understand the importance of data ethics and role of bias in data science. Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Outline Section 1.1 Why Statistics? ⚫ Data ⚫ ⚫ Cases and variables The Structure of Data ⚫ Categorical and quantitative variables ⚫ Explanatory and response variables Lock5 ⚫ Lock5 ⚫ Data are a set of measurements taken on a set of individual units ⚫ Usually data is stored and presented in a dataset, comprised of variables measured on cases Lock5 Intro Statistics Survey Data Collecting data ⚪ Describing data – summarizing, visualizing Analyzing data Data are everywhere! Regardless of your field, interests, lifestyle, etc., you will almost definitely have to make decisions based on data, or evaluate decisions someone else has made based on data Statistics: Unlocking the Power of Data Cases and Variables Data Statistics: Unlocking the Power of Data Statistics: Unlocking the Power of Data ⚪ ⚪ ⚫ Using data to answer a question Statistics: Unlocking the Power of Data Statistics is all about data Lock5 The Economist Data We obtain information about cases or units. A variable is any characteristic that is recorded for each case. ⚫ Generally each case makes up a row in a dataset, and each variable makes up a column Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Kidney Cancer Lock5 Kidney Cancer If the values in the kidney cancer dataset are rates of kidney cancer deaths, then what are the cases? (a) The people living in the US (b) The counties of the US Counties with the highest kidney cancer death rates Source: Gelman et. al. Bayesian Data Anaylsis, CRC Press, 2004. Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Kidney Cancer If the values in the kidney cancer dataset are rates of kidney cancer deaths, then what are the cases? (a) The people living in the US (b) The counties of the US Kidney Cancer Kidney Cancer If the values in the kidney cancer dataset are yes/no, then what are the cases? If the values in the kidney cancer dataset are yes/no, then what are the cases? (a) The people living in the US (a) The people living in the US (b) The counties of the US (b) The counties of the US A person either has kidney cancer or doesn’t… a rate must apply to a group of people, such as a county Statistics: Unlocking the Power of Data Lock5 A person either has kidney cancer or doesn’t. Yes/no doesn’t make sense for a county. Statistics: Unlocking the Power of Data Categorical versus Quantitative ⚫ Variables are classified as either categorical Lock5 Variables ⚪ ⚪ Hollywood Movies What are the variables? Is each variable categorical or quantitative? In a dataset to answer this question, what are the cases? 2. Can eating a yogurt a day cause you to lose weight? (a) Comedies (b) Dramas (c) Movies (d) Audience ratings • A categorical variable divides the cases into groups 3. Do males find females more attractive if they wear red? • A quantitative variable measures a numerical quantity for each case 4. Does louder music cause people to drink more beer? 5. Are lions more likely to attack after a full moon? Lock5 Statistics: Unlocking the Power of Data Hollywood Movies Do movies that are comedies tend to get higher audience ratings than movies that are dramas? In a dataset to answer this question, what are the cases? Lock5 Statistics: Unlocking the Power of Data Hollywood Movies Hollywood Movies Do movies that are comedies tend to get higher audience ratings than movies that are dramas? In a dataset to answer this question, how many variables are there? In a dataset to answer this question, how many variables are there? Comedies (a) 1 (a) 1 (b) Dramas (b) 2 (b) 2 (c) Movies (c) 3 (c) 3 Audience ratings (d) 4 (d) 4 We are collecting data about movies, so the cases are the movies. Statistics: Unlocking the Power of Data Lock5 Lock5 Do movies that are comedies tend to get higher audience ratings than movies that are dramas? (a) (d) Lock5 Do movies that are comedies tend to get higher audience ratings than movies that are dramas? For each of the following situations: or quantitative: Statistics: Unlocking the Power of Data Statistics: Unlocking the Power of Data 5 There are two variables: Whether the movie is a comedy or a drama, and what the audience rating is for the movie. (e) Statistics: Unlocking the Power of Data Lock5 (e) 5 There are two variables: Whether the movie is a comedy or a drama, and what the audience rating is for the movie. Statistics: Unlocking the Power of Data Lock5 Hollywood Movies Hollywood Movies Hollywood Movies Do movies that are comedies tend to get higher audience ratings than movies that are dramas? Do movies that are comedies tend to get higher audience ratings than movies that are dramas? Do movies that are comedies tend to get higher audience ratings than movies that are dramas? In a dataset to answer this question, how many of the variables are categorical? In a dataset to answer this question, how many of the variables are categorical? In a dataset to answer this question, how many of the variables are quantitative? (a) 0 (a) 0 (b) 1 (b) 1 (c) 2 (c) 2 Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data ⚫ ⚫ 0 (b) 1 (c) Examples: 2 Lock5 ⚫ Does meditation help reduce stress? ⚫ Does sugar consumption increase hyperactivity? Statistics: Unlocking the Power of Data What do you want to know? We’ll do a class survey, collecting data you are interested in. ⚫ What do you want to know about your peers? What are the variables? (one or two?) ⚫ Are they categorical or quantitative? ⚫ Is there an explanatory and response variable? Statistics: Unlocking the Power of Data Lock5 0 1 (c) 2 Statistics: Unlocking the Power of Data Variables Lock5 ⚪ Write a question to measure each variable of interest. Write questions so the resulting data will be accurate and easy to analyze. ⚫ Quantitative variable? Give units. ⚫ Categorical variable? Give the possible categories (no more than 5). ⚫ Be clear and specific. Lock5 Which is the explanatory and which is the response variable? 2. Can eating a yogurt a day cause you to lose weight? 3. Do males find females more attractive if they wear red? 4. Does louder music cause people to drink more beer? 5. Are lions more likely to attack after a full moon? Statistics: Unlocking the Power of Data What do you want to know? Statistics: Unlocking the Power of Data Lock5 For each of the following situations: Audience rating is quantitative. Statistics: Unlocking the Power of Data ⚫ Sometimes we are interested in one variable, Other times we are interested in the relationship between two variables If we are using one variable to help us understand or predict values of another variable, we call the former the explanatory variable and the latter the response variable In a dataset to answer this question, how many of the variables are quantitative? (a) Lock5 (a) (b) Explanatory and Response Hollywood Movies Do movies that are comedies tend to get higher audience ratings than movies that are dramas? Whether the movie is a comedy or a drama is categorical. Lock5 Summary ⚫ Data are everywhere, and pertain to a wide variety of topics ⚫ A dataset is usually comprised of variables measured on cases ⚫ Variables are either categorical or quantitative ⚫ Data can be used to provide information about essentially anything we are interested in and want to collect data on! Statistics: Unlocking the Power of Data Lock5 Outline Section 1.2 Sample versus Population ⚫ Sample versus Population Sampling from a Population A population includes all individuals or objects of interest. ⚫ Statistical Inference A sample is all the cases that we have collected data on (a subset of the population). ⚫ Sampling Bias ⚫ Simple Random Sample Statistical inference is the process of using data from a sample to gain information about the population. ⚫ Other Sources of Bias Statistics: Unlocking the Power of Data Lock5 The Big Picture Population Sampling Sample Statistical Inference Statistics: Unlocking the Power of Data Lock5 Dewey Defeats Truman? Statistics: Unlocking the Power of Data Lock5 Most Important to You Which of the following is most important to you? a) Athletics b) Academics c) Social Life d) Community Service e) Other Statistics: Unlocking the Power of Data ⚫ Suppose researchers studying student life use the results of our clicker question to investigate what students find important ⚫ Can the sample data be generalized to make inferences about the population? Why or why not? Lock5 Dewey Defeats Truman? ⚫ However, Harry S. Truman won the election ⚫ What went wrong? Statistics: Unlocking the Power of Data Most Important to You ⚫ What is the population? of the 1948 presidential election, and was based on the results of a large telephone poll which showed Dewey sweeping Truman Lock5 Lock5 ⚫ What is the sample? ⚫ The paper was published before the conclusion Statistics: Unlocking the Power of Data Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Sampling Bias Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. ⚫ If sampling bias exists, we cannot trust generalizations from the sample to the population Statistics: Unlocking the Power of Data Lock5 Can you avoid sampling bias? Sampling Population Sample Sample ⚫ The next slide shows Lincoln’s Gettysburg Address. The entire population, all words in his address, will be shown to you. What is the average word length? ⚫ Your task: Select a sample of 10 words that resemble the overall address. Write them down. ⚫ Calculate the average number of letters for the words in your sample ⚫ Enter your 10 random words into this sheet (paste in zoom chat) Lincoln’s Gettysburg Address “Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they here gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.” GOAL: Select a sample that is similar to the population, only smaller Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Can you avoid sampling bias? ⚫ Actual average?? ⚫ Imagine putting the names of all the units of the population into a hat, and drawing out names at random to be in the sample ⚫ More often, we use technology ⚫ We need a better way… Lock5 Random vs Non-Random Sampling ⚫ Random samples have averages that are centered around the correct number ⚫ Non-random samples may suffer from sampling bias, and averages may not be centered around the correct number ⚫ Only random samples can truly be trusted when making generalizations to the population! Statistics: Unlocking the Power of Data Lock5 How can we make sure to avoid sampling bias? Statistics: Unlocking the Power of Data Lock5 ⚫ Before the 2008 election, the Gallup Poll took a random sample of 2,847 Americans. 52% of those sampled supported Obama ⚫ In the actual election, 53% voted for Obama ⚫ Random sampling is a very powerful tool!!! Statistics: Unlocking the Power of Data Simple Random Sample In a simple random sample, each unit of the population has the same chance of being selected, regardless of the other units chosen for the sample ⚫ More complicated random sampling schemes exist, but will not be covered in this course Statistics: Unlocking the Power of Data Lock5 Lock5 Random Sampling Take a RANDOM sample! ⚫ People are TERRIBLE at selecting a good Statistics: Unlocking the Power of Data Statistics: Unlocking the Power of Data Random Sampling ⚫ sample, even when explicitly trying to avoid sampling bias! Lock5 Lock5 Realities of Sampling ⚫ While a random sample is ideal, often it isn’t feasible. A list of the entire population may not be available, or it may be impossible or too difficult to contact all members of the population. ⚫ Sometimes, your population of interest has to be altered to something more feasible to sample from. Generalization of results are limited to the population that was actually sampled from. ⚫ In practice, think hard about potential sources of sampling bias, and try your best to avoid them Statistics: Unlocking the Power of Data Lock5 Non-Random Samples Suppose you want to estimate the average number of hours that students spend studying each week. Which of the following is the best method of sampling? Non-Random Samples Suppose you want to estimate the average number of hours that students spend studying each week. Which of the following is the best method of sampling? (a) Go to the library and ask all the students there how much they study (a) Go to the library and ask all the students there how much they study (b) Email all students asking how much they study, and use all the data you get (b) Email all students asking how much they study, and use all the data you get (c) Give a clicker question in this class and force every student to respond (c) Give a clicker question in this class and force every student to respond (d) Stand outside the student center and ask everyone going in how much they study (d) Stand outside the student center and ask everyone going in how much they study All are flawed! Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Bad Methods of Sampling ⚫ Letting your sample be comprised of whoever chooses to participate (volunteer bias) ⚫ People who chose to participate or respond are probably not representative of the entire population ⚪ ⚪ Emailing or mailing the entire population, and then making conclusions about the population based on whoever chooses to respond Example: An airline emails all of it’s customers asking them to rate their satisfaction with their recent travel Lock5 Alcohol, Marijuana, and Driving Bad Methods of Sampling ⚫ Sampling units based on something obviously related to the variable(s) you are studying ⚪ Sampling only students in the library when asking how much they study, or sampling only students taking a statistics class ⚪ “Today’s Poll” on fitnessmagazine.com asked “Have you ever hired a personal trainer?”. 27% of respondents said “yes” – can we infer that 27% of all humans have hired a personal trainer? Lock5 Statistics: Unlocking the Power of Data Data Collection and Bias ⚫ The Federal Office of Road Safety in Australia conducted a study on the effects of alcohol and marijuana on performance ⚫ Volunteers who responded to advertisements for the study on rock radio stations were given a random combination of the two drugs, then their performance was observed Population What is the sample? What is the population? Is there sampling bias? ⚪ Will the results be informative and/or do you think the study is worth conducting? Sampling Bias? Sample ⚪ ⚪ Other forms of bias? DATA Source: Chesher, G., Dauncey, H., Crawford, J. and Horn, K, “The Interaction between Alcohol and Marijuana: A Dose Dependent Study on the Effects of Human Moods and Performance Skills,” Report No. C40, Federal Office of Road Safety, Federal Department of Transport, Australia, 1986. Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Other Forms of Bias ⚫ Lock5 Question Wording Even with a random sample, data can still be biased, especially when collected on humans ⚫ Other forms of bias to watch out for in data collection: Question wording ⚪ Context ⚪ Inaccurate responses ⚪ Many other possibilities – examine the specifics of each study! ⚪ ⚫ “Do you think the US should allow public speeches against democracy?” 21% said speeches should be allowed Statistics: Unlocking the Power of Data Question Wording A random sample was asked: “Should there be a tax cut, or should money be used to fund new government programs?” Tax Cut: 60% ⚫ “Do you think the US should not forbid public speeches against democracy?” 39% said speeches should not be forbidden Lock5 Programs: 40% A different random sample was asked: “Should there be a tax cut, or should money be spent on programs for education, the environment, health care, crime-fighting, and military defense?” Source: Rugg, D. (1941). “Experiments in wording questions,” Public Opinion Quarterly, 5, 91-92. Tax Cut: 22% Statistics: Unlocking the Power of Data Lock 5 Statistics: Unlocking the Power of Data Lock5 Programs: 78% Statistics: Unlocking the Power of Data Lock5 Context Having Children Having Children If we were to run the question all by itself in the newspaper with a request for responses...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture