slide1 - Home Page Statistics 231 Title Page • Course...

Info icon This preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Home Page Statistics 231 Title Page • Course web-site is on ANGEL information sheet the next class Contents • These slides will be posted on the ANGEL site and the course • Please feel free to email me questions which I will answer in • Course notes are available at Pixel planet. ￿￿ ￿ ￿￿ ￿ Page 2 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Assessment Title Page • Four assignments (20%), Two Midterms (30%), and Final (50%) Contents • You must pass the final in order to pass course. ￿￿ ￿ ￿￿ ￿ Page 3 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Cheating and Academic Discipline Title Page • Cheating on assignments and projects includes copying an- Contents other student’s solution and submitting it as your own, allowing another student to copy your solution, or collaborating excessively with another student. tellectual property of another as one’s own. The use of other people’s work must be properly acknowledged and referenced in all written material such as take-home examinations, essays, laboratory reports, work-term reports, design projects, statistical data, computer programs and research results. ment or project is as follows: no marks for the assignment and a deduction of 5% from the final course grade ￿￿ ￿ ￿￿ ￿ • Plagiarism is the act of presenting the ideas, words or other in- Page 5 of 61 Go Back • The standard penalty for cheating or plagiarism on an assign- Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Cheating and Academic Discipline Title Page • It is permissible, and indeed desirable, to discuss assignment solution methods with classmates, TAs, and instructors. Contents • You should work through the solution yourself and write it in your own words. The only exceptions are assignments or projects which the instructor designates as ’group’ activities. ￿￿ ￿ ￿￿ ￿ • In academic work, it is customary to acknowledge, in writing, all sources of help. We require that, for each assignment or project submitted, you write (and sign) an acknowledgement of help received, which includes the names of the people (if any) with whom you discussed your solutions. Page 6 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Prerequisites Title Page • This course uses many of the tools and ideas of STAT 230 and any of this material can be used during assessment. required material. Contents • This first section of the notes gives a very brief review of the • The first assignment will also help you revise this material ￿￿ ￿ ￿￿ ￿ Page 7 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Chapter 1: Data Analysis Title Page • Section 1.1 of the notes is a review of probability from STAT 230 which is used in this course to it as needed and the Final Contents • You need to work through this section yourself and refer back • Questions on this material will appear in Assignments, Midterms ￿￿ ￿ ￿￿ ￿ Page 8 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Chapter 1: Data Analysis In this section we look at data in various forms. Let us first look at three definitions of data: Title Page Contents • Data: facts given from which others may be inferred. Chambers Dictionary ￿￿ ￿ ￿￿ ￿ • Data: Things given or granted, something known or assumed as fact or made the bases of reasoning or calculation. Shorter English Dictionary textbook Page 9 of 61 • Data: The set of measurements from an experiment. Statistics Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Statistics We can also give a number of definitions of statistics: Title Page Contents • It is the science of making decisions in the fact of uncertainty. • Finding patterns and structure in data. • Empirical problem solving. ￿￿ ￿ ￿￿ ￿ Page 10 of 61 This brings in a final aspect. That the information we have is almost always incomplete and so we are forced to make decisions when we will not know in advance exactly the outcome of the decision. Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Data Types 1. Discrete data This is numerical data which comes either in whole numbers or has been counted. Examples include numbers of courses passed, numbers of cars parked in a car park, number of cases of AIDS in a population. 2. Continuous data This is numerical data which is a real number. In general it has been measured. Examples include height, weight, voltage in a circuit. Note that most often this data is only recorded to a certain accuracy. 3. Categorical data This is non-numerical and the data has been chosen from some set of categories which are often pre-determined. Examples include month of birth, name, marital status. Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 11 of 61 Go Back Full Screen Close Quit • First • Prev • Next • Last • Go Back • Full Screen • Close • Quit Home Page Title Page 4. Binary data This is categorical data which has only two categories. The answers are often ‘Yes’ or ‘No’. Examples include responses to the question: Are you vegetarian? 5. Ordinal data Any data which has an underlying order is ordinal. Often is this numerical data but not always. Examples include height, numbers of items produced by a factory. Differences though might not be interpretable. 6. Grouped or frequency data This is data which has been recorded in the form of the numbers of observations of particular categories. Examples include The numbers of men and women in this class, the number of Pure Maths, Act. Sci. etc students. Contents ￿￿ ￿ ￿￿ ￿ Page 12 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Notation Title Page • All the above are called datatypes. Contents • A collection of data is called a dataset. ￿￿ ￿ ￿￿ ￿ Page 13 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Examples A dataset on the distribution of resources around the world. Deaths /1000 pop. 5.7 11.9 11.7 12.4 . . . Infant mort. 30.8 14.4 11.3 7.6 . . . male life expt’cy 69.6 68.3 71.8 69.8 . . . female life expt’cy 75.5 74.7 77.7 75.9 . . . G.N.P. Country Group 600 1 2250 1 2980 1 * 1 . . . . . . Name Title Page Contents ￿￿ ￿ ￿￿ ￿ Albania Bulgaria Czech... E. Germany . . . Page 14 of 61 Go Back Full Screen Close Quit Review Keywords: Data, Discrete, Continuous, Categorical, Binary, Ordinal, Datatypes, Dataset •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Transformations on data Title Page • A translation and scale or affine if it is in the following form y = Ax + B • Affine maps include the transformations £ → $, Fahrenheit to Centigrade. called coding. Contents ￿￿ ￿ ￿￿ ￿ • Given categorical data a transformation to numerical data is • Example of this would be to code the month of birth in the following way January = 1, February = 2, March = 3 etc. Page 15 of 61 Go Back • Ranking. For ordinal data ordering the data from smallest to largest. The associated position is called the rank of the data. transformation will be Full Screen • If the data is {1.2, 3.4, 5.2, −0.3, 0.2, 10.3} we order the rank {1.2, 3.4, 5.2, −0.3, 0.2, 10.3} → {3, 4, 5, 1, 2, 6}. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Close Quit Home Page Monotone transformations Title Page Contents • Given a general transformation F it is called monotone increasing if the ranks of {x1 , x2 , x3 , · · · , xn } are the same as that of {F (x1 ), F (x2 ), F (x3 ), · · · , F (xn )}. • If the ranks are reversed it is called monotone decreasing. • A translation and scale transformation x → Ax + B is monotone increasing if A is positive and monotone decreasing if A is negative. ￿￿ ￿ ￿￿ ￿ Page 16 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Log transformations The (natural) log transformation is often very useful, note that x must be positive for log(x) to exist, then Title Page Contents ￿￿ ￿ ￿￿ ￿ log(x.y ) = log(x) + log(y ) log(x/y ) = log(x) − log(y ) log(xn) = n log(x) Also it is often useful fact that log is a monotone increasing transformation. Page 17 of 61 Go Back Full Screen Close Quit Review Keywords: Scaling, Translation, Affine map, Coding, Ranking, Monotone increasing transformation, Logs •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Statistical Method - A Systematic Approach Title Page • Problem: A clear statement of what we are trying to learn • Plan: The procedures we use to carry out the study • Data: The data are collected according to plan the questions posed learned Contents ￿￿ ￿ ￿￿ ￿ • Analysis: The data are summarized and analyzed to answer • Conclusion: Conclusions are drawn about what has been Page 18 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Problem Title Page • Very often statistics is concerned with making statements about certain populations of individuals. Contents ￿￿ ￿ ￿￿ ￿ • These can be populations of people, companies, types of virus, . . .. These individuals will be referred to as units. • Any characteristic of single unit is called a variate. • The problem can usually be defined in terms of functions defined on these populations, such functions are called attributes. Page 19 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Aspect The aspect of the problem is a categorisation of the primary concern of a question. Its defined in the problem stage of the PPDAC cycle. Title Page Contents ￿￿ ￿ ￿￿ ￿ • It can be descriptive where the answer involves learning about some particular attribute of the target population. • It can be causative where the answer involves the existence Page 20 of 61 Go Back (or non existence) of a causal link between variates. With such an aspect the variates are divided into two types by the nature of the problem: response variates and explanatory variates with the understanding that it is changes in the explanatory variates which ‘cause’ the changes in the response. value of a response variate for a given unit. Full Screen • It can be predictive where the answer involves predicting the Close Quit Review Keywords: Population, units, variate, attributes, aspect, descriptive, causal, predictive. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Plan A population is a set of units. This population could be potentially infinite, or even hypthotical. If the time that the units are measured is important then the population is often called a process. In any analysis it is important to be clear about what exactly is the definition of the population. We consider three types of set of units: Title Page Contents ￿￿ ￿ ￿￿ ￿ • the target population is the set of units to which the investibeen selected in the sample. gator sets out to to investigate in the definition of the problem Page 21 of 61 • the study population is the set of units which could have been • the sample which is the set of units actually selected in the investigation. The number of units in the sample is called the sample size and the way that (and number of) elements in the sample are selected (often using some random mechanism) is a called the sampling protocol. The choice of the sample size is part of the protocol. Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Populations Objects Variates: response, explanatory Examples Title Page Unit Cost of unit # i Contents ￿￿ ￿ ￿￿ ￿ Sample Attributes Average cost for units in sample Study Population Attributes Page 22 of 61 Average cost for units in study pop. Go Back Target Population Attributes Average cost for units in target pop. Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Errors Title Page • Each of these populations are, in some sense, representative of the others, but there is always an error involved in such a representation. ulation to real numbers) and so the study error would be Contents ￿￿ ￿ ￿￿ ￿ • Suppose the attribute of interest is a(·) (a function from Popa(PStudy ) − a(PTarget). • Let S represent the sample, then a(S ) be the value of the a(S ) − a(PStudy ). attribute of interest on the sample. Define the sample error to be Page 23 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Plans Title Page • We call a plan experimental if the investigator deliberately • If there is no control we call the plan observational. Contents manipulates one or more explanatory variates on the units of the sample. ￿￿ ￿ ￿￿ ￿ Page 24 of 61 Go Back Full Screen Close Quit Review Keywords: Process, target population, study population, sample, sample size, sampling protocol, study error, sample error, experimental plan, observation plan, attribute •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Data Quality Title Page • Run a visual, graphical or automatic check for values which are logically inconsistent or in conflict with the prior information, that is information which you have, from any source before you have got the data. Contents ￿￿ ￿ ￿￿ ￿ • Look for extreme observations which seem to be away from the bulk of the data. Such observations are sometimes called outliers. You must take care though not to be over keen on identifying outliers. in measurement, where bias is defined as a systematic error which applies to all or most of the data and cannot be averaged out. have been omitted because of their highly suspicious character. Often missing observations are entered in some conventional way with a 0, 99 (not recommended) or ∗, NA etc. Page 25 of 61 • Check on the methods of data collection for sources of bias Go Back Full Screen • Search for missing observations, including oberservations which Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Graphical Data Summaries I. 1. All graphics should be displayed at an appropriately clear size. 2. Graphics should have clear titles which are fairly self explanatory. 3. Axes should be labelled and units given where appropriate. 4. The choice of scales should be made with care. 5. Graphics should not be used without thought, there may well be better ways of displaying the information. Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 26 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example In the population Minnesota residents for the year 1975 the number of deaths, by type, are given in the table (Data from Health and Numbers: Chap Le) Type Heart disease Cancer Stroke Accidents Other Frequency 12378 6448 3958 1814 8088 Relative Frequency 0.379 0.197 0.121 0.055 0.247 Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 27 of 61 Go Back Full Screen Close Quit •First •Prev Next •Last •Go Back •Full Screen •Close •Quit Home Page Pie chart Title Page Contents Heart disease ￿￿ ￿ ￿￿ ￿ Cancer Page 28 of 61 Stroke Go Back Other Accidents Full Screen Close Figure 1: Pie chart of type of death Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Bar chart This same frequency information is usually better shown on a bar chart. Here you can get back the relative frequencies by using the y -axis. Causes of death 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Heart disease Title Page Contents ￿￿ ￿ ￿￿ ￿ % of Total Page 29 of 61 Go Back Full Screen Cancer Stroke Accidents Other Close Quit Figure 2: Bar char of relative frequencies of type of death in Minnesota •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Histogram Title Page • If we have continuous data then it is common to summarise it by grouping it into classes or bins. Contents • We then record only the frequency or number in each bin. • A bar chart of this frequency data is called a histogram. • The area of the bar represents the proportion of observations which fall in interval covered by the bar. Note however that often the frequency is plotted and care must be taken in the interpretation. ￿￿ ￿ ￿￿ ￿ Page 30 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example The following data is the chest measurements of 5732 Scottish soldiers. They have been grouped to the nearest inch. So that the bins are 33 ≤ x < 34, 34 ≤ x < 35 . . . 48 ≤ x < 49. Chest size (in inches) Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 31 of 61 Density Go Back Full Screen Close 0.00 0.05 0.10 0.15 35 40 Size (in) 45 Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example The number of deaths by age for the state of Minnesota in 1987 Histogram: Deaths in Minnesota Title Page Contents 0.025 density 0.000 0.005 0.010 0.015 0.020 ￿￿ ￿ ￿￿ ￿ Page 32 of 61 Go Back <1 5−14 15−24 25−34 35−44 45−54 Age 55−64 65−74 75−84 > 85 Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Income for nonwhites in USA 1983 income US non−white families 0.006 Density 0.000 0.001 0.002 0.003 0.004 0.005 Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 33 of 61 Go Back 10,000 50000 income Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Income Income distribution for households in Singapore in two different years Household Incomes in Singapore 1990 0.00015 Density 0 2000 4000 6000 8000 10000 0.00000 0 0.00005 0.00010 Title Page Contents Household Incomes in Singapore 2000 ￿￿ ￿ ￿￿ ￿ Density Page 34 of 61 Go Back Full Screen 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 2000 4000 6000 8000 10000 Household Income S$ Household Income S$ Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Shape of distibution Title Page • Note that while the income histograms are different they all have a similar ‘shape’. Contents • They are all highly skewed to the right, which means there are a small number of very large observations. symmetric. ￿￿ ￿ ￿￿ ￿ • This is different to the shape of Scottish Soldier which is very Page 35 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Frequency polygon Chest size (in inches) Title Page Frequency ￿ ￿ Page 36 of 61 Go Back Full Screen 0 200 400 600 800 ￿￿ ￿￿ 1000 Contents 35 40 Size (in) 45 Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Kernel density estimates Title Page (a) Chest size (in inches) Contents (b) Chest size (in inches) 0.15 ￿￿ ￿ ￿￿ Density 0.10 Density 30 35 40 Size (in) 45 50 ￿ 0.05 Page 37 of 61 0.00 Go Back 0.00 0.05 0.10 0.15 35 40 Size (in) 45 50 Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Cumulative Frequency plot Title Page • Another way to show the same frequency information is to count the number of data points which are smaller than any given value. Chest size Frequency Cumulative Frequency 41 935 4533 42 646 5179 43 313 5492 44 168 5660 45 50 5710 46 18 5728 47 3 5731 48 1 5732 Chest size Frequency Cumulative Frequency 33 3 3 34 19 22 35 81 103 36 189 292 37 409 701 38 753 1454 39 1062 2516 40 1082 3598 Contents ￿￿ ￿ ￿￿ ￿ Page 38 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example Chest size (in inches) Title Page Contents ￿￿ ￿ ￿￿ ￿ Cumulative Frequency 0 Full Screen 1000 Go Back 2000 Page 39 of 61 3000 4000 5000 35 Close 40 Size (in) 45 Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Comparing Distributions Income distribution, 1983 1.0 Title Page Contents ! ! ! ￿￿ ￿ ￿￿ Cumulative proportion 0.8 ! White Income Nonwhite income ￿ 0.4 0.6 ! Page 40 of 61 ! Go Back 0.2 ! 0.0 Full Screen !...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern