slide1 - Home Page Statistics 231 Title Page • Course...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Home Page Statistics 231 Title Page • Course web-site is on ANGEL information sheet the next class Contents • These slides will be posted on the ANGEL site and the course • Please feel free to email me questions which I will answer in • Course notes are available at Pixel planet. ￿￿ ￿ ￿￿ ￿ Page 2 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Assessment Title Page • Four assignments (20%), Two Midterms (30%), and Final (50%) Contents • You must pass the final in order to pass course. ￿￿ ￿ ￿￿ ￿ Page 3 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Cheating and Academic Discipline Title Page • Cheating on assignments and projects includes copying an- Contents other student’s solution and submitting it as your own, allowing another student to copy your solution, or collaborating excessively with another student. tellectual property of another as one’s own. The use of other people’s work must be properly acknowledged and referenced in all written material such as take-home examinations, essays, laboratory reports, work-term reports, design projects, statistical data, computer programs and research results. ment or project is as follows: no marks for the assignment and a deduction of 5% from the final course grade ￿￿ ￿ ￿￿ ￿ • Plagiarism is the act of presenting the ideas, words or other in- Page 5 of 61 Go Back • The standard penalty for cheating or plagiarism on an assign- Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Cheating and Academic Discipline Title Page • It is permissible, and indeed desirable, to discuss assignment solution methods with classmates, TAs, and instructors. Contents • You should work through the solution yourself and write it in your own words. The only exceptions are assignments or projects which the instructor designates as ’group’ activities. ￿￿ ￿ ￿￿ ￿ • In academic work, it is customary to acknowledge, in writing, all sources of help. We require that, for each assignment or project submitted, you write (and sign) an acknowledgement of help received, which includes the names of the people (if any) with whom you discussed your solutions. Page 6 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Prerequisites Title Page • This course uses many of the tools and ideas of STAT 230 and any of this material can be used during assessment. required material. Contents • This first section of the notes gives a very brief review of the • The first assignment will also help you revise this material ￿￿ ￿ ￿￿ ￿ Page 7 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Chapter 1: Data Analysis Title Page • Section 1.1 of the notes is a review of probability from STAT 230 which is used in this course to it as needed and the Final Contents • You need to work through this section yourself and refer back • Questions on this material will appear in Assignments, Midterms ￿￿ ￿ ￿￿ ￿ Page 8 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Chapter 1: Data Analysis In this section we look at data in various forms. Let us first look at three definitions of data: Title Page Contents • Data: facts given from which others may be inferred. Chambers Dictionary ￿￿ ￿ ￿￿ ￿ • Data: Things given or granted, something known or assumed as fact or made the bases of reasoning or calculation. Shorter English Dictionary textbook Page 9 of 61 • Data: The set of measurements from an experiment. Statistics Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Statistics We can also give a number of definitions of statistics: Title Page Contents • It is the science of making decisions in the fact of uncertainty. • Finding patterns and structure in data. • Empirical problem solving. ￿￿ ￿ ￿￿ ￿ Page 10 of 61 This brings in a final aspect. That the information we have is almost always incomplete and so we are forced to make decisions when we will not know in advance exactly the outcome of the decision. Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Data Types 1. Discrete data This is numerical data which comes either in whole numbers or has been counted. Examples include numbers of courses passed, numbers of cars parked in a car park, number of cases of AIDS in a population. 2. Continuous data This is numerical data which is a real number. In general it has been measured. Examples include height, weight, voltage in a circuit. Note that most often this data is only recorded to a certain accuracy. 3. Categorical data This is non-numerical and the data has been chosen from some set of categories which are often pre-determined. Examples include month of birth, name, marital status. Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 11 of 61 Go Back Full Screen Close Quit • First • Prev • Next • Last • Go Back • Full Screen • Close • Quit Home Page Title Page 4. Binary data This is categorical data which has only two categories. The answers are often ‘Yes’ or ‘No’. Examples include responses to the question: Are you vegetarian? 5. Ordinal data Any data which has an underlying order is ordinal. Often is this numerical data but not always. Examples include height, numbers of items produced by a factory. Differences though might not be interpretable. 6. Grouped or frequency data This is data which has been recorded in the form of the numbers of observations of particular categories. Examples include The numbers of men and women in this class, the number of Pure Maths, Act. Sci. etc students. Contents ￿￿ ￿ ￿￿ ￿ Page 12 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Notation Title Page • All the above are called datatypes. Contents • A collection of data is called a dataset. ￿￿ ￿ ￿￿ ￿ Page 13 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Examples A dataset on the distribution of resources around the world. Deaths /1000 pop. 5.7 11.9 11.7 12.4 . . . Infant mort. 30.8 14.4 11.3 7.6 . . . male life expt’cy 69.6 68.3 71.8 69.8 . . . female life expt’cy 75.5 74.7 77.7 75.9 . . . G.N.P. Country Group 600 1 2250 1 2980 1 * 1 . . . . . . Name Title Page Contents ￿￿ ￿ ￿￿ ￿ Albania Bulgaria Czech... E. Germany . . . Page 14 of 61 Go Back Full Screen Close Quit Review Keywords: Data, Discrete, Continuous, Categorical, Binary, Ordinal, Datatypes, Dataset •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Transformations on data Title Page • A translation and scale or affine if it is in the following form y = Ax + B • Affine maps include the transformations £ → $, Fahrenheit to Centigrade. called coding. Contents ￿￿ ￿ ￿￿ ￿ • Given categorical data a transformation to numerical data is • Example of this would be to code the month of birth in the following way January = 1, February = 2, March = 3 etc. Page 15 of 61 Go Back • Ranking. For ordinal data ordering the data from smallest to largest. The associated position is called the rank of the data. transformation will be Full Screen • If the data is {1.2, 3.4, 5.2, −0.3, 0.2, 10.3} we order the rank {1.2, 3.4, 5.2, −0.3, 0.2, 10.3} → {3, 4, 5, 1, 2, 6}. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Close Quit Home Page Monotone transformations Title Page Contents • Given a general transformation F it is called monotone increasing if the ranks of {x1 , x2 , x3 , · · · , xn } are the same as that of {F (x1 ), F (x2 ), F (x3 ), · · · , F (xn )}. • If the ranks are reversed it is called monotone decreasing. • A translation and scale transformation x → Ax + B is monotone increasing if A is positive and monotone decreasing if A is negative. ￿￿ ￿ ￿￿ ￿ Page 16 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Log transformations The (natural) log transformation is often very useful, note that x must be positive for log(x) to exist, then Title Page Contents ￿￿ ￿ ￿￿ ￿ log(x.y ) = log(x) + log(y ) log(x/y ) = log(x) − log(y ) log(xn) = n log(x) Also it is often useful fact that log is a monotone increasing transformation. Page 17 of 61 Go Back Full Screen Close Quit Review Keywords: Scaling, Translation, Affine map, Coding, Ranking, Monotone increasing transformation, Logs •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Statistical Method - A Systematic Approach Title Page • Problem: A clear statement of what we are trying to learn • Plan: The procedures we use to carry out the study • Data: The data are collected according to plan the questions posed learned Contents ￿￿ ￿ ￿￿ ￿ • Analysis: The data are summarized and analyzed to answer • Conclusion: Conclusions are drawn about what has been Page 18 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Problem Title Page • Very often statistics is concerned with making statements about certain populations of individuals. Contents ￿￿ ￿ ￿￿ ￿ • These can be populations of people, companies, types of virus, . . .. These individuals will be referred to as units. • Any characteristic of single unit is called a variate. • The problem can usually be defined in terms of functions defined on these populations, such functions are called attributes. Page 19 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Aspect The aspect of the problem is a categorisation of the primary concern of a question. Its defined in the problem stage of the PPDAC cycle. Title Page Contents ￿￿ ￿ ￿￿ ￿ • It can be descriptive where the answer involves learning about some particular attribute of the target population. • It can be causative where the answer involves the existence Page 20 of 61 Go Back (or non existence) of a causal link between variates. With such an aspect the variates are divided into two types by the nature of the problem: response variates and explanatory variates with the understanding that it is changes in the explanatory variates which ‘cause’ the changes in the response. value of a response variate for a given unit. Full Screen • It can be predictive where the answer involves predicting the Close Quit Review Keywords: Population, units, variate, attributes, aspect, descriptive, causal, predictive. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Plan A population is a set of units. This population could be potentially infinite, or even hypthotical. If the time that the units are measured is important then the population is often called a process. In any analysis it is important to be clear about what exactly is the definition of the population. We consider three types of set of units: Title Page Contents ￿￿ ￿ ￿￿ ￿ • the target population is the set of units to which the investibeen selected in the sample. gator sets out to to investigate in the definition of the problem Page 21 of 61 • the study population is the set of units which could have been • the sample which is the set of units actually selected in the investigation. The number of units in the sample is called the sample size and the way that (and number of) elements in the sample are selected (often using some random mechanism) is a called the sampling protocol. The choice of the sample size is part of the protocol. Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Populations Objects Variates: response, explanatory Examples Title Page Unit Cost of unit # i Contents ￿￿ ￿ ￿￿ ￿ Sample Attributes Average cost for units in sample Study Population Attributes Page 22 of 61 Average cost for units in study pop. Go Back Target Population Attributes Average cost for units in target pop. Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Errors Title Page • Each of these populations are, in some sense, representative of the others, but there is always an error involved in such a representation. ulation to real numbers) and so the study error would be Contents ￿￿ ￿ ￿￿ ￿ • Suppose the attribute of interest is a(·) (a function from Popa(PStudy ) − a(PTarget). • Let S represent the sample, then a(S ) be the value of the a(S ) − a(PStudy ). attribute of interest on the sample. Define the sample error to be Page 23 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Plans Title Page • We call a plan experimental if the investigator deliberately • If there is no control we call the plan observational. Contents manipulates one or more explanatory variates on the units of the sample. ￿￿ ￿ ￿￿ ￿ Page 24 of 61 Go Back Full Screen Close Quit Review Keywords: Process, target population, study population, sample, sample size, sampling protocol, study error, sample error, experimental plan, observation plan, attribute •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Data Quality Title Page • Run a visual, graphical or automatic check for values which are logically inconsistent or in conflict with the prior information, that is information which you have, from any source before you have got the data. Contents ￿￿ ￿ ￿￿ ￿ • Look for extreme observations which seem to be away from the bulk of the data. Such observations are sometimes called outliers. You must take care though not to be over keen on identifying outliers. in measurement, where bias is defined as a systematic error which applies to all or most of the data and cannot be averaged out. have been omitted because of their highly suspicious character. Often missing observations are entered in some conventional way with a 0, 99 (not recommended) or ∗, NA etc. Page 25 of 61 • Check on the methods of data collection for sources of bias Go Back Full Screen • Search for missing observations, including oberservations which Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Graphical Data Summaries I. 1. All graphics should be displayed at an appropriately clear size. 2. Graphics should have clear titles which are fairly self explanatory. 3. Axes should be labelled and units given where appropriate. 4. The choice of scales should be made with care. 5. Graphics should not be used without thought, there may well be better ways of displaying the information. Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 26 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example In the population Minnesota residents for the year 1975 the number of deaths, by type, are given in the table (Data from Health and Numbers: Chap Le) Type Heart disease Cancer Stroke Accidents Other Frequency 12378 6448 3958 1814 8088 Relative Frequency 0.379 0.197 0.121 0.055 0.247 Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 27 of 61 Go Back Full Screen Close Quit •First •Prev Next •Last •Go Back •Full Screen •Close •Quit Home Page Pie chart Title Page Contents Heart disease ￿￿ ￿ ￿￿ ￿ Cancer Page 28 of 61 Stroke Go Back Other Accidents Full Screen Close Figure 1: Pie chart of type of death Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Bar chart This same frequency information is usually better shown on a bar chart. Here you can get back the relative frequencies by using the y -axis. Causes of death 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Heart disease Title Page Contents ￿￿ ￿ ￿￿ ￿ % of Total Page 29 of 61 Go Back Full Screen Cancer Stroke Accidents Other Close Quit Figure 2: Bar char of relative frequencies of type of death in Minnesota •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Histogram Title Page • If we have continuous data then it is common to summarise it by grouping it into classes or bins. Contents • We then record only the frequency or number in each bin. • A bar chart of this frequency data is called a histogram. • The area of the bar represents the proportion of observations which fall in interval covered by the bar. Note however that often the frequency is plotted and care must be taken in the interpretation. ￿￿ ￿ ￿￿ ￿ Page 30 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example The following data is the chest measurements of 5732 Scottish soldiers. They have been grouped to the nearest inch. So that the bins are 33 ≤ x < 34, 34 ≤ x < 35 . . . 48 ≤ x < 49. Chest size (in inches) Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 31 of 61 Density Go Back Full Screen Close 0.00 0.05 0.10 0.15 35 40 Size (in) 45 Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example The number of deaths by age for the state of Minnesota in 1987 Histogram: Deaths in Minnesota Title Page Contents 0.025 density 0.000 0.005 0.010 0.015 0.020 ￿￿ ￿ ￿￿ ￿ Page 32 of 61 Go Back <1 5−14 15−24 25−34 35−44 45−54 Age 55−64 65−74 75−84 > 85 Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Income for nonwhites in USA 1983 income US non−white families 0.006 Density 0.000 0.001 0.002 0.003 0.004 0.005 Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 33 of 61 Go Back 10,000 50000 income Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Income Income distribution for households in Singapore in two different years Household Incomes in Singapore 1990 0.00015 Density 0 2000 4000 6000 8000 10000 0.00000 0 0.00005 0.00010 Title Page Contents Household Incomes in Singapore 2000 ￿￿ ￿ ￿￿ ￿ Density Page 34 of 61 Go Back Full Screen 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 2000 4000 6000 8000 10000 Household Income S$ Household Income S$ Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Shape of distibution Title Page • Note that while the income histograms are different they all have a similar ‘shape’. Contents • They are all highly skewed to the right, which means there are a small number of very large observations. symmetric. ￿￿ ￿ ￿￿ ￿ • This is different to the shape of Scottish Soldier which is very Page 35 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Frequency polygon Chest size (in inches) Title Page Frequency ￿ ￿ Page 36 of 61 Go Back Full Screen 0 200 400 600 800 ￿￿ ￿￿ 1000 Contents 35 40 Size (in) 45 Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Kernel density estimates Title Page (a) Chest size (in inches) Contents (b) Chest size (in inches) 0.15 ￿￿ ￿ ￿￿ Density 0.10 Density 30 35 40 Size (in) 45 50 ￿ 0.05 Page 37 of 61 0.00 Go Back 0.00 0.05 0.10 0.15 35 40 Size (in) 45 50 Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Cumulative Frequency plot Title Page • Another way to show the same frequency information is to count the number of data points which are smaller than any given value. Chest size Frequency Cumulative Frequency 41 935 4533 42 646 5179 43 313 5492 44 168 5660 45 50 5710 46 18 5728 47 3 5731 48 1 5732 Chest size Frequency Cumulative Frequency 33 3 3 34 19 22 35 81 103 36 189 292 37 409 701 38 753 1454 39 1062 2516 40 1082 3598 Contents ￿￿ ￿ ￿￿ ￿ Page 38 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example Chest size (in inches) Title Page Contents ￿￿ ￿ ￿￿ ￿ Cumulative Frequency 0 Full Screen 1000 Go Back 2000 Page 39 of 61 3000 4000 5000 35 Close 40 Size (in) 45 Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Comparing Distributions Income distribution, 1983 1.0 Title Page Contents ! ! ! ￿￿ ￿ ￿￿ Cumulative proportion 0.8 ! White Income Nonwhite income ￿ 0.4 0.6 ! Page 40 of 61 ! Go Back 0.2 ! 0.0 Full Screen ! 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 income ($) Close Quit Review Keywords: Pie chart, Bar chart, Histogram, Frequency Polygon, Cumulative Frequency plot. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Algebraic methods: continuous data We concentrate on two aspects of the data initially: 1. What is a typical or average value of the data set? This is called the location of the data. 2. How far does the data spread from this typical value? Later we look in more detail about the ways that variability of the data can be described algebraically. Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 41 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Location Title Page • The mean. Given a numerical data set {x1, x2, · · · , xn} the mean is given by Contents ￿￿ ￿ ￿￿ ￿ x1 + x2 + x3 + · · · + xn x= ¯ = n Page 42 of 61 • The median. This can be used when the data is ordinal. The median is defined to be the value Q2 such that half the data is less than Q2 and half is greater than Q2 . Using the notation of the previous section if we have ranked the data as ￿n i=1 xi n Go Back x(1), x(2), . . . , x(n) where then if n is odd the median is x( n+1 ) , when n is even the me2 1 dian is 2 (x(n/2) + x((n+1)/2) ) Full Screen x(1) ≤ x(2) ≤ . . . ≤ x(n) Close Quit • First • Prev • Next • Last • Go Back • Full Screen • Close • Quit Home Page • The mode. This is used when we have grouped or frequency Title Page data. The mode is that group with the highest frequency. This would be called the modal class. imum and is roughly symmetric then in general the mean, median and mode will have approximately the same value. Contents • When the data is such that the histogram has a single max• Consider the case where the histogram of the data is strongly skewed. Then there will be a large difference between the three measures. The mean is much more influenced by extreme values of outliers than the other two. ￿￿ ￿ ￿￿ ￿ Page 43 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Spread Title Page • To get the crudest measure of the range we could look at the maximum and minimum values of the data. The range is defined as maximum - minimum. Contents ￿￿ ￿ ￿￿ ￿ • A generalisation of the median are the quartiles. The lower quartile Q1 is the value which has 25% of the data smaller than it and 75% greater. The upper quartile splits the data in the ratio 75% : 25%. One way of calculating the quartiles is from the cumulative frequency plot Page 44 of 61 • The difference Q3 − Q1 is called the interquartile range Go Back Full Screen • A further generalisation of this idea is given by the percentiles, the pth -percentile divides the data in the ratio p : (100 − p). Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Quartiles Cumulative Distribution of Weights 100 80 75% Cumulative Proportion (%) 60 50% 40 25% 20 Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 45 of 61 Go Back lower quartile median 0 20 Full Screen 40 Weights (lbs) 60 80 Close Quit 0 upper quartile •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Variance and standard deviation Title Page • The variance is defined to be the mean of the squared dif2 σn Contents ￿￿ ￿ ￿￿ ￿ Page 46 of 61 • The sample variance is very similar and is given by ￿n ¯2 2 i=1 (xi − x) σn−1 = n−1 • The standard deviation is given by ￿ ￿ 2 2 or σ σn = σn σn−1 n−1 = ference from the average. The population variance is given by ￿ = n i=1 (xi n ¯ − x )2 Go Back The difference between these two is a little subtle and we shall explore it later. Full Screen Close Quit Review Keywords: Mean, Median, Mode, Average, Max, Min, Range, Quartiles, Interquartile range, Percentiles, Variance, Sample Variance, Standard Deviation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Graphical Data Summaries II Title Page • Box-plots A good graphical tool for showing the shape of a distribution is given by a Box-plot. Contents • The QQ plot is useful for comparing the shape of the distribu- ￿￿ ￿ ￿￿ ￿ tion of two samples or a sample and the theoretical distribution Page 47 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example: weights of 57 children Boxplot of Childrens’ weights Title Page Contents 80 Possible outlier Upper adjacent value ! Weight (lbs) ￿ ￿ 60 ￿￿ ￿￿ Upper Quartile 40 Median Lower Quartile 20 Page 48 of 61 Go Back Lower adjacent value Full Screen Close Quit 0 •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example: saturation of bile Comparing ages of patients Title Page Contents ￿￿ ￿ ￿￿ ￿ % Saturation Page 49 of 61 Go Back Full Screen 40 60 80 100 120 140 Patients younger 50 Patients older 50 Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page QQ-plot: chest size data Title Page (a) Histogram Contents (b) QQ plot ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! !! ! ! 0.15 Sample Quantiles ￿￿ ￿ ￿￿ Density 0.10 ￿ 0.05 0.00 Page 50 of 61 35 ! Go Back 35 40 Chest (in) 45 40 −4 45 −2 0 2 4 Theoretical Quantiles Full Screen Close Quit Review Keywords: Box-plots, QQ-plots. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Bivariate data Title Page • Suppose two variables (X, Y ) are measured on a sample of n units to give data (x1, y1), (x2, y2), . . . , (xn, yn). • The variables X and Y may, to a greater or lesser extent, be associated. Contents ￿￿ ￿ ￿￿ ￿ • This might occur for example in an investigation with a causal aspect in which yi is the response variate for unit #i while xi is the corresponding explanatory variate. Page 51 of 61 • The problem of such a investigation is then to try to establish if there is an association between the two variates and, if there is, is it a causal one? Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Scatterplots Title Page Scatterplot of Votes/Seats 80 Contents ! ! ! Plot of Seats by Time 80 Seats % 30 40 50 1900 60 70 70 ￿￿ ￿ ￿￿ ￿ Seats % ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! !! ! ! ! !! ! ! 30 40 ! Page 52 of 61 Go Back 50 60 40 Full Screen 45 50 Votes % 55 1920 1940 Year 1960 Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Scatter-plot matrix 40 45 50 ! ! ! ! Title Page 55 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Contents ! ! ! ! ! ! Year ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! !! ! ! !! ! ! ! 50 ￿ ￿ 55 !! ! ! ! ! ! ! ! ! !! ! ! ! ! Votes % ! !! !! 40 Page 53 of 61 45 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! Go Back !! ! ! Full Screen 1900 1920 1940 1960 30 40 50 60 70 80 Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit 30 40 50 Seats % 60 70 80 1900 ￿￿ ￿￿ ! ! ! 1920 1940 1960 ! Home Page Measures of association Title Page • If we have two, or more, variates measured on each unit then it is often useful to have a measure of the association between these variates. pend on the type of data. For example if we have two categorical variates we can use ideas of independence which are very similar to those of the independence of probabilities in STAT 230. braic measure which is derived from the scatter plot. Contents ￿￿ ￿ ￿￿ ￿ • Which particular tool we use to measure association will de- Page 54 of 61 • However if both variates are continuous we can use an alge- Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Dependence & relative risk Suppose that we wished to investigate the relationship between two categorical variates, for example the binary variates of being a Smoker (Yes/No) and being at high risk for a heart attack (Yes/No). Smoke/Risk Yes No Total High 42 12 54 Low 7 39 46 Total 49 51 100 Table 1: Frequencies for paired categorical variables Title Page Contents ￿￿ ￿ ￿￿ ￿ Page 55 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Dependence & relative risk Title Page • Recall from STAT 230 that we define two events, A and B , to be independent if Contents ￿￿ ￿ ￿￿ ￿ • Thus we can measure dependence by seeing if these equalities fail, in particular looking how far P (A|B ) = P (A| not B ) = P (A). Page 56 of 61 is from 1. use P (A|B ) P (A|not B ) Go Back • What holds for probabilities also holds for proportions. So Proportion of smokers who are high risk . Proportion of non smokers who are high risk • This is called the relative risk. In our example this will be 42/49 = 3.64 > 1 12/51 • First • Prev • Next • Last • Go Back • Full Screen • Close • Quit Full Screen Close Quit Home Page Correlation Title Page • A commonly used measure of strength of a linear relationship between two continuous variables (xi , yi ), i = 1, . . . , n is called correlation. Contents ￿￿ ￿ ￿￿ ￿ • This is usually denoted by r and is defined by: r=√ where n ￿ i=1 SXY SXX SY Y n ￿ i=1 Page 57 of 61 Go Back SXX = (xi − x)2, SY Y = ¯ n ￿ i=1 (y i − y )2 , ¯ Full Screen SXY = Close {(xi − x)(yi − y )}. ¯ ¯ Quit • First • Prev • Next • Last • Go Back • Full Screen • Close • Quit Home Page • The following rough interpretation can be used: when the Title Page Contents points lie exactly on a straight line with positive (negative) slope, r = 1, (r = −1) when there is a reasonably strong positive (negative) linear relationship, r will be appreciably positive (negative) when there is no apparent relationship between the variables, r will be near zero. indictates a strong relationship between the two variables. ￿￿ ￿ ￿￿ ￿ • For the election data example the correlation is 0.93 which Keywords: Association, dependence, relative risk, correlation Page 58 of 61 Go Back Full Screen Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Time series plots Title Page • One of the most commonly encountered types of data is the • We have data points of the form (t, y (t)) where t is time and y (t) is the value of the variate at time t. • Sometimes y (t) is recorded continuously (e.g. the trace on a Contents time series this occurs when the investigation concerns a process. ￿￿ ￿ ￿￿ ￿ Page 59 of 61 Go Back seismograph), but more usually at discrete and often equallyspaced time points (e.g. weekly sales figures, monthly unemployment figures, annual maximum high water level at a coastal site). To emphasise the fact that the points are consecutive in time, neighbouring points are usually joined by straight lines. Full Screen • Plot the data points (t, y (t)) as an x − y graph or scatter plot. Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example: Dow street crash Title Page Contents Dow Jones Index (1920−1941) 400 400 Dow Jones Index 1925 1930 1935 1940 0 100 200 300 Dow Jones Index (1929−1931) ￿￿ ￿ ￿￿ ￿ Dow Jones Index Page 60 of 61 Go Back 0 100 200 300 Full Screen 1929.5 1930.5 Years(1929−1931) 1931.5 Years(1920−1941) Close Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Home Page Example: Denmark monthly birth data Title Page Denmark 20th Century birthrate Contents ￿￿ ￿ ￿￿ ￿ Births per month Page 61 of 61 Go Back 4000 1900 5000 6000 7000 8000 9000 Full Screen 1920 1940 Year 1960 1980 Close Quit Keywords: Trend, Season effect, Change points •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit ...
View Full Document

Ask a homework question - tutors are online