Week 7 - Hypothesis Testing Continued and Simple Regression

Week 7 - Hypothesis Testing Continued and Simple Regression...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Week 7 Hypothesis Tes/ng Con/nued Regression Analysis Minitab •  We covered normality plots and hypothesis tes/ng on Minitab using the ques/ons in the next few slides below. I’ll send out the Minitab files to let you have a look at them and get familiarized. Single Sample Tests for propor/ons •  To see if at least 50% of the compressors manufactured by a firm can withstand 5 years of con/nuous opera/on without failure, 100 compressors were put on test. If 38 of them were s/ll running aPer 5 years, can it be said, at the 0.05 level of significance, that this standard has not been met? Significance Tests for popula/on means (one ­sample t ­test) •  A test for the mean follows the same 1 ­4 steps as for any hypothesis test •  Test hypothesis H0: µ = µ0 (mean actually equals the stated value) –  Ex: “ The mean water volume is exactly 1 liter” •  By comparing standardized z values at the required level of signifiance, we specify what value of µ will cause us to reject H0 if is deemed significantly different than µ Significance test for the sample mean example 1 •  Test at the 0.05 level of significance whether the mean of a random sample of size 16 differs significantly from 10 if the distribu/on form which the sample was taken is approximately normal, the sample mean and standard devia/on are calculated to be 8.4 and 3.2, respec/vely. Significance test for the sample mean example 2 •  The Righteous Insurance Company will insure an auto only if the mean repair cost aPer a 10 mph collision is less than $1000. The company uses a standard α = 0.05 as its significance level. From 5 crash tests conducted by the auto manufacturer, the cost of repairs were: $1050, $400, $720, $600, and $1100. Significance test for the sample mean example 2 •  Step 1: Ho: μ ≥ $1000 mean cost is too high Ha: μ < $1000 mean cost is too low •  Step 2: Test sta/s/c is the t distribu/on: where μo is the hypothe/cal mean of $1000 and the standard error is s/√n •  For this example, = $774, and s = $298, n=5, df = 5 ­1 = 4 •  tobs = (774 – 100) / (298/√5) =  ­1.696 •  Let α = 0.05 Significance test for the sample mean example 2 •  Step 3: p ­value = P(t <  ­1.70| df=4) > 0.05 from the t ­table for 4 df. •  Step 4: Since p ­value > α value of 0.05, we cannot reject Ho. Mean damage is not significantly less than $1000, hence no insurance will be given to this kind of car. Two ­Sample Comparisons – Propor/ons and means •  Compares one sample with another to determine whether there is a sta/s/cally significant difference between the two samples •  Our objec/ve is to es/mate the true difference: p1 – p2 or x1 – x2 •  The significance of the result depends on: –  the size of the difference between p1 and p2 or µ1 and µ2 (larger difference, easier to reject H0) –  the size of the samples n1 and n2 (larger sample sizes, easier to reject H0) –  the size of α (higher value, easier to reject H0) Two ­Sample Comparisons – Propor/ons and means •  H0 is always: there is no difference between the propor/ons or means of the two samples (difference equals zero) •  Ha is again one of 3 possibili?es: – There is a difference (≠) – One is greater than the other (>) – One less than the other (<) Difference between Propor/ons – Summary for CI •  p1 and p2 are the popula/on success propor/on •  Propor/on of successes in each sample: •  The es/mator is: p1  ­ p2, a normally distributed variable when n is large (when np > 5) •  CI: •  Since the 2 popula/ons are assumed independent the variances of our es/mates are addi/ve: A note on Paired comparisons.. remember •  Make sure the differences are approximately normally distributed before using this test. Goodness ­of ­fit Tests •  Normality tests/plots – easy to do in Minitab Ho: Data is normally distributed Ha: Data is not normally distributed •  If the p ­value < 0.05 , we reject the null hypothesis (i.e. data not normally distributed) •  If the p ­value > 0.05, the data can be considered normally distributed •  Remember to check your data !!! Probability Plots and Chi ­Squared Test •  Minitab can also do goodness ­of ­fit test for other con/nuous distribu/ons such as: extreme ­value, Wei bull, etc. This is done using probability plots •  For discrete distribu/ons, the Chi ­Squared test can be used to test whether a Binomial or Poisson distribu/on fits the data and we use the test sta/s/c: Applica/on to Quality •  It is impossible to judge the validity of guesses, scien/fic es/mates must be applied to data to verify validity of statements •  Make sure that sta/s/cal inferences are not misused – oPen sta/s/cal significance doesn’t provide a result that is meaningful – there is no subs/tute for common sense, and great sta/s/cs means lirle if you’re solving the wrong problem! Regression Analysis  ­ Introduc/on Regression Analysis  ­ Introduc/on •  So, far we have looked at sta/s/cal methods that have dealt with observa/ons involving a single variable •  Many problems of quality improvement involve rela/ons among several variables •  We are going to looks at sta/s/cal methods that apply to simultaneous observa/ons made on mul/ple variables •  We will look at simple linear regression, correla/on, mul/ple regression, analysis of residuals and their applica/ons to quality management Regression Analysis  ­ Introduc/on •  A widely used method for studying rela/onships between mul/ple variables is regression analysis •  Examples of where this is used are everywhere in the product development and manufacturing environment •  The rela/onship between price and sales, weight and speed, geometry and drag are some examples of where regression analysis could be used to solve problems of interest to quality managers and engineers Why perform Regression? •  We use regression in order to: –  Learn something about the rela/onship between two variables, –  Remove a por/on of the varia/on in one variable in order to gain a berer understanding of the remaining varia/on –  Es/mate or predict values of one variable based on another variable Variable Rela/onships Monotonic vs. Nonmonotonic •  Regression and correla/on deal with describing/ quan/fying rela/onships between random variables •  Types of Rela/onships: •  Monotonic: variable rela/onships that have no reversals in slope (a linear rela/onship is a special case of a monotonic rela/onship where the slope is constant over x) •  Nonmonotonic: variable rela/onships where the slope reverses Variable Rela/onships: Linear vs. Non ­linear •  Linear func/ons sa/sfy the following proper/es: Addi/vity (aka superposi/on property) f (x + y) = f(x) + f(y) (the net response by two or more factors is the sum of the responses caused by each factor) Homogeneity of degree 1 f(ax) = af(x) for all a •  Nonlinear: a system which does not sa/sfy the above principles (oPen used to represent natural phenomena) Simple Linear Regression •  Simple linear regression: the use of sta/s/cal methods to find the ‘best fit’ linear rela/onship between two variables •  the value of a given random variable may depend upon the value of another (non ­random) variable •  In simple regression we are only dealing with a single rela/onship between two random variables •  In mul/ple regression, we are dealing with mul/ple rela/onships between variables Simple Linear Regression •  The non ­random variable, x, also called the dependent variable or control variable, is fixed at certain known values without error in order to observe its effect on the values of the random variable •  The random variable, y, also called the response variable or the independent variable, is the variable we observe during the experiment we are interested in knowing how the value of y is affected by changes in the value of x Simple Linear Regression con/nued •  We make the assump/ons that y is a value of a rv whose mean is a linear func/on of x and that the standard devia/on of y is constant and independent of x, resul/ng in the following summary equa/on: –  α and β are parameters we es/mate to find the line of best fit to a set of n observa/ons, (xi, yi) –  ε is a value of a random variable with mean zero and standard devia/on σ Simple Linear Regression con/nued •  The fired line has the equa/on: that es/mates the predicted value of y when the independent variable takes on the value x, and a and b are es/mates of α and β •  The errors associated with using predicted values of y (according to the regression line equa/on) to es/mate the y value for a given xi are called residuals (ei) Regression Lines •  We minimize the total spread of the y values from the line by looking at all the squared y distances from the line •  The regression or least squares line is the line with the smallest SSE – an aggregate measure of how much the line’s predicted predicted yi differ from actual yi n ˆ SSE = ∑ ( y i − y i ) 2 i =1 Sum of Squares Expression Abbrevia/ons •  SSxx: Sum of squares around the mean’s x ­value – n measures the spread of xi SSxx = ∑ ( x i − x ) 2 i =1 •  SSyy: Sum of squares around the mean’s y ­value – measures the spread of yi € •  SSxy: Cross product (used with ssxx to determine b) Regression line fit  ­ components of variability in y ­values •  SSE measures how much the predicted values of y differ from the actual y values •  SSR measures the total variability due to the regression (predicted y values) •  SSE/SSyy is the propor/on of error rela/ve to the total spread •  SSR/SSyy = 1 ­ SSE/SSyy = R2 is the propor/on of variability accounted for by the regression Method of Least Squares •  The method of least squares is used to associate the line of ‘best fit’ by minimizing the errors/residuals between the predicted and observed values of the dependent variable y •  We can then use theses sums of squares to find the values of a and b, the parameters of the regression line equa/on •  We minimize the sum of residuals by taking par/al deriva/ves of the sum of squared residuals and sevng the result to zero, which results a series of equa/ons that allow us to solve for a and b Example: scrap material length & weight Length (mm) xi 60 62 64 66 68 70 72 74 76 sum Sample mean 612 68 Weight (g) yi 84 95 140 155 119 175 145 197 150 1260 140 (xi – xbar) (yi – ybar) (xi  ­ xbar)2 (yi  ­ ybar)2 (xi – xbar) * (yi – ybar)  ­8  ­6  ­4  ­2 0 2 4 6 8  ­56  ­45 0 15  ­21 35 5 57 10 64 36 16 4 0 4 16 36 64 SSxx 240 3136 2025 0 225 441 1225 25 3249 100 SSyy 10426 448 270 0  ­30 0 70 20 342 80 SSxy 1200 Example: scrap material length & weight •  From the tabulated data we can calculated a and b 60 62 64 66 68 70 72 74 76 78 80 Mean  ­200 + 5x How well does our regression line fit? •  The goodness of fit depend on the size of SSE rela/ve to the total spread of the data •  We can quan/fy this fit by appor/oning (assigning a weigh/ng to) the variability in y from the regression (predicted) and the actual measured value ANOVA Table L (mm) xi 60 62 64 66 68 70 72 74 76 68 Wt 84 95 REGRESSION ERROR yi 100 110 120 € 130 140 150 160 170 180 ˆ yi ˆ ( yi − y)  ­56  ­45 0 ˆ ˆ ˆ ( y i − y )2 ( y i − y i ) ( y i − y i )2 3136 2025 0  ­16  ­15 20 € 25  ­21 25  ­15 27  ­30 256 225 400 625 441 625 225 729 900 SSE 4426 € 140 € 155 119 175 145 197 150 140 € 15  ­21 35 5 57 10 € 225 441 1225 25 3249 100 SSR 6000 Sample mean SSyy = SSR +SSE = 10426 Minitab: Fired Line Plot •  Minitab: Regression or Fired Line Plot – can also use excel •  Gives the regression equa/on, fit the line, choice of different types of func/ons, and can also show confidence intervals and predic/on intervals A quick note on correla/on •  The degree of linearity between two variables is defined by a scale ­invariant measure of the propor/on of total spread in y accounted for by the regression SSR SSE R= =1− = ρ2 SSyy SSyy 2 € Necessary Assump/ons for Least Squares Regression •  Correct Model form: y is linearly related to x •  Data used to fit the model are representa/ve of the data of interest •  Variance of the residuals is constant (it does not dependent on x) •  Residuals are independent (not dependent on /me) •  The residuals are normally distributed Analysis of Residuals •  An analysis of residuals helps to determine if the data are adequately described by the fired equa/on •  Several types of graphs are helpful in residual analysis: –  Normal scores plot of the residuals –  Plovng residuals against the predicted values of y –  Plovng residuals against the run numbers Analysis of Residuals – Residual scarer plots • Residual scarer plots help to uncover viola/ons against the assump/ons underlying the regression model •  Procedure: Plot the residuals ei against the predicted y values (error is assumed to be independent of y) •  Errors in the equa/on form are shown through curvilinear trends Analysis of Residuals •  A random scarerplot indicates accurate model assump/ons for Least Squares Regression •  Parerns indicate a problem with one of the ini/al assump/ons for Least Squares Regression Analysis of Residuals – Plovng residuals against run numbers •  A plot of residuals against integers reflec/ng the order of sampling can also reveal (unwanted) parerns or trends •  What types of factors do you think this could reveal? Predic/ng Mean Response •  We are also interested in predic/ng the mean response of y at a fixed value of x •  The confidence interval for y = axo + b is: Predic/ng new (unmeasured) values of y •  Predic/on interval of a new indiviudal ynew with observed xnew is: Note: Error increases farther from the sample mean because changing the slope of the regression line creates a larger difference to points on the line further away from the mean Tes/ng linear rela/onships •  To ensure that the rela/onship is linear, examine the slope of the fired line by means of a scarer plot showing the independent variable on the x axis and the dependent variable on the y axis •  To ensure the rela/onship is linear, we need to test if b, the slope of the best ­fit line, is significantly different from zero •  Ho : the regression line slope equals zero •  Ha : the regression line slope does not equal zero Test for significance of a linear regression •  We want to determine whether y is in fact linearly related to x (i.e. Does the slope of the linear regression equa/on differ significantly from zero?) •  If the slope of the fired least ­square line is significantly different from zero, there is a linear rela/on between x and y at the chosen level of significance •  Note that two variables may have a strong rela/onship even when there is no linear rela/onship between them Test for significance of a linear regression – residual variance •  The least ­squares slope es/mate, b, has an error related to the standard devia/on of the residuals •  The variance of the residuals can be es/mated by the mean ­squared devia/on of the y values of the data points from the least ­squares line Test for significance of a linear regression – t test value •  Assuming that the residuals (ei) are values of a random variable having the normal distribu/on we can calculate a t value •  The t value can be compared with table values of tα/2 to determine if t is significantly different from zero at the significance level α •  t value for tes/ng linearity Confidence limits for β •  Using similar methods as we used to find confidence limits for popula/on means and propor/ons, we can also find confidence limits for the regression line slope •  1 ­α Confidence Limits for β: Correla/on Analysis •  Correla/on Analysis determines the degree of linear interrela/on or associa/on between 2 random variables •  All correla/on measures (ρ ) are dimensionless and scale to lie in the range:  ­1 ≤ ρ ≤ 1 •  For uncorrelated data, ρ = O Correla/on Analysis •  Correla/on does not provide evidence for casual rela/onship between two variables •  Two variables may be correlated because one causes the other; (ex. mixing produces heat) or because they both share a common factor that influences both variables (ex. two solutes measured are influenced by varia/ons in the source of water) Difference between Correla/on and Regression •  Correla/on of: x vs. y = y vs. x •  Regression of : x vs. y ≠ y vs. x, •  In regression, a line is fired to explain y from x or x from y and the rela/onship cannot be reversed unless the fit is perfect •  A regression line can be used to predict y values Coefficient of Correla/on (r) •  A coefficient of correla/on quan/fies and tests the strength of monotonic rela/onship between two variables x and y •  Pearson's correla/on coefficient, r, measures the linear correla/on/associa/on between x and y •  r is invariant to changes in scale, and has a dimensionless value that is obtained by dividing the difference between random variables and their means by their respec/ve sample standard devia/ons Coefficient of correla/on •  This scale ­invariant measure of the strength of a linear regression can be constructed by comparing the variability “explained” by the regression to the original variability of y •  For random variables: •  For samples: Correla/on’s rela/on to Least ­Squares line •  The strength of a linear rela/onship can be measured by the slope of the least squares regression line, but this measure is dependent on the magnitude of the slope, or the scale of measurement for y •  a scale ­invariant measure of linear rela/onship can be constructed by comparing the variability explained by the regression to the original variability of y values, resul/ng in the populaMon correlaMon coefficient What does r tell us about the least ­squares fit line? •  If r takes on a posi/ve value close to 1, the data points (xi, yi) are nearly collinear, and the slope of the corresponding least ­squares line is posi/ve •  If r takes on a nega/ve value close to 1, the data points (xi, yi) are nearly collinear, and the slope of the corresponding least ­squares line is nega/ve •  If r takes on a value of zero, there is lirle or no linear rela/onship between x and y Remember our previous example •  The correla/on is the propor/on of the total SSyy accounted for by the regression •  Comparing variance of residuals (variability explained by the regression) to the original variance of the y values  ­ the propor/on of the reduc/on in variability associated with the regression to the original variability of y Coefficient of correla/on •  The correla/on coefficient can form the basis of a sta/s/cal test of independence. •  Ho: ρ = 0 yi are independent iden/cally distributed normal rvs, not dependent on xi •  The test sta/s/c t is defined as: •  Ho: ρ ≠ O •  Ho is rejected if |t| > tcrit, at n ­2 degrees of freedom Significance Tests and Confidence Intervals for r •  For our test sta/s/c, we use the t ­distribu/on, where t defined as: •  We reject Ho if |t| > tcrit •  tcrit is the value of the t ­distribu/on with n  ­ 2 d.f. and a probability of exceedence of α/2 •  In ­class example Applica/on to Quality •  Quality ­related issues are oPen the result of the combined effects of mul/ple variables that do not act independently •  Regression analysis is useful in screening if a variable is not correlated with another and if it causes lirle change to y, it can probably be excluded from the analysis •  Problems associated with correlated variables, non ­measured variables and equa/on form are oPen overlooked – bad idea! ...
View Full Document

This note was uploaded on 03/29/2011 for the course ENGR 9397 taught by Professor Susanhunt during the Winter '11 term at Memorial University.

Ask a homework question - tutors are online