This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Week 7 Hypothesis Tes/ng Con/nued Regression Analysis Minitab • We covered normality plots and hypothesis tes/ng on Minitab using the ques/ons in the next few slides below. I’ll send out the Minitab ﬁles to let you have a look at them and get familiarized. Single Sample Tests for propor/ons • To see if at least 50% of the compressors manufactured by a ﬁrm can withstand 5 years of con/nuous opera/on without failure, 100 compressors were put on test. If 38 of them were s/ll running aPer 5 years, can it be said, at the 0.05 level of signiﬁcance, that this standard has not been met? Signiﬁcance Tests for popula/on means (one
sample t
test) • A test for the mean follows the same 1
4 steps as for any hypothesis test • Test hypothesis H0: µ = µ0 (mean actually equals the stated value) – Ex: “ The mean water volume is exactly 1 liter” • By comparing standardized z values at the required level of signiﬁance, we specify what value of µ will cause us to reject H0 if is deemed signiﬁcantly diﬀerent than µ Signiﬁcance test for the sample mean example 1 • Test at the 0.05 level of signiﬁcance whether the mean of a random sample of size 16 diﬀers signiﬁcantly from 10 if the distribu/on form which the sample was taken is approximately normal, the sample mean and standard devia/on are calculated to be 8.4 and 3.2, respec/vely. Signiﬁcance test for the sample mean example 2 • The Righteous Insurance Company will insure an auto only if the mean repair cost aPer a 10 mph collision is less than $1000. The company uses a standard α = 0.05 as its signiﬁcance level. From 5 crash tests conducted by the auto manufacturer, the cost of repairs were: $1050, $400, $720, $600, and $1100. Signiﬁcance test for the sample mean example 2 • Step 1: Ho: μ ≥ $1000 mean cost is too high Ha: μ < $1000 mean cost is too low • Step 2: Test sta/s/c is the t distribu/on: where μo is the hypothe/cal mean of $1000 and the standard error is s/√n • For this example, = $774, and s = $298, n=5, df = 5
1 = 4 • tobs = (774 – 100) / (298/√5) =
1.696 • Let α = 0.05 Signiﬁcance test for the sample mean example 2 • Step 3: p
value = P(t <
1.70 df=4) > 0.05 from the t
table for 4 df. • Step 4: Since p
value > α value of 0.05, we cannot reject Ho. Mean damage is not signiﬁcantly less than $1000, hence no insurance will be given to this kind of car. Two
Sample Comparisons – Propor/ons and means • Compares one sample with another to determine whether there is a sta/s/cally signiﬁcant diﬀerence between the two samples • Our objec/ve is to es/mate the true diﬀerence: p1 – p2 or x1 – x2 • The signiﬁcance of the result depends on: – the size of the diﬀerence between p1 and p2 or µ1 and µ2 (larger diﬀerence, easier to reject H0) – the size of the samples n1 and n2 (larger sample sizes, easier to reject H0) – the size of α (higher value, easier to reject H0) Two
Sample Comparisons – Propor/ons and means • H0 is always: there is no diﬀerence between the propor/ons or means of the two samples (diﬀerence equals zero) • Ha is again one of 3 possibili?es: – There is a diﬀerence (≠) – One is greater than the other (>) – One less than the other (<) Diﬀerence between Propor/ons – Summary for CI • p1 and p2 are the popula/on success propor/on • Propor/on of successes in each sample: • The es/mator is: p1
p2, a normally distributed variable when n is large (when np > 5) • CI: • Since the 2 popula/ons are assumed independent the variances of our es/mates are addi/ve: A note on Paired comparisons.. remember • Make sure the diﬀerences are approximately normally distributed before using this test. Goodness
of
ﬁt Tests • Normality tests/plots – easy to do in Minitab Ho: Data is normally distributed Ha: Data is not normally distributed • If the p
value < 0.05 , we reject the null hypothesis (i.e. data not normally distributed) • If the p
value > 0.05, the data can be considered normally distributed • Remember to check your data !!! Probability Plots and Chi
Squared Test • Minitab can also do goodness
of
ﬁt test for other con/nuous distribu/ons such as: extreme
value, Wei bull, etc. This is done using probability plots • For discrete distribu/ons, the Chi
Squared test can be used to test whether a Binomial or Poisson distribu/on ﬁts the data and we use the test sta/s/c: Applica/on to Quality • It is impossible to judge the validity of guesses, scien/ﬁc es/mates must be applied to data to verify validity of statements • Make sure that sta/s/cal inferences are not misused – oPen sta/s/cal signiﬁcance doesn’t provide a result that is meaningful – there is no subs/tute for common sense, and great sta/s/cs means lirle if you’re solving the wrong problem! Regression Analysis
Introduc/on Regression Analysis
Introduc/on • So, far we have looked at sta/s/cal methods that have dealt with observa/ons involving a single variable • Many problems of quality improvement involve rela/ons among several variables • We are going to looks at sta/s/cal methods that apply to simultaneous observa/ons made on mul/ple variables • We will look at simple linear regression, correla/on, mul/ple regression, analysis of residuals and their applica/ons to quality management Regression Analysis
Introduc/on • A widely used method for studying rela/onships between mul/ple variables is regression analysis • Examples of where this is used are everywhere in the product development and manufacturing environment • The rela/onship between price and sales, weight and speed, geometry and drag are some examples of where regression analysis could be used to solve problems of interest to quality managers and engineers Why perform Regression? • We use regression in order to: – Learn something about the rela/onship between two variables, – Remove a por/on of the varia/on in one variable in order to gain a berer understanding of the remaining varia/on – Es/mate or predict values of one variable based on another variable Variable Rela/onships Monotonic vs. Nonmonotonic • Regression and correla/on deal with describing/ quan/fying rela/onships between random variables • Types of Rela/onships: • Monotonic: variable rela/onships that have no reversals in slope (a linear rela/onship is a special case of a monotonic rela/onship where the slope is constant over x) • Nonmonotonic: variable rela/onships where the slope reverses Variable Rela/onships: Linear vs. Non
linear • Linear func/ons sa/sfy the following proper/es: Addi/vity (aka superposi/on property) f (x + y) = f(x) + f(y) (the net response by two or more factors is the sum of the responses caused by each factor) Homogeneity of degree 1 f(ax) = af(x) for all a • Nonlinear: a system which does not sa/sfy the above principles (oPen used to represent natural phenomena) Simple Linear Regression • Simple linear regression: the use of sta/s/cal methods to ﬁnd the ‘best ﬁt’ linear rela/onship between two variables • the value of a given random variable may depend upon the value of another (non
random) variable • In simple regression we are only dealing with a single rela/onship between two random variables • In mul/ple regression, we are dealing with mul/ple rela/onships between variables Simple Linear Regression • The non
random variable, x, also called the dependent variable or control variable, is ﬁxed at certain known values without error in order to observe its eﬀect on the values of the random variable • The random variable, y, also called the response variable or the independent variable, is the variable we observe during the experiment we are interested in knowing how the value of y is aﬀected by changes in the value of x Simple Linear Regression con/nued • We make the assump/ons that y is a value of a rv whose mean is a linear func/on of x and that the standard devia/on of y is constant and independent of x, resul/ng in the following summary equa/on: – α and β are parameters we es/mate to ﬁnd the line of best ﬁt to a set of n observa/ons, (xi, yi) – ε is a value of a random variable with mean zero and standard devia/on σ Simple Linear Regression con/nued • The ﬁred line has the equa/on: that es/mates the predicted value of y when the independent variable takes on the value x, and a and b are es/mates of α and β • The errors associated with using predicted values of y (according to the regression line equa/on) to es/mate the y value for a given xi are called residuals (ei) Regression Lines • We minimize the total spread of the y values from the line by looking at all the squared y distances from the line • The regression or least squares line is the line with the smallest SSE – an aggregate measure of how much the line’s predicted predicted yi diﬀer from actual yi n ˆ SSE = ∑ ( y i − y i ) 2
i =1 Sum of Squares Expression Abbrevia/ons • SSxx: Sum of squares around the mean’s x
value – n measures the spread of xi SSxx = ∑ ( x i − x ) 2
i =1 • SSyy: Sum of squares around the mean’s y
value – measures the spread of yi € • SSxy: Cross product (used with ssxx to determine b) Regression line ﬁt
components of variability in y
values • SSE measures how much the predicted values of y diﬀer from the actual y values • SSR measures the total variability due to the regression (predicted y values) • SSE/SSyy is the propor/on of error rela/ve to the total spread • SSR/SSyy = 1
SSE/SSyy = R2 is the propor/on of variability accounted for by the regression Method of Least Squares • The method of least squares is used to associate the line of ‘best ﬁt’ by minimizing the errors/residuals between the predicted and observed values of the dependent variable y • We can then use theses sums of squares to ﬁnd the values of a and b, the parameters of the regression line equa/on • We minimize the sum of residuals by taking par/al deriva/ves of the sum of squared residuals and sevng the result to zero, which results a series of equa/ons that allow us to solve for a and b Example: scrap material length & weight Length (mm) xi 60 62 64 66 68 70 72 74 76 sum Sample mean 612 68 Weight (g) yi 84 95 140 155 119 175 145 197 150 1260 140 (xi – xbar) (yi – ybar) (xi
xbar)2 (yi
ybar)2 (xi – xbar) * (yi – ybar)
8
6
4
2 0 2 4 6 8
56
45 0 15
21 35 5 57 10 64 36 16 4 0 4 16 36 64 SSxx 240 3136 2025 0 225 441 1225 25 3249 100 SSyy 10426 448 270 0
30 0 70 20 342 80 SSxy 1200 Example: scrap material length & weight • From the tabulated data we can calculated a and b 60 62 64 66 68 70 72 74 76 78 80 Mean
200 + 5x How well does our regression line ﬁt? • The goodness of ﬁt depend on the size of SSE rela/ve to the total spread of the data • We can quan/fy this ﬁt by appor/oning (assigning a weigh/ng to) the variability in y from the regression (predicted) and the actual measured value ANOVA Table L (mm) xi 60 62 64 66 68 70 72 74 76 68 Wt 84 95 REGRESSION ERROR yi
100 110 120 € 130 140 150 160 170 180 ˆ yi ˆ ( yi − y)
56
45 0 ˆ ˆ ˆ ( y i − y )2 ( y i − y i ) ( y i − y i )2
3136 2025 0
16
15 20 € 25
21 25
15 27
30 256 225 400 625 441 625 225 729 900 SSE 4426 € 140 € 155 119 175 145 197 150 140 € 15
21 35 5 57 10 € 225 441 1225 25 3249 100 SSR 6000 Sample mean SSyy = SSR +SSE = 10426 Minitab: Fired Line Plot • Minitab: Regression or Fired Line Plot – can also use excel • Gives the regression equa/on, ﬁt the line, choice of diﬀerent types of func/ons, and can also show conﬁdence intervals and predic/on intervals A quick note on correla/on • The degree of linearity between two variables is deﬁned by a scale
invariant measure of the propor/on of total spread in y accounted for by the regression SSR SSE R= =1− = ρ2 SSyy SSyy
2 € Necessary Assump/ons for Least Squares Regression • Correct Model form: y is linearly related to x • Data used to ﬁt the model are representa/ve of the data of interest • Variance of the residuals is constant (it does not dependent on x) • Residuals are independent (not dependent on /me) • The residuals are normally distributed Analysis of Residuals • An analysis of residuals helps to determine if the data are adequately described by the ﬁred equa/on • Several types of graphs are helpful in residual analysis: – Normal scores plot of the residuals – Plovng residuals against the predicted values of y – Plovng residuals against the run numbers Analysis of Residuals – Residual scarer plots • Residual scarer plots help to uncover viola/ons against the assump/ons underlying the regression model • Procedure: Plot the residuals ei against the predicted y values (error is assumed to be independent of y) • Errors in the equa/on form are shown through curvilinear trends Analysis of Residuals • A random scarerplot indicates accurate model assump/ons for Least Squares Regression • Parerns indicate a problem with one of the ini/al assump/ons for Least Squares Regression Analysis of Residuals – Plovng residuals against run numbers • A plot of residuals against integers reﬂec/ng the order of sampling can also reveal (unwanted) parerns or trends • What types of factors do you think this could reveal? Predic/ng Mean Response • We are also interested in predic/ng the mean response of y at a ﬁxed value of x • The conﬁdence interval for y = axo + b is: Predic/ng new (unmeasured) values of y • Predic/on interval of a new indiviudal ynew with observed xnew is: Note: Error increases farther from the sample mean because changing the slope of the regression line creates a larger diﬀerence to points on the line further away from the mean Tes/ng linear rela/onships • To ensure that the rela/onship is linear, examine the slope of the ﬁred line by means of a scarer plot showing the independent variable on the x axis and the dependent variable on the y axis • To ensure the rela/onship is linear, we need to test if b, the slope of the best
ﬁt line, is signiﬁcantly diﬀerent from zero • Ho : the regression line slope equals zero • Ha : the regression line slope does not equal zero Test for signiﬁcance of a linear regression • We want to determine whether y is in fact linearly related to x (i.e. Does the slope of the linear regression equa/on diﬀer signiﬁcantly from zero?) • If the slope of the ﬁred least
square line is signiﬁcantly diﬀerent from zero, there is a linear rela/on between x and y at the chosen level of signiﬁcance • Note that two variables may have a strong rela/onship even when there is no linear rela/onship between them Test for signiﬁcance of a linear regression – residual variance • The least
squares slope es/mate, b, has an error related to the standard devia/on of the residuals • The variance of the residuals can be es/mated by the mean
squared devia/on of the y values of the data points from the least
squares line Test for signiﬁcance of a linear regression – t test value • Assuming that the residuals (ei) are values of a random variable having the normal distribu/on we can calculate a t value • The t value can be compared with table values of tα/2 to determine if t is signiﬁcantly diﬀerent from zero at the signiﬁcance level α • t value for tes/ng linearity Conﬁdence limits for β • Using similar methods as we used to ﬁnd conﬁdence limits for popula/on means and propor/ons, we can also ﬁnd conﬁdence limits for the regression line slope • 1
α Conﬁdence Limits for β: Correla/on Analysis • Correla/on Analysis determines the degree of linear interrela/on or associa/on between 2 random variables • All correla/on measures (ρ ) are dimensionless and scale to lie in the range:
1 ≤ ρ ≤ 1 • For uncorrelated data, ρ = O Correla/on Analysis • Correla/on does not provide evidence for casual rela/onship between two variables • Two variables may be correlated because one causes the other; (ex. mixing produces heat) or because they both share a common factor that inﬂuences both variables (ex. two solutes measured are inﬂuenced by varia/ons in the source of water) Diﬀerence between Correla/on and Regression • Correla/on of: x vs. y = y vs. x • Regression of : x vs. y ≠ y vs. x, • In regression, a line is ﬁred to explain y from x or x from y and the rela/onship cannot be reversed unless the ﬁt is perfect • A regression line can be used to predict y values Coeﬃcient of Correla/on (r) • A coeﬃcient of correla/on quan/ﬁes and tests the strength of monotonic rela/onship between two variables x and y • Pearson's correla/on coeﬃcient, r, measures the linear correla/on/associa/on between x and y • r is invariant to changes in scale, and has a dimensionless value that is obtained by dividing the diﬀerence between random variables and their means by their respec/ve sample standard devia/ons Coeﬃcient of correla/on • This scale
invariant measure of the strength of a linear regression can be constructed by comparing the variability “explained” by the regression to the original variability of y • For random variables: • For samples: Correla/on’s rela/on to Least
Squares line • The strength of a linear rela/onship can be measured by the slope of the least squares regression line, but this measure is dependent on the magnitude of the slope, or the scale of measurement for y • a scale
invariant measure of linear rela/onship can be constructed by comparing the variability explained by the regression to the original variability of y values, resul/ng in the populaMon correlaMon coeﬃcient What does r tell us about the least
squares ﬁt line? • If r takes on a posi/ve value close to 1, the data points (xi, yi) are nearly collinear, and the slope of the corresponding least
squares line is posi/ve • If r takes on a nega/ve value close to 1, the data points (xi, yi) are nearly collinear, and the slope of the corresponding least
squares line is nega/ve • If r takes on a value of zero, there is lirle or no linear rela/onship between x and y Remember our previous example • The correla/on is the propor/on of the total SSyy accounted for by the regression • Comparing variance of residuals (variability explained by the regression) to the original variance of the y values
the propor/on of the reduc/on in variability associated with the regression to the original variability of y Coeﬃcient of correla/on • The correla/on coeﬃcient can form the basis of a sta/s/cal test of independence. • Ho: ρ = 0 yi are independent iden/cally distributed normal rvs, not dependent on xi • The test sta/s/c t is deﬁned as: • Ho: ρ ≠ O • Ho is rejected if t > tcrit, at n
2 degrees of freedom Signiﬁcance Tests and Conﬁdence Intervals for r • For our test sta/s/c, we use the t
distribu/on, where t deﬁned as: • We reject Ho if t > tcrit • tcrit is the value of the t
distribu/on with n
2 d.f. and a probability of exceedence of α/2 • In
class example Applica/on to Quality • Quality
related issues are oPen the result of the combined eﬀects of mul/ple variables that do not act independently • Regression analysis is useful in screening if a variable is not correlated with another and if it causes lirle change to y, it can probably be excluded from the analysis • Problems associated with correlated variables, non
measured variables and equa/on form are oPen overlooked – bad idea! ...
View
Full
Document
This note was uploaded on 03/29/2011 for the course ENGR 9397 taught by Professor Susanhunt during the Winter '11 term at Memorial University.
 Winter '11
 SusanHunt

Click to edit the document details