### week9

Course: LING 120, Fall 2008
School: UPenn
120 LING Introduction to Speech Analysis Fall 2007 Week 9 Speech analysis IV: Variation and statistical techniques (II) Oct. 29 - Nov. 2, 2007

120 LING Introduction to Speech Analysis Fall 2007 Week 9 Speech analysis IV: Variation and statistical techniques (II) Oct. 29 - Nov. 2, 2007 Hypothesis testing Steps for Hypothesis Testing: 1. Formulate your hypotheses: - Need a Null Hypothesis (H0) and an Alternative Hypothesis (HA) 2. Calculate the test statistic: - Test statistic summarizes the difference between data and your null hypothesis 3. Find the p-value for the test statistic: - How probable is your data if the null hypothesis is true? Basic Plan: Determine whether the data you currently have would be plausible if H0 was true. If so, then H0 cannot be rejected. If the current data is not plausible if H0 is true, then H0 should be rejected. Type I errors (false positive, error): null hypothesis is true but rejected; Type II errors: null hypothesis is false but failed to reject (false negative, error/. LING 521 Introduction to Speech Analysis, Fall 2007 2 Hypothesis testing: one sample mean To test if the population mean is equal to 0 (H0: = 0) We can compare the difference/distance between the sample mean and the hypothesized population mean with the standard deviation of the population mean: 3 Z= - X 0 / n - If Z >= 1.96, for example, we know that the sample mean is two SDs away from the hypothesized population mean -> there is only a small chance (5%) that this can happen -> we can reject the null hypothesis After calculating Z, we can look up the p-value from the N(0,1) table (Z has a standard normal distribution if the distribution of X is normal). If the p value is smaller than .05 or .01 (the level we set up), we reject the null hypothesis. LING 521 Introduction to Speech Analysis, Fall 2007 t distribution If the standard deviation of the population mean is unknown, we use the standard error of the sample mean as estimate of it: s SE(X) = n n Therefore: 4 T= X 0 s/ n T has a t distribution with n - 1 degrees of freedom. LING 521 Introduction to Speech Analysis, Fall 2007 t distribution t distribution looks like a normal distribution, but has thicker tails controlled by the degrees of freedom: - 5 The smaller the degrees of freedom, the thicker the tails If the degrees of freedom is large enough (large sample size), the t distribution is pretty much identical to the normal distribution LING 521 Introduction to Speech Analysis, Fall 2007 One sample t test H0: = 0 Procedure: Calculate: 6 X 0 T= s/ n Look up the p-value from the t distribution table (n-1 degrees of freedom) If p-value is smaller than the -level we like to accept (normally .05 or .01), we reject the null hypothesis, otherwise, we cannot reject the null hypothesis (does not mean that we prove it!) Question: one sided (tailed) or two sided (tailed)? LING 521 Introduction to Speech Analysis, Fall 2007 One sided vs. two-sided The interval of possible t scores for which the null hypothesis would be rejected is called the rejection region (the red area). 7 We should usually use two-sided. LING 521 Introduction to Speech Analysis, Fall 2007 An example 8 LING 521 Introduction to Speech Analysis, Fall 2007 Two sample t test We want to compare the means of two populations: H0: 1 = 2,, or equivalently, 1 - 2 =0 9 We can generalize the T formula used in one sample t test: T = Observed Value of YY - Expected Value of YY under H 0 Standard Error of YY - In one sample t test, YY is the population mean - In two sample t test, YY is the difference between two population means LING 521 Introduction to Speech Analysis, Fall 2007 t test in R 10 LING 521 Introduction to Speech Analysis, Fall 2007 An example 11 LING 521 Introduction to Speech Analysis, Fall 2007 One way ANOVA What if there are more than two groups? e.g., if we want to know whether topic has an effect on speech rate, we need to compare the means of different topics (four in our data set). If we do a t test on each pair (six pairs in total), each of which has its individual false positive rate (type I error, wrongly reject the null hypothesis) controlled at (=.05), then the total false positive rate may be as high as 1-(1-.05)6 = .265 (too high!). 12 If there are five independent parts and each fails 5% of the time, then the machine whole fails 1-(0.95)5 = 23% of the time! [Graph from: John Dziak, Penn State U.] LING 521 Introduction to Speech Analysis, Fall 2007 One way ANOVA ANOVA can be used to compare many groups at the same time. The name ANalysis Of VAriance (ANOVA) comes from the way the procedure uses variances to decide whether the means are different. A better acronym for this model would be ANOVASMAD (analysis of variance to see if means are different)! The idea is simple: - If all the groups have the same mean (grand mean), then the between group variance (differences between group means) should not be very different from the within group variance (differences within the groups) - Otherwise, between group variance is larger than within group variance. 13 LING 521 Introduction to Speech Analysis, Fall 2007 One way ANOVA 14 MSB: Between group variance MSW: Within group variance We reject the null hypothesis, we know that not all the topics have the same speech rate. LING 521 Introduction to Speech Analysis, Fall 2007 Correlation Is there a relationship between two variables (e.g., age and speech rate)? What is the strength of this relationship? Y Y Y Y Y Y 15 X X X Positive correlation Negative correlation No correlation Do two variables change together? How to measure? LING 521 Introduction to Speech Analysis, Fall 2007 Correlation Covariance: the degree to which two variables vary together. n 16 ( xi cov( x, y ) = i =1 x )( yi n 1 y) When X and Y : cov(x,y) = pos. When X and Y : cov(x,y) = neg. When no constant relationship: cov(x,y) = 0 Covariance is dependent on the size of the datas standard deviations: if large, the value will be greater than if small Pearsons R: standardizes the covariance value - dividing the covariance by the multiplied standard deviations of X and Y. LING 521 Introduction to Speech Analysis, Fall 2007 An example 17 LING 521 Introduction to Speech Analysis, Fall 2007 An example 18 We know that AGE and RATE are correlated, but how? -> regression LING 521 Introduction to Speech Analysis, Fall 2007 Regression The goal of regression is to describe the relationship between: - Y: response, dependent, outcome, etc. and - X1, X2, , Xk: explanatory, independent, predictors, etc. 19 Simple Linear Regression describes the situation where there is only a single independent variable X, and: = ax + b b: Mean response when x=0 (intercept) a: Change in mean response when x increases by 1 unit (slope) b, a are unknown parameters (like 0) Goal: Find estimates ...

