15Bootstrap

# 15Bootstrap - the bootstrap We've seen a lot about...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: the bootstrap We've seen a lot about inference on C in a nGM (or nAitken) model A huge number of questions can be answered by appropriate choice of C key point is that C is a linear function of y What if quantity of interest is not a linear function of y? This next block of material explores techniques for inference (tests, interval estimation) for arbitrary functions of the data Focus on the general approach called the bootstrap Many details/extensions/variations, only present what I find most useful. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 1 / 34 Examples Process optimization Many physical processes can be described (at least approximately) as quadratic functions of input variable(s) e.g. very large computation, done using parallel computing on n processors Mean clock time required can be approximated by quadratic function, Y = 0 + 1 n + 2 n2 Optimum number of processors is Nopt = -1 /(22 ) ^ ^ ^ Can estimate by Nopt = -1 /(22 ) But what about a confidence interval for Nopt ? or test of Nopt = k? c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 2 / 34 Inference on Nopt The options for inference (ci or test) on Nopt include: Asymptotic approximation: ^ Delta method approximation for Var Nopt ^ ^ Normal approximation for (Nopt - Nopt )/ Var Nopt Construct a test from an equivalent linear hypothesis: Nopt = k -1 /(22 ) = k 1 = 2k2 1 - 2k2 = 0 This is a C test for specified k Invert test to find the confidence interval 95% ci for Nopt is set of k for which Ho: Nopt = k is not rejected at = 0.05 Confidence interval computation is not trivial! c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 3 / 34 Examples - 2 Seed germinability, G, in long-term storage may be approximated by a quadratic function of time in storage, T: G = 0 + 1 T + t T 2 Want to estimate T85 , the time when germination drops to 85% This is T for which 0.85 = 0 + 1 T + 2 T 2 Solution is either: one of the values -1 2 1 - 4(0 - 0.85)2 /(22 ) or doesn't exist ^ ^ ^ ^ Can estimate T85 from 0 , 1 , and 2 but what about inference? No linearization here Delta method ignores 'does not exist' c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 4 / 34 Examples - 3 the "Law School" data, Efron and Tibshirani, 1993 correlation between average LSAT and average GPA for students entering law school data are from a sample of of 15 law schools sample correlation coefficient = 0.78 want a 95% ci for the correlation, if population is bivN, use Fisher's z transform for inference on knowledge of law schools: population is not likely to be bivN ! data: sample is unlikely to come from a bivN population so how do you construct a confidence interval? c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 5 / 34 340 q q q q 330 q 320 GPA 310 q q q q q q 290 300 q 280 q q q 560 580 600 LSAT score 620 640 660 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 6 / 34 A general framework for frequentist statistical inference Want to estimate for a population. The population is described by F, its distribution function. F might be specified parametrically, e.g. N(, 2 ) We call a parameter. Could be, but doesn't have to be, a distribution parameter. Examples of possible 's standard deviation coefficient of variation a quantile of the population, e.g. the median the proportion of the population with values exceeding k the correlation between components of a bivariate distribution c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 7 / 34 Framework - 2 Key concept: = t(F), where is a real value and t is a known function that maps distribution functions to R We don't know F or . We have a sample of data from F: Y1 , Y2 , . . . , Yn F ^ Our estimator = s(Y1 , Y2 , . . . , Yn ), where s() is a known function Example: is the mean of a univariate distribution, t(F) = a reasonable estimator is s(Y) = n Yi /n = Y. i=1 i.i.d. y dF Some practical questions about the estimator s()? ^ Is it unbiased, i.e. is E = ? ^ ^ What is the variability, Var ? How can I estimate Var ? ^ What is the sampling distribution of ? The last question is crucial for tests and confidence intervals. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 8 / 34 Theoretical answers In some cases, theoretical statistics provides exact answers ^ y If F N(, 1), = N(, 1/n) Many quantities associated with normal distributions have exact answers. In other cases, theory provides approximate answers. Asymptotic approximations justified by properties of the estimator when n increasingly large. In many cases, need to consider other approaches Possible approaches include: Sampling, when the population is known Parametric bootstrap, when population form is known Nonparametric bootstrap, when only the sample is known. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 9 / 34 Sampling Suppose our population is the bivariate distribution of average LSAT score and average GPA for students entering a major law school in the US. There are 82 major law schools in the US. F is the discrete distribution that assigns probability 1/82 to each of the 82 (LSAT, GPA) pairs. Our interest is the correlation between LSAT and GPA across law schools. ^ We compute as the sample correlation from a sample of 15 law schools (sample is without replacement). ^ What is the distribution of ? Because the population, F, is known, we can use sampling Repeatedly draw a sample of 15 law schools without replacement ^ from F, compute from each sample ^ The empirical distribution of is a Monte-Carlo estimate of the ^ sampling distibution of . c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 10 / 34 Density 1 2 3 q 0 0.0 0.2 0.4 0.6 0.8 1.0 Sample correlation, n = 15 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 11 / 34 Practical problem: sampling requires knowing the population Why estimate in that case? Usually have only one sample from the population The bootstrap (either parametric or nonparametric) provides a way ^ to estimate the sampling distribution of without knowing the population c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 12 / 34 the Bootstrap Approach Use y1 , y2 , . . . , yn F to estimate: ^ by = s(y1 , y2 , . . . , yn ), and ^ ^ F by F, where F is estimated from the sample {yi } ^ Nonparametric bootstrap: F is the empirical cdf of the observations in the sample ^ Parametric bootstrap: F is a specified distribution (e.g. N) with parameters estimated from the sample The bootstrap concept: Repeatedly draw a sample of 15 law ^ ^ schools from F, compute from each sample ^ The empirical distribution of is a bootstrap estimate of the ^ sampling distibution of . i.i.d. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 13 / 34 empirical Cumulative Distribution Functions (eCDF) 1 ^ If using the nonparametric bootstrap, F(x) n n 1(yi x) i=1 where 1(yi x) is 1 if yi x and 0 otherwise ^ this F(x) is the empirical Cumulative Distribution Function to each yi , if all yi are unique ^ if the sample is 3.4, 1.2, 8.5, 7.6, 4.9, F assigns probability 1/5 to each of the values ^ ^ F(4) = P[yi 4] = 2/5 = 0.4. F(4.9) = P[yi 4.9] = 3/5 = 0.6. ^ if the sample is 3.4, 5.6, 3.4, 7.9, 8.6, F assigns probability 1/5 to the values 5.6, 7.9 and 8.6 and probability 2/5 to the value 3.4 ^ ^ ^ ^ ^ F(1.5) = 0, F(3.5) = 0.4, F(5.5) = 0.4, F(5.6) = 0.6, F(8) = 0.8, ^ F(9) = 1 the eCDF assigns probability 1 n c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 14 / 34 Nonparametric bootstrap If using a nonparametric bootstrap, draw bootstrap samples with replacement from the data sample. Some observations occur 2 (or even more) times in a bootstrap sample Other observations don't occur in a bootstrap sample Some bootstrap samples, described by the number of times obs. i occurs in the bootstrap sample Data pt.: boot. samp. 1 boot. samp. 2 boot. samp. 3 boot. samp. 4 1 1 0 1 0 2 0 0 6 2 3 1 1 1 0 4 2 1 1 3 5 2 1 0 2 6 1 0 1 0 7 2 1 0 1 8 2 1 0 1 9 0 1 2 1 10 0 1 2 1 11 1 2 0 0 12 1 2 1 1 13 1 1 0 1 14 1 1 0 1 15 0 1 0 0 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 15 / 34 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 density 0.2 0.4 0.6 0.8 1.0 Cor in bootstrap sample c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 16 / 34 Uses for a bootstrap distribution ^ ^ ^ FB () is the bootstrap estimate of the sampling distribution of expressed as an empirical CDF what can you do with it? ^ Estimate the standard error of standard error = estimated sd of a statistic ^ estimate se by sd of b for the 15 law schools, se of correlation 0.13 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 17 / 34 Bias corrected estimators Estimate bias of an estimator ^ ^ Bias() = EF () - ^ ^ Bootstrap estimate of bias is EF () - ^ ^ if B bootstrap samples, each providing b , 1 ^ ^ estimated bias is B b - ^ For the 15 sample law school data, b = 0.7696 ^ Correlation in original sample: = 0.7764 Estimated bias: 0.7696 - 0.7764 = -0.0068 Sample correlation is slightly too low, on average. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 18 / 34 Bias correction Logic of a bias corrected estimator: If know that an estimator is biased, tempting to reduce or remove bias Bootstrap suggests sample correlation too low by -0.0068 "better" estimate of correlation is 0.7764 - bias = 0.783 ~ ^ ^ Bias corrected estimator = = - Bias() 1 ^ ^ ^ = - B b - ^ ^ = 2 - 1 B b ^ Bias correction usually increases variability because Bias() is an estimate Use MSE(t) = E(t - )2 = Var t + Bias(t)2 to compare variability of possibly biased estimators ~ ^ Details depend on estimator, but often have MSE() > MSE() c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 19 / 34 Bootstrap confidence intervals Most common use of a bootstrap sample is to estimate a 1 - confidence interval Many, many different ways to do this. Will talk about the two I find most useful in practice. Percentile bootstrap confidence intervals: Endpoints of 1 - confidence interval are the /2 and 1 - /2 ^ ^ quantiles of FB (). ^ -1 ^ -1 e.g. 95% confidence interval is (FB (0.025), FB (0.975) For 15 sample law school data, (0.45, 0.96) This interval is range respecting: endpoints of the ci must fall within the range of the estimates, so within valid range of parameters Avoids, e.g. a 95% confidence interval (0.97, 1.02) for a proportion. ^ Tends to have poor coverage when Var is a function of Various extensions, e.g. Bias Corrected and BCa bootstraps, attempt to improve the coverage c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 20 / 34 Percentile-t = Studentized bootstrap ^ The studentized bootstrap deals explicitly with Var = f () ^ ^ Calculate and Var for each bootstrap sample. i i Calculate Ti from each sample, where Ti = ^ ^ i - ^ Var i ^ This is the t-statistic quantifying the departure of for each ^ from the original data, on a unit-less bootstrap sample from scale. ^ In a standard problem, - has a theoretical t distribution ^ Var is the bootstrap estimate of the The empirical CDF of T ^ distribution of - ^ Var c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 21 / 34 Studentized bootstrap confidence intervals: Find the /2 and 1 - /2 quantiles of T : T/2 and T1-/2 lower and upper bounds of the 95% ci are ^ ^ ^ ^ - Var T1-/2 and - Var T/2 N.B. bounds are "backwards". For a 95% ci, lower bound is minus 0.975 quantile; upper is minus 0.025 quantile. matches careful construction of a Student's t interval matters because distribution of T is unlikely to be symmetric Characteristics: Is not range respecting ^ Adjusts for variation in Var . ^ But requires at least approximate estimate of Var . Delta method, jacknife resampling Variety of simulation studies: empirical coverage much closer to nominal in difficult problems c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 22 / 34 Studentized bootstrap for correlation ^ How to get Var ? What do we know that might help? Fisher's Z transformation: 1 1+r Z(r) = log 2 1-r If (X, Y) bivariate normal, Z(r) approx normal with Var 1/(n - 3) Calculate T from Z and Var Z for each bootstrap sample If pop. far from bivN, expect T far from normal calculate a studentized interval for Z(r) and back transform Result: (0.13, 1.56) on z scale, (0.13, 0.92) for corr. Shifted down from percentile interval BCa interval, which uses different info to adjust for Var, is (0.32, 0.94), not as shifted as Studentized interval In the problems I have worked on, Studentized intervals have better empirical coverage than BCa intervals. Pictures on next two slides c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 23 / 34 density 0.00 -4 0.10 0.20 0.30 -2 0 2 4 6 Bootstrap T c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 24 / 34 Bootstrap T values Sample Quantiles -4 -4 -2 0 2 4 6 -2 0 2 4 Theoretical Quantiles c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 25 / 34 Practical issues How many bootstrap samples? To estimate bias or se: 100-1000 To estimate ci: at least 1000. 5,000 and 10,000 now common Need even more if interested in extreme quantiles (relatively) easy to determine Monte-Carlo variation in ci How to draw the bootstrap samples? Ordinary bootstrap: simple random sample for each sample Balanced bootstrap: ensure that each obs occurs a total of R times Regression bootstrap: If X values fixed by design, ensure all bootstrap samples have same X values Fit model, estimate residuals, estimate predicted values at each X. bootstrap residuals, add to predicted values to get (X,Y) c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 26 / 34 Confidence intervals for optimum # processors Plot of time vs. # processors with fitted quadratic regr. on next page Estimated model is time = 5.0772 - 0.4704N + 0.0143N 2 ^ Estimated Nopt is -0.4704/(2*0.0143) = 16.45 Want 99% confidence interval for Nopt Nonparametric bootstraps: Percentile: (14.9, 21.2), Studentized: (12.5, 21.6) Parametric bootstraps: Percentile: (14.4, 22.8), Studentized: (13.9, 22.1) Which do you report? Some plots that might help you decide follow. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 27 / 34 q q 4 q q time 3 q q 2 q q 1 q 5 10 n 15 20 25 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 28 / 34 density 0.0 0.1 0.2 0.3 0.4 0.5 16 18 20 22 24 Bootstrap value c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 29 / 34 density 0.00 10 0.10 0.20 0.30 15 20 P bootstrap Nopt 25 30 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 30 / 34 0.5 q q 0.0 q q q q q q Residual -1.0 -0.5 q 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Predicted value c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 31 / 34 QQ plot of residuals 0.5 q q Sample Quantiles 0.0 q q q q q q -1.0 -0.5 q -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Theoretical Quantiles c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 32 / 34 Nonparametric bootstrap 8 q sd of Nopt 6 q q q q qq q qq q q q q q qq q q q q qq q q q q qq q q q qq q q q q q q qq q q q q q qq qq q q qq q qq q q q q q q qqq q q q q q qqq q q q q q q q qq q q qq qq qq qq qqqq q q q q qq q q qqqqqqq q q qq q qqq q q q qq q qqqqq qqq q q qq q q q q q qqqqq qqqqqqq q qq q qqqqqqq q qqqqqq q qqqqqqqqqqqq q q q q qq qq q qq qq q q qq q q qq q qqqqq q qqq qq qq q qq qq qqqqq q qqqqqqqqq qq qq qq q q qqqq q qqqqq q qq q q qqqq qq qq q q q qq qqqqqqqqqqqq qqqq q q qqqqqq qq qqq qqqqqqqq qq q q qqqqqqqqqqqqqqqqq qq q q qqqqqq q qqqqqq qqqqqqqqq q q qqqqq q qqqqqqqqq qq q qq q q qq q qqqqqqqq qq qqqqqqqqq qq q qq q q qqqqqqqqq q q q qq qqqqqqqqq qq qqqq q q q qqqqqqqq q q q qq qqqqqqqq q q qq qq qqqqqqqqq q q qqqq qq q q qqq qqqqqq q q qq qqq q qqq q q qq q q qqqqqqq q qqqqqqq qqqqqqqq q qqq qq q qqqqqqqq q qqqq qq qqqq qq qq q qqqqqqq qqqqqqqq q qq q qqqqqqqqqq qqqqq q q q qq qqqqqq qq qqqqqqqq q qqq qq qqqqqqq q q qqq qqqq q q qq q qq qqq qqqqqqqqqqq qq q qqqq qqqqqqqqqqq qqqqqqqq q qqqq q qqqqqqqq qqqqqqqq q qqqqq qqqqqqqqqq q q q qqqqqqqqqqq q qqqq qqqqqqqq qqqqqqqqqqq qqqq qqqqqqqqqqq qqqqqqqq qqqqqqqqqqqq q q qq q q qq q qqqqqqqq q qqqqq q q qqqqqqqqq q qqqq qq qqqqqqqqqqqq qq q qqqqqqqqqqqq q q q qqqqqqqqqqq q q qqqqqqqqq qqqqqqqqqqqqqq qqqq q q qqqqqq qqqqqqqqqqq qqqqqqqqqq qqqqq qqqq qq qqqqqqqqq q q qqq qqqqq qqqqq qqqqqqqq qqqqqqqqqq qqqqqqqq qqqqq qqqqqqqq qqqqqqqqqqq qqqqq qqqqqqqqqq q q qqqqqqqq q qq q qq qqqqqqqqq q q qq q q q qqqqqq q qqq qqqqqq qqqqq q q q qqqqqqq q q q qq q q q 0 2 4 14 16 18 20 22 24 26 Nopt from bootstrap sample c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 33 / 34 Parametric bootstrap q sd of Nopt 40 60 80 100 q 20 q q q q q qq q qq qqq q q qqqq q q q qq q qqqq qq qq q q q qq q q qqqqqq q qq q q qq qqq qqqqqqq q qqqqq q qqqqqq q qqqqqq qq q qqqqq qqqqqqq qqq q qqq qqqq qq qq q qqqqq qq qq qq qq qq qq qq qqq qq qq qq qq qq qq qqqqq qq qq qq qqq qq qqqq qq qq qq qqq qq qq qqqq qq qqqq qq qqq 0 20 30 40 50 60 70 Nopt from bootstrap sample c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 34 / 34 ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online