Unformatted text preview: the bootstrap
Process optimization Many physical processes can be described (at least approximately) as quadratic functions of input variable(s) Examples We've seen a lot about inference on C in a nGM (or nAitken) model A huge number of questions can be answered by appropriate choice of C key point is that C is a linear function of y What if quantity of interest is not a linear function of y? e.g. very large computation, done using parallel computing on n processors Mean clock time required can be approximated by quadratic function, Y = 0 + 1 n + 2 n2 This next block of material explores techniques for inference (tests, interval estimation) for arbitrary functions of the data Focus on the general approach called the bootstrap Optimum number of processors is Nopt = 1 /(22 ) ^ ^ ^ Can estimate by Nopt = 1 /(22 ) But what about a confidence interval for Nopt ? or test of Nopt = k? Many details/extensions/variations, only present what I find most useful. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 1 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 2 / 34 Inference on Nopt Examples  2
Seed germinability, G, in longterm storage may be approximated by a quadratic function of time in storage, T: G = 0 + 1 T + t T 2 Want to estimate T85 , the time when germination drops to 85% This is T for which 0.85 = 0 + 1 T + 2 T 2 Solution is either: one of the values 1 2 1  4(0  0.85)2 /(22 ) The options for inference (ci or test) on Nopt include: Asymptotic approximation: ^ Delta method approximation for Var Nopt ^ ^ Normal approximation for (Nopt  Nopt )/ Var Nopt Construct a test from an equivalent linear hypothesis: Nopt = k 1 /(22 ) = k 1 = 2k2 1  2k2 = 0 This is a C test for specified k Invert test to find the confidence interval 95% ci for Nopt is set of k for which Ho: Nopt = k is not rejected at = 0.05 Confidence interval computation is not trivial! or doesn't exist ^ ^ ^ ^ Can estimate T85 from 0 , 1 , and 2 but what about inference?
No linearization here Delta method ignores 'does not exist'
Stat 511 section 15 3 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 4 / 34 c 2011 Dept. Statistics (Iowa State University) Examples  3 the "Law School" data, Efron and Tibshirani, 1993
320 330 correlation between average LSAT and average GPA for students entering law school
GPA 310 data are from a sample of of 15 law schools
300 290 sample correlation coefficient = 0.78 want a 95% ci for the correlation, if population is bivN, use Fisher's z transform for inference on 280 knowledge of law schools: population is not likely to be bivN ! data: sample is unlikely to come from a bivN population so how do you construct a confidence interval?
560 580 340 600 LSAT score 620 640 660 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 5 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 6 / 34 A general framework for frequentist statistical inference Framework  2
Key concept: = t(F), where is a real value and t is a known function that maps distribution functions to R We don't know F or .
i.i.d. Want to estimate for a population. The population is described by F, its distribution function. F might be specified parametrically, e.g. N(, 2 ) We have a sample of data from F: Y1 , Y2 , . . . , Yn F ^ Our estimator = s(Y1 , Y2 , . . . , Yn ), where s() is a known function Example: is the mean of a univariate distribution, t(F) = a reasonable estimator is s(Y) = n Yi /n = Y.
i=1 We call a parameter. Could be, but doesn't have to be, a distribution parameter. Examples of possible 's y dF Some practical questions about the estimator s()? standard deviation coefficient of variation a quantile of the population, e.g. the median the proportion of the population with values exceeding k the correlation between components of a bivariate distribution ^ Is it unbiased, i.e. is E = ? ^ ^ What is the variability, Var ? How can I estimate Var ? ^ What is the sampling distribution of ? The last question is crucial for tests and confidence intervals.
Stat 511 section 15 7 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 8 / 34 c 2011 Dept. Statistics (Iowa State University) Theoretical answers Sampling In some cases, theoretical statistics provides exact answers ^ y If F N(, 1), = N(, 1/n) Many quantities associated with normal distributions have exact answers. In other cases, theory provides approximate answers. Asymptotic approximations justified by properties of the estimator when n increasingly large. In many cases, need to consider other approaches Possible approaches include: Sampling, when the population is known Parametric bootstrap, when population form is known Nonparametric bootstrap, when only the sample is known. Suppose our population is the bivariate distribution of average LSAT score and average GPA for students entering a major law school in the US. There are 82 major law schools in the US. F is the discrete distribution that assigns probability 1/82 to each of the 82 (LSAT, GPA) pairs. Our interest is the correlation between LSAT and GPA across law schools. ^ We compute as the sample correlation from a sample of 15 law schools (sample is without replacement). ^ What is the distribution of ? Because the population, F, is known, we can use sampling Repeatedly draw a sample of 15 law schools without replacement ^ from F, compute from each sample ^ The empirical distribution of is a MonteCarlo estimate of the ^ sampling distibution of .
c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 10 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 9 / 34 Practical problem: sampling requires knowing the population Why estimate in that case? Usually have only one sample from the population Density 1 2 3 The bootstrap (either parametric or nonparametric) provides a way ^ to estimate the sampling distribution of without knowing the population 0 0.0 0.2 0.4 0.6 0.8 1.0 Sample correlation, n = 15
Stat 511 section 15 11 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 12 / 34 c 2011 Dept. Statistics (Iowa State University) the Bootstrap Approach
^ If using the nonparametric bootstrap, F(x) 1 n 1(yi x) i=1 n where 1(yi x) is 1 if yi x and 0 otherwise ^ this F(x) is the empirical Cumulative Distribution Function empirical Cumulative Distribution Functions (eCDF) i.i.d. Use y1 , y2 , . . . , yn F to estimate: ^ by = s(y1 , y2 , . . . , yn ), and ^ ^ F by F, where F is estimated from the sample {yi } ^ Nonparametric bootstrap: F is the empirical cdf of the observations in the sample ^ Parametric bootstrap: F is a specified distribution (e.g. N) with parameters estimated from the sample the eCDF assigns probability
1 n to each yi , if all yi are unique ^ if the sample is 3.4, 1.2, 8.5, 7.6, 4.9, F assigns probability 1/5 to each of the values ^ ^ F(4) = P[yi 4] = 2/5 = 0.4. F(4.9) = P[yi 4.9] = 3/5 = 0.6. ^ if the sample is 3.4, 5.6, 3.4, 7.9, 8.6, F assigns probability 1/5 to the values 5.6, 7.9 and 8.6 and probability 2/5 to the value 3.4 ^ ^ ^ ^ ^ F(1.5) = 0, F(3.5) = 0.4, F(5.5) = 0.4, F(5.6) = 0.6, F(8) = 0.8, ^ F(9) = 1 The bootstrap concept: Repeatedly draw a sample of 15 law ^ ^ schools from F, compute from each sample ^ The empirical distribution of is a bootstrap estimate of the ^ sampling distibution of . c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 13 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 14 / 34 Nonparametric bootstrap If using a nonparametric bootstrap, draw bootstrap samples with replacement from the data sample. Some observations occur 2 (or even more) times in a bootstrap sample 0.0 0.5 1.0 1.5 2.0 2.5 3.0 density
0.0 Other observations don't occur in a bootstrap sample Some bootstrap samples, described by the number of times obs. i occurs in the bootstrap sample Data pt.: boot. samp. 1 boot. samp. 2 boot. samp. 3 boot. samp. 4 1 1 0 1 0 2 0 0 6 2 3 1 1 1 0 4 2 1 1 3 5 2 1 0 2 6 1 0 1 0 7 2 1 0 1 8 2 1 0 1 9 0 1 2 1 10 0 1 2 1 11 1 2 0 0 12 1 2 1 1 13 1 1 0 1 14 1 1 0 1 15 0 1 0 0 0.2 0.4 0.6 0.8 Cor in bootstrap sample
Stat 511 section 15 15 / 34 c 2011 Dept. Statistics (Iowa State University) 1.0 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 16 / 34 Uses for a bootstrap distribution Bias corrected estimators ^ ^ ^ FB () is the bootstrap estimate of the sampling distribution of expressed as an empirical CDF
^ ^ Bias() = EF ()  ^ ^ Bootstrap estimate of bias is EF ()  ^ ^ if B bootstrap samples, each providing b , 1 ^ ^ estimated bias is B b  Estimate bias of an estimator what can you do with it? ^ Estimate the standard error of standard error = estimated sd of a statistic ^ estimate se by sd of b ^ For the 15 sample law school data, b = 0.7696 ^ Correlation in original sample: = 0.7764 Estimated bias: 0.7696  0.7764 = 0.0068 Sample correlation is slightly too low, on average. for the 15 law schools, se of correlation 0.13 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 17 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 18 / 34 Bias correction Bootstrap confidence intervals
Most common use of a bootstrap sample is to estimate a 1  confidence interval Many, many different ways to do this. Will talk about the two I find most useful in practice. Percentile bootstrap confidence intervals: Logic of a bias corrected estimator: If know that an estimator is biased, tempting to reduce or remove bias Bootstrap suggests sample correlation too low by 0.0068 "better" estimate of correlation is 0.7764  bias = 0.783 ~ ^ ^ Bias corrected estimator = =  Bias() 1 ^ ^ ^ =  B b  ^ ^ = 2  1 ^ Bias correction usually increases variability because Bias() is an estimate B b Endpoints of 1  confidence interval are the /2 and 1  /2 ^ ^ quantiles of FB (). ^ 1 ^ 1 e.g. 95% confidence interval is (FB (0.025), FB (0.975) For 15 sample law school data, (0.45, 0.96) This interval is range respecting: endpoints of the ci must fall within the range of the estimates, so within valid range of parameters Avoids, e.g. a 95% confidence interval (0.97, 1.02) for a proportion. Use MSE(t) = E(t  )2 = Var t + Bias(t)2 to compare variability of possibly biased estimators ~ ^ Details depend on estimator, but often have MSE() > MSE()
Stat 511 section 15 19 / 34 ^ Tends to have poor coverage when Var is a function of Various extensions, e.g. Bias Corrected and BCa bootstraps, attempt to improve the coverage
c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 20 / 34 c 2011 Dept. Statistics (Iowa State University) Percentilet = Studentized bootstrap
Studentized bootstrap confidence intervals: ^ The studentized bootstrap deals explicitly with Var = f () ^ ^ Calculate and Var for each bootstrap sample. ^ ^ i  ^ Var i Characteristics: i Calculate Ti = i Ti from each sample, where Find the /2 and 1  /2 quantiles of T : T/2 and T1/2 lower and upper bounds of the 95% ci are ^ ^ ^ ^  Var T1/2 and  Var T/2 N.B. bounds are "backwards". For a 95% ci, lower bound is minus 0.975 quantile; upper is minus 0.025 quantile. matches careful construction of a Student's t interval matters because distribution of T is unlikely to be symmetric ^ This is the tstatistic quantifying the departure of for each ^ from the original data, on a unitless bootstrap sample from scale. ^ In a standard problem,  has a theoretical t distribution ^ Var is the bootstrap estimate of the The empirical CDF of T ^ distribution of  ^ Var Stat 511 section 15 21 / 34 c 2011 Dept. Statistics (Iowa State University) Is not range respecting ^ Adjusts for variation in Var . ^ But requires at least approximate estimate of Var . Delta method, jacknife resampling Variety of simulation studies: empirical coverage much closer to nominal in difficult problems c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 22 / 34 Studentized bootstrap for correlation ^ How to get Var ? What do we know that might help? Fisher's Z transformation: 1 1+r Z(r) = log 2 1r
density 0.00
4 0.10 0.20 0.30 2 0 2 Bootstrap T
Stat 511 section 15 23 / 34 c 2011 Dept. Statistics (Iowa State University) If (X, Y) bivariate normal, Z(r) approx normal with Var 1/(n  3) Calculate T from Z and Var Z for each bootstrap sample If pop. far from bivN, expect T far from normal calculate a studentized interval for Z(r) and back transform Result: (0.13, 1.56) on z scale, (0.13, 0.92) for corr. Shifted down from percentile interval BCa interval, which uses different info to adjust for Var, is (0.32, 0.94), not as shifted as Studentized interval In the problems I have worked on, Studentized intervals have better empirical coverage than BCa intervals. Pictures on next two slides 4 6 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 24 / 34 Bootstrap T values
How many bootstrap samples?
To estimate bias or se: 1001000 To estimate ci: at least 1000. 5,000 and 10,000 now common Need even more if interested in extreme quantiles (relatively) easy to determine MonteCarlo variation in ci Practical issues 2 4 6 How to draw the bootstrap samples? Sample Quantiles 4 2 0 Ordinary bootstrap: simple random sample for each sample Balanced bootstrap: ensure that each obs occurs a total of R times Regression bootstrap: If X values fixed by design, ensure all bootstrap samples have same X values Fit model, estimate residuals, estimate predicted values at each X. bootstrap residuals, add to predicted values to get (X,Y)
0 2 4 4 2 Theoretical Quantiles
Stat 511 section 15 25 / 34 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 26 / 34 c 2011 Dept. Statistics (Iowa State University) Confidence intervals for optimum # processors Plot of time vs. # processors with fitted quadratic regr. on next page
4 Estimated model is time = 5.0772  0.4704N + 0.0143N 2 ^ Estimated Nopt is 0.4704/(2*0.0143) = 16.45
time Want 99% confidence interval for Nopt 3 Nonparametric bootstraps:
2 Percentile: (14.9, 21.2), Studentized: (12.5, 21.6) Parametric bootstraps:
1 Percentile: (14.4, 22.8), Studentized: (13.9, 22.1)
5 10 n
Stat 511 section 15 27 / 34 c 2011 Dept. Statistics (Iowa State University) Which do you report? Some plots that might help you decide follow. 15 20 25 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 28 / 34 0.5 0.3 0.4 density 0.2 density
20 22 24 0.0 0.1 16 18 0.00
10 0.10 0.20 0.30 15 20 P bootstrap Nopt 25 30 Bootstrap value
Stat 511 section 15 29 / 34 c 2011 Dept. Statistics (Iowa State University) c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 30 / 34 QQ plot of residuals 0.5 0.5 0.0 Residual 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1.0 Sample Quantiles 0.5 1.5 1.0 0.5 0.0 0.5 Theoretical Quantiles
Stat 511 section 15 31 / 34 c 2011 Dept. Statistics (Iowa State University) 1.0 1.5 Predicted value c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 32 / 34 Nonparametric bootstrap Parametric bootstrap 8 6 4 sd of Nopt sd of Nopt 40 60 80 100 2 0 0 20 14 16 18 20 22 24 26 20 30 40 50 60 70 Nopt from bootstrap sample
Stat 511 section 15 33 / 34 Nopt from bootstrap sample
c 2011 Dept. Statistics (Iowa State University) Stat 511 section 15 34 / 34 c 2011 Dept. Statistics (Iowa State University) ...
View
Full Document
 Spring '08
 Staff
 Normal Distribution, Cumulative distribution function, STATE UNIVERSITY

Click to edit the document details