This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Empirical Bayes Estimates for Large-Scale Prediction Problems Bradley Efron * Abstract Classical prediction methods such as Fishers linear discriminant function were designed for small-scale problems, where the number of predictors N is much smaller than the number of observations n . Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10 , 000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to large-scale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to large-scale prediction, and also with false discovery rate theory. Keywords: microarray prediction, empirical Bayes, shrunken centroids, effect size estimation, correlated predictors, local fdr. 1 Introduction An important class of prediction problems begins with the observation of n independent vectors, ( x j ,y j ) j = 1 , 2 ,...,n. (1.1) Here x j is a N-vector of predictors, while y j is a real-valued response, taken to be dichotomous in most of what follows. For example, x j might include age, height, weight, gender, etc. for person j , while y j indicates whether or not that person later developed cancer. Given a newly observed N-vector X , we would like to predict its corresponding Y value. Our task is to use the training data (1.1) to construct an effective prediction rule. Classic prediction methods, such as Fishers linear discriminant function, were fashioned for problems where N is much smaller than n , that is, where the number of predictors is less than the number of training cases. Current high-throughput scientific technology tends to produce just the opposite situation, with N n ; modern equipment may permit thousands of measurements on a single individual, but recruiting new subjects remains as difficult as ever. Microarrays offer the iconic example. Here x j is a vector of genetic expression measurements on subject j , one for each of N genes, where N is typically several thousand. In the prostate cancer data (Singh et al., 2002) we will use for motivation, there are N = 6033 genes measured * Department of Statistics, Stanford University This work was supported in part by NIH grant 8R01 EB002784 and NSF grant DMS0505673. 1 on each of n = 102 men, n 1 = 50 healthy controls and n 2 = 52 prostate cancer patients. Given a new microarray measuring the same 6033 genes, we would like to predict whether or not that man develops prostate cancer....
View Full Document
- Fall '08