This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Empirical Bayes Estimates for LargeScale Prediction Problems Bradley Efron * Abstract Classical prediction methods such as Fishers linear discriminant function were designed for smallscale problems, where the number of predictors N is much smaller than the number of observations n . Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10 , 000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to largescale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to largescale prediction, and also with false discovery rate theory. Keywords: microarray prediction, empirical Bayes, shrunken centroids, effect size estimation, correlated predictors, local fdr. 1 Introduction An important class of prediction problems begins with the observation of n independent vectors, ( x j ,y j ) j = 1 , 2 ,...,n. (1.1) Here x j is a Nvector of predictors, while y j is a realvalued response, taken to be dichotomous in most of what follows. For example, x j might include age, height, weight, gender, etc. for person j , while y j indicates whether or not that person later developed cancer. Given a newly observed Nvector X , we would like to predict its corresponding Y value. Our task is to use the training data (1.1) to construct an effective prediction rule. Classic prediction methods, such as Fishers linear discriminant function, were fashioned for problems where N is much smaller than n , that is, where the number of predictors is less than the number of training cases. Current highthroughput scientific technology tends to produce just the opposite situation, with N n ; modern equipment may permit thousands of measurements on a single individual, but recruiting new subjects remains as difficult as ever. Microarrays offer the iconic example. Here x j is a vector of genetic expression measurements on subject j , one for each of N genes, where N is typically several thousand. In the prostate cancer data (Singh et al., 2002) we will use for motivation, there are N = 6033 genes measured * Department of Statistics, Stanford University This work was supported in part by NIH grant 8R01 EB002784 and NSF grant DMS0505673. 1 on each of n = 102 men, n 1 = 50 healthy controls and n 2 = 52 prostate cancer patients. Given a new microarray measuring the same 6033 genes, we would like to predict whether or not that man develops prostate cancer....
View Full
Document
 Fall '08
 Staff

Click to edit the document details