Samples example small round blue cell tumors khan et

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: r 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani 15 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Classification in high dimensions • important for gene expression microarray problems and other genomics problems ˆ • Starting point: diagonal LDA which uses diag(Σ) • nearest centroid classification on standardized features is equivalent to diagonal LDA • nearest shrunken centroids regularizes further, by discarding noisy features 16 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Classification of microarray samples Example: small round blue cell tumors; Khan et al, Nature Medicine, 2001 • Tumors classified as BL (Burkitt lymphoma), EWS (Ewing), NB (neuroblastoma) and RMS (rhabdomyosarcoma). • There are 63 training samples and 25 test samples, although five of the latter were not SRBCTs. 2308 genes • Khan et al report zero training and test errors, using a complex neural network model. Decided that 96 genes were “important”. • Too complicated! 17 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani BL EWS NB RMS Khan data 18 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Neural network approach 19 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Class centroids EWS NB RMS 0 500 1000 Gene 1500 2000 BL −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Centroids: Average Expression Centered at Overall Centroid 20 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Shrunken centroids • Idea: shrink each class centroid towards the overall centroid. First normalize by the within-class standard deviation for each gene. • Let xij be the expression for samples i = 1, 2, . . . n and genes j = 1, 2, . . . p. • We have classes 1, 2, . . . K , and let Ck be indices of the nk samples in class k . • The j th component of the centroid for class k is xjk = i∈Ck xij /nk , the mean expression value in class...
View Full Document

This document was uploaded on 03/10/2014 for the course STATS 315A at Stanford.

Ask a homework question - tutors are online