ESL Chapter 12.4–7 — Flexible Discriminant and Mixture Models Trevor Hastie and Rob Tibshirani Flexible Discriminant and Mixture Models Theme: Modular Extensions of Standard Tools Linear Discriminant Analysis or LDA is a classic technique for discrimination and classification Virtues of LDA: + Simple prototype method for multiple class classification + Can produce optimal low dimensional views of the data + Sometimes produces the best results; e.g. LDA featured in top 3 classifiers for 11/22 of the STATLOG datasets, overall winner in 3/22. 1

ESL Chapter 12.4–7 — Flexible Discriminant and Mixture Models Trevor Hastie and Rob Tibshirani Limitations of LDA: - Lots of data, many predictors: LDA underfits (restricts to linear boundaries) - Many correlated predictors: LDA (noisy/wiggly coefficients) - Dimension reduction limited by the number of classes 2
ESL Chapter 12.4–7 — Flexible Discriminant and Mixture Models Trevor Hastie and Rob Tibshirani Example of extension: FDA ˆ Y = S X ( Y ) where Y is an indicator response matrix and S X a regression procedure ( Linear regression, Polynomial Regression, Additive Models, MARS, Neural Network, · · · ) eigen ( Y T ˆ Y ) = eigen ( Y T S X Y ) LDA, flexible extensions of LDA . Typically this amounts to expanding/selecting the predictors via basis transformations chosen by regression, and then (penalized) LDA in the new space. 3

ESL Chapter 12.4–7 — Flexible Discriminant and Mixture Models Trevor Hastie and Rob Tibshirani Example: Vowel Recognition Vowel Word Vowel Word i heed o hod I hid c: hoard E head U hood A had u: who’d a: hard 3: heard y hud 11 symbols, 8 speakers (train), 7 speakers (test), 6 replications each. 10 inputs features based on digitized utterances. Source: Tony Robinson, via Scott Falman, CMU GOAL: Predict Vowels 4
ESL Chapter 12.4–7 — Flexible Discriminant and Mixture Models Trevor Hastie and Rob Tibshirani Technique Error rates Training Test (1) LDA 0.32 0.56 Softmax 0.48 0.67 (2) QDA 0.01 0.53 (3) CART 0.05 0.56 (5) Single-layer Perceptron 0.67 (6) Multi-layer Perceptron (88 hidden units) 0.49 (7) Gaussian Node Network ( 528 hidden units) 0.45 (8) Nearest Neighbor 0.44 (9) FDA/BRUTO 0.06 0.44 Softmax 0.11 0.50 (10) FDA/MARS (degree = 1) 0.09 0.45 Best reduced dimension (=2) 0.18 0.42 Softmax 0.14 0.48 (11) FDA/MARS (degree = 2) 0.02 0.42 Best reduced dimension (=6) 0.13 0.39 Softmax 0.10 0.50 5

ESL Chapter 12.4–7 — Flexible Discriminant and Mixture Models Trevor Hastie and Rob Tibshirani Coordinate 1 for Training Data Coordinate 2 for Training Data -4 -2 0 2 4 -6 -4 -2 0 2 4 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o
